Re: Proposal: speeding up GIN build with parallel workers - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Proposal: speeding up GIN build with parallel workers |
Date | |
Msg-id | 0a0df4f6-66bb-aeaf-c7a7-60d6d4abaabb@iki.fi Whole thread Raw |
In response to | Re: Proposal: speeding up GIN build with parallel workers (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Proposal: speeding up GIN build with parallel workers
|
List | pgsql-hackers |
On 01/17/2016 10:03 PM, Jeff Janes wrote: > On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote: >> On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote: >>> I have a draft implementation which divides the whole process between >>> N parallel workers, see the patch attached. Instead of a full scan of >>> the relation, I give each worker a range of blocks to read. >> >> I am currently working on a patch that allows B-Tree index builds to >> be performed in parallel. I think I'm a week or two away from posting >> it. >> >> Even without parallelism, wouldn't it be better if GIN indexes were >> built using tuplesort? I know way way less about the gin am than the >> nbtree am, but I imagine that a prominent cost for GIN index builds is >> constructing the main B-Tree (the one that's constructed over key >> values) itself. Couldn't tuplesort.c be adapted to cover this case? >> That would be much faster in general, particularly with the recent >> addition of abbreviated keys, while also leaving a clear path forward >> to performing the build in parallel. > > I think it would take a lot of changes to tuple sort to make this be a > almost-always win. > > In the general case each GIN key occurs in many tuples, and the > in-memory rbtree is good at compressing the tid list for each key to > maximize the amount of data that can be in memory (although perhaps it > could be even better, as it doesn't use varbyte encoding on the tid > list the way the posting lists on disk do--it could do so in the bulk > build case, where it receives the tid in order, but not feasibly in > the pending-list clean-up case, where it doesn't get them in order) > > When I was testing building an index on a column of text identifiers > where there were a couple million identifiers occurring a few dozen > times each, building it as a gin index (using btree_gin so that plain > text columns could be indexed) was quite a bit faster than building it > as a regular btree index using tuple sort. I didn't really > investigate this difference, but I assume it was due to the better > memory usage leading to less IO. I've been wondering about this too, using tuplesort to build GIN indexes, for a long time. Surely the highly-optimized quicksort+merge code in tuplesort.c is faster than maintaining a red-black tree? Or so I thought. I wrote a quick prototype of using tuplesort.c for GIN index build. I tested it with a 500 MB table of integer arrays, so that the sort fit completely in memory. It's a lot slower than the current code. It turns out eliminating the duplicates early is really really important. So we want to keep the red-black tree, to eliminate the duplicates. Or add that capability to tuplesort.c, which might speed up Sort+Unique and aggregates in general, but that's a big effort. However, I still wonder, if we shouldn't use a merge approach when the tree doesn't fit in memory, like tuplesort.c does. Currently, when the tree is full, we flush it out to the index, by inserting all the entries. That can get quite expensive, I/O-wise. It also generates more WAL, compared to writing each page only once. If we flushed the tree to a tape instead, then we could perhaps use the machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to merge the tapes. I'm not sure if that works out, but I think it's worth some experimentation. > But I do think this with aggregation would be worth it despite the > amount of work involved. In addition to automatically benefiting from > parallel sorts and any other future sort improvements, I think it > would also generate better indexes. I have a theory that a problem > with very large gin indexes is that repeatedly building > maintenance_work_mem worth of rbtree entries and then merging them to > disk yields highly fragmented btrees, in which logical adjacent > key-space does not occupy physically nearby leaf pages, which then can > causes problems either with access that follows a pattern (like range > scans, except gin indexes can't really do those anyway) or further > bulk operations. Yeah, there's that too. - Heikki
pgsql-hackers by date: