WIP: parallel GiST index builds - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | WIP: parallel GiST index builds |
Date | |
Msg-id | 0e4f4af8-1088-4995-b2fb-8f92b5c6cef9@enterprisedb.com Whole thread Raw |
Responses |
Re: WIP: parallel GiST index builds
Re: WIP: parallel GiST index builds |
List | pgsql-hackers |
Hi, After looking into parallel builds for BRIN and GIN indexes, I was wondering if there's a way to do parallel builds for GiST too. I knew next to nothing about how GiST works, but I gave it a shot and here's what I have - the attached patch allows parallel GiST builds for the "unsorted" case (i.e. when the opclass does not include sortsupport), and does not support buffered builds. unsorted builds only -------------------- Addressing only the unsorted case may seem a bit weird, but I did it this way for two reasons - parallel sort is a solved problem, and adding this to the patch seems quite straightforward. It's what btree does, for example. But I also was not very sure how common this is - we do have sort for points, but I have no idea if the PostGIS indexes define sorting etc. My guess was "no" but I've been told that's no longer true, so I guess sorted builds are more widely applicable than I thought. In any case, I'm not in a rush to parallelize sorted builds. It can be added later, as an improvement, IMHO. In fact, it's a well isolated part of the patch, which might make it a good choice for someone looking for an idea for their first patch ... buffered builds --------------- The lack of support for buffered builds is a very different thing. The basic idea is that we don't push the index entries all the way to the leaf pages right away, but accumulate them in buffers half-way through. This combines writes and reduces random I/O, which is nice. Unfortunately, the way it's implemented does not work with parallel builds at all - all the state is in private memory, and it assumes the worker is the only possible backend that can split the page (at which point the buffers need to be split too, etc.). But for parallel builds this is obviously not true. I'm not saying parallel builds can't do similar buffering, but it requires moving the buffers into shared memory, and introducing locking to coordinate accesses to the buffers. (Or perhaps it might be enough to only "notify" the workers about page splits, with buffers still in private memory?). Anyway, it seems far too complicated for v1. In fact, I'm not sure the buffering is entirely necessary - maybe the increase in amount of RAM makes this less of an issue? If the index can fit into shared buffers (or at least page cache), maybe the amount of extra I/O is not that bad? I'm sure there may be cases really affected by this, but maybe it's OK to tell people to disable parallel builds in those cases? gistGetFakeLSN -------------- One more thing - GiST disables WAL-logging during the build, and only logs it once at the end. For serial builds this is fine, because there are no concurrent splits, and so we don't need to rely on page LSNs to detect these cases (in fact, is uses a bogus value). But for parallel builds this would not work - we need page LSNs that actually change, otherwise we'd miss page splits, and the index build would either fail or produce a broken index. But the existing is_build flag affects both things, so I had to introduce a new "is_parallel" flag which only affects the page LSN part, using the gistGetFakeLSN() function, previously used only for unlogged indexes. This means we'll produce WAL during the index build (because gistGetFakeLSN() writes a trivial message into WAL). Compared to the serial builds this produces maybe 25-75% more WAL, but it's an order of magnitude less than with "full" WAL logging (is_build=false). For example, serial build of 5GB index needs ~5GB of WAL. A parallel build may need ~7GB, while a parallel build with "full" logging would use 50GB. I think this is a reasonable trade off. There's one "strange" thing, though - the amount of WAL decreases with the number of parallel workers. Consider for example an index on a numeric field, where the index is ~9GB, but the amount of WAL changes like this (0 workers means serial builds): parallel workers 0 1 3 5 7 WAL (GB) 5.7 9.2 7.6 7.0 6.8 The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN determines if it needs to actually assign a new LSN (and write stuff to WAL) by comparing the last LSN assigned (in a given worker) to the current insert LSN. But the current insert LSN might have been updated by some other worker, in which case we simply use that. Which means that multiple workers may use the same fake LSN, and the likelihood increases with the number of workers - and this is consistent with the observed behavior of the WAL decreasing as the number of workers increases (because more workers use the same LSN). I'm not entirely sure if this is OK or a problem. I was worried two workers might end up using the same LSN for the same page, leading to other workers not noticing the split. But after a week of pretty intensive stress testing, I haven't seen a single such failure ... If this turns out to be a problem, the fix is IMHO quite simple - it should be enough to force gistGetFakeLSN() to produce a new fake LSN every time when is_parallel=true. performance ----------- Obviously, the primary goal of the patch is to speed up the builds, so does it actually do that? For indexes of different sizes I got this timings (in seconds): scale type 0 1 3 5 7 ------------------------------------------------------------------ small inet 13 7 4 4 2 numeric 239 122 67 46 36 oid 15 8 5 3 2 text 71 35 19 13 10 medium inet 207 111 59 42 32 numeric 3409 1714 885 618 490 oid 214 114 60 43 33 text 940 479 247 180 134 large inet 2167 1459 865 632 468 numeric 38125 20256 10454 7487 5846 oid 2647 1490 808 594 475 text 10987 6298 3376 2462 1961 Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index, depending on the data type. The raw duration is not particularly easy to interpret, so let's look at the "actual speedup" which is calculated as (serial duration) / (parallel duration) and the table looks like this: scale type 1 3 5 7 -------------------------------------------------------------- small inet 1.9 3.3 3.3 6.5 numeric 2.0 3.6 5.2 6.6 oid 1.9 3.0 5.0 7.5 text 2.0 3.7 5.5 7.1 medium inet 1.9 3.5 4.9 6.5 numeric 2.0 3.9 5.5 7.0 oid 1.9 3.6 5.0 6.5 text 2.0 3.8 5.2 7.0 large inet 1.5 2.5 3.4 4.6 numeric 1.9 3.6 5.1 6.5 oid 1.8 3.3 4.5 5.6 text 1.7 3.3 4.5 5.6 Ideally (if the build scaled linearly with the number of workers), we'd get the number of workers + 1 (because the leader participates too). Obviously, it's not that great - for example for text with 3 workers we get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers. But I think those numbers are actually pretty good - I'd definitely not complain if my index builds got 5x faster. But those are synthetic tests on random data, using the btree_gist opclasses. It'd be interesting if people could do their own testing on real-world data sets. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: