Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | CAH2-Wzm_nBWnbO8wG+zvHHF2QA9j=BOByXZcmhPK1PBii859Kg@mail.gmail.com Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
List | pgsql-hackers |
On Fri, Dec 28, 2018 at 3:32 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Dec 28, 2018 at 3:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I'm envisioning that you have an array, with one element for each item > > on the page (including the tuple we're inserting, which isn't really on > > the page yet). In the first pass, you count up from left to right, > > filling the array. Next, you compute the complete penalties, starting > > from the middle, walking outwards. > Ah, right. I think I see what you mean now. > Leave it with me. I'll need to think about this some more. Attached is v10 of the patch series, which has many changes based on your feedback. However, I didn't end up refactoring _bt_findsplitloc() in the way you described, because it seemed hard to balance all of the concerns there. I still have an open mind on this question, and recognize the merit in what you suggested. Perhaps it's possible to reach a compromise here. I did refactor the _bt_findsplitloc() stuff to make the division of work clearer, though -- I think that you'll find that to be a clear improvement, even though it's less than what you asked for. I also moved all of the _bt_findsplitloc() stuff (old and new) into its own .c file, nbtsplicloc.c, as you suggested. Other significant changes ========================= * Creates a new commit that changes routines like _bt_search() and _bt_binsrch() to use a dedicated insertion scankey struct, per request from Heikki. * As I mentioned in passing, many other small changes to comments, the nbtree README, and the commit messages based on your (Heikki's) first round of review. * v10 generalizes the previous _bt_lowest_scantid() logic for adding a tie-breaker on equal pivot tuples during a descent of a B-Tree. The new code works with any truncated attribute, not just a truncated heap TID (I removed _bt_lowest_scantid() entirely). This also allowed me to remove a couple of places that previously opted in to _bt_lowest_scantid(), since the new approach can work without anybody explicitly opting in. As a bonus, the new approach makes the patch faster, since remaining queries where we unnecessarily follow an equal-though-truncated downlink are fixed (it's usually only the heap TID that's truncated when we can do this, but not always). The idea behind this new generalized approach is to recognize that minus infinity is an artificial/sentinel value that doesn't appear in real keys (it only appears in pivot tuples). The majority of callers (all callers aside from VACUUM's leaf page deletion code) can therefore go to the right of a pivot that has all-equal attributes, if and only if: 1. The pivot has at least one truncated/minus infinity attribute *and* 2. The number of attributes matches the scankey. In other words, we tweak the comparison logic to add a new tie-breaker. There is no change to the on-disk structures compared to v9 of the patch -- I've only made index scans able to take advantage of minus infinity values in *all* cases. If this explanation is confusing to somebody less experienced with nbtree than Heikki: consider the way we descend *between* the values on internal pages, rather than expecting exact matches. _bt_binsrch() behaves slightly differently when doing a binary search on an internal page already: equality actually means "go left" when descending the tree (though it doesn't work like that on leaf pages, where insertion scankeys almost always search for a >= match). We want to "go right" instead in cases where it's clear that tuples of interest to our scan can only be in that child page (we're rarely searching for a minus infinity value, since that doesn't appear in real tuples). (Note that this optimization has nothing to do with "moving right" to recover from concurrent page splits -- we would have relied on code like _bt_findsplicloc() and _bt_readpage() to move right once we reach the leaf level when we didn't have this optimization, but that code isn't concerned with recovering from concurrent page splits.) Minor changes ============= * Addresses Heikki's concerns about locking the metapage more frequently in a general way. Comments are added to nbtpage.c, and updated in a number of places that already talk about the same risk. The master branch seems to be doing much the same thing in similar situations already (e.g. during a root page split, when we need to finish an interrupted page split but don't have a usable parent/ancestor page stack). Importantly, the patch does not change the dependency graph. * Small changes to user docs where existing descriptions of things seem to be made inaccurate by the patch. Benchmarking ============ I have also recently been doing a lot of automated benchmarking. Here are results of a BenchmarkSQL benchmark (plus various instrumentation) as a bz2 archive: https://drive.google.com/file/d/1RVJUzMtMNDi4USg0-Yo56LNcRItbFg1Q/view?usp=sharing I completed on my home server last night, against v10 of the patch series. Note that there were 4 runs for each case (master case + public/patch case), with each run lasting 2 hours (so the benchmark took over 8 hours once you include bulk loading time). There were 400 "warehouses" (this is similar to pgbench's scale factor), and 16 terminals/clients. This left the database 110GB+ in size on a server with 32GB of memory and a fast consumer grade SSD. Autovacuum was tuned to perform aggressive cleanup of bloat. All the settings used are available in the bz2 archive (there are "settings" output files, too). Summary ------- See the html "report" files for a quick visual indication of how the tests progresses. BenchmarkSQL uses R to produce useful graphs, which is quite convenient. (I have automated a lot of this with my own ugly shellscript.) We see a small but consistent increase in transaction throughput here, as well as a small but consistent decrease in average latency for each class of transaction. There is also a large and consistent decrease in the on-disk size of indexes, especially if you just consider the number of internal pages (diff the "balance" files to see what I mean). Note that the performance is expected to degrade across runs, since the database is populated once, at the start, and has more data added over time; the important thing is that run n on master be compared to run n on public/patch. Note also that I use my own fork of BenchmarkSQL that does its CREATE INDEX before initial bulk loading, not after [1]. It'll take longer to see problems on Postgres master if the initial bulk load does CREATE INDEX after BenchmarkSQL workers populate tables (we only need INSERTs to see significant index bloat). Avoiding pristine indexes at the start of the benchmark makes the problems on the master branch apparent sooner. The benchmark results also include things like pg_statio* + pg_stat_bgwriter view output (reset between test runs), which gives some insight into what's going on. Checkpoints tend to write out a few more dirty buffers with the patch, while there is a much larger drop in the number of buffers written out by backends. There are probably workloads where we'd see a much larger increase in transaction throughput -- TPC-C happens to access index pages with significant locality, and happens to be very write-heavy, especially compared to the more modern (though less influential) TPC-E benchmark. Plus, the TPC-C workload isn't at all helped by the fact that the patch will never "get tired", even though that's the most notable improvement overall. [1] https://github.com/petergeoghegan/benchmarksql/tree/nbtree-customizations -- Peter Geoghegan
Attachment
- v10-0002-Add-pg_depend-index-scan-tiebreaker-column.patch
- v10-0001-Refactor-nbtree-insertion-scankeys.patch
- v10-0004-Pick-nbtree-split-points-discerningly.patch
- v10-0005-Add-split-at-new-tuple-page-split-optimization.patch
- v10-0003-Treat-heap-TID-as-part-of-the-nbtree-key-space.patch
- v10-0007-DEBUG-Add-pageinspect-instrumentation.patch
- v10-0006-Add-high-key-continuescan-optimization.patch
pgsql-hackers by date: