[WIP] [B-Tree] Keep indexes sorted by heap physical location - Mailing list pgsql-hackers
From | Claudio Freire |
---|---|
Subject | [WIP] [B-Tree] Keep indexes sorted by heap physical location |
Date | |
Msg-id | CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com Whole thread Raw |
Responses |
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location
Re: [WIP] [B-Tree] Keep indexes sorted by heap physical location Re: [HACKERS] [WIP] [B-Tree] Keep indexes sorted by heap physical location |
List | pgsql-hackers |
The attached patch tries to maintain the initial status of B-Tree indexes, which are created with equal-key runs in physical order, during the whole life of the B-Tree, and make key-tid pairs efficiently searchable in the process. The patch doesn't add an interface to perform the key-tid pairs, nor does it address all potential issues that could arise from the implementation, or backward compatibility with earlier-format indexes, but passes all tests, and seems to work well with the workloads I have tried until now, which include: - dump/restore/compare (with lots of indexes, both unique and non) - sanity check of index lookups - running a complex application on top of it, no crashes or corruption (detected) Recovery hasn't been tested at all, and I'd expect it to have issues - though I have tried to make the necessary changes I may have missed more than one. Nor have I tested high-concurrency workloads (although the application mentioned does produce some of it), so this is an early WIP in search of feedback on the overall approach. Still, it should be enough to measure the merits of the idea, namely index size, performance and code impact. A couple of points make me uneasy about this patch, yet I can think of no better alternative, so I seek feedback: - introducing a different format for inner index tuples makes for an invasive patch and quite difficult-to-reason code (it's easy to forget whether a page is leaf or inner and that now causes assertion failures or segfaults) - the ascent-descent to check for uniqueness when there are large dead tuple runs could have locking subtleties that escape me. Perhaps there's better ways to solve this. - the patch changes the output of regression tests in nondeterministic ways, making index scan results always follow physical order. While that's not only not incorrect but desirable, lots of tests depended on the almost deterministic order that resulted from the way a suitable insertion point was selected before. The first patch addresses that, but at the expense of some "coverage" (ignoring some lines of reference output, the minimum I could ignore). - while I don't see why it would cause more bloat under normal usage than the previous method, there are probably new/different pathological cases to account for, and maybe I'm missing one that is a common pattern - it's btree, it doesn't get any more critical than that, it really needs a heapload of testing before going in, and some of it (crash recovery) may be tough to get right On the performance side, I don't see a significant impact. If anything, it has the potential to speed up things, since scans over long runs of equal keys will now be done in physical order, and insertions will be evenly spread around the index leaf pages (due to the spread of heap tuples), but I haven't done any benchmarks beyond pgbench to back that up. It is also an important enabling feature to make vacuum more efficient, and to implement WARM tuples in a more general way, so that has to be accounted for when evaluating the potential benefit from this patch. So, the numbers. The size impact I measured on a dump of a test database, and seems under the noise, which is what I expected since it only affects non-leaf pages: patched db: 23G hourly_full: 562M + 790M (118M + 118M + 554M) bid_logs: 591M + 268M (134M + 134M) uas: 285M + 496M (20M + 138M + 159M + 20M + 159M) unpatched db: 23G hourly_full: 562M + 789M (118M + 118M + 553M) bid_logs: 591M + 268M (134M + 134M) uas: 285M + 496M (20M + 138M + 159M + 20M + 159M) Here, numbers are HEAP + INDEXES (i1 + i2 + i3 ...), for tables that are a mix of wide and narrow, wide indexed columns (varchar) and narrow (int). Only a few indexes seem to have grown, but this is before any bloat accumulates (I haven't devised a test yet that produces the same amount of bloat on both databases to make them comparable). pgbench, in-memory, patched: transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 13271118 latency average: 0.181 ms tps = 44237.026822 (including connections establishing) tps = 44237.227678 (excluding connections establishing) pgbench, in-memory, unpatched: transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 13336128 latency average: 0.180 ms tps = 44453.718954 (including connections establishing) tps = 44453.895436 (excluding connections establishing) pgbench, bigger-than-RAM, patched: transaction type: <builtin: select only> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 71194 latency average: 33.711 ms tps = 237.269144 (including connections establishing) tps = 237.270122 (excluding connections establishing) transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 26748 latency average: 89.726 ms tps = 89.043800 (including connections establishing) tps = 89.044218 (excluding connections establishing) pgbench, bigger-than-RAM, unpatched: transaction type: <builtin: select only> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 86495 latency average: 27.747 ms tps = 288.256959 (including connections establishing) tps = 288.258202 (excluding connections establishing) transaction type: <builtin: TPC-B (sort of)> scaling factor: 1000 query mode: simple number of clients: 8 number of threads: 1 duration: 300 s number of transactions actually processed: 27145 latency average: 88.414 ms tps = 90.461600 (including connections establishing) tps = 90.462105 (excluding connections establishing) Which is expected since pgbench exercises none of the potentially improved paths. I/O here (just a laptop) is horrible so the difference is probably just noise. The drop in select-only seems strangely high, I think that's also noise, but I haven't run the tests again yet.
Attachment
pgsql-hackers by date: