Re: Processing btree walks as a batch to parallelize IO - Mailing list pgsql-hackers
From | James Coleman |
---|---|
Subject | Re: Processing btree walks as a batch to parallelize IO |
Date | |
Msg-id | CAAaqYe8uaSeK-k8mESzAHxg8m5-iime_6mNgEi-dv4rMS255=Q@mail.gmail.com Whole thread Raw |
In response to | Re: Processing btree walks as a batch to parallelize IO (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
List | pgsql-hackers |
On Fri, Apr 9, 2021 at 4:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > > > > On 4/9/21 7:33 PM, James Coleman wrote: > > $SUBJECT is still a very loosely formed idea, so forgive lack of detail > > or things I've likely missed, but I wanted to get it out there to see if > > it sounded at all intriguing to people. > > > > Background: One of the big problems with non-local storage such as AWS > > EBS volumes or a SAN is that in a large database (really, working set, > > where working set includes reads) exceeds the size of buffer cache (and > > page cache) the cost of random page reads hitting the underlying disk > > system dominates. This is because networked disks have an order of > > magnitude higher latency than a bunch of RAIDed SSDs (even more so with > > NVMe storage). In some of our experiments on Aurora I've seen a 10x > > change versus pretty good physical hardware, and I'd assume RDS (since > > it's EBS-backed) is similar. > > > > A specific area where this is particularly painful is btree index reads. > > Walking the tree to leaf pages isn't naturally prefetchable, and so for > > each level you pay the random page cost. Of course higher levels in the > > tree will almost certainly exhibit emergent behavior such that they > > (just by fact of the LRU caching) will be in the buffer cache, but for a > > large index lower levels likely won't be. > > > > What do you consider a large index level? In general it's probably all levels but the leaves (though depends on cache and index size etc.) > Consider a 1TB table, with just a single UUID column - that's ~25B rows, > give or take. Real tables will have more columns, so this seems like a > reasonable model of the largest number of rows per relation. With ~32B > per index tuple, that's about 100M leaf pages, and with ~256 branches > per internal page, that's still only ~5 levels. I think it's quite rare > to see indexes with more than 6 or 7 levels. > > And the internal pages are maybe 0.5% of the whole index (so ~4GB out of > 750GB). I think the usual expectation is that most of that will fit into > RAM, but of course there may be more indexes competing for that. > > I think the index level is not really the crucial bit - it's more about > the total amount of indexes in the DB. I suppose? If the tables/indexes/etc. size is sufficiently large relative to cache size it won't matter the quantity. > > If we squint a bit, insertions look a whole lot like reads as well since > > we have to walk the tree to find the leaf insertion page for a new > > tuple. This is particularly true for indexes where inserts are roughly > > randomly distributed data, like a uuid. > > > > Yep. We need to walk the index to the leaf pages in both cases, both for > read and insert workloads. > > > The read-for-lookups problem is harder to solve, but the cost as it > > relates to table inserts is possibly more tractable. Tables typically > > have more than one index to update, so the obvious approach is "let's > > just parallelize the index insertions". Of course we know that's > > difficult given the multi-process approach Postgres uses for parallelism. > > > > Hmm. Not sure if reads are harder to real with, but I think you're right > those two cases (reads and writes) may look similar at the level of a > single index, but may need rather different approaches exactly because > inserts have to deal with all indexes, while reads only really deal with > a single index. Right. In practice it's harder to deal with a single index scan because you don't have multiple such scans to parallelize. > FWIW I think there are a couple options for improving reads, at least in > some cases. > > 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page > ahead. We can't look further ahead, but perhaps this would help. > > 2) In some cases (e.g. nested loop with inner indexes scan) we could > collect an array of values and then look them up at once, which should > allow us to do at least some fo the I/O in parallel, I think. That's > similar to what you propose for writes, except that it works against the > same index. The "collect an array of values" approach isn't one I'd considered, but seems likely interesting. > > Another approach that at first glance seems like it fits better into > > Postgres (I'm not claiming it's easy or a small patch) would be to > > process a batch of indexes at once. For example, if the index access > > methods were extended to allow being given a list of indexes that need > > to be walked, then the btree code could process each layer in the walk > > as a group -- issuing IO fetches for all of the first level blocks in > > the tree, and then computing all of the next level blocks needed and > > issuing those IO requests at a time, and so on. > > > > Yeah, I agree having a way to say "prefetch all pages needed to insert > these keys into these indexes" might be better than just parallelizing > it in a "naive" way. > > Not sure how complex would it be - I think the API would need to allow > traversing the index with each step split into two phases: > > 1) determine the page needed for the next step, return it to caller > > 2) the caller collects pages from all indexes, initiates prefetch > > 3) instruct indexes to actually do the next step, stop if it's a leaf > page (otherwise go to (1)) > > And then we might just do index inserts in a serial way, just like we do > today, hoping to hit the prefetched pages. Correct; this is roughly what I was envisioning. > FWIW while this probably helps saturating the I/O, it unfortunately does > nothing to reduce the write amplification - we still need to modify the > same amount of leaf pages in all indexes, produce the same amount of WAL > etc. I think there were some proposals to add small internal buffers, > and instead of pushing the inserts all the way down to the leaf page, > just add them to the internal buffer. And when the buffer gets full, > propagate the contents to the next level of buffers. > > For example, each internal page might have one "buffer" page, so the > index size would not really change (the internal pages would double, but > it's still jut ~1% of the total index size). Of course, this makes > lookups more complex/expensive, because we need to check the internal > buffers. But it does reduce the write amplification, because it combines > changes to leaf pages. I think I've seen that discussion, and it's very interesting, but also I think still orthogonal to this. > > In some workloads we've been testing I believe such an approach could > > plausibly improve table insert (and update) performance by multiple > > hundreds of percent. > > > > I don't have any code at the moment to show here, but I wanted to get > > the idea out there to see if there were any immediate reactions or other > > thoughts on the topic. > > > > Thoughts? > > > > I think you're right indexes may be a serious bottleneck in some cases, > so exploring ways to improve that seems useful. Ultimately I think we > should be looking for ways to reduce the amount of work we need to do, > but parallelizing it (i.e. doing the same amount of work but in multiple > processes) is a valid approach too. Thanks for the feedback. James
pgsql-hackers by date: