Re: Processing btree walks as a batch to parallelize IO - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Processing btree walks as a batch to parallelize IO |
Date | |
Msg-id | 3507221b-f664-17ab-ddc3-6113f2032c51@enterprisedb.com Whole thread Raw |
In response to | Processing btree walks as a batch to parallelize IO (James Coleman <jtc331@gmail.com>) |
Responses |
Re: Processing btree walks as a batch to parallelize IO
Re: Processing btree walks as a batch to parallelize IO |
List | pgsql-hackers |
On 4/9/21 7:33 PM, James Coleman wrote: > $SUBJECT is still a very loosely formed idea, so forgive lack of detail > or things I've likely missed, but I wanted to get it out there to see if > it sounded at all intriguing to people. > > Background: One of the big problems with non-local storage such as AWS > EBS volumes or a SAN is that in a large database (really, working set, > where working set includes reads) exceeds the size of buffer cache (and > page cache) the cost of random page reads hitting the underlying disk > system dominates. This is because networked disks have an order of > magnitude higher latency than a bunch of RAIDed SSDs (even more so with > NVMe storage). In some of our experiments on Aurora I've seen a 10x > change versus pretty good physical hardware, and I'd assume RDS (since > it's EBS-backed) is similar. > > A specific area where this is particularly painful is btree index reads. > Walking the tree to leaf pages isn't naturally prefetchable, and so for > each level you pay the random page cost. Of course higher levels in the > tree will almost certainly exhibit emergent behavior such that they > (just by fact of the LRU caching) will be in the buffer cache, but for a > large index lower levels likely won't be. > What do you consider a large index level? Consider a 1TB table, with just a single UUID column - that's ~25B rows, give or take. Real tables will have more columns, so this seems like a reasonable model of the largest number of rows per relation. With ~32B per index tuple, that's about 100M leaf pages, and with ~256 branches per internal page, that's still only ~5 levels. I think it's quite rare to see indexes with more than 6 or 7 levels. And the internal pages are maybe 0.5% of the whole index (so ~4GB out of 750GB). I think the usual expectation is that most of that will fit into RAM, but of course there may be more indexes competing for that. I think the index level is not really the crucial bit - it's more about the total amount of indexes in the DB. > If we squint a bit, insertions look a whole lot like reads as well since > we have to walk the tree to find the leaf insertion page for a new > tuple. This is particularly true for indexes where inserts are roughly > randomly distributed data, like a uuid. > Yep. We need to walk the index to the leaf pages in both cases, both for read and insert workloads. > The read-for-lookups problem is harder to solve, but the cost as it > relates to table inserts is possibly more tractable. Tables typically > have more than one index to update, so the obvious approach is "let's > just parallelize the index insertions". Of course we know that's > difficult given the multi-process approach Postgres uses for parallelism. > Hmm. Not sure if reads are harder to real with, but I think you're right those two cases (reads and writes) may look similar at the level of a single index, but may need rather different approaches exactly because inserts have to deal with all indexes, while reads only really deal with a single index. FWIW I think there are a couple options for improving reads, at least in some cases. 1) I wonder if e.g. _bt_readnextpage could prefetch at least one page ahead. We can't look further ahead, but perhaps this would help. 2) In some cases (e.g. nested loop with inner indexes scan) we could collect an array of values and then look them up at once, which should allow us to do at least some fo the I/O in parallel, I think. That's similar to what you propose for writes, except that it works against the same index. > Another approach that at first glance seems like it fits better into > Postgres (I'm not claiming it's easy or a small patch) would be to > process a batch of indexes at once. For example, if the index access > methods were extended to allow being given a list of indexes that need > to be walked, then the btree code could process each layer in the walk > as a group -- issuing IO fetches for all of the first level blocks in > the tree, and then computing all of the next level blocks needed and > issuing those IO requests at a time, and so on. > Yeah, I agree having a way to say "prefetch all pages needed to insert these keys into these indexes" might be better than just parallelizing it in a "naive" way. Not sure how complex would it be - I think the API would need to allow traversing the index with each step split into two phases: 1) determine the page needed for the next step, return it to caller 2) the caller collects pages from all indexes, initiates prefetch 3) instruct indexes to actually do the next step, stop if it's a leaf page (otherwise go to (1)) And then we might just do index inserts in a serial way, just like we do today, hoping to hit the prefetched pages. FWIW while this probably helps saturating the I/O, it unfortunately does nothing to reduce the write amplification - we still need to modify the same amount of leaf pages in all indexes, produce the same amount of WAL etc. I think there were some proposals to add small internal buffers, and instead of pushing the inserts all the way down to the leaf page, just add them to the internal buffer. And when the buffer gets full, propagate the contents to the next level of buffers. For example, each internal page might have one "buffer" page, so the index size would not really change (the internal pages would double, but it's still jut ~1% of the total index size). Of course, this makes lookups more complex/expensive, because we need to check the internal buffers. But it does reduce the write amplification, because it combines changes to leaf pages. > In some workloads we've been testing I believe such an approach could > plausibly improve table insert (and update) performance by multiple > hundreds of percent. > > I don't have any code at the moment to show here, but I wanted to get > the idea out there to see if there were any immediate reactions or other > thoughts on the topic. > > Thoughts? > I think you're right indexes may be a serious bottleneck in some cases, so exploring ways to improve that seems useful. Ultimately I think we should be looking for ways to reduce the amount of work we need to do, but parallelizing it (i.e. doing the same amount of work but in multiple processes) is a valid approach too. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: