Re: accounting for memory used for BufFile during hash joins - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: accounting for memory used for BufFile during hash joins |
Date | |
Msg-id | CA+hUKG+ppih9DtZwGJh4KMZXdtD=c9+fYoS1CSeBQBs9JzZtbA@mail.gmail.com Whole thread Raw |
In response to | Re: accounting for memory used for BufFile during hash joins (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: accounting for memory used for BufFile during hash joins
Re: accounting for memory used for BufFile during hash joins |
List | pgsql-hackers |
On Tue, May 7, 2019 at 9:58 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Fri, May 3, 2019 at 5:34 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: >> The second patch tries to enforce work_mem more strictly. That would be >> impossible if we were to keep all the BufFile structs in memory, so >> instead it slices the batches into chunks that fit into work_mem, and >> then uses a single "overflow" file for slices currently not in memory. >> These extra slices can't be counted into work_mem, but we should need >> just very few of them. For example with work_mem=4MB the slice is 128 >> batches, so we need 128x less overflow files (compared to per-batch). >> > I want to see if I understand the implications of the per-slice-overflow patch > for execution of hashjoin: > For each bucket in the hashtable, when attempting to double the number of > batches, if the memory that the BufFile structs will occupy once this is done > will exceed the work_mem, split each batch into slices that fit into memory. > This means that, for each probe-side tuple hashing to that bucket, you have to > load every slice of each batch separately into memory to ensure correct results. > Is this right? Seems expensive for large numbers of slices -- you need to join the outer batch against each inner slice. But I wonder how we'd deal with outer joins, as Tom Lane asked in another thread: https://www.postgresql.org/message-id/12185.1488932980%40sss.pgh.pa.us >> I'm not entirely sure which of those approaches is the right one. The >> first one is clearly just a "damage control" for cases where the hash >> side turned out to be much larger than we expected. With good estimates >> we probably would not have picked a hash join for those (that is, we >> should have realized we can't keep work_mem and prohibit hash join). >> >> The second patch however makes hash join viable for some of those cases, >> and it seems to work pretty well (there are some numbers in the message >> posted to pgsql-performance thread). So I kinda like this second one. >> > So, my initial reaction after taking a look at the patches is that I prefer the > first approach--increasing the resize threshhold. The second patch, the > per-slice-overflow patch, adds a major new mechanic to hashjoin in order to > address what is, based on my understanding, an edge case. Personally I'd like to make work_mem more reliable, even if it takes a major new mechanism. Stepping back a bit, I think there is something fishy about the way we detect extreme skew. Is that a factor in this case? Right now we wait until we have a batch that gets split into child batches containing exactly 0% and 100% of the tuples before we give up. Previously I had thought of that as merely a waste of time, but clearly it's also a waste of unmetered memory. Oops. I think our extreme skew detector should go off sooner, because otherwise if you have N nicely distributed unique keys and also M duplicates of one bad egg key that'll never fit in memory, we keep repartitioning until none of the N keys fall into the batch containing the key for the M duplicates before we give up! You can use balls-into-bins maths to figure out the number, but I think that means we expect to keep splitting until we have N * some_constant batches, and that's just silly and liable to create massive numbers of partitions proportional to N, even though we're trying to solve a problem with M. In another thread I suggested we should stop when (say) 95% of the tuples go to one child batch. I'm not sure how you pick the number. Of course that doesn't solve the problem that we don't have a better plan for dealing with the M duplicates -- it just avoids a needless batch explosions triggered by bad maths. I think we need something like Tomas's #2, or a way to switch to sort-merge, or some other scheme. I'm not sure how to compare the slice idea, which involves processing outer tuples * inner slices with the sort-merge idea, which involves sorting the inner and outer batch, plus the entirely new concept of switching to another node at execution time. I also wondered about reducing the buffer size of the BufFiles, but that doesn't seem to be fixing the real problem. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: