Re: Memory-Bounded Hash Aggregation - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: Memory-Bounded Hash Aggregation |
Date | |
Msg-id | 6e2b4a4475a690fa515829950aeed935696b0df3.camel@j-davis.com Whole thread Raw |
In response to | Re: Memory-Bounded Hash Aggregation (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Memory-Bounded Hash Aggregation
|
List | pgsql-hackers |
On Fri, 2020-01-24 at 17:16 -0800, Peter Geoghegan wrote: > That sounds weird. Might be pathological in some sense. > > I have a wild guess for you. Maybe this has something to do with the > "test for presorted input" added by commit a3f0b3d68f9. That can > perform very badly when the input is almost sorted, but has a few > tuples that are out of order towards the end. (I have called these > "banana skin tuples" in the past.) My simple test case is: 'explain analyze select i from big group by i;', where "big" has 20M tuples. I tried without that change and it helped (brought the time from 55s to 45s). But if I completely remove the sorting of the freelist, it goes down to 12s. So it's something about the access pattern. After digging a bit more, I see that, for Sort, the LogicalTapeSet's freelist hovers around 300 entries and doesn't grow larger than that. For HashAgg, it gets up to almost 60K. The pattern in HashAgg is that the space required is at a maximum after the first spill, and after that point the used space declines with each batch (because the groups that fit in the hash table were finalized and emitted, and only the ones that didn't fit were written out). As the amount of required space declines, the size of the freelist grows. That leaves a few options: 1) Cap the size of the LogicalTapeSet's freelist. If the freelist is growing large, that's probably because it will never actually be used. I'm not quite sure how to pick the cap though, and it seems a bit hacky to just leak the freed space. 2) Use a different structure more capable of handling a large fraction of free space. A compressed bitmap might make sense, but that seems like overkill to waste effort tracking a lot of space that is unlikely to ever be used. 3) Don't bother tracking free space for HashAgg at all. There's already an API for that so I don't need to further hack logtape.c. 4) Try to be clever and shrink the file (or at least the tracked portion of the file) if the freed blocks are at the end. This wouldn't be very useful in the first recursive level, but the problem is worst for the later levels anyway. Unfortunately, I think this requires a breadth-first strategy to make sure that blocks at the end get freed. If I do change it to breadth-first also, this does amount to a significant speedup. I am leaning toward #1 or #3. As an aside, I'm curious why the freelist is managed the way it is. Newly-released blocks are likely to be higher in number (or at least not the lowest in number), but they are added to the end of an array. The array is therefore likely to require repeated re-sorting to get back to descending order. Wouldn't a minheap or something make more sense? Regards, Jeff Davis
pgsql-hackers by date: