Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets |
Date | |
Msg-id | 603c8f070812250747t3193a27fj8c50ff4c34cb7daa@mail.gmail.com Whole thread Raw |
In response to | Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets ("Lawrence, Ramon" <ramon.lawrence@ubc.ca>) |
Responses |
Re: Proposed Patch to Improve Performance of Multi-BatchHash Join for Skewed Data Sets
|
List | pgsql-hackers |
> There is almost zero penalty for selecting incorrect MCV tuples to > buffer in memory. Since the number of MCVs is approximately 100, the > "overhead" is keeping these 100 tuples in memory where they *might* not > be MCVs. The cost is the little extra memory and the checking of the > MCVs which is very fast. I looked at this some more. I'm a little concerned about the way we're maintaining the in-memory hash table. Since the highest legal statistics target is now 10,000, it's possible that we could have two orders of magnitude more MCVs than what you're expecting. As I read the code, that could lead to construction of an in-memory hash table with 64K slots. On a 32-bit machine, I believe that works out to 16 bytes per partition (12 and 4), which is a 1MB hash table. That's not necessarily problematic, except that I don't think you're considering the size of the hash table itself when evaluating whether you are blowing out work_mem, and the default size of work_mem is 1MB. I also don't really understand why we're trying to control the size of the hash table by flushing tuples after the fact. Right now, when the in-memory table fills up, we just keep adding tuples to it, which in turn forces us to flush out other tuples to keep the size down. This seems quite inefficient - not only are we doing a lot of unnecessary allocating and freeing, but those flushed slots in the hash table degrade performance (because they don't stop the scan for an empty slot). It seems like we could simplify things considerably by adding tuples to the in-memory hash table only to the point where the next tuple would blow it out. Once we get to that point, we can skip the isAMostCommonValue() test and send any future tuples straight to temp files. (This would also reduce the memory consumption of the in-memory table by a factor of two.) We could potentially improve on this even further if we can estimate in advance how many MCVs we can fit into the in-memory hash table before it gets blown out. If, for example, we have only 1MB of work_mem but there 10,000 MCVs, getMostCommonValues() might decide to only hash the first 1,000 MCVs. Even if we still blow out the in-memory hash table, the earlier MCVs are more frequent than the later MCVs, so the ones that actually make it into the table are likely to be more beneficial. I'm not sure exactly how to do this tuning though, since we'd need to approximate the size of the tuples... I guess the query planner makes some effort to estimate that but I'm not sure how to get at it. > However, the join with Part and LineItem *should* show a benefit but may > not because of a limitation of the patch implementation (not the idea). > The MCV optimization is only enabled currently when the probe side is a > sequential scan. This limitation is due to our current inability to > determine a stats tuple of the join attribute on the probe side for > other operators. (This should be possible - help please?). Not sure how to get at this either, but I'll take a look and see if I can figure it out. Merry Christmas, ...Robert
pgsql-hackers by date: