Home > mailing lists

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

From	Tomas Vondra
Subject	Re: bad estimation together with large work_mem generates terrible slow hash joins
Date	September 9, 2014 22:49:48
Msg-id	540F8401.6040808@fuzzy.cz Whole thread Raw
In response to	Re: bad estimation together with large work_mem generates terrible slow hash joins (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: bad estimation together with large work_mem generates terrible slow hash joins
List	pgsql-hackers

Tree view

On 9.9.2014 16:09, Robert Haas wrote:
> On Mon, Sep 8, 2014 at 5:53 PM, Tomas Vondra <tv@fuzzy.cz> wrote:
>> So I only posted the separate patch for those who want to do a review,
>> and then a "complete patch" with both parts combined. But it sure may be
>> a bit confusing.
>
> Let's do this: post each new version of the patches only on this
> thread, as a series of patches that can be applied one after another.

OK, attached. Apply in this order

1) dense-alloc-v5.patch
2) hashjoin-nbuckets-v12.patch

>>> In ExecChooseHashTableSize(), if buckets_bytes is independent of
>>> nbatch, could we just compute it before working out dbatch, and just
>>> deduct it from hash_table_bytes? That seems like it'd do the same
>>> thing for less work.
>>
>> Well, we can do that. But I really don't think that's going to make
>> measurable difference. And I think the code is more clear this way.
>
> Really?  It seems to me that you're doing more or less the same
> calculation that's already been done a second time.  It took me 20
> minutes of staring to figure out what it was really doing.  Now
> admittedly I was a little low on caffeine, but...

I reworked this part a bit, to make it easier to understand. Mostly
following your recommendations.

>
>>> Either way, what happens if buckets_bytes is already bigger than
>>> hash_table_bytes?
>>
>> Hm, I don't see how that could happen.
>
> How about an Assert() and a comment, then?

Done. According to the comment, we should never really get over 25%,
except maybe for strange work_mem values, because we're keeping nbuckets
as 2^N. But even then we should not reach 50%, so I added this assert:

   Assert(buckets_bytes <= hash_table_bytes/2);

And then we subtract the buckets_bytes, and continue with the loop.

>
>>> A few more stylistic issues that I see:
>>>
>>> +               if ((hashtable->nbatch == 1) &&
>>> +                       if (hashtable->nbuckets_optimal <= (INT_MAX/2))
>>> +       if (size > (HASH_CHUNK_SIZE/8))
>>>
>>> While I'm all in favor of using parentheses generously where it's
>>> needed to add clarity, these and similar usages seem over the top to
>>> me.
>>
>> It seemed more readable for me at the time of coding it, and it still
>> seems better this way, but ...
>>
>> https://www.youtube.com/watch?v=CxK_nA2iVXw
>
> Heh.  Well, at any rate, I think PostgreSQL style is to not use
> parentheses in that situation.

FWIW, I added HASH_CHUNK_THRESHOLD macro, specifying the threshold. I
also simplified the conditions a bit, and removed some of the parentheses.

>
>>> +char * chunk_alloc(HashJoinTable hashtable, int size) {
>>>
>>> Eh... no.
>>>
>>> +       /* XXX maybe we should use MAXALIGN(size) here ... */
>>>
>>> +1.
>>
>> I'm not sure what's the 'no' pointing at? Maybe that the parenthesis
>> should be on the next line? And is the +1 about doing MAXALING?
>
> The "no" is about the fact that you have the return type, the function
> name, and the opening brace on one line instead of three lines as is
> project style.  The +1 is for applying MAXALIGN to the size.

OK, fixed.

I also did a few 'minor' changes to the dense allocation patch, most
notably:

* renamed HashChunk/HashChunkData to MemoryChunk/MemoryChunkData
  The original naming seemed a bit awkward.

* renamed the function to 'dense_alloc' (instead of 'chunk_alloc')
  Seems like a better fit to what the function does.

* the chunks size is 32kB (instead of 16kB), and we're using 1/4
  threshold for 'oversized' items

  We need the threshold to be >=8kB, to trigger the special case
  within AllocSet. The 1/4 rule is consistent with ALLOC_CHUNK_FRACTION.

I also considered Heikki's suggestion that while rebatching, we can
reuse chunks with a single large tuple, instead of copying it
needlessly. My attempt however made the code much uglier (I almost used
a GOTO, but my hands rejected to type it ...). I'll look into that.

Tomas

Attachment

pgsql-hackers by date:

From: Thom Brown
Date: 09 September 2014, 22:47:38
Subject: FD_SETSIZE on Linux?

From: Tom Lane
Date: 09 September 2014, 23:21:09
Subject: Re: FD_SETSIZE on Linux?

Re: bad estimation together with large work_mem generates terrible slow hash joins - Mailing list pgsql-hackers

Attachment

Previous

Next