Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Hash Joins vs. Bloom Filters / take 2
Date
Msg-id CAGTBQpZ5UwF2jphmbHO4u5m_oqDj2jqU1L1AT1K16XKkJuNL6g@mail.gmail.com
Whole thread Raw
In response to Re: Hash Joins vs. Bloom Filters / take 2  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Hash Joins vs. Bloom Filters / take 2
List pgsql-hackers
On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>> That's kinda slow to do per-item. I tried to "count" distinct items by
>> checking the BF before adding (don't add redundantly), but that's less
>> precise than a HLL in my experience.
>
> But you don't need to do that for each item you try to add - you know
> that with M bits and K hash functions you can't reach 50% before at
> least M/(2*K) additions. And you don't really need to track the number
> of bits exactly - if it gets 55% full, it's not going to explode.
>
> So just wait until M/(2*K) additions, see how many bits remain until the
> threshold - rinse and repeat. And use some reasonable minimum distance
> (say, 5% of the capacity), not to do the check too often.

Ok, that works too.

Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.

You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try, SBFs
remove any dependency on the accuracy of that estimate.


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Allow workers to override datallowconn
Next
From: Andres Freund
Date:
Subject: Re: Add PGDLLIMPORT to enable_hashagg