Home > mailing lists

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

From	Claudio Freire
Subject	Re: Hash Joins vs. Bloom Filters / take 2
Date	February 22, 2018 23:52:10
Msg-id	CAGTBQpZ5UwF2jphmbHO4u5m_oqDj2jqU1L1AT1K16XKkJuNL6g@mail.gmail.com Whole thread Raw
In response to	Re: Hash Joins vs. Bloom Filters / take 2 (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Hash Joins vs. Bloom Filters / take 2
List	pgsql-hackers

Tree view

On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>> That's kinda slow to do per-item. I tried to "count" distinct items by
>> checking the BF before adding (don't add redundantly), but that's less
>> precise than a HLL in my experience.
>
> But you don't need to do that for each item you try to add - you know
> that with M bits and K hash functions you can't reach 50% before at
> least M/(2*K) additions. And you don't really need to track the number
> of bits exactly - if it gets 55% full, it's not going to explode.
>
> So just wait until M/(2*K) additions, see how many bits remain until the
> threshold - rinse and repeat. And use some reasonable minimum distance
> (say, 5% of the capacity), not to do the check too often.

Ok, that works too.

Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.

You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try, SBFs
remove any dependency on the accuracy of that estimate.

pgsql-hackers by date:

From: Andres Freund
Date: 22 February 2018, 23:32:38
Subject: Re: Allow workers to override datallowconn

From: Andres Freund
Date: 23 February 2018, 00:04:46
Subject: Re: Add PGDLLIMPORT to enable_hashagg

Re: Hash Joins vs. Bloom Filters / take 2 - Mailing list pgsql-hackers

Previous

Next