Home > mailing lists

Re: On partitioning - Mailing list pgsql-hackers

From	Amit Langote
Subject	Re: On partitioning
Date	December 15, 2014 10:01:53
Msg-id	007201d0184d$077bd9d0$16738d70$@lab.ntt.co.jp Whole thread Raw
In response to	Re: On partitioning (Claudio Freire <klaussfreire@gmail.com>)
List	pgsql-hackers

Tree view

Claudio Freire wrote:
> On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote
> <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> >> On egress you need some direct way to compare the scan quals with the
> >> partitioning values.  I would imagine this to be similar to how scan
> >> quals are compared to the values stored in a BRIN index: each scan qual
> >> has a corresponding operator strategy and a scan key, and you can say
> >> "aye" or "nay" based on a small set of operations that can be run
> >> cheaply, again without any proof or running arbitrary expressions.
> >>
> >
> > My knowledge of this is far from being perfect, though to clear any
> confusions -
> >
> > As far as planning is concerned, I could not imagine how index access
> method way of pruning partitions could be made to work. Of course, I may
> be missing something.
>
> Let me be overly verbose, don't take it as patronizing, just answering
> in lots of detail why this could be a good idea to try.
>

Thanks for explaining. It helps.

> Normal indexes store a pointer for each key value of sorts. So B-Tree
> gets you a set of tids for each key, and so does GIN and hash.
>
> But BRIN is different in that it does the mapping differently. BRIN
> stores a compact, approximate representation of the set of keys within
> a page range. It can tell with some degree of (in)accuracy whether a
> key or key range could be part of that page range or not. The way it
> does this is abstracted out, but at its core, it stores a "compressed"
> representation of the key set that takes a constant amount of bits to
> store, and no more, no matter how many elements. What changes as the
> element it represents grows, is its accuracy.
>
> Currently, BRIN only supports min-max representations. It will store,
> for each page range, the minimum and maximum of some columns, and
> when
> you query it, you can compare range vs range, and discard whole page
> ranges.
>
> Now, what are partitions, if not page ranges?

Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this
isa rang partition, list of values in case this is a list partition and hash value in case this is a hash partition. 

There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an
accessmethod for a relation (say, top-level partitioned relation) that comes into play in executor if that access
methodsurvives as preferred access method by the planner. I cannot see a way to generalize it further and make it
supporteach block range as a separate relation and then use it for partition pruning in planner. This is assuming a
partitionedrelation is planned as an Append node which contains a list of plans for surviving partition relations based
on,say, restrict quals. 

I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so?

> A BRIN tuple is a min-max pair. But BRIN in more general, it could use
> other data structures to hold that "compressed representation", if
> someone implemented them. Like bloom filters [0].
>
> A BRIN index is a complex data structure because it has to account for
> physically growing tables, but all the complexities vanish when you
> fix a "block range" (the BR in BRIN) to a partition. Then, a mere
> array of BRIN tuples would suffice.
>
> BRIN already contains the machinery to turn quals into something that
> filters out entire partitions, if you provide the BRIN tuples.
>

IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right?

> And you could even effectively matain a BRIN index for the partitions
> (just a BRIN tuple per partition, dynamically updated with every
> insertion).
>
> If you do that, you start with empty partitions, and each insert
> updates the BRIN tuple. Avoiding concurrency loss in this case would
> be tricky, but in theory this could allow very general partition
> exclusion. In fact it could even work with constraint exclusion right
> now: you'd have a single-tuple BRIN index for each partition and
> benefit from it.
>
> But you don't need to pay the price of updating BRIN indexes, as
> min-max tuples for each partition can be produced while creating the
> partitions if the syntax already provides the information. Then, it's
> just a matter of querying this meta-data which just happens to have
> the form of a BRIN tuple for each partition.
>

Thanks,
Amit

pgsql-hackers by date:

From: Andres Freund
Date: 15 December 2014, 09:58:52
Subject: Re: Escaping from blocked send() reprised.

From: Petr Jelinek
Date: 15 December 2014, 10:37:13
Subject: Re: Sequence Access Method WIP

Re: On partitioning - Mailing list pgsql-hackers

Previous

Next