Re: On partitioning - Mailing list pgsql-hackers
From | Amit Langote |
---|---|
Subject | Re: On partitioning |
Date | |
Msg-id | 007201d0184d$077bd9d0$16738d70$@lab.ntt.co.jp Whole thread Raw |
In response to | Re: On partitioning (Claudio Freire <klaussfreire@gmail.com>) |
List | pgsql-hackers |
Claudio Freire wrote: > On Sun, Dec 14, 2014 at 11:12 PM, Amit Langote > <Langote_Amit_f8@lab.ntt.co.jp> wrote: > >> On egress you need some direct way to compare the scan quals with the > >> partitioning values. I would imagine this to be similar to how scan > >> quals are compared to the values stored in a BRIN index: each scan qual > >> has a corresponding operator strategy and a scan key, and you can say > >> "aye" or "nay" based on a small set of operations that can be run > >> cheaply, again without any proof or running arbitrary expressions. > >> > > > > My knowledge of this is far from being perfect, though to clear any > confusions - > > > > As far as planning is concerned, I could not imagine how index access > method way of pruning partitions could be made to work. Of course, I may > be missing something. > > Let me be overly verbose, don't take it as patronizing, just answering > in lots of detail why this could be a good idea to try. > Thanks for explaining. It helps. > Normal indexes store a pointer for each key value of sorts. So B-Tree > gets you a set of tids for each key, and so does GIN and hash. > > But BRIN is different in that it does the mapping differently. BRIN > stores a compact, approximate representation of the set of keys within > a page range. It can tell with some degree of (in)accuracy whether a > key or key range could be part of that page range or not. The way it > does this is abstracted out, but at its core, it stores a "compressed" > representation of the key set that takes a constant amount of bits to > store, and no more, no matter how many elements. What changes as the > element it represents grows, is its accuracy. > > Currently, BRIN only supports min-max representations. It will store, > for each page range, the minimum and maximum of some columns, and > when > you query it, you can compare range vs range, and discard whole page > ranges. > > Now, what are partitions, if not page ranges? Yes, I can see a partition as a page range. The fixed summary info in BRIN's terms would be range bounds in case this isa rang partition, list of values in case this is a list partition and hash value in case this is a hash partition. There is debate on the topic but each of these partitions also happens to be a separate relation. IIUC, BRIN is an accessmethod for a relation (say, top-level partitioned relation) that comes into play in executor if that access methodsurvives as preferred access method by the planner. I cannot see a way to generalize it further and make it supporteach block range as a separate relation and then use it for partition pruning in planner. This is assuming a partitionedrelation is planned as an Append node which contains a list of plans for surviving partition relations based on,say, restrict quals. I may be thinking of BRIN as a whole as not being generalized enough but I may be wrong. Could you point out if so? > A BRIN tuple is a min-max pair. But BRIN in more general, it could use > other data structures to hold that "compressed representation", if > someone implemented them. Like bloom filters [0]. > > A BRIN index is a complex data structure because it has to account for > physically growing tables, but all the complexities vanish when you > fix a "block range" (the BR in BRIN) to a partition. Then, a mere > array of BRIN tuples would suffice. > > BRIN already contains the machinery to turn quals into something that > filters out entire partitions, if you provide the BRIN tuples. > IIUC, that machinery comes into play when, say, a Bitmap Heap scan starts, right? > And you could even effectively matain a BRIN index for the partitions > (just a BRIN tuple per partition, dynamically updated with every > insertion). > > If you do that, you start with empty partitions, and each insert > updates the BRIN tuple. Avoiding concurrency loss in this case would > be tricky, but in theory this could allow very general partition > exclusion. In fact it could even work with constraint exclusion right > now: you'd have a single-tuple BRIN index for each partition and > benefit from it. > > But you don't need to pay the price of updating BRIN indexes, as > min-max tuples for each partition can be produced while creating the > partitions if the syntax already provides the information. Then, it's > just a matter of querying this meta-data which just happens to have > the form of a BRIN tuple for each partition. > Thanks, Amit
pgsql-hackers by date: