Re: GSoC 2014 proposal - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: GSoC 2014 proposal |
Date | |
Msg-id | 533DB4D5.4010109@vmware.com Whole thread Raw |
In response to | Re: GSoC 2014 proposal (Alexander Korotkov <aekorotkov@gmail.com>) |
Responses |
Re: GSoC 2014 proposal
|
List | pgsql-hackers |
On 04/03/2014 04:15 PM, Alexander Korotkov wrote: > On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote: > >> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas < >> hlinnakangas@vmware.com> wrote: >> >>> The BIRCH algorithm as described in the paper describes building a tree >>> in memory. If I understood correctly, you're suggesting to use a pre-built >>> GiST index instead. Interesting idea! >>> >>> There are a couple of signifcant differences between the CF tree >>> described in the paper and GiST: >>> >>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree, >>> a leaf item represents a cluster, which consists of one or more tuples. So >>> the CF tree doesn't store an entry for every input tuple, which makes it >>> possible to keep it in memory. >>> >>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold >>> requirement, with respect to a threshold value T: the diameter (or radius) >>> has to be less than T". GiST imposes no such restrictions. An item can >>> legally be placed anywhere in the tree; placing it badly will just lead to >>> degraded search performance, but it's still a legal GiST tree. >>> >>> 3. A GiST index, like any other index in PostgreSQL, holds entries also >>> for deleted tuples, until the index is vacuumed. So you cannot just use >>> information from a non-leaf node and use it in the result, as the >>> information summarized at a non-leaf level includes noise from the dead >>> tuples. >>> >>> Can you elaborate how you are planning to use a GiST index to implement >>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more >>> strict in where in the tree an item can be stored, and lets the operator >>> class to specify exactly when a node is split etc. >>> >> >> Hmmm, it's likely I've imagined something quite outside of this paper, and >> even already suggested it to Ivan... :) >> I need a little time to rethink it. >> > > Using GiST we can implement BIRCH-like clustering like so: > 1) Build a CF tree as GiST index without restriction of T threshold value. > 2) Scan CF tree with threshold T with some auxiliary operator. If > consistent method see CF entry which diameter is greater than T then it > returns true. Otherwise it returns false and put this CF entry into output > area (could be either in-memory or temporary table). > 3) Process other steps of algorithm as usual. I still don't understand how that would work. You can't trust the non-leaf entries, because their CF value can contain deleted entries. So you have to scan every tuple anyway. Once you do that, what's the point of the index? - Heikki
pgsql-hackers by date: