Home > mailing lists

Re: GSoC 2014 proposal - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: GSoC 2014 proposal
Date	April 3, 2014 19:22:08
Msg-id	533DB4D5.4010109@vmware.com Whole thread Raw
In response to	Re: GSoC 2014 proposal (Alexander Korotkov <aekorotkov@gmail.com>)
Responses	Re: GSoC 2014 proposal
List	pgsql-hackers

Tree view

On 04/03/2014 04:15 PM, Alexander Korotkov wrote:
> On Wed, Apr 2, 2014 at 2:22 PM, Alexander Korotkov <aekorotkov@gmail.com>wrote:
>
>> On Tue, Apr 1, 2014 at 2:23 PM, Heikki Linnakangas <
>> hlinnakangas@vmware.com> wrote:
>>
>>> The BIRCH algorithm as described in the paper describes building a tree
>>> in memory. If I understood correctly, you're suggesting to use a pre-built
>>> GiST index instead. Interesting idea!
>>>
>>> There are a couple of signifcant differences between the CF tree
>>> described in the paper and GiST:
>>>
>>> 1. In GiST, a leaf item always represents one heap tuple. In the CF tree,
>>> a leaf item represents a cluster, which consists of one or more tuples. So
>>> the CF tree doesn't store an entry for every input tuple, which makes it
>>> possible to keep it in memory.
>>>
>>> 2. In the CF tree, "all entries in a leaf node must satisfy a threshold
>>> requirement, with respect to a threshold value T: the diameter (or radius)
>>> has to be less than T". GiST imposes no such restrictions. An item can
>>> legally be placed anywhere in the tree; placing it badly will just lead to
>>> degraded search performance, but it's still a legal GiST tree.
>>>
>>> 3. A GiST index, like any other index in PostgreSQL, holds entries also
>>> for deleted tuples, until the index is vacuumed. So you cannot just use
>>> information from a non-leaf node and use it in the result, as the
>>> information summarized at a non-leaf level includes noise from the dead
>>> tuples.
>>>
>>> Can you elaborate how you are planning to use a GiST index to implement
>>> BIRCH? You might also want to take a look at SP-GiST; SP-GiST is more
>>> strict in where in the tree an item can be stored, and lets the operator
>>> class to specify exactly when a node is split etc.
>>>
>>
>> Hmmm, it's likely I've imagined something quite outside of this paper, and
>> even already suggested it to Ivan... :)
>> I need a little time to rethink it.
>>
>
> Using GiST we can implement BIRCH-like clustering like so:
> 1) Build a CF tree as GiST index without restriction of T threshold value.
> 2) Scan CF tree with threshold T with some auxiliary operator. If
> consistent method see CF entry which diameter is greater than T then it
> returns true. Otherwise it returns false and put this CF entry into output
> area (could be either in-memory or temporary table).
> 3) Process other steps of algorithm as usual.

I still don't understand how that would work. You can't trust the 
non-leaf entries, because their CF value can contain deleted entries. So 
you have to scan every tuple anyway. Once you do that, what's the point 
of the index?

- Heikki

pgsql-hackers by date:

From: Thom Brown
Date: 03 April 2014, 19:20:24
Subject: Re: B-Tree support function number 3 (strxfrm() optimization)

From: Alexander Korotkov
Date: 03 April 2014, 19:44:29
Subject: Re: GSoC 2014 proposal

Re: GSoC 2014 proposal - Mailing list pgsql-hackers

Previous

Next