Home > mailing lists

Re: Designing an extension for feature-space similarity search - Mailing list pgsql-hackers

From	Jay Levitt
Subject	Re: Designing an extension for feature-space similarity search
Date	February 16, 2012 14:18:45
Msg-id	4F3D4870.3020704@gmail.com Whole thread Raw
In response to	Re: Designing an extension for feature-space similarity search (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Designing an extension for feature-space similarity search
List	pgsql-hackers

Tree view

Tom Lane wrote:
> Jay Levitt<jay.levitt@gmail.com>  writes:
>> - I'm not sure how to represent arbitrary column-like features without
>> reinventing the wheel and putting a database in the database.
>
> ISTM you could define a composite type and then create operators and an
> operator class over that type.  If you were trying to make a btree
> opclass there might be a conflict with the built-in record_ops opclass,
> but since you're only interested in GIST I don't see any real
> roadblocks.

Perfect. Composite types are exactly what I need here; the application can 
declare its composite type and provide distance functions for each member, 
and the extension can use those to calculate similarity. How do I introspect 
the composite type's pg_class to see what it contains? I assume there's a 
better way than SPI on system catalogs :) Should I be using systable_* 
functions from genam, or is there an in-memory tree? I feel like funcapi 
gets me partway there but there's magic in the middle.

Can you think of any code that would serve as a sample, maybe whatever 
creates the output for psql's \d?

> The main potential disadvantage of this is that you'd have
> the standard tuple header as overhead in index entries --- but maybe the
> entries are large enough that that doesn't matter, and in any case you
> could probably make use of the GIST "compress" method to get rid of most
> of the header.  Maybe convert to MinimalTuple, for instance, if you want
> to still be able to leverage existing support code for field extraction.

Probably not worth it to save the 8 bytes; we're starting out at about 20 
floats per row. But good to know for later optimization...

>
>> - Can domains have operators, or are operators defined on types?
>
> I think the current state of play is that you can have such things but
> the system will only consider them for exact type matches, so you might
> need more explicit casts than you ordinarily would.  However, we only
> support domains over base types not composites, so this isn't really
> going to be a profitable direction for you anyway.

Actually, as mentioned to Alexander, I'm thinking of domains per feature, 
not for the overall tuple, so birthdate<->birthdate differs from 
now()<->posting_date.  Sounds like that might work - I'll play.
>
>> - Does KNN-GiST run into problems when<->  returns values that don't "make
>> sense" in the physical world?
>
> Wouldn't surprise me.  In general, non-strict index operators are a bad
> idea.  However, if the indexed entities are records, it would be
> entirely your own business how you handled individual fields being NULL.

Yeah, that example conflated NULLs in the feature fields (we don't know your 
birthdate) with <-> on the whole tuple.  Oops.

I guess I can just test this by verifying that KNN-GiST ordered by distance 
returns the same results as without the index.

Thanks for your help here.

Jay

pgsql-hackers by date:

From: Dimitri Fontaine
Date: 16 February 2012, 13:43:09
Subject: Re: Command Triggers

From: Robert Haas
Date: 16 February 2012, 14:29:28
Subject: Re: patch for parallel pg_dump

Re: Designing an extension for feature-space similarity search - Mailing list pgsql-hackers

Previous

Next