Re: Designing an extension for feature-space similarity search - Mailing list pgsql-hackers
From | Jay Levitt |
---|---|
Subject | Re: Designing an extension for feature-space similarity search |
Date | |
Msg-id | 4F3D4870.3020704@gmail.com Whole thread Raw |
In response to | Re: Designing an extension for feature-space similarity search (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Designing an extension for feature-space similarity search
|
List | pgsql-hackers |
Tom Lane wrote: > Jay Levitt<jay.levitt@gmail.com> writes: >> - I'm not sure how to represent arbitrary column-like features without >> reinventing the wheel and putting a database in the database. > > ISTM you could define a composite type and then create operators and an > operator class over that type. If you were trying to make a btree > opclass there might be a conflict with the built-in record_ops opclass, > but since you're only interested in GIST I don't see any real > roadblocks. Perfect. Composite types are exactly what I need here; the application can declare its composite type and provide distance functions for each member, and the extension can use those to calculate similarity. How do I introspect the composite type's pg_class to see what it contains? I assume there's a better way than SPI on system catalogs :) Should I be using systable_* functions from genam, or is there an in-memory tree? I feel like funcapi gets me partway there but there's magic in the middle. Can you think of any code that would serve as a sample, maybe whatever creates the output for psql's \d? > The main potential disadvantage of this is that you'd have > the standard tuple header as overhead in index entries --- but maybe the > entries are large enough that that doesn't matter, and in any case you > could probably make use of the GIST "compress" method to get rid of most > of the header. Maybe convert to MinimalTuple, for instance, if you want > to still be able to leverage existing support code for field extraction. Probably not worth it to save the 8 bytes; we're starting out at about 20 floats per row. But good to know for later optimization... > >> - Can domains have operators, or are operators defined on types? > > I think the current state of play is that you can have such things but > the system will only consider them for exact type matches, so you might > need more explicit casts than you ordinarily would. However, we only > support domains over base types not composites, so this isn't really > going to be a profitable direction for you anyway. Actually, as mentioned to Alexander, I'm thinking of domains per feature, not for the overall tuple, so birthdate<->birthdate differs from now()<->posting_date. Sounds like that might work - I'll play. > >> - Does KNN-GiST run into problems when<-> returns values that don't "make >> sense" in the physical world? > > Wouldn't surprise me. In general, non-strict index operators are a bad > idea. However, if the indexed entities are records, it would be > entirely your own business how you handled individual fields being NULL. Yeah, that example conflated NULLs in the feature fields (we don't know your birthdate) with <-> on the whole tuple. Oops. I guess I can just test this by verifying that KNN-GiST ordered by distance returns the same results as without the index. Thanks for your help here. Jay
pgsql-hackers by date: