Thread: Understanding and implementing a GiST Index
Hi there!
I'm trying to implement a custom indexing system using the GiST index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across a set of 64 bit integers by hamming distance. This is intended to be used in searching for similar images, where the 64 bit integer is actually a phash of an image, and similarity between two images is reflected in the hamming distance between two integers.
Anyways, The appropriate approach here is to use something called a BK tree, for which I've written some test code and I think I have a decent grip of (my test code seems to work, in any event).
That said:
I'm trying to implement a custom indexing system using the GiST index framework, and I have a few questions.
Basically, I'm trying to implement a system that allows me to search across a set of 64 bit integers by hamming distance. This is intended to be used in searching for similar images, where the 64 bit integer is actually a phash of an image, and similarity between two images is reflected in the hamming distance between two integers.
Anyways, The appropriate approach here is to use something called a BK tree, for which I've written some test code and I think I have a decent grip of (my test code seems to work, in any event).
That said:
- Is there a simple piece of example-code for implementing a GiST index for a single index? I've been looking through the /contrib/ directory, particularly the /contrib/btree_gist/ files, but it looks like 90% of the complexity there is related to supporting all the column types Postgres has, rather then actually tied to producing a functional index.
- Once I have something compiling, how can I check to be sure that I'm actually using the indexing module I created? I can enable my compiled extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX test_index ON testing USING gist (val);` that lets me specify *which* GiST index is actually employed? Is this even a valid question?
This is probably something that's available in one of the system tables somewhere, but I haven't had much luck with google finding out where. - Testing: What's the appropriate way to examine the generated tree structure of the index? I certainly went through a few bugs with my test tree system (and that was in python!). Is there any way to examine the index structure for debugging purposes?
- Also, it looks like `ereport()` is the proper way to emit debugging information. Is this correct?
- In that vein, is there any way to have information that is on the operations of an entire query? Looking at the number of tree nodes touched for a scan would be nice (and I would not be surprised if there is already a facility for it).
Project code is here if anyone is interested, any help would be great. I have very little idea what I'm doing.
Thanks,
Connor
On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf <wolf@imaginaryindustries.com> wrote: > I'm trying to implement a custom indexing system using the GiST index > framework, and I have a few questions. > Basically, I'm trying to implement a system that allows me to search across > a set of 64 bit integers by hamming distance. This is intended to be used in > searching for similar images, where the 64 bit integer is actually a phash > of an image, and similarity between two images is reflected in the hamming > distance between two integers. Have you seen the smlar extension? http://www.pgcon.org/2012/schedule/events/443.en.html http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/ > > Anyways, The appropriate approach here is to use something called a BK tree, > for which I've written some test code and I think I have a decent grip of > (my test code seems to work, in any event). > > That said: > > Is there a simple piece of example-code for implementing a GiST index for a > single index? I've been looking through the /contrib/ directory, > particularly the /contrib/btree_gist/ files, but it looks like 90% of the > complexity there is related to supporting all the column types Postgres has, > rather then actually tied to producing a functional index. > Once I have something compiling, how can I check to be sure that I'm > actually using the indexing module I created? I can enable my compiled > extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX > test_index ON testing USING gist (val);` that lets me specify *which* GiST > index is actually employed? Is this even a valid question? > This is probably something that's available in one of the system tables > somewhere, but I haven't had much luck with google finding out where. > Testing: What's the appropriate way to examine the generated tree structure > of the index? I certainly went through a few bugs with my test tree system > (and that was in python!). Is there any way to examine the index structure > for debugging purposes? > Also, it looks like `ereport()` is the proper way to emit debugging > information. Is this correct? > In that vein, is there any way to have information that is on the operations > of an entire query? Looking at the number of tree nodes touched for a scan > would be nice (and I would not be surprised if there is already a facility > for it). > > Project code is here if anyone is interested, any help would be great. I > have very little idea what I'm doing. > > Thanks, > Connor -- Kind regards, Sergey Konoplev PostgreSQL Consultant and DBA http://www.linkedin.com/in/grayhemp +1 (415) 867-9984, +7 (499) 346-7196, +7 (988) 888-1979 gray.ru@gmail.com
I had skimmed the presentation slides, but I hadn't looked that closely because it appeared to be using cosine similarity metrics, and only operated on matrices, neither of which are what I wanted. On closer examination, I think I could explode my packed hash values to a matrix. I'm not sure how the cosine similarity metric would work given that the arrays would only contain the values either 0 or 1 (as my hash value is fundamentally a integer of configurable length (you can alter the hash size, I'm using 64 bits)). I'll check out the code and poke it a bit. I'll probably also move this discussion to the hackers mailing list (the instructions say to ask here first, but Oleg suggested I go there, and I generally agree). Connor On 10/9/2014 8:19 AM, Sergey Konoplev wrote: > On Thu, Oct 9, 2014 at 12:09 AM, Connor Wolf > <wolf@imaginaryindustries.com> wrote: >> I'm trying to implement a custom indexing system using the GiST index >> framework, and I have a few questions. >> Basically, I'm trying to implement a system that allows me to search across >> a set of 64 bit integers by hamming distance. This is intended to be used in >> searching for similar images, where the 64 bit integer is actually a phash >> of an image, and similarity between two images is reflected in the hamming >> distance between two integers. > Have you seen the smlar extension? > > http://www.pgcon.org/2012/schedule/events/443.en.html > http://sigaev.ru/git/gitweb.cgi?p=smlar.git;a=blob;hb=HEAD;f=README > http://railsware.com/blog/2012/05/10/effective-similarity-search-in-postgresql/ > >> Anyways, The appropriate approach here is to use something called a BK tree, >> for which I've written some test code and I think I have a decent grip of >> (my test code seems to work, in any event). >> >> That said: >> >> Is there a simple piece of example-code for implementing a GiST index for a >> single index? I've been looking through the /contrib/ directory, >> particularly the /contrib/btree_gist/ files, but it looks like 90% of the >> complexity there is related to supporting all the column types Postgres has, >> rather then actually tied to producing a functional index. >> Once I have something compiling, how can I check to be sure that I'm >> actually using the indexing module I created? I can enable my compiled >> extension using `CREATE EXTENSION`, but is there an option for `CREATE INDEX >> test_index ON testing USING gist (val);` that lets me specify *which* GiST >> index is actually employed? Is this even a valid question? >> This is probably something that's available in one of the system tables >> somewhere, but I haven't had much luck with google finding out where. >> Testing: What's the appropriate way to examine the generated tree structure >> of the index? I certainly went through a few bugs with my test tree system >> (and that was in python!). Is there any way to examine the index structure >> for debugging purposes? >> Also, it looks like `ereport()` is the proper way to emit debugging >> information. Is this correct? >> In that vein, is there any way to have information that is on the operations >> of an entire query? Looking at the number of tree nodes touched for a scan >> would be nice (and I would not be surprised if there is already a facility >> for it). >> >> Project code is here if anyone is interested, any help would be great. I >> have very little idea what I'm doing. >> >> Thanks, >> Connor > >