Re: [HACKERS] How to implement a SP-GiST index as a extension module? - Mailing list pgsql-hackers
From | Connor Wolf |
---|---|
Subject | Re: [HACKERS] How to implement a SP-GiST index as a extension module? |
Date | |
Msg-id | CAAVqP=ra29L6bUc5n3mjzG5+Mv37i+uaGCvh_kzL6x1FSo1hXw@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] How to implement a SP-GiST index as a extension module? (Alexander Korotkov <a.korotkov@postgrespro.ru>) |
Responses |
Re: [HACKERS] How to implement a SP-GiST index as a extension module?
|
List | pgsql-hackers |
On Mon, Nov 13, 2017 at 2:09 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!On Mon, Nov 13, 2017 at 6:47 AM, Connor Wolf <connorw@imaginaryindustries.com> wrote: Ok, I've managed to get my custom index working.Good!It's all on github here: https://github.com/fake-name/pg-spgist_hamming, if anyone else needs a fuzzy-image searching system that can integrate into postgresql..It should be a pretty good basis for anyone else to use if they want to implement a SP-GiST index too.I took a look at the code, and I feel myself a bit confused :)It appears that you're indexing int8 values. That seems like unrealistic short representation for image signature.
It is a int8, and nope, it's a surprisingly robust and functional signature. There's a document describing the hashing mechanism here:
Functionally, the procedure is relatively simple:
- Convert to greyscale
- Resize to intermediate resolution (32x32 is common)
- Perform DCT on 32x32 image.
- Crop 32x32 image to 8x8 by throwing away the high-frequency components
- Threshold the 8x8 image by it's average
- Serialize the 64 binary values into a int8
In my case, the actual implementation is here: https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/scanner/hashFile.py#L95-L102
Also, name of repository make me think that hamming distance would be used to compare signatures. But after look at the code, I see that plain absolute value of difference is used for that purpose.static doublegetDistance(Datum v1, Datum v2){int64_t a1 = DatumGetInt64(v1);int64_t a2 = DatumGetInt64(v2);int64_t diff = Abs(a1 - a2);fprintf_to_ereport("getDistance %ld <-> %ld : %ld", a1, a2, diff); return diff;}For such notion of distance, you don't need a VP-tree or another complex indexing. B-tree is quite enough in this case. Alternatively, distance function is not what it meant to be.
You're looking in the wrong place.
https://github.com/fake-name/pg-spgist_hamming/tree/master/vptree is the code you sent me, with some simplification to make it only work on single integers. Basically,
before I started on my own stuff, I wanted to make sure I could at least implement a functional index using a much more basic structure.
https://github.com/fake-name/pg-spgist_hamming/tree/master/bktree is the actual BK-tree index, and it does indeed use hamming distance for the search metric:
static int64_t
f_hamming(int64_t a_int, int64_t b_int)
{
/*
Compute number of bits that are not common between `a` and `b`.
return value is a plain integer
*/
uint64_t x = (a_int ^ b_int);
uint64_t ret = __builtin_popcountll (x);
return ret;
}
It would be useful if you provide complete usage example of this extension: from image to signature conversion to search queries.
Actual usage is done with this project: https://github.com/fake-name/IntraArchiveDeduplicator, which
also has the older in-memory BK tree I've implemented, and it's actually used here.
I also have unit tests that sit on top of this here (see all the files that are named "Test_db_BKTree...".
Connor
pgsql-hackers by date: