GIN - Generalized Inverted iNdex. - Mailing list pgsql-hackers
From | Teodor Sigaev |
---|---|
Subject | GIN - Generalized Inverted iNdex. |
Date | |
Msg-id | 44366388.1000106@sigaev.ru Whole thread Raw |
Responses |
Re: GIN - Generalized Inverted iNdex.
Re: GIN - Generalized Inverted iNdex. |
List | pgsql-hackers |
We (me and Oleg) are glad to present GIN to PostgreSQL. If community will agree, we will commit it to HEAD branch. http://www.sigaev.ru/gin/gin.gz http://www.sigaev.ru/gin/README Install: % cd pgsql % zcat gin.gz | patch -p0 make and initdb README: Gin for PostgreSQL Gin stands for Generalized Inverted iNdex and should be considered as a genie, not drink. Generalized means that index doesn't know what operation it accelerates, it works with strategies, defined for specific data type (Index Method Strategies). In that sense, gin is similar to GiST and differ from btree index,which has predefined comparison based operations. Inverted index is an index structure storing a (key,posting list) pairs, where posting list is a set of documents, which contain this key (document, usually, contains many keys). The primary goal of the Gin index is a scalable full text search in PostgreSQL. Gin is consists of a B-tree builded over entries (ET, entries tree), where entry is an element of indexed value ( element of array, lexeme for tsvector) and where each tuple in the leaf pages is either pointer to B-tree over item pointers (PT, posting tree), or list of item pointers (PL, posting list) if tuple is small enough. Notes: There is no delete operation for ET. The reason for that is that from our experience, a set of unique words of a whole collection changed very rare. This greatly simplify code and concurrency algorithm. Gin comes with built-in support for one dimensional arrays ( integer[], text[], no support for NULL elements) and following operations: * contains : value_array @ query_array * overlap : value_array && query_array * contained: value_array ~ query_array Synopsis =# create index txt_idx on aa using gin(a); Features * Concurrency * WAL-logging (recoverable) * user-defined opclass, the scheme is similar to GiST * optimizedindex creation (make use of maintenance_work_mem to accumulate postings in memory) Limitations * no support for multicolumn indices * Gin doesn't uses scan->kill_prior_tuple & scan->ignore_killed_tuples * Ginsearch entry only by equality matching, this may be improved in future Gin interface OpClass interface (pseudocode). Example for opclass is in ginarayproc.c. Datum* extractValue(Datum inputValue, uint32* nentries) Returns array of Datum of entries of value to be indexed, nentriesshould contains number of returning entries int compareEntry( Datum a, Datum b ) Compares two entries (not the indexing values!) Datum* extractQuery(Datum query, uint32* nentries, StrategyNumber n) Returns array of Datum of entries of query to besearched, n contains Strategy number of operation. bool consistent( bool[] check, StrategyNumber n, Datum query) The size of check array is the same as sizeof of array returnedby extractQuery. Each element of check array is true if indexed value has corresponding entry in the query,i.e. if check[i] == TRUE then i-th entry of query is presented in indexed value. Function should returns true if indexed value matches by StrategyNumber and query. Open items (We appreciate any comments, help, advice and recommendations). * teach optimizer/executor, that GIN is intrinsically clustered, i.e., it always returns ItemPointer in ascendingorder. * tweak gincostestimate * GIN stores several ItemPointer to heap tuple, so vacuum full produces warning message: WARNING: index "idx" contains 88395 row versions, but table contains 51812 row versions HINT: Rebuild the index with REINDEX. TODO Nearest future: * add tsearch2 opclass * create full scale test suite Distant future: * Replace entry btree to something like to GiST * Add multicolumn support * Optimize insert operation (use backgroundindex insertion) Authors:Teodor Sigaev <teodor@sigaev.ru>Oleg Bartunov <oleg@sai.msu.su> -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
pgsql-hackers by date: