Thread: fulltext searching via a custom index type
(I started with this addressed to the -general list, but it just doesn't seem like a general usage topic. If -hackers is the wrong place, please point me in the right direction). I've been working on a custom index type (access method) that allows postgres to do fulltext searching (including phrase and proximity) that I believe is more flexible and natural than "tsearch2" (no offense to the tsearch2 guys... it's been a source of inspiration). I also plan on providing "term browsing" and hit-position information. I'm using Xapian (www.xapian.org) as the backend fulltext engine, and I'm trying to learn more of the innards of postgres as I go. I've only spent about 18-24 hours on this, so it's nowhere near complete, and I really hesitate to even mention it in a public forum. But if in the end it doesn't suck it might make a great contrib/ package. *shrug* who knows? In a nutshell, I've implemented all the necessary access method functions to teach postgres about a new index type named "xapian", and so far, everything is really great. It works just like any other index type would: create table test (stuff varchar(255), more_stuff text);create index idxstuff on test using xapian (stuff);create index idxmore_stuffon test using xapian (more_stuff); insert into test (stuff, more_stuff) values ('this is stuff', 'this is more stuff');insert into test (stuff, more_stuff) values ('my name is eric ridge', 'i like to drink beer'); select * from test where stuff => 'stuff' or more_stuff => '"drink beer"'; stuff | more_stuff -----------------------+---------------------- this is stuff | this is more stuff my name is eric ridge | i liketo drink beer (2 rows) All this aside, I've got some questions related to indexes and query plans. 1) Is it possible for an access method to receive some kind of "DROP INDEX" notification? As far as I can tell, the answer is "no", but it can't hurt to ask. 2) When does the query planner decide that it needs to "Filter" results, and is there any way to turn this off or otherwise fool the query planner into NOT doing Filters (even if only for certain operators)? For example, this query:select * from test where stuff => 'stuff' AND NOT more_stuff => '"drink beer"'; has this plan:Index Scan using idxstuff on test (cost=0.00..-0.98 rows=250 width=177) Index Cond: ((stuff)::text => 'stuff'::text) Filter: (NOT (more_stuff => '"drink beer"'::text)) In this case, postgres is forced to re-parse the contents of the "more_stuff" field (via the defined procedure for the => operator) for every row returned by the previous index scan, just so it can determine if the field contains the phrase 'drink beer' or not. Since so much overhead is involved here, it would be cool if postgres could somehow do another index scan. Maybe there's some way for the operator function to know not only the Datum value, but the actual field (and ItemPointer)? 3) How does one get the $PGDATA directory? getenv() doesn't seem ideal since PGDATA can be specified as a command-line argument. 4) Can a function be registered as part of a transaction, pre-commit -- so the function can have an opportunity to abort the transaction. I've seen RegisterEOXactCallback, but it's not quite what I'm looking for. 5) are there discussions in the archives of this list (or other pgsql- lists) that discuss fulltext searching that y'all think are worth reading? thanks in advance for your time and input! eric
Eric Ridge <ebr@tcdi.com> writes: > 1) Is it possible for an access method to receive some kind of "DROP > INDEX" notification? No. > select * from test where stuff => 'stuff' AND NOT more_stuff => > '"drink beer"'; > has this plan: To do better here, you'd need to invent a "not-=>' operator, so that the above could be simplified to, say, select * from test where stuff => 'stuff' AND more_stuff ~=> '"drink beer"'; and then define both => and ~=> as members of your index opclass, and then build a two-column index on (stuff, more_stuff) ... whereupon the planner would pass your index AM both clauses and it would be up to you to do something intelligent. You might want to back off a little on how much of this you really want to do in a first cut. > 3) How does one get the $PGDATA directory? DataDir. Why should you care? An index AM that wants to know this is probably broken IMHO; it's certainly trying to do something that's outside the charter of index AMs, and is likely to cause lots of headaches. > 4) Can a function be registered as part of a transaction, pre-commit -- > so the function can have an opportunity to abort the transaction. Why would that be a good idea? When exactly would you expect it to be called (relative to the other ninety-nine things that happen at commit)? How are you going to recover if something else aborts the transaction, either before or after your function runs? I get the impression from your questions that you are trying to make an end run around the transaction mechanism. This is almost certainly doomed to failure. You need to find a way of storing all your data within the ordinary index structure. regards, tom lane
On Dec 26, 2003, at 3:22 PM, Tom Lane wrote: >> 3) How does one get the $PGDATA directory? > > DataDir. Why should you care? An index AM that wants to know this is > probably broken IMHO; it's certainly trying to do something that's > outside the charter of index AMs, and is likely to cause lots of > headaches. Xapian has it's own storage subsystem, and that's what I'm using to store the index... not using anything internal to postgres (although this could change). As such, Xapian needs to know *where* to save its indexes... $PGDATA seemed like a good place to start. >> 4) Can a function be registered as part of a transaction, pre-commit >> -- >> so the function can have an opportunity to abort the transaction. > > Why would that be a good idea? When exactly would you expect it to be > called (relative to the other ninety-nine things that happen at > commit)? > How are you going to recover if something else aborts the transaction, > either before or after your function runs? I don't really have an answer to this. :) > I get the impression from your questions that you are trying to make an > end run around the transaction mechanism. Perhaps. I'm actually fishing for ideas to bridge xapian's transaction facilities to postgres. Your comment confirms my suspicions that it just ain't gunna work out. > This is almost certainly doomed to failure. You need to find a way of > storing all your data > within the ordinary index structure. You are probably right. :) I'm just playing around right now. I do appreciate your response, it's given me a lot to think about. eric
Eric Ridge <ebr@tcdi.com> writes: > Xapian has it's own storage subsystem, and that's what I'm using to > store the index... not using anything internal to postgres (although > this could change). I would say you have absolutely zero chance of making it work that way. You will not be able to get it to interoperate reliably with transactions, checkpointing, or WAL replay; to say nothing of features we might add in the future, such as tablespaces and point-in-time recovery. You need to migrate all the data into the Postgres storage mechanism. It might be worth pointing out here than an index AM is not bound to use exactly the typical Postgres page layout. I think you probably do have to use the standard page header, but the page contents don't have to look like tuples if you don't want 'em to. For precedent see the hash index AM, which stores ordinary index tuples on some index pages but uses other pages for its own purposes. regards, tom lane
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> Xapian has it's own storage subsystem, and that's what I'm using to >> store the index... not using anything internal to postgres (although >> this could change). > > I would say you have absolutely zero chance of making it work that way. thanks for the encouragement! :) > You will not be able to get it to interoperate reliably with > transactions, checkpointing, or WAL replay; to say nothing of features > we might add in the future, such as tablespaces and point-in-time > recovery. > You need to migrate all the data into the Postgres storage mechanism. And these are the things I'm struggling with now. The basic indexing and searching currently works flawlessly, but the moment another user connects up, everything goes to hell. > It might be worth pointing out here than an index AM is not bound to > use > exactly the typical Postgres page layout. I think you probably do have > to use the standard page header, but the page contents don't have to > look like tuples if you don't want 'em to. For precedent see the hash > index AM, which stores ordinary index tuples on some index pages but > uses other pages for its own purposes. That's useful information. Thanks. I've been using the hash AM as my guide b/c it's not as complex as the other index types (atleast on the public interface side). Obviously, I'm trying to save the time and energy of re-inventing the wheel when it comes full text indexing and searching. Xapian is an awesome standalone engine (and it's amazingly fast too!), so it seemed like a good place to start. It's backend storage subsystem is pluggable, and after our little exchange here today, I'm now considering writing a postgres backend for Xapian. I assume the doc chapter on Page Files and the various storage-related README files are good places for more information. Any other tips or pointers? eric
Eric Ridge <ebr@tcdi.com> writes: > I assume the doc chapter on Page Files and the various storage-related > README files are good places for more information. Any other tips or > pointers? Not right offhand, but feel free to ask questions when you get stuck. Also don't forget that there's a wealth of info in source code comments; you should always try looking for existing code that does something similar to what you want to do. BTW, one problem with using the hash AM as a model is that it's not yet WAL-aware. The btree AM is the only one that's really WAL-ified yet, so you need to look at that if you want to think now about how to make your code safe for WAL. I think you'd probably be okay to postpone this consideration for later, but keep in mind that all your operations that change the contents of index files should be divisible into bite-size operations that can be logged as WAL entries. Each "bite-size operation" has to leave the index in a legal state, too, so there's a limit as to how small you can subdivide. regards, tom lane
On Dec 26, 2003, at 4:04 PM, Tom Lane wrote: > Eric Ridge <ebr@tcdi.com> writes: >> Xapian has it's own storage subsystem, and that's what I'm using to >> store the index... not using anything internal to postgres (although >> this could change). > > I would say you have absolutely zero chance of making it work that way. I still think this is one of the best quotes I've heard in awhile. :) > It might be worth pointing out here than an index AM is not bound to > use > exactly the typical Postgres page layout. Thanks again for this little bit of info. It was just enough to get me thinking about how to make "it work". Xapian is basically a big btree index, except it's 5 btree indexes. One for terms, one for posts (terms with positions), one for positions, one for arbitrary document values, and one for the documents themselves. Each index is made up of 3 physical files on disk. All told there's 17 files for a single Xapian index (15 db files, a versioninfo file, and a lock file). I couldn't think of a way to create a whole new database type for Xapian that could deal with managing 5 btree indexes inside of Postgres (other than using tables w/ standard postgres btree index on certain fields), so instead, I dug into Xapian and abstracted out it's filesystem i/o (open, read, write, etc). (as an aside, I did spend some time pondering ways to adapt Postgres' nbtree AM to handle this, but I just don't understand how it works) Once I had Xapian's filesystem i/o encapsulated into a nice little C++ class, I embarked on creating a mini "filesystem" ontop of Postgres' storage subsystem. In essence, I've now got a Postgres access method that mirrors the basics of a filesystem, from creating/open files to reading from and writing to them, in addition to truncation and deletion. After that, it was just a matter of the glue code to teach Xapian to use this "filesystem" for all its filesystem i/o, and voila!, Xapian works ontop of Postgres' storage subsystem and I didn't have to rewrite Xapian from scratch. And surprisingly, despite the additional overhead of this filesystem abstraction layer, it's still very fast... esp. once Buffers get cached. I've still got more work to do (like dealing with locking and general concurrency issues, not to mention bugs I haven't found yet), but it's working *really* well in a single-user environment. So here's the important question: How stupid is this? I've done some benchmarking against tsearch2. Attached are the queries and execution times on my dual 2gig G5 w/ 2gig ram. The table contains 51,160 records. It's every text file contained on my computer (which includes multiple copies of all my java projects). All told, it's 337,343,569 bytes of data, with an average file size of 6,594 bytes. The Xapian operator is "=>", and tsearch2's operator is "@@". I ran each query 6 times, and just took the best execution time. It's also worth noting that my document parser is much different than tsearch2's. I'm splitting words on non-alphanumerics (and currently am not using stopwords), and it seems that tsearch2 tries to do something more intelligent, so the # of results returned vary widely between tsearch2 and Xapian. I'm not offering an opinion on which way is "better". I've got a few more questions about transactions, locking, and a few other things, but I just thought I'd throw this out as a status report and to see if there's any kind of reaction. thanks for your time. eric
Attachment
Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
From
Alvaro Herrera
Date:
On Thu, Jan 01, 2004 at 11:19:07PM -0500, Eric B.Ridge wrote: > I couldn't think of a way to create a whole new database type for > Xapian that could deal with managing 5 btree indexes inside of Postgres > (other than using tables w/ standard postgres btree index on certain > fields), so instead, I dug into Xapian and abstracted out it's > filesystem i/o (open, read, write, etc). > > (as an aside, I did spend some time pondering ways to adapt Postgres' > nbtree AM to handle this, but I just don't understand how it works) I think your approach is too ugly. You will have tons of problems the minute you start thinking about concurrency (unless you want to allow only a single user accessing the index) and recovery (unless you want to force users to REINDEX when the system crashes). I think one way of attacking the problem would be using the existing nbtree by allowing it to store the five btrees. First read the README in the nbtree dir, and then poke at the metapage's only structure. You will see that it has a BlockNumber to the root page of the index. Try modifying that to make it have a BlockNumber to every index's root page. You will have to provide ways to access each root page and maybe other nonstandard things (such as telling the root split operation what root page are you going to split), but you will get recovery and concurrency (at least to a point) for free. Hope this helps, -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La espina, desde que nace, ya pincha" (Proverbio africano)
> I think one way of attacking the problem would be using the existing > nbtree by allowing it to store the five btrees. First read the README > in the nbtree dir, and then poke at the metapage's only structure. You > will see that it has a BlockNumber to the root page of the index. Try > modifying that to make it have a BlockNumber to every index's root page. > You will have to provide ways to access each root page and maybe other > nonstandard things (such as telling the root split operation what root > page are you going to split), but you will get recovery and concurrency > (at least to a point) for free. Why not write it using the GiST interface - that is after all the entire point of GiST... Chris
Re: Postgres + Xapian (was Re: fulltext searching via a custom index type )
From
Alvaro Herrera
Date:
On Sat, Jan 03, 2004 at 02:49:23PM +0800, Christopher Kings-Lynne wrote: > >I think one way of attacking the problem would be using the existing > >nbtree by allowing it to store the five btrees. > > Why not write it using the GiST interface - that is after all the entire > point of GiST... Well, the project is to adapt Postgres as a Xapian backend ... I think it is much more straightforward to use the BTrees that Xapian already uses, than to use GiST. Plus, GiST already has an interface for fulltext indexing, doesn't it? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "La victoria es para quien se atreve a estar solo"
On Sat, 3 Jan 2004, Christopher Kings-Lynne wrote: > > I think one way of attacking the problem would be using the existing > > nbtree by allowing it to store the five btrees. First read the README > > in the nbtree dir, and then poke at the metapage's only structure. You > > will see that it has a BlockNumber to the root page of the index. Try > > modifying that to make it have a BlockNumber to every index's root page. > > You will have to provide ways to access each root page and maybe other > > nonstandard things (such as telling the root split operation what root > > page are you going to split), but you will get recovery and concurrency > > (at least to a point) for free. > > Why not write it using the GiST interface - that is after all the entire > point of GiST... I think this would be the best approach, we have already btree_gist which is a gist realization of btree. We started working on concurrency and recovery for GiST and hope to get it working for 7.5. Another work we'd like to work is extending of GiST interface, so other important ADT's like digital trees, quad-tree, s-tree etc could be implemented. But we decided to get concurrency and recovery works at first. > > Chris > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
On Jan 2, 2004, at 4:54 PM, Alvaro Herrera wrote: > I think your approach is too ugly. You will have tons of problems the > minute you start thinking about concurrency (unless you want to allow > only a single user accessing the index) It might be ugly, but it's very fast. Surprisingly fast, actually. Concerning concurrency, Xapian internally supports multiple readers and only 1 concurrent writer. So the locking requirements should be far less complex than a true concurrent solution. Now, I'm not arguing that this ideal, but if Xapian is a search engine you're interested in, then you've already made up your mind that you're willing to deal with 1 writer at a time. However, Xapian does have built-in support for searching multiple databases at once. One thought I've had is to simply create a new 1-document database on every INSERT/UPDATE beyond the initial CREATE INDEX. Then whenever you do an index scan, tell Xapian to use all the little databases that exist in the index. This would give some bit of concurrency. Then on VACUUM (or FULL), all these little databases could be merged back into the main index. > and recovery (unless you want to force users to REINDEX when the > system crashes). I don't yet understand how the WAL stuff works. I haven't looked at the API's yet, but if something you can record is "write these bytes to this BlockNumber at this offset", or if you can say, "index Tuple X from Relation Y", then it seems like recovery is still possible. If ya can't do any of that, then I need to go look at WAL further. > I think one way of attacking the problem would be using the existing > nbtree by allowing it to store the five btrees. First read the README > in the nbtree dir, and then poke at the metapage's only structure. You > will see that it has a BlockNumber to the root page of the index. Right, I had gotten this far in my investigation already. The daunting thing about trying to use the nbtree code, is the a code itself. It's very complex. Plus, I just don't know how well the rest of Xapian would respond to all of a sudden having a concurrent backend. It's likely that it would make no difference, but it's just an unknown to me at this time. > Try modifying that to make it have a BlockNumber to every index's root > page. > You will have to provide ways to access each root page and maybe other > nonstandard things (such as telling the root split operation what root > page are you going to split), but you will get recovery and concurrency > (at least to a point) for free. And I'm not convinced that recovery and concurrency would be "for free" in this case either. The need to keep essentially 5 different trees in sync greatly complicates the concurrency issue, I would think. thanks for your time! eric
Thanks to everyone that provided input about this idea. After taking everything into consideration and talking with Olly Betts from the Xapian project, what I'm going to do is implement a btree that supports multiple roots and an "tag" value of arbitrary length for each item in the tree. Then implement a new Xapian backend that uses it. In the end, it'll be much more efficient and much less dirty than this silly filesystem thing I have now. And hopefully, concurrency and WAL will be easier to deal with too. Having the existing nbtree AM as guide will be very useful. I have one issue that could potentially make all of this completely useless, and it's related to Postgres deciding to use a Filter instead of an Index Scan. I asked this question last week, and Tom Lane responded with a solution, but I don't think I explained myself very well. And now that I've got a bit more experience with some of the PG internals, maybe I can ask the question more intelligently. Given this query: select id from table where document_text ==> 'food' and title ==> 'water'; Postgres generates this plan: Index Scan using idxtitle on table (cost=0.00..4.01 rows=1 width=8) Index Cond: (title ==>'water'::text) Filter: (document_text ==> 'food'::text) The problem is, the "document_text" column is *really* big. Average value length is 171k. With this query plan, my operator procedure is forced to re-parse the document_text column from each row returned by the index scan against the title column, and do a bunch of Xapian tricks for each row. The overhead is huge. The composite index solution doesn't seem ideal here because there's absolutely no way I can anticipate every combination of fields a user might choose to search. Forgetting about composite indexes for a moment, is postgres even capable of doing 2 index scans in this situation? Does it know how do take the intersection of two scans? My AM's cost estimate function literally sets the selectivity, correlation, total cost, and startup cost values to zero (and I've tried tons of combinations of really big, really small, and really negative values). My thought behind setting them all to zero was that if the cost function basically says, "There is no cost", I could fool Postgres into wanting to use the index everywhere it can. Sadly, this doesn't work out. Now, I realize I can fix my cost estimate function to return better costs for the title and document_text fields so that PG will instead decide to index scan on document_text, but what I really want to do is make PG do an index scan for each field. Can PG even do this? I'd appreciate any thoughts on this. Hopefully, I'm just missing something really obvious. thanks! eric
Eric Ridge kirjutas R, 09.01.2004 kell 01:16: > Forgetting about composite indexes for a moment, is postgres even > capable of doing 2 index scans in this situation? Does it know how do > take the intersection of two scans? Not currently, but it has been in TODO for quite some time: * Use bitmaps to fetch heap pages in sequential order [performance] * Use bitmaps to combine existing indexes [performance] --------------- Hannu
Hannu Krosing wrote: > Eric Ridge kirjutas R, 09.01.2004 kell 01:16: > > > Forgetting about composite indexes for a moment, is postgres even > > capable of doing 2 index scans in this situation? Does it know how do > > take the intersection of two scans? > > Not currently, but it has been in TODO for quite some time: > > * Use bitmaps to fetch heap pages in sequential order [performance] > * Use bitmaps to combine existing indexes [performance] And this the foundation of bitmapped indexes, as I understand it, and is used for "star" joins, popular with data warehousing. Rather than use those terms, I spelled out the actual behavior we want. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Hannu Krosing wrote: > > Eric Ridge kirjutas R, 09.01.2004 kell 01:16: > > > > > Forgetting about composite indexes for a moment, is postgres even > > > capable of doing 2 index scans in this situation? Does it know how do > > > take the intersection of two scans? > > > > Not currently, but it has been in TODO for quite some time: > > > > * Use bitmaps to fetch heap pages in sequential order [performance] > > * Use bitmaps to combine existing indexes [performance] > > And this the foundation of bitmapped indexes, as I understand it, and is > used for "star" joins, popular with data warehousing. Actually I think the thinking with those two TODO items is to use these algorithms in conjunction with regular indexes. Doing regular index scans in parallel, construct bitmaps, perform logical operations to join them, and then do the table scan. It could be done in conjunction with a scheme for storing bitmap indexes or not. Probably not though. -- greg