Thread: Database-based alternatives to tsearch2?
I've seen questions asked on the list about alternatives to tsearch2, but not for the type of full text indexing I'm looking for. I'm looking for a non index-based full text indexing - one that stores the information as table data instead of index data. I do not need to implement SQL operators for searches. The application library would need to implement the actual word search. Indexes are too fragile. Our documents will be offline, and re-indexing would be impossible. Additionally, as I undertstand it, tsearch2 doesn't scale to the numbers I need (hundreds of millions of documents). Is anyone aware of any such solutions for PostgreSQL, open source or otherwise? Thanks Wes
On Tue, 2006-12-12 at 12:19 -0600, Wes wrote: > I'm looking for a non index-based full text indexing - one that stores the > information as table data instead of index data. I do not need to implement > SQL operators for searches. The application library would need to implement > the actual word search. > Store the tsvector (a custom type provided by tsearch2) as a separate column in the table. This data type holds all the important information about the indexed text, such as distinct words and some position information, but it takes up much less space than a large document. The tsearch2 package provides a lot of functionality even without the index. But after you have a tsvector column, you can create an index on it if you want. > Indexes are too fragile. Our documents will be offline, and re-indexing > would be impossible. Additionally, as I undertstand it, tsearch2 doesn't > scale to the numbers I need (hundreds of millions of documents). > Try PostgreSQL 8.2 with tsearch2 using GIN. As I understand it, that's very scalable. Also, as I understand it, a GIN index should not need to be reindexed unless there is a huge shift in the set of distinct words you're using. However, if you do need to reindex, you can if you have the tsvector column. Regards, Jeff Davis
Wes wrote: > > Indexes are too fragile. Our documents will be offline, and re-indexing > would be impossible. Additionally, as I undertstand it, tsearch2 doesn't > scale to the numbers I need (hundreds of millions of documents). Jeff's right about tsvector - sounds like it's what you're looking for. If you're worried about reindexing costs, perhaps look at partioning the table, or using partial indexes (so you could have multiple indexes for each table, based on (id mod 100) or some such). Obviously, partitioning over multiple machines is usually quite do-able for this sort of task too. > Is anyone aware of any such solutions for PostgreSQL, open source or > otherwise? Without wishing to discourage a potential large user from PG, it might be worth checking if Google/Yahoo/etc have a non-relational server that meets your needs off-the-shelf. -- Richard Huxton Archonet Ltd
Wes wrote: > I've seen questions asked on the list about alternatives to tsearch2, but > not for the type of full text indexing I'm looking for. > > I'm looking for a non index-based full text indexing - one that stores the > information as table data instead of index data. I do not need to implement > SQL operators for searches. The application library would need to implement > the actual word search. Not sure if it would fit your needs, but DBIx-TextIndex implements FTI using table data: http://search.cpan.org/~dkoch/DBIx-TextIndex-0.25/lib/DBIx/TextIndex.pm -- Daniel PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org