Re: tsearch2/GIST performance factors? - Mailing list pgsql-performance
From | Oleg Bartunov |
---|---|
Subject | Re: tsearch2/GIST performance factors? |
Date | |
Msg-id | Pine.GSO.4.63.0510172127510.10366@ra.sai.msu.su Whole thread Raw |
In response to | tsearch2/GIST performance factors? ("Craig A. James" <cjames@modgraph-usa.com>) |
Responses |
Re: tsearch2/GIST performance factors?
|
List | pgsql-performance |
On Sat, 15 Oct 2005, Craig A. James wrote: > We are indexing about 5 million small documents using tsearch2/GIST. Each > "document" contains 2 to 50 words. This is a "write once, read many" > situation. Write performance is unimportant, and the database contents are > static. (We build it offline.) > > We're having problems with inconsistent performance, and it's very hard to > separate the effects of various factors. Here are the things we think may be > relevant. > > 1. Total number of words > > Our documents currently contain about 110,000 unique words. Oleg wrote: > "[The limit is] 100K, but it's very fuzzy limit." By trial and error, we've > learned that 50,000 works well, and 150,000 works poorly, so Oleg's comment > appears to be a good rule-of-thumb. (With SIGLENINT enlarged, see below.) > But there may be other factors that affect this conclusion (such as shared > memory, total memory, etc.). > Did you consider *decreasing* SIGLENINT ? Size of index will diminish and performance could be increased. I use in current project SIGLENINT=15 > > 2. Total size of the table > > 5 million documents is not a very big database (each document is a few to a > few hundred bytes), so we don't think this is relevant. > > > 3. Number of documents per word > > There seems to be a VERY strong effect related to "common" words. When a > word occurs in more than about 1% of the documents (say 50,000 to 150,000 > documents), performance goes WAY down. Not just for that specific query, but > it screws up tsearch2/GIST completely. > > We have a test of 100 queries that return 382,000 documents total. The first > time we run it, it's slow, about 20 minutes (as expected). The second time > we run it, it's very fast, about 72 seconds -- very fast!! As long as we > avoid queries with common words, performance is very good. > > But, if we run just one query that contains a common word (a word that's in > more than about 2% of the documents, roughly 150,000 documents), then the > next time we run the 100 test queries, it will take 20 minutes again. > > We can't simply eliminate these common words. First of all, they can be very > significant. Second, it doesn't seem like 2% is "common". I can understand > that a words like "the" which occur in most documents shouldn't be indexed. > But a word that occurs in 2% of the database seems like a very good word to > index, yet it causes us great problems. > tsearch2's index is a lossy index, read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals so search results should be rechecked ! > I've read a bit about tsearchd, and wonder if it would solve our problem. > For our application, consistent performance is VERY important. If we could > lock the GIST index into memory, I think it would fix our problem. I think so, tsearchd was designed for static contents in mind and it's index doesn't require rechecking ! > > I tried copying the GIST indexes (which are in a separate tablespace) to a 1 > GB RAM disk, and it made the initial query faster, but overall performance > seemed worse, probably because the RAM disk was using memory that could have > been used by the file-system cache. > > > 4. Available RAM and Disk drives > > Would more RAM help? How would we tell Postgres to use it effectively? The > GIST indexes are currently about 2.6 GB on the disk. try to decrease signature size, say, #define SIGLENINT 15 > I feel like I'm shooting in the dark -- Linux, Postgres and tsearch2/GIST are > interacting in ways that I can't predict or analyze. Thanks very much for > any comments and advice. We have our TODO http://www.sai.msu.su/~megera/oddmuse/index.cgi/todo and hope to find sponsorhips for fts project for 8.2 release. Unfortunately, I didn't find spare time to package tsearchd for you, it should certainly help you. Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-performance by date: