Thread: full text search to_tsquery performance with ispell dictionary
Hello everybody,
I was experimenting with the FTS feature on postgres 8.3.4 lately and encountered a weird performance issue when using a custom FTS configuration.
I use this german ispell dictionary, re-encoded to utf8:
With the following configuration:
CREATE TEXT SEARCH CONFIGURATION public.german_de (COPY = pg_catalog.german);
CREATE TEXT SEARCH DICTIONARY german_de_ispell (
TEMPLATE = ispell,
DictFile = german_de_utf8,
AffFile = german_de_utf8,
StopWords = german_de_utf8
);
ALTER TEXT SEARCH CONFIGURATION german_de
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH german_de_ispell, german_stem;
So far so good. Indexing and creation of tsvectors works like a charm.
The problem is, that if I open a new connection to the database and do something like this
SELECT to_tsquery('german_de', 'abcd');
it takes A LOT of time for the query to complete for the first time. About 1-1,5s. If I submit the same query for a second, third, fourth time and so on, it takes only some 10-20ms, which is what I would expect.
It almost seems as if the dictionary is somehow analyzed or indexed and the results cached for each connection, which seems counter-intuitive to me. After all, the dictionaries should not change that often.
Did I miss something or did I do something wrong?
I'd be thankful for any advice.
Kind Regards
--
Stanislav Raskin
livn GmbH
Campus Freudenberg
Rainer-Gruenter-Str. 21
42119 Wuppertal
+49(0)202-8 50 66 921
raskin@livn.de
http://www.livn.de
livn
local individual video news GmbH
Registergericht Wuppertal HRB 20086
Geschäftsführer:
Dr. Stefan Brües
Alexander Jacob
Stanislav Raskin
livn GmbH
Campus Freudenberg
Rainer-Gruenter-Str. 21
42119 Wuppertal
+49(0)202-8 50 66 921
raskin@livn.de
http://www.livn.de
livn
local individual video news GmbH
Registergericht Wuppertal HRB 20086
Geschäftsführer:
Dr. Stefan Brües
Alexander Jacob
Hello 2011/5/11 Stanislav Raskin <raskin@livn.de>: > Hello everybody, > I was experimenting with the FTS feature on postgres 8.3.4 lately and > encountered a weird performance issue when using a custom FTS configuration. > I use this german ispell dictionary, re-encoded to utf8: > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz > With the following configuration: > > CREATE TEXT SEARCH CONFIGURATION public.german_de (COPY = > pg_catalog.german); > > CREATE TEXT SEARCH DICTIONARY german_de_ispell ( > > TEMPLATE = ispell, > > DictFile = german_de_utf8, > > AffFile = german_de_utf8, > > StopWords = german_de_utf8 > > ); > > ALTER TEXT SEARCH CONFIGURATION german_de > > ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, > > word, hword, hword_part > > WITH german_de_ispell, german_stem; > > So far so good. Indexing and creation of tsvectors works like a charm. > The problem is, that if I open a new connection to the database and do > something like this > SELECT to_tsquery('german_de', 'abcd'); > it takes A LOT of time for the query to complete for the first time. About > 1-1,5s. If I submit the same query for a second, third, fourth time and so > on, it takes only some 10-20ms, which is what I would expect. > It almost seems as if the dictionary is somehow analyzed or indexed and the > results cached for each connection, which seems counter-intuitive to me. > After all, the dictionaries should not change that often. > Did I miss something or did I do something wrong? > I'd be thankful for any advice. > Kind Regards it is expected behave :( . A loading of ispell dictionary is very slow. Use a german snowball instead. You can you a some pooling connection software too. Regards Pavel Stehule > -- > > Stanislav Raskin > > livn GmbH > Campus Freudenberg > Rainer-Gruenter-Str. 21 > 42119 Wuppertal > > +49(0)202-8 50 66 921 > raskin@livn.de > http://www.livn.de > > livn > local individual video news GmbH > Registergericht Wuppertal HRB 20086 > > Geschäftsführer: > Dr. Stefan Brües > Alexander Jacob
Stanislav Raskin <raskin@livn.de> writes: > The problem is, that if I open a new connection to the database and do > something like this > SELECT to_tsquery('german_de', 'abcd'); > it takes A LOT of time for the query to complete for the first time. About > 1-1,5s. If I submit the same query for a second, third, fourth time and so > on, it takes only some 10-20ms, which is what I would expect. > It almost seems as if the dictionary is somehow analyzed or indexed and the > results cached for each connection, which seems counter-intuitive to me. Yes, loading a large dictionary is known to be a fairly expensive operation. There's been discussions about how to make it cheaper, but nothing's been done yet. regards, tom lane
Hello 2011/5/11 Stanislav Raskin <raskin@livn.de>: > > On 11.05.11 15:45, "Pavel Stehule" <pavel.stehule@gmail.com> wrote: > >>it is expected behave :( . A loading of ispell dictionary is very slow. >> >>Use a german snowball instead. >> >>You can you a some pooling connection software too. > > > Thank you for the response. > Is the dictionary german_stem supplied with postgresql a snowball stemmer? > If yes, it sure is incredibly fast, but yields much worse results and thus > fewer and worse matches for search queries. > > To use connections pooling is...difficult in my situation, to say the > least. We currently use quite a complex pgcluster/corosync setup for > multi-master replication, load balancing and high availability. To > introduce connection pooling to this setup could turn out to be quite a > big project. > German_stem is part of distribution. I am thinking so result of stems are usable because the reports about slow speed are not often. There are not exists Czech stem, so we have to use a ispell. I wrote a patch that stores loaded dictionary in shared memory. You can find source code in archive pg_hacker mailing list. But it isn't well tested and it is just prototype - not accepted to pg. You can test it. Sometimes people use a >>simple<< configuration here. It isn't best but it is fast. Regards Pavel Stehule > -- > > Stanislav Raskin > > > >
> > > >Yes, loading a large dictionary is known to be a fairly expensive >operation. There's been discussions about how to make it cheaper, but >nothing's been done yet. > > regards, tom lane Hi Tom, thanks for the quick response. Bad news for me ;( We develop ajax-driven web apps, which sort of rely on quick calls to data services. Each call to a service opens a new connection. This makes the search service, if using fts and ispell, about 100 times slower than a "dumb" ILIKE-implementation. Is there any way of hack or compromise to achieve good performance without losing fts ability? I am thinking, for example, of a way to permanently keep a loaded dictionary in memory instead of loading it for every connection. As I wrote in response to Pavel Stehule's post, connection pooling is not really an option. Our front-end is strictly PHP, so I was thinking about using a single persistent connection (http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is there some sort of major disadvantage in this approach from the database point of view? Kind regards -- Stanislav Raskin
On 11.05.11 15:45, "Pavel Stehule" <pavel.stehule@gmail.com> wrote: >it is expected behave :( . A loading of ispell dictionary is very slow. > >Use a german snowball instead. > >You can you a some pooling connection software too. Thank you for the response. Is the dictionary german_stem supplied with postgresql a snowball stemmer? If yes, it sure is incredibly fast, but yields much worse results and thus fewer and worse matches for search queries. To use connections pooling is...difficult in my situation, to say the least. We currently use quite a complex pgcluster/corosync setup for multi-master replication, load balancing and high availability. To introduce connection pooling to this setup could turn out to be quite a big project. -- Stanislav Raskin
>> >> >> >>Yes, loading a large dictionary is known to be a fairly expensive >>operation. There's been discussions about how to make it cheaper, but >>nothing's been done yet. >> >> regards, tom lane > > Hi Tom, > > thanks for the quick response. Bad news for me ;( > We develop ajax-driven web apps, which sort of rely on quick calls to data > services. Each call to a service opens a new connection. This makes the > search service, if using fts and ispell, about 100 times slower than a > "dumb" ILIKE-implementation. > > Is there any way of hack or compromise to achieve good performance without > losing fts ability? > I am thinking, for example, of a way to permanently keep a loaded > dictionary in memory instead of loading it for every connection. As I > wrote in response to Pavel Stehule's post, connection pooling is not > really an option. > Our front-end is strictly PHP, so I was thinking about using a single > persistent connection > (http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is > there some sort of major disadvantage in this approach from the database > point of view? > > Kind regards Hi, opening a completely new connection for each request may be a bit expensive, so I'd recommend using some king od connection pooling, especially when you're doing 'small' transactions (because that's when the overhead matters). We had exactly the same problem and persistent connection solved it. But it has some drawbacks too - each conneection has it's own copy of the dictionary. So if the dictionary takes 30MB and you have 10 connections, then 300 MB of memory is used. regards Tomas
2011/5/11 Stanislav Raskin <raskin@livn.de>: > On 11.05.11 16:42, "Pavel Stehule" <pavel.stehule@gmail.com> wrote: > > >>I wrote a >>patch that stores loaded dictionary in shared memory. > > Hi Pavel, > > very interesting. I will give it a closer look. > > What do you think about using ispell to create, store and index tsvectors, > but at the same time to use the stemmer to create tsqueries? Thus, I hope > to create a fast user experience while searching, but at the same time > have more detailed tsvectors and more matches. > I know nothing - but there can be risk - two different method can to returns two different results. Regards Pavel > Regards > > -- > > Stanislav Raskin >> > > >
Stanislav Raskin <raskin@livn.de> writes: > Is there any way of hack or compromise to achieve good performance without > losing fts ability? > I am thinking, for example, of a way to permanently keep a loaded > dictionary in memory instead of loading it for every connection. As I > wrote in response to Pavel Stehule's post, connection pooling is not > really an option. You really, really, really need to fix whatever is preventing you from using pooling. Opening a database connection to run one query is just horridly inefficient. Loading a dictionary has evidently raised it past the threshold where you couldn't ignore it anymore, but that doesn't mean you weren't paying through the nose before. A postgres backend just isn't a lightweight object --- there is all sorts of stuff it has to cache before it's really up to speed. regards, tom lane
On 11.05.11 16:42, "Pavel Stehule" <pavel.stehule@gmail.com> wrote: >I wrote a >patch that stores loaded dictionary in shared memory. Hi Pavel, very interesting. I will give it a closer look. What do you think about using ispell to create, store and index tsvectors, but at the same time to use the stemmer to create tsqueries? Thus, I hope to create a fast user experience while searching, but at the same time have more detailed tsvectors and more matches. Regards -- Stanislav Raskin >
On Wed, 11 May 2011, Stanislav Raskin wrote: >> >> >> >> Yes, loading a large dictionary is known to be a fairly expensive >> operation. There's been discussions about how to make it cheaper, but >> nothing's been done yet. >> >> regards, tom lane > > Hi Tom, > > thanks for the quick response. Bad news for me ;( > We develop ajax-driven web apps, which sort of rely on quick calls to data > services. Each call to a service opens a new connection. This makes the > search service, if using fts and ispell, about 100 times slower than a > "dumb" ILIKE-implementation. > > Is there any way of hack or compromise to achieve good performance without > losing fts ability? > I am thinking, for example, of a way to permanently keep a loaded > dictionary in memory instead of loading it for every connection. As I > wrote in response to Pavel Stehule's post, connection pooling is not > really an option. > Our front-end is strictly PHP, so I was thinking about using a single > persistent connection > (http://de.php.net/manual/en/function.pg-pconnect.php) for all calls. Is > there some sort of major disadvantage in this approach from the database > point of view? > > Kind regards > > -- > > Stanislav Raskin > > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 11.05.11 17:04, "tv@fuzzy.cz" <tv@fuzzy.cz> wrote: >We had exactly the same problem and persistent connection solved it. First testing with persistent connections seems to work like a charm. Will do some thorough testing and watch the memory load. Hopefully, I will not trip over some sort of pitfall. Goole seems to be full of people who have problems with persistent connections. Big thanks for your advice.
On 11.05.11 17:31, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: >You really, really, really need to fix whatever is preventing you from >using pooling. Opening a database connection to run one query is just >horridly inefficient. Very true. I did not mean that anything actually prevents us from using pooling. We just have no idea, how it will interfere with our productive pgcluster setup. I imagine the evaluation, testing and verification of pooling systems in combination with our setup to be quite tedious. Of course, I don't open a connection for each query. One is opened for each service call. The services are designed to start an own process for every call. Such a process - for now - needs an own connection. It usually handles dozens of queries. Until now, we never ran into performance problems. The time consumed by DB operations is usually negligible compared to the rest. First tests with a simple persistent connection setup seem to work fine and solve the performance issue. I tend to put some thorough testing on this setup and see if I step into a pitfall. Big kudos to you and this list. You were a great help, as always.