Re: Tsearch2 performance on big database - Mailing list pgsql-performance
From | Oleg Bartunov |
---|---|
Subject | Re: Tsearch2 performance on big database |
Date | |
Msg-id | Pine.GSO.4.62.0503231236390.5508@ra.sai.msu.su Whole thread Raw |
In response to | Re: Tsearch2 performance on big database (Rick Jansen <rick@rockingstone.nl>) |
Responses |
Re: Tsearch2 performance on big database
|
List | pgsql-performance |
On Wed, 23 Mar 2005, Rick Jansen wrote: > Oleg Bartunov wrote: >> On Tue, 22 Mar 2005, Rick Jansen wrote: >> >> Hmm, default configuration is too eager, you index every lexem using simple >> dictionary) ! Probably, it's too much. Here is what I have for my russian >> configuration in dictionary database: >> >> default_russian | lword | {en_ispell,en_stem} >> default_russian | lpart_hword | {en_ispell,en_stem} >> default_russian | lhword | {en_ispell,en_stem} >> default_russian | nlword | {ru_ispell,ru_stem} >> default_russian | nlpart_hword | {ru_ispell,ru_stem} >> default_russian | nlhword | {ru_ispell,ru_stem} >> >> Notice, I index only russian and english words, no numbers, url, etc. >> You may just delete unwanted rows in pg_ts_cfgmap for your configuration, >> but I'd recommend just update them setting dict_name to NULL. >> For example, to not indexing integers: >> >> update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and >> tok_alias='int'; >> >> voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have >> +70000 bucks'); >> token | dict_name | tok_type | tsvector >> --------+---------------------+----------+---------- >> Do | {en_ispell,en_stem} | lword | >> you | {en_ispell,en_stem} | lword | >> have | {en_ispell,en_stem} | lword | >> +70000 | | int | >> bucks | {en_ispell,en_stem} | lword | 'buck' >> >> Only 'bucks' gets indexed :) >> Hmm, probably I should add this into documentation. >> >> What about word statistics (# of unique words, for example). >> > > I'm now following the guide to add the ispell dictionary and I've updated > most of the rows setting dict_name to NULL: > > ts_name | tok_alias | dict_name > -----------------+--------------+----------- > default | lword | {en_stem} > default | nlword | {simple} > default | word | {simple} > default | part_hword | {simple} > default | nlpart_hword | {simple} > default | lpart_hword | {en_stem} > default | hword | {simple} > default | lhword | {en_stem} > default | nlhword | {simple} > > These are left, but I have no idea what a 'hword' or 'nlhword' or any other > of these tokens are. from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes I've asked how to know token types supported by parser. Actually, there is function token_type(parser), so you just use: select * from token_type(); > > Anyway, how do I find out the number of unique words or other word > statistics? from my notes http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes It's usefull to see words statistics, for example, to check how good your dictionaries work or how did you configure pg_ts_cfgmap. Also, you may notice probable stop words relevant for your collection. Tsearch provides stat() function: ....................... Don't hesitate to read it and if you find some bugs or know better wording I'd be glad to improve my notes. > > Rick > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
pgsql-performance by date: