Thread: Tsearch2 custom dictionaries
Part1. I have created a dictionary called 'webwords' which checks all words and curtails them to 300 chars (for now) after running make make install I then copied the lib_webwords.so into my $libdir I have run psql mybd < dict_webwords.sql The tutorial shows how to install the intdict for integer types. How should i install my custom dictionary? Part2. The dictionary I am trying to create is to be used for searching multilingual text. My aim is to have fast search over all text, but ignore binary encoded data which is also present. (i will probably move to ignoring long words in the text eventually). What is the best approach to tackle this problem? As the text can be multilingual I don't think stemming is possible? I also need to include many none-standard words in the index such as urls and message ID's contained in the text. I get the feeling that building these indexs will by no means be an easy task so any suggestions will be gratefully recieved! Thanks... --
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > Part1. > > I have created a dictionary called 'webwords' which checks all words > and curtails them to 300 chars (for now) > > after running > make > make install > > I then copied the lib_webwords.so into my $libdir > > I have run > > psql mybd < dict_webwords.sql > > The tutorial shows how to install the intdict for integer types. How > should i install my custom dictionary? Once you did 'psql mybd < dict_webwords.sql' you should be able use it :) Test it : select lexize('webwords','some_web_word'); Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict > > > Part2. > > The dictionary I am trying to create is to be used for searching > multilingual text. My aim is to have fast search over all text, but > ignore binary encoded data which is also present. (i will probably move > to ignoring long words in the text eventually). > What is the best approach to tackle this problem? > As the text can be multilingual I don't think stemming is possible? You're right. I'm afraid you need UTF database, but tsearch2 isn't UTF-8 compatible :( > I also need to include many none-standard words in the index such as > urls and message ID's contained in the text. > What's message ID ? Integer ? it's already recognized by parser. try select * from token_type(); Also, last version of tsearch2 (for 7.3 grab from http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/, for 7.4 - available from CVS) has rather useful function - ts_debug apod=# select * from ts_debug('http://www.sai.msu.su/~megera'); ts_name | tok_type | description | token | dict_name | tsvector ---------+----------+-------------+----------------+-----------+------------------ simple | host | Host | www.sai.msu.su | {simple} | 'www.sai.msu.su' simple | lword | Latin word | megera | {simple} | 'megera' (2 rows) > I get the feeling that building these indexs will by no means be an > easy task so any suggestions will be gratefully recieved! > You may write your own parser, at last. Some info about parser API: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief > Thanks... > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
> On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > > > Part1. > > > > I have created a dictionary called 'webwords' which checks all words > > and curtails them to 300 chars (for now) > > > > after running > > make > > make install > > > > I then copied the lib_webwords.so into my $libdir > > > > I have run > > > > psql mybd < dict_webwords.sql > > > Once you did 'psql mybd < dict_webwords.sql' you should be able use it :) > Test it : > select lexize('webwords','some_web_word'); I did test it with select lexize('webwords','some_web_word'); lexize ------- {some_web_word} select lexize('webwords','some_400char_web_word'); lexize -------- {some_shortened_web_word} so that bit works, but then I tried SELECT to_tsvector( 'webwords', 'my words' ); Error: No tsearch config > Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict yeah, i did read it - its good! should i run: update pg_ts_cfgmap set dict_name='{webwords}'; > > Part2. <snip> > > As the text can be multilingual I don't think stemming is possible? > > You're right. I'm afraid you need UTF database, but tsearch2 isn't > UTF-8 compatible :( My database was created as unicode - does this mean I cannot use tsaerch?! > > I also need to include many none-standard words in the index such as > > urls and message ID's contained in the text. > > > > What's message ID ? Integer ? it's already recognized by parser. > > try > select * from token_type(); > > Also, last version of tsearch2 (for 7.3 grab from > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/, > for 7.4 - available from CVS) > has rather useful function - ts_debug > > apod=# select * from ts_debug('http://www.sai.msu.su/~megera'); > ts_name | tok_type | description | token | dict_name | tsvector > ---------+----------+-------------+----------------+-----------+------ ------------ > simple | host | Host | www.sai.msu.su | {simple} | 'www. sai.msu.su' > simple | lword | Latin word | megera | {simple} | ' megera' > (2 rows) > > > > > I get the feeling that building these indexs will by no means be an > > easy task so any suggestions will be gratefully recieved! > > > > You may write your own parser, at last. Some info about parser API: > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief Parser writing...scary stuff :-) Thanks! --
On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > > On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote: > > > > > Part1. > > > > > > I have created a dictionary called 'webwords' which checks all > words > > > and curtails them to 300 chars (for now) > > > > > > after running > > > make > > > make install > > > > > > I then copied the lib_webwords.so into my $libdir > > > > > > I have run > > > > > > psql mybd < dict_webwords.sql > > > > > Once you did 'psql mybd < dict_webwords.sql' you should be able use > it :) > > Test it : > > select lexize('webwords','some_web_word'); > > I did test it with > select lexize('webwords','some_web_word'); > lexize > ------- > {some_web_word} > > select lexize('webwords','some_400char_web_word'); > lexize > -------- > {some_shortened_web_word} > > > so that bit works, but then I tried > > SELECT to_tsvector( 'webwords', 'my words' ); > Error: No tsearch config from ref.guide: to_tsvector( [configuration,] document TEXT) RETURNS tsvector > > > Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict > > yeah, i did read it - its good! > should i run: > update pg_ts_cfgmap set dict_name='{webwords}'; > after loading your dictionary to db you should have it registered in pg_ts_dict, try select * from pg_ts_dict; next, you need to read docs, for example http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html how to create your configuration and specify lexem_type-dictionary mapping; > > > > > Part2. > <snip> > > > As the text can be multilingual I don't think stemming is possible? > > > > > You're right. I'm afraid you need UTF database, but tsearch2 isn't > > UTF-8 compatible :( > > My database was created as unicode - does this mean I cannot use > tsaerch?! > We have no any experience with UTF, so you may better ask openfts mailing list and read archives. > > > I also need to include many none-standard words in the index such > as > > > urls and message ID's contained in the text. > > > > > > > What's message ID ? Integer ? it's already recognized by parser. > > > > try > > select * from token_type(); > > > > Also, last version of tsearch2 (for 7.3 grab from > > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/, > > for 7.4 - available from CVS) > > has rather useful function - ts_debug > > > > apod=# select * from ts_debug('http://www.sai.msu.su/~megera'); > > ts_name | tok_type | description | token | dict_name | > tsvector > > ---------+----------+-------------+----------------+-----------+------ > ------------ > > simple | host | Host | www.sai.msu.su | {simple} | 'www. > sai.msu.su' > > simple | lword | Latin word | megera | {simple} | ' > megera' > > (2 rows) > > > > > > > > > I get the feeling that building these indexs will by no means be an > > > > easy task so any suggestions will be gratefully recieved! > > > > > > > You may write your own parser, at last. Some info about parser API: > > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief > > > Parser writing...scary stuff :-) > > > Thanks! > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83