Thread: Tsearch2 custom dictionaries

Tsearch2 custom dictionaries

From

psql-mail@freeuk.com

Date:

07 August 2003, 11:09:54

Part1.

I have created a dictionary called 'webwords' which checks all words
and curtails them to 300 chars (for now)

after running
make
make install

I then copied the lib_webwords.so into my $libdir

I have run

psql mybd < dict_webwords.sql

The tutorial shows how to install the intdict for integer types. How
should i install my custom dictionary?


Part2.

The dictionary I am trying to create is to be used for searching
multilingual text. My aim is to have fast search over all text, but
ignore binary encoded data which is also present. (i will probably move
to ignoring long words in the text eventually).
What is the best approach to tackle this problem?
As the text can be multilingual I don't think stemming is possible?
I also need to include many none-standard words in the index such as
urls and message ID's contained in the text.

I get the feeling that building these indexs will by no means be an
easy task so any suggestions will be gratefully recieved!

Thanks...

--

Re: Tsearch2 custom dictionaries

From

Oleg Bartunov

Date:

07 August 2003, 11:31:36

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

> Part1.
>
> I have created a dictionary called 'webwords' which checks all words
> and curtails them to 300 chars (for now)
>
> after running
> make
> make install
>
> I then copied the lib_webwords.so into my $libdir
>
> I have run
>
> psql mybd < dict_webwords.sql
>
> The tutorial shows how to install the intdict for integer types. How
> should i install my custom dictionary?

Once you did 'psql mybd < dict_webwords.sql' you should be able use it :)
Test it :
   select lexize('webwords','some_web_word');

Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict


>
>
> Part2.
>
> The dictionary I am trying to create is to be used for searching
> multilingual text. My aim is to have fast search over all text, but
> ignore binary encoded data which is also present. (i will probably move
> to ignoring long words in the text eventually).
> What is the best approach to tackle this problem?
> As the text can be multilingual I don't think stemming is possible?

You're right. I'm afraid you need UTF database, but tsearch2 isn't
UTF-8 compatible :(

> I also need to include many none-standard words in the index such as
> urls and message ID's contained in the text.
>

What's message ID ? Integer ? it's already recognized by parser.

try
select * from token_type();

Also, last version of tsearch2 (for 7.3 grab from
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
for 7.4 - available from CVS)
has rather useful function - ts_debug

apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
 ts_name | tok_type | description |     token      | dict_name |     tsvector
---------+----------+-------------+----------------+-----------+------------------
 simple  | host     | Host        | www.sai.msu.su | {simple}  | 'www.sai.msu.su'
 simple  | lword    | Latin word  | megera         | {simple}  | 'megera'
(2 rows)



> I get the feeling that building these indexs will by no means be an
> easy task so any suggestions will be gratefully recieved!
>

You may write your own parser, at last. Some info about parser API:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief


> Thanks...
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 custom dictionaries

From

psql-mail@freeuk.com

Date:

07 August 2003, 12:22:16

> On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
>
> > Part1.
> >
> > I have created a dictionary called 'webwords' which checks all
words
> > and curtails them to 300 chars (for now)
> >
> > after running
> > make
> > make install
> >
> > I then copied the lib_webwords.so into my $libdir
> >
> > I have run
> >
> > psql mybd < dict_webwords.sql
> >
> Once you did 'psql mybd < dict_webwords.sql' you should be able use
it :)
> Test it :
>    select lexize('webwords','some_web_word');

I did test it with
select lexize('webwords','some_web_word');
lexize
-------
{some_web_word}

select lexize('webwords','some_400char_web_word');
lexize
--------
{some_shortened_web_word}


so that bit works, but then I tried

SELECT to_tsvector( 'webwords', 'my words' );
Error: No tsearch config

> Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict

yeah, i did read it - its good!
should i run:
update pg_ts_cfgmap set dict_name='{webwords}';



> > Part2.
<snip>
> > As the text can be multilingual I don't think stemming is possible?

>
> You're right. I'm afraid you need UTF database, but tsearch2 isn't
> UTF-8 compatible :(

My database was created as unicode - does this mean I cannot use
tsaerch?!

> > I also need to include many none-standard words in the index such
as
> > urls and message ID's contained in the text.
> >
>
> What's message ID ? Integer ? it's already recognized by parser.
>
> try
> select * from token_type();
>
> Also, last version of tsearch2 (for 7.3 grab from
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
> for 7.4 - available from CVS)
> has rather useful function - ts_debug
>
> apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
>  ts_name | tok_type | description |     token      | dict_name |
tsvector
> ---------+----------+-------------+----------------+-----------+------
------------
>  simple  | host     | Host        | www.sai.msu.su | {simple}  | 'www.
sai.msu.su'
>  simple  | lword    | Latin word  | megera         | {simple}  | '
megera'
> (2 rows)
>
>
>
> > I get the feeling that building these indexs will by no means be an

> > easy task so any suggestions will be gratefully recieved!
> >
>
> You may write your own parser, at last. Some info about parser API:
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief


Parser writing...scary stuff :-)


Thanks!

--

Re: Tsearch2 custom dictionaries

From

Oleg Bartunov

Date:

07 August 2003, 14:15:18

On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:

> > On Thu, 7 Aug 2003 psql-mail@freeuk.com wrote:
> >
> > > Part1.
> > >
> > > I have created a dictionary called 'webwords' which checks all
> words
> > > and curtails them to 300 chars (for now)
> > >
> > > after running
> > > make
> > > make install
> > >
> > > I then copied the lib_webwords.so into my $libdir
> > >
> > > I have run
> > >
> > > psql mybd < dict_webwords.sql
> > >
> > Once you did 'psql mybd < dict_webwords.sql' you should be able use
> it :)
> > Test it :
> >    select lexize('webwords','some_web_word');
>
> I did test it with
> select lexize('webwords','some_web_word');
> lexize
> -------
> {some_web_word}
>
> select lexize('webwords','some_400char_web_word');
> lexize
> --------
> {some_shortened_web_word}
>
>
> so that bit works, but then I tried
>
> SELECT to_tsvector( 'webwords', 'my words' );
> Error: No tsearch config

from ref.guide:
to_tsvector( [configuration,]  document TEXT) RETURNS tsvector


>
> > Did you read http://www.sai.msu.su/~megera/oddmuse/index.cgi/Gendict
>
> yeah, i did read it - its good!
> should i run:
> update pg_ts_cfgmap set dict_name='{webwords}';
>

after loading your dictionary to db you should have it registered in
pg_ts_dict, try

select * from pg_ts_dict;

next, you need to read docs, for example
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/tsearch-V2-intro.html
how to create your configuration and specify lexem_type-dictionary
mapping;

>
>
> > > Part2.
> <snip>
> > > As the text can be multilingual I don't think stemming is possible?
>
> >
> > You're right. I'm afraid you need UTF database, but tsearch2 isn't
> > UTF-8 compatible :(
>
> My database was created as unicode - does this mean I cannot use
> tsaerch?!
>

We have no any experience with UTF, so you may better ask openfts mailing
list and read archives.


> > > I also need to include many none-standard words in the index such
> as
> > > urls and message ID's contained in the text.
> > >
> >
> > What's message ID ? Integer ? it's already recognized by parser.
> >
> > try
> > select * from token_type();
> >
> > Also, last version of tsearch2 (for 7.3 grab from
> > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/,
> > for 7.4 - available from CVS)
> > has rather useful function - ts_debug
> >
> > apod=# select * from ts_debug('http://www.sai.msu.su/~megera');
> >  ts_name | tok_type | description |     token      | dict_name |
> tsvector
> > ---------+----------+-------------+----------------+-----------+------
> ------------
> >  simple  | host     | Host        | www.sai.msu.su | {simple}  | 'www.
> sai.msu.su'
> >  simple  | lword    | Latin word  | megera         | {simple}  | '
> megera'
> > (2 rows)
> >
> >
> >
> > > I get the feeling that building these indexs will by no means be an
>
> > > easy task so any suggestions will be gratefully recieved!
> > >
> >
> > You may write your own parser, at last. Some info about parser API:
> > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_in_Brief
>
>
> Parser writing...scary stuff :-)
>
>
> Thanks!
>
>

    Regards,
        Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83