Thread: TSearch2 / Get all unique lexems
Is there a way to get all unique lexems from a table with a tsvector column? The stat() function does this (and more), but I cannot use it.. Thanks -- Regards, Hannes Dorbath
On Wed, 7 Dec 2005, Hannes Dorbath wrote: > Is there a way to get all unique lexems from a table with a tsvector column? > The stat() function does this (and more), but I cannot use it.. hmm, you could dump tsvector column and use awk+sort+uniq > > Thanks > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On 07.12.2005 16:13, Oleg Bartunov wrote: > hmm, you could dump tsvector column and use awk+sort+uniq Thanks. I hoped for something possible inside a pl/pgsql proc. I'm trying to integrate pg_trgm with Tsearch2. I'm still on my UTF-8 database. Yes I know, there is _NO_ UTF-8 support of any kind in Tsearch2 yet, but I got it working to a degree that is OK for my application (Created my own stemmer variant, ispell dict, affix file etc). The last missing bit is to get a source for pg_trgm. I cannot use the the stat() function, because it breaks as soon it sees an UTF-8 char. I thought of using lexise(), cast the text array to rows somehow, write it to a temp table, use SELECT DISTINCT.. but I hadn't any success yet. -- Regards, Hannes Dorbath
> Thanks. I hoped for something possible inside a pl/pgsql proc. I'm > trying to integrate pg_trgm with Tsearch2. I'm still on my UTF-8 > database. Yes I know, there is _NO_ UTF-8 support of any kind in > Tsearch2 yet, but I got it working to a degree that is OK for my > application (Created my own stemmer variant, ispell dict, affix file > etc). The last missing bit is to get a source for pg_trgm. I cannot use > the the stat() function, because it breaks as soon it sees an UTF-8 char. I suppose noncompatible with UTF wordparser can produce illegal lexemes (with part of multibyte char) and stores it in tsvector. Tsvector hasn't any control of breakness lexemes (with a help pg_verifymbstr() call), but stat() makes text field and then postgres check it and found incomplete mbchars. Which way I see (except waiting UTF support in tsearch2 which we develop now): 1 modify stat() function to check text field and if it fails then remove lexeme from output 2 Take from CVS HEAD wordpaser (ts_locale.[ch], wparser_def.c, wordparser/parser.[ch]). to_tsvector will works fine, to_tsquery will works correct only with quoted string (for examle, 'foo' & 'bar', bad: foo & bar). But casting 'asasas'::tsvector and dump/reload will not work correct. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Thu, 8 Dec 2005, Hannes Dorbath wrote: > On 07.12.2005 16:13, Oleg Bartunov wrote: >> hmm, you could dump tsvector column and use awk+sort+uniq > > Thanks. I hoped for something possible inside a pl/pgsql proc. I'm trying to > integrate pg_trgm with Tsearch2. I'm still on my UTF-8 database. Yes I know, > there is _NO_ UTF-8 support of any kind in Tsearch2 yet, but I got it working > to a degree that is OK for my application (Created my own stemmer variant, > ispell dict, affix file etc). The last missing bit is to get a source for > pg_trgm. I cannot use the the stat() function, because it breaks as soon it > sees an UTF-8 char. unless there is some way to ignore errors in utf8 convertation to text this is a dead-end. stat() function uses text representation. You have to wait new release with full UTF8 support or go 'lazy' way, i.e. use any tools to get a list of unique words and create pg_trgm index. There are several questions: * Do you actually need to be synchronized with tsvector ? * Do you need to recognize all words ? I supposed no. In real life you should have a dictionary which you certainly need to recognize. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83