Thread: full text search: the concept of a "word"
I'm considering using tsearch2 in the project I'm working on right now...however, I'm not sure if tsearch2 can handle my very specific requirements - I therefore hope someone can tell me if the following is possible and how I should go about it... My textfields are trigger-generated using information from a number of tables: these fields can be, say, a couple of thousand characters wide. Up to here, there's no problem. What I'd like to do is define - possibly using regexps - what constitutes a word. For instance, my word separator is a semicolon, not a space; a dash is not a separator, and neither are language specific characters (which might be interpreted that way by a language agnostic tool)... BTW, I use UTF-8 as my database encoding if it's of any importance. What it comes down to is this: is it possible to somehow define what constitutes a word? TIA, Tomislav
> My textfields are trigger-generated using information from a number of > tables: these fields can be, say, a couple of thousand characters > wide. > Up to here, there's no problem. > What I'd like to do is define - possibly using regexps - what > constitutes a word. For instance, my word separator is a semicolon, > not a space; a dash is not a separator, and neither are language > specific characters (which might be interpreted that way by a language > agnostic tool)... > BTW, I use UTF-8 as my database encoding if it's of any importance. I do not see a big problem: just write your own parser. It's may be a problem with UTF-8: only CHS head tsearch2 supports UTF-8. But you can find a patch on 8.1 at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/ -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/