Thread: tsearch2: word position
Hi, I'm fiddling with to_tsvector() and parse() from tsearch2, trying to get the word position from those functions. I'd like to use the tsearch2 parser and stemmer, but I need to know the exact position of the word as well as the original, unstemmed word. What I came up with so far is pretty ugly: SELECT (parse('my test text')).tokid, (parse('my test text')).token, strip(to_tsvector((parse('my test text')).token)); And this only tells me a word position, not a character or byte position within the string. Is there a way to get this information from tsearch2? Regards Markus
> I'm fiddling with to_tsvector() and parse() from tsearch2, trying to get > the word position from those functions. I'd like to use the tsearch2 > parser and stemmer, but I need to know the exact position of the word as > well as the original, unstemmed word. It's not supposed usage... Why do you need that? > And this only tells me a word position, not a character or byte position > within the string. Is there a way to get this information from tsearch2? Have a look to headline framework as an example or staring point. hlparsetext() returns parsed text with matched lexemes in tsquery. Small description of hlparsetext is placed at http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html near the end. Description of HLWORD struct is some out of day, sorry. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hello Teodor, Teodor Sigaev wrote: > It's not supposed usage... Why do you need that? Well, long story... I'm still using my own indexing on top of the tsearch2 parsers and stemming. However, two obvious cases come to mind: - autocompletion, where I want to give the user one of the possible known words. Currently, I'm returning the stemmed word, which is obviously not quite right. - highlighting of matching words > Have a look to headline framework as an example or staring point. > hlparsetext() returns parsed text with matched lexemes in tsquery. > Small description of hlparsetext is placed at > http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html > near the end. Description of HLWORD struct is some out of day, sorry. Thanks. I probably need to dig in the sources, though. Markus
Hi, Teodor Sigaev wrote: >> I'm fiddling with to_tsvector() and parse() from tsearch2, trying to >> get the word position from those functions. I'd like to use the >> tsearch2 parser and stemmer, but I need to know the exact position of >> the word as well as the original, unstemmed word. > > It's not supposed usage... Why do you need that? Counter question: what's the supposed usage of the word number? Why would anyone be interested in that? You always need to parse the text yourself, to be able to get any use from the word number. to_tsvector() could as well return the character number or a byte pointer, I could see advantages for both. But the word number makes little sense to me. Regards Markus
> to_tsvector() could as well return the character number or a byte > pointer, I could see advantages for both. But the word number makes > little sense to me. Word number is used only in ranking functions. If you don't need a ranking than you could safely strip positional information. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hi, Teodor Sigaev wrote: > Word number is used only in ranking functions. If you don't need a > ranking than you could safely strip positional information. Huh? I explicitly *want* positional information. But I find the word number to be less useful than a character number or a simple (byte) pointer to the position of the word in the string. Given only the word number, I have to go and parse the string again. Regards Markus
> Huh? I explicitly *want* positional information. But I find the word > number to be less useful than a character number or a simple (byte) > pointer to the position of the word in the string. > > Given only the word number, I have to go and parse the string again. byte offset of word is useless for ranking purpose -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Hello Teodor, Teodor Sigaev wrote: > byte offset of word is useless for ranking purpose Why is a word number more meaningful for ranking? Are the first 100 words more important than the rest? That seems as ambiguous as saying the first 1000 bytes are more important, no? Or does the ranking work with the word numbers internally to do something more clever? Do you understand why I find the word number inconvenient? Regards Markus
On 2/22/07, Markus Schiltknecht <markus@bluegap.ch> wrote: > Hello Teodor, > > Teodor Sigaev wrote: > > byte offset of word is useless for ranking purpose > > Why is a word number more meaningful for ranking? Are the first 100 > words more important than the rest? That seems as ambiguous as saying > the first 1000 bytes are more important, no? No, the first X aren't more important, but being able to determine word proximity is very important for partial phrase matching and ranking. The closer the words, the "better" the match, all else being equal. -- Mike Rylander mrylander@gmail.com GPLS -- PINES Development Database Developer http://open-ils.org
Hi, Mike Rylander wrote: > No, the first X aren't more important, but being able to determine > word proximity is very important for partial phrase matching and > ranking. The closer the words, the "better" the match, all else being > equal. Ah, yeah, for word-pairs, that certainly helps. Thanks. Regards Markus
> No, the first X aren't more important, but being able to determine > word proximity is very important for partial phrase matching and > ranking. The closer the words, the "better" the match, all else being > equal. exactly