Thread: tsearch2: word position

tsearch2: word position

From

Markus Schiltknecht

Date:

21 February 2007, 11:07:53

Hi,

I'm fiddling with to_tsvector() and parse() from tsearch2, trying to get
the word position from those functions. I'd like to use the tsearch2
parser and stemmer, but I need to know the exact position of the word as
well as the original, unstemmed word.

What I came up with so far is pretty ugly:

SELECT
   (parse('my test text')).tokid,
   (parse('my test text')).token,
   strip(to_tsvector((parse('my test text')).token));

And this only tells me a word position, not a character or byte position
within the string. Is there a way to get this information from tsearch2?

Regards

Markus

Re: tsearch2: word position

From

Teodor Sigaev

Date:

21 February 2007, 11:39:43

> I'm fiddling with to_tsvector() and parse() from tsearch2, trying to get
> the word position from those functions. I'd like to use the tsearch2
> parser and stemmer, but I need to know the exact position of the word as
> well as the original, unstemmed word.

It's not supposed usage... Why do you need that?

> And this only tells me a word position, not a character or byte position
> within the string. Is there a way to get this information from tsearch2?

Have a look to headline framework as an example or staring point. hlparsetext()
returns  parsed text with matched lexemes in tsquery. Small description of
hlparsetext is placed at
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
near the end. Description of HLWORD struct is some out of day, sorry.

--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2: word position

From

Markus Schiltknecht

Date:

21 February 2007, 11:57:11

Hello Teodor,

Teodor Sigaev wrote:
> It's not supposed usage... Why do you need that?

Well, long story... I'm still using my own indexing on top of the
tsearch2 parsers and stemming.

However, two obvious cases come to mind:

- autocompletion, where I want to give the user one of the possible
known words. Currently, I'm returning the stemmed word, which is
obviously not quite right.

- highlighting of matching words

> Have a look to headline framework as an example or staring point.
> hlparsetext() returns  parsed text with matched lexemes in tsquery.
> Small description of hlparsetext is placed at
> http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/HOWTO-parser-tsearch2.html
> near the end. Description of HLWORD struct is some out of day, sorry.

Thanks. I probably need to dig in the sources, though.

Markus

Re: tsearch2: word position

From

Markus Schiltknecht

Date:

22 February 2007, 10:19:05

Hi,

Teodor Sigaev wrote:
>> I'm fiddling with to_tsvector() and parse() from tsearch2, trying to
>> get the word position from those functions. I'd like to use the
>> tsearch2 parser and stemmer, but I need to know the exact position of
>> the word as well as the original, unstemmed word.
>
> It's not supposed usage... Why do you need that?

Counter question: what's the supposed usage of the word number? Why
would anyone be interested in that? You always need to parse the text
yourself, to be able to get any use from the word number.

to_tsvector() could as well return the character number or a byte
pointer, I could see advantages for both. But the word number makes
little sense to me.

Regards

Markus

Re: tsearch2: word position

From

Teodor Sigaev

Date:

22 February 2007, 10:29:07

> to_tsvector() could as well return the character number or a byte
> pointer, I could see advantages for both. But the word number makes
> little sense to me.

Word number is used only in ranking functions. If you don't need a ranking than
you could safely strip positional information.


--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2: word position

From

Markus Schiltknecht

Date:

22 February 2007, 10:48:32

Hi,

Teodor Sigaev wrote:
> Word number is used only in ranking functions. If you don't need a
> ranking than you could safely strip positional information.

Huh? I explicitly *want* positional information. But I find the word
number to be less useful than a character number or a simple (byte)
pointer to the position of the word in the string.

Given only the word number, I have to go and parse the string again.

Regards

Markus

Re: tsearch2: word position

From

Teodor Sigaev

Date:

22 February 2007, 10:57:06

> Huh? I explicitly *want* positional information. But I find the word
> number to be less useful than a character number or a simple (byte)
> pointer to the position of the word in the string.
>
> Given only the word number, I have to go and parse the string again.

byte offset of word is useless for ranking purpose
--
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: tsearch2: word position

From

Markus Schiltknecht

Date:

22 February 2007, 12:02:01

Hello Teodor,

Teodor Sigaev wrote:
> byte offset of word is useless for ranking purpose

Why is a word number more meaningful for ranking? Are the first 100
words more important than the rest? That seems as ambiguous as saying
the first 1000 bytes are more important, no?

Or does the ranking work with the word numbers internally to do
something more clever?

Do you understand why I find the word number inconvenient?

Regards

Markus

Re: tsearch2: word position

From

"Mike Rylander"

Date:

22 February 2007, 13:08:37

On 2/22/07, Markus Schiltknecht <markus@bluegap.ch> wrote:
> Hello Teodor,
>
> Teodor Sigaev wrote:
> > byte offset of word is useless for ranking purpose
>
> Why is a word number more meaningful for ranking? Are the first 100
> words more important than the rest? That seems as ambiguous as saying
> the first 1000 bytes are more important, no?

No, the first X aren't more important, but being able to determine
word proximity is very important for partial phrase matching and
ranking.  The closer the words, the "better" the match, all else being
equal.

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org

Re: tsearch2: word position

From

Markus Schiltknecht

Date:

22 February 2007, 13:24:00

Hi,

Mike Rylander wrote:
> No, the first X aren't more important, but being able to determine
> word proximity is very important for partial phrase matching and
> ranking.  The closer the words, the "better" the match, all else being
> equal.

Ah, yeah, for word-pairs, that certainly helps.
Thanks.

Regards

Markus

Re: tsearch2: word position

From

Teodor Sigaev

Date:

22 February 2007, 13:25:32

> No, the first X aren't more important, but being able to determine
> word proximity is very important for partial phrase matching and
> ranking.  The closer the words, the "better" the match, all else being
> equal.
exactly