Home > mailing lists

Re: english parser in text search: support for multiple words in the same position - Mailing list pgsql-hackers

From	Sushant Sinha
Subject	Re: english parser in text search: support for multiple words in the same position
Date	December 23, 2010 01:35:39
Msg-id	AANLkTin+XiewXD396WMqr-Pnk9QOHday3OTTM3MyS7SR@mail.gmail.com Whole thread Raw
In response to	Re: english parser in text search: support for multiple words in the same position (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: english parser in text search: support for multiple words in the same position
List	pgsql-hackers

Tree view

Just a reminder that this patch is discussing  how to break url, emails etc into its components.<br /><br /><div
class="gmail_quote">OnMon, Oct 4, 2010 at 3:54 AM, Tom Lane <span dir="ltr"><<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>></span>wrote:<br /><blockquote class="gmail_quote"
style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">[ sorry for not
respondingon this sooner, it's been hectic the last<br />  couple weeks ]<br /><div class="im"><br /> Sushant Sinha
<<ahref="mailto:sushant354@gmail.com">sushant354@gmail.com</a>> writes:<br /></div><div class="im">>> I
lookedat this patch a bit.  I'm fairly unhappy that it seems to be<br /> >> inventing a brand new mechanism to do
somethingthe ts parser can<br /> >> already do.  Why didn't you code the url-part mechanism using the<br />
>>existing support for compound words?<br /><br /> > I am not familiar with compound word implementation and
soI am not sure<br /> > how to split a url with compound word support. I looked into the<br /> > documentation
forcompound words and that does not say much about how to<br /> > identify components of a token.<br /><br
/></div>IIRC,the way that that works is associated with pushing a sub-state<br /> of the state machine in order to scan
eachcompound-word part.  I don't<br /> have the details in my head anymore, though I recall having traced<br /> through
itin the past.  Look at the state machine actions that are<br /> associated with producing the compound word tokens and
sub-tokens.<br/></blockquote></div><br />I did look around for compound word support in postgres. In particular, I read
thedocumentation and code in tsearch/spell.c that seems to implement the compound word support. <br /><br />So in my
understandingthe way it works is:<br /><br />1. Specify a dictionary of words in which each word will have applicable
prefix/suffixflags<br />2. Specify a flag file that provides prefix/suffix operations on those flags<br /> 3. flag z
indicatesthat a word in the dictionary can participate in compound word splitting<br />4. When a token matches words
specifiedin the dictionary (after applying affix/suffix operations), the matching words are emitted as sub-words of the
token(i.e., compound word)<br /><br />If my above understanding is correct, then I think it will not be possible to
implementurl/email splitting using the compound word support.<br /><br />The main reason is that the compound word
supportrequires the  "PRE-DETERMINED" dictionary of words. So to split a url/email we will need to provide a list of
*allpossible* host names and user names. I do not think that is a possibility.<br /><br />Please correct me if I have
mis-understoodsomething.<br /><br />-Sushant. <br />

pgsql-hackers by date:

From: Robert Haas
Date: 22 December 2010, 23:05:41
Subject: Re: knngist - 0.8

From: Pavel Stehule
Date: 23 December 2010, 04:11:14
Subject: recapitulation: FOREACH-IN-ARRAY

Re: english parser in text search: support for multiple words in the same position - Mailing list pgsql-hackers

Previous

Next