Thread: fulltext parser strange behave

fulltext parser strange behave

From

"Pavel Stehule"

Date:

06 November 2007, 18:13:39

Hello

I am writing tsearch2 wrapper and I testing functionality. I found
some little bit strange on default parser. It can't parse tags with
numbers:

test=# select * from parse('<h1>zluty kun se napil <b>zlute</b> vody</h2>');tokid | token
-------+-------   12 | <    3 | h1   12 | >    1 | zluty   12 |    1 | kun   12 |    1 | se   12 |    1 | napil   12 |
13 | <b>    1 | zlute   13 | </b>   12 |    1 | vody   12 | <              <=====   19 | /h2   12 | >
<=====
(19 rows)

It is correct?

Regards
Pavel Stehule

Re: fulltext parser strange behave

From

Tom Lane

Date:

07 November 2007, 19:11:25

"Pavel Stehule" <pavel.stehule@gmail.com> writes:
> I am writing tsearch2 wrapper and I testing functionality. I found
> some little bit strange on default parser. It can't parse tags with
> numbers:

Well, the state machine definitely thinks that tag names should contain
only ASCII letters (with possibly a leading or trailing '/').  Given the
HTML examples I suppose we should allow non-first digits too.  Is there
anything else that should be considered a tag?  What about dash and
underscore for instance?
        regards, tom lane

Re: fulltext parser strange behave

From

Andrew Dunstan

Date:

07 November 2007, 19:39:26


Tom Lane wrote:
> "Pavel Stehule" <pavel.stehule@gmail.com> writes:
>   
>> I am writing tsearch2 wrapper and I testing functionality. I found
>> some little bit strange on default parser. It can't parse tags with
>> numbers:
>>     
>
> Well, the state machine definitely thinks that tag names should contain
> only ASCII letters (with possibly a leading or trailing '/').  Given the
> HTML examples I suppose we should allow non-first digits too.  Is there
> anything else that should be considered a tag?  What about dash and
> underscore for instance?
>
>     
>   

The docs say we specifically accept HTML tags. Are we really just 
accepting anything that is a string of ASCII letters as the tag name? 
Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.

cheers

andrew

Re: fulltext parser strange behave

From

Tom Lane

Date:

07 November 2007, 21:12:30

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Well, the state machine definitely thinks that tag names should contain
>> only ASCII letters (with possibly a leading or trailing '/').  Given the
>> HTML examples I suppose we should allow non-first digits too.  Is there
>> anything else that should be considered a tag?  What about dash and
>> underscore for instance?

> The docs say we specifically accept HTML tags. Are we really just 
> accepting anything that is a string of ASCII letters as the tag name? 
> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.

I don't think I want to try to maintain a list of exactly which
identifiers are considered valid tag names ... and if I did, I wouldn't
put it into the parser.  It would be a dictionary's job to tell valid
from invalid tag names, no?
        regards, tom lane

Re: fulltext parser strange behave

From

Andrew Dunstan

Date:

07 November 2007, 22:02:14


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> Tom Lane wrote:
>>     
>>> Well, the state machine definitely thinks that tag names should contain
>>> only ASCII letters (with possibly a leading or trailing '/').  Given the
>>> HTML examples I suppose we should allow non-first digits too.  Is there
>>> anything else that should be considered a tag?  What about dash and
>>> underscore for instance?
>>>       
>
>   
>> The docs say we specifically accept HTML tags. Are we really just 
>> accepting anything that is a string of ASCII letters as the tag name? 
>> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.
>>     
>
> I don't think I want to try to maintain a list of exactly which
> identifiers are considered valid tag names ... and if I did, I wouldn't
> put it into the parser.  It would be a dictionary's job to tell valid
> from invalid tag names, no?
>
>             
>   

I don't have a quarrel with that. But then we should be more clear about 
what we are recognizing. We could describe the thing as an HTML-like 
tag, possibly. I think the same probably goes for entities too.

cheers

andrew

Re: fulltext parser strange behave

From

Oleg Bartunov

Date:

08 November 2007, 03:07:41

On Wed, 7 Nov 2007, Tom Lane wrote:

> Andrew Dunstan <andrew@dunslane.net> writes:
>> Tom Lane wrote:
>>> Well, the state machine definitely thinks that tag names should contain
>>> only ASCII letters (with possibly a leading or trailing '/').  Given the
>>> HTML examples I suppose we should allow non-first digits too.  Is there
>>> anything else that should be considered a tag?  What about dash and
>>> underscore for instance?
>
>> The docs say we specifically accept HTML tags. Are we really just
>> accepting anything that is a string of ASCII letters as the tag name?
>> Then we should adjust the docs. <foo> and <foo1234> are not HTML tags.
>
> I don't think I want to try to maintain a list of exactly which
> identifiers are considered valid tag names ... and if I did, I wouldn't
> put it into the parser.  It would be a dictionary's job to tell valid
> from invalid tag names, no?

it'd be nice to know in dictionary the parser state, but I think it's
too much knowledge for dictionary and the only possibility is to 
let <foo1234> pass to dictionary. Currently we have three separate tokens.


>
>             regards, tom lane
>
    Regards,        Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Re: fulltext parser strange behave

From

Andrew Dunstan

Date:

08 November 2007, 16:12:07


Andrew Dunstan wrote:
>
>
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>  
>>> Tom Lane wrote:
>>>    
>>>> Well, the state machine definitely thinks that tag names should 
>>>> contain
>>>> only ASCII letters (with possibly a leading or trailing '/').  
>>>> Given the
>>>> HTML examples I suppose we should allow non-first digits too.  Is 
>>>> there
>>>> anything else that should be considered a tag?  What about dash and
>>>> underscore for instance?
>>>>       
>>
>>  
>>> The docs say we specifically accept HTML tags. Are we really just 
>>> accepting anything that is a string of ASCII letters as the tag 
>>> name? Then we should adjust the docs. <foo> and <foo1234> are not 
>>> HTML tags.
>>>     
>>
>> I don't think I want to try to maintain a list of exactly which
>> identifiers are considered valid tag names ... and if I did, I wouldn't
>> put it into the parser.  It would be a dictionary's job to tell valid
>> from invalid tag names, no?
>>
>>            
>>   
>
> I don't have a quarrel with that. But then we should be more clear 
> about what we are recognizing. We could describe the thing as an 
> HTML-like tag, possibly. I think the same probably goes for entities too.
>
>
I've just been looking at the state machine in wparser_def.c. I think 
the processing for entities is also a few bob short in the pound. It 
recognises decimal numeric character references, but nor hexadecimal 
numeric character references. That's fairly silly since the HTML spec 
specifically says the latter are "particularly useful". The rules for 
named entities are also deficient w.r.t. digits, just like the case of 
tags that Tom noticed. This isn't academic: HTML features a number of 
named entities with digits in the name (sup2, frac14 for example).

In XML at least, legal names are defined by the following rules from the 
spec:

[4]       NameStartChar       ::=       ":" | [A-Z] | "_" | [a-z] | 
[#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | 
[#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | 
[#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a]       NameChar       ::=       NameStartChar | "-" | "." | [0-9] | 
#xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5]       Name       ::=       NameStartChar (NameChar)*

Restricting this to ASCII, we get:

[4]       NameStartChar       ::=       ":" | [A-Z] | "_" | [a-z]
[4a]       NameChar       ::=       NameStartChar | "-" | "." | [0-9]
[5]       Name       ::=       NameStartChar (NameChar)*

or this regex for Name:

[A-Za-z:_][A-Za-z0-9:_.-]*


I suggest we use that or something very close to it as the rule for 
names in these patterns.

cheers

andrew

Re: fulltext parser strange behave

From

Tom Lane

Date:

09 November 2007, 14:54:10

Andrew Dunstan <andrew@dunslane.net> writes:
> I've just been looking at the state machine in wparser_def.c. I think 
> the processing for entities is also a few bob short in the pound. It 
> recognises decimal numeric character references, but nor hexadecimal 
> numeric character references. That's fairly silly since the HTML spec 
> specifically says the latter are "particularly useful". The rules for 
> named entities are also deficient w.r.t. digits, just like the case of 
> tags that Tom noticed. This isn't academic: HTML features a number of 
> named entities with digits in the name (sup2, frac14 for example).

> In XML at least, legal names are defined by the following rules from the 
> spec:
> ...
> [A-Za-z:_][A-Za-z0-9:_.-]*

> I suggest we use that or something very close to it as the rule for 
> names in these patterns.

No objections here.  Who wants to patch wparser_def?
        regards, tom lane

Re: fulltext parser strange behave

From

Andrew Dunstan

Date:

13 November 2007, 15:42:24


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>   
>> I've just been looking at the state machine in wparser_def.c. I think 
>> the processing for entities is also a few bob short in the pound. It 
>> recognises decimal numeric character references, but nor hexadecimal 
>> numeric character references. That's fairly silly since the HTML spec 
>> specifically says the latter are "particularly useful". The rules for 
>> named entities are also deficient w.r.t. digits, just like the case of 
>> tags that Tom noticed. This isn't academic: HTML features a number of 
>> named entities with digits in the name (sup2, frac14 for example).
>>     
>
>   
>> In XML at least, legal names are defined by the following rules from the 
>> spec:
>> ...
>> [A-Za-z:_][A-Za-z0-9:_.-]*
>>     
>
>   
>> I suggest we use that or something very close to it as the rule for 
>> names in these patterns.
>>     
>
> No objections here.  Who wants to patch wparser_def?
>
>             
>   


I can get to it some time in the next week. - rather snowed under right now.

BTW, I'm also suspicious of the clause that allows <?xml ... it appears 
that it will allow <?xfoo  and <?XFOO also, which seems quite odd, 
especially the latter.

cheers

andrew