Thread: Example non-Latin words for text search parser docs?
I'm afraid my English-centricity is showing, but I could use a little help filling in the missing examples in the table here: http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html I'm not sure of a suitable example all-non-ASCII-letters word, and even less sure of how to represent it in SGML. (I remember we had quite a bit of trouble dealing with accented letters in people's names, for instance.) regards, tom lane
Tom Lane wrote: > I'm afraid my English-centricity is showing, but I could use a little > help filling in the missing examples in the table here: > http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html > I'm not sure of a suitable example all-non-ASCII-letters word, It's easy to find an example -- I went to the english Wikipedia, searched for "elephant", then clicked on the russian link at the left. It gives you "Слоновые", which I see on my terminal as a series of black squares :-) so there's not a single latin letter in it. http://ru.wikipedia.org/wiki/%D0%A1%D0%BB%D0%BE%D0%BD%D0%BE%D0%B2%D1%8B%D0%B5 In that page they also mention the word "Слон" which looks like "Slon". > and even less sure of how to represent it in SGML. (I remember we had > quite a bit of trouble dealing with accented letters in people's > names, for instance.) Yeah, that will prove difficult. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> and even less sure of how to represent it in SGML. (I remember we had >> quite a bit of trouble dealing with accented letters in people's >> names, for instance.) > Yeah, that will prove difficult. This problem largely goes away if we redefine the word categories as under discussion in the -hackers thread: with any of the proposed alternatives it'd be pretty easy to make up real words that are easily representable in SGML. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > Tom Lane wrote: >> I'm afraid my English-centricity is showing, but I could use a little >> help filling in the missing examples in the table here: >> http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html >> I'm not sure of a suitable example all-non-ASCII-letters word, > It's easy to find an example -- I went to the english Wikipedia, > searched for "elephant", then clicked on the russian link at the left. > It gives you "Слоновые", which I see on my terminal as a series of black > squares :-) so there's not a single latin letter in it. Given the just-applied changes in the definition of a "word", we no longer need a totally-not-ASCII sample word. But I wonder if anyone has a better idea than the føø that I made up on the spot... regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Tom Lane wrote: > >> I'm afraid my English-centricity is showing, but I could use a little > >> help filling in the missing examples in the table here: > >> http://developer.postgresql.org/pgdocs/postgres/textsearch-parsers.html > >> I'm not sure of a suitable example all-non-ASCII-letters word, > > > It's easy to find an example -- I went to the english Wikipedia, > > searched for "elephant", then clicked on the russian link at the left. > > It gives you "Слоновые", which I see on my terminal as a series of black > > squares :-) so there's not a single latin letter in it. > > Given the just-applied changes in the definition of a "word", we no > longer need a totally-not-ASCII sample word. But I wonder if anyone > has a better idea than the føø that I made up on the > spot... Actually I was wondering if we should use actual words. So instead of "foo" we could use "elephant" for asciiword and "Éléphant" (french) for word. And for the hword, "sous-espèces" (which appears on the French Wikipedia) would do. -- Alvaro Herrera http://www.flickr.com/photos/alvherre/ "La espina, desde que nace, ya pincha" (Proverbio africano)
Alvaro Herrera <alvherre@commandprompt.com> writes: > Actually I was wondering if we should use actual words. So instead of > "foo" we could use "elephant" for asciiword and "Éléphant" (french) for > word. And for the hword, "sous-espèces" (which appears on the French > Wikipedia) would do. Hmm ... I see a potential problem with that, which is that if someone happened to be viewing the page on something that dropped the accents, or even just made them too small to be easily readable, the examples wouldn't make any sense at all. I have no problem with "elephant" as a sample asciiword, but for the sample non-ascii word I'd suggest something that (a) is clearly not English and (b) as much as possible, everybody knows has an accent. At least in large parts of the US, something like "mañana" would work nicely. Anyway, feel free to hack on it --- I'm getting a bit weary of looking at that chapter. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > Actually I was wondering if we should use actual words. So instead of > > "foo" we could use "elephant" for asciiword and "Éléphant" (french) for > > word. And for the hword, "sous-espèces" (which appears on the French > > Wikipedia) would do. > > Hmm ... I see a potential problem with that, which is that if someone > happened to be viewing the page on something that dropped the accents, > or even just made them too small to be easily readable, the examples > wouldn't make any sense at all. > > I have no problem with "elephant" as a sample asciiword, but for the > sample non-ascii word I'd suggest something that (a) is clearly not > English and (b) as much as possible, everybody knows has an accent. > At least in large parts of the US, something like "mañana" would > work nicely. OK I went with that. I also used real spanish hyphenated words in the hword examples. I also changed the domains foo.com to example.com, just because I'm anal enough to do it. The hword_asciipart I'm not 100% sure about. I used this: militar in the context político-militar, or postgresql in the context postgresql-beta1 What I wanted to emphasize here is that it's the "ascii-ness" of the part that matters, not that of the complete token. The reason I'm not sure about it is that it makes the table wider. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > The hword_asciipart I'm not 100% sure about. I used this: > militar in the context pol�tico-militar, or postgresql in the > context postgresql-beta1 Hmm ... I went and looked at the page on developer.postgresql.org, and it's just as I feared: with slightly bleary morning eyes, the accents over the i's are not obvious, and so you have to look *real* close before you get the point of the examples. It doesn't help that 'politico' with no accent is exactly how the phrase would be spelled in English, and so it's easy to not see the accent because you're not expecting one. The other examples seem alright, but I think that one's a bad choice. regards, tom lane
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > The hword_asciipart I'm not 100% sure about. I used this: > > militar in the context pol�tico-militar, or postgresql in the > > context postgresql-beta1 > > Hmm ... I went and looked at the page on developer.postgresql.org, > and it's just as I feared: with slightly bleary morning eyes, the > accents over the i's are not obvious, and so you have to look *real* > close before you get the point of the examples. It doesn't help that > 'politico' with no accent is exactly how the phrase would be spelled > in English, and so it's easy to not see the accent because you're not > expecting one. The other examples seem alright, but I think that one's > a bad choice. Damn. Ok, I'll search for a different example. We're making progress nonetheless ;-) -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera wrote: > Tom Lane wrote: > > Alvaro Herrera <alvherre@commandprompt.com> writes: > > > The hword_asciipart I'm not 100% sure about. I used this: > > > militar in the context pol?tico-militar, or postgresql in the > > > context postgresql-beta1 > > > > Hmm ... I went and looked at the page on developer.postgresql.org, > > and it's just as I feared: with slightly bleary morning eyes, the > > accents over the i's are not obvious, and so you have to look *real* > > close before you get the point of the examples. It doesn't help that > > 'politico' with no accent is exactly how the phrase would be spelled > > in English, and so it's easy to not see the accent because you're not > > expecting one. The other examples seem alright, but I think that one's > > a bad choice. > > Damn. Ok, I'll search for a different example. We're making progress > nonetheless ;-) How about "lógico-matemática"? (If that one doesn't work for you, maybe we should look into words in another language, more different from english. Maybe Magnus can suggest hyphenated words with weird letters). -- Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane: > Hmm ... I went and looked at the page on developer.postgresql.org, > and it's just as I feared: with slightly bleary morning eyes, the > accents over the i's are not obvious, and so you have to look *real* > close before you get the point of the examples. By that standard, you will have to use non-Latin letters, which might decrease the usability of the examples much more. There are not likely to be any Latin-looking letters that are not ASCII and are not resembling another Latin letter. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Peter Eisentraut wrote: > Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane: > > Hmm ... I went and looked at the page on developer.postgresql.org, > > and it's just as I feared: with slightly bleary morning eyes, the > > accents over the i's are not obvious, and so you have to look *real* > > close before you get the point of the examples. > > By that standard, you will have to use non-Latin letters, which might decrease > the usability of the examples much more. There are not likely to be any > Latin-looking letters that are not ASCII and are not resembling another Latin > letter. I think it would suffice to use an accent over a vowel that's not an i. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > Peter Eisentraut wrote: >> Am Donnerstag, 25. Oktober 2007 schrieb Tom Lane: >>> Hmm ... I went and looked at the page on developer.postgresql.org, >>> and it's just as I feared: with slightly bleary morning eyes, the >>> accents over the i's are not obvious, and so you have to look *real* >>> close before you get the point of the examples. >> >> By that standard, you will have to use non-Latin letters, which might decrease >> the usability of the examples much more. There are not likely to be any >> Latin-looking letters that are not ASCII and are not resembling another Latin >> letter. > I think it would suffice to use an accent over a vowel that's not an i. Yeah, that would help. But the real problem with pol?tico-militar is that it looks way too much like the English equivalent --- my first reaction was "huh, he forgot the 'y'". I'm after a word that *looks* not-English. Alvaro's comment that maybe we need to look to something besides Spanish seems on point. regards, tom lane
Alvaro Herrera <alvherre@commandprompt.com> writes: > How about "l�gico-matem�tica"? Works for me. regards, tom lane