Re: Fulltext search configuration - Mailing list pgsql-general
From | Oleg Bartunov |
---|---|
Subject | Re: Fulltext search configuration |
Date | |
Msg-id | Pine.LNX.4.64.0902021829280.4158@sn.sai.msu.ru Whole thread Raw |
In response to | Re: Fulltext search configuration (Oleg Bartunov <oleg@sai.msu.su>) |
Responses |
Re: Fulltext search configuration
|
List | pgsql-general |
On Mon, 2 Feb 2009, Oleg Bartunov wrote: > On Mon, 2 Feb 2009, Mohamed wrote: > >> Hehe, ok.. >> I don't know either but I took some lines from Al-Jazeera : >> http://aljazeera.net/portal >> >> just made the change you said and created it successfully and tried this : >> >> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? ????????? >> ?????') >> >> but I got nothing... :( > > Mohamed, what did you expect from ts_lexize ? Please, provide us valuable > information, else we can't help you. > >> >> Is there a way of making sure that words not recognized also gets >> indexed/searched for ? (Not that I think this is the problem) > > yes Read http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html "A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a Snowball stemmer or simple, which recognizes everything." quick example: CREATE TEXT SEARCH CONFIGURATION arabic ( COPY = english ); =# \dF+ arabic Text search configuration "public.arabic" Parser: "pg_catalog.default" Token | Dictionaries -----------------+-------------- asciihword | english_stem asciiword | english_stem email | simple file | simple float | simple host | simple hword | english_stem hword_asciipart | english_stem hword_numpart | simple hword_part | english_stem int | simple numhword | simple numword | simple sfloat | simple uint | simple url | simple url_path | simple version | simple word | english_stem Then you can alter this configuration. > > >> >> / Moe >> >> >> >> On Mon, Feb 2, 2009 at 3:50 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >> >>> Mohamed, >>> >>> comment line in ar.affix >>> #FLAG long >>> and creation of ispell dictionary will work. This is temp, solution. >>> Teodor >>> is working on fixing affix autorecognizing. >>> >>> I can't say anything about testing, since somebody should provide >>> first test case. I don't know how to type arabic :) >>> >>> >>> Oleg >>> >>> On Mon, 2 Feb 2009, Mohamed wrote: >>> >>> Oleg, like I mentioned earlier. I have a different .affix file that I got >>>> from Andrew with the stop file and I get no errors creating the >>>> dictionary >>>> using that one but I get nothing out from ts_lexize. >>>> The size on that one is : 406,219 bytes >>>> And the size on the hunspell one (first) : 406,229 bytes >>>> >>>> Little to close, don't you think ? >>>> >>>> It might be that the arabic hunspell (ayaspell) affix file is damaged on >>>> some lines and I got the fixed one from Andrew. >>>> >>>> Just wanted to let you know. >>>> >>>> / Moe >>>> >>>> >>>> >>>> On Mon, Feb 2, 2009 at 3:25 PM, Mohamed <mohamed5432154321@gmail.com> >>>> wrote: >>>> >>>> Ok, thank you Oleg. >>>>> I have another dictionary package which is a conversion to hunspell >>>>> aswell: >>>>> >>>>> >>>>> >>>>> http://wiki.services.openoffice.org/wiki/Dictionaries#Arabic_.28North_Africa_and_Middle_East.29 >>>>> (Conversion of Buckwalter's Arabic morphological analyser) 2006-02-08 >>>>> >>>>> And running that gives me this error : (again the affix file) >>>>> >>>>> ERROR: wrong affix file format for flag >>>>> CONTEXT: line 560 of configuration file "C:/Program >>>>> Files/PostgreSQL/8.3/share/tsearch_data/arabic_utf8_alias.affix": "PFX >>>>> 1013 >>>>> Y 6 >>>>> " >>>>> >>>>> / Moe >>>>> >>>>> >>>>> >>>>> On Mon, Feb 2, 2009 at 2:41 PM, Oleg Bartunov <oleg@sai.msu.su> wrote: >>>>> >>>>> Mohamed, >>>>>> >>>>>> We are looking on the problem. >>>>>> >>>>>> Oleg >>>>>> >>>>>> On Mon, 2 Feb 2009, Mohamed wrote: >>>>>> >>>>>> No, I don't. But the ts_lexize don't return anything so I figured >>>>>> there >>>>>> >>>>>>> must >>>>>>> be an error somehow. >>>>>>> I think we are using the same dictionary + that I am using the >>>>>>> stopwords >>>>>>> file and a different affix file, because using the hunspell (ayaspell) >>>>>>> .aff >>>>>>> gives me this error : >>>>>>> >>>>>>> ERROR: wrong affix file format for flag >>>>>>> CONTEXT: line 42 of configuration file "C:/Program >>>>>>> Files/PostgreSQL/8.3/share/tsearch_data/hunarabic.affix": "PFX Aa Y 40 >>>>>>> >>>>>>> / Moe >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 2, 2009 at 12:13 PM, Daniel Chiaramello < >>>>>>> daniel.chiaramello@golog.net> wrote: >>>>>>> >>>>>>> Hi Mohamed. >>>>>>> >>>>>>>> >>>>>>>> I don't know where you get the dictionary - I unsuccessfully tried >>>>>>>> the >>>>>>>> OpenOffice one by myself (the Ayaspell one), and I had no arabic >>>>>>>> stopwords >>>>>>>> file. >>>>>>>> >>>>>>>> Renaming the file is supposed to be enough (I did it successfully for >>>>>>>> Thailandese dictionary) - the ".aff'" file becoming the ".affix" one. >>>>>>>> When I tried to create the dictionary: >>>>>>>> >>>>>>>> CREATE TEXT SEARCH DICTIONARY ar_ispell ( >>>>>>>> TEMPLATE = ispell, >>>>>>>> DictFile = ar_utf8, >>>>>>>> AffFile = ar_utf8, >>>>>>>> StopWords = english >>>>>>>> ); >>>>>>>> >>>>>>>> I had an error: >>>>>>>> >>>>>>>> ERREUR: mauvais format de fichier affixe pour le drapeau >>>>>>>> CONTEXTE : ligne 42 du fichier de configuration ? >>>>>>>> /usr/share/pgsql/tsearch_data/ar_utf8.affix ? : ? PFX Aa Y >>>>>>>> 40 >>>>>>>> >>>>>>>> (which means Bad format of Affix file for flag, line 42 of >>>>>>>> configuration >>>>>>>> file) >>>>>>>> >>>>>>>> Do you have an error when creating your dictionary? >>>>>>>> >>>>>>>> Daniel >>>>>>>> >>>>>>>> Mohamed a ?crit : >>>>>>>> >>>>>>>> >>>>>>>> I have ran into some problems here. >>>>>>>> I am trying to implement arabic fulltext search on three columns. >>>>>>>> >>>>>>>> To create a dictionary I have a hunspell dictionary and and arabic >>>>>>>> stop >>>>>>>> file. >>>>>>>> >>>>>>>> CREATE TEXT SEARCH DICTIONARY hunspell_dic ( >>>>>>>> TEMPLATE = ispell, >>>>>>>> DictFile = hunarabic, >>>>>>>> AffFile = hunarabic, >>>>>>>> StopWords = arabic >>>>>>>> ); >>>>>>>> >>>>>>>> >>>>>>>> 1) The problem is that the hunspell contains a .dic and a .aff file >>>>>>>> but >>>>>>>> the configuration requeries a .dict and .affix file. I have tried to >>>>>>>> change >>>>>>>> the endings but with no success. >>>>>>>> >>>>>>>> 2) ts_lexize('hunspell_dic', 'ARABIC WORD') returns nothing >>>>>>>> >>>>>>>> 3) How can I convert my .dic and .aff to valid .dict and .affix ? >>>>>>>> >>>>>>>> 4) I have read that when using dictionaries, if a word is not >>>>>>>> recognized >>>>>>>> by >>>>>>>> any dictionary it will not be indexed. I find that troublesome. I >>>>>>>> would >>>>>>>> like >>>>>>>> everything but the stop words to be indexed. I guess this might be a >>>>>>>> step >>>>>>>> that I am not ready for yet, but just wanted to put it out there. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Also I would like to know how the process of the fulltext search >>>>>>>> implementation looks like, from config to search. >>>>>>>> >>>>>>>> Create dictionary, then a text configuration, add dic to >>>>>>>> configuration, >>>>>>>> index columns with gin or gist ... >>>>>>>> >>>>>>>> How does a search look like? Does it match against the gin/gist >>>>>>>> index. >>>>>>>> Have that index been built up using the dictionary/configuration, or >>>>>>>> is >>>>>>>> the >>>>>>>> dictionary only used on search frases? >>>>>>>> >>>>>>>> / Moe >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Regards, >>>>>> Oleg >>>>>> _____________________________________________________________ >>>>>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>>>>> Sternberg Astronomical Institute, Moscow University, Russia >>>>>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>>>>> phone: +007(495)939-16-83, +007(495)939-23-83 >>>>>> >>>>>> >>>>> >>>>> >>>> >>> Regards, >>> Oleg >>> _____________________________________________________________ >>> Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), >>> Sternberg Astronomical Institute, Moscow University, Russia >>> Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ >>> phone: +007(495)939-16-83, +007(495)939-23-83 >>> >> > > Regards, > Oleg > _____________________________________________________________ > Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), > Sternberg Astronomical Institute, Moscow University, Russia > Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ > phone: +007(495)939-16-83, +007(495)939-23-83 > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
pgsql-general by date: