Configuring Text Search parser? - Mailing list pgsql-hackers
From | jesper@krogh.cc |
---|---|
Subject | Configuring Text Search parser? |
Date | |
Msg-id | 1a26550c0b55c0a0af0dcbd8e080bc82.squirrel@shrek.krogh.cc Whole thread Raw |
Responses |
Re: Configuring Text Search parser?
|
List | pgsql-hackers |
Hi. I'm trying to migrate an application off an existing Full Text Search engine and onto PostgreSQL .. one of my main (remaining) headaches are the fact that PostgreSQL treats _ as a seperation charachter whereas the existing behaviour is to "not split". That means: testdb=# select ts_debug('database_tag_number_999'); ts_debug ------------------------------------------------------------------------------(asciiword,"Word, all ASCII",database,{english_stem},english_stem,{databas})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all ASCII",tag,{english_stem},english_stem,{tag})(blank,"Spacesymbols",_,{},,)(asciiword,"Word, all ASCII",number,{english_stem},english_stem,{number})(blank,"Spacesymbols",_,{},,)(uint,"Unsigned integer",999,{simple},simple,{999}) (7 rows) Where the incoming data, by design contains a set of tags which includes _ and are expected to be one "lexeme". I've tried patching my way out of this using this patch. $ diff -w -C 5 src/backend/tsearch/wparser_def.c.orig src/backend/tsearch/wparser_def.c *** src/backend/tsearch/wparser_def.c.orig 2010-09-20 15:58:37.033336460 +0200 --- src/backend/tsearch/wparser_def.c 2010-09-20 15:58:41.193335577 +0200 *************** *** 967,986 **** --- 967,988 ---- static const TParserStateActionItem actionTPS_InNumWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL}, {p_isalnum,0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_InNumWord, 0, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_iseqC, '/',A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHyphenNumWordFirst,0, NULL}, {NULL, 0, A_BINGO, TPS_Base, NUMWORD, NULL} }; static const TParserStateActionItem actionTPS_InAsciiWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}, {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC,'.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, {p_iseqC, '-',A_PUSH, TPS_InHyphenAsciiWordFirst, 0, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, *************** *** 995,1004 **** --- 997,1007 ---- static const TParserStateActionItem actionTPS_InWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, WORD_T, NULL}, {p_isalpha,0, A_NEXT, TPS_Null, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_iseqC, '_', A_NEXT, TPS_Null, 0, NULL}, {p_isdigit, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_iseqC, '-', A_PUSH,TPS_InHyphenWordFirst, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, WORD_T, NULL} }; This will obviously break other peoples applications, so my questions would be: If this should be made configurable.. how should it be done? As a sidenote... Xapian doesn't split on _ .. Lucene does. Thanks. -- Jesper
pgsql-hackers by date: