Re: PATCH: Allow empty targets in unaccent dictionary - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: PATCH: Allow empty targets in unaccent dictionary |
Date | |
Msg-id | 20035.1404155957@sss.pgh.pa.us Whole thread Raw |
In response to | Re: PATCH: Allow empty targets in unaccent dictionary (Abhijit Menon-Sen <ams@2ndQuadrant.com>) |
Responses |
Re: PATCH: Allow empty targets in unaccent dictionary
|
List | pgsql-hackers |
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > I've attached a patch to contrib/unaccent as outlined in my review the > other day. I went to commit this, and while testing I realized that the current implementation of unaccent_lexize is only capable of coping with "src" strings that are single characters in the current encoding. I'm not sure exactly how complicated it would be to fix that, but it might not be terribly difficult. (Overlapping matches would be the main complication, likely.) Anyway, this raises the question of whether the current patch is actually a desirable way to do things, or whether it would be better if the unaccenting rules were like "base-char accent-char" -> "base-char". Presumably, that would require more rules, but it would prevent deletion of a standalone accent-char, which might or might not be a good thing. Also, if there are any contexts where the right translation of an accent-char depends on the base-char, you couldn't do it with the patch as it stands. I don't know any languages that use separate accent-chars, so I'm not qualified to opine on whether this is important. It's not unlikely that we want this patch *and* an improvement that allows multi-character src strings, but I held off committing for the moment until somebody weighs in with an opinion about that. In any case, the patch failed to update the user-facing docs (unaccent.sgml) so we need some more work in that department. The user-facing docs are already pretty weak in that they fail to explain the whitespace rules, much less clarify that the src must be exactly a single character. If we don't improve the code to allow multi-character src, I wonder whether the rule-reading code shouldn't be improved to forbid such cases. I'm also wondering why it silently ignores malformed lines instead of bleating. But that's more or less orthogonal to this patch. Lastly, I didn't especially like the coding details of either proposed patch, and rewrote it as attached. I didn't see a need to introduce a "state 5", and I also didn't agree with changing the initial values of trg/trglen to be meaningful. Right now, those variables are initialized only to keep the compiler from bleating; the initial values aren't actually used anywhere. It didn't seem like a good idea for the trgxxx variables to be different from the srcxxx variables in that respect. regards, tom lane diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c index a337df6..5a31f85 100644 *** a/contrib/unaccent/unaccent.c --- b/contrib/unaccent/unaccent.c *************** initTrie(char *filename) *** 104,114 **** while ((line = tsearch_readline(&trst)) != NULL) { ! /* ! * The format of each line must be "src trg" where src and trg ! * are sequences of one or more non-whitespace characters, ! * separated by whitespace. Whitespace at start or end of ! * line is ignored. */ int state; char *ptr; --- 104,124 ---- while ((line = tsearch_readline(&trst)) != NULL) { ! /*---------- ! * The format of each line must be "src" or "src trg", where ! * src and trg are sequences of one or more non-whitespace ! * characters, separated by whitespace. Whitespace at start ! * or end of line is ignored. If trg is omitted, an empty ! * string is used as the replacement. ! * ! * We use a simple state machine, with states ! * 0 initial (before src) ! * 1 in src ! * 2 in whitespace after src ! * 3 in trg ! * 4 in whitespace after trg ! * -1 syntax error detected (line will be ignored) ! *---------- */ int state; char *ptr; *************** initTrie(char *filename) *** 160,166 **** } } ! if (state >= 3) rootTrie = placeChar(rootTrie, (unsigned char *) src, srclen, trg, trglen); --- 170,183 ---- } } ! if (state == 1 || state == 2) ! { ! /* trg was omitted, so use "" */ ! trg = ""; ! trglen = 0; ! } ! ! if (state > 0) rootTrie = placeChar(rootTrie, (unsigned char *) src, srclen, trg, trglen);
pgsql-hackers by date: