Re: Better locale-specific-character-class handling for regexps - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Better locale-specific-character-class handling for regexps |
Date | |
Msg-id | 20763.1473011046@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Better locale-specific-character-class handling for regexps (Heikki Linnakangas <hlinnaka@iki.fi>) |
Responses |
Re: Better locale-specific-character-class handling for
regexps
|
List | pgsql-hackers |
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 08/23/2016 03:54 AM, Tom Lane wrote: >> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, >> ! whose rows correspond to character ranges that are explicitly mentioned >> ! in the input, and whose columns correspond to sets of relevant locale >> ! character classes. > I think that last sentence should say "whose rows correspond to > character ranges that are explicitly mentioned in the *regex pattern*", > rather than "in the input". OK. The pattern is the input under consideration by regcomp.c, but I see your point that someone might be confused. > An example would be very helpful here. I'll see what I can do. > I can easily come up with examples for > character classes, but not for "high" character-ranges. The best I could > come up with is to check if a characters belongs to some special group > of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical > symbol characters. Right, or someone might write the equivalent of [A-Z] in the Klingon alphabet, or whatever --- it wouldn't necessarily have to use backslash notation, it could be literal characters with large codes. > In practice, I guess you will only see single > characters in the colormaprange array, although we must of course cope > with ranges too. I suspect that's true; I'm not sure that there are many places in the upper reaches of Unicode space where ranges of consecutive character codes form natural units to search for. >> + /* this relies on WHITE being zero: */ >> + memset(cm->locolormap, WHITE, >> + (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); > Is the comment correct? I don't see why this wouldn't work with "WHITE > != 0". The array entries are shorts not chars, so if WHITE were say 2, memset would fill the array with 0x0202 rather than 2. I suppose if WHITE were say 0x0303 then it would accidentally work, but I don't think the comment needs to get into that. > Perhaps "backwards" would be clearer than "downwards"? OK. > +1 for this patch in general. Some regression test cases would be nice. I'm not sure how to write such tests without introducing insurmountable platform dependencies --- particularly on platforms with weak support for UTF8 locales, such as OS X. All the interesting cases require knowing what iswalpha() etc will return for some high character codes. What I did to test it during development was to set MAX_SIMPLE_CHR to something in the ASCII range, so that the high-character-code paths could be tested without making any assumptions about locale classifications for non-ASCII characters. I'm not sure that's a helpful idea for regression testing purposes, though. I guess I could follow the lead of collate.linux.utf8.sql and produce a test that's only promised to pass on one platform with one encoding, but I'm not terribly excited by that. AFAIK that test file does not get run at all in the buildfarm or in the wild. Thanks for reviewing! regards, tom lane
pgsql-hackers by date: