Re: Better locale-specific-character-class handling for regexps - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Better locale-specific-character-class handling for regexps |
Date | |
Msg-id | e604d9ec-0c09-5e1d-14a2-196dd66fd462@iki.fi Whole thread Raw |
In response to | Better locale-specific-character-class handling for regexps (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Better locale-specific-character-class handling for regexps
|
List | pgsql-hackers |
On 08/23/2016 03:54 AM, Tom Lane wrote: > ! That's still not quite enough, though, because of locale-dependent > ! character classes such as [[:alpha:]]. In Unicode locales these classes > ! may have thousands of entries that are above MAX_SIMPLE_CHR, and we > ! certainly don't want to be searching large colormaprange arrays at runtime. > ! Nor do we even want to spend the time to initialize cvec structures that > ! exhaustively describe all of those characters. Our solution is to compute > ! exact per-character colors at compile time only up to MAX_SIMPLE_CHR. > ! For characters above that, we apply the <ctype.h> or <wctype.h> lookup > ! functions at runtime for each locale-dependent character class used in the > ! regex pattern, constructing a bitmap that describes which classes the > ! runtime character belongs to. The per-character-range data structure > ! mentioned above actually holds, for each range, a separate color entry > ! for each possible combination of character class properties. That is, > ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, > ! whose rows correspond to character ranges that are explicitly mentioned > ! in the input, and whose columns correspond to sets of relevant locale > ! character classes. I think that last sentence should say "whose rows correspond to character ranges that are explicitly mentioned in the *regex pattern*", rather than "in the input". An example would be very helpful here. I had to read through this many times, until I understood it. I can easily come up with examples for character classes, but not for "high" character-ranges. The best I could come up with is to check if a characters belongs to some special group of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical symbol characters. In practice, I guess you will only see single characters in the colormaprange array, although we must of course cope with ranges too. > + /* this relies on WHITE being zero: */ > + memset(cm->locolormap, WHITE, > + (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); > + > + memset(cm->classbits, 0, sizeof(cm->classbits)); > + cm->numcmranges = 0; > + cm->cmranges = NULL; > + cm->maxarrayrows = 4; /* arbitrary initial allocation */ > + cm->hiarrayrows = 1; > + cm->hiarraycols = 1; > + cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color)); > + if (cm->hicolormap == NULL) > + { > + CERR(REG_ESPACE); > + return; > + } > + /* initialize the "all other characters" row to WHITE */ > + cm->hicolormap[0] = WHITE; Is the comment correct? I don't see why this wouldn't work with "WHITE != 0". > ! /* Duplicate existing columns to the right, and increase ref counts */ > ! /* Must work downwards in the array because we realloc'd in place */ > ! for (r = cm->hiarrayrows - 1; r >= 0; r--) > ! { > ! color *oldrowptr = &newarray[r * cm->hiarraycols]; > ! color *newrowptr = &newarray[r * cm->hiarraycols * 2]; > ! color *newrowptr2 = newrowptr + cm->hiarraycols; > > ! for (c = 0; c < cm->hiarraycols; c++) > ! { > ! color co = oldrowptr[c]; > ! > ! newrowptr[c] = newrowptr2[c] = co; > ! cm->cd[co].nuchrs++; > ! } > ! } Perhaps "backwards" would be clearer than "downwards"? At least in my mental model, index 0 is the top row of an array, so "downwards" means 0, 1, 2. I guess you meant downwards numerically, rather than visually, but it took me a moment to process that. +1 for this patch in general. Some regression test cases would be nice. - Heikki
pgsql-hackers by date: