WIP: lookbehind constraints for our regexp engine - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | WIP: lookbehind constraints for our regexp engine |
Date | |
Msg-id | 5404.1445039538@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: WIP: lookbehind constraints for our regexp engine
Re: WIP: lookbehind constraints for our regexp engine Re: WIP: lookbehind constraints for our regexp engine |
List | pgsql-hackers |
I've occasionally heard complaints that our regex engine only has lookahead constraints not lookbehind constraints, while Perl's for example does have those. While I was fooling around with the other issues in that code, I learned enough to realize that it would not be that hard to implement such things, and the attached proof-of-concept patch does so (using Perl-compatible syntax, in case you're wondering). However, I don't feel like this is done to the point of being committable, because its performance leaves something to be desired in a lot of cases. The difficulty is that because the engine can only match in a forward direction, in the worst case you have to check a lookbehind constraint by trying to match starting from the string start to see if a match exists ending at the current-location where you're testing the constraint. That makes it, worst case, O(N^2) to search a string of length N. Now, you can improve on that greatly if you can determine that possible matches don't extend all the way back to the string start. In the attached patch I use the "cold start pointer" returned by shortest() to decide that characters before the coldstart point no longer have to be rechecked. That helps some, but it's not good enough because there are too many cases where the engine doesn't move the coldstart point forward very aggressively. It might be that that behavior itself can be improved on, which would be nice because there are also non-lookbehind-related scenarios where smarter coldstart detection would help performance. But I have no very good ideas about how to do it. Another idea I've been toying with is to add logic that tries to determine the maximum possible match length for a regex. If we can determine that for the lookbehind sub-regex, then we'd know we have to back up at most that far before applying the match check. This seems promising because a lot of other engines don't even support variable-length lookbehind patterns at all, and so we'd be assured of good performance in all the cases that competitors can handle at all. Anyway, I'm not planning to do much more work on this right now, but I thought I'd throw it out there just to let people know that this seems within reach. I'm curious to know how many people care, and how much. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 2946122..4d482ec 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *************** SELECT foo FROM regexp_split_to_table('t *** 4477,4489 **** where no substring matching <replaceable>re</> begins (AREs only) </entry> </row> </tbody> </tgroup> </table> <para> ! Lookahead constraints cannot contain <firstterm>back references</> ! (see <xref linkend="posix-escape-sequences">), and all parentheses within them are considered non-capturing. </para> </sect3> --- 4477,4503 ---- where no substring matching <replaceable>re</> begins (AREs only) </entry> </row> + + <row> + <entry> <literal>(?<=</><replaceable>re</><literal>)</> </entry> + <entry> <firstterm>positive lookbehind</> matches at any point + where a substring matching <replaceable>re</> ends + (AREs only) </entry> + </row> + + <row> + <entry> <literal>(?<!</><replaceable>re</><literal>)</> </entry> + <entry> <firstterm>negative lookbehind</> matches at any point + where no substring matching <replaceable>re</> ends + (AREs only) </entry> + </row> </tbody> </tgroup> </table> <para> ! Lookahead and lookbehind constraints cannot contain <firstterm>back ! references</> (see <xref linkend="posix-escape-sequences">), and all parentheses within them are considered non-capturing. </para> </sect3> *************** SELECT regexp_matches('abc01234xyz', '(? *** 5355,5361 **** the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, ! the restrictions on parentheses and back references in lookahead constraints, and the longest/shortest-match (rather than first-match) matching semantics. </para> --- 5369,5375 ---- the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, ! the restrictions on parentheses and back references in lookahead/lookbehind constraints, and the longest/shortest-match (rather than first-match) matching semantics. </para> diff --git a/src/backend/regex/README b/src/backend/regex/README index 5c24d3d..6c9f483 100644 *** a/src/backend/regex/README --- b/src/backend/regex/README *************** The possible arc types are: *** 332,341 **** as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line constraints respectively. ! LACON constraints, which represent "(?=re)" and "(?!re)" constraints, ! i.e. the input starting at this point must match (or not match) a ! given sub-RE, but the matching input is not consumed. These are ! dumped as ":subtree_number:->to_state". If you see anything else (especially any question marks) in the display of an arc, it's dumpnfa() trying to tell you that there's something fishy --- 332,341 ---- as "$0->to_state" or "$1->to_state" for end-of-string and end-of-line constraints respectively. ! LACON constraints, which represent "(?=re)", "(?!re)", "(?<=re)", and ! "(?<!re)" constraints, i.e. the input starting/ending at this point must ! match (or not match) a given sub-RE, but the matching input is not ! consumed. These are dumped as ":subtree_number:->to_state". If you see anything else (especially any question marks) in the display of an arc, it's dumpnfa() trying to tell you that there's something fishy diff --git a/src/backend/regex/re_syntax.n b/src/backend/regex/re_syntax.n index f37bb85..4621bfc 100644 *** a/src/backend/regex/re_syntax.n --- b/src/backend/regex/re_syntax.n *************** where a substring matching \fIre\fR begi *** 196,205 **** \fB(?!\fIre\fB)\fR \fInegative lookahead\fR (AREs only), matches at any point where no substring matching \fIre\fR begins .RE .PP ! The lookahead constraints may not contain back references (see later), ! and all parentheses within them are considered non-capturing. .PP An RE may not end with `\fB\e\fR'. --- 196,213 ---- \fB(?!\fIre\fB)\fR \fInegative lookahead\fR (AREs only), matches at any point where no substring matching \fIre\fR begins + .TP + \fB(?<=\fIre\fB)\fR + \fIpositive lookbehind\fR (AREs only), matches at any point + where a substring matching \fIre\fR ends + .TP + \fB(?<!\fIre\fB)\fR + \fInegative lookbehind\fR (AREs only), matches at any point + where no substring matching \fIre\fR ends .RE .PP ! Lookahead and lookbehind constraints may not contain back references ! (see later), and all parentheses within them are considered non-capturing. .PP An RE may not end with `\fB\e\fR'. *************** Incompatibilities of note include `\fB\e *** 856,862 **** the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, ! the restrictions on parentheses and back references in lookahead constraints, and the longest/shortest-match (rather than first-match) matching semantics. .PP The matching rules for REs containing both normal and non-greedy quantifiers --- 864,871 ---- the lack of special treatment for a trailing newline, the addition of complemented bracket expressions to the things affected by newline-sensitive matching, ! the restrictions on parentheses and back references in lookahead/lookbehind ! constraints, and the longest/shortest-match (rather than first-match) matching semantics. .PP The matching rules for REs containing both normal and non-greedy quantifiers diff --git a/src/backend/regex/regc_lex.c b/src/backend/regex/regc_lex.c index f6ed9f0..bfd9dcd 100644 *** a/src/backend/regex/regc_lex.c --- b/src/backend/regex/regc_lex.c *************** next(struct vars * v) *** 582,587 **** --- 582,589 ---- { NOTE(REG_UNONPOSIX); v->now++; + if (ATEOS()) + FAILW(REG_BADRPT); switch (*v->now++) { case CHR(':'): /* non-capturing paren */ *************** next(struct vars * v) *** 596,607 **** return next(v); break; case CHR('='): /* positive lookahead */ ! NOTE(REG_ULOOKAHEAD); ! RETV(LACON, 1); break; case CHR('!'): /* negative lookahead */ ! NOTE(REG_ULOOKAHEAD); ! RETV(LACON, 0); break; default: FAILW(REG_BADRPT); --- 598,628 ---- return next(v); break; case CHR('='): /* positive lookahead */ ! NOTE(REG_ULOOKAROUND); ! RETV(LACON, LATYPE_AHEAD_POS); break; case CHR('!'): /* negative lookahead */ ! NOTE(REG_ULOOKAROUND); ! RETV(LACON, LATYPE_AHEAD_NEG); ! break; ! case CHR('<'): ! if (ATEOS()) ! FAILW(REG_BADRPT); ! switch (*v->now++) ! { ! case CHR('='): /* positive lookbehind */ ! NOTE(REG_ULOOKAROUND); ! RETV(LACON, LATYPE_BEHIND_POS); ! break; ! case CHR('!'): /* negative lookbehind */ ! NOTE(REG_ULOOKAROUND); ! RETV(LACON, LATYPE_BEHIND_NEG); ! break; ! default: ! FAILW(REG_BADRPT); ! break; ! } ! assert(NOTREACHED); break; default: FAILW(REG_BADRPT); diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index b733bc7..cfb8f83 100644 *** a/src/backend/regex/regcomp.c --- b/src/backend/regex/regcomp.c *************** static int numst(struct subre *, int); *** 65,71 **** static void markst(struct subre *); static void cleanst(struct vars *); static long nfatree(struct vars *, struct subre *, FILE *); ! static long nfanode(struct vars *, struct subre *, FILE *); static int newlacon(struct vars *, struct state *, struct state *, int); static void freelacons(struct subre *, int); static void rfree(regex_t *); --- 65,71 ---- static void markst(struct subre *); static void cleanst(struct vars *); static long nfatree(struct vars *, struct subre *, FILE *); ! static long nfanode(struct vars *, struct subre *, int, FILE *); static int newlacon(struct vars *, struct state *, struct state *, int); static void freelacons(struct subre *, int); static void rfree(regex_t *); *************** struct vars *** 245,252 **** int ntree; /* number of tree nodes, plus one */ struct cvec *cv; /* interface cvec */ struct cvec *cv2; /* utility cvec */ ! struct subre *lacons; /* lookahead-constraint vector */ ! int nlacons; /* size of lacons */ size_t spaceused; /* approx. space used for compilation */ }; --- 245,253 ---- int ntree; /* number of tree nodes, plus one */ struct cvec *cv; /* interface cvec */ struct cvec *cv2; /* utility cvec */ ! struct subre *lacons; /* lookaround-constraint vector */ ! int nlacons; /* size of lacons[]; note that only slots ! * numbered 1 .. nlacons-1 are used */ size_t spaceused; /* approx. space used for compilation */ }; *************** struct vars *** 277,283 **** #define CCLASS 'C' /* start of [: */ #define END 'X' /* end of [. [= [: */ #define RANGE 'R' /* - within [] which might be range delim. */ ! #define LACON 'L' /* lookahead constraint subRE */ #define AHEAD 'a' /* color-lookahead arc */ #define BEHIND 'r' /* color-lookbehind arc */ #define WBDRY 'w' /* word boundary constraint */ --- 278,284 ---- #define CCLASS 'C' /* start of [: */ #define END 'X' /* end of [. [= [: */ #define RANGE 'R' /* - within [] which might be range delim. */ ! #define LACON 'L' /* lookaround constraint subRE */ #define AHEAD 'a' /* color-lookahead arc */ #define BEHIND 'r' /* color-lookbehind arc */ #define WBDRY 'w' /* word boundary constraint */ *************** pg_regcomp(regex_t *re, *** 432,442 **** assert(v->nlacons == 0 || v->lacons != NULL); for (i = 1; i < v->nlacons; i++) { #ifdef REG_DEBUG if (debug != NULL) fprintf(debug, "\n\n\n========= LA%d ==========\n", i); #endif ! nfanode(v, &v->lacons[i], debug); } CNOERR(); if (v->tree->flags & SHORTER) --- 433,447 ---- assert(v->nlacons == 0 || v->lacons != NULL); for (i = 1; i < v->nlacons; i++) { + struct subre *lasub = &v->lacons[i]; + #ifdef REG_DEBUG if (debug != NULL) fprintf(debug, "\n\n\n========= LA%d ==========\n", i); #endif ! ! /* Prepend .*? to pattern if it's a lookbehind LACON */ ! nfanode(v, lasub, !LATYPE_IS_AHEAD(lasub->subno), debug); } CNOERR(); if (v->tree->flags & SHORTER) *************** makesearch(struct vars * v, *** 640,646 **** static struct subre * parse(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookahead subRE) or PLAIN */ struct state * init, /* initial state */ struct state * final) /* final state */ { --- 645,651 ---- static struct subre * parse(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookaround subRE) or PLAIN */ struct state * init, /* initial state */ struct state * final) /* final state */ { *************** parse(struct vars * v, *** 719,725 **** static struct subre * parsebranch(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookahead subRE) or PLAIN */ struct state * left, /* leftmost state */ struct state * right, /* rightmost state */ int partial) /* is this only part of a branch? */ --- 724,730 ---- static struct subre * parsebranch(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookaround subRE) or PLAIN */ struct state * left, /* leftmost state */ struct state * right, /* rightmost state */ int partial) /* is this only part of a branch? */ *************** parsebranch(struct vars * v, *** 768,774 **** static void parseqatom(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookahead subRE) or PLAIN */ struct state * lp, /* left state to hang it on */ struct state * rp, /* right state to hang it on */ struct subre * top) /* subtree top */ --- 773,779 ---- static void parseqatom(struct vars * v, int stopper, /* EOS or ')' */ ! int type, /* LACON (lookaround subRE) or PLAIN */ struct state * lp, /* left state to hang it on */ struct state * rp, /* right state to hang it on */ struct subre * top) /* subtree top */ *************** parseqatom(struct vars * v, *** 782,788 **** struct subre *atom; /* atom's subtree */ struct subre *t; int cap; /* capturing parens? */ ! int pos; /* positive lookahead? */ int subno; /* capturing-parens or backref number */ int atomtype; int qprefer; /* quantifier short/long preference */ --- 787,793 ---- struct subre *atom; /* atom's subtree */ struct subre *t; int cap; /* capturing parens? */ ! int latype; /* lookaround constraint type */ int subno; /* capturing-parens or backref number */ int atomtype; int qprefer; /* quantifier short/long preference */ *************** parseqatom(struct vars * v, *** 866,873 **** nonword(v, AHEAD, s, rp); return; break; ! case LACON: /* lookahead constraint */ ! pos = v->nextvalue; NEXT(); s = newstate(v->nfa); s2 = newstate(v->nfa); --- 871,878 ---- nonword(v, AHEAD, s, rp); return; break; ! case LACON: /* lookaround constraint */ ! latype = v->nextvalue; NEXT(); s = newstate(v->nfa); s2 = newstate(v->nfa); *************** parseqatom(struct vars * v, *** 876,882 **** freesubre(v, t); /* internal structure irrelevant */ assert(SEE(')') || ISERR()); NEXT(); ! n = newlacon(v, s, s2, pos); NOERR(); ARCV(LACON, n); return; --- 881,887 ---- freesubre(v, t); /* internal structure irrelevant */ assert(SEE(')') || ISERR()); NEXT(); ! n = newlacon(v, s, s2, latype); NOERR(); ARCV(LACON, n); return; *************** nfatree(struct vars * v, *** 1826,1840 **** if (t->right != NULL) (DISCARD) nfatree(v, t->right, f); ! return nfanode(v, t, f); } /* ! * nfanode - do one NFA for nfatree */ static long /* optimize results */ nfanode(struct vars * v, struct subre * t, FILE *f) /* for debug output */ { struct nfa *nfa; --- 1831,1848 ---- if (t->right != NULL) (DISCARD) nfatree(v, t->right, f); ! return nfanode(v, t, 0, f); } /* ! * nfanode - do one NFA for nfatree or lacons ! * ! * If converttosearch is true, apply makesearch() to the NFA. */ static long /* optimize results */ nfanode(struct vars * v, struct subre * t, + int converttosearch, FILE *f) /* for debug output */ { struct nfa *nfa; *************** nfanode(struct vars * v, *** 1855,1864 **** NOERRZ(); dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final); if (!ISERR()) - { specialcolors(nfa); ret = optimize(nfa, f); ! } if (!ISERR()) compact(nfa, &t->cnfa); --- 1863,1873 ---- NOERRZ(); dupnfa(nfa, t->begin, t->end, nfa->init, nfa->final); if (!ISERR()) specialcolors(nfa); + if (!ISERR()) ret = optimize(nfa, f); ! if (converttosearch && !ISERR()) ! makesearch(v, nfa); if (!ISERR()) compact(nfa, &t->cnfa); *************** nfanode(struct vars * v, *** 1867,1879 **** } /* ! * newlacon - allocate a lookahead-constraint subRE */ static int /* lacon number */ newlacon(struct vars * v, struct state * begin, struct state * end, ! int pos) { int n; struct subre *newlacons; --- 1876,1888 ---- } /* ! * newlacon - allocate a lookaround-constraint subRE */ static int /* lacon number */ newlacon(struct vars * v, struct state * begin, struct state * end, ! int latype) { int n; struct subre *newlacons; *************** newlacon(struct vars * v, *** 1900,1912 **** sub = &v->lacons[n]; sub->begin = begin; sub->end = end; ! sub->subno = pos; ZAPCNFA(sub->cnfa); return n; } /* ! * freelacons - free lookahead-constraint subRE vector */ static void freelacons(struct subre * subs, --- 1909,1921 ---- sub = &v->lacons[n]; sub->begin = begin; sub->end = end; ! sub->subno = latype; ZAPCNFA(sub->cnfa); return n; } /* ! * freelacons - free lookaround-constraint subRE vector */ static void freelacons(struct subre * subs, *************** dump(regex_t *re, *** 2020,2028 **** } for (i = 1; i < g->nlacons; i++) { ! fprintf(f, "\nla%d (%s):\n", i, ! (g->lacons[i].subno) ? "positive" : "negative"); ! dumpcnfa(&g->lacons[i].cnfa, f); } fprintf(f, "\n"); dumpst(g->tree, f, 0); --- 2029,2057 ---- } for (i = 1; i < g->nlacons; i++) { ! struct subre *lasub = &g->lacons[i]; ! const char *latype; ! ! switch (lasub->subno) ! { ! case LATYPE_AHEAD_POS: ! latype = "positive lookahead"; ! break; ! case LATYPE_AHEAD_NEG: ! latype = "negative lookahead"; ! break; ! case LATYPE_BEHIND_POS: ! latype = "positive lookbehind"; ! break; ! case LATYPE_BEHIND_NEG: ! latype = "negative lookbehind"; ! break; ! default: ! latype = "???"; ! break; ! } ! fprintf(f, "\nla%d (%s):\n", i, latype); ! dumpcnfa(&lasub->cnfa, f); } fprintf(f, "\n"); dumpst(g->tree, f, 0); diff --git a/src/backend/regex/rege_dfa.c b/src/backend/regex/rege_dfa.c index a37e4b0..4b6c516 100644 *** a/src/backend/regex/rege_dfa.c --- b/src/backend/regex/rege_dfa.c *************** miss(struct vars * v, *** 613,631 **** } /* ! * lacon - lookahead-constraint checker for miss() */ static int /* predicate: constraint satisfied? */ lacon(struct vars * v, struct cnfa * pcnfa, /* parent cnfa */ chr *cp, ! pcolor co) /* "color" of the lookahead constraint */ { int n; struct subre *sub; struct dfa *d; - struct smalldfa sd; chr *end; /* Since this is recursive, it could be driven to stack overflow */ if (STACK_TOO_DEEP(v->re)) --- 613,631 ---- } /* ! * lacon - lookaround-constraint checker for miss() */ static int /* predicate: constraint satisfied? */ lacon(struct vars * v, struct cnfa * pcnfa, /* parent cnfa */ chr *cp, ! pcolor co) /* "color" of the lookaround constraint */ { int n; struct subre *sub; struct dfa *d; chr *end; + int satisfied; /* Since this is recursive, it could be driven to stack overflow */ if (STACK_TOO_DEEP(v->re)) *************** lacon(struct vars * v, *** 635,653 **** } n = co - pcnfa->ncolors; ! assert(n < v->g->nlacons && v->g->lacons != NULL); FDEBUG(("=== testing lacon %d\n", n)); sub = &v->g->lacons[n]; ! d = newdfa(v, &sub->cnfa, &v->g->cmap, &sd); if (d == NULL) - { - ERR(REG_ESPACE); return 0; } ! end = longest(v, d, cp, v->stop, (int *) NULL); ! freedfa(d); ! FDEBUG(("=== lacon %d match %d\n", n, (end != NULL))); ! return (sub->subno) ? (end != NULL) : (end == NULL); } /* --- 635,673 ---- } n = co - pcnfa->ncolors; ! assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL); FDEBUG(("=== testing lacon %d\n", n)); sub = &v->g->lacons[n]; ! d = getladfa(v, n); if (d == NULL) return 0; + if (LATYPE_IS_AHEAD(sub->subno)) + { + /* used to use longest() here, but shortest() could be much cheaper */ + end = shortest(v, d, cp, cp, v->stop, + (chr **) NULL, (int *) NULL); } ! else ! { ! /* ! * In general a lookbehind LACON would need to be matched to the whole ! * string back to v->start. But as long as we check a given LACON at ! * successively increasing points (which is certainly the case within ! * any one run of longest() or shortest(), and is often true across ! * runs), we can start this check from the last no-progress point seen ! * in the prior check, if any. The NFA we're using is a search NFA, ! * so it doesn't mind scanning over stuff before the match. ! */ ! if (v->larestarts[n] > cp) ! v->larestarts[n] = v->start; ! end = shortest(v, d, v->larestarts[n], cp, cp, ! (chr **) NULL, (int *) NULL); ! /* shortest() won't do this on failure return, so do it ourselves */ ! v->larestarts[n] = lastcold(v, d); ! } ! satisfied = LATYPE_IS_POS(sub->subno) ? (end != NULL) : (end == NULL); ! FDEBUG(("=== lacon %d satisfied %d\n", n, satisfied)); ! return satisfied; } /* diff --git a/src/backend/regex/regexec.c b/src/backend/regex/regexec.c index 8a21f2c..ba644fb 100644 *** a/src/backend/regex/regexec.c --- b/src/backend/regex/regexec.c *************** struct vars *** 112,118 **** chr *search_start; /* search start of string */ chr *stop; /* just past end of string */ int err; /* error code if any (0 none) */ ! struct dfa **subdfas; /* per-subre DFAs */ struct smalldfa dfa1; struct smalldfa dfa2; }; --- 112,120 ---- chr *search_start; /* search start of string */ chr *stop; /* just past end of string */ int err; /* error code if any (0 none) */ ! struct dfa **subdfas; /* per-tree-subre DFAs */ ! struct dfa **ladfas; /* per-lacon-subre DFAs */ ! chr **larestarts; /* per-lacon-subre search restart pointers */ struct smalldfa dfa1; struct smalldfa dfa2; }; *************** struct vars *** 132,137 **** --- 134,140 ---- */ /* === regexec.c === */ static struct dfa *getsubdfa(struct vars *, struct subre *); + static struct dfa *getladfa(struct vars *, int); static int find(struct vars *, struct cnfa *, struct colormap *); static int cfind(struct vars *, struct cnfa *, struct colormap *); static int cfindloop(struct vars *, struct cnfa *, struct colormap *, struct dfa *, struct dfa *, chr **); *************** pg_regexec(regex_t *re, *** 226,246 **** v->search_start = (chr *) string + search_start; v->stop = (chr *) string + len; v->err = 0; assert(v->g->ntree >= 0); n = (size_t) v->g->ntree; if (n <= LOCALDFAS) v->subdfas = subdfas; else - v->subdfas = (struct dfa **) MALLOC(n * sizeof(struct dfa *)); - if (v->subdfas == NULL) { ! if (v->pmatch != pmatch && v->pmatch != mat) ! FREE(v->pmatch); ! return REG_ESPACE; } for (i = 0; i < n; i++) v->subdfas[i] = NULL; /* do it */ assert(v->g->tree != NULL); if (backref) --- 229,273 ---- v->search_start = (chr *) string + search_start; v->stop = (chr *) string + len; v->err = 0; + v->subdfas = NULL; + v->ladfas = NULL; + v->larestarts = NULL; + /* below this point, "goto cleanup" will behave sanely */ + assert(v->g->ntree >= 0); n = (size_t) v->g->ntree; if (n <= LOCALDFAS) v->subdfas = subdfas; else { ! v->subdfas = (struct dfa **) MALLOC(n * sizeof(struct dfa *)); ! if (v->subdfas == NULL) ! { ! st = REG_ESPACE; ! goto cleanup; ! } } for (i = 0; i < n; i++) v->subdfas[i] = NULL; + assert(v->g->nlacons >= 0); + n = (size_t) v->g->nlacons; + if (n > 0) + { + v->ladfas = (struct dfa **) MALLOC(n * sizeof(struct dfa *)); + v->larestarts = (chr **) MALLOC(n * sizeof(chr *)); + if (v->ladfas == NULL || v->larestarts == NULL) + { + st = REG_ESPACE; + goto cleanup; + } + for (i = 0; i < n; i++) + { + v->ladfas[i] = NULL; + v->larestarts[i] = v->start; + } + } + /* do it */ assert(v->g->tree != NULL); if (backref) *************** pg_regexec(regex_t *re, *** 257,278 **** } /* clean up */ if (v->pmatch != pmatch && v->pmatch != mat) FREE(v->pmatch); ! n = (size_t) v->g->ntree; ! for (i = 0; i < n; i++) { ! if (v->subdfas[i] != NULL) ! freedfa(v->subdfas[i]); } ! if (v->subdfas != subdfas) ! FREE(v->subdfas); return st; } /* ! * getsubdfa - create or re-fetch the DFA for a subre node * * We only need to create the DFA once per overall regex execution. * The DFA will be freed by the cleanup step in pg_regexec(). --- 284,321 ---- } /* clean up */ + cleanup: if (v->pmatch != pmatch && v->pmatch != mat) FREE(v->pmatch); ! if (v->subdfas != NULL) { ! n = (size_t) v->g->ntree; ! for (i = 0; i < n; i++) ! { ! if (v->subdfas[i] != NULL) ! freedfa(v->subdfas[i]); ! } ! if (v->subdfas != subdfas) ! FREE(v->subdfas); } ! if (v->ladfas != NULL) ! { ! n = (size_t) v->g->nlacons; ! for (i = 0; i < n; i++) ! { ! if (v->ladfas[i] != NULL) ! freedfa(v->ladfas[i]); ! } ! FREE(v->ladfas); ! } ! if (v->larestarts != NULL) ! FREE(v->larestarts); return st; } /* ! * getsubdfa - create or re-fetch the DFA for a tree subre node * * We only need to create the DFA once per overall regex execution. * The DFA will be freed by the cleanup step in pg_regexec(). *************** getsubdfa(struct vars * v, *** 291,296 **** --- 334,361 ---- } /* + * getladfa - create or re-fetch the DFA for a LACON subre node + * + * Same as above, but for LACONs. + */ + static struct dfa * + getladfa(struct vars * v, + int n) + { + assert(n > 0 && n < v->g->nlacons && v->g->lacons != NULL); + + if (v->ladfas[n] == NULL) + { + struct subre *sub = &v->g->lacons[n]; + + v->ladfas[n] = newdfa(v, &sub->cnfa, &v->g->cmap, DOMALLOC); + if (ISERR()) + return NULL; + } + return v->ladfas[n]; + } + + /* * find - find a match for the main NFA (no-complications case) */ static int diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c index c5524ae..9134071 100644 *** a/src/backend/regex/regexport.c --- b/src/backend/regex/regexport.c *************** *** 6,12 **** * In this implementation, the NFA defines a necessary but not sufficient * condition for a string to match the regex: that is, there can be strings * that match the NFA but don't match the full regex, but not vice versa. ! * Thus, for example, it is okay for the functions below to ignore lookahead * constraints, which merely constrain the string some more. * * Notice that these functions return info into caller-provided arrays --- 6,12 ---- * In this implementation, the NFA defines a necessary but not sufficient * condition for a string to match the regex: that is, there can be strings * that match the NFA but don't match the full regex, but not vice versa. ! * Thus, for example, it is okay for the functions below to ignore lookaround * constraints, which merely constrain the string some more. * * Notice that these functions return info into caller-provided arrays diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c index e78faf3..4ca4b48 100644 *** a/src/backend/regex/regprefix.c --- b/src/backend/regex/regprefix.c *************** static int findprefix(struct cnfa * cnfa *** 36,42 **** * the common prefix or exact value, of length *slength (measured in chrs * not bytes!). * ! * This function does not analyze all complex cases (such as lookahead * constraints) exactly. Therefore it is possible that some strings matching * the reported prefix or exact-match string do not satisfy the regex. But * it should never be the case that a string satisfying the regex does not --- 36,42 ---- * the common prefix or exact value, of length *slength (measured in chrs * not bytes!). * ! * This function does not analyze all complex cases (such as lookaround * constraints) exactly. Therefore it is possible that some strings matching * the reported prefix or exact-match string do not satisfy the regex. But * it should never be the case that a string satisfying the regex does not *************** findprefix(struct cnfa * cnfa, *** 162,168 **** thiscolor = COLORLESS; for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) { ! /* We ignore lookahead constraints */ if (ca->co >= cnfa->ncolors) continue; /* We can also ignore BOS/BOL arcs */ --- 162,168 ---- thiscolor = COLORLESS; for (ca = cnfa->states[st]; ca->co != COLORLESS; ca++) { ! /* We ignore lookaround constraints */ if (ca->co >= cnfa->ncolors) continue; /* We can also ignore BOS/BOL arcs */ diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index 5e1b692..2f89dc9 100644 *** a/src/include/regex/regex.h --- b/src/include/regex/regex.h *************** typedef struct *** 58,64 **** size_t re_nsub; /* number of subexpressions */ long re_info; /* information about RE */ #define REG_UBACKREF 000001 ! #define REG_ULOOKAHEAD 000002 #define REG_UBOUNDS 000004 #define REG_UBRACES 000010 #define REG_UBSALNUM 000020 --- 58,64 ---- size_t re_nsub; /* number of subexpressions */ long re_info; /* information about RE */ #define REG_UBACKREF 000001 ! #define REG_ULOOKAROUND 000002 #define REG_UBOUNDS 000004 #define REG_UBRACES 000010 #define REG_UBSALNUM 000020 diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index 19fe991..2ceffa6 100644 *** a/src/include/regex/regguts.h --- b/src/include/regex/regguts.h *************** *** 89,101 **** */ #define NOTREACHED 0 - #define xxx 1 #define DUPMAX _POSIX2_RE_DUP_MAX #define DUPINF (DUPMAX+1) #define REMAGIC 0xfed7 /* magic number for main struct */ /* --- 89,107 ---- */ #define NOTREACHED 0 #define DUPMAX _POSIX2_RE_DUP_MAX #define DUPINF (DUPMAX+1) #define REMAGIC 0xfed7 /* magic number for main struct */ + /* Type codes for lookaround constraints */ + #define LATYPE_AHEAD_POS 03 /* positive lookahead */ + #define LATYPE_AHEAD_NEG 02 /* negative lookahead */ + #define LATYPE_BEHIND_POS 01 /* positive lookbehind */ + #define LATYPE_BEHIND_NEG 00 /* negative lookbehind */ + #define LATYPE_IS_POS(la) ((la) & 01) + #define LATYPE_IS_AHEAD(la) ((la) & 02) /* *************** struct nfa *** 351,357 **** * * The non-dummy carc structs are of two types: plain arcs and LACON arcs. * Plain arcs just store the transition color number as "co". LACON arcs ! * store the lookahead constraint number plus cnfa.ncolors as "co". LACON * arcs can be distinguished from plain by testing for co >= cnfa.ncolors. */ struct carc --- 357,363 ---- * * The non-dummy carc structs are of two types: plain arcs and LACON arcs. * Plain arcs just store the transition color number as "co". LACON arcs ! * store the lookaround constraint number plus cnfa.ncolors as "co". LACON * arcs can be distinguished from plain by testing for co >= cnfa.ncolors. */ struct carc *************** struct cnfa *** 365,371 **** int nstates; /* number of states */ int ncolors; /* number of colors (max color in use + 1) */ int flags; ! #define HASLACONS 01 /* uses lookahead constraints */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ --- 371,377 ---- int nstates; /* number of states */ int ncolors; /* number of colors (max color in use + 1) */ int flags; ! #define HASLACONS 01 /* uses lookaround constraints */ int pre; /* setup state number */ int post; /* teardown state number */ color bos[2]; /* colors, if any, assigned to BOS and BOL */ *************** struct subre *** 433,439 **** #define PREF2(f1, f2) ((PREF(f1) != 0) ? PREF(f1) : PREF(f2)) #define COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2)) short id; /* ID of subre (1..ntree-1) */ ! int subno; /* subexpression number (for 'b' and '(') */ short min; /* min repetitions for iteration or backref */ short max; /* max repetitions for iteration or backref */ struct subre *left; /* left child, if any (also freelist chain) */ --- 439,446 ---- #define PREF2(f1, f2) ((PREF(f1) != 0) ? PREF(f1) : PREF(f2)) #define COMBINE(f1, f2) (UP((f1)|(f2)) | PREF2(f1, f2)) short id; /* ID of subre (1..ntree-1) */ ! int subno; /* subexpression number for 'b' and '(', or ! * LATYPE code for lookaround constraint */ short min; /* min repetitions for iteration or backref */ short max; /* max repetitions for iteration or backref */ struct subre *left; /* left child, if any (also freelist chain) */ *************** struct guts *** 479,484 **** int ntree; /* number of subre's, plus one */ struct colormap cmap; int FUNCPTR(compare, (const chr *, const chr *, size_t)); ! struct subre *lacons; /* lookahead-constraint vector */ ! int nlacons; /* size of lacons */ }; --- 486,492 ---- int ntree; /* number of subre's, plus one */ struct colormap cmap; int FUNCPTR(compare, (const chr *, const chr *, size_t)); ! struct subre *lacons; /* lookaround-constraint vector */ ! int nlacons; /* size of lacons[]; note that only slots ! * numbered 1 .. nlacons-1 are used */ }; diff --git a/src/test/regress/expected/regex.out b/src/test/regress/expected/regex.out index 320f5e8..8e624cc 100644 *** a/src/test/regress/expected/regex.out --- b/src/test/regress/expected/regex.out *************** select substring('a' from '((a)+)'); *** 90,95 **** --- 90,197 ---- a (1 row) + -- Test lookahead constraints + select regexp_matches('ab', 'a(?=b)b*'); + regexp_matches + ---------------- + {ab} + (1 row) + + select regexp_matches('a', 'a(?=b)b*'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('abc', 'a(?=b)b*(?=c)c*'); + regexp_matches + ---------------- + {abc} + (1 row) + + select regexp_matches('ab', 'a(?=b)b*(?=c)c*'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('ab', 'a(?!b)b*'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('a', 'a(?!b)b*'); + regexp_matches + ---------------- + {a} + (1 row) + + select regexp_matches('b', '(?=b)b'); + regexp_matches + ---------------- + {b} + (1 row) + + select regexp_matches('a', '(?=b)b'); + regexp_matches + ---------------- + (0 rows) + + -- Test lookbehind constraints + select regexp_matches('abb', '(?<=a)b*'); + regexp_matches + ---------------- + {bb} + (1 row) + + select regexp_matches('a', 'a(?<=a)b*'); + regexp_matches + ---------------- + {a} + (1 row) + + select regexp_matches('abc', 'a(?<=a)b*(?<=b)c*'); + regexp_matches + ---------------- + {abc} + (1 row) + + select regexp_matches('ab', 'a(?<=a)b*(?<=b)c*'); + regexp_matches + ---------------- + {ab} + (1 row) + + select regexp_matches('ab', 'a(?<!a)b*'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('a', 'a(?<!a)b*'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('b', '(?<=b)b'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('foobar', '(?<=f)b+'); + regexp_matches + ---------------- + (0 rows) + + select regexp_matches('foobar', '(?<=foo)b+'); + regexp_matches + ---------------- + {b} + (1 row) + + select regexp_matches('foobar', '(?<=oo)b+'); + regexp_matches + ---------------- + {b} + (1 row) + -- Test conversion of regex patterns to indexable conditions explain (costs off) select * from pg_proc where proname ~ 'abc'; QUERY PLAN diff --git a/src/test/regress/sql/regex.sql b/src/test/regress/sql/regex.sql index 5412f6e..ae737e1 100644 *** a/src/test/regress/sql/regex.sql --- b/src/test/regress/sql/regex.sql *************** select substring('asd TO foo' from ' TO *** 25,30 **** --- 25,52 ---- select substring('a' from '((a))+'); select substring('a' from '((a)+)'); + -- Test lookahead constraints + select regexp_matches('ab', 'a(?=b)b*'); + select regexp_matches('a', 'a(?=b)b*'); + select regexp_matches('abc', 'a(?=b)b*(?=c)c*'); + select regexp_matches('ab', 'a(?=b)b*(?=c)c*'); + select regexp_matches('ab', 'a(?!b)b*'); + select regexp_matches('a', 'a(?!b)b*'); + select regexp_matches('b', '(?=b)b'); + select regexp_matches('a', '(?=b)b'); + + -- Test lookbehind constraints + select regexp_matches('abb', '(?<=a)b*'); + select regexp_matches('a', 'a(?<=a)b*'); + select regexp_matches('abc', 'a(?<=a)b*(?<=b)c*'); + select regexp_matches('ab', 'a(?<=a)b*(?<=b)c*'); + select regexp_matches('ab', 'a(?<!a)b*'); + select regexp_matches('a', 'a(?<!a)b*'); + select regexp_matches('b', '(?<=b)b'); + select regexp_matches('foobar', '(?<=f)b+'); + select regexp_matches('foobar', '(?<=foo)b+'); + select regexp_matches('foobar', '(?<=oo)b+'); + -- Test conversion of regex patterns to indexable conditions explain (costs off) select * from pg_proc where proname ~ 'abc'; explain (costs off) select * from pg_proc where proname ~ '^abc';
pgsql-hackers by date: