Re: SQL-spec incompatibilities in similar_escape() and related stuff - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: SQL-spec incompatibilities in similar_escape() and related stuff |
Date | |
Msg-id | 32617.1558649424@sss.pgh.pa.us Whole thread Raw |
In response to | Re: SQL-spec incompatibilities in similar_escape() and related stuff (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: SQL-spec incompatibilities in similar_escape() and related stuff
Re: SQL-spec incompatibilities in similar_escape() and related stuff |
List | pgsql-hackers |
I wrote: > I propose therefore that we leave similar_escape in place with its > current behavior, as a compatibility measure for cases like this. > Intead, invent two new strict functions, say > similar_to_escape(pattern) > similar_to_escape(pattern, escape) > and change the parser and the implementation of SUBSTRING() to > rely on these going forward. > The net effect will be to make explicit "ESCAPE NULL" spec-compliant, > and to get rid of the performance problem from inlining failure for > substring(). All else is just doc clarifications. Here's a proposed patch for that. I think it's a bit too late to be messing with this kind of thing for v12, so I'll add this to the upcoming CF. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index a79e7c0..7d4472a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -4146,6 +4146,14 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> <para> + According to the SQL standard, omitting <literal>ESCAPE</literal> + means there is no escape character (rather than defaulting to a + backslash), and a zero-length <literal>ESCAPE</literal> value is + disallowed. <productname>PostgreSQL</productname>'s behavior in + this regard is therefore slightly nonstandard. + </para> + + <para> The key word <token>ILIKE</token> can be used instead of <token>LIKE</token> to make the match case-insensitive according to the active locale. This is not in the <acronym>SQL</acronym> standard but is a @@ -4280,9 +4288,27 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> <para> - As with <function>LIKE</function>, a backslash disables the special meaning - of any of these metacharacters; or a different escape character can - be specified with <literal>ESCAPE</literal>. + As with <function>LIKE</function>, a backslash disables the special + meaning of any of these metacharacters. A different escape character + can be specified with <literal>ESCAPE</literal>, or the escape + capability can be disabled by writing <literal>ESCAPE ''</literal>. + </para> + + <para> + According to the SQL standard, omitting <literal>ESCAPE</literal> + means there is no escape character (rather than defaulting to a + backslash), and a zero-length <literal>ESCAPE</literal> value is + disallowed. <productname>PostgreSQL</productname>'s behavior in + this regard is therefore slightly nonstandard. + </para> + + <para> + Another nonstandard extension is that following the escape character + with a letter or digit provides access to the same escape sequences + defined for POSIX regular expressions, below (see + <xref linkend="posix-character-entry-escapes-table"/>, + <xref linkend="posix-class-shorthand-escapes-table"/>, and + <xref linkend="posix-constraint-escapes-table"/>). </para> <para> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 8311b1d..6462e13 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -13073,15 +13073,15 @@ a_expr: c_expr { $$ = $1; } | a_expr SIMILAR TO a_expr %prec SIMILAR { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), - list_make2($4, makeNullAConst(-1)), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), + list_make1($4), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~", $1, (Node *) n, @2); } | a_expr SIMILAR TO a_expr ESCAPE a_expr %prec SIMILAR { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), list_make2($4, $6), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "~", @@ -13089,15 +13089,15 @@ a_expr: c_expr { $$ = $1; } } | a_expr NOT_LA SIMILAR TO a_expr %prec NOT_LA { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), - list_make2($5, makeNullAConst(-1)), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), + list_make1($5), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~", $1, (Node *) n, @2); } | a_expr NOT_LA SIMILAR TO a_expr ESCAPE a_expr %prec NOT_LA { - FuncCall *n = makeFuncCall(SystemFuncName("similar_escape"), + FuncCall *n = makeFuncCall(SystemFuncName("similar_to_escape"), list_make2($5, $7), @2); $$ = (Node *) makeSimpleA_Expr(AEXPR_SIMILAR, "!~", @@ -14323,9 +14323,9 @@ subquery_Op: | NOT_LA ILIKE { $$ = list_make1(makeString("!~~*")); } /* cannot put SIMILAR TO here, because SIMILAR TO is a hack. - * the regular expression is preprocessed by a function (similar_escape), + * the regular expression is preprocessed by a function (similar_to_escape), * and the ~ operator for posix regular expressions is used. - * x SIMILAR TO y -> x ~ similar_escape(y) + * x SIMILAR TO y -> x ~ similar_to_escape(y) * this transformation is made on the fly by the parser upwards. * however the SubLink structure which handles any/some/all stuff * is not ready for such a thing. diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c index 90a9197..3d38aef 100644 --- a/src/backend/utils/adt/regexp.c +++ b/src/backend/utils/adt/regexp.c @@ -654,15 +654,18 @@ textregexreplace(PG_FUNCTION_ARGS) } /* - * similar_escape() - * Convert a SQL:2008 regexp pattern to POSIX style, so it can be used by - * our regexp engine. + * similar_to_escape(), similar_escape() + * + * Convert a SQL "SIMILAR TO" regexp pattern to POSIX style, so it can be + * used by our regexp engine. + * + * similar_escape_internal() is the common workhorse for three SQL-exposed + * functions. esc_text can be passed as NULL to select the default escape + * (which is '\'), or as an empty string to select no escape character. */ -Datum -similar_escape(PG_FUNCTION_ARGS) +static text * +similar_escape_internal(text *pat_text, text *esc_text) { - text *pat_text; - text *esc_text; text *result; char *p, *e, @@ -673,13 +676,9 @@ similar_escape(PG_FUNCTION_ARGS) bool incharclass = false; int nquotes = 0; - /* This function is not strict, so must test explicitly */ - if (PG_ARGISNULL(0)) - PG_RETURN_NULL(); - pat_text = PG_GETARG_TEXT_PP(0); p = VARDATA_ANY(pat_text); plen = VARSIZE_ANY_EXHDR(pat_text); - if (PG_ARGISNULL(1)) + if (esc_text == NULL) { /* No ESCAPE clause provided; default to backslash as escape */ e = "\\"; @@ -687,12 +686,11 @@ similar_escape(PG_FUNCTION_ARGS) } else { - esc_text = PG_GETARG_TEXT_PP(1); e = VARDATA_ANY(esc_text); elen = VARSIZE_ANY_EXHDR(esc_text); if (elen == 0) e = NULL; /* no escape character */ - else + else if (elen > 1) { int escape_mblen = pg_mbstrlen_with_len(e, elen); @@ -898,6 +896,65 @@ similar_escape(PG_FUNCTION_ARGS) SET_VARSIZE(result, r - ((char *) result)); + return result; +} + +/* + * similar_to_escape(pattern, escape) + */ +Datum +similar_to_escape_2(PG_FUNCTION_ARGS) +{ + text *pat_text = PG_GETARG_TEXT_PP(0); + text *esc_text = PG_GETARG_TEXT_PP(1); + text *result; + + result = similar_escape_internal(pat_text, esc_text); + + PG_RETURN_TEXT_P(result); +} + +/* + * similar_to_escape(pattern) + * Inserts a default escape character. + */ +Datum +similar_to_escape_1(PG_FUNCTION_ARGS) +{ + text *pat_text = PG_GETARG_TEXT_PP(0); + text *result; + + result = similar_escape_internal(pat_text, NULL); + + PG_RETURN_TEXT_P(result); +} + +/* + * similar_escape(pattern, escape) + * + * Legacy function for compatibility with views stored using the + * pre-v13 expansion of SIMILAR TO. Unlike the above functions, this + * is non-strict, which leads to not-per-spec handling of "ESCAPE NULL". + */ +Datum +similar_escape(PG_FUNCTION_ARGS) +{ + text *pat_text; + text *esc_text; + text *result; + + /* This function is not strict, so must test explicitly */ + if (PG_ARGISNULL(0)) + PG_RETURN_NULL(); + pat_text = PG_GETARG_TEXT_PP(0); + + if (PG_ARGISNULL(1)) + esc_text = NULL; /* use default escape character */ + else + esc_text = PG_GETARG_TEXT_PP(1); + + result = similar_escape_internal(pat_text, esc_text); + PG_RETURN_TEXT_P(result); } diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 8733524..15045d0 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -3321,9 +3321,15 @@ proname => 'repeat', prorettype => 'text', proargtypes => 'text int4', prosrc => 'repeat' }, -{ oid => '1623', descr => 'convert SQL99 regexp pattern to POSIX style', +{ oid => '1623', descr => 'convert SQL regexp pattern to POSIX style', proname => 'similar_escape', proisstrict => 'f', prorettype => 'text', proargtypes => 'text text', prosrc => 'similar_escape' }, +{ oid => '1986', descr => 'convert SQL regexp pattern to POSIX style', + proname => 'similar_to_escape', prorettype => 'text', + proargtypes => 'text text', prosrc => 'similar_to_escape_2' }, +{ oid => '1987', descr => 'convert SQL regexp pattern to POSIX style', + proname => 'similar_to_escape', prorettype => 'text', proargtypes => 'text', + prosrc => 'similar_to_escape_1' }, { oid => '1624', proname => 'mul_d_interval', prorettype => 'interval', @@ -5743,10 +5749,10 @@ { oid => '2073', descr => 'extract text matching regular expression', proname => 'substring', prorettype => 'text', proargtypes => 'text text', prosrc => 'textregexsubstr' }, -{ oid => '2074', descr => 'extract text matching SQL99 regular expression', +{ oid => '2074', descr => 'extract text matching SQL regular expression', proname => 'substring', prolang => 'sql', prorettype => 'text', proargtypes => 'text text text', - prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_escape($2, $3))' }, + prosrc => 'select pg_catalog.substring($1, pg_catalog.similar_to_escape($2, $3))' }, { oid => '2075', descr => 'convert int8 to bitstring', proname => 'bit', prorettype => 'bit', proargtypes => 'int8 int4', diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out index 486c00b..2483966 100644 --- a/src/test/regress/expected/strings.out +++ b/src/test/regress/expected/strings.out @@ -410,7 +410,56 @@ SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde"; cde (1 row) --- PostgreSQL extension to allow using back reference in replace string; +-- Check behavior of SIMILAR TO, which uses largely the same regexp variant +SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true; + true +------ + t +(1 row) + +SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false; + false +------- + f +(1 row) + +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false; + false +------- + f +(1 row) + +SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true; + true +------ + t +(1 row) + +-- Postgres uses '\' as the default escape character, which is not per spec +SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false; + false +------- + f +(1 row) + +-- and an empty string to mean "no escape", which is also not per spec +SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true; + true +------ + t +(1 row) + +-- these behaviors are per spec, though: +SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null; + null +------ + +(1 row) + +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error; +ERROR: invalid escape string +HINT: Escape string must be empty or one character. +-- Test back reference in regexp_replace SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); regexp_replace ---------------- diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql index 5744c9f..b5e75c3 100644 --- a/src/test/regress/sql/strings.sql +++ b/src/test/regress/sql/strings.sql @@ -144,7 +144,20 @@ SELECT SUBSTRING('abcdefg' FROM 'c.e') AS "cde"; -- With a parenthesized subexpression, return only what matches the subexpr SELECT SUBSTRING('abcdefg' FROM 'b(.*)f') AS "cde"; --- PostgreSQL extension to allow using back reference in replace string; +-- Check behavior of SIMILAR TO, which uses largely the same regexp variant +SELECT 'abcdefg' SIMILAR TO '_bcd%' AS true; +SELECT 'abcdefg' SIMILAR TO 'bcd%' AS false; +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '#' AS false; +SELECT 'abcd%' SIMILAR TO '_bcd#%' ESCAPE '#' AS true; +-- Postgres uses '\' as the default escape character, which is not per spec +SELECT 'abcdefg' SIMILAR TO '_bcd\%' AS false; +-- and an empty string to mean "no escape", which is also not per spec +SELECT 'abcd\efg' SIMILAR TO '_bcd\%' ESCAPE '' AS true; +-- these behaviors are per spec, though: +SELECT 'abcdefg' SIMILAR TO '_bcd%' ESCAPE NULL AS null; +SELECT 'abcdefg' SIMILAR TO '_bcd#%' ESCAPE '##' AS error; + +-- Test back reference in regexp_replace SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3'); SELECT regexp_replace('AAA BBB CCC ', E'\\s+', ' ', 'g'); SELECT regexp_replace('AAA', '^|$', 'Z', 'g');
pgsql-hackers by date: