Re: Fix number skipping in to_number - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Fix number skipping in to_number |
Date | |
Msg-id | 5597.1510781328@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Fix number skipping in to_number (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Fix number skipping in to_number
Re: Fix number skipping in to_number |
List | pgsql-hackers |
I wrote: > Oliver Ford <ojford@gmail.com> writes: >> On Monday, 13 November 2017, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't follow your concern? If "$" is not the correct currency >>> symbol for the locale, we shouldn't accept it as a match to an L format. >>> Your patch is tightening what we will accept as a match to a G format, >>> so I don't see why you're concerned about backward compatibility in >>> one case but not the other. >> It's a guess as to the likely use case. I would imagine that people are >> likely to use a currency symbol different from the locale, but unlikely to >> use a different group separator. Others might have a different opinion >> though. > Well, if they use a currency symbol different from the locale's, they're > in trouble anyway because the number of bytes might be different. In most > encodings, symbols other than "$" are probably not 1-byte characters. > At the very least I think we need to constrain it enough that it not > swallow a fractional character. After more testing I understood your concern about L_currency_symbol: in C locale that's " ", not "$" as I naively imagined. Americans, at least, would be pretty unhappy if "$1234.56" suddenly stopped matching "L9999.99". So it seems like we can't institute a strict matching rule. However, it is certainly not good that this happens: regression=# select to_number('1234.56', 'L9999.99'); to_number ----------- 234.56 (1 row) To me that seems just as bad as having ',' or 'G' eat a digit. After some reflection I propose that the rule that we want is: * ',' and 'G' consume input only if it exactly matches the expected separator. * Other non-data template patterns consume a number of input characters equal to the number of characters they'd produce in output, *except* that these patterns will not consume data characters (digits, signs, decimal point, comma). I think that while we are at it we should take some measures to ensure that "character" in this definition means "character", not "byte". It is not good that a euro currency symbol might consume an encoding-dependent number of input characters. That leads me to the attached patch. There is more that could be done here --- in particular, I'd like to see the character-not-byte-count rule extended to literal text. But that seems like fit material for a different patch. Also, I noticed that in your form of the patch, the strncmp() could read past the end of the string, possibly resulting in a crash. So I made it use the AMOUNT_TEST infrastructure from NUM_numpart_from_char to avoid that. One other note: I realized that it was only pure luck that your regression test cases worked in locales where 'G' is decimal point --- they still gave the same answer, but through a totally different interpretation of the input. That did not seem like a good idea, so I adjusted the regression test to force C locale for the to_number() tests. I wish we could use some other locale here, but then it likely wouldn't be portable to Windows. regards, tom lane diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index f901567..35a845c 100644 *** a/doc/src/sgml/func.sgml --- b/doc/src/sgml/func.sgml *************** SELECT regexp_match('abc01234xyz', '(?:( *** 5850,5856 **** data based on the given value. Any text that is not a template pattern is simply copied verbatim. Similarly, in an input template string (for the other functions), template patterns identify the values to be supplied by ! the input data string. </para> <para> --- 5850,5859 ---- data based on the given value. Any text that is not a template pattern is simply copied verbatim. Similarly, in an input template string (for the other functions), template patterns identify the values to be supplied by ! the input data string. If there are characters in the template string ! that are not template patterns, the corresponding characters in the input ! data string are simply skipped over (whether or not they are equal to the ! template string characters). </para> <para> *************** SELECT regexp_match('abc01234xyz', '(?:( *** 6176,6188 **** Ordinary text is allowed in <function>to_char</function> templates and will be output literally. You can put a substring in double quotes to force it to be interpreted as literal text ! even if it contains pattern key words. For example, in <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal> will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal> ! will not be. In <function>to_date</function>, <function>to_number</function>, ! and <function>to_timestamp</function>, double-quoted strings skip the number of ! input characters contained in the string, e.g. <literal>"XX"</literal> ! skips two input characters. </para> </listitem> --- 6179,6193 ---- Ordinary text is allowed in <function>to_char</function> templates and will be output literally. You can put a substring in double quotes to force it to be interpreted as literal text ! even if it contains template patterns. For example, in <literal>'"Hello Year "YYYY'</literal>, the <literal>YYYY</literal> will be replaced by the year data, but the single <literal>Y</literal> in <literal>Year</literal> ! will not be. ! In <function>to_date</function>, <function>to_number</function>, ! and <function>to_timestamp</function>, literal text and double-quoted ! strings result in skipping the number of characters contained in the ! string; for example <literal>"XX"</literal> skips two input characters ! (whether or not they are <literal>XX</literal>). </para> </listitem> *************** SELECT regexp_match('abc01234xyz', '(?:( *** 6485,6490 **** --- 6490,6506 ---- <listitem> <para> + In <function>to_number</function>, if non-data template patterns such + as <literal>L</literal> or <literal>TH</literal> are used, the + corresponding number of input characters are skipped, whether or not + they match the template pattern, unless they are data characters + (that is, digits, sign, decimal point, or comma). For + example, <literal>TH</literal> would skip two non-data characters. + </para> + </listitem> + + <listitem> + <para> <literal>V</literal> with <function>to_char</function> multiplies the input values by <literal>10^<replaceable>n</replaceable></literal>, where diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c index 50254f2..5afc293 100644 *** a/src/backend/utils/adt/formatting.c --- b/src/backend/utils/adt/formatting.c *************** static char *get_last_relevant_decnum(ch *** 988,994 **** static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len); static void NUM_numpart_to_char(NUMProc *Np, int id); static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, ! char *number, int from_char_input_len, int to_char_out_pre_spaces, int sign, bool is_to_char, Oid collid); static DCHCacheEntry *DCH_cache_getnew(const char *str); static DCHCacheEntry *DCH_cache_search(const char *str); --- 988,994 ---- static void NUM_numpart_from_char(NUMProc *Np, int id, int input_len); static void NUM_numpart_to_char(NUMProc *Np, int id); static char *NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, ! char *number, int input_len, int to_char_out_pre_spaces, int sign, bool is_to_char, Oid collid); static DCHCacheEntry *DCH_cache_getnew(const char *str); static DCHCacheEntry *DCH_cache_search(const char *str); *************** get_last_relevant_decnum(char *num) *** 4232,4237 **** --- 4232,4245 ---- return result; } + /* + * These macros are used in NUM_processor() and its subsidiary routines. + * OVERLOAD_TEST: true if we've reached end of input string + * AMOUNT_TEST(s): true if at least s characters remain in string + */ + #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len) + #define AMOUNT_TEST(s) (Np->inout_p <= Np->inout + (input_len - (s))) + /* ---------- * Number extraction for TO_NUMBER() * ---------- *************** NUM_numpart_from_char(NUMProc *Np, int i *** 4246,4254 **** (id == NUM_0 || id == NUM_9) ? "NUM_0/9" : id == NUM_DEC ? "NUM_DEC" : "???"); #endif - #define OVERLOAD_TEST (Np->inout_p >= Np->inout + input_len) - #define AMOUNT_TEST(_s) (input_len-(Np->inout_p-Np->inout) >= _s) - if (OVERLOAD_TEST) return; --- 4254,4259 ---- *************** NUM_numpart_to_char(NUMProc *Np, int id) *** 4641,4654 **** ++Np->num_curr; } static char * NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, ! char *number, int from_char_input_len, int to_char_out_pre_spaces, int sign, bool is_to_char, Oid collid) { FormatNode *n; NUMProc _Np, *Np = &_Np; MemSet(Np, 0, sizeof(NUMProc)); --- 4646,4677 ---- ++Np->num_curr; } + /* + * Skip over "n" input characters, but only if they aren't numeric data + */ + static void + NUM_eat_non_data_chars(NUMProc *Np, int n, int input_len) + { + while (n-- > 0) + { + if (OVERLOAD_TEST) + break; /* end of input */ + if (strchr("0123456789.,+-", *Np->inout_p) != NULL) + break; /* it's a data character */ + Np->inout_p += pg_mblen(Np->inout_p); + } + } + static char * NUM_processor(FormatNode *node, NUMDesc *Num, char *inout, ! char *number, int input_len, int to_char_out_pre_spaces, int sign, bool is_to_char, Oid collid) { FormatNode *n; NUMProc _Np, *Np = &_Np; + const char *pattern; + int pattern_len; MemSet(Np, 0, sizeof(NUMProc)); *************** NUM_processor(FormatNode *node, NUMDesc *** 4816,4824 **** if (!Np->is_to_char) { /* ! * Check non-string inout end */ ! if (Np->inout_p >= Np->inout + from_char_input_len) break; } --- 4839,4849 ---- if (!Np->is_to_char) { /* ! * Check at least one character remains to be scanned. (In ! * actions below, must use AMOUNT_TEST if we want to read more ! * characters than that.) */ ! if (OVERLOAD_TEST) break; } *************** NUM_processor(FormatNode *node, NUMDesc *** 4828,4839 **** if (n->type == NODE_TYPE_ACTION) { /* ! * Create/reading digit/zero/blank/sing * * 'NUM_S' note: The locale sign is anchored to number and we * read/write it when we work with first or last number ! * (NUM_0/NUM_9). This is reason why NUM_S missing in follow ! * switch(). */ switch (n->key->id) { --- 4853,4868 ---- if (n->type == NODE_TYPE_ACTION) { /* ! * Create/read digit/zero/blank/sign/special-case * * 'NUM_S' note: The locale sign is anchored to number and we * read/write it when we work with first or last number ! * (NUM_0/NUM_9). This is why NUM_S is missing in switch(). ! * ! * Notice the "Np->inout_p++" at the bottom of the loop. This is ! * why most of the actions advance inout_p one less than you might ! * expect. In cases where we don't want that increment to happen, ! * a switch case ends with "continue" not "break". */ switch (n->key->id) { *************** NUM_processor(FormatNode *node, NUMDesc *** 4848,4854 **** } else { ! NUM_numpart_from_char(Np, n->key->id, from_char_input_len); break; /* switch() case: */ } --- 4877,4883 ---- } else { ! NUM_numpart_from_char(Np, n->key->id, input_len); break; /* switch() case: */ } *************** NUM_processor(FormatNode *node, NUMDesc *** 4872,4881 **** --- 4901,4914 ---- if (IS_FILLMODE(Np->Num)) continue; } + if (*Np->inout_p != ',') + continue; } break; case NUM_G: + pattern = Np->L_thousands_sep; + pattern_len = strlen(pattern); if (Np->is_to_char) { if (!Np->num_in) *************** NUM_processor(FormatNode *node, NUMDesc *** 4884,4899 **** continue; else { ! int x = strlen(Np->L_thousands_sep); ! ! memset(Np->inout_p, ' ', x); ! Np->inout_p += x - 1; } } else { ! strcpy(Np->inout_p, Np->L_thousands_sep); ! Np->inout_p += strlen(Np->inout_p) - 1; } } else --- 4917,4932 ---- continue; else { ! /* just in case there are MB chars */ ! pattern_len = pg_mbstrlen(pattern); ! memset(Np->inout_p, ' ', pattern_len); ! Np->inout_p += pattern_len - 1; } } else { ! strcpy(Np->inout_p, pattern); ! Np->inout_p += pattern_len - 1; } } else *************** NUM_processor(FormatNode *node, NUMDesc *** 4903,4920 **** if (IS_FILLMODE(Np->Num)) continue; } ! Np->inout_p += strlen(Np->L_thousands_sep) - 1; } break; case NUM_L: if (Np->is_to_char) { ! strcpy(Np->inout_p, Np->L_currency_symbol); ! Np->inout_p += strlen(Np->inout_p) - 1; } else ! Np->inout_p += strlen(Np->L_currency_symbol) - 1; break; case NUM_RN: --- 4936,4968 ---- if (IS_FILLMODE(Np->Num)) continue; } ! ! /* ! * Because L_thousands_sep typically contains data ! * characters (either '.' or ','), we can't use ! * NUM_eat_non_data_chars here. Instead skip only if ! * the input matches L_thousands_sep. ! */ ! if (AMOUNT_TEST(pattern_len) && ! strncmp(Np->inout_p, pattern, pattern_len) == 0) ! Np->inout_p += pattern_len - 1; ! else ! continue; } break; case NUM_L: + pattern = Np->L_currency_symbol; if (Np->is_to_char) { ! strcpy(Np->inout_p, pattern); ! Np->inout_p += strlen(pattern) - 1; } else ! { ! NUM_eat_non_data_chars(Np, pg_mbstrlen(pattern), input_len); ! continue; ! } break; case NUM_RN: *************** NUM_processor(FormatNode *node, NUMDesc *** 4949,4956 **** continue; if (Np->is_to_char) strcpy(Np->inout_p, get_th(Np->number, TH_LOWER)); ! Np->inout_p += 1; break; case NUM_TH: --- 4997,5012 ---- continue; if (Np->is_to_char) + { strcpy(Np->inout_p, get_th(Np->number, TH_LOWER)); ! Np->inout_p += 1; ! } ! else ! { ! /* All variants of 'th' occupy 2 characters */ ! NUM_eat_non_data_chars(Np, 2, input_len); ! continue; ! } break; case NUM_TH: *************** NUM_processor(FormatNode *node, NUMDesc *** 4959,4966 **** continue; if (Np->is_to_char) strcpy(Np->inout_p, get_th(Np->number, TH_UPPER)); ! Np->inout_p += 1; break; case NUM_MI: --- 5015,5030 ---- continue; if (Np->is_to_char) + { strcpy(Np->inout_p, get_th(Np->number, TH_UPPER)); ! Np->inout_p += 1; ! } ! else ! { ! /* All variants of 'TH' occupy 2 characters */ ! NUM_eat_non_data_chars(Np, 2, input_len); ! continue; ! } break; case NUM_MI: *************** NUM_processor(FormatNode *node, NUMDesc *** 4977,4982 **** --- 5041,5051 ---- { if (*Np->inout_p == '-') *Np->number = '-'; + else + { + NUM_eat_non_data_chars(Np, 1, input_len); + continue; + } } break; *************** NUM_processor(FormatNode *node, NUMDesc *** 4994,5016 **** { if (*Np->inout_p == '+') *Np->number = '+'; } break; case NUM_SG: if (Np->is_to_char) *Np->inout_p = Np->sign; - else { if (*Np->inout_p == '-') *Np->number = '-'; else if (*Np->inout_p == '+') *Np->number = '+'; } break; - default: continue; break; --- 5063,5093 ---- { if (*Np->inout_p == '+') *Np->number = '+'; + else + { + NUM_eat_non_data_chars(Np, 1, input_len); + continue; + } } break; case NUM_SG: if (Np->is_to_char) *Np->inout_p = Np->sign; else { if (*Np->inout_p == '-') *Np->number = '-'; else if (*Np->inout_p == '+') *Np->number = '+'; + else + { + NUM_eat_non_data_chars(Np, 1, input_len); + continue; + } } break; default: continue; break; *************** NUM_processor(FormatNode *node, NUMDesc *** 5019,5025 **** else { /* ! * Remove to output char from input in TO_CHAR */ if (Np->is_to_char) *Np->inout_p = n->character; --- 5096,5107 ---- else { /* ! * In TO_CHAR, non-pattern characters in the format are copied to ! * the output. In TO_NUMBER, we skip one input character for each ! * non-pattern format character, whether or not it matches the ! * format character. (Currently, that's actually implemented as ! * skipping one input byte per non-pattern format byte, which is ! * wrong...) */ if (Np->is_to_char) *Np->inout_p = n->character; diff --git a/src/test/regress/expected/numeric.out b/src/test/regress/expected/numeric.out index 7e55b0e..a96bfc0 100644 *** a/src/test/regress/expected/numeric.out --- b/src/test/regress/expected/numeric.out *************** SELECT '' AS to_char_26, to_char('100':: *** 1219,1224 **** --- 1219,1225 ---- -- TO_NUMBER() -- + SET lc_numeric = 'C'; SELECT '' AS to_number_1, to_number('-34,338,492', '99G999G999'); to_number_1 | to_number -------------+----------- *************** SELECT '' AS to_number_13, to_number(' . *** 1297,1302 **** --- 1298,1358 ---- | -0.01 (1 row) + SELECT '' AS to_number_14, to_number('34,50','999,99'); + to_number_14 | to_number + --------------+----------- + | 3450 + (1 row) + + SELECT '' AS to_number_15, to_number('123,000','999G'); + to_number_15 | to_number + --------------+----------- + | 123 + (1 row) + + SELECT '' AS to_number_16, to_number('123456','999G999'); + to_number_16 | to_number + --------------+----------- + | 123456 + (1 row) + + SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99'); + to_number_17 | to_number + --------------+----------- + | 1234.56 + (1 row) + + SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99'); + to_number_18 | to_number + --------------+----------- + | 1234.56 + (1 row) + + SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99'); + to_number_19 | to_number + --------------+----------- + | 1234.56 + (1 row) + + SELECT '' AS to_number_20, to_number('1234.56','L99,999.99'); + to_number_20 | to_number + --------------+----------- + | 1234.56 + (1 row) + + SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99'); + to_number_21 | to_number + --------------+----------- + | 1234.56 + (1 row) + + SELECT '' AS to_number_22, to_number('42nd', '99th'); + to_number_22 | to_number + --------------+----------- + | 42 + (1 row) + + RESET lc_numeric; -- -- Input syntax -- diff --git a/src/test/regress/sql/numeric.sql b/src/test/regress/sql/numeric.sql index 9675b6e..321c7bd 100644 *** a/src/test/regress/sql/numeric.sql --- b/src/test/regress/sql/numeric.sql *************** SELECT '' AS to_char_26, to_char('100':: *** 788,793 **** --- 788,794 ---- -- TO_NUMBER() -- + SET lc_numeric = 'C'; SELECT '' AS to_number_1, to_number('-34,338,492', '99G999G999'); SELECT '' AS to_number_2, to_number('-34,338,492.654,878', '99G999G999D999G999'); SELECT '' AS to_number_3, to_number('<564646.654564>', '999999.999999PR'); *************** SELECT '' AS to_number_10, to_number('0' *** 801,806 **** --- 802,817 ---- SELECT '' AS to_number_11, to_number('.-01', 'S99.99'); SELECT '' AS to_number_12, to_number('.01-', '99.99S'); SELECT '' AS to_number_13, to_number(' . 0 1-', ' 9 9 . 9 9 S'); + SELECT '' AS to_number_14, to_number('34,50','999,99'); + SELECT '' AS to_number_15, to_number('123,000','999G'); + SELECT '' AS to_number_16, to_number('123456','999G999'); + SELECT '' AS to_number_17, to_number('$1234.56','L9,999.99'); + SELECT '' AS to_number_18, to_number('$1234.56','L99,999.99'); + SELECT '' AS to_number_19, to_number('$1,234.56','L99,999.99'); + SELECT '' AS to_number_20, to_number('1234.56','L99,999.99'); + SELECT '' AS to_number_21, to_number('1,234.56','L99,999.99'); + SELECT '' AS to_number_22, to_number('42nd', '99th'); + RESET lc_numeric; -- -- Input syntax
pgsql-hackers by date: