Home > mailing lists
Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) - Mailing list pgsql-bugs

From	Tom Lane
Subject	Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Date	March 23, 2016 15:44:08
Msg-id	31913.1458747836@sss.pgh.pa.us Whole thread Raw
In response to	Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
List	pgsql-bugs
Tree view
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Mar 22, 2016 at 10:44 PM, Noah Misch <noah@leadboat.com> wrote:
>> I, too, found MAXXFRMLEN insufficient; I raised it fourfold.  Cygwin
>> 2.2.1(0.289/5/3) caught fire; 10% of locales passed.  (varstr_sortsupport()
>> already blacklists the UTF8/native Windows case.)  The test passed on Solaris
>> 10, Solaris 11, HP-UX B.11.31, OpenBSD 5.0, NetBSD 5.1.2, and FreeBSD 9.0.
>> See attached tryalllocales.sh outputs.  I did not test AIX, because the AIX
>> machines I use have no UTF8 locales installed.

> Wow, thanks for the extensive testing.  This suggests that, apart from
> Cygwin which apparently doesn't matter right now, the only thing that
> is busted is glibc.  I believe we have yet to see a single locale that
> fails anywhere else (apart from Cygwin).  Good thing so few of our
> users run glibc!

I extended my test program to be able to check locales using ISO-8859-x
encodings.  RHEL6 shows me failures in a set of locales that is remarkably
unlike the set it fails on for UTF8 (though good ol de_DE manages to fail
in both encodings, as do a few others).  I'm not sure what that implies
for the underlying bug(s).

> So, options:

> 1. We could make it the user's problem to figure out whether they've
> got a buggy glibc and add a GUC to shut this off, as previously
> suggested.

> 2. We could add a blacklist (either hardcoded or a GUC) shutting this
> off for locales known to be buggy anywhere.

> 3. We could write some test code that runs at startup time which
> reliably detects all of the broken locales we've so far uncovered and
> disables this if so.

> 4. We could shut this off for all Linux users in all locales and tell
> everybody to REINDEX.  That would be pretty sad, though.

TBH, I think #1 is right out, unless maybe the GUC defaults to off.
We aren't that cavalier with data consistency in other departments.

#2 and #3 presume a level of knowledge of the bug details that we
have not got, and probably can't get by Monday.

As far as #4 goes, we're going to have to tell people to REINDEX
no matter what the other aspects of the fix look like.  On-disk
indexes are broken right now, if you're using one of the affected
locales.

            regards, tom lane

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <locale.h>
#include <langinfo.h>
#include <time.h>

/*
 * Test: generate 1000 random UTF8 strings, sort them by strcoll, sanity-
 * check the sort result, sort them by strxfrm, sanity-check that result,
 * and compare the two sort orders.
 */
#define NSTRINGS 1000
#define MAXSTRLEN 20
#define MAXXFRMLEN (MAXSTRLEN * 10)

typedef struct
{
    char        strval[MAXSTRLEN];
    char        xfrmval[MAXXFRMLEN];
    int            strsortpos;
    int            xfrmsortpos;
} OneString;

/* qsort comparators */

static int
strcoll_compare(const void *pa, const void *pb)
{
    const OneString *a = (const OneString *) pa;
    const OneString *b = (const OneString *) pb;

    return strcoll(a->strval, b->strval);
}

static int
strxfrm_compare(const void *pa, const void *pb)
{
    const OneString *a = (const OneString *) pa;
    const OneString *b = (const OneString *) pb;

    return strcmp(a->xfrmval, b->xfrmval);
}


/* returns 1 if OK, 0 if inconsistency detected */
static int
run_test_case(int is_utf8)
{
    int            ok = 1;
    OneString    data[NSTRINGS];
    int            i,
                j;

    /* Generate random strings of length less than MAXSTRLEN bytes */
    for (i = 0; i < NSTRINGS; i++)
    {
        char       *p = data[i].strval;
        int            len;

        len = 1 + (random() % (MAXSTRLEN - 1));
        while (len > 0)
        {
            int            c;

            /* Generate random printable char in ISO8859-1 range */
            /* Bias towards producing a lot of spaces */
            if ((random() % 16) < 3)
                c = ' ';
            else
            {
                do
                {
                    c = random() & 0xFF;
                } while (!((c >= ' ' && c <= 127) || (c >= 0xA0 && c <= 0xFF)));
            }

            if (c <= 127 || !is_utf8)
            {
                *p++ = c;
                len--;
            }
            else
            {
                if (len < 2)
                    break;
                /* Poor man's utf8-ification */
                *p++ = 0xC0 + (c >> 6);
                len--;
                *p++ = 0x80 + (c & 0x3F);
                len--;
            }
        }
        *p = '\0';

        /* strxfrm each string as we produce it */
        if (strxfrm(data[i].xfrmval, data[i].strval, MAXXFRMLEN) >= MAXXFRMLEN)
        {
            fprintf(stderr, "strxfrm() result for %d-length string exceeded %d bytes\n",
                    (int) strlen(data[i].strval), MAXXFRMLEN);
            exit(1);
        }

#if 0
        printf("%d %s\n", i, data[i].strval);
#endif
    }

    /* Sort per strcoll(), and label, being careful in case some are equal */
    qsort(data, NSTRINGS, sizeof(OneString), strcoll_compare);
    j = 0;
    for (i = 0; i < NSTRINGS; i++)
    {
        if (i > 0 && strcoll(data[i].strval, data[i-1].strval) != 0)
            j++;
        data[i].strsortpos = j;
    }

    /* Sanity-check: is each string <= those after it? */
    for (i = 0; i < NSTRINGS; i++)
    {
        for (j = i + 1; j < NSTRINGS; j++)
        {
            if (strcoll(data[i].strval, data[j].strval) > 0)
            {
                fprintf(stdout, "strcoll sort inconsistency between positions %d and %d\n",
                        i, j);
                ok = 0;
            }
        }
    }

    /* Sort per strxfrm(), and label, being careful in case some are equal */
    qsort(data, NSTRINGS, sizeof(OneString), strxfrm_compare);
    j = 0;
    for (i = 0; i < NSTRINGS; i++)
    {
        if (i > 0 && strcmp(data[i].xfrmval, data[i-1].xfrmval) != 0)
            j++;
        data[i].xfrmsortpos = j;
    }

    /* Sanity-check: is each string <= those after it? */
    for (i = 0; i < NSTRINGS; i++)
    {
        for (j = i + 1; j < NSTRINGS; j++)
        {
            if (strcmp(data[i].xfrmval, data[j].xfrmval) > 0)
            {
                fprintf(stdout, "strxfrm sort inconsistency between positions %d and %d\n",
                        i, j);
                ok = 0;
            }
        }
    }

    /* Compare */
    for (i = 0; i < NSTRINGS; i++)
    {
        if (data[i].strsortpos != data[i].xfrmsortpos)
        {
            fprintf(stdout, "inconsistency between strcoll (%d) and strxfrm (%d) orders\n",
                    data[i].strsortpos, data[i].xfrmsortpos);
            ok = 0;
        }
    }

    return ok;
}

int
main(int argc, char **argv)
{
    const char *lc;
    const char *cset;
    int            is_utf8;
    int            ntries;
    int result = 0;

    /* Absorb locale from environment, and report what we're using */
    if (setlocale(LC_ALL, "") == NULL)
    {
        perror("setlocale(LC_ALL) failed");
        exit(1);
    }
    lc = setlocale(LC_COLLATE, NULL);
    if (lc)
    {
        printf("Using LC_COLLATE = \"%s\"\n", lc);
    }
    else
    {
        perror("setlocale(LC_COLLATE) failed");
        exit(1);
    }
    lc = setlocale(LC_CTYPE, NULL);
    if (lc)
    {
        printf("Using LC_CTYPE = \"%s\"\n", lc);
    }
    else
    {
        perror("setlocale(LC_CTYPE) failed");
        exit(1);
    }

    /* Identify encoding */
    cset = nl_langinfo(CODESET);
    if (!cset)
    {
        perror("nl_langinfo(CODESET) failed");
        exit(1);
    }
    if (strstr(cset, "utf") || strstr(cset, "UTF"))
        is_utf8 = 1;
    else if (strstr(cset, "iso-8859") || strstr(cset, "ISO-8859") ||
         strstr(cset, "iso8859") || strstr(cset, "ISO8859"))
        is_utf8 = 0;
    else
    {
        fprintf(stderr, "unrecognized codeset name \"%s\"\n", cset);
        exit(1);
    }

    /* Ensure new random() values on every run */
    srandom((unsigned int) time(NULL));

    /* argv[1] can be the max number of tries to run */
    if (argc > 1)
        ntries = atoi(argv[1]);
    else
        ntries = 1;

    /* Run one test instance per loop */
    while (ntries-- > 0)
    {
        if (!run_test_case(is_utf8))
            result = 1;
    }

    return result;
}
az_AZ.utf8 BAD
ca_AD.utf8 BAD
ca_ES.utf8 BAD
ca_FR.utf8 BAD
ca_IT.utf8 BAD
crh_UA.utf8 BAD
csb_PL.utf8 BAD
cv_RU.utf8 BAD
da_DK.utf8 BAD
de_DE.utf8 BAD
en_CA.utf8 BAD
es_EC.utf8 BAD
es_US.utf8 BAD
fi_FI.utf8 BAD
fil_PH.utf8 BAD
fo_FO.utf8 BAD
fr_CA.utf8 BAD
fur_IT.utf8 BAD
hu_HU.utf8 BAD
ig_NG.utf8 BAD
ik_CA.utf8 BAD
iu_CA.utf8 BAD
kl_GL.utf8 BAD
ku_TR.utf8 BAD
nb_NO.utf8 BAD
nn_NO.utf8 BAD
no_NO.utf8 BAD
ro_RO.utf8 BAD
sc_IT.utf8 BAD
se_NO.utf8 BAD
shs_CA.utf8 BAD
sq_AL.utf8 BAD
sq_MK.utf8 BAD
sv_FI.utf8 BAD
sv_SE.utf8 BAD
tk_TM.utf8 BAD
tt_RU.utf8 BAD
tt_RU.utf8@iqtelif BAD
ug_CN.utf8 BAD
vi_VN.utf8 BAD
yo_NG.utf8 BAD
ar_AE.iso88596 BAD
ar_BH.iso88596 BAD
ar_DZ.iso88596 BAD
ar_EG.iso88596 BAD
ar_IQ.iso88596 BAD
ar_JO.iso88596 BAD
ar_KW.iso88596 BAD
ar_LB.iso88596 BAD
ar_LY.iso88596 BAD
ar_MA.iso88596 BAD
ar_OM.iso88596 BAD
ar_QA.iso88596 BAD
ar_SD.iso88596 BAD
ar_SY.iso88596 BAD
ar_TN.iso88596 BAD
ar_YE.iso88596 BAD
bs_BA.iso88592 BAD
ca_AD.iso885915 BAD
ca_ES.iso88591 BAD
ca_ES.iso885915@euro BAD
ca_FR.iso885915 BAD
ca_IT.iso885915 BAD
da_DK.iso88591 BAD
da_DK.iso885915 BAD
de_DE.iso88591 BAD
es_EC.iso88591 BAD
es_US.iso88591 BAD
fi_FI.iso88591 BAD
fi_FI.iso885915@euro BAD
fo_FO.iso88591 BAD
fr_CA.iso88591 BAD
he_IL.iso88598 BAD
hu_HU.iso88592 BAD
iw_IL.iso88598 BAD
kl_GL.iso88591 BAD
ku_TR.iso88599 BAD
mk_MK.iso88595 BAD
mt_MT.iso88593 BAD
nb_NO.iso88591 BAD
nn_NO.iso88591 BAD
no_NO.iso88591 BAD
ro_RO.iso88592 BAD
ru_RU.iso88595 BAD
sq_AL.iso88591 BAD
sv_FI.iso88591 BAD
sv_FI.iso885915@euro BAD
sv_SE.iso88591 BAD
sv_SE.iso885915 BAD
pgsql-bugs by date:
From: Robert Haas
Date: 23 March 2016, 14:47:13
Subject: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
From: Tom Lane
Date: 23 March 2016, 16:13:52
Subject: Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5)
Re: Re: Missing rows with index scan when collation is not "C" (PostgreSQL 9.5) - Mailing list pgsql-bugs

Previous

Next