Thread: BUG #17142: COPY ignores client_encoding for octal digit characters

BUG #17142: COPY ignores client_encoding for octal digit characters

From

PG Bug reporting form

Date:

11 August 2021, 21:24:45

The following bug has been logged on the website:

Bug reference:      17142
Logged by:          Andreas Grob
Email address:      vilarion@illarion.org
PostgreSQL version: 13.3
Operating system:   Debian GNU/Linux 11 (bullseye)
Description:

Test db and table:
```
CREATE DATABASE test WITH TEMPLATE = template0 ENCODING = 'UTF8' LC_COLLATE
= 'C' LC_CTYPE = 'C';
CREATE TABLE test (text character varying(50));
```

Test program in C:
```
#include <postgresql/libpq-fe.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv)
{
    const char *conninfo;
    char *errmsg;
    PGconn     *conn;
    PGresult   *res;
    int a, b;
    ExecStatusType status;
    int enc;

    char buffer[] = "\\304\\366\\337"; //Äöß
    // char buffer[] = "\304\366\337"; //Äöß

    if (argc > 1)
        conninfo = argv[1];
    else
        conninfo = "user=postgres dbname=test port=5433
client_encoding=LATIN1";

    /* Make a connection to the database */
    conn = PQconnectdb(conninfo);

    /* Check to see that the backend connection was successfully made */
    if (PQstatus(conn) != CONNECTION_OK)
    {
        fprintf(stderr, "Connection to database failed: %s"
                , PQerrorMessage(conn));
        PQfinish(conn);
        exit(1);
    }

    res = PQexec(conn, "BEGIN");

    res = PQexec(conn, "COPY public.test(text) from STDIN;");
    a = PQputCopyData(conn, buffer, strlen(buffer));
    b = PQputCopyEnd(conn, NULL);
    res = PQgetResult(conn);
    status = PQresultStatus(res);
    enc = PQclientEncoding(conn);
    errmsg = PQresultErrorMessage(res);

    printf("status=%d a=%d,b=%d, enc=%d\n", status, a, b, enc);

    if (status != PGRES_COMMAND_OK)
        printf("%s\n", errmsg);
    else
        printf("worked.\n");

    res = PQexec(conn, "COMMIT");

    /* close the connection to the database and cleanup */
    PQfinish(conn);

    return 0;
}
```

Output:
```
status=7 a=1,b=1, enc=8
ERROR:  invalid byte sequence for encoding "UTF8": 0xc4 0xf6
CONTEXT:  COPY test, line 1: "\304\366\337"
```

Expected output:
```
status=1 a=1,b=1, enc=8
worked.
```
(Äöß got inserted into the table.)


Characters in octal digits should be possible as per
https://www.postgresql.org/docs/13/sql-copy.html
When using characters directly (char buffer[] = "\304\366\337") the expected
output is displayed.

My apologies if I misunderstood something.

Re: BUG #17142: COPY ignores client_encoding for octal digit characters

From

Heikki Linnakangas

Date:

12 August 2021, 07:40:35

On 12/08/2021 00:24, PG Bug reporting form wrote:
> Characters in octal digits should be possible as per
> https://www.postgresql.org/docs/13/sql-copy.html
> When using characters directly (char buffer[] = "\304\366\337") the expected
> output is displayed.
> 
> My apologies if I misunderstood something.

The code is pretty clear that the \123 and \x12 escapes are evaluated 
after encoding conversion. That means, the escapes are interpreted using 
the database encoding, regardless of client encoding. The documentation 
doesn't say anything about that, though. We should fix the docs. How 
does the attached patch look?

You could get weird results if you use the escapes for some bytes in a 
multi-byte character. Mostly you'd get invalid byte sequence errors, but 
I think with the right combination of the client and database encodings, 
it could get more strange. I think the wording in the attached docs 
patch is enough to cover that, though.

- Heikki

Attachment

0001-Doc-123-and-x12-escapes-in-COPY-are-in-database-enco.patch

Re: BUG #17142: COPY ignores client_encoding for octal digit characters

From

vilarion@illarion.org

Date:

12 August 2021, 08:01:56

On 12.08.2021 09:40, Heikki Linnakangas wrote:
> On 12/08/2021 00:24, PG Bug reporting form wrote:
>> Characters in octal digits should be possible as per
>> https://www.postgresql.org/docs/13/sql-copy.html
>> When using characters directly (char buffer[] = "\304\366\337") the 
>> expected
>> output is displayed.
>>
>> My apologies if I misunderstood something.
>
> The code is pretty clear that the \123 and \x12 escapes are evaluated 
> after encoding conversion. That means, the escapes are interpreted 
> using the database encoding, regardless of client encoding. The 
> documentation doesn't say anything about that, though. We should fix 
> the docs. How does the attached patch look?
>
> You could get weird results if you use the escapes for some bytes in a 
> multi-byte character. Mostly you'd get invalid byte sequence errors, 
> but I think with the right combination of the client and database 
> encodings, it could get more strange. I think the wording in the 
> attached docs patch is enough to cover that, though.
>
> - Heikki


Thanks for clarifying! This patch to the docs will allow me to file a 
bug report against the library I am using (pqxx).

Andreas

Re: BUG #17142: COPY ignores client_encoding for octal digit characters

From

Heikki Linnakangas

Date:

17 August 2021, 08:28:31

On 12/08/2021 11:01, vilarion@illarion.org wrote:
> Thanks for clarifying! This patch to the docs will allow me to file a
> bug report against the library I am using (pqxx).

Pushed the docs patch now.

- Heikki