Thread: BUG #16469: High unicode character mangled on write and read with ODBC driver

BUG #16469: High unicode character mangled on write and read with ODBC driver

From
PG Bug reporting form
Date:
29 May 2020, 00:50:48
The following bug has been logged on the website:

Bug reference:      16469
Logged by:          Keith Erskine
Email address:      toastie604@gmail.com
PostgreSQL version: 11.7
Operating system:   Ubuntu 18.04
Description:

I was trying to insert a high unicode character (i.e. beyond the basic
multilingual plane) into a table using the Python module pyodbc.  The INSERT
succeeded but when I read the text back, it appeared to be different.  The
unicode character in question was U+1F31C (LAST QUARTER MOON WITH FACE).

Here is my setup:
Postgres 11.7
Linux 18.04
unixODBC 2.3.7
driver psqlodbcw.so 12.01.0000
pyodbc 4.0.30
Python 3.7

Steps to reproduce:
1) Create a utf-8 database:
psql -c "CREATE DATABASE test WITH encoding='UTF8' LC_COLLATE='en_US.UTF-8'
LC_CTYPE='en_US.UTF-8'" -U postgres
2) Run the following Python 3.7 code (NOT Python 2.7):
import pyodbc
cnxn = pyodbc.connect("DRIVER={PostgreSQL
Unicode};SERVER=localhost;UID=postgres;DATABASE=test", autocommit=True)
cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8')
cnxn.setencoding(str, encoding='utf-8')
cnxn.setencoding(unicode, encoding='utf-8')
crsr = cnxn.cursor()
crsr.execute("DROP TABLE IF EXISTS t1")
crsr.execute("CREATE TABLE t1 (s varchar(50))")
v = "aaa \U0001F31C zzz"
crsr.execute(r"insert into t1 values ('{}')".format(v))
rows = crsr.execute("select * from t1").fetchall()
print(rows)
crsr.close()
cnxn.close()

Note:
1) I can write and read 2-byte unicode characters, without any problem.
2) Curiously, the same Python code actually works on Windows.  The correct
characters are returned.
3) Here's an ODBC trace fragment for a similar example:
(writing)
[ODBC][25398][1590664490.057932][SQLExecDirect.c][240]
        Entry:
            Statement = 0x5612f0515100
            SQL = [insert into t1 values (1, U&'x \+01F31C z', 'x 🌜 z')][length =
55]
[ODBC][25398][1590664490.058301][SQLExecDirect.c][515]
        Exit:[SQL_SUCCESS]

...

(reading)
[ODBC][25398][1590664490.061578][SQLGetData.c][237]
        Entry:
            Statement = 0x5612f0515100
            Column Number = 2
            Target Type = 1 SQL_CHAR
            Buffer Length = 4096
            Target Value = 0x5612f051bfb0
            StrLen Or Ind = 0x7ffcfd40cfa8
[ODBC][25398][1590664490.061584][SQLGetData.c][534]
        Exit:[SQL_SUCCESS]                
            Buffer = [x 🌜 z]                
            Strlen Or Ind = 0x7ffcfd40cfa8 -> 8
[ODBC][25398][1590664490.061593][SQLGetData.c][237]
        Entry:
            Statement = 0x5612f0515100
            Column Number = 3
            Target Type = 1 SQL_CHAR
            Buffer Length = 4096
            Target Value = 0x5612f051bfb0
            StrLen Or Ind = 0x7ffcfd40cfa8
[ODBC][25398][1590664490.061599][SQLGetData.c][534]
        Exit:[SQL_SUCCESS]                
            Buffer = [x ð��� z]                
            Strlen Or Ind = 0x7ffcfd40cfa8 -> 12


As you can see, if I use the PostgreSQL unicode format in the INSERT
statement (the first text column, column number 2), it is read back fine.
But in column number 3, with an embedded literal high unicode character, the
text is not read back correctly.  It appear as if the original INSERT SQL
statement was not decoded as utf-8 text, but was treated as UCS-2 or ASCII.
Hence, when the data is read back out and decoded into utf-8, it is
mangled.

A broader question, when using SQLExecDirect with Postgres, how should the
SQL statement be encoded?  Does it have to be utf-8, or perhaps UCS-2?  Any
documentation on this would be greatly appreciated.  I haven't been able to
find much official documentation on the subject.  Even a code reference
would help.

Many thanks for any and all help.