Thread: BUG #16469: High unicode character mangled on write and read with ODBC driver
BUG #16469: High unicode character mangled on write and read with ODBC driver
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 16469 Logged by: Keith Erskine Email address: toastie604@gmail.com PostgreSQL version: 11.7 Operating system: Ubuntu 18.04 Description: I was trying to insert a high unicode character (i.e. beyond the basic multilingual plane) into a table using the Python module pyodbc. The INSERT succeeded but when I read the text back, it appeared to be different. The unicode character in question was U+1F31C (LAST QUARTER MOON WITH FACE). Here is my setup: Postgres 11.7 Linux 18.04 unixODBC 2.3.7 driver psqlodbcw.so 12.01.0000 pyodbc 4.0.30 Python 3.7 Steps to reproduce: 1) Create a utf-8 database: psql -c "CREATE DATABASE test WITH encoding='UTF8' LC_COLLATE='en_US.UTF-8' LC_CTYPE='en_US.UTF-8'" -U postgres 2) Run the following Python 3.7 code (NOT Python 2.7): import pyodbc cnxn = pyodbc.connect("DRIVER={PostgreSQL Unicode};SERVER=localhost;UID=postgres;DATABASE=test", autocommit=True) cnxn.setdecoding(pyodbc.SQL_WCHAR, encoding='utf-8') cnxn.setencoding(str, encoding='utf-8') cnxn.setencoding(unicode, encoding='utf-8') crsr = cnxn.cursor() crsr.execute("DROP TABLE IF EXISTS t1") crsr.execute("CREATE TABLE t1 (s varchar(50))") v = "aaa \U0001F31C zzz" crsr.execute(r"insert into t1 values ('{}')".format(v)) rows = crsr.execute("select * from t1").fetchall() print(rows) crsr.close() cnxn.close() Note: 1) I can write and read 2-byte unicode characters, without any problem. 2) Curiously, the same Python code actually works on Windows. The correct characters are returned. 3) Here's an ODBC trace fragment for a similar example: (writing) [ODBC][25398][1590664490.057932][SQLExecDirect.c][240] Entry: Statement = 0x5612f0515100 SQL = [insert into t1 values (1, U&'x \+01F31C z', 'x 🌜 z')][length = 55] [ODBC][25398][1590664490.058301][SQLExecDirect.c][515] Exit:[SQL_SUCCESS] ... (reading) [ODBC][25398][1590664490.061578][SQLGetData.c][237] Entry: Statement = 0x5612f0515100 Column Number = 2 Target Type = 1 SQL_CHAR Buffer Length = 4096 Target Value = 0x5612f051bfb0 StrLen Or Ind = 0x7ffcfd40cfa8 [ODBC][25398][1590664490.061584][SQLGetData.c][534] Exit:[SQL_SUCCESS] Buffer = [x 🌜 z] Strlen Or Ind = 0x7ffcfd40cfa8 -> 8 [ODBC][25398][1590664490.061593][SQLGetData.c][237] Entry: Statement = 0x5612f0515100 Column Number = 3 Target Type = 1 SQL_CHAR Buffer Length = 4096 Target Value = 0x5612f051bfb0 StrLen Or Ind = 0x7ffcfd40cfa8 [ODBC][25398][1590664490.061599][SQLGetData.c][534] Exit:[SQL_SUCCESS] Buffer = [x ð��� z] Strlen Or Ind = 0x7ffcfd40cfa8 -> 12 As you can see, if I use the PostgreSQL unicode format in the INSERT statement (the first text column, column number 2), it is read back fine. But in column number 3, with an embedded literal high unicode character, the text is not read back correctly. It appear as if the original INSERT SQL statement was not decoded as utf-8 text, but was treated as UCS-2 or ASCII. Hence, when the data is read back out and decoded into utf-8, it is mangled. A broader question, when using SQLExecDirect with Postgres, how should the SQL statement be encoded? Does it have to be utf-8, or perhaps UCS-2? Any documentation on this would be greatly appreciated. I haven't been able to find much official documentation on the subject. Even a code reference would help. Many thanks for any and all help.