Thread: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
From
"Marc Mamin"
Date:
The following bug has been logged online: Bug reference: 3697 Logged by: Marc Mamin Email address: m.mamin@intershop.de PostgreSQL version: 8.2.4 Operating system: SuSE Linux 9.1 (i586) Description: utf8 issue: can not reimport a table that was successfully exported. Details: Hello, I'm not sure this is a bug; the problem might be related to a client encoding issue. My Database is on a Linux server which I connect to using putty from Windows. both server and client are set to UTF8: client_encoding | UTF8 backslash_quote | safe_encoding server_encoding | UTF8 I stumbled on this issue while trying to import a "malicious" user agent string... I didn't check if all characters are valid UTF8... My concern is about database recovery. I'm using pg_dump to regulary export my users, bu according to the example below, it seems that my dumps may be worthless ! May be you should consider not to publish this before a fix exist as this is a serious issue which could eventually be exploited to damage existing instances (for the case this is really a bug)... regards, Marc Mamin steps to repeat: CREATE TABLE utf8_test(s varchar); CREATE OR REPLACE FUNCTION f_utf8_test( st VARCHAR) RETURNS INT AS $$ DECLARE quotedline varchar = quote_literal($1); BEGIN INSERT INTO utf8_test ( s ) VALUES ( quotedline); RETURN 0; END; $$ LANGUAGE plpgsql; select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9 \xE3\xEE\xF1\xF3\xE4 xE4\xE6 \xCD\xC1 \xD0\xC1")'); -- here the same statement, but with all backslashed duplicated for the case when the string was modified when posting this issue: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind \\xE0\\xF0\\xF1\\xF2\\xE2\\xE5\\xED\\xED\\xFB\\xE9 \\xE2\\xFB\\xF1\\xF8\\9 \\xE3\\xEE\\xF1\\xF3\\xE4 xE4\\xE6 \\xCD\\xC1 \\xD0\\xC1")'); WARNING: nonstandard use of escape in a string literal LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind COPY utf8_test TO '/tmp/utf8_test.txt'; COPY 1 COPY utf8_test FROM '/tmp/utf8_test.txt'; ERROR: invalid byte sequence for encoding "UTF8": 0xd3ce HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY utf8_test, line 1 The same isuue occure when using pg_dump: pg_dump -i -v -p 5433 -Uisdb2 -tutf8_test > /tmp/utf8_dump pg_dump: server version: 8.2.4; pg_dump version: 8.2.1 pg_dump: proceeding despite version mismatch psql -f"/tmp/utf8_dump" => psql:/tmp/utf8_dump:40: ERROR: invalid byte sequence for encoding "UTF8": 0xd3ce HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". CONTEXT: COPY utf8_test, line 1
Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
From
Tom Lane
Date:
"Marc Mamin" <m.mamin@intershop.de> writes: > I didn't check if all characters are valid UTF8... They aren't ... > select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind > \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9 > \xE3\xEE\xF1\xF3\xE4 > xE4\xE6 \xCD\xC1 \xD0\xC1")'); In 8.3 that will throw an error: utf8=# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9 utf8'# \xE3\xEE\xF1\xF3\xE4 utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")'); WARNING: nonstandard use of escape in a string literal LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ERROR: invalid byte sequence for encoding "UTF8": 0xe0f0f1 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlledby "client_encoding". utf8=# However, since this behavior isn't backwards-compatible, there's not much appetite for back-patching it. I don't think this is a security issue --- if you don't quote backslashes in untrusted input you'll have problems far worse than this one. regards, tom lane
Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
From
Tom Lane
Date:
"Marc Mamin" <M.Mamin@intershop.de> writes: > Is there a recommendation how to clean these data (I know where to > search for them) Usually people do it by running iconv with the delete-bad-data option on a pg_dump file. regards, tom lane
Re: BUG #3697: utf8 issue: can not reimport a table that was successfully exported.
From
"Marc Mamin"
Date:
Thank you for your quick response, > if you don't quote backslashes in untrusted input you'll have problems far worse than this one I do it now but not since by db is live...=20 So I probably have some invalid caraters in.=20 Is this an issue that must be fixed before I can upgrade to 8.3 ? Is there a recommendation how to clean these data (I know where to search for them) Thanks, Marc Mamin =20 -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us]=20 Sent: Thursday, October 25, 2007 6:08 PM To: Marc Mamin Cc: pgsql-bugs@postgresql.org Subject: Re: [BUGS] BUG #3697: utf8 issue: can not reimport a table that was successfully exported.=20 "Marc Mamin" <m.mamin@intershop.de> writes: > I didn't check if all characters are valid UTF8... They aren't ... > select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind > \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9 > \xE3\xEE\xF1\xF3\xE4 > xE4\xE6 \xCD\xC1 \xD0\xC1")'); In 8.3 that will throw an error: utf8=3D# select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind utf8'# \xE0\xF0\xF1\xF2\xE2\xE5\xED\xED\xFB\xE9 \xE2\xFB\xF1\xF8\9 utf8'# \xE3\xEE\xF1\xF3\xE4 utf8'# xE4\xE6 \xCD\xC1 \xD0\xC1")'); WARNING: nonstandard use of escape in a string literal LINE 1: select f_utf8_test('(Mozilla/4.0 (compatible; MSIE 6.0; Wind ^ HINT: Use the escape string syntax for escapes, e.g., E'\r\n'. ERROR: invalid byte sequence for encoding "UTF8": 0xe0f0f1 HINT: This error can also happen if the byte sequence does not match the encoding expected by the server, which is controlled by "client_encoding". utf8=3D#=20 However, since this behavior isn't backwards-compatible, there's not much appetite for back-patching it. I don't think this is a security issue --- if you don't quote backslashes in untrusted input you'll have problems far worse than this one. regards, tom lane