Re: plperlu problem with utf8 - Mailing list pgsql-hackers
From | Alex Hunsaker |
---|---|
Subject | Re: plperlu problem with utf8 |
Date | |
Msg-id | AANLkTi=ODW=o2R7i2NOfj0tv6dqGCAsBcKesAiWYWMw4@mail.gmail.com Whole thread Raw |
In response to | Re: plperlu problem with utf8 ("David E. Wheeler" <david@kineticode.com>) |
Responses |
Re: plperlu problem with utf8
Re: plperlu problem with utf8 |
List | pgsql-hackers |
On Sat, Dec 18, 2010 at 20:29, David E. Wheeler <david@kineticode.com> wrote: > ... > I would argue that it should output the same as the first example. That is, PL/Perl should have decoded the latin-1 beforepassing the text to the Perl function. Yeah, I don't think you will find anyone who disagrees :) PL/TCL and PL/Python get this right FWIW. Anyway find attached a patch that does just this. With the attached we: - for function arguments, convert (using pg_do_encoding_conversion) to utf8 from the current database encoding. We also turn on the utf8 flag so perl knows it was given utf8. Pre patch things only really worked for SQL_ASCII or PG_UTF8 databases. In practice everything worked fine for single byte charsets. However things like uc() or lc() on bytes with high bits set were probably broken. - for output from perl convert from perls internal format to utf8 (using SvPVutf8()), and then convert that to the current database encoding. This sounds unoptimized, but in the common case SvPVutf8() should be a noop. Pre patch this was "random" (dependent on how perl decided to represent the string internally) but it worked 99% of the time (at least in the single byte charset or UTF8 cases). - fix errors so they properly encode their message to the current database encoding (pre patch we were doing no conversion at all, similar to the output case were it worked most of the time) - always do the utf8 hack so utf8.pm is loaded (fixes utf8 regexs in plperl). Pre patch this only happened on a UTF8 database. That meant multi-byte character regexs were broken on non utf8 databases. -remove some old perl version checks for 5.6 and 5.8. We require 5.8.1 so these were nothing but clutter. Something interesting to note is when we are SQL_ASCII, pg_do_encoding_conversion() does nothing, yet we turn on the utf8 flag. This means if you pass in valid utf8 perl will treat it as such. It also means on output it will hand utf8 back. Both PL/Tcl and PL/Python do the same thing so I suppose its sensible to match their behavior (and it was the lazy thing to do). The difference being with PL/Python if you return said string you get "ERROR: PL/Python: could not convert Python Unicode object to PostgreSQL server encoding". While PL/Tcl and now Pl/perl give you back a utf8 version. For example: (SQL_ASCII database) =# select length('☺'); length -------- 3 =# CREATE FUNCTION tcl_len(text) returns text as $$ return [string length $1] $$ language pltcl; CREATE FUNCTION postgres=# SELECT tcl_len('☺'); tcl_len ------------ 1 (1 row) =# CREATE FUNCTION py_len(str text) returns text as $$ return len(str) $$ language plpython3; =# SELECT py_len('☺'); py_len -------- 1 (1 row) I wouldn't really say thats right, but its at least consistent... This does *not* address the bytea issue where currently if you have bytea input or output we try to encode that the same as any string. I think thats going to be a bit more invasive and this patch should stands on its own.
Attachment
pgsql-hackers by date: