Re: plperlu problem with utf8 - Mailing list pgsql-hackers
From | David E. Wheeler |
---|---|
Subject | Re: plperlu problem with utf8 |
Date | |
Msg-id | C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com Whole thread Raw |
In response to | Re: plperlu problem with utf8 (Alex Hunsaker <badalex@gmail.com>) |
Responses |
Re: plperlu problem with utf8
Re: plperlu problem with utf8 |
List | pgsql-hackers |
On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote: > You might argue this is a bug with URI::Escape as I *think* all uri's > will be utf8 encoded. Anyway, I think postgres is doing the right > thing here. No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's Latin-1. > In playing around I did find what I think is a postgres bug. Perl has > 2 ways it can store things internally. per perldoc perlunicode: > > Using Unicode in XS > ... What the "UTF8" flag means is that the sequence of octets in the > representation of the scalar is the sequence of UTF-8 encoded code > points of the characters of a string. The "UTF8" flag being off means > that each octet in this representation encodes a single character with > code point 0..255 within the string. > > Postgres always prints whatever the internal representation happens to > be ignoring the UTF8 flag and the server encoding. > > # create or replace function chr(i int, i2 int) returns text as $$ > return chr($_[0]).chr($_[1]); $$ language plperlu; > CREATE FUNCTION > > # show server_encoding; > server_encoding > ----------------- > SQL_ASCII > > # SELECT length(chr(128, 33)); > length > -------- > 2 > > # SELECT length(chr(128, 333)); > length > -------- > 4 > > Grr that should error out with "Invalid server encoding", or worst > case should return a length of 3 (it utf8 encoded 128 into 2 bytes > instead of leaving it as 1). In this case the 333 causes perl store > it internally as utf8. Well with SQL_ASCII anything goes, no? > Now on a utf8 database: > > # show server_encoding; > server_encoding > ----------------- > UTF8 > > # SELECT length(chr(128, 33)); > ERROR: invalid byte sequence for encoding "UTF8": 0x80 > CONTEXT: PL/Perl function "chr" > > # SELECT length(chr(128, 333)); > CONTEXT: PL/Perl function "chr" > length > -------- > 2 > > Same thing here, we just end up using the internal format. In one > case it works in the other it does not. The main point being, most of > the time it *happens* to work. But its really just by chance. > > I think what we should do is use SvPVutf8() when we are UTF8 instead > of SvPV in sv2text_mbverified(). SvPV gives us a pointer to a string > in perls current internal format (maybe unicode, maybe a utf8 byte > sequence). While SvPVutf8 will always give us utf8 (may or may not be > valid!) encoded string. > > Something like the attached. Thoughts? Im not very happy with the non > utf8 case-- The elog(ERROR, "invalid byte sequence") is a total > cop-out yes. But I did not see a good solution short of hand rolling > our own version of sv_utf8_downgrade(). Is it worth it? > <plperl_encoding.patch> Maybe I'm misunderstanding, but it seems to me that: * String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal representationbefore the function actually gets them. * Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server encodingbefore they're returned. I didn't really follow all of the above; are you aiming for the same thing? Best, David
pgsql-hackers by date: