Re: Careful PL/Perl Release Not Required - Mailing list pgsql-hackers
From | Alex Hunsaker |
---|---|
Subject | Re: Careful PL/Perl Release Not Required |
Date | |
Msg-id | AANLkTimp9yiGqAGLvwJifb1gvJ6xK0PUZh3td30BEU5C@mail.gmail.com Whole thread Raw |
In response to | Re: Careful PL/Perl Release Not Required ("David E. Wheeler" <david@kineticode.com>) |
Responses |
Re: Careful PL/Perl Release Not Required
|
List | pgsql-hackers |
On Thu, Feb 10, 2011 at 21:53, David E. Wheeler <david@kineticode.com> wrote: > On Feb 10, 2011, at 5:28 PM, Alex Hunsaker wrote: >> The other thing that changed is non UTF-8 databases now also get >> character semantics. That is we convert from the database encoding >> into utf8 and visa versa on output. That probably should be noted >> somewhere... > > Oh. I see. And Oleg's database wasn't utf-8 then, I guess. I'll have to re-read the JSON docs, I guess. Erm…feh. Okay.I have to pass the false value to utf8() *now*. Okay, at least that's more consistent. I'd like to quibble with you over this point if I may. :-) Per perldoc: JSON::XS "utf8" flag disabled When "utf8" is disabled (the default), then "encode"/"decode" generate and expect Unicode strings ... So - If you are on < 9.1 and a utf8 database you want to pass utf8(false), as you have a Unicode string. - If you are on < 9.1 and on a non utf8 database you would want to pass utf8(false) as the string is *not* Unicode, its byte soup. Its in some _other_ encoding say EUC_JP. You would need to decode() it into Unicode first. - If you are on 9.1 and a utf8 database you still want to pass utf8(false) as the string is still unicode. - if you are on 9.1 and a non utf8 database you want to pass utf8(false) as the string is _now_ unicode. So... it seems you always want to pass false. The only case I can where you would want to pass true is you are on < 9.1 with a SQL_ASCII database and you know for a fact the string represents a utf8 byte sequence. Or am I missing something obvious? >> If you do have to change your semantics/functions, could you post an >> example? I'd like to make sure its because you were hitting one of >> those nasty corner cases and not something new is broken. > > I think that people who have non-utf-8 databases might be surprised. Yeah, surprised it does the right thing and its actually usable now ;). >>> This probably won't be that common, but Oleg, for example, will need to convert his fixed function from: > No, he had to add the decode line, IIRC: > > CREATE OR REPLACE FUNCTION url_decode(Vkw varchar) RETURNS varchar AS $$ > use strict; > use URI::Escape; > utf8::decode($_[0]); > return uri_unescape($_[0]); $$ LANGUAGE plperlu; > > Because uri_unescape() needs its argument to be decoded to Perl's internal form. On 9.1, it will be, so he won't need tocall utf8::decode(). That is, in a latin-1 database: Meh, no, not really. He will still need to call decode. The problem is uri_unescape() does not assume an encoding on the URI. It could be UTF-16 encoded for all it knows (UTF-8 is probably standard, but thats not the point, it knows nothing about Unicode or encodings). For example, lets say you have a latin-1 accented e "é" the byte sequence is the one byte: 0xe9. If you were to uri_escape that you get the 3 byte ascii string "%E9": $ perl -E 'use URI::Escape; my $str = "\xe9"; say uri_escape($str)' %E9 If you uri_unescape "%E9" you get 1 byte back with a hex value of 0xe9: $ perl -E 'use URI::Escape; my $str = uri_unescape("%E9"); say sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length $str)' chr: é hex: e9, len: 1 What if we want to uri_escape a UTF-16 accented e? Thats two hex bytes 0x00e9: $ perl -E 'use URI::Escape; my $str = "\x00\xe9"; say uri_escape($str)' %00%E9 What happens we uri_unescape that? Do we get back a Unicode string that has one character? No. And why should we? How is uri_unescape supposed to know what %00%E9 represent? All it knows is thats 2 separate bytes: $ perl -E 'use URI::Escape; my $str = uri_unescape("%00%E9"); say sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length $str)' chr: é hex: 00e9, len: 2 Now, lets say you want to uri_escape a utf8 accented e, thats the two byte sequence: 0xc3 0xa9: $ perl -E 'use URI::Escape; my $str = "\xc3\xa9"; say uri_escape($str)' %C3%A9 Ok, what happens when we uri_unescape those?: $ perl -E 'use URI::Escape; my $str = uri_unescape("%C3%A9"); say sprintf("chr: %s hex: %s, len: %s", $str, unpack("H*", $str), length $str)' chr: é hex: c3a9, len: 2 So, plperl will also return 2 characters here. In the the cited case he was passing "%C3%A9" to uri_unescape() and expecting it to return 1 character. The additional utf8::decode() will tell perl the string is in utf8 so it will then return 1 char. The point being, decode is needed and with it, the function will work pre and post 9.1. In-fact on a latin-1 database it sure as heck better return two characters, it would be a bug if it only returned 1 as that would mean it would be treating a series of latin1 bytes as a series of utf8 bytes!
pgsql-hackers by date: