Re: plperlu problem with utf8 - Mailing list pgsql-hackers
From | David E. Wheeler |
---|---|
Subject | Re: plperlu problem with utf8 |
Date | |
Msg-id | BBD0B93C-7B24-4C36-A44E-4863EC98CE6A@kineticode.com Whole thread Raw |
In response to | Re: plperlu problem with utf8 (David Christensen <david@endpoint.com>) |
Responses |
Re: plperlu problem with utf8
Re: plperlu problem with utf8 |
List | pgsql-hackers |
On Dec 17, 2010, at 9:32 PM, David Christensen wrote: > +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments. Inthe case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion regardlessof data type. For any other server_encoding, the data would need to be converted from the server_encoding to UTF-8,presumably using the built-in conversions before passing it off to the first code path. A similar handling would needto be done for the return values, again datatype-dependent. +1 > Recent upgrades of the Encode module included with perl 5.10+ have caused issues wherein circular dependencies betweenEncode and Encode::Alias have made it impossible to load in a Safe container without major pain. (There may be somebetter options than I'd had on a previous project, given that we're embedding our own interpreters and accessing morethrough the XS guts, so I'm not ruling out this possibility completely). Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm. >> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue? > > I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure theparameters came in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify thestring-like types/text data. (Perhaps any type which had a binary cast to text would be a sufficient definition here. Do domains automatically inherit binary casts from their originating types?) Their labels are TEXT. I believe that the only type that should not be treated as text is bytea. >>> 2) its not utf8, so we just leave it as octets. >> >> Which mean's Perl will assume that it's Latin-1, IIUC. > > This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier. This would produce bogus resultsfor any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general usage. Agreed. > This example seems bogus; wouldn't length be 3 if this is the example text this was run with? Additionally, since allASCII is trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was improperlyhandled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit) <length(encode_utf8($hi_bit)). I don't see that this test shows us much with the test case as given. The is_utf8() functionmerely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set onascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit charactersin the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal format)with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though theunderlying data, as represented in perl will be identical). Sorry, I probably had a pasto there. how about this? CREATE OR REPLACE FUNCTION perlgets( TEXT ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$ my $text = shift; return_next { length => length $text, is_utf8 => utf8::is_utf8($text) ? 1 : 0 }; $$; utf8=# SELECT * FROM perlgets('“hello”'); length │ is_utf8 ────────┼───────── 7 │ t latin=# SELECT * FROM perlgets('“hello”'); length │ is_utf8 ────────┼───────── 11 │ f (Yes I used Latin-1 curly quotes in that last example). I would argue that it should output the same as the first example.That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function. > >> In a latin-1 database: >> >> latin=# select * from perlgets('foo'); >> length │ is_utf8 >> ────────┼───────── >> 8 │ f >> (1 row) >> >> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to Perl'sinternal form. > > See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case) datathat has not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation forthe state of that flag. Right. I think that it *should* be decoded. >> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8 database.That doesn't seem right to me. > > I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do noencoding on the data itself, which would indicate that the utf8 flag for that variable would be unset. Right. > If this is not currently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no charactersemantics attached to it. It looks as though it is not handled that way. The utf8 flag *is* set on a bytea string passed to a PL/Perl function in aUTF-8 database. > As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded versionis 28. I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not handle> 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to UTF8first and then escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8 intoperl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1). Right. > -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it. Perhapslater versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub uri_unescape_utf8{ Encode::decode_utf8(uri_unescape(shift))}. This is definitely not a bug in uri_escape, as it is onlydefined to return octets. Right, I think we're agreed on that count. I wouldn't mind seeing a uri_unescape_utf8() though, as it might prevent someconfusion. >>> Yeah, the patch address this part. Right now we just spit out >>> whatever the internal format happens to be. >> >> Ah, excellent. > > I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return (->server_encoding). This should be for any character-type data insofar as it is feasible, but ISTR there is already datatype-specificmarshaling occurring. Dunno about that. > There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a fewthings up. Yes, it helped, thanks! David
pgsql-hackers by date: