Home > mailing lists
Re: plperlu problem with utf8 - Mailing list pgsql-hackers

From	David E. Wheeler
Subject	Re: plperlu problem with utf8
Date	December 18, 2010 23:35:39
Msg-id	BBD0B93C-7B24-4C36-A44E-4863EC98CE6A@kineticode.com Whole thread Raw
In response to	Re: plperlu problem with utf8 (David Christensen <david@endpoint.com>)
Responses	Re: plperlu problem with utf8 Re: plperlu problem with utf8
List	pgsql-hackers
Tree view
On Dec 17, 2010, at 9:32 PM, David Christensen wrote:

> +1 on the original sentiment, but only for the case that we're dealing with data that is passed in/out as arguments.
Inthe case that the server_encoding is UTF-8, this is as trivial as a few macros on the underlying SVs for text-like
types. If the server_encoding is SQL_ASCII (= byte soup), this is a trivial case of doing nothing with the conversion
regardlessof data type.  For any other server_encoding, the data would need to be converted from the server_encoding to
UTF-8,presumably using the built-in conversions before passing it off to the first code path.  A similar handling would
needto be done for the return values, again datatype-dependent. 

+1

> Recent upgrades of the Encode module included with perl 5.10+ have caused issues wherein circular dependencies
betweenEncode and Encode::Alias have made it impossible to load in a Safe container without major pain.  (There may be
somebetter options than I'd had on a previous project, given that we're embedding our own interpreters and accessing
morethrough the XS guts, so I'm not ruling out this possibility completely). 

Fortunately, thanks to Tim Bunce, PL/Perl no longer relies on Safe.pm.

>> Well that works for me. I always use UTF8. Oleg, what was the encoding of your database where you saw the issue?
>
> I'm not sure what the current plperl runtime does as far as marshaling this, but it would be fairly easy to ensure
theparameters came in in perl's internal format given a server_encoding of UTF8 and some type introspection to identify
thestring-like types/text data.  (Perhaps any type which had a binary cast to text would be a sufficient definition
here. Do domains automatically inherit binary casts from their originating types?)  

Their labels are TEXT. I believe that the only type that should not be treated as text is bytea.

>>> 2) its not utf8, so we just leave it as octets.
>>
>> Which mean's Perl will assume that it's Latin-1, IIUC.
>
> This is sub-optimal for non-UTF-8-encoded databases, for reasons I pointed out earlier.  This would produce bogus
resultsfor any non-UTF-8, non-ASCII, non latin-1 encoding, even if it did not generally bite most people in general
usage.

Agreed.

> This example seems bogus; wouldn't length be 3 if this is the example text this was run with?  Additionally, since
allASCII is trivially UTF-8, I think a better example would be using a string with hi-bit characters so if this was
improperlyhandled the lengths wouldn't match; length($all_ascii) == length(encode_utf8($all_ascii)) vs length($hi_bit)
<length(encode_utf8($hi_bit)).  I don't see that this test shows us much with the test case as given.  The is_utf8()
functionmerely returns the state of the SV_utf8 flag, which doesn't speak to UTF-8 validity (i.e., this need not be set
onascii-only strings, which are still valid in the UTF-8 encoding), nor does it indicate that there are no hi-bit
charactersin the string (i.e., with encode_utf8($hi_bit_string)), the source string $hi_bit_string (in perl's internal
format)with hi-bit characters will have the utf8 flag set, but the return value of encode_utf8 will not, even though
theunderlying data, as represented in perl will be identical). 


Sorry, I probably had a pasto there. how about this?
   CREATE OR REPLACE FUNCTION perlgets(       TEXT   ) RETURNS TABLE(length INT, is_utf8 BOOL) LANGUAGE plperl AS $$
 my $text = shift;      return_next {          length  => length $text,          is_utf8 => utf8::is_utf8($text) ? 1 :
0     };   $$; 
   utf8=# SELECT * FROM perlgets('“hello”');    length │ is_utf8    ────────┼─────────         7 │ t
   latin=# SELECT * FROM perlgets('“hello”');    length │ is_utf8    ────────┼─────────        11 │ f

(Yes I used Latin-1 curly quotes in that last example). I would argue that it should output the same as the first
example.That is, PL/Perl should have decoded the latin-1 before passing the text to the Perl function. 

>
>> In a latin-1 database:
>>
>>   latin=# select * from perlgets('foo');
>>    length │ is_utf8
>>   ────────┼─────────
>>         8 │ f
>>   (1 row)
>>
>> I would argue that in the latter case, is_utf8 should be true, too. That is, PL/Perl should decode from Latin-1 to
Perl'sinternal form. 
>
> See above for discussion of the is_utf8 flag; if we're dealing with latin-1 data or (more precisely in this case)
datathat has not been decoded from the server_encoding to perl's internal format, this would exactly be the expectation
forthe state of that flag. 

Right. I think that it *should* be decoded.

>> Interestingly, when I created a function that takes a bytea argument, utf8 was *still* enabled in the utf-8
database.That doesn't seem right to me. 
>
> I'm not sure what you mean here, but I do think that if bytea is identifiable as one of the input types, we should do
noencoding on the data itself, which would indicate that the utf8 flag for that variable would be unset.   

Right.

> If this is not currently handled this way, I'd be a bit surprised, as bytea should just be an array of bytes with no
charactersemantics attached to it. 

It looks as though it is not handled that way. The utf8 flag *is* set on a bytea string passed to a PL/Perl function in
aUTF-8 database. 

> As shown above, the character length for the example should be 27, while the octet length for the UTF-8 encoded
versionis 28.  I've reviewed the source of URI::Escape, and can say definitively that: a) regular uri_escape does not
handle> 255 code points in the encoding, but there exists a uri_escape_utf8 which will convert the source string to
UTF8first and then escape the encoded value, and b) uri_unescape has *no* logic in it to automatically decode from UTF8
intoperl's internal format (at least as far as the version that I'm looking at, which came with 5.10.1). 

Right.

> -1; if you need to decode from an octets-only encoding, it's your responsibility to do so after you've unescaped it.
Perhapslater versions of the URI::Escape module contain a uri_unescape_utf8() function, but it's trivially: sub
uri_unescape_utf8{ Encode::decode_utf8(uri_unescape(shift))}.  This is definitely not a bug in uri_escape, as it is
onlydefined to return octets. 

Right, I think we're agreed on that count. I wouldn't mind seeing a uri_unescape_utf8() though, as it might prevent
someconfusion. 

>>> Yeah, the patch address this part.  Right now we just spit out
>>> whatever the internal format happens to be.
>>
>> Ah, excellent.
>
> I agree with the sentiments that: data (server_encoding) -> function parameters (-> perl internal) -> function return
(->server_encoding).  This should be for any character-type data insofar as it is feasible, but ISTR there is already
datatype-specificmarshaling occurring. 

Dunno about that.

> There is definitely a lot of confusion surrounding perl's handling of character data; I hope this was able to clear a
fewthings up. 

Yes, it helped, thanks!

David
pgsql-hackers by date:
From: Robert Haas
Date: 18 December 2010, 23:18:44
Subject: Re: SQL/MED - file_fdw
From: "David E. Wheeler"
Date: 18 December 2010, 23:37:08
Subject: Re: Extensions, patch v20 (bitrot fixes) (was: Extensions, patch v19 (encoding brainfart fix))
Re: plperlu problem with utf8 - Mailing list pgsql-hackers

Previous

Next