Home > mailing lists

Re: plperlu problem with utf8 - Mailing list pgsql-hackers

From	David E. Wheeler
Subject	Re: plperlu problem with utf8
Date	December 16, 2010 23:25:00
Msg-id	C9982425-2453-479A-88FB-D12B6F20839B@kineticode.com Whole thread Raw
In response to	Re: plperlu problem with utf8 (Alex Hunsaker <badalex@gmail.com>)
Responses	Re: plperlu problem with utf8 Re: plperlu problem with utf8
List	pgsql-hackers

Tree view

On Dec 16, 2010, at 6:39 PM, Alex Hunsaker wrote:

> You might argue this is a bug with URI::Escape as I *think* all uri's
> will be utf8 encoded.  Anyway, I think postgres is doing the right
> thing here.

No, URI::Escape is fine. The issue is that if you don't decode text to Perl's internal form, it assumes that it's
Latin-1.

> In playing around I did find what I think is a postgres bug.  Perl has
> 2 ways it can store things internally.  per perldoc perlunicode:
>
> Using Unicode in XS
> ... What the "UTF8" flag means is that the sequence of octets in the
> representation of the scalar is the sequence of UTF-8 encoded code
> points of the characters of a string.  The "UTF8" flag being off means
> that each octet in this representation encodes a single character with
> code point 0..255 within the string.
>
> Postgres always prints whatever the internal representation happens to
> be ignoring the UTF8 flag and the server encoding.
>
> # create or replace function chr(i int, i2 int) returns text as $$
> return chr($_[0]).chr($_[1]); $$ language plperlu;
> CREATE FUNCTION
>
> # show server_encoding;
> server_encoding
> -----------------
> SQL_ASCII
>
> # SELECT length(chr(128, 33));
> length
> --------
>      2
>
> # SELECT length(chr(128, 333));
> length
> --------
>      4
>
> Grr that should error out with "Invalid server encoding", or worst
> case should return a length of 3 (it utf8 encoded 128 into 2 bytes
> instead of leaving it as 1).  In this case the 333 causes perl store
> it internally as utf8.

Well with SQL_ASCII anything goes, no?

> Now on a utf8 database:
>
> # show server_encoding;
> server_encoding
> -----------------
> UTF8
>
> # SELECT length(chr(128, 33));
> ERROR:  invalid byte sequence for encoding "UTF8": 0x80
> CONTEXT:  PL/Perl function "chr"
>
> # SELECT length(chr(128, 333));
> CONTEXT:  PL/Perl function "chr"
> length
> --------
>      2
>
> Same thing here, we just end up using the internal format.  In one
> case it works in the other it does not.  The main point being, most of
> the time it *happens* to work.  But its really just by chance.
>
> I think what we should do is use SvPVutf8() when we are UTF8 instead
> of SvPV in sv2text_mbverified().  SvPV gives us a pointer to a string
> in perls current internal format (maybe unicode, maybe a utf8 byte
> sequence).  While SvPVutf8 will always give us utf8 (may or may not be
> valid!) encoded string.
>
> Something like the attached.  Thoughts? Im not very happy with the non
> utf8 case--  The elog(ERROR, "invalid byte sequence") is a total
> cop-out yes.  But I did not see a good solution short of hand rolling
> our own version of sv_utf8_downgrade().  Is it worth it?
> <plperl_encoding.patch>

Maybe I'm misunderstanding, but it seems to me that:

* String arguments passed to PL/Perl functions should be decoded from the server encoding to Perl's internal
representationbefore the function actually gets them. 

* Values returned from PL/Perl functions that are in Perl's internal representation should be encoded into the server
encodingbefore they're returned. 

I didn't really follow all of the above; are you aiming for the same thing?

Best,

David

pgsql-hackers by date:

From: Shigeru HANADA
Date: 16 December 2010, 22:49:46
Subject: Re: SQL/MED - file_fdw

From: Alex Hunsaker
Date: 17 December 2010, 00:40:04
Subject: Re: plperlu problem with utf8

Re: plperlu problem with utf8 - Mailing list pgsql-hackers

Previous

Next