plperlu problem with utf8 [REVIEW] - Mailing list pgsql-hackers
From | Andy Colson |
---|---|
Subject | plperlu problem with utf8 [REVIEW] |
Date | |
Msg-id | 4D320FA6.3000005@squeakycode.net Whole thread Raw |
In response to | Re: plperlu problem with utf8 (Alex Hunsaker <badalex@gmail.com>) |
Responses |
Re: plperlu problem with utf8 [REVIEW]
|
List | pgsql-hackers |
This is a review of "plperl encoding issues" https://commitfest.postgresql.org/action/patch_view?id=452 Purpose: ======== Your database uses one encoding, and passes data to perl in the same encoding, which perl is not prepared for (it assumesUTF-8). This patch makes sure data is encoded into UTF-8 before its passed to plperl then converts the response fromUTF-8 back to the database encoding for storage. My test: ptest2=# create database ptest2 encoding 'EUC_JP' template template0; I created a simple perl function that reverses the string. I don't know Japanese so I found a tattoo website that had sayingsin Japanese... I picked: "I am awesome". create or replace function preverse(x text) returns text as $$my $tmp = reverse($_[0]);return $tmp; $$ LANGUAGE plperl; Before the patch: ptest2=#select preverse('私はよだれを垂らす'); preverse -------------------- 垢蕕眇鬚譴世茲呂篁 (1 row) It is also possible to generate invalid characters. This function pulls off the last character in the string... assumingits UTF-8 create or replace function plastchar(x text) returns text as $$my $tmp = substr($_[0], -1);return $tmp; $$ LANGUAGE plperl; ptest2=# select plastchar('私はよだれを垂らす'); ERROR: invalid byte sequence for encoding "EUC_JP": 0xb9 CONTEXT: PL/Perl function "plastchar" Because the string was not UTF-8, perl got confused and returned an invalid character. After the patch: The exact same plperl functions work fine: ptest2=# select preverse('私はよだれを垂らす'); preverse -------------------- すら垂をれだよは私 (1 row) ptest2=# select plastchar('私はよだれを垂らす'); plastchar ----------- す (1 row) Performance: ============ This is a bug fix, not for performance, however, as noted by the author, many encodings will be very UTF-8'ish and the overheadwill be very small. For those encodings that would need converted, you'd need to do the same convert inside yourperl function anyway before you could use the data. The processing has just moved from inside your perl func to insidePG. The Patch: ========== Applies clean to git head as of January 15 2011. PG built with --enable-cassert and --enable-debug seems to run fine withno errors. I don't think regression tests cover plperl, so understandable there are no tests in the patch. There is no manual updates in the patch either, and I think there should be. I think it should be made clear that data (varchar, text, etc. but not bytea) will be passed to perl as UTF-8, regardless of database encoding. Also that"use utf8;" is always loaded and in use. Code Review: ============ I am not qualified. Looking through the patch, I'm reminded of the old saying: "Any sufficently advanced perl XS code isindistinguishable from magic" :-) Other Remarks: ============== - Yes I know... it was a joke. - I sure hope this posts to the news group ok - My terminal (konsole) had a hard time displaying Japanese, so I used psql's \i and \o to read/write files that kwrite show'd/encodedcorrectly via EUC_JP Summary: ======== Looks good. Looks needed. Needs manual updates.
pgsql-hackers by date: