Thread: Bad Data back Door

Bad Data back Door

From

"David E. Wheeler"

Date:

06 October 2012, 00:55:07

Hackers,

I’ve discovered something a bit disturbing at $work. We’re migrating (slowly) from Oracle to PostgreSQL, and in some
casesare using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that,
althoughthe database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign
tableinto a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though
thedatabase is UTF-8. 

I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and therefor
additionalcharacter encoding validation is skipped, yes? 

If so, I’m wondering if it might be possible to add some sort of option to the CREATE FOREIGN TABLE statement to the
effectthat certain values should not be trusted to be in the encoding they say they are. 

At any rate, I’m spending some quality time re-encoding bogus values I never expected to see in our systems. :-(

Thanks,

David

Re: Bad Data back Door

From

Tom Lane

Date:

06 October 2012, 01:12:08

"David E. Wheeler" <david@justatheory.com> writes:
> I�ve discovered something a bit disturbing at $work. We�re migrating (slowly) from Oracle to PostgreSQL, and in some
casesare using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that,
althoughthe database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign
tableinto a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though
thedatabase is UTF-8.
 

> I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and
thereforadditional character encoding validation is skipped, yes?
 

Probably not so much "assumed" as "nobody thought about it".  In
e.g. plperl we expend the cycles to do encoding validity checking on
*every* string entering the system from Perl.  I'm not sure why foreign
tables ought to get a pass on that, especially when you consider the
communication overhead that the encoding check would be amortized
against.

Now, having said that, I think it has to be the reponsibility of the FDW
to apply any required check ... which makes this a bug report against
oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
COPY code, which will check encoding.)
        regards, tom lane

Re: Bad Data back Door

From

John R Pierce

Date:

06 October 2012, 02:14:13

On 10/05/12 6:12 PM, Tom Lane wrote:
> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

I'm not sure of that.  what if the FDW is used to connect to (say) a 
postgres database that is in POSIX/C ?  is that checked for?

I'd like to see some encoding validation and substitution functions in 
postgres.   for instance, one that can take any supported encoding and 
convert it to the database encoding and generate an error on any invalid 
character.   this translation could be identity (eg, UTF8->UTF8) 
whereupon it would just validate.    a 2nd function would do the same, 
but replace errors with the substitution character in the target charset 
and not error.

-- 
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: Bad Data back Door

From

Heikki Linnakangas

Date:

06 October 2012, 11:26:36

On 06.10.2012 05:14, John R Pierce wrote:
> I'd like to see some encoding validation and substitution functions in
> postgres. for instance, one that can take any supported encoding and
> convert it to the database encoding and generate an error on any invalid
> character. this translation could be identity (eg, UTF8->UTF8) whereupon
> it would just validate.

See pg_any_to_server() in mbutils.c. At the SQL level, there's the 
convert(bytea, name, name) function.

> a 2nd function would do the same, but replace
> errors with the substitution character in the target charset and not error.

Hmm, I don't think we have that.

- Heikki

Re: Bad Data back Door

From

"David E. Wheeler"

Date:

06 October 2012, 17:34:43

On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Probably not so much "assumed" as "nobody thought about it".  In
> e.g. plperl we expend the cycles to do encoding validity checking on
> *every* string entering the system from Perl.  I'm not sure why foreign
> tables ought to get a pass on that, especially when you consider the
> communication overhead that the encoding check would be amortized
> against.

Yes, that’s what I was thinking.

> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding of
thosetext values). But I think that it would be much more useful overall -- not to mention more database-like -- for
PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to
validatevalues before displaying them. 

Best,

David

Re: Bad Data back Door

From

Atri Sharma

Date:

06 October 2012, 17:37:07

On Sat, Oct 6, 2012 at 1:34 PM, David E. Wheeler <david@justatheory.com> wrote:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> Probably not so much "assumed" as "nobody thought about it".  In
>> e.g. plperl we expend the cycles to do encoding validity checking on
>> *every* string entering the system from Perl.  I'm not sure why foreign
>> tables ought to get a pass on that, especially when you consider the
>> communication overhead that the encoding check would be amortized
>> against.
>
> Yes, that’s what I was thinking.
>
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)
>
> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding
ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for
PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to
validatevalues before displaying them. 
>
> Best,
>
> David
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


+1
--
Regards,

Atri
l'apprenant

Re: Bad Data back Door

From

John R Pierce

Date:

06 October 2012, 19:13:21

On 10/06/12 3:45 AM, Heikki Linnakangas wrote:
> At the SQL level, there's the convert(bytea, name, name) function.

ahhh, right.  (forehead slap)

>
>> a 2nd function would do the same, but replace
>> errors with the substitution character in the target charset and not 
>> error.
>
> Hmm, I don't think we have that.

me thinks this would be extremely useful for importing 'dirty' data.   
that or a per-connection flag (or option on the COPY command?)  that 
said "substitute-on-error" for the likes of UTF-8 imports from CSV.

-- 
john r pierce                            N 37, W 122
santa cruz ca                         mid-left coast

Re: Bad Data back Door

From

Tom Lane

Date:

06 October 2012, 19:35:25

"David E. Wheeler" <david@justatheory.com> writes:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)

> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it�s Oracle that�s lying about the encoding
ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for
PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to
validatevalues before displaying them.
 

It is the FDW's responsibility to deal with this.  We expect it to hand
back valid tuples; it is not reasonable to disassemble them looking for
mistakes (and we couldn't catch most mistakes, anyway).  If the
interface were defined in terms of text, we could do checking above the
FDW level ... but it isn't.
        regards, tom lane

Re: Bad Data back Door

From

Andrew Dunstan

Date:

06 October 2012, 20:03:40

On 10/06/2012 03:35 PM, Tom Lane wrote:
> "David E. Wheeler" <david@justatheory.com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the FDW
>>> to apply any required check ... which makes this a bug report against
>>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
>>> COPY code, which will check encoding.)
>> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding
ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for
PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to
validatevalues before displaying them. 
> It is the FDW's responsibility to deal with this.  We expect it to hand
> back valid tuples; it is not reasonable to disassemble them looking for
> mistakes (and we couldn't catch most mistakes, anyway).  If the
> interface were defined in terms of text, we could do checking above the
> FDW level ... but it isn't.
>
>


Exactly.

We've done quite a lot of work tightening the ways that badly encoded
data can enter the database over the years. It's a never ending game of
whack-a-mole. There aren't any easy answers.

cheers

andrew

Re: Bad Data back Door

From

"Albe Laurenz"

Date:

08 October 2012, 07:26:04

Tom Lane wrote:
> "David E. Wheeler" <david@justatheory.com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the
FDW
>>> to apply any required check ... which makes this a bug report
against
>>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on
the
>>> COPY code, which will check encoding.)

>> I agree that this is a bug in oracle_fdw (well, potentially;
ultimately,
>> it's Oracle that's lying about the encoding of those text values).
>> But I think that it would be much more useful overall -- not
>> to mention more database-like -- for PostgreSQL to provide a way to
>> enforce it. That is, to consider foreign tables to be an input like
>> COPY or SQL, and to validate values before displaying them.

> It is the FDW's responsibility to deal with this.  We expect it to
hand
> back valid tuples; it is not reasonable to disassemble them looking
for
> mistakes (and we couldn't catch most mistakes, anyway).  If the
> interface were defined in terms of text, we could do checking above
the
> FDW level ... but it isn't.

As the author I agree that this is a bug in oracle_fdw.

This was caused by ignorance on my part:  I had assumed that the
type input functions would perform the necessary checks, but it
seems like that is not the case.  I'll look into it.

Oracle does not care much about correct encoding.
If client character set and database character set are the same,
Oracle does not bother to check the data.  This is probably how
WINDOWS-1252 characters slipped into the UTF-8 database in question.
I consider this a bug in Oracle, but never reported it, because
I don't have much hope that Oracle would see it as a problem
given their habitually sloppy handling of encoding issues.

Yours,
Laurenz Albe

Re: Bad Data back Door

From

"David E. Wheeler"

Date:

08 October 2012, 07:30:58

On Oct 8, 2012, at 12:25 AM, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote:

> As the author I agree that this is a bug in oracle_fdw.

Thanks. Should I file a report somewhere?

> This was caused by ignorance on my part:  I had assumed that the
> type input functions would perform the necessary checks, but it
> seems like that is not the case.  I'll look into it.

Thank you!

> Oracle does not care much about correct encoding.
> If client character set and database character set are the same,
> Oracle does not bother to check the data.  This is probably how
> WINDOWS-1252 characters slipped into the UTF-8 database in question.
> I consider this a bug in Oracle, but never reported it, because
> I don't have much hope that Oracle would see it as a problem
> given their habitually sloppy handling of encoding issues.

Yeah, same here. I've been looking into write a function to try to fix poorly-encoded data, though, but haven't got
far,because CONVERT() does not indicate failure. If you have any insight on this, I'd appreciate your thoughts on this
StackOverflow question: 
 http://stackoverflow.com/q/12717363/79202

Thanks,

David

Re: Bad Data back Door

From

"David E. Wheeler"

Date:

08 October 2012, 17:57:09

On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Now, having said that, I think it has to be the reponsibility of the FDW
> to apply any required check ... which makes this a bug report against
> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
> COPY code, which will check encoding.)

FWIW, I believe that dblink does not check encoding.

Best,

David

Re: Bad Data back Door

From

Tom Lane

Date:

08 October 2012, 18:14:44

"David E. Wheeler" <david@justatheory.com> writes:
> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, having said that, I think it has to be the reponsibility of the FDW
>> to apply any required check ... which makes this a bug report against
>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
>> COPY code, which will check encoding.)

> FWIW, I believe that dblink does not check encoding.

In dblink's case, that boils down to trusting a remote instance of
Postgres to get this right, which doesn't seem totally unreasonable.
But I wouldn't object to adding checks there if someone wanted to submit
a patch.
        regards, tom lane

Re: Bad Data back Door

From

"David E. Wheeler"

Date:

08 October 2012, 18:23:15

On Oct 8, 2012, at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

>> FWIW, I believe that dblink does not check encoding.
>
> In dblink's case, that boils down to trusting a remote instance of
> Postgres to get this right, which doesn't seem totally unreasonable.
> But I wouldn't object to adding checks there if someone wanted to submit
> a patch.

Yeah, I found this because we had a dblink to another PostgreSQL server's table with data populated from oracle_fdw. I
guesstrusting is reasonable, though. 

I wonder about dbi-link, though…

David

Re: Bad Data back Door

From

Andrew Dunstan

Date:

08 October 2012, 18:29:26

On 10/08/2012 02:13 PM, Tom Lane wrote:
> "David E. Wheeler" <david@justatheory.com> writes:
>> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Now, having said that, I think it has to be the reponsibility of the FDW
>>> to apply any required check ... which makes this a bug report against
>>> oracle_fdw, not the core system.  (FWIW, contrib/file_fdw depends on the
>>> COPY code, which will check encoding.)
>> FWIW, I believe that dblink does not check encoding.
> In dblink's case, that boils down to trusting a remote instance of
> Postgres to get this right, which doesn't seem totally unreasonable.
> But I wouldn't object to adding checks there if someone wanted to submit
> a patch.

It does do:
    PQsetClientEncoding(conn, GetDatabaseEncodingName());


I'd be mildly reluctant to do anything more except possibly as an 
option, unless it could be shown to have minimal performance impact.

cheers

andrew

Re: Bad Data back Door

From

"Albe Laurenz"

Date:

09 October 2012, 07:16:46

David E. Wheeler wrote:
>> As the author I agree that this is a bug in oracle_fdw.

> Thanks. Should I file a report somewhere?

That's not necessary.  Thanks for reporting the problem.
It may be a few days until I get around to fix that.

>> Oracle does not care much about correct encoding.

> Yeah, same here. I've been looking into write a function to try
> to fix poorly-encoded data, though, but haven't got far,
> because CONVERT() does not indicate failure. If you have
> any insight on this, I'd appreciate your thoughts on this
> Stack Overflow question:
>
>   http://stackoverflow.com/q/12717363/79202

The only thing I can think of is a stored procedure in Java.
You could use java.nio.charset.CharsetEncoder and
java.nio.charset.CharsetDecoder which should throw exceptions
if they encounter illegal bytes.

Yours,
Laurenz Albe

Re: Bad Data back Door

From

"Albe Laurenz"

Date:

09 October 2012, 12:49:12

I wrote:
> Tom Lane wrote:
>>>> Now, having said that, I think it has to be the reponsibility of
the FDW
>>>> to apply any required check ... which makes this a bug report
against
>>>> oracle_fdw, not the core system.

> As the author I agree that this is a bug in oracle_fdw.

Ok, fixed.
David, could you try the latest version from CVS to see
if it works for you?

Yours,
Laurenz Albe