Thread: Bad Data back Door
Hackers, I’ve discovered something a bit disturbing at $work. We’re migrating (slowly) from Oracle to PostgreSQL, and in some casesare using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that, althoughthe database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign tableinto a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though thedatabase is UTF-8. I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and therefor additionalcharacter encoding validation is skipped, yes? If so, I’m wondering if it might be possible to add some sort of option to the CREATE FOREIGN TABLE statement to the effectthat certain values should not be trusted to be in the encoding they say they are. At any rate, I’m spending some quality time re-encoding bogus values I never expected to see in our systems. :-( Thanks, David
"David E. Wheeler" <david@justatheory.com> writes: > I�ve discovered something a bit disturbing at $work. We�re migrating (slowly) from Oracle to PostgreSQL, and in some casesare using oracle_fdw to copy data over. Alas, there are a fair number of text values in the Oracle database that, althoughthe database is UTF-8, are actually something else (CP1252 or Latin1). When we copy from an oracle_fdw foreign tableinto a PostgreSQL table, PostgreSQL does not complain, but ends up storing the mis-encoded strings, even though thedatabase is UTF-8. > I assume that this is because the foreign table, as a table, is assumed by the system to have valid data, and thereforadditional character encoding validation is skipped, yes? Probably not so much "assumed" as "nobody thought about it". In e.g. plperl we expend the cycles to do encoding validity checking on *every* string entering the system from Perl. I'm not sure why foreign tables ought to get a pass on that, especially when you consider the communication overhead that the encoding check would be amortized against. Now, having said that, I think it has to be the reponsibility of the FDW to apply any required check ... which makes this a bug report against oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the COPY code, which will check encoding.) regards, tom lane
On 10/05/12 6:12 PM, Tom Lane wrote: > Now, having said that, I think it has to be the reponsibility of the FDW > to apply any required check ... which makes this a bug report against > oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the > COPY code, which will check encoding.) I'm not sure of that. what if the FDW is used to connect to (say) a postgres database that is in POSIX/C ? is that checked for? I'd like to see some encoding validation and substitution functions in postgres. for instance, one that can take any supported encoding and convert it to the database encoding and generate an error on any invalid character. this translation could be identity (eg, UTF8->UTF8) whereupon it would just validate. a 2nd function would do the same, but replace errors with the substitution character in the target charset and not error. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
On 06.10.2012 05:14, John R Pierce wrote: > I'd like to see some encoding validation and substitution functions in > postgres. for instance, one that can take any supported encoding and > convert it to the database encoding and generate an error on any invalid > character. this translation could be identity (eg, UTF8->UTF8) whereupon > it would just validate. See pg_any_to_server() in mbutils.c. At the SQL level, there's the convert(bytea, name, name) function. > a 2nd function would do the same, but replace > errors with the substitution character in the target charset and not error. Hmm, I don't think we have that. - Heikki
On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Probably not so much "assumed" as "nobody thought about it". In > e.g. plperl we expend the cycles to do encoding validity checking on > *every* string entering the system from Perl. I'm not sure why foreign > tables ought to get a pass on that, especially when you consider the > communication overhead that the encoding check would be amortized > against. Yes, that’s what I was thinking. > Now, having said that, I think it has to be the reponsibility of the FDW > to apply any required check ... which makes this a bug report against > oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the > COPY code, which will check encoding.) I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding of thosetext values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validatevalues before displaying them. Best, David
On Sat, Oct 6, 2012 at 1:34 PM, David E. Wheeler <david@justatheory.com> wrote: > On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Probably not so much "assumed" as "nobody thought about it". In >> e.g. plperl we expend the cycles to do encoding validity checking on >> *every* string entering the system from Perl. I'm not sure why foreign >> tables ought to get a pass on that, especially when you consider the >> communication overhead that the encoding check would be amortized >> against. > > Yes, that’s what I was thinking. > >> Now, having said that, I think it has to be the reponsibility of the FDW >> to apply any required check ... which makes this a bug report against >> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >> COPY code, which will check encoding.) > > I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validatevalues before displaying them. > > Best, > > David > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers +1 -- Regards, Atri l'apprenant
On 10/06/12 3:45 AM, Heikki Linnakangas wrote: > At the SQL level, there's the convert(bytea, name, name) function. ahhh, right. (forehead slap) > >> a 2nd function would do the same, but replace >> errors with the substitution character in the target charset and not >> error. > > Hmm, I don't think we have that. me thinks this would be extremely useful for importing 'dirty' data. that or a per-connection flag (or option on the COPY command?) that said "substitute-on-error" for the likes of UTF-8 imports from CSV. -- john r pierce N 37, W 122 santa cruz ca mid-left coast
"David E. Wheeler" <david@justatheory.com> writes: > On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, having said that, I think it has to be the reponsibility of the FDW >> to apply any required check ... which makes this a bug report against >> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >> COPY code, which will check encoding.) > I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it�s Oracle that�s lying about the encoding ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validatevalues before displaying them. It is the FDW's responsibility to deal with this. We expect it to hand back valid tuples; it is not reasonable to disassemble them looking for mistakes (and we couldn't catch most mistakes, anyway). If the interface were defined in terms of text, we could do checking above the FDW level ... but it isn't. regards, tom lane
On 10/06/2012 03:35 PM, Tom Lane wrote: > "David E. Wheeler" <david@justatheory.com> writes: >> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Now, having said that, I think it has to be the reponsibility of the FDW >>> to apply any required check ... which makes this a bug report against >>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >>> COPY code, which will check encoding.) >> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, it’s Oracle that’s lying about the encoding ofthose text values). But I think that it would be much more useful overall -- not to mention more database-like -- for PostgreSQLto provide a way to enforce it. That is, to consider foreign tables to be an input like COPY or SQL, and to validatevalues before displaying them. > It is the FDW's responsibility to deal with this. We expect it to hand > back valid tuples; it is not reasonable to disassemble them looking for > mistakes (and we couldn't catch most mistakes, anyway). If the > interface were defined in terms of text, we could do checking above the > FDW level ... but it isn't. > > Exactly. We've done quite a lot of work tightening the ways that badly encoded data can enter the database over the years. It's a never ending game of whack-a-mole. There aren't any easy answers. cheers andrew
Tom Lane wrote: > "David E. Wheeler" <david@justatheory.com> writes: >> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Now, having said that, I think it has to be the reponsibility of the FDW >>> to apply any required check ... which makes this a bug report against >>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >>> COPY code, which will check encoding.) >> I agree that this is a bug in oracle_fdw (well, potentially; ultimately, >> it's Oracle that's lying about the encoding of those text values). >> But I think that it would be much more useful overall -- not >> to mention more database-like -- for PostgreSQL to provide a way to >> enforce it. That is, to consider foreign tables to be an input like >> COPY or SQL, and to validate values before displaying them. > It is the FDW's responsibility to deal with this. We expect it to hand > back valid tuples; it is not reasonable to disassemble them looking for > mistakes (and we couldn't catch most mistakes, anyway). If the > interface were defined in terms of text, we could do checking above the > FDW level ... but it isn't. As the author I agree that this is a bug in oracle_fdw. This was caused by ignorance on my part: I had assumed that the type input functions would perform the necessary checks, but it seems like that is not the case. I'll look into it. Oracle does not care much about correct encoding. If client character set and database character set are the same, Oracle does not bother to check the data. This is probably how WINDOWS-1252 characters slipped into the UTF-8 database in question. I consider this a bug in Oracle, but never reported it, because I don't have much hope that Oracle would see it as a problem given their habitually sloppy handling of encoding issues. Yours, Laurenz Albe
On Oct 8, 2012, at 12:25 AM, "Albe Laurenz" <laurenz.albe@wien.gv.at> wrote: > As the author I agree that this is a bug in oracle_fdw. Thanks. Should I file a report somewhere? > This was caused by ignorance on my part: I had assumed that the > type input functions would perform the necessary checks, but it > seems like that is not the case. I'll look into it. Thank you! > Oracle does not care much about correct encoding. > If client character set and database character set are the same, > Oracle does not bother to check the data. This is probably how > WINDOWS-1252 characters slipped into the UTF-8 database in question. > I consider this a bug in Oracle, but never reported it, because > I don't have much hope that Oracle would see it as a problem > given their habitually sloppy handling of encoding issues. Yeah, same here. I've been looking into write a function to try to fix poorly-encoded data, though, but haven't got far,because CONVERT() does not indicate failure. If you have any insight on this, I'd appreciate your thoughts on this StackOverflow question: http://stackoverflow.com/q/12717363/79202 Thanks, David
On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now, having said that, I think it has to be the reponsibility of the FDW > to apply any required check ... which makes this a bug report against > oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the > COPY code, which will check encoding.) FWIW, I believe that dblink does not check encoding. Best, David
"David E. Wheeler" <david@justatheory.com> writes: > On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, having said that, I think it has to be the reponsibility of the FDW >> to apply any required check ... which makes this a bug report against >> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >> COPY code, which will check encoding.) > FWIW, I believe that dblink does not check encoding. In dblink's case, that boils down to trusting a remote instance of Postgres to get this right, which doesn't seem totally unreasonable. But I wouldn't object to adding checks there if someone wanted to submit a patch. regards, tom lane
On Oct 8, 2012, at 11:13 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> FWIW, I believe that dblink does not check encoding. > > In dblink's case, that boils down to trusting a remote instance of > Postgres to get this right, which doesn't seem totally unreasonable. > But I wouldn't object to adding checks there if someone wanted to submit > a patch. Yeah, I found this because we had a dblink to another PostgreSQL server's table with data populated from oracle_fdw. I guesstrusting is reasonable, though. I wonder about dbi-link, though… David
On 10/08/2012 02:13 PM, Tom Lane wrote: > "David E. Wheeler" <david@justatheory.com> writes: >> On Oct 5, 2012, at 6:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Now, having said that, I think it has to be the reponsibility of the FDW >>> to apply any required check ... which makes this a bug report against >>> oracle_fdw, not the core system. (FWIW, contrib/file_fdw depends on the >>> COPY code, which will check encoding.) >> FWIW, I believe that dblink does not check encoding. > In dblink's case, that boils down to trusting a remote instance of > Postgres to get this right, which doesn't seem totally unreasonable. > But I wouldn't object to adding checks there if someone wanted to submit > a patch. It does do: PQsetClientEncoding(conn, GetDatabaseEncodingName()); I'd be mildly reluctant to do anything more except possibly as an option, unless it could be shown to have minimal performance impact. cheers andrew
David E. Wheeler wrote: >> As the author I agree that this is a bug in oracle_fdw. > Thanks. Should I file a report somewhere? That's not necessary. Thanks for reporting the problem. It may be a few days until I get around to fix that. >> Oracle does not care much about correct encoding. > Yeah, same here. I've been looking into write a function to try > to fix poorly-encoded data, though, but haven't got far, > because CONVERT() does not indicate failure. If you have > any insight on this, I'd appreciate your thoughts on this > Stack Overflow question: > > http://stackoverflow.com/q/12717363/79202 The only thing I can think of is a stored procedure in Java. You could use java.nio.charset.CharsetEncoder and java.nio.charset.CharsetDecoder which should throw exceptions if they encounter illegal bytes. Yours, Laurenz Albe
I wrote: > Tom Lane wrote: >>>> Now, having said that, I think it has to be the reponsibility of the FDW >>>> to apply any required check ... which makes this a bug report against >>>> oracle_fdw, not the core system. > As the author I agree that this is a bug in oracle_fdw. Ok, fixed. David, could you try the latest version from CVS to see if it works for you? Yours, Laurenz Albe