Thread: [GENERAL] Postgres, apps, special characters and UTF-8 encoding
Hi. I've got a recurring problem with character encoding for a Postgres-based web PHP app, and am hoping someone can clue me in or at least point me in the right direction. I'll confess upfront my understanding of encoding issues is extremely limited. Here goes.
The app uses a Postgres database, UTF-8 encoded. Through their browsers, users can add and edit records often including text. Most of the time this works fine. Though sometimes this will fail with Postgres complaining, for example, "Could query with ... , The error text was: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 0x67"
So this generally happens when people copy and paste things out of their word documents and such.
As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something. And that one way or another, the encoding needs to be translated before it can be placed into the database. I'm not clear how this is supposed to happen though. Automatically by the browser? Done in the app? Some other way? And if in the app, how is one supposed to know what the incoming encoding is?
Thanks in advance for any help or pointers.
Ken

AGENCY Software
A Free Software data system
By and for non-profits
(253) 245-3801
learn more about AGENCY or
follow the discussion.
As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something. And that one way or another, the encoding needs to be translated before it can be placed into the database. I'm not clear how this is supposed to happen though. Automatically by the browser? Done in the app? Some other way? And if in the app, how is one supposed to know what the incoming encoding is?
Haven't run into this problem personally, probably accidentally lucky, but ISTM that you need to tell the browser what character set you are working in. It is in the best position to mediate between the user and the server.
Ideally, on the server, you can examine HTTP headers to learn about the incoming data charset/encoding (I may not be using these terms precisely but you should get the idea).
Googling "html input field encoding" seems to provide a decent start.
Note that technically the data encoding issues can occur without HTML, its really an HTTP layer thing, but the medium of use you care about is HTTP/Browsers.
David J.
On 03/07/2017 03:20 PM, Ken Tanzer wrote: > Hi. I've got a recurring problem with character encoding for a > Postgres-based web PHP app, and am hoping someone can clue me in or at > least point me in the right direction. I'll confess upfront my > understanding of encoding issues is extremely limited. Here goes. > > The app uses a Postgres database, UTF-8 encoded. Through their > browsers, users can add and edit records often including text. Most of > the time this works fine. Though sometimes this will fail with Postgres > complaining, for example, "Could query with ... , The error text was: > ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 0x67" > > So this generally happens when people copy and paste things out of their > word documents and such. > > As I understand it, those are likely encoded in something non-UTF-8, > like WIN-1251 or something. And that one way or another, the encoding > needs to be translated before it can be placed into the database. I'm > not clear how this is supposed to happen though. Automatically by the > browser? Done in the app? Some other way? And if in the app, how is > one supposed to know what the incoming encoding is? I don't use PHP, but found this: http://www.php.net/manual/en/function.mb-detect-encoding.php and this: http://php.net/manual/en/function.mb-convert-encoding.php > > Thanks in advance for any help or pointers. > > Ken > > > -- > AGENCY Software > A Free Software data system > By and for non-profits > /http://agency-software.org// > /https://agency-software.org/demo/client/ > ken.tanzer@agency-software.org <mailto:ken.tanzer@agency-software.org> > (253) 245-3801 > > Subscribe to the mailing list > <mailto:agency-general-request@lists.sourceforge.net?body=subscribe> to > learn more about AGENCY or > follow the discussion. -- Adrian Klaver adrian.klaver@aklaver.com
Hi Ken, On Tue, 2017-03-07 at 15:20 -0800, Ken Tanzer wrote: > Hi. I've got a recurring problem with character encoding for a > Postgres-based web PHP app, and am hoping someone can clue me in or > at least point me in the right direction. I'll confess upfront my > understanding of encoding issues is extremely limited. Here goes. > > The app uses a Postgres database, UTF-8 encoded. Through their > browsers, users can add and edit records often including text. Most > of the time this works fine. Though sometimes this will fail with > Postgres complaining, for example, "Could query with ... , The error > text was: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x20 > 0x67" > > So this generally happens when people copy and paste things out of > their word documents and such. > > As I understand it, those are likely encoded in something non-UTF-8, > like WIN-1251 or something. And that one way or another, the > encoding needs to be translated before it can be placed into the > database. I'm not clear how this is supposed to happen though. > Automatically by the browser? Done in the app? Some other way? And > if in the app, how is one supposed to know what the incoming encoding > is? > > Thanks in advance for any help or pointers. > > Ken > > > 1) Make sure the text editor you use to create your pages, etc. uses UTF-8 as its encoding. That way the file's BOM is set correctly. 2) Make sure your headers contain the following:- <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> or <meta charset="UTF-8"/> which is HTML5, however the long version is still recognised by HTML5. I understand that some versions of IE have issues with correctly determining the character set, so, unfortunately, you still have to verify that user input is UTF-8 compatible. HTH, Rob
On 2017-03-08 00:20, Ken Tanzer wrote: > Hi. I've got a recurring problem with character encoding for a > Postgres-based web PHP app, and am hoping someone can clue me in or at > least point me in the right direction. I'll confess upfront my > understanding of encoding issues is extremely limited. Here goes. > > And that one way or another, the encoding > needs to be translated before it can be placed into the database. > Ken > > -- > You don't really have to translate the encoding, because all parts of the system are capable of dealing with all encodings. What you have to make sure that that they are indeed all working in the same encoding. You have to set the encoding of the HTML document, the database, and the database connection to the same encoding, like utf8. People tend to forget the "set names" on the database connection, which can make the database think you are sending latin1, but you are really sending utf-8, and presto problemo. Then the only problem left is that PHP doesn't do utf-8 very well internally so if you receive data from an UTF-8 page and want to substring etc then you have to use the multibyte variants of those functions. You could convert everything back to latin1 first, but then you might as well just do everything in latin1 in the first place.
Ken Tanzer wrote: > Hi. I've got a recurring problem with character encoding for a Postgres-based web PHP app, and am > hoping someone can clue me in or at least point me in the right direction. I'll confess upfront my > understanding of encoding issues is extremely limited. Here goes. > > The app uses a Postgres database, UTF-8 encoded. Through their browsers, users can add and edit > records often including text. Most of the time this works fine. Though sometimes this will fail with > Postgres complaining, for example, "Could query with ... , The error text was: ERROR: invalid byte > sequence for encoding "UTF8": 0xe9 0x20 0x67" > > So this generally happens when people copy and paste things out of their word documents and such. > > As I understand it, those are likely encoded in something non-UTF-8, like WIN-1251 or something. And > that one way or another, the encoding needs to be translated before it can be placed into the > database. I'm not clear how this is supposed to happen though. Automatically by the browser? Done > in the app? Some other way? And if in the app, how is one supposed to know what the incoming > encoding is? > > Thanks in advance for any help or pointers. The byte sequence 0xe9 0x20 0x67 means "é g" in ISO-8859-1 and WINDOWS-1252, so I think that your setup is as follows: - The PHP application gets data encoded in ISO-8859-1 or WINDOWS-1252 and tries to store it in a database. - The PHP application has a database connection with client_encoding set to UTF8. Then the database thinks it gets UTF-8 and will choke if it gets something different. The solution: - Make sure that your web application gets data in only one encoding. - Set client_encoding to that encoding. Yours, Laurenz Albe