Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 - Mailing list psycopg
From | Karl O. Pinc |
---|---|
Subject | Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 |
Date | |
Msg-id | 20240215234515.0ab71d58@slate.karlpinc.com Whole thread Raw |
In response to | Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 (Daniele Varrazzo <daniele.varrazzo@gmail.com>) |
Responses |
Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3
|
List | psycopg |
On Wed, 14 Feb 2024 19:28:57 +0100 Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote: > On Wed, 14 Feb 2024 at 16:42, Karl O. Pinc <kop@karlpinc.com> wrote: > > > I did not use conn.info.encoding because the docs say that it > > contains the _client_ encoding, not the server-side encoding > > used to store the db content. > > [...] > > Confirming the encodings, connecting to the "latin1" db with psql > > shows: > > > kop_latin1=> show client_encoding; > > UTF8 > > > > kop_latin1=> show server_encoding; > > LATIN1 > > > > But, conn.info.encoding does return iso8859-1. > > > > So I think your documentation has confused client > > and server in this case. > No, I am pretty sure that this is the client encoding that is > reported. You are right. > Maybe the way you are connecting via psql sets the client_encoding? > Can you try to get the result of `SHOW client_encoding` from psycopg? psycopg, as well as libpq -- called directly from C, report LATIN1 as the client encoding. More on this below. > From psycopg PoV, the client encoding is more important, because it's > how strings must be encoded to send them to the server; the server > encoding is relatively less important. So what you can actually store > is the smallest set of characters between server encoding and client > encoding. What you could do is to set the client encoding equal to the > server's: > > SELECT set_config('client_encoding', > current_setting('server_encoding'), false); > > and then proceed using `conn.info.encoding`. This is what is happening: Unless you open a connection with a connection string (etc.) that does client_encoding=auto, or otherwise set a specific client encoding, the connection defaults to setting the client encoding to the server encoding. This means to me that there are 2(+) ways to use psycopg. You can have psycopg do all the encoding conversion, client side, as is probably typical because I don't expect that people are setting client_encoding. In this case you get a UnicodeEncodeError raised when there's a problem. Or you use client_encoding=auto, the server does the converting, and you get a psycopg.errors.UntranslatableCharacter when there's a problem. (I was surprised when I first wrote my test code to get a UnicodeEncodingError instead of an UntranslateableCharacter exception.) This leads me a problem with your documentation. https://www.psycopg.org/psycopg3/docs/api/adapt.html#psycopg.adapt.Dumper.dump Says: abstract dump(obj: Any) → Union[bytes, bytearray, memoryview] Convert the object obj to PostgreSQL representation. When in fact dump() additionally converts the object to the connection's client encoding. To me "PostgreSQL representation" means server encoding. Even if it means something else to you, it wouldn't hurt to be as specific as possible and write "to the PostgreSQL representation, in the connection's client encoding when this is text" to avoid ambiguity. There's 2 things going on here, encoding and "representation", seemingly kinda separate. Anyway, your docs talk "type" and "representation" and I think it'd be helpful to talk "type", "representation" and also "encoding" when the representation/presentation is text. I find the current docs unclear when it comes to exactly what encoding or decoding might happen where and what affects what can happen where. Perhaps something as simple as, in the "Passing Parameters to SQL Queries" section, https://www.psycopg.org/psycopg3/docs/basic/params.html#passing-parameters-to-sql-queries above the "Binary Parameters and Results" section, https://www.psycopg.org/psycopg3/docs/basic/params.html#binary-parameters-and-results you have a "Text Parameters and Results" section: Once upon a time, when you bought a computer it came with only one set of characters and only one way to represent, encode, each of those characters into exactly one byte. These days life is not so simple and character encoding is more complicated. Today there is no substitute for knowing the encoding of the text your application obtains from the outside world. This can be highly system dependent because when reading files open()-ed as text, Python decodes (into UTF-8) the bytes read. By default decoding from the system locale's character encoding. And when writing files open()-ed as text Python encodes (from UTF-8) the bytes written. Again by default encoding into the system locale's character encoding. Encoding is also application dependent. Your application might have to accept multiple text encodings, or at least handle exceptions when given text in an unexpected encoding. No matter how you get your data, to put your data into the database as text, its bytes must first have their external encoding decoded to UTF-8. Because Python strings are UTF-8. You must manage this initial conversion into Python because only you knows the encoding of the text supplied to your application. Although there are conventions, markers, etc., there is no way to automatically know an encoding just by looking at the bytes input. Once in Python, psycopg converts the UTF-8 text to the database connection's client encoding and the PostgreSQL server converts that to the server encoding. Reverse the entire process when retrieving textual data from PostgreSQL and outputting it. All text is passed to and received from the server in the connection's client encoding. Unless the client_encoding connection parameter is set, PostgreSQL defaults to the database's encoding, the server encoding. Setting client_encoding=auto automatically sets the client encoding to the client locale's character encoding. Adaption converts all text to the client encoding when sent, and back when received. So when the client encoding is the server encoding, all conversion is done on the client-side. Any exceptions raised should the conversion fail are Python UnicodeError exceptions, or one of its encode or decode subclasses. Whereas when client and server encoding differ, encoding conversion is also done server-side. An UntranslatableCharacter exception is raised when server-side conversion fails. This is a subclass of the standard DBAPI DataException. Setting client_encoding=UTF8, the same as Python's encoding, covers the final use-case where all encoding conversion, except, possibly, the initial reading of the text into Python, is done server-side. See also: PEP 686 & PEP 540. To be clear, I'm not an expert here. But I'm not entirely ignorant and this is my understanding. It's important to get the encoding right so I think it'd be good to talk about it. There's a lot of Python-language explanation in the above. And some PostgreSQL specifics too. All this might be extraneous but if the Python or pg side is gotten wrong it's easy to have odd, occasional, bugs when suddenly somebody inputs an unexpected character. So better to cover the subject from end-to-end and provide an integrated understanding. This will only become more important as UTF-8 becomes more prevalent, because even though non-UTF-8 encodings will always be with us, they'll be less familiarity with these encodings as time passes and encoding bugs will become even more mysterious. What this means for my problem at hand is that calling Dumper.dump() does me no good at all when it comes to looking for the exact source of the exceptions raised when converting between the client and server encoding. dump() does not convert between client and server encodings. And that means, please correct me if I'm wrong, that psycopg does not presently expose the server encoding so that I can do my own checking and pinpoint the exact problem after catching an encoding related exception. (Short of using psycopg to query the server and ask for the server encoding, or omitting client_encoding when making a connection, etc.) FYI. `man psql` says: If both standard input and standard output are a terminal, then psql sets the client encoding to “auto”, which will detect the appropriate client encoding from the locale settings (LC_CTYPE environment variable on Unix systems). Thanks for the help. Apologies for the lengthy reply. Regards, Karl <kop@karlpinc.com> Free Software: "You don't pay back, you pay forward." -- Robert A. Heinlein