Home > mailing lists

Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 - Mailing list psycopg

From	Karl O. Pinc
Subject	Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3
Date	February 16, 2024 05:45:15
Msg-id	20240215234515.0ab71d58@slate.karlpinc.com Whole thread Raw
In response to	Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 (Daniele Varrazzo <daniele.varrazzo@gmail.com>)
Responses	Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3
List	psycopg

Tree view

On Wed, 14 Feb 2024 19:28:57 +0100
Daniele Varrazzo <daniele.varrazzo@gmail.com> wrote:

> On Wed, 14 Feb 2024 at 16:42, Karl O. Pinc <kop@karlpinc.com> wrote:
>
> > I did not use conn.info.encoding because the docs say that it
> > contains the _client_ encoding, not the server-side encoding
> > used to store the db content.
> > [...]
> > Confirming the encodings, connecting to the  "latin1" db with psql
> > shows:
>
> > kop_latin1=> show client_encoding;
> >  UTF8
> >
> > kop_latin1=> show server_encoding;
> >  LATIN1
> >
> > But, conn.info.encoding does return iso8859-1.
> >
> > So I think your documentation has confused client
> > and server in this case.

> No, I am pretty sure that this is the client encoding that is
> reported.

You are right.

> Maybe the way you are connecting via psql sets the client_encoding?
> Can you try to get the result of `SHOW client_encoding` from psycopg?

psycopg, as well as libpq -- called directly from C, report LATIN1
as the client encoding.  More on this below.

> From psycopg PoV, the client encoding is more important, because it's
> how strings must be encoded to send them to the server; the server
> encoding is relatively less important. So what you can actually store
> is the smallest set of characters between server encoding and client
> encoding. What you could do is to set the client encoding equal to the
> server's:
>
>     SELECT set_config('client_encoding',
> current_setting('server_encoding'), false);
>
> and then proceed using `conn.info.encoding`.

This is what is happening:

Unless you open a connection with a connection string (etc.)
that does client_encoding=auto, or otherwise set a specific
client encoding, the connection defaults to setting the
client encoding to the server encoding.

This means to me that there are 2(+) ways to use psycopg.
You can have psycopg do all the encoding conversion, client
side, as is probably typical because I don't expect that
people are setting client_encoding.  In this case
you get a UnicodeEncodeError raised when there's a
problem.  Or you use client_encoding=auto, the server
does the converting, and you get a
psycopg.errors.UntranslatableCharacter when there's
a problem.

(I was surprised when I first wrote my test code to
get a UnicodeEncodingError instead of an
UntranslateableCharacter exception.)

This leads me a problem with your documentation.
https://www.psycopg.org/psycopg3/docs/api/adapt.html#psycopg.adapt.Dumper.dump
Says:
abstract dump(obj: Any) → Union[bytes, bytearray, memoryview]

Convert the object obj to PostgreSQL representation.

When in fact dump() additionally converts the object to the
connection's client encoding.  To me "PostgreSQL representation"
means server encoding.  Even if it means something else to you,
it wouldn't hurt to be as specific as possible and write
"to the PostgreSQL representation, in the connection's
client encoding when this is text" to avoid ambiguity.  There's 2 things
going on here, encoding and "representation", seemingly
kinda separate.

Anyway, your docs talk "type" and "representation"
and I think it'd be helpful to talk "type", "representation"
and also "encoding" when the representation/presentation
is text.  I find the current docs unclear
when it comes to exactly what encoding or decoding might
happen where and what affects what can happen where.

Perhaps something as simple as, in the "Passing Parameters
to SQL Queries" section,
https://www.psycopg.org/psycopg3/docs/basic/params.html#passing-parameters-to-sql-queries
above the "Binary Parameters and Results" section,
https://www.psycopg.org/psycopg3/docs/basic/params.html#binary-parameters-and-results
you have a "Text Parameters and Results" section:

  Once upon a time, when you bought a computer it came
with only one set of characters and only one way to represent,
encode, each of those characters into exactly one byte.  These days
life is not so simple and character encoding is more complicated.

  Today there is no substitute for knowing the encoding of the
text your application obtains from the outside world.
This can be highly system dependent because when reading
files open()-ed as text, Python decodes (into UTF-8) the bytes read.
By default decoding from the system locale's character encoding.
And when writing files open()-ed as text Python encodes (from UTF-8)
the bytes written.  Again by default encoding into the
system locale's character encoding.

  Encoding is also application dependent.  Your application
might have to accept multiple text encodings, or at least
handle exceptions when given text in an unexpected encoding.

  No matter how you get your data, to put your data into
the database as text, its bytes must first have their external
encoding decoded to UTF-8.  Because Python strings are
UTF-8.  You must manage this initial conversion into Python
because only you knows the encoding of the text supplied to your
application.  Although there are conventions, markers, etc.,
there is no way to automatically know an encoding just by
looking at the bytes input.

  Once in Python, psycopg converts the UTF-8 text to the database
connection's client encoding and the PostgreSQL server converts
that to the server encoding.  Reverse the entire process when
retrieving textual data from PostgreSQL and outputting it.

  All text is passed to and received from the server in the
connection's client encoding.  Unless the client_encoding
connection parameter is set, PostgreSQL defaults to the database's
encoding, the server encoding.  Setting client_encoding=auto
automatically sets the client encoding to the client
locale's character encoding.

  Adaption converts all text to the client encoding when sent,
and back when received.  So when the client encoding is the
server encoding, all conversion is done on the client-side.
Any exceptions raised should the conversion fail are Python
UnicodeError exceptions, or one of its encode or decode
subclasses.  Whereas when client and server encoding differ,
encoding conversion is also done server-side.  An
UntranslatableCharacter exception is raised when server-side
conversion fails.  This is a subclass of the standard DBAPI
DataException.  Setting client_encoding=UTF8,
the same as Python's encoding, covers the final use-case
where all encoding conversion, except, possibly, the initial
reading of the text into Python, is done server-side.

  See also: PEP 686 & PEP 540.

To be clear, I'm not an expert here.  But I'm not entirely
ignorant and this is my understanding.  It's important
to get the encoding right so I think it'd be good to
talk about it.  There's a lot of Python-language
explanation in the above.  And some PostgreSQL
specifics too.  All this might be extraneous
but if the Python or pg side is gotten wrong it's easy
to have odd, occasional, bugs when suddenly somebody
inputs an unexpected character.  So better to cover
the subject from end-to-end and provide an integrated
understanding.  This will only become more important
as UTF-8 becomes more prevalent, because even though
non-UTF-8 encodings will always be with us, they'll
be less familiarity with these encodings as time passes
and encoding bugs will become even more mysterious.

What this means for my problem at hand is that calling
Dumper.dump() does me no good at all when it comes to
looking for the exact source of the exceptions raised when
converting between the client and server encoding.
dump() does not convert between client and server encodings.
And that means, please correct me if I'm wrong,
that psycopg does not presently expose
the server encoding so that I can do my own checking
and pinpoint the exact problem after catching an encoding
related exception.  (Short of using psycopg to query the
server and ask for the server encoding, or omitting
client_encoding when making a connection, etc.)

FYI.  `man psql` says:
       If both standard input and standard output are a terminal, then psql
       sets the client encoding to “auto”, which will detect the appropriate
       client encoding from the locale settings (LC_CTYPE environment variable
       on Unix systems).

Thanks for the help.  Apologies for the lengthy reply.

Regards,

Karl <kop@karlpinc.com>
Free Software:  "You don't pay back, you pay forward."
                 -- Robert A. Heinlein

psycopg by date:

From: Daniele Varrazzo
Date: 14 February 2024, 18:28:57
Subject: Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

From: Karsten Hilbert
Date: 16 February 2024, 10:47:29
Subject: Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3

Re: Reporting UnicodeEncodeError info on arbitrary data sent to PG with psycopg3 - Mailing list psycopg

Previous

Next