Home > mailing lists

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: Pre-proposal: unicode normalized text
Date	October 4, 2023 17:16:22
Msg-id	CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com Whole thread Raw
In response to	Re: Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text
List	pgsql-hackers

Tree view

On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I assume you mean because we reject invalid byte sequences? Yeah, I'm
> sure that causes a problem for some (especially migrations), but it's
> difficult for me to imagine a database working well with no rules at
> all for the the basic data types.

There's a very popular commercial database where, or so I have been
led to believe, any byte sequence at all is accepted when you try to
put values into the database. The rumors I've heard -- I have not
played with it myself -- are that when you try to do anything, byte
sequences that are not valid in the configured encoding are treated as
single-byte characters or something of that sort. So like if you had
UTF-8 as the encoding and the first byte of the string is something
that can only appear as a continuation byte in UTF-8, I think that
byte is just treated as a separate character. I don't quite know how
you make all of the operations work that way, but it seems like
they've come up with a somewhat-consistent set of principles that are
applied across the board. Very different from the PG philosophy, of
course. And I'm not saying it's better. But it does eliminate the
problem of being unable to load data into the database, because in
such a model there's no such thing as invalidly-encoded data. Instead,
an encoding like UTF-8 is effectively extended so that every byte
sequence represents *something*. Whether that something is what you
wanted is another story.

At any rate, if we were to go in the direction of rejecting code
points that aren't yet assigned, or aren't yet known to the collation
library, that's another way for data loading to fail. Which feels like
very defensible behavior, but not what everyone wants, or is used to.

> At minimum I think we need to have some internal functions to check for
> unassigned code points. That belongs in core, because we generate the
> unicode tables from a specific version.

That's a good idea.

> I also think we should expose some SQL functions to check for
> unassigned code points. That sounds useful, especially since we already
> expose normalization functions.

That's a good idea, too.

> One could easily imagine a domain with CHECK(NOT
> contains_unassigned(a)). Or an extension with a data type that uses the
> internal functions.

Yeah.

> Whether we ever get to a core data type -- and more importantly,
> whether anyone uses it -- I'm not sure.

Same here.

> Yeah, I am looking for a better compromise between:
>
>   * everything is memcmp() and 'á' sometimes doesn't equal 'á'
> (depending on code point sequence)
>   * everything is constantly changing, indexes break, and text
> comparisons are slow
>
> A stable idea of unicode normalization based on using only assigned
> code points is very tempting.

The fact that there are multiple types of normalization and multiple
notions of equality doesn't make this easier.

--
Robert Haas
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Merlin Moncure
Date: 04 October 2023, 16:26:31
Subject: Re: Request for comment on setting binary format output per session

From: Nico Williams
Date: 04 October 2023, 17:23:41
Subject: Re: Pre-proposal: unicode normalized text

Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers

Previous

Next