Re: Pre-proposal: unicode normalized text - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Pre-proposal: unicode normalized text |
Date | |
Msg-id | CA+TgmoYzYR-yhU6k1XFCADeyj=Oyz2PkVsa3iKv+keM8wp-F_A@mail.gmail.com Whole thread Raw |
In response to | Re: Pre-proposal: unicode normalized text (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Pre-proposal: unicode normalized text
Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text |
List | pgsql-hackers |
On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis <pgsql@j-davis.com> wrote: > I assume you mean because we reject invalid byte sequences? Yeah, I'm > sure that causes a problem for some (especially migrations), but it's > difficult for me to imagine a database working well with no rules at > all for the the basic data types. There's a very popular commercial database where, or so I have been led to believe, any byte sequence at all is accepted when you try to put values into the database. The rumors I've heard -- I have not played with it myself -- are that when you try to do anything, byte sequences that are not valid in the configured encoding are treated as single-byte characters or something of that sort. So like if you had UTF-8 as the encoding and the first byte of the string is something that can only appear as a continuation byte in UTF-8, I think that byte is just treated as a separate character. I don't quite know how you make all of the operations work that way, but it seems like they've come up with a somewhat-consistent set of principles that are applied across the board. Very different from the PG philosophy, of course. And I'm not saying it's better. But it does eliminate the problem of being unable to load data into the database, because in such a model there's no such thing as invalidly-encoded data. Instead, an encoding like UTF-8 is effectively extended so that every byte sequence represents *something*. Whether that something is what you wanted is another story. At any rate, if we were to go in the direction of rejecting code points that aren't yet assigned, or aren't yet known to the collation library, that's another way for data loading to fail. Which feels like very defensible behavior, but not what everyone wants, or is used to. > At minimum I think we need to have some internal functions to check for > unassigned code points. That belongs in core, because we generate the > unicode tables from a specific version. That's a good idea. > I also think we should expose some SQL functions to check for > unassigned code points. That sounds useful, especially since we already > expose normalization functions. That's a good idea, too. > One could easily imagine a domain with CHECK(NOT > contains_unassigned(a)). Or an extension with a data type that uses the > internal functions. Yeah. > Whether we ever get to a core data type -- and more importantly, > whether anyone uses it -- I'm not sure. Same here. > Yeah, I am looking for a better compromise between: > > * everything is memcmp() and 'á' sometimes doesn't equal 'á' > (depending on code point sequence) > * everything is constantly changing, indexes break, and text > comparisons are slow > > A stable idea of unicode normalization based on using only assigned > code points is very tempting. The fact that there are multiple types of normalization and multiple notions of equality doesn't make this easier. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: