Change initdb default to the builtin collation provider - Mailing list pgsql-hackers

From Jeff Davis
Subject Change initdb default to the builtin collation provider
Date
Msg-id e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com
Whole thread Raw
List pgsql-hackers
-------
Summary
-------

The libc collation provider is a bad default[1]. The builtin collation
provider is a good default, so let's use that.

----------
Motivation
----------

The initdb default is what we choose for new postgres instances when
we have little information about the user's collation needs. The
default has no effect on upgrades, which always use the previous
instance's locale settings.

There are a number of factors to consider when choosing a default:

* Risk of primary key inconsistencies due to libc or ICU updates
* Performance
* Quality of query semantics for a variety of scripts, languages and
  regions (excluding the final result order)
* Final result ordering / display

In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.

With the builtin provider, there's no risk of primary key or plain
index inconsistencies, the performance is great (ordering with
memcmp()), and the query semantics are based on Unicode.

-------------------
Why does it matter?
-------------------

Arguably, we could just not worry and let various service providers,
tools, scripts, packages, and wrappers make the choice independently.

But that just moves the problem -- someone still needs to make that
choice. Collectively, we have built up some knowledge about collation
here on -hackers, and it would be good to offer it as guidance. And an
initdb default is a good way to offer that guidance.

Unifying around one default also creates a more consistent, tested,
and documented Postgres experience that benefits hackers and users
alike.

--------------------------------
What's the catch? Display order.
--------------------------------

The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.

That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.

Other providers offer a better final display order, but it comes at a
heavy price: index inconsistencies and poor performance. Those
problems are hard to address in an existing system. Some users may be
willing to pay that price, but it should be opt-in.

Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.

In general, there's little reason to think that a single locale for
display order is enough for a whole database. Often, databases are
used (directly or indirectly) by people from dozens of locales. When
per-locale display order becomes an issue, it will be necessary to add
COLLATE clauses or application logic to tailor to the end user
regardless, so the database default locale won't be useful.

For all of these reasons, display order is the wrong thing to optimize
for when the user doesn't specify anything. We should prioritize the
other factors, and for those other factors, the builtin provider is
the best.

------------
Why not ICU?
------------

ICU is better than libc in a lot of ways:

* Better performance
* Platform-independent
* Easier to manage it as a separate library

But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.

------------------
Other Alternatives
------------------

In a previous thread[1], I laid out some alternatives. If someone
disagrees with this proposal, please choose one of those or suggest a
new one.

The most interesting alternative, in my opinion, is #4, but that was
soundly rejected.

---------------------
Which builtin locale?
---------------------

All builtin locales use the exact same ordering: they sort by code
point. Code point order is stable, so primary keys and plain indexes
remain consistent across upgrades forever.

The difference is in case conversion and character classification
semantics:

1. C: Only basic ASCII semantics which never change.

2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this
   is similar to the libc "C.UTF-8" locale available on Linux. It's
   also similar to the default semantics of at least one big
   commercial database, making migrations easier.

3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more
   aligned with the SQL standard, which specifies in an example the
   uppercase of 'ß' is 'SS'.

For the latter two locales, expression and partial indexes depending
on these semantics may be subject to inconsistencies after a Unicode
update.

I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.

-------
Details
-------

The mechanics of the default itself are being worked out here[2]. The
concrete proposal here is to commit those patches, and then
change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and
DEFAULT_BUILTIN_LOCALE to whatever we choose here.

Note: the builtin provider requires UTF-8, which can potentially
conflict with the LC_CTYPE. Fortunately, when the builtin provider is
being used, LC_CTPE has little effect. To further reduce the
consequences of LC_CTYPE when using the builtin provider, another
patch[3] fixes tsearch to parse based on the database default locale
rather than depending on LC_CTYPE.


Comments welcome.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
[2]
https://www.postgresql.org/message-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
[3]
https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com




pgsql-hackers by date:

Previous
From: Chao Li
Date:
Subject: Re: new environment variable INITDB_LOCALE_PROVIDER
Next
From: Jeff Davis
Date:
Subject: Re: new environment variable INITDB_LOCALE_PROVIDER