Change initdb default to the builtin collation provider - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Change initdb default to the builtin collation provider |
Date | |
Msg-id | e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com Whole thread Raw |
List | pgsql-hackers |
------- Summary ------- The libc collation provider is a bad default[1]. The builtin collation provider is a good default, so let's use that. ---------- Motivation ---------- The initdb default is what we choose for new postgres instances when we have little information about the user's collation needs. The default has no effect on upgrades, which always use the previous instance's locale settings. There are a number of factors to consider when choosing a default: * Risk of primary key inconsistencies due to libc or ICU updates * Performance * Quality of query semantics for a variety of scripts, languages and regions (excluding the final result order) * Final result ordering / display In the absence of specific user requirements, these factors weigh heavily in favor of the builtin collation provider, and heavily against libc. With the builtin provider, there's no risk of primary key or plain index inconsistencies, the performance is great (ordering with memcmp()), and the query semantics are based on Unicode. ------------------- Why does it matter? ------------------- Arguably, we could just not worry and let various service providers, tools, scripts, packages, and wrappers make the choice independently. But that just moves the problem -- someone still needs to make that choice. Collectively, we have built up some knowledge about collation here on -hackers, and it would be good to offer it as guidance. And an initdb default is a good way to offer that guidance. Unifying around one default also creates a more consistent, tested, and documented Postgres experience that benefits hackers and users alike. -------------------------------- What's the catch? Display order. -------------------------------- The builtin provider uses code point order, i.e. memcmp(), so the final result display order is less human-friendly. For instance, 'Z' comes before 'a'. That problem is annoying, but *much* easier to fix than the other factors. The user might add a COLLATE clause to the final ORDER BY, or perform the sort in the application layer or presentation layer. Other providers offer a better final display order, but it comes at a heavy price: index inconsistencies and poor performance. Those problems are hard to address in an existing system. Some users may be willing to pay that price, but it should be opt-in. Furthermore, in the default case, we don't even really know which language and region to use. We infer it from the environment variable LC_COLLATE at initdb time, but that's a weak signal: there's little reason to think that the OS admin, DBA, and end user are all in the same locale. In general, there's little reason to think that a single locale for display order is enough for a whole database. Often, databases are used (directly or indirectly) by people from dozens of locales. When per-locale display order becomes an issue, it will be necessary to add COLLATE clauses or application logic to tailor to the end user regardless, so the database default locale won't be useful. For all of these reasons, display order is the wrong thing to optimize for when the user doesn't specify anything. We should prioritize the other factors, and for those other factors, the builtin provider is the best. ------------ Why not ICU? ------------ ICU is better than libc in a lot of ways: * Better performance * Platform-independent * Easier to manage it as a separate library But fundamentally, I don't think it's a great default, because it favors final result display order at the risk of primary key inconsistencies. ------------------ Other Alternatives ------------------ In a previous thread[1], I laid out some alternatives. If someone disagrees with this proposal, please choose one of those or suggest a new one. The most interesting alternative, in my opinion, is #4, but that was soundly rejected. --------------------- Which builtin locale? --------------------- All builtin locales use the exact same ordering: they sort by code point. Code point order is stable, so primary keys and plain indexes remain consistent across upgrades forever. The difference is in case conversion and character classification semantics: 1. C: Only basic ASCII semantics which never change. 2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this is similar to the libc "C.UTF-8" locale available on Linux. It's also similar to the default semantics of at least one big commercial database, making migrations easier. 3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more aligned with the SQL standard, which specifies in an example the uppercase of 'ß' is 'SS'. For the latter two locales, expression and partial indexes depending on these semantics may be subject to inconsistencies after a Unicode update. I propose changing the default to PG_C_UTF8 because it seems simple and practical. However, I'm also fine with PG_UNICODE_FAST if those affected by the "full" case mapping find it helpful. "C" is also a possibility, but the query semantics suffer. All are better than libc. ------- Details ------- The mechanics of the default itself are being worked out here[2]. The concrete proposal here is to commit those patches, and then change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and DEFAULT_BUILTIN_LOCALE to whatever we choose here. Note: the builtin provider requires UTF-8, which can potentially conflict with the LC_CTYPE. Fortunately, when the builtin provider is being used, LC_CTPE has little effect. To further reduce the consequences of LC_CTYPE when using the builtin provider, another patch[3] fixes tsearch to parse based on the database default locale rather than depending on LC_CTYPE. Comments welcome. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com [2] https://www.postgresql.org/message-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com [3] https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
pgsql-hackers by date: