Thread: Locale, Collation, ICU patch
Regarding the ICU patch in the commitfest here's my plan. IMHO the idea of making ICU a hard dependency which Postgres will have to use forevermore on all systems is a non-starter. I'm not entirely against having ICU as a supported collation system which packagers on systems where the system locale support is weak can choose to make a dependency of their binary packages though, assuming the issues raised elsewhere about ICU are resolved. As long as this bogeyman is scaring us though it's preventing us from having the SQL standard collation syntax and the accompanying catalog and planner changes. And as long as we don't have that support -- which is a big job -- nobody who's interested in implementing ICU or strcoll_l() or any other interfaces for a new platform will get around to it. The actual porting glue to call those functions on each platform is fairly lightweight and could easily be done by experts on that platform who aren't catalog and planner mavens. So we have a bit of a chicken and egg problem. We aren't getting the planner and syntax changes because we aren't sure the support would be good on every platform and we aren't getting the platform support because we don't have the planner and catalog changes. What I want to do is focus on adding the planner and catalog changes somehow. We implement a kind of baseline locale support something only slightly better than what we have now using setlocale before every comparison. This is clearly not the recommended configuration but as long as it handles what we handle today without a performance hit and a bit more besides it would be a big start. I'm assuming we would check if the desired locale is the current locale and skip the assignment. So if only one locale is *actually* in use then basically no additional overhead is incurred. Moreover if the desired locale is C then we can skip the assignment and use strcmp directly. So actually as long as only one non-C locale is in use then no additional overhead would be incurred. The big gotcha is what collation to use when comparing with data in the system tables, especially the shared system tables. I think we do need to define a database-wide encoding and collation to use for system tables. (Unless we can get by with varchar_pattern_ops indexes on system tables?) So the following use cases arise: a) They're actually using only one collation for both the system tables and their own data. This is well handled by our existing setup and would be basically unchanged in the new setup. b) They're using multiple collations for their data but only one "at a time". Either one per database or one per session. In which case they don't incur any overhead c) They're using multiple collations for their data but only one collation in a given application unit of work. This is probably the most common case for OLTP application since each unit of work represents some particular user's operation. In this case as long as the system tables are set up to use the C locale then this would require at most one setlocale() call per unit of work though. d) They're actively using multiple collations in a single query, possibly even within a single sort (something like ORDER BY a COLLATION en_US, b COLLATION es_US). This would perform passably on glibc but abysmally on most other libc's. From that point forward we would go about adding support for strcoll_l() and other interfaces to handle case (d) on various platforms. For platforms with no reasonable interface we could add a --enable-ICU users or packagers could choose to use. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: > The big gotcha is what collation to use when comparing with data in the system > tables, especially the shared system tables. I think we do need to define a > database-wide encoding and collation to use for system tables. You mean cluster-wide? If we can get away with that, it'd solve a lot of problems. Note that the stuff in the system tables is mostly type "name" not text, and the comparison semantics for that have always been strcmp(), so the question of collation doesn't really apply. Name in itself doesn't care about encoding either, but I think we have to restrict encoding to avoid the problem of injecting data that's invalidly encoded into one database from another via the shared catalogs. The other issue that'd have to be resolved is the problem of system log output. I think we'd wish that log messages are written in a uniform encoding (CSV output in particular is going to have a hard time otherwise) but what do you do when you need to report something that includes a character not present in that encoding? regards, tom lane
Tom Lane wrote: > The other issue that'd have to be resolved is the problem of system log > output. I think we'd wish that log messages are written in a uniform > encoding (CSV output in particular is going to have a hard time > otherwise) but what do you do when you need to report something that > includes a character not present in that encoding? > > > I think the only problem with CSV logs would be in trying to read them back into Postgres (which I agree is the main point of having them). We need to be more aggressive about dealing with these problems, or else how will we ever get to per-column charsets/collations? cheers andrew
On Thu, Apr 03, 2008 at 06:54:50PM +0100, Gregory Stark wrote: > The big gotcha is what collation to use when comparing with data in the system > tables, especially the shared system tables. I think we do need to define a > database-wide encoding and collation to use for system tables. (Unless we can > get by with varchar_pattern_ops indexes on system tables?) In my version I simply made that all system tables were in locale C, standard binary sorting. Anything else is likely to blowup if you want to start using multiple encodings. From a performance point of view this is best, since you don't want the planner bogging down because the user wants an expensive collation. > b) They're using multiple collations for their data but only one "at a time". > Either one per database or one per session. In which case they don't incur any > overhead The case I'm thinking of if people wanting some columns to be case-insensetive. I'm not sure however if this will be in the first version though. Also, there isn't one-to-one correspondence between locales and collations. There are many more collations and I hope we will support that. The common variations are ascending/descending and case-sensetivity. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.