Thread: Locale, Collation, ICU patch

Locale, Collation, ICU patch

From

Gregory Stark

Date:

03 April 2008, 14:55:35

Regarding the ICU patch in the commitfest here's my plan.

IMHO the idea of making ICU a hard dependency which Postgres will have to use
forevermore on all systems is a non-starter. I'm not entirely against having
ICU as a supported collation system which packagers on systems where the
system locale support is weak can choose to make a dependency of their binary
packages though, assuming the issues raised elsewhere about ICU are resolved.

As long as this bogeyman is scaring us though it's preventing us from having
the SQL standard collation syntax and the accompanying catalog and planner
changes.

And as long as we don't have that support -- which is a big job -- nobody
who's interested in implementing ICU or strcoll_l() or any other interfaces
for a new platform will get around to it. The actual porting glue to call
those functions on each platform is fairly lightweight and could easily be
done by experts on that platform who aren't catalog and planner mavens.

So we have a bit of a chicken and egg problem. We aren't getting the planner
and syntax changes because we aren't sure the support would be good on every
platform and we aren't getting the platform support because we don't have the
planner and catalog changes.

What I want to do is focus on adding the planner and catalog changes somehow.

We implement a kind of baseline locale support something only slightly better
than what we have now using setlocale before every comparison. This is clearly
not the recommended configuration but as long as it handles what we handle
today without a performance hit and a bit more besides it would be a big
start.

I'm assuming we would check if the desired locale is the current locale and
skip the assignment. So if only one locale is *actually* in use then basically
no additional overhead is incurred. Moreover if the desired locale is C then
we can skip the assignment and use strcmp directly. So actually as long as
only one non-C locale is in use then no additional overhead would be incurred.

The big gotcha is what collation to use when comparing with data in the system
tables, especially the shared system tables. I think we do need to define a
database-wide encoding and collation to use for system tables. (Unless we can
get by with varchar_pattern_ops indexes on system tables?)

So the following use cases arise:

a) They're actually using only one collation for both the system tables and
their own data. This is well handled by our existing setup and would be
basically unchanged in the new setup.

b) They're using multiple collations for their data but only one "at a time".
Either one per database or one per session. In which case they don't incur any
overhead

c) They're using multiple collations for their data but only one collation in
a given application unit of work. This is probably the most common case for
OLTP application since each unit of work represents some particular user's
operation. In this case as long as the system tables are set up to use the C
locale then this would require at most one setlocale() call per unit of work
though.

d) They're actively using multiple collations in a single query, possibly even
within a single sort (something like ORDER BY a COLLATION en_US, b COLLATION
es_US). This would perform passably on glibc but abysmally on most other
libc's.

From that point forward we would go about adding support for strcoll_l() and
other interfaces to handle case (d) on various platforms. For platforms with
no reasonable interface we could add a --enable-ICU users or packagers could
choose to use.

-- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!

Re: Locale, Collation, ICU patch

From

Tom Lane

Date:

03 April 2008, 16:03:40

Gregory Stark <stark@enterprisedb.com> writes:
> The big gotcha is what collation to use when comparing with data in the system
> tables, especially the shared system tables. I think we do need to define a
> database-wide encoding and collation to use for system tables.

You mean cluster-wide?  If we can get away with that, it'd solve a lot
of problems.

Note that the stuff in the system tables is mostly type "name" not text,
and the comparison semantics for that have always been strcmp(), so the
question of collation doesn't really apply.  Name in itself doesn't care
about encoding either, but I think we have to restrict encoding to avoid
the problem of injecting data that's invalidly encoded into one database
from another via the shared catalogs.

The other issue that'd have to be resolved is the problem of system log
output.  I think we'd wish that log messages are written in a uniform
encoding (CSV output in particular is going to have a hard time
otherwise) but what do you do when you need to report something that
includes a character not present in that encoding?
        regards, tom lane

Re: Locale, Collation, ICU patch

From

Andrew Dunstan

Date:

03 April 2008, 16:24:10

Tom Lane wrote:
> The other issue that'd have to be resolved is the problem of system log
> output.  I think we'd wish that log messages are written in a uniform
> encoding (CSV output in particular is going to have a hard time
> otherwise) but what do you do when you need to report something that
> includes a character not present in that encoding?
>
>             
>   

I think the only problem with CSV logs would be in trying to read them 
back into Postgres (which I agree is the main point of having them).

We need to be more aggressive about dealing with these problems, or else 
how will we ever get to per-column charsets/collations?

cheers

andrew

Re: Locale, Collation, ICU patch

From

Martijn van Oosterhout

Date:

04 April 2008, 06:30:11

On Thu, Apr 03, 2008 at 06:54:50PM +0100, Gregory Stark wrote:
> The big gotcha is what collation to use when comparing with data in the system
> tables, especially the shared system tables. I think we do need to define a
> database-wide encoding and collation to use for system tables. (Unless we can
> get by with varchar_pattern_ops indexes on system tables?)

In my version I simply made that all system tables were in locale C,
standard binary sorting. Anything else is likely to blowup if you want
to start using multiple encodings. From a performance point of view
this is best, since you don't want the planner bogging down because the
user wants an expensive collation.

> b) They're using multiple collations for their data but only one "at a time".
> Either one per database or one per session. In which case they don't incur any
> overhead

The case I'm thinking of if people wanting some columns to be
case-insensetive. I'm not sure however if this will be in the first
version though.

Also, there isn't one-to-one correspondence between locales and
collations. There are many more collations and I hope we will support
that. The common variations are ascending/descending and
case-sensetivity.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.