Re: ICU locale validation / canonicalization - Mailing list pgsql-hackers

From Jeff Davis
Subject Re: ICU locale validation / canonicalization
Date
Msg-id 423180d32bd2e9a61b839aff0dfefa1655d2fc1f.camel@j-davis.com
Whole thread Raw
In response to Re: ICU locale validation / canonicalization  (Noah Misch <noah@leadboat.com>)
Responses Re: ICU locale validation / canonicalization
List pgsql-hackers
On Tue, 2023-05-02 at 07:29 -0700, Noah Misch wrote:
> On Thu, Mar 30, 2023 at 08:59:41AM +0200, Peter Eisentraut wrote:
> > On 30.03.23 04:33, Jeff Davis wrote:
> > > Attached is a new version of the final patch, which performs
> > > canonicalization. I'm not 100% sure that it's wanted, but it
> > > still
> > > seems like a good idea to get the locales into a standard format
> > > in the
> > > catalogs, and if a lot more people start using ICU in v16
> > > (because it's
> > > the default), then it would be a good time to do it. But perhaps
> > > there
> > > are risks?
> >
> > I say, let's do it.
>
> The following is not cause for postgresql.git changes at this time,
> but I'm
> sharing it in case it saves someone else the study effort.  Commit
> ea1db8a
> ("Canonicalize ICU locale names to language tags.") slowed buildfarm
> member
> hoverfly, but that disappears if I drop debug_parallel_query from its
> config.
> Typical end-to-end duration rose from 2h5m to 2h55m.  Most-affected
> were
> installcheck runs, which rose from 11m to 19m.  (The "check" stage
> uses
> NO_LOCALE=1, so it changed less.)  From profiles, my theory is that
> each of
> the many parallel workers burns notable CPU and I/O opening its ICU
> collator
> for the first time.

I didn't repro the overall test timings (mine is ~1m40s compared to
~11-19m on hoverfly) but I think a microbenchmark on the ICU calls
showed a possible cause.

I ran open in a loop 10M times on the requested locale. The root locale
("und"[1], "root" and "") take about 1.3s to open 10M times; simple
locales like 'en' and 'fr-CA' and 'de-DE' are all a little shower at
3.3s.

Unrecognized locales like "xyz" take about 10 times as long: 13s to
open 10M times, presumably to perform the fallback logic that
ultimately opens the root locale. Not sure if 10X slower in the open
path is enough to explain the overall test slowdown.

My guess is that the ICU locale for these tests is not recognized, or
is some other locale that opens slowly. Can you tell me the actual
daticulocale?

Regards,
    Jeff Davis

[1] It appears that "und" is also slow to open in ICU < 64. Hoverfly is
on v58, so it's possible that's the problem if daticulocale=und.




pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: pgbench: using prepared BEGIN statement in a pipeline could cause an error
Next
From: Stephen Frost
Date:
Subject: Re: Adding SHOW CREATE TABLE