Thread: tiny step toward threading: reduce dependence on setlocale()

tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
There was an unconference session at pgconf.dev related to threading
support. One of the problems identified was setlocale().

The attached series of patches make collation not depend on
setlocale(), even if the database collation uses the libc provider.

Since commit 8d9a9f034e, all supported platforms have locale_t, so we
can use strcoll_l(), etc., or uselocale() when no "_l" variant is
available.

A brief test shows that there may be a performance regression for libc
default collations. But if so, I'm not sure that's avoidable if the
goal is to take away setlocale. I'll see if removing the extra branches
mitigates it.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-06-05 at 17:23 -0700, Jeff Davis wrote:
> A brief test shows that there may be a performance regression for
> libc
> default collations. But if so, I'm not sure that's avoidable if the
> goal is to take away setlocale. I'll see if removing the extra
> branches
> mitigates it.

I redid the test and didn't see a difference, and then I ran a
standalone microbenchmark to compare strcoll() vs strcoll_l(), and
didn't see a difference there, either.

Another implementation may show a difference, but it doesn't seem to be
a problem for glibc.

I think this patch series is a nice cleanup, as well, making libc more
like the other providers and not dependent on global state.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Thu, 2024-06-06 at 11:37 -0700, Jeff Davis wrote:
>
> I think this patch series is a nice cleanup, as well, making libc
> more
> like the other providers and not dependent on global state.

New rebased series attached with additional cleanup. Now that
pg_locale_t is never NULL, we can simplify the way the collation cache
works, eliminating ~100 lines.

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 15.06.24 01:35, Jeff Davis wrote:
> On Thu, 2024-06-06 at 11:37 -0700, Jeff Davis wrote:
>>
>> I think this patch series is a nice cleanup, as well, making libc
>> more
>> like the other providers and not dependent on global state.
> 
> New rebased series attached with additional cleanup. Now that
> pg_locale_t is never NULL, we can simplify the way the collation cache
> works, eliminating ~100 lines.

Overall, this is great.  I have wished for a simplification like this
for a long time.  It is the logical continuation of relying on
locale_t stuff rather than process-global settings.


* v2-0001-Make-database-default-collation-internal-to-pg_lo.patch

+void
+pg_init_database_collation()

The function argument should be (void).

The prefix pg_ makes it sound like it's a user-facing function of some
sort.  Is there a reason for it?

Maybe add a small comment before the function.


* v2-0002-Make-database-collation-pg_locale_t-always-non-NU.patch

There is quite a bit of code duplication from
pg_newlocale_from_collation().  Can we do this better?


* v2-0003-ts_locale.c-do-not-use-NULL-to-mean-the-database-.patch

The "TODO" markers should be left, because what they refer to is that
these functions just use the default collation rather than something
passed in from the expression machinery.  This is not addressed by
this change.  (Obviously, the comments could have been clearer about
this.)


* v2-0004-Remove-support-for-null-pg_locale_t.patch

I found a few more places that should be adjusted in a similar way.

- In src/backend/regex/regc_pg_locale.c, the whole business with
   pg_regex_locale, in particular in pg_set_regex_collation().

- In src/backend/utils/adt/formatting.c, various places such as
   str_tolower().

- In src/backend/utils/adt/pg_locale.c, wchar2char() and char2wchar().
   (Also, for wchar2char() there is one caller that explicitly passes
   NULL.)

The changes at the call sites of pg_locale_deterministic() are
unfortunate, because they kind of go into the opposite direction: They
add checks for NULL locales instead of removing them.  (For a minute I
was thinking I was reading your patch backwards.)  We should think of
a way to write this clearer.

Looking for example at hashtext() after 0001..0004 applied, it is

     if (!lc_collate_is_c(collid))
         mylocale = pg_newlocale_from_collation(collid);

     if (!mylocale || pg_locale_deterministic(mylocale))
     {

But then after your 0006 patch, lc_locale_is_c() internally also calls
pg_newlocale_from_collation(), so this code just becomes redundant.
Better might be if pg_locale_deterministic() itself checked if collate
is C, and then hashtext() would just need to write:

     mylocale = pg_newlocale_from_collation(collid);

     if (pg_locale_deterministic(mylocale))
     {

The patch sequencing might be a bit tricky here.  Maybe it's ok if
patch 0004 stays as is in this respect if 0006 were to fix it back.


* v2-0005-Avoid-setlocale-in-lc_collate_is_c-and-lc_ctype_i.patch

Nothing uses this, AFAICT, so why?

Also, this creates more duplication between
pg_init_database_collation() and pg_newlocale_from_collation(), as
mentioned at patch 0002.


* v2-0006-Simplify-collation-cache.patch

ok




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-06-19 at 11:15 +0200, Peter Eisentraut wrote:
> > * v2-0001-Make-database-default-collation-internal-to-pg_lo.patch
> >
> > +void
> > +pg_init_database_collation()
> >
> > The function argument should be (void).
> >
> > The prefix pg_ makes it sound like it's a user-facing function of >
> > some
> > sort.  Is there a reason for it?
> >
> > Maybe add a small comment before the function.

Agreed, done.

> >
> > * v2-0002-Make-database-collation-pg_locale_t-always-non-NU.patch
> >
> > There is quite a bit of code duplication from
> > pg_newlocale_from_collation().  Can we do this better?

I refactored it into make_libc_collator().

> > * v2-0003-ts_locale.c-do-not-use-NULL-to-mean-the-database-.patch
> >
> > The "TODO" markers should be left, because what they refer to is
> > that
> > these functions just use the default collation rather than
> > something
> > passed in from the expression machinery.  This is not addressed by
> > this change.  (Obviously, the comments could have been clearer
> > about
> > this.)

Done.

> > * v2-0004-Remove-support-for-null-pg_locale_t.patch
> >
> > I found a few more places that should be adjusted in a similar way.

Fixed, thank you.

> > The changes at the call sites of pg_locale_deterministic() are
> > unfortunate

Yeah, that part was a bit annoying.

> > But then after your 0006 patch, lc_locale_is_c() internally also >
> > calls
> > pg_newlocale_from_collation()

Not always. It still returns early for C_COLLATION_OID or
POSIX_COLLATION_OID, and that's actually required because
pg_regcomp(..., C_COLLATION_OID) is called when parsing pg_hba.conf,
before catalog access is available. I don't think that detail is
relevant in the cases you brought up, but it got in the way of some
other refactoring I was trying to do.

> > , so this code just becomes redundant.
> > Better might be if pg_locale_deterministic() itself checked if >
> > collate
> > is C, and then hashtext() would just need to write:
> >
> >      mylocale = pg_newlocale_from_collation(collid);
> >
> >      if (pg_locale_deterministic(mylocale))
> >      {
> >
> > The patch sequencing might be a bit tricky here.  Maybe it's ok if
> > patch 0004 stays as is in this respect if 0006 were to fix it back.

Addressed in v3-0006.

> > * v2-0005-Avoid-setlocale-in-lc_collate_is_c-and-lc_ctype_i.patch
> >
> > Nothing uses this, AFAICT, so why?

You're right. It was used to avoid setlocale() in lc_collate_is_c(),
but that's eliminated in the next patch anyway.


Also, in v3-0005, I had to also check for "C" or "POSIX" in
make_libc_collator(), so that it wouldn't try to actually create the
locale_t in that case.


Regards,
    Jeff Davis



Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Andreas Karlsson
Date:
Nice refactoring!

Two small comments about CheckMyDatabase().

- Shouldn't we look at the default_locale.ctype_is_c when setting 
database_ctype_is_c instead of doing a strcmp()? or maybe we should even 
remove the global variable and always look at the default_locale?

- I think that the lookup of Anum_pg_database_datlocale could be done 
later in the code since it is not needed when we use a libc locale. E.g. 
as below.

     if (dbform->datlocprovider == COLLPROVIDER_LIBC)
         locale = collate;
     else
     {
         datum = SysCacheGetAttr(DATABASEOID, tup, 
Anum_pg_database_datlocale, &isnull);
         if (!isnull)
         locale = TextDatumGetCString(datum);
     }

Also is there any reaosn you do not squash th 4th and the 6th patch?

Andreas



Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Fri, 2024-07-26 at 19:38 +0200, Andreas Karlsson wrote:
> Nice refactoring!
>
> Two small comments about CheckMyDatabase().
>
> - Shouldn't we look at the default_locale.ctype_is_c when setting
> database_ctype_is_c instead of doing a strcmp()? or maybe we should
> even
> remove the global variable and always look at the default_locale?

database_ctype_is_c refers to the LC_CTYPE environment of the database
-- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
database's default collation.

Confusing, I know, but it matters for a few things that still depend on
the LC_CTYPE, such as tsearch and maybe a few extensions. See
f413941f41.

> - I think that the lookup of Anum_pg_database_datlocale could be done
> later in the code since it is not needed when we use a libc locale.
> E.g.
> as below.

Done, thank you.

> Also is there any reaosn you do not squash th 4th and the 6th patch?

Done. I had to rearrange the patch ordering a bit because prior to the
cache refactoring patch, it's unsafe to call
pg_newlocale_from_collation() without checking lc_collate_is_c() or
lc_ctype_is_c() first.

Regards,
    Jeff Davis


Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Andreas Karlsson
Date:
On 7/26/24 10:35 PM, Jeff Davis wrote:
> database_ctype_is_c refers to the LC_CTYPE environment of the database
> -- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
> database's default collation.
> 
> Confusing, I know, but it matters for a few things that still depend on
> the LC_CTYPE, such as tsearch and maybe a few extensions. See
> f413941f41.

Ah, right! That was thinko on my behalf.

The set of patches looks good to me now. There is further refactoring 
that can be done in this area (and should be done given all calls e.g to 
isalpha()) but I think this set of patches improves code readability 
while moving us away from setlocale().

And even if we take a tiny performance hit here, which your tests did 
not measure, I would say it is worth it both due to code clarity and due 
to not relying on thread unsafe state.

I do not see these patches in the commitfest app but if they were I 
would have marked them as ready for committer.

Andreas




Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 27.07.24 21:03, Andreas Karlsson wrote:
> On 7/26/24 10:35 PM, Jeff Davis wrote:
>> database_ctype_is_c refers to the LC_CTYPE environment of the database
>> -- pg_database.datctype. default_locale.ctype_is_c is the ctype of the
>> database's default collation.
>>
>> Confusing, I know, but it matters for a few things that still depend on
>> the LC_CTYPE, such as tsearch and maybe a few extensions. See
>> f413941f41.
> 
> Ah, right! That was thinko on my behalf.
> 
> The set of patches looks good to me now. There is further refactoring 
> that can be done in this area (and should be done given all calls e.g to 
> isalpha()) but I think this set of patches improves code readability 
> while moving us away from setlocale().
> 
> And even if we take a tiny performance hit here, which your tests did 
> not measure, I would say it is worth it both due to code clarity and due 
> to not relying on thread unsafe state.
> 
> I do not see these patches in the commitfest app but if they were I 
> would have marked them as ready for committer.

Here: https://commitfest.postgresql.org/48/5023/

I have also re-reviewed the patches and I agree they are good to go.




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Mon, 2024-07-29 at 21:45 +0200, Peter Eisentraut wrote:
> I have also re-reviewed the patches and I agree they are good to go.

I found a couple issues with the later patches:

* There was still some confusion about the default collation vs.
datcollate/datctype for callers of wchar2char() and char2wchar() (those
functions only work for libc). I introduced a new pg_locale_t structure
to represent datcollate/datctype regardless of datlocprovider to solve
this.

* Another loose end relying on setlocale(): in selfuncs.c, there's
still a call directly to strxfrm(), which depends on setlocale(). I
changed this to lookup the collation and then use pg_strxfrm(). That
should improve histogram selectivity estimates because it uses the
correct provider, rather than relying on setlocale(), right?

New series attached.

Regards,
    Jeff Davis


Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Tue, 2024-07-30 at 12:13 -0700, Jeff Davis wrote:
> I found a couple issues with the later patches:
>
> * There was still some confusion about the default collation vs.
> datcollate/datctype for callers of wchar2char() and char2wchar()
> (those
> functions only work for libc). I introduced a new pg_locale_t
> structure
> to represent datcollate/datctype regardless of datlocprovider to
> solve
> this.

I didn't quite like the API I introduced in 0001, so I skipped 0001.

For 0002 I left char2wchar() and wchar2char() as-is, where they expect
libc and accept a NULL pg_locale_t. I committed the rest of 0002.

> * Another loose end relying on setlocale(): in selfuncs.c, there's
> still a call directly to strxfrm(), which depends on setlocale(). I
> changed this to lookup the collation and then use pg_strxfrm(). That
> should improve histogram selectivity estimates because it uses the
> correct provider, rather than relying on setlocale(), right?

Committed 0003.

With these changes, collations are no longer dependent on the
environment locale (setlocale()) at all for either collation behavior
(ORDER BY) or ctype behavior (LOWER(), etc.).

Additionally, unless I missed something, nothing in the server is
dependent on LC_COLLATE at all.

There are still some things that depend on setlocale() in one way or
another:

  - char2wchar() & wchar2char()
  - ts_locale.c
  - various places that depend on LC_CTYPE unrelated to the collation
infrastructure
  - things that depend on other locale settings, like LC_NUMERIC

We can address those as part of a separate thread. I'll count this as
committed.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 06.08.24 23:40, Jeff Davis wrote:
> With these changes, collations are no longer dependent on the
> environment locale (setlocale()) at all for either collation behavior
> (ORDER BY) or ctype behavior (LOWER(), etc.).
> 
> Additionally, unless I missed something, nothing in the server is
> dependent on LC_COLLATE at all.
> 
> There are still some things that depend on setlocale() in one way or
> another:
> 
>    - char2wchar() & wchar2char()
>    - ts_locale.c
>    - various places that depend on LC_CTYPE unrelated to the collation
> infrastructure
>    - things that depend on other locale settings, like LC_NUMERIC
> 
> We can address those as part of a separate thread. I'll count this as
> committed.

I have a couple of small follow-up patches for this.

First, in like.c, SB_lower_char() now reads:

static char
SB_lower_char(unsigned char c, pg_locale_t locale, bool locale_is_c)
{
     if (locale_is_c)
         return pg_ascii_tolower(c);
     else if (locale)
         return tolower_l(c, locale->info.lt);
     else
         return pg_tolower(c);
}

But after this patch set, locale cannot be NULL anymore, so the third 
branch is obsolete.

(Now that I look at it, pg_tolower() has some short-circuiting for ASCII 
letters, so it would not handle Turkish-i correctly if that had been the 
global locale.  By removing the use of pg_tolower(), we fix that issue 
in passing.)

Second, there are a number of functions in like.c like the above that 
take separate arguments like pg_locale_t locale, bool locale_is_c. 
Because pg_locale_t now contains the locale_is_c information, these can 
be combined.
Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-07 at 22:44 +0200, Peter Eisentraut wrote:
> But after this patch set, locale cannot be NULL anymore, so the third
> branch is obsolete.

...

> Second, there are a number of functions in like.c like the above that
> take separate arguments like pg_locale_t locale, bool locale_is_c.
> Because pg_locale_t now contains the locale_is_c information, these
> can
> be combined.

I believe these patches are correct, but the reasoning is fairly
complex:

1. Some MatchText variants are called with 0 for locale. But that's OK
because ...

2. A MatchText variant only cares about the locale if MATCH_LOWER(t) is
defined, and ...

3. Only one variant, SB_IMatchText() defines MATCH_LOWER(), and ...

4. SB_IMatchText() is called with a non-zero locale.

All of these are a bit confusing to follow because it's generated code.
#2 is particularly non-obvious, because "locale" is not even an
argument of the MATCH_LOWER(t) or GETCHAR(t) macros, it's taken
implicitly from the outer scope.

I don't think your patches cause this confusion, but is there a way you
can clarify some of this along the way?

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 07.08.24 22:44, Peter Eisentraut wrote:
> (Now that I look at it, pg_tolower() has some short-circuiting for ASCII 
> letters, so it would not handle Turkish-i correctly if that had been the 
> global locale.  By removing the use of pg_tolower(), we fix that issue 
> in passing.)

It occurred to me that this issue also surfaces in a more prominent 
place.  These arguably-wrong pg_tolower() and pg_toupper() calls were 
also used by the normal SQL lower() and upper() functions before commit 
e9931bfb751 if you used a single byte encoding.

For example, in PG17, multi-byte encoding:

initdb --locale=tr_TR.utf8

select upper('hij'); --> HİJ

PG17, single-byte encoding:

initdb --locale=tr_TR  # uses LATIN5

select upper('hij'); --> HIJ

With current master, after commit e9931bfb751, you get the first result 
in both cases.

So this could break indexes across pg_upgrade in such configurations.




Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 08.08.24 22:00, Jeff Davis wrote:
> On Wed, 2024-08-07 at 22:44 +0200, Peter Eisentraut wrote:
>> But after this patch set, locale cannot be NULL anymore, so the third
>> branch is obsolete.
> 
> ...
> 
>> Second, there are a number of functions in like.c like the above that
>> take separate arguments like pg_locale_t locale, bool locale_is_c.
>> Because pg_locale_t now contains the locale_is_c information, these
>> can
>> be combined.
> 
> I believe these patches are correct, but the reasoning is fairly
> complex:
> 
> 1. Some MatchText variants are called with 0 for locale. But that's OK
> because ...
> 
> 2. A MatchText variant only cares about the locale if MATCH_LOWER(t) is
> defined, and ...
> 
> 3. Only one variant, SB_IMatchText() defines MATCH_LOWER(), and ...
> 
> 4. SB_IMatchText() is called with a non-zero locale.
> 
> All of these are a bit confusing to follow because it's generated code.
> #2 is particularly non-obvious, because "locale" is not even an
> argument of the MATCH_LOWER(t) or GETCHAR(t) macros, it's taken
> implicitly from the outer scope.
> 
> I don't think your patches cause this confusion, but is there a way you
> can clarify some of this along the way?

Yes, this is also my analysis.  The patch in 
<https://www.postgresql.org/message-id/flat/700d2e86-bf75-4607-9cf2-f5b7802f6e88@eisentraut.org> 
would replace passing 0 with an actual locale object.  The changes in 
GenericMatchText() could also be applied independently so that we'd 
always pass in a non-zero locale value, even if it would not be used in 
some cases.  I need to update that patch to cover your latest changes. 
I'll see if I can propose something here that looks a bit nicer.




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Thu, 2024-08-08 at 23:09 +0200, Peter Eisentraut wrote:
> With current master, after commit e9931bfb751, you get the first
> result
> in both cases.
>
> So this could break indexes across pg_upgrade in such configurations.

Good observation. Is there any action we should take here, or just add
that to the release notes for 18?

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> We can address those as part of a separate thread. I'll count this as
> committed.

Coverity has a nit about this:

*** CID 1616189:  Null pointer dereferences  (REVERSE_INULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/like.c: 206 in Generic_Text_IC_like()
200          * on the pattern and text, but instead call SB_lower_char on each
201          * character.  In the multi-byte case we don't have much choice :-(. Also,
202          * ICU does not support single-character case folding, so we go the long
203          * way.
204          */
205
>>>     CID 1616189:  Null pointer dereferences  (REVERSE_INULL)
>>>     Null-checking "locale" suggests that it may be null, but it has already been dereferenced on all paths leading
tothe check. 
206         if (pg_database_encoding_max_length() > 1 || (locale && locale->provider == COLLPROVIDER_ICU))
207         {
208             pat = DatumGetTextPP(DirectFunctionCall1Coll(lower, collation,
209                                                          PointerGetDatum(pat)));
210             p = VARDATA_ANY(pat);
211             plen = VARSIZE_ANY_EXHDR(pat);

I assume it would now be okay to take out "locale &&" here?

            regards, tom lane



Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 11.08.24 18:33, Tom Lane wrote:
> Jeff Davis <pgsql@j-davis.com> writes:
>> We can address those as part of a separate thread. I'll count this as
>> committed.
> 
> Coverity has a nit about this:
> 
> *** CID 1616189:  Null pointer dereferences  (REVERSE_INULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/utils/adt/like.c: 206 in Generic_Text_IC_like()
> 200          * on the pattern and text, but instead call SB_lower_char on each
> 201          * character.  In the multi-byte case we don't have much choice :-(. Also,
> 202          * ICU does not support single-character case folding, so we go the long
> 203          * way.
> 204          */
> 205
>>>>      CID 1616189:  Null pointer dereferences  (REVERSE_INULL)
>>>>      Null-checking "locale" suggests that it may be null, but it has already been dereferenced on all paths
leadingto the check.
 
> 206         if (pg_database_encoding_max_length() > 1 || (locale && locale->provider == COLLPROVIDER_ICU))
> 207         {
> 208             pat = DatumGetTextPP(DirectFunctionCall1Coll(lower, collation,
> 209                                                          PointerGetDatum(pat)));
> 210             p = VARDATA_ANY(pat);
> 211             plen = VARSIZE_ANY_EXHDR(pat);
> 
> I assume it would now be okay to take out "locale &&" here?

Correct, that check is no longer necessary.




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Mon, 2024-08-12 at 09:00 +0200, Peter Eisentraut wrote:
> > I assume it would now be okay to take out "locale &&" here?
>
> Correct, that check is no longer necessary.

Thank you, fixed.

Regards,
    Jeff Davis





Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
I'm wondering if after this patch series, lc_collate_is_c() and 
lc_ctype_is_c() are still useful.

They used to be completely separate from pg_newlocale_from_collation(), 
but now they are just mostly a thin wrapper around it.  Except there is 
some hardcoded handling of C_COLLATION_OID and POSIX_COLLATION_OID.  Do 
we care about that?

In many places, the notational and structural complexity would be 
significantly improved if we changed code like

     if (pg_collate_is_c(colloid))
     {
         ...
     }
     else
     {
         pg_locale_t locale = pg_newlocale_from_collation(colloid);

         if (locale->provider == ...)
         {
             ...
     }

to more like

     pg_locale_t locale = pg_newlocale_from_collation(colloid);

     if (locale->collate_is_c)
     {
         ...
     }
     else if (locale->provider == ...)
         ...
     }
     ...

However, it's not clear whether the hardcoded handling of some 
collations is needed for performance parity or perhaps some 
bootstrapping reasons.  It would be useful to get that cleared up. 
Thoughts?




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Tue, 2024-08-13 at 17:11 +0200, Peter Eisentraut wrote:

> They used to be completely separate from
> pg_newlocale_from_collation(),
> but now they are just mostly a thin wrapper around it.  Except there
> is
> some hardcoded handling of C_COLLATION_OID and POSIX_COLLATION_OID. 
> Do
> we care about that?

...

> However, it's not clear whether the hardcoded handling of some
> collations is needed for performance parity or perhaps some
> bootstrapping reasons.  It would be useful to get that cleared up.
> Thoughts?

There's at least one place where we expect lc_collate_is_c() to work
without catalog access at all: libpq/hba.c uses regexes with
C_COLLATION_OID.

But I don't think that's a major problem -- we can just move the
hardcoded test into pg_newlocale_from_collation() and return a
predefined struct with collate_is_c/ctype_is_c already set.

+1.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Andreas Karlsson
Date:
On 8/13/24 7:56 PM, Jeff Davis wrote:
> But I don't think that's a major problem -- we can just move the
> hardcoded test into pg_newlocale_from_collation() and return a
> predefined struct with collate_is_c/ctype_is_c already set.

I tried that out but thought it felt cleaner to do the hardcoding in 
pg_set_regex_collation(). What do you think?

I have attached patches removing lc_collate_is_c() and lc_ctype_is_c(). 
I have not checked if there are any performance regressions when using 
the C and POSIX locales but remove these special cases makes the code a 
lot cleaner in my book.

I also attach some other clean up patches I did while touching this code.

0001-Remove-lc_collate_is_c.patch

Removes lc_collate_is_c().

0002-Remove-lc_ctype_is_c.patch

Removes lc_ctype_is_c() and POSIX_COLLATION_OID which is no longer 
necessary.

0003-Remove-dubious-check-against-default-locale.patch

This patch removes a check against DEFAULT_COLLATION_OID which I thought 
looked really dubious. Shouldn't this just be a simple check for if the 
locale is deterministic? Since we know we have a valid locale that 
should be enough, right?

0004-Do-not-check-both-for-collate_is_c-and-deterministic.patch

It is redundant to check both for "collation_is_c && deterministic", right?

0005-Remove-pg_collate_deterministic-and-check-field-dire.patch

Since after my patches we look a lot directly at the collation_is_c and 
ctype_is_c fields I think the thin wrapper around the deterministic 
field makes it seem like there is more to it so I suggest that we should 
just remove it.

0006-Slightly-refactor-varstr_sortsupport-to-improve-read.patch

Small refactor to make a hard to read function a bit easier to read.

Andreas
Attachment

Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-14 at 01:31 +0200, Andreas Karlsson wrote:
> 0001-Remove-lc_collate_is_c.patch
>
> Removes lc_collate_is_c().

This could use some performance testing, as the commit message says,
otherwise it looks good.

> 0002-Remove-lc_ctype_is_c.patch
>
> Removes lc_ctype_is_c() and POSIX_COLLATION_OID which is no longer
> necessary.

This overlaps a bit with what Peter already proposed here:

https://www.postgresql.org/message-id/4f562d84-87f4-44dc-8946-01d6c437936f%40eisentraut.org

right?

> 0003-Remove-dubious-check-against-default-locale.patch
>
> This patch removes a check against DEFAULT_COLLATION_OID which I
> thought
> looked really dubious. Shouldn't this just be a simple check for if
> the
> locale is deterministic? Since we know we have a valid locale that
> should be enough, right?

Looks good to me. The code was correct in the sense that the default
collation is always deterministic, but (a) Peter is working on non-
deterministic default collations; and (b) it was a redundant check.

> 0004-Do-not-check-both-for-collate_is_c-and-deterministic.patch
>
> It is redundant to check both for "collation_is_c && deterministic",
> right?

+1.

> 0005-Remove-pg_collate_deterministic-and-check-field-dire.patch
>
> Since after my patches we look a lot directly at the collation_is_c
> and
> ctype_is_c fields I think the thin wrapper around the deterministic
> field makes it seem like there is more to it so I suggest that we
> should
> just remove it.

+1. When I added that, there was also a NULL check, but that was
removed so we might as well just read the field.

> 0006-Slightly-refactor-varstr_sortsupport-to-improve-read.patch
>
> Small refactor to make a hard to read function a bit easier to read.

Looks good to me.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Wed, 2024-08-28 at 18:43 +0200, Andreas Karlsson wrote:
> On 8/15/24 12:55 AM, Jeff Davis wrote:
> > This overlaps a bit with what Peter already proposed here:
> >
> > https://www.postgresql.org/message-id/4f562d84-87f4-44dc-8946-01d6c437936f%40eisentraut.org
> >
> > right?
>
> Maybe I am missing something but not as far as I can see. It is
> related
> but I do not see any overlap.

v2-0002:

Oh, I see, pattern_char_isalpha() only has one caller, and it never
passes a NULL for pg_locale_t, so my complaint doesn't affect your
patch. This is about ready, then.

> > > 0003-Remove-dubious-check-against-default-locale.patch
> > >
> > > This patch removes a check against DEFAULT_COLLATION_OID which I
> > > thought
> > > looked really dubious. Shouldn't this just be a simple check for
> > > if
> > > the
> > > locale is deterministic? Since we know we have a valid locale
> > > that
> > > should be enough, right?
> >
> > Looks good to me. The code was correct in the sense that the
> > default
> > collation is always deterministic, but (a) Peter is working on non-
> > deterministic default collations; and (b) it was a redundant check.
>
> Suspected so.
>
> I have attached a set of rebased patches with an added assert. Maybe
> I
> should start a new thread though.

v2-0001:

* There was a performance regression for repeated lookups, such as a "t
< 'xyz'" predicate. I committed 12d3345c0d (to remember the last
collation lookup), which will prevent that regression.

* This patch may change the handling of collation oid 0, and I'm not
sure whether that was intentional or not. lc_collate_is_c(0) returned
false, whereas pg_newlocale_from_collation(0)->collate_is_c raised
Assert or threw an cache lookup error. I'm not sure if this is an
actual problem, but looking at the callers, they should be more
defensive if they expect a collation oid of 0 to work at all.

* The comment in lc_collate_is_c() said that it returned false with the
idea that the caller would throw a useful error, and I think that
comment has been wrong for a while. If it returns false, the caller is
expected to call pg_newlocale_from_collation(), which would Assert on a
debug build. Should we remove that Assert, too, so that it will
consistently throw a cache lookup failure?

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
Committed v2-0001.

On Tue, 2024-09-03 at 22:04 -0700, Jeff Davis wrote:

> * This patch may change the handling of collation oid 0, and I'm not
> sure whether that was intentional or not. lc_collate_is_c(0) returned
> false, whereas pg_newlocale_from_collation(0)->collate_is_c raised
> Assert or threw an cache lookup error. I'm not sure if this is an
> actual problem, but looking at the callers, they should be more
> defensive if they expect a collation oid of 0 to work at all.

For functions that do call pg_newlocale_from_collation() when
lc_collation_is_c() returns false, the behavior is unchanged.

There are only 3 callers which don't follow that pattern:

  * spg_text_inner_consistent: gets collation ID from the index, and
text is a collatable type so it will be valid

  * match_pattern_prefix: same

  * make_greater_string: I changed the order of the tests in
make_greater_string so that if len=0 and collation=0, it won't throw an
error. If len !=0, it goes into varstr_cmp(), which will call
pg_newlocale_from_collation().

> * The comment in lc_collate_is_c() said that it returned false with
> the
> idea that the caller would throw a useful error, and I think that
> comment has been wrong for a while. If it returns false, the caller
> is
> expected to call pg_newlocale_from_collation(), which would Assert on
> a
> debug build. Should we remove that Assert, too, so that it will
> consistently throw a cache lookup failure?

I fixed this by replacing the assert with an elog(ERROR, ...), so that
it will consistently show a "cache lookup failed for collation 0"
regardless of whether it's a debug build or not. It's not expected that
the error will be encountered.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Jeff Davis
Date:
On Mon, 2024-09-09 at 15:37 +0200, Peter Eisentraut wrote:
> In the end, I figured the best thing to do here is to add an explicit
> locale argument to MATCH_LOWER() and GETCHAR() so one can actually
> understand this code by reading it.  New patch attached.

Looks good to me, thank you.

Regards,
    Jeff Davis




Re: tiny step toward threading: reduce dependence on setlocale()

From
Andreas Karlsson
Date:
On 9/4/24 11:45 PM, Jeff Davis wrote:
> Committed v2-0001.
>
 > [...]
>
> I fixed this by replacing the assert with an elog(ERROR, ...), so that
> it will consistently show a "cache lookup failed for collation 0"
> regardless of whether it's a debug build or not. It's not expected that
> the error will be encountered.

Thanks!

Andreas




Re: tiny step toward threading: reduce dependence on setlocale()

From
Peter Eisentraut
Date:
On 12.09.24 22:01, Jeff Davis wrote:
> On Mon, 2024-09-09 at 15:37 +0200, Peter Eisentraut wrote:
>> In the end, I figured the best thing to do here is to add an explicit
>> locale argument to MATCH_LOWER() and GETCHAR() so one can actually
>> understand this code by reading it.  New patch attached.
> 
> Looks good to me, thank you.

committed, thanks