Thread: PATCH: CITEXT 2.0

PATCH: CITEXT 2.0

From

David E. Wheeler

Date:

27 June 2008, 22:22:36

Howdy,

[N.B.: I tried to send this a while ago but it didn't get delivered,
I'm assuming because, with the uncompressed patch, the email was too
big for -hackers. So this is a re-send with the patch gzip'd. Sorry
for any duplication].

Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:

* I had originally called it lctext, as it's not a true case-
insensitive type, but just converts strings to lowercase before
comparing them. I changed it to citext at Tom Lane's suggestion, to
ease compatibility for users of the original citext module on pgFoundry.

* Differences from the original citext are:
  + Locale-aware lowercasing of strings, rather than just lowercasing
    ASCII characters.
  + No implicit casts from text to citext except via assignment.
  + A few more functions overloaded
  + Works with 8.3 and CVS head

* Many thanks to whoever added str_tolower() to formatting.c. If I had
known about that, I could have saved myself a lot of grief! My
original implementation for 8.3.1 had copied a lot of code from
oracle_compat.c to get things working. With this patch, I've
eliminated a whole lot of code, as I can now just call str_tolower().
So thank you for that! I'll probably keep my original in my personal
Subversion repository, but don't now about releasing it if it will be
accepted as a contrib module for 8.4.

* All comparisons simply convert the strings to be compared to
lowercase using str_tolower(). I've made no other optimizations,
though I'm sure someone with more experience with collations and such
could add them.

* The regression test uses a new module I've created, now on
pgFoundry, called pgtap. It should just work. sql/citext.sql adds
plpgsql to the database and then includes pgtap.sql, which has the
test functions in it.

* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd
work with most West European languages, and maybe all languages other
than the C locale, but I'm not sure. YMMV. If there's a way to
generalize it and still be able to test the locale awareness, that
would be great. What locales do the build farm servers use?

* In the documentation, I've pitched this type as a replacement for
the use of LOWER() in ad-hoc queries, while also stipulating that this
is not a "true" case-insensitive text type, and is furthermore less
efficient than just TEXT (though I'm sure more efficient than ad-hock
LOWER()s). I've also mentioned a few other caveats, including casts
for TEXT that don't work for citext and non-case-insensitive matching
in replace(), regexp_replace(), and a few others.

* I wrote all the code here myself, but of course used the original
citext implementation (which is case-insensitive only for ASCII
characters) for inspiration and guidance. Thanks to Donald Fraser for
that original implementation.

I've compiled the CVS checkout, run its regressions, then built and
installed the citext module (hence my discovery of the deprecation of
wstring_lower and the addition of str_tolower -- should the
declaration of the former be removed from formatting.c?), and all
tests passed as of an hour ago.

I of course welcome feedback, advice, insults, commiserations, and
just about any mode of comment on this patch. Please let me know if I
need to provide any additional information.

Best,

David

Attachment

citext.patch.gz

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

01 July 2008, 11:29:54

David E. Wheeler napsal(a):
> Howdy,
> 

Howdy,

> 
> Please find attached a patch adding a locale-aware, case-insensitive 
> text type, called citext, as a contrib module. A few notes:

What is benefit to have this type when collation per database will be 
implemented? It seems to me that its overlapped feature? Definition of collation 
should allow to setup case sensitivity. Only advantage what I see there is that 
you can combine case sensitive and case insensitive text in one database. 
However, it will be solved when collation per column will be implemented.

    Zdenek

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

01 July 2008, 11:39:32

Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> However, it will be solved when collation per column will be implemented.

Well, yeah, but that seems a very long way off.  In the meantime a lot
of people use the existing pgfoundry citext module.
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

01 July 2008, 13:33:59

On Jul 1, 2008, at 07:38, Tom Lane wrote:

>> However, it will be solved when collation per column will be  
>> implemented.
>
> Well, yeah, but that seems a very long way off.  In the meantime a lot
> of people use the existing pgfoundry citext module.

And even more of us have written queries using LOWER(col) = LOWER(?),  
which is just a PITA.

Best,

David

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

02 July 2008, 09:02:49

David E. Wheeler napsal(a):
> On Jul 1, 2008, at 07:38, Tom Lane wrote:
> 
>>> However, it will be solved when collation per column will be 
>>> implemented.
>>
>> Well, yeah, but that seems a very long way off.  In the meantime a lot
>> of people use the existing pgfoundry citext module.
> 
> And even more of us have written queries using LOWER(col) = LOWER(?), 
> which is just a PITA.
> 

OK. I'm going to review your patch.
    Zdenek

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

02 July 2008, 11:10:31

Yay!

Best,

David

Sent from my iPhone

On Jul 2, 2008, at 5:03, Zdenek Kotala <Zdenek.Kotala@Sun.COM> wrote:

> David E. Wheeler napsal(a):
>> On Jul 1, 2008, at 07:38, Tom Lane wrote:
>>>> However, it will be solved when collation per column will be  
>>>> implemented.
>>>
>>> Well, yeah, but that seems a very long way off.  In the meantime a  
>>> lot
>>> of people use the existing pgfoundry citext module.
>> And even more of us have written queries using LOWER(col) = LOWER 
>> (?), which is just a PITA.
>
> OK. I'm going to review your patch.
>
>        Zdenek

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

02 July 2008, 13:13:56

David E. Wheeler napsal(a):
> Howdy,

Howdy

> Please find attached a patch adding a locale-aware, case-insensitive 
> text type, called citext, as a contrib module. A few notes:

I went through your code and I have following comments/questions:

1) formating

Please, do not type space before asterix: char * lcstr, * rcstr -> char *lcstr, *rcstr

and do not put extra space in a brackets
citextcmp( left, right )  -> citextcmp(left, right)

Other seems to me OK (remove 8.2 version mention at the bottom)

2) citextcmp function returns int but citext_cmp returns int32. I think 
citextcmp should returns int32 as well.

3) citext_larger, citext_smaller function have memory leak. You don't call 
PG_FREE_IF_COPY on unused text.

I'm not sure if return value in these function should be duplicated or not.

4) Operator =  citext_eq is not correct. See comment 
http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac

There must be difference between equality and collation for example in Czech 
language 'láska' and 'laská' are different word it means that 'láska' != 
'laská'. But there is no difference in collation order. See Unicode Universal 
Collation Algorithm for detail.

5) There are several commented out lines in CREATE OPERATOR statement mostly 
related to NEGATOR. Is there some reason for that? Also OPERATOR || has probably 
wrong negator.


6) You use pgTAP for unit test. It is something what should be widely discussed.  It is new approach and I'm not
correctperson to say if it good or not.
 

I see there two problems:
At first point there is not clear licensing 
(https://svn.kineticode.com/pgtap/trunk/README.pgtap).
And, It should be implemented as a general framework by my opinion not only as a 
part of citext contrib module.

Please, let me know when you will have updated code and I will start deep testing.
    With regards Zdenek

Re: PATCH: CITEXT 2.0

From

Peter Eisentraut

Date:

02 July 2008, 13:39:43

Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
> Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes:
> > However, it will be solved when collation per column will be implemented.
>
> Well, yeah, but that seems a very long way off.  In the meantime a lot
> of people use the existing pgfoundry citext module.

Indeed, but why isn't this code put there as well?

Re: PATCH: CITEXT 2.0

From

Teodor Sigaev

Date:

02 July 2008, 16:19:02

> I went through your code and I have following comments/questions:

one more comment:

7) Hash opclass is absent. Hash opclass framework is used for hash join.


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

02 July 2008, 21:40:17

Peter Eisentraut <peter_e@gmx.net> writes:
> Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
>> Well, yeah, but that seems a very long way off.  In the meantime a lot
>> of people use the existing pgfoundry citext module.

> Indeed, but why isn't this code put there as well?

Well, actually, that might be the best thing to do with it.  But it
would be sensible to give it a code review anyway, no?
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 01:44:59

On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:

> I went through your code and I have following comments/questions:

Thanks. I'm on a short family vacation, so it will probably be Monday
before I can submit a new patch. I got a few replies below, though.

> 1) formating
>
> Please, do not type space before asterix:
> char * lcstr, * rcstr -> char *lcstr, *rcstr
>
> and do not put extra space in a brackets
> citextcmp( left, right )  -> citextcmp(left, right)

Okay.

> Other seems to me OK (remove 8.2 version mention at the bottom)

Um, are you referring to this (at the top):

+// PostgreSQL 8.2 Magic.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

That's gotta be there (though I can of course remove the comment).

> 2) citextcmp function returns int but citext_cmp returns int32. I
> think citextcmp should returns int32 as well.

Yeah, I'm a bit confused by this. I followed what's in varlena.c on
this. The comparison functions are documented supposed to return 1, 0,
or -1, which of course would be covered by int. But varstr_cmp(),
which ultimately returns the value, returns all kinds of numbers. So,
perhaps someone could say what it's *supposed* to be?

> 3) citext_larger, citext_smaller function have memory leak. You
> don't call PG_FREE_IF_COPY on unused text.
>
> I'm not sure if return value in these function should be duplicated
> or not.

These I also duplicated from varlena.c, and I asked about a potential
memory leak in a previous email. If there's a leak here, would there
not also be a leak in varlena.c?

> 4) Operator =  citext_eq is not correct. See comment
http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac

So should citextcmp() call strncmp() instead of varst_cmp()? The
latter is what I saw in varlena.c.

> There must be difference between equality and collation for example
> in Czech language 'láska' and 'laská' are different word it means
> that 'láska' != 'laská'. But there is no difference in collation
> order. See Unicode Universal Collation Algorithm for detail.

I'll leave the collation stuff to the functions I call (*far* from my
specialty), but I'll add a test for this and make sure it works as
expected. Um, although, with what collation should it be tested? The
tests I wrote assume en_US.UTF-8.

> 5) There are several commented out lines in CREATE OPERATOR
> statement mostly related to NEGATOR. Is there some reason for that?

I copied it from the original citext.sql. Not sure what effect it has.

> Also OPERATOR || has probably wrong negator.

Right, good catch.

> 6) You use pgTAP for unit test. It is something what should be
> widely discussed.  It is new approach and I'm not correct person to
> say if it good or not.

Sure. I created a pgFoundry project for it here:
  http://pgtap.projects.postgresql.org/

> I see there two problems:
> At first point there is not clear licensing (https://svn.kineticode.com/pgtap/trunk/README.pgtap
> ).

That's a standard revised BSD license.

> And, It should be implemented as a general framework by my opinion
> not only as a part of citext contrib module.

It is. I just put the file in citext so that the tests can use it.
It's distributed on pgFoundry and maintained at the SVN URL quoted in
the file.

> Please, let me know when you will have updated code and I will start
> deep testing.

Will do.

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 01:45:29

On Jul 2, 2008, at 09:39, Peter Eisentraut wrote:

>> Well, yeah, but that seems a very long way off.  In the meantime a  
>> lot
>> of people use the existing pgfoundry citext module.
>
> Indeed, but why isn't this code put there as well?

It could be, but this is *such* a common need (and few even know about  
pgFoundry), that I thought it'd be worthwhile to get it distributed as  
part of contrib.

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 01:45:35

On Jul 2, 2008, at 12:18, Teodor Sigaev wrote:

>> 7) Hash opclass is absent. Hash opclass framework is used for hash  
>> join.

Thanks. I haven't seen docs on this. The original citext only supports  
btree, and I don't remember reading about creating a hash opclass in  
the Douglass book, though I probably missed it. Anyone got a link for  
me to read to make it happen?

Thanks,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 01:45:41

On Jul 2, 2008, at 17:21, Tom Lane wrote:

>> Indeed, but why isn't this code put there as well?
>
> Well, actually, that might be the best thing to do with it.  But it
> would be sensible to give it a code review anyway, no?

Obviously, I would argue that contrib is a good place for it, hence  
the patch. I can say more on  my reasoning if you'd like. But even if  
it doesn't end up in contrib, I'm extremely grateful for the code  
reviews.

Best,

David

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

03 July 2008, 02:14:21

"David E. Wheeler" <david@kineticode.com> writes:
> On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:
>> Please, do not type space before asterix:
>> char * lcstr, * rcstr -> char *lcstr, *rcstr
>> 
>> and do not put extra space in a brackets
>> citextcmp( left, right )  -> citextcmp(left, right)

> Okay.

Note that this sort of stuff will mostly be fixed by pg_indent,
whether or not David does anything about it.  But certainly
conforming to the project style to begin with will cause less
pain to reviewers' eyeballs.

> Um, are you referring to this (at the top):

> +// PostgreSQL 8.2 Magic.
> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif

Here however is an outright bug: we do not allow // comments, because we
still support ANSI-spec compilers that don't recognize them.

> Yeah, I'm a bit confused by this. I followed what's in varlena.c on  
> this. The comparison functions are documented supposed to return 1, 0,  
> or -1, which of course would be covered by int. But varstr_cmp(),  
> which ultimately returns the value, returns all kinds of numbers. So,  
> perhaps someone could say what it's *supposed* to be?

btree cmp functions are allowed to return int32 negative, zero, or
positive.  There is *not* any requirement that negative or positive
results be exactly -1 or +1.  However, if you are comparing values
that are int32 or wider then you can't just return their difference
because it might overflow.

>> 3) citext_larger, citext_smaller function have memory leak. You  
>> don't call PG_FREE_IF_COPY on unused text.

> These I also duplicated from varlena.c, and I asked about a potential  
> memory leak in a previous email. If there's a leak here, would there  
> not also be a leak in varlena.c?

The "leak" is irrelevant for larger/smaller.  The only place where it's
actually useful to do PG_FREE_IF_COPY is in a btree or hash index
support function.  In other cases you can assume that you're being
called in a memory context that's too short-lived for it to matter.

>> 5) There are several commented out lines in CREATE OPERATOR  
>> statement mostly related to NEGATOR. Is there some reason for that?

> I copied it from the original citext.sql. Not sure what effect it has.

http://developer.postgresql.org/pgdocs/postgres/xoper-optimization.html
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

Teodor Sigaev

Date:

03 July 2008, 04:19:41

> Douglass book, though I probably missed it. Anyone got a link for me to 
> read to make it happen?

Hash opclass is 5-times simpler that btree one :)

CREATE FUNCTION citext_hash(mchar)
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;

CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE mchar USING hash AS              OPERATOR        1       =  (citext, citext),              FUNCTION
   1       citext_hash(citext);
 


-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: PATCH: CITEXT 2.0

From

Teodor Sigaev

Date:

03 July 2008, 04:31:30

> CREATE FUNCTION citext_hash(*citext*) 

> DEFAULT FOR TYPE *citext* USING hash AS

Oops, citext of course.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 13:02:35

On Jul 2, 2008, at 22:14, Tom Lane wrote:

> Note that this sort of stuff will mostly be fixed by pg_indent,
> whether or not David does anything about it.  But certainly
> conforming to the project style to begin with will cause less
> pain to reviewers' eyeballs.

Yeah, I'll change it. I'm JAPH, so kind of made up the formatting as I  
went, though I did try to copy the style in varlena.c.

>> +// PostgreSQL 8.2 Magic.
>> +#ifdef PG_MODULE_MAGIC
>> +PG_MODULE_MAGIC;
>> +#endif
>
> Here however is an outright bug: we do not allow // comments,  
> because we
> still support ANSI-spec compilers that don't recognize them.

Forgot about that. I'll change it for the next version.

> btree cmp functions are allowed to return int32 negative, zero, or
> positive.  There is *not* any requirement that negative or positive
> results be exactly -1 or +1.  However, if you are comparing values
> that are int32 or wider then you can't just return their difference
> because it might overflow.

Thanks for the explanation. I'll make sure that they're both int32.

> The "leak" is irrelevant for larger/smaller.  The only place where  
> it's
> actually useful to do PG_FREE_IF_COPY is in a btree or hash index
> support function.  In other cases you can assume that you're being
> called in a memory context that's too short-lived for it to matter.

So would that be for any function used by

CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE CITEXT USING btree AS    OPERATOR    1   <  (citext, citext),    OPERATOR    2   <= (citext, citext),
 OPERATOR    3   =  (citext, citext),    OPERATOR    4   >= (citext, citext),    OPERATOR    5   >  (citext, citext),
FUNCTION    1   citext_cmp(citext, citext);
 

? (And then the btree operator and function to be added, of course.)

>
>
>>> 5) There are several commented out lines in CREATE OPERATOR
>>> statement mostly related to NEGATOR. Is there some reason for that?
>
>> I copied it from the original citext.sql. Not sure what effect it  
>> has.
>
> http://developer.postgresql.org/pgdocs/postgres/xoper- 
> optimization.html

Thanks. Sounds like it'd be valuable to have them in there. I'll add  
tests, as well.

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

03 July 2008, 13:05:16

On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:

>> Hash opclass is 5-times simpler that btree one :)
>
> CREATE FUNCTION citext_hash(mchar)
> RETURNS int4
> AS 'MODULE_PATHNAME'
> LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;
>
> CREATE OPERATOR CLASS citext_ops
> DEFAULT FOR TYPE mchar USING hash AS
>              OPERATOR        1       =  (citext, citext),
>              FUNCTION        1       citext_hash(citext);

Thanks. What would citext_hash() look like? I don't see a text_hash()  
to borrow from anywhere in src/.

Thanks,

David

Re: PATCH: CITEXT 2.0

From

Alvaro Herrera

Date:

03 July 2008, 13:53:37

David E. Wheeler wrote:
> On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:
>
>>> Hash opclass is 5-times simpler that btree one :)
>>
>> CREATE FUNCTION citext_hash(mchar)
>> RETURNS int4
>> AS 'MODULE_PATHNAME'
>> LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;
>>
>> CREATE OPERATOR CLASS citext_ops
>> DEFAULT FOR TYPE mchar USING hash AS
>>              OPERATOR        1       =  (citext, citext),
>>              FUNCTION        1       citext_hash(citext);
>
> Thanks. What would citext_hash() look like? I don't see a text_hash() to 
> borrow from anywhere in src/.

See hash_any().  I assume the difficulty is making sure that
hash("FOO") = hash("foo") ...

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 02:40:01

On Jul 3, 2008, at 09:53, Alvaro Herrera wrote:

>> Thanks. What would citext_hash() look like? I don't see a  
>> text_hash() to
>> borrow from anywhere in src/.
>
> See hash_any().  I assume the difficulty is making sure that
> hash("FOO") = hash("foo") ...

Great, big help, thank you. So does this look sensible?

Datum
citext_hash(PG_FUNCTION_ARGS)
{    char       *txt;    char       *str;    Datum       result;
    txt = cilower( PG_GETARG_TEXT_PP(0) );    str = VARDATA_ANY(txt);
    result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));
    /* Avoid leaking memory for toasted inputs */    PG_FREE_IF_COPY(txt, 0);    pfree( str );
    return result;
}

And how might I be able to test that it actually works?

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 02:54:51

On Jul 2, 2008, at 22:14, Tom Lane wrote:

> The "leak" is irrelevant for larger/smaller.  The only place where  
> it's
> actually useful to do PG_FREE_IF_COPY is in a btree or hash index
> support function.  In other cases you can assume that you're being
> called in a memory context that's too short-lived for it to matter.

Stupid question: for the btree index support function, is that *only*  
the function referenced in the OPERATOR CLASS, or does it also apply  
to functions that implement the operators in that class? IOW, do I  
need to worry about memory leaks in citext_eq, citext_ne, citext_gt,  
etc., or only in citext_cmp()?

Thanks,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 03:11:55

Replying to myself, but I've made some local changes (see other
messages) and just wanted to follow up on some of my own comments.

On Jul 2, 2008, at 21:38, David E. Wheeler wrote:

>> 4) Operator =  citext_eq is not correct. See comment
http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac
>
> So should citextcmp() call strncmp() instead of varst_cmp()? The
> latter is what I saw in varlena.c.

I'm guessing that the answer is "no," since varstr_cmp() uses
strncmp() internally, as appropriate to the locale. Correct?

>> There must be difference between equality and collation for example
>> in Czech language 'láska' and 'laská' are different word it means
>> that 'láska' != 'laská'. But there is no difference in collation
>> order. See Unicode Universal Collation Algorithm for detail.
>
> I'll leave the collation stuff to the functions I call (*far* from
> my specialty), but I'll add a test for this and make sure it works
> as expected. Um, although, with what collation should it be tested?
> The tests I wrote assume en_US.UTF-8.

I added this test and is passes:

SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented
characters should not be equivalent' );

>> 5) There are several commented out lines in CREATE OPERATOR
>> statement mostly related to NEGATOR. Is there some reason for that?
>
> I copied it from the original citext.sql. Not sure what effect it has.

I restored these (and one of them was wrong anyway).

>> Also OPERATOR || has probably wrong negator.
>
> Right, good catch.

Stupid question: What would the negation of || actually be? There
isn't one is, there?

Thanks!

David

Re: PATCH: CITEXT 2.0

From

Gregory Stark

Date:

05 July 2008, 06:58:15

"David E. Wheeler" <david@kineticode.com> writes:

> On Jul 3, 2008, at 09:53, Alvaro Herrera wrote:
>
>>> Thanks. What would citext_hash() look like? I don't see a  text_hash() to
>>> borrow from anywhere in src/.
>>
>> See hash_any().  I assume the difficulty is making sure that
>> hash("FOO") = hash("foo") ...
>
> Great, big help, thank you. So does this look sensible?
>
>     txt = cilower( PG_GETARG_TEXT_PP(0) );
>     str = VARDATA_ANY(txt);
>
>     result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));

I thought your data type implemented a locale dependent collation, not just
a case insensitive collation. That is, does this hash agree with your
citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=> "fooss" ?

You may have to use strxfrm

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: PATCH: CITEXT 2.0

From

Gregory Stark

Date:

05 July 2008, 06:58:54

"David E. Wheeler" <david@kineticode.com> writes:

> do I need to worry about memory leaks in citext_eq, citext_ne, citext_gt,
> etc.,

yes

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

05 July 2008, 12:13:27

"David E. Wheeler" <david@kineticode.com> writes:
>>> Also OPERATOR || has probably wrong negator.
>> 
>> Right, good catch.

> Stupid question: What would the negation of || actually be? There  
> isn't one is, there?

Per the docs, NEGATOR is only sensible for operators returning boolean.
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 21:47:07

On Jul 5, 2008, at 02:58, Gregory Stark wrote:

>> do I need to worry about memory leaks in citext_eq, citext_ne,  
>> citext_gt,
>> etc.,
>
> yes

Thanks.

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 21:47:08

On Jul 5, 2008, at 02:58, Gregory Stark wrote:

>>    txt = cilower( PG_GETARG_TEXT_PP(0) );
>>    str = VARDATA_ANY(txt);
>>
>>    result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));
>
> I thought your data type implemented a locale dependent collation,
> not just
> a case insensitive collation. That is, does this hash agree with your
> citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=>
> "fooss" ?

CITEXT is basically intended to replace all those queries that do
`WHERE LOWER(col) = LOWER(?)` by doing it internally. That's it. It's
locale-aware to the same extent that `LOWER()` is (and that citext 1.0
is not, since it only compares ASCII characters case-insensitively).
And I expect that it does, in fact, agree with your examples, in that
all the current tests for = and <> pass:

try=# select 'foo bar' = 'foobar'; ?column?
---------- f

try=# SELECT 'fooß' = 'fooss'; ?column?
---------- f

> You may have to use strxfrm

In the patch against CVS HEAD, it uses str_tolower() in formatting.c.

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

05 July 2008, 21:47:28

On Jul 5, 2008, at 08:13, Tom Lane wrote:

>> Stupid question: What would the negation of || actually be? There
>> isn't one is, there?
>
> Per the docs, NEGATOR is only sensible for operators returning  
> boolean.

Message received. Many thanks, Tom, as usual.

Best,

David

PATCH: CITEXT 2.0 v2

From

David E. Wheeler

Date:

06 July 2008, 03:31:35

On Jun 27, 2008, at 18:22, David E. Wheeler wrote:

> Please find attached a patch adding a locale-aware, case-insensitive
> text type, called citext, as a contrib module.

Here is a new version of the patch, with the following changes:

* Fixed formatting to be more like core.
* Added appropriate NEGATORs to operators.
* Removed NEGATOR from the || operator.
* Added hash index function and operator class.
* The = operator now supports HASHES and MERGES.
* citext_cmp and citextcmp both return int32.
* Changed // comments to /* comments */.
* Added test confirming láska'::citext <> 'laská'::citext.
* A few other organizational, formatting, and pasto fixes.
* Updated the FAQ entry on case-insensitive queries to recommend
citext (it would, of course, need to be translated).

Stuff I was asked about but didn't change:

* citext_cmp() still uses varstr_cmp() instead of strncmp(). When I
tried the latter, everything seemed to be equivalent.
* citext_smaller() and citext_larger() don't have memory leaks, says
Tom, so I added no calls to PG_FREE_IF_COPY().

Thank you everyone for your feedback and suggestions!

Best,

David

Attachment

citext2.patch.gz

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

07 July 2008, 04:46:16

David E. Wheeler napsal(a):
> Replying to myself, but I've made some local changes (see other 
> messages) and just wanted to follow up on some of my own comments.
> 
> On Jul 2, 2008, at 21:38, David E. Wheeler wrote:
> 
>>> 4) Operator =  citext_eq is not correct. See comment 
>>> http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac 
>>>
>>
>> So should citextcmp() call strncmp() instead of varst_cmp()? The 
>> latter is what I saw in varlena.c.
> 
> I'm guessing that the answer is "no," since varstr_cmp() uses strncmp() 
> internally, as appropriate to the locale. Correct?

You have to use varstr_cmp  in citextcmp. Your code is correct, because for
< <= >= > operators you need collation sensible function.

You need to change only citext_cmp function to use strncmp() or call texteq 
function.

>>> There must be difference between equality and collation for example 
>>> in Czech language 'láska' and 'laská' are different word it means 
>>> that 'láska' != 'laská'. But there is no difference in collation 
>>> order. See Unicode Universal Collation Algorithm for detail.
>>
>> I'll leave the collation stuff to the functions I call (*far* from my 
>> specialty), but I'll add a test for this and make sure it works as 
>> expected. Um, although, with what collation should it be tested? The 
>> tests I wrote assume en_US.UTF-8.
> 
> I added this test and is passes:
> 
> SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented 
> characters should not be equivalent' );

I'm think that this test will work correctly for en_US.UTF-8 at any time. I 
guess the test make sense only when Czech collation (cs_CZ.UTF-8) is selected, 
but unfortunately, you cannot change collation during your test :(.

I think, Best solution for now is to keep the test and add comment about 
recommended collation for this test.

    Zdenek

Re: PATCH: CITEXT 2.0 v2

From

Zdenek Kotala

Date:

07 July 2008, 11:42:29

David E. Wheeler napsal(a):
> On Jun 27, 2008, at 18:22, David E. Wheeler wrote:
> 
>> Please find attached a patch adding a locale-aware, case-insensitive 
>> text type, called citext, as a contrib module.
> 
> Here is a new version of the patch, with the following changes:
> 
> * Fixed formatting to be more like core.
> * Added appropriate NEGATORs to operators.
> * Removed NEGATOR from the || operator.
> * Added hash index function and operator class.
> * The = operator now supports HASHES and MERGES.
> * citext_cmp and citextcmp both return int32.
> * Changed // comments to /* comments */.
> * Added test confirming láska'::citext <> 'laská'::citext.
> * A few other organizational, formatting, and pasto fixes.
> * Updated the FAQ entry on case-insensitive queries to recommend citext 
> (it would, of course, need to be translated).
> 
> Stuff I was asked about but didn't change:
> 
> * citext_cmp() still uses varstr_cmp() instead of strncmp(). When I 
> tried the latter, everything seemed to be equivalent.

citext_eq (operator =) should use strncmp and citext_cmp() should use varstr_cmp().

> * citext_smaller() and citext_larger() don't have memory leaks, says 
> Tom, so I added no calls to PG_FREE_IF_COPY().

Yeah, it is correct. FGMR does clean job for you.

However, It seems to me that code is ok now (exclude citex_eq). I think there 
two open problems/questions:

1) regression test -
  a) I think that regresion is not correct. It depends on en_US locale, but it 
uses characters which is not in related character repertoire. It means comparing 
is not defined and I guess it could generate different result on different OS - 
depends on locale implementation.
  b) pgTap is something new. Need make a decision if this framework is 
acceptable or not.

2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. Good to 
mention that citext type will be obsoleted with full collation implementation in 
a future. I personally prefer to keep it on pgFoundry because it is temporally 
workaround (by my opinion), but I can live with contrib module as well.

    Zdenek

Re: PATCH: CITEXT 2.0 v2

From

Andrew Dunstan

Date:

07 July 2008, 12:01:42

Zdenek Kotala wrote:
>
>
> 2) contrib vs. pgFoundry
>
> There is unresolved answer if we want to have this in contrib or not. 
> Good to mention that citext type will be obsoleted with full collation 
> implementation in a future. I personally prefer to keep it on 
> pgFoundry because it is temporally workaround (by my opinion), but I 
> can live with contrib module as well.
>
>
>       

Is there a concrete plan to get to full collation support (i.e. per 
column) in any known time frame? If not, then I think a citext module 
would be acceptable.

What does still bother me is its performance. I'd like to know if any 
measurement has been done of using citext vs. a functional index on 
lower(foo).

cheers

andrew

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 14:09:50

On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:

> You have to use varstr_cmp  in citextcmp. Your code is correct,  
> because for
> < <= >= > operators you need collation sensible function.
>
> You need to change only citext_cmp function to use strncmp() or call  
> texteq function.

I'm confused. What's the difference between strncmp() and  
varstr_cmp()? And why would I use one in one place and the other  
elsewhere? Shouldn't they be used consistently?

> I'm think that this test will work correctly for en_US.UTF-8 at any  
> time. I guess the test make sense only when Czech collation  
> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change  
> collation during your test :(.

No, I was wondering before what locale was used for initdb on the  
build farm. I mean, how are locale-aware things really tested?

> I think, Best solution for now is to keep the test and add comment  
> about recommended collation for this test.

Yep, that's what it does.

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 14:20:18

On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

> However, It seems to me that code is ok now (exclude citex_eq). I  
> think there two open problems/questions:
>
> 1) regression test -
>
>  a) I think that regresion is not correct. It depends on en_US  
> locale, but it uses characters which is not in related character  
> repertoire. It means comparing is not defined and I guess it could  
> generate different result on different OS - depends on locale  
> implementation.

That I don't know about. The test requires en_US.UTF-8, at least at  
this point. How are tests run on the build farm? And how else could I  
ensure that comparisons are case-insensitive for non-ASCII characters  
other than requiring a Unicode locale? Or is it just an issue for the  
sort order tests? For those, I could potentially remove accented  
characters, just as long as I'm verifying in other tests that  
comparisons are case-insensitive (without worrying about collation).

>  b) pgTap is something new. Need make a decision if this framework  
> is acceptable or not.

Well, from the point of view of `make installcheck`, it's invisible.  
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to  
discuss it further here though, if folks are interested.

> 2) contrib vs. pgFoundry
>
> There is unresolved answer if we want to have this in contrib or  
> not. Good to mention that citext type will be obsoleted with full  
> collation implementation in a future. I personally prefer to keep it  
> on pgFoundry because it is temporally workaround (by my opinion),  
> but I can live with contrib module as well.

I second what Andrew has said in reply to this issue. I'll also say  
that, since people *so* often end up using `WHERE LOWER(col) =  
LOWER(?)`, that it'd be very valuable to have citext in contrib,  
especially since so few folks seem to even know about pgFoundry, let  
alone be able to find it. I mean, look at these search results:
  http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text

My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is  
#7. Last time I did a query like this, it didn't turn up at all.

Belive me, I'd love for pgFoundry (or something like it) to become the  
CPAN for PostgreSQL. But without some love and SEO, I don't think  
that's gonna happen.

Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4,  
given the changes in the string comparison API in CVS.

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 14:21:14

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

> What does still bother me is its performance. I'd like to know if  
> any measurement has been done of using citext vs. a functional index  
> on lower(foo).

That's a good question. I can whip up a quick test by populating a  
column full of data and trying both, but I'm no performance testing  
expert. Anyone else know more about such performance testing who can  
help out?

Many Thanks,

David

Re: PATCH: CITEXT 2.0

From

Alvaro Herrera

Date:

07 July 2008, 15:44:48

David E. Wheeler wrote:
> On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:
>
>> You have to use varstr_cmp  in citextcmp. Your code is correct,  
>> because for
>> < <= >= > operators you need collation sensible function.
>>
>> You need to change only citext_cmp function to use strncmp() or call  
>> texteq function.
>
> I'm confused. What's the difference between strncmp() and varstr_cmp()? 
> And why would I use one in one place and the other elsewhere? Shouldn't 
> they be used consistently?

Not necessarily -- see texteq.  I think the point is that varstr_cmp()
is a lot slower.

>> I'm think that this test will work correctly for en_US.UTF-8 at any  
>> time. I guess the test make sense only when Czech collation  
>> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change  
>> collation during your test :(.
>
> No, I was wondering before what locale was used for initdb on the build 
> farm. I mean, how are locale-aware things really tested?

They aren't AFAIK.


-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 15:47:29

On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:

>> I'm confused. What's the difference between strncmp() and  
>> varstr_cmp()?
>> And why would I use one in one place and the other elsewhere?  
>> Shouldn't
>> they be used consistently?
>
> Not necessarily -- see texteq.  I think the point is that varstr_cmp()
> is a lot slower.

Then why shouldn't I use strncmp() for all comparisons?

>> No, I was wondering before what locale was used for initdb on the  
>> build
>> farm. I mean, how are locale-aware things really tested?
>
> They aren't AFAIK.

Ow. Pity.

Best,

David

Re: PATCH: CITEXT 2.0

From

Alvaro Herrera

Date:

07 July 2008, 15:54:12

David E. Wheeler wrote:
> On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:
>
>>> I'm confused. What's the difference between strncmp() and  
>>> varstr_cmp()?
>>> And why would I use one in one place and the other elsewhere?  
>>> Shouldn't
>>> they be used consistently?
>>
>> Not necessarily -- see texteq.  I think the point is that varstr_cmp()
>> is a lot slower.
>
> Then why shouldn't I use strncmp() for all comparisons?

I have no idea :-) -- because it's not locale-aware perhaps.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: PATCH: CITEXT 2.0 v2

From

Zdenek Kotala

Date:

07 July 2008, 15:57:47

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:
> 
>> However, It seems to me that code is ok now (exclude citex_eq). I 
>> think there two open problems/questions:
>>
>> 1) regression test -
>>
>>  a) I think that regresion is not correct. It depends on en_US locale, 
>> but it uses characters which is not in related character repertoire. 
>> It means comparing is not defined and I guess it could generate 
>> different result on different OS - depends on locale implementation.
> 
> That I don't know about. The test requires en_US.UTF-8, at least at this 
> point. How are tests run on the build farm? And how else could I ensure 
> that comparisons are case-insensitive for non-ASCII characters other 
> than requiring a Unicode locale? Or is it just an issue for the sort 
> order tests? For those, I could potentially remove accented characters, 
> just as long as I'm verifying in other tests that comparisons are 
> case-insensitive (without worrying about collation).

Hmm, it is complex area and I'm not sure if anybody on planet know correct 
answer :-). I think that best approach is to keep it as is and when a problem 
occur then it will be fixed.

>>  b) pgTap is something new. Need make a decision if this framework is 
>> acceptable or not.
> 
> Well, from the point of view of `make installcheck`, it's invisible. 
> I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to 
> discuss it further here though, if folks are interested.

Yeah, it is invisible, but question is why don't put it as a framework to common 
place. But it is for another discussion.

>> 2) contrib vs. pgFoundry
>>
>> There is unresolved answer if we want to have this in contrib or not. 
>> Good to mention that citext type will be obsoleted with full collation 
>> implementation in a future. I personally prefer to keep it on 
>> pgFoundry because it is temporally workaround (by my opinion), but I 
>> can live with contrib module as well.
> 
> I second what Andrew has said in reply to this issue. I'll also say 
> that, since people *so* often end up using `WHERE LOWER(col) = 
> LOWER(?)`, that it'd be very valuable to have citext in contrib, 
> especially since so few folks seem to even know about pgFoundry, let 
> alone be able to find it. I mean, look at these search results:

I understand it but there is parallel project which should solve this problem 
completely I guess in "close" future (2-3years). Afterward this module will be 
obsolete and it will takes times to remove it from contrib. It seems to me that 
have citext in contrib only for two releases is not optimal solution.
    Zdenek



-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: PATCH: CITEXT 2.0 v2

From

Zdenek Kotala

Date:

07 July 2008, 16:01:21

Andrew Dunstan napsal(a):
> 
> 
> Zdenek Kotala wrote:
>>
>>
>> 2) contrib vs. pgFoundry
>>
>> There is unresolved answer if we want to have this in contrib or not. 
>> Good to mention that citext type will be obsoleted with full collation 
>> implementation in a future. I personally prefer to keep it on 
>> pgFoundry because it is temporally workaround (by my opinion), but I 
>> can live with contrib module as well.
>>
>>
>>       
> 
> Is there a concrete plan to get to full collation support (i.e. per 
> column) in any known time frame? If not, then I think a citext module 
> would be acceptable.

Collation per database should be part of 8.4. Radek Strnad works on it as a SoC 
project. He plans to continue on this work and implement full support as his 
Master of Thesis in 2-3 years time frame.
    Zdenek


-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 16:03:32

On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:

>> Then why shouldn't I use strncmp() for all comparisons?
>
> I have no idea :-) -- because it's not locale-aware perhaps.

Could someone who does have an idea answer these questions:

* Does it need to be locale-aware or not?
* Should I use strncmp() or varstr_cmp() to compare strings?
* Shouldn't it use one or the other, but not both?

Sorry, I'm just confused about the "correct" thing to do here. If  
someone who knows the definitive answers could weigh in, I'd be happy  
to make the adjustment.

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 16:06:15

On Jul 7, 2008, at 11:54, Zdenek Kotala wrote:

> Hmm, it is complex area and I'm not sure if anybody on planet know  
> correct answer :-). I think that best approach is to keep it as is  
> and when a problem occur then it will be fixed.

Regression tests are really important, though.

>>> b) pgTap is something new. Need make a decision if this framework  
>>> is acceptable or not.
>> Well, from the point of view of `make installcheck`, it's  
>> invisible. I've submitted a talk proposal for pgDay.US on ptTAP.  
>> I'm happy to discuss it further here though, if folks are interested.
>
> Yeah, it is invisible, but question is why don't put it as a  
> framework to common place. But it is for another discussion.

It is in a common place as a project, on pgFoundry. Whether the core  
team wants to use it for testing other parts of PostgreSQL (core or  
contrib) and therefore put it in a central location is, as you say, a  
separate conversation. It'd be easy to move it in such a case, of  
course.

> I understand it but there is parallel project which should solve  
> this problem completely I guess in "close" future (2-3years).  
> Afterward this module will be obsolete and it will takes times to  
> remove it from contrib. It seems to me that have citext in contrib  
> only for two releases is not optimal solution.

I guess that'd be the reason to keep it on pgFoundry, but I have two  
comments:

* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,  
indeed ever (no disrespect to Radek Strnad, it's just there are always  
unforeseen issues).

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"Pavel Stehule"

Date:

07 July 2008, 16:10:53

2008/7/7 Zdenek Kotala <Zdenek.Kotala@sun.com>:
> David E. Wheeler napsal(a):
>>
>> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:
>>
>>> However, It seems to me that code is ok now (exclude citex_eq). I think
>>> there two open problems/questions:
>>>
>>> 1) regression test -
>>>
>>>  a) I think that regresion is not correct. It depends on en_US locale,
>>> but it uses characters which is not in related character repertoire. It
>>> means comparing is not defined and I guess it could generate different
>>> result on different OS - depends on locale implementation.
>>
>> That I don't know about. The test requires en_US.UTF-8, at least at this
>> point. How are tests run on the build farm? And how else could I ensure that
>> comparisons are case-insensitive for non-ASCII characters other than
>> requiring a Unicode locale? Or is it just an issue for the sort order tests?
>> For those, I could potentially remove accented characters, just as long as
>> I'm verifying in other tests that comparisons are case-insensitive (without
>> worrying about collation).
>
> Hmm, it is complex area and I'm not sure if anybody on planet know correct
> answer :-). I think that best approach is to keep it as is and when a
> problem occur then it will be fixed.
>

Maybe we can have some "locale" test outside our regress tests -

>>>  b) pgTap is something new. Need make a decision if this framework is
>>> acceptable or not.
>>
>> Well, from the point of view of `make installcheck`, it's invisible. I've
>> submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it
>> further here though, if folks are interested.
>
> Yeah, it is invisible, but question is why don't put it as a framework to
> common place. But it is for another discussion.
>
>>> 2) contrib vs. pgFoundry
>>>
>>> There is unresolved answer if we want to have this in contrib or not.
>>> Good to mention that citext type will be obsoleted with full collation
>>> implementation in a future. I personally prefer to keep it on pgFoundry
>>> because it is temporally workaround (by my opinion), but I can live with
>>> contrib module as well.
>>
>> I second what Andrew has said in reply to this issue. I'll also say that,
>> since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that
>> it'd be very valuable to have citext in contrib, especially since so few
>> folks seem to even know about pgFoundry, let alone be able to find it. I
>> mean, look at these search results:
>
> I understand it but there is parallel project which should solve this
> problem completely I guess in "close" future (2-3years). Afterward this
> module will be obsolete and it will takes times to remove it from contrib.
> It seems to me that have citext in contrib only for two releases is not
> optimal solution.
>
>                Zdenek
>
>
Regards
Pavel

>
> --
> Zdenek Kotala              Sun Microsystems
> Prague, Czech Republic     http://sun.com/postgresql
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

07 July 2008, 16:16:31

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:
> 
>>> Then why shouldn't I use strncmp() for all comparisons?
>>
>> I have no idea :-) -- because it's not locale-aware perhaps.
> 
> Could someone who does have an idea answer these questions:
> 
> * Does it need to be locale-aware or not?
> * Should I use strncmp() or varstr_cmp() to compare strings?
> * Shouldn't it use one or the other, but not both?
> 
> Sorry, I'm just confused about the "correct" thing to do here. If 
> someone who knows the definitive answers could weigh in, I'd be happy to 
> make the adjustment.


I'm sorry. I meant bttext() 
http://doxygen.postgresql.org/varlena_8c-source.html#l01270

bttext should use in citextcmp function. It uses strcol function.

And citext_eq should be implemented as texteq:

http://doxygen.postgresql.org/varlena_8c-source.html#l01164
I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(
    Zdenek




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 16:21:48

On Jul 7, 2008, at 12:13, Zdenek Kotala wrote:

> I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270
>
> bttext should use in citextcmp function. It uses strcol function.

I've no idea what bttext has to do with anything. Sorry if I'm being  
slow here.

> And citext_eq should be implemented as texteq:
>
> http://doxygen.postgresql.org/varlena_8c-source.html#l01164
>
>     I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(

Okay, I see that text_cmp() uses varstr_cmp():
  http://doxygen.postgresql.org/varlena_8c-source.html#l01139

While texteq() and textne() use strncmp():
  http://doxygen.postgresql.org/varlena_8c-source.html#l01164
http://doxygen.postgresql.org/varlena_8c-source.html#l01187

The other operator functions, like text_lt(), use text_cmp() and  
therefore varstr_cmp():
  http://doxygen.postgresql.org/varlena_8c-source.html#01210

My question is: why? Shouldn't they all use the same function for  
comparison? I'm happy to dupe this implementation for citext, but I  
don't understand it. Should not all comparisons be executed  
consistently?

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 16:23:59

On Jul 7, 2008, at 12:10, Pavel Stehule wrote:

> Maybe we can have some "locale" test outside our regress tests -

I think that would be useful.

Best,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 16:26:10

On Jul 7, 2008, at 12:21, David E. Wheeler wrote:

> My question is: why? Shouldn't they all use the same function for  
> comparison? I'm happy to dupe this implementation for citext, but I  
> don't understand it. Should not all comparisons be executed  
> consistently?

Let me try to answer my own question by citing this comment:
/* * Since we only care about equality or not-equality, we can avoid  
all the * expense of strcoll() here, and just do bitwise comparison. */

So, the upshot is that the = and <> operators are not locale-aware,  
yes? They just do byte comparisons. Is that really the way it should  
be? I mean, could there not be strings that are equivalent but have  
different bytes?

Thanks,

David

Re: PATCH: CITEXT 2.0

From

"Pavel Stehule"

Date:

07 July 2008, 16:36:16

2008/7/7 David E. Wheeler <david@kineticode.com>:
> On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:
>
>>> Then why shouldn't I use strncmp() for all comparisons?
>>
>> I have no idea :-) -- because it's not locale-aware perhaps.
>
> Could someone who does have an idea answer these questions:
>
> * Does it need to be locale-aware or not?

yes!

> * Should I use strncmp() or varstr_cmp() to compare strings?
> * Shouldn't it use one or the other, but not both?
>
> Sorry, I'm just confused about the "correct" thing to do here. If someone
> who knows the definitive answers could weigh in, I'd be happy to make the
> adjustment.
>
> Thanks,
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 16:38:15

On Jul 7, 2008, at 12:36, Pavel Stehule wrote:

>> * Does it need to be locale-aware or not?
>
> yes!

texteq() in varlena.c does not seem to be.

Best,

David

Re: PATCH: CITEXT 2.0

From

"Pavel Stehule"

Date:

07 July 2008, 16:49:02

2008/7/7 David E. Wheeler <david@kineticode.com>:
> On Jul 7, 2008, at 12:36, Pavel Stehule wrote:
>
>>> * Does it need to be locale-aware or not?
>>
>> yes!
>
> texteq() in varlena.c does not seem to be.
>

it's case sensitive and it's +/- binary compare

Regards
Pavel Stehule

> Best,
>
> David
>

Re: PATCH: CITEXT 2.0

From

Zdenek Kotala

Date:

07 July 2008, 16:50:06

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
> 
>> My question is: why? Shouldn't they all use the same function for 
>> comparison? I'm happy to dupe this implementation for citext, but I 
>> don't understand it. Should not all comparisons be executed consistently?
> 
> Let me try to answer my own question by citing this comment:
> 
>     /*
>      * Since we only care about equality or not-equality, we can avoid 
> all the
>      * expense of strcoll() here, and just do bitwise comparison.
>      */
> 
> So, the upshot is that the = and <> operators are not locale-aware, yes? 
> They just do byte comparisons. Is that really the way it should be? I 
> mean, could there not be strings that are equivalent but have different 
> bytes?

Correct. The problem is complex. It works fine only for normalized string. But 
postgres now assume that all utf8 strings are normalized.

If you need to implement < <= >= > operators you need to use strcol which take 
care of locale collation.

See unicode collation algorithm http://www.unicode.org/reports/tr10/
    Zdenek




-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

07 July 2008, 17:10:33

"David E. Wheeler" <david@kineticode.com> writes:
> So, the upshot is that the = and <> operators are not locale-aware,  
> yes? They just do byte comparisons. Is that really the way it should  
> be? I mean, could there not be strings that are equivalent but have  
> different bytes?

We intentionally force such strings to be considered non-equal.
See varstr_cmp, and if you like see the archives from back when
that was put in.

The = and <> operators are in fact consistent with the behavior of
varstr_cmp (and had better be!); they're just optimized a bit.
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 17:13:21

On Jul 7, 2008, at 13:10, Tom Lane wrote:

> We intentionally force such strings to be considered non-equal.
> See varstr_cmp, and if you like see the archives from back when
> that was put in.
>
> The = and <> operators are in fact consistent with the behavior of
> varstr_cmp (and had better be!); they're just optimized a bit.

Thank you, Tom. I'll post my updated patch shortly.

In the meantime, can anyone suggest an easy way to populate a table  
full of random strings so that I can do a bit of benchmarking?

Thanks,

David

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 17:15:10

On Jul 7, 2008, at 12:46, Zdenek Kotala wrote:

>> So, the upshot is that the = and <> operators are not locale-aware,  
>> yes? They just do byte comparisons. Is that really the way it  
>> should be? I mean, could there not be strings that are equivalent  
>> but have different bytes?
>
> Correct. The problem is complex. It works fine only for normalized  
> string. But postgres now assume that all utf8 strings are normalized.

I see. So binary equivalence is okay, in that case.

> If you need to implement < <= >= > operators you need to use strcol  
> which take care of locale collation.

Which varstr_cmp() does, I guess. It's what textlt uses, for example.

> See unicode collation algorithm http://www.unicode.org/reports/tr10/

Wow, that looks like a fun read.

Best,

David

Re: PATCH: CITEXT 2.0

From

Gregory Stark

Date:

07 July 2008, 17:59:42

"David E. Wheeler" <david@kineticode.com> writes:

> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
>
>> My question is: why? Shouldn't they all use the same function for
>> comparison? I'm happy to dupe this implementation for citext, but I  don't
>> understand it. Should not all comparisons be executed  consistently?
>
> Let me try to answer my own question by citing this comment:
>
>     /*
>      * Since we only care about equality or not-equality, we can avoid  all
> the
>      * expense of strcoll() here, and just do bitwise comparison.
>      */
>
> So, the upshot is that the = and <> operators are not locale-aware,  yes? They
> just do byte comparisons. Is that really the way it should  be? I mean, could
> there not be strings that are equivalent but have  different bytes?

There could be strings that strcoll returns 0 for even though they're not
identical. However that caused problems in Postgres so we decided that only
equal strings should actually compare equal. So if strcoll returns 0 then we
do a bytewise comparison to impose an arbitrary ordering.

Of course the obvious case of two equivalent strings with different bytes
would be two strings which differ only in case in a collation which doesn't
distinguish based on case. So you obviously can't take this route for citext.

I don't think you have to worry about the problem that cause Postgres to make
this change. IIRC it was someone comparing strings like paths and usernames
and getting false positives because they were in a Turkish locale which found
certain sequences of characters to be insignificant for ordering. Someone
who's using a citext data type has obviously decided that's precisely the kind
of behaviour they want.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 18:05:30

On Jul 7, 2008, at 13:59, Gregory Stark wrote:

> Of course the obvious case of two equivalent strings with different  
> bytes
> would be two strings which differ only in case in a collation which  
> doesn't
> distinguish based on case. So you obviously can't take this route  
> for citext.

Well, to be fair, citext isn't imposing a collation. It's just calling  
str_tolower() on strings before passing them on to varstr_cmp() or  
strncmp() to compare.

> I don't think you have to worry about the problem that cause  
> Postgres to make
> this change. IIRC it was someone comparing strings like paths and  
> usernames
> and getting false positives because they were in a Turkish locale  
> which found
> certain sequences of characters to be insignificant for ordering.  
> Someone
> who's using a citext data type has obviously decided that's  
> precisely the kind
> of behaviour they want.

Hrm. So in your opinion, strncmp() could be used for all comparisons  
by citext, rather than varstr_cmp()?

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 20:24:34

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

> What does still bother me is its performance. I'd like to know if  
> any measurement has been done of using citext vs. a functional index  
> on lower(foo).

Okay, here's a start. The attached script inserts random strings of  
1-10 space-delimited words into text and citext columns, and then  
compares the performance of queries with and without indexes. The  
output for me is as follows:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against the  
CITEXT column aren't using them. Furthermore, the `lower(text) LIKE  
lower(?)` query isn't using *its* index. Huh?

So this leaves me with two questions:

1. For what reason would the query against the citext column *not* use  
the index?

2. Is there some way to get the CITEXT index to behave like a LOWER()  
index, that is, so that its value is stored using the result of the  
str_tolower() function, thus removing some of the overhead of  
converting the values for each row fetched from the index? (Does this  
question make any sense?)

Thanks,

David

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 20:26:42

And here is the script. D'oh!

Thanks,

David




On Jul 7, 2008, at 16:24, David E. Wheeler wrote:

> On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:
>
>> What does still bother me is its performance. I'd like to know if
>> any measurement has been done of using citext vs. a functional
>> index on lower(foo).
>
> Okay, here's a start. The attached script inserts random strings of
> 1-10 space-delimited words into text and citext columns, and then
> compares the performance of queries with and without indexes. The
> output for me is as follows:
>
> Loading words from dictionary.
> Inserting into the table.
>
> Test =.
> SELECT * FROM try WHERE LOWER(text) = LOWER('food');
> Time: 254.254 ms
> SELECT * FROM try WHERE citext = 'food';
> Time: 288.535 ms
>
> Test LIKE and ILIKE
> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
> Time: 209.385 ms
> SELECT * FROM try WHERE citext ILIKE 'C%';
> Time: 236.186 ms
> SELECT * FROM try WHERE citext LIKE 'C%';
> Time: 235.818 ms
>
> Adding indexes...
>
> Test =.
> SELECT * FROM try WHERE LOWER(text) = LOWER('food');
> Time: 1.260 ms
> SELECT * FROM try WHERE citext = 'food';
> Time: 277.755 ms
>
> Test LIKE and ILIKE
> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
> Time: 209.073 ms
> SELECT * FROM try WHERE citext ILIKE 'C%';
> Time: 238.430 ms
> SELECT * FROM try WHERE citext LIKE 'C%';
> Time: 238.685 ms
> benedict%
>
> So for some reason, after adding the indexes, the queries against
> the CITEXT column aren't using them. Furthermore, the `lower(text)
> LIKE lower(?)` query isn't using *its* index. Huh?
>
> So this leaves me with two questions:
>
> 1. For what reason would the query against the citext column *not*
> use the index?
>
> 2. Is there some way to get the CITEXT index to behave like a
> LOWER() index, that is, so that its value is stored using the result
> of the str_tolower() function, thus removing some of the overhead of
> converting the values for each row fetched from the index? (Does
> this question make any sense?)
>
> Thanks,
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

try.sql

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 20:37:08

No, *really*

Sheesh, sorry.

David




On Jul 7, 2008, at 16:26, David E. Wheeler wrote:

> And here is the script. D'oh!
>
> Thanks,
>
> David
>
> <try.sql>
>
>
> On Jul 7, 2008, at 16:24, David E. Wheeler wrote:
>
>> On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:
>>
>>> What does still bother me is its performance. I'd like to know if
>>> any measurement has been done of using citext vs. a functional
>>> index on lower(foo).
>>
>> Okay, here's a start. The attached script inserts random strings of
>> 1-10 space-delimited words into text and citext columns, and then
>> compares the performance of queries with and without indexes. The
>> output for me is as follows:
>>
>> Loading words from dictionary.
>> Inserting into the table.
>>
>> Test =.
>> SELECT * FROM try WHERE LOWER(text) = LOWER('food');
>> Time: 254.254 ms
>> SELECT * FROM try WHERE citext = 'food';
>> Time: 288.535 ms
>>
>> Test LIKE and ILIKE
>> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
>> Time: 209.385 ms
>> SELECT * FROM try WHERE citext ILIKE 'C%';
>> Time: 236.186 ms
>> SELECT * FROM try WHERE citext LIKE 'C%';
>> Time: 235.818 ms
>>
>> Adding indexes...
>>
>> Test =.
>> SELECT * FROM try WHERE LOWER(text) = LOWER('food');
>> Time: 1.260 ms
>> SELECT * FROM try WHERE citext = 'food';
>> Time: 277.755 ms
>>
>> Test LIKE and ILIKE
>> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
>> Time: 209.073 ms
>> SELECT * FROM try WHERE citext ILIKE 'C%';
>> Time: 238.430 ms
>> SELECT * FROM try WHERE citext LIKE 'C%';
>> Time: 238.685 ms
>> benedict%
>>
>> So for some reason, after adding the indexes, the queries against
>> the CITEXT column aren't using them. Furthermore, the `lower(text)
>> LIKE lower(?)` query isn't using *its* index. Huh?
>>
>> So this leaves me with two questions:
>>
>> 1. For what reason would the query against the citext column *not*
>> use the index?
>>
>> 2. Is there some way to get the CITEXT index to behave like a
>> LOWER() index, that is, so that its value is stored using the
>> result of the str_tolower() function, thus removing some of the
>> overhead of converting the values for each row fetched from the
>> index? (Does this question make any sense?)
>>
>> Thanks,
>>
>> David
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

try.sql

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

07 July 2008, 20:58:18

"David E. Wheeler" <david@kineticode.com> writes:
> Hrm. So in your opinion, strncmp() could be used for all comparisons  
> by citext, rather than varstr_cmp()?

I thought the charter of this type was to work like lower(textcol).
If that's so, you certainly can't use strncmp, because that would result
in sort orderings totally different from lower()'s result.  Even without
that argument, for most multibyte cases you'd get a pretty arbitrary,
user-unfriendly sort ordering.
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 21:04:02

On Jul 7, 2008, at 16:58, Tom Lane wrote:

> "David E. Wheeler" <david@kineticode.com> writes:
>> Hrm. So in your opinion, strncmp() could be used for all comparisons
>> by citext, rather than varstr_cmp()?
>
> I thought the charter of this type was to work like lower(textcol).

Correct.

> If that's so, you certainly can't use strncmp, because that would  
> result
> in sort orderings totally different from lower()'s result.  Even  
> without
> that argument, for most multibyte cases you'd get a pretty arbitrary,
> user-unfriendly sort ordering.

Now I'm confused again. :-( Whether or not I use strncmp() or  
varstr_cmp(), I first lowercase the value to be compared using  
str_tolower(). What Zdenek has said is, that aside, just as for the  
TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for  
everything else. Are you saying something different?

I could use some examples to play with in order to ensure that things  
are behaving as they should. I'll add regression tests for them.

Thanks,

David

Re: PATCH: CITEXT 2.0

From

Tom Lane

Date:

07 July 2008, 21:36:38

"David E. Wheeler" <david@kineticode.com> writes:
> On Jul 7, 2008, at 16:58, Tom Lane wrote:
>> If that's so, you certainly can't use strncmp, because that would  
>> result
>> in sort orderings totally different from lower()'s result.  Even  
>> without
>> that argument, for most multibyte cases you'd get a pretty arbitrary,
>> user-unfriendly sort ordering.

> Now I'm confused again. :-( Whether or not I use strncmp() or  
> varstr_cmp(), I first lowercase the value to be compared using  
> str_tolower(). What Zdenek has said is, that aside, just as for the  
> TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for  
> everything else. Are you saying something different?

No, but you were: you proposed using strncmp for everything.
        regards, tom lane

Re: PATCH: CITEXT 2.0

From

"David E. Wheeler"

Date:

07 July 2008, 21:42:53

On Jul 7, 2008, at 17:18, Tom Lane wrote:

> No, but you were: you proposed using strncmp for everything.

Yes, that's right. I was trying to understand why I wouldn't use just  
one or the other. And the answer turned out to be, you wouldn't,  
except that strncmp() is an desirable optimization for = and <>. So  
I've changed only those.

Phew, I think I'm clear now. Thanks!

DAvid

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

07 July 2008, 21:47:39

Thanks to help from RhodiumToad on IRC, I got some things improved here:

On Jul 7, 2008, at 16:24, David E. Wheeler wrote:

> So for some reason, after adding the indexes, the queries against  
> the CITEXT column aren't using them. Furthermore, the `lower(text)  
> LIKE lower(?)` query isn't using *its* index. Huh?

I never knew what one needed to use the text_pattern_ops operator  
class to index a column for use with LIKE! I had no clue. Would that  
work for a citext column, too, since it's essentially the same as TEXT?

> So this leaves me with two questions:
>
> 1. For what reason would the query against the citext column *not*  
> use the index?

It turns out that it did use the index if I put `SET enable_seqscan =  
false;` into my script. So with RhodiumToad's direction, I added some  
`RESTRICT` and `JOIN` clauses to my comparison operators (copying them  
from ip4r). So now I have:

CREATE OPERATOR = (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    COMMUTATOR = =,    NEGATOR    = <>,    PROCEDURE
= citext_eq,RESTRICT   = eqsel,JOIN       = eqjoinsel,    HASHES,    MERGES

);

CREATE OPERATOR <> (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    NEGATOR    = =,    COMMUTATOR = <>,
PROCEDURE = citext_ne,RESTRICT   = neqsel,JOIN       = neqjoinsel

);

CREATE OPERATOR < (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    NEGATOR    = >=,    COMMUTATOR = >,    PROCEDURE
= citext_lt,RESTRICT   = scalarltsel,JOIN       = scalarltjoinsel

);

CREATE OPERATOR <= (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    NEGATOR    = >,    COMMUTATOR = <=,
PROCEDURE = citext_le,RESTRICT   = scalarltsel,JOIN       = scalarltjoinsel

);

CREATE OPERATOR >= (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    NEGATOR    = <,    COMMUTATOR = <=,
PROCEDURE = citext_ge,RESTRICT   = scalargtsel,JOIN       = scalargtjoinsel

);

CREATE OPERATOR > (    LEFTARG    = CITEXT,    RIGHTARG   = CITEXT,    NEGATOR    = <=,    COMMUTATOR = <,    PROCEDURE
= citext_gt,RESTRICT   = scalargtsel,JOIN       = scalargtjoinsel

);

With this change, the index was used:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 261.295 ms
SELECT * FROM try WHERE citext = 'food';
Time: 289.304 ms
Time: 1228.961 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 2.018 ms
SELECT * FROM try WHERE citext = 'food';
Time: 0.788 ms

Seems to be faster than the LOWER() version, too, which makes me  
happy. The output from EXPLAIN ANALYZE:

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food';
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
IndexScan using idx_try_citext on try  (cost=0.00..8.31 rows=1  

width=119) (actual time=0.324..0.324 rows=0 loops=1)   Index Cond: (citext = 'food'::citext) Total runtime: 0.377 ms
(3 rows)

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) =  
LOWER('food');
                                                       QUERY PLAN

------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on try  (cost=28.17..1336.10 rows=500 width=119)  

(actual time=0.170..0.170 rows=0 loops=1)   Recheck Cond: (lower(text) = 'food'::text)   ->  Bitmap Index Scan on
idx_try_text (cost=0.00..28.04 rows=500  

width=0) (actual time=0.168..0.168 rows=0 loops=1)         Index Cond: (lower(text) = 'food'::text) Total runtime:
0.211ms

(5 rows)

So my only other question related to this is:

* Are the above RESTRICT and JOIN functions the ones to use, or is  
there some way to make use of those used by the TEXT type that would  
be more appropriate?

> 2. Is there some way to get the CITEXT index to behave like a  
> LOWER() index, that is, so that its value is stored using the result  
> of the str_tolower() function, thus removing some of the overhead of  
> converting the values for each row fetched from the index? (Does  
> this question make any sense?)

Given the performance with an index, I think that this is moot, yes?  
There is, of course, much more overhead for a table scan.

Best,

David

Re: PATCH: CITEXT 2.0 v2

From

Martijn van Oosterhout

Date:

08 July 2008, 03:55:11

On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
> I guess that'd be the reason to keep it on pgFoundry, but I have two
> comments:
>
> * 2-3 years is a *long* time in Internet time.

There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).

It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.

> * There is on guarantee that it will be finished in that time or,
> indeed ever (no disrespect to Radek Strnad, it's just there are always
> unforeseen issues).

I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Re: PATCH: CITEXT 2.0 v2

From

Zdenek Kotala

Date:

08 July 2008, 04:15:16

Martijn van Oosterhout napsal(a):
> On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
>> I guess that'd be the reason to keep it on pgFoundry, but I have two  
>> comments:
>>
>> * 2-3 years is a *long* time in Internet time.
> 
> There have been patches over the years, but they tend not to get looked
> at. Recently someone pulled up the COLLATE patch from a couple of years
> ago but it didn't get much feedback either. (I can't find the link
> right now).

I know about it. I have printed your proposal on my desk. I think It is linked 
from TODO list.

> It's disappointing that the discussions get hung up on the ICU library
> when it's not required or even needed for COLLATE support. My original
> patch never even mentioned it.
> 
> I note that Firebird added COLLATE using ICU a few years back now. I
> think PostgreSQL is the only large DBMS to not support it.

Complete agree. Collation missing support is big problem for many users.

>> * There is on guarantee that it will be finished in that time or,  
>> indeed ever (no disrespect to Radek Strnad, it's just there are always  
>> unforeseen issues).
> 
> I think that with concerted coding effort it could be done in 2-3
> months (judging by how long it took to write the first version). The
> problem is it needs some planner kung-fu which not many people have.

I agree that 2-3 months on fulltime is good estimation, problem is that you need 
kung-fu master which has time to do it :(. What we currently have is student 
which works on it in free time.
    Zdenek

-- 
Zdenek Kotala              Sun Microsystems
Prague, Czech Republic     http://sun.com/postgresql

Re: PATCH: CITEXT 2.0 v2

From

"David E. Wheeler"

Date:

09 July 2008, 13:49:38

On Jul 7, 2008, at 12:06, David E. Wheeler wrote:

>> I understand it but there is parallel project which should solve  
>> this problem completely I guess in "close" future (2-3years).  
>> Afterward this module will be obsolete and it will takes times to  
>> remove it from contrib. It seems to me that have citext in contrib  
>> only for two releases is not optimal solution.
>
> I guess that'd be the reason to keep it on pgFoundry, but I have two  
> comments:
>
> * 2-3 years is a *long* time in Internet time.
>
> * There is on guarantee that it will be finished in that time or,  
> indeed ever (no disrespect to Radek Strnad, it's just there are  
> always unforeseen issues).

One other thing that occurred to me yesterday: Given that the feature  
will ultimately support column-level collations, I suspect that it  
will be much easier to migrate CITEXT to a case-insensitive collation  
(perhaps using an updated CITEXT contrib module that just does so  
transparently) than to migrate application code from using LOWER() all  
over the place to not using. One transition requires a change to the  
schema, the other requires a change to application code, of which  
there is generally a lot more.

Best,

David