Thread: PATCH: CITEXT 2.0
Howdy, [N.B.: I tried to send this a while ago but it didn't get delivered, I'm assuming because, with the uncompressed patch, the email was too big for -hackers. So this is a re-send with the patch gzip'd. Sorry for any duplication]. Please find attached a patch adding a locale-aware, case-insensitive text type, called citext, as a contrib module. A few notes: * I had originally called it lctext, as it's not a true case- insensitive type, but just converts strings to lowercase before comparing them. I changed it to citext at Tom Lane's suggestion, to ease compatibility for users of the original citext module on pgFoundry. * Differences from the original citext are: + Locale-aware lowercasing of strings, rather than just lowercasing ASCII characters. + No implicit casts from text to citext except via assignment. + A few more functions overloaded + Works with 8.3 and CVS head * Many thanks to whoever added str_tolower() to formatting.c. If I had known about that, I could have saved myself a lot of grief! My original implementation for 8.3.1 had copied a lot of code from oracle_compat.c to get things working. With this patch, I've eliminated a whole lot of code, as I can now just call str_tolower(). So thank you for that! I'll probably keep my original in my personal Subversion repository, but don't now about releasing it if it will be accepted as a contrib module for 8.4. * All comparisons simply convert the strings to be compared to lowercase using str_tolower(). I've made no other optimizations, though I'm sure someone with more experience with collations and such could add them. * The regression test uses a new module I've created, now on pgFoundry, called pgtap. It should just work. sql/citext.sql adds plpgsql to the database and then includes pgtap.sql, which has the test functions in it. * I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd work with most West European languages, and maybe all languages other than the C locale, but I'm not sure. YMMV. If there's a way to generalize it and still be able to test the locale awareness, that would be great. What locales do the build farm servers use? * In the documentation, I've pitched this type as a replacement for the use of LOWER() in ad-hoc queries, while also stipulating that this is not a "true" case-insensitive text type, and is furthermore less efficient than just TEXT (though I'm sure more efficient than ad-hock LOWER()s). I've also mentioned a few other caveats, including casts for TEXT that don't work for citext and non-case-insensitive matching in replace(), regexp_replace(), and a few others. * I wrote all the code here myself, but of course used the original citext implementation (which is case-insensitive only for ASCII characters) for inspiration and guidance. Thanks to Donald Fraser for that original implementation. I've compiled the CVS checkout, run its regressions, then built and installed the citext module (hence my discovery of the deprecation of wstring_lower and the addition of str_tolower -- should the declaration of the former be removed from formatting.c?), and all tests passed as of an hour ago. I of course welcome feedback, advice, insults, commiserations, and just about any mode of comment on this patch. Please let me know if I need to provide any additional information. Best, David
Attachment
David E. Wheeler napsal(a): > Howdy, > Howdy, > > Please find attached a patch adding a locale-aware, case-insensitive > text type, called citext, as a contrib module. A few notes: What is benefit to have this type when collation per database will be implemented? It seems to me that its overlapped feature? Definition of collation should allow to setup case sensitivity. Only advantage what I see there is that you can combine case sensitive and case insensitive text in one database. However, it will be solved when collation per column will be implemented. Zdenek
Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > However, it will be solved when collation per column will be implemented. Well, yeah, but that seems a very long way off. In the meantime a lot of people use the existing pgfoundry citext module. regards, tom lane
On Jul 1, 2008, at 07:38, Tom Lane wrote: >> However, it will be solved when collation per column will be >> implemented. > > Well, yeah, but that seems a very long way off. In the meantime a lot > of people use the existing pgfoundry citext module. And even more of us have written queries using LOWER(col) = LOWER(?), which is just a PITA. Best, David
David E. Wheeler napsal(a): > On Jul 1, 2008, at 07:38, Tom Lane wrote: > >>> However, it will be solved when collation per column will be >>> implemented. >> >> Well, yeah, but that seems a very long way off. In the meantime a lot >> of people use the existing pgfoundry citext module. > > And even more of us have written queries using LOWER(col) = LOWER(?), > which is just a PITA. > OK. I'm going to review your patch. Zdenek
Yay! Best, David Sent from my iPhone On Jul 2, 2008, at 5:03, Zdenek Kotala <Zdenek.Kotala@Sun.COM> wrote: > David E. Wheeler napsal(a): >> On Jul 1, 2008, at 07:38, Tom Lane wrote: >>>> However, it will be solved when collation per column will be >>>> implemented. >>> >>> Well, yeah, but that seems a very long way off. In the meantime a >>> lot >>> of people use the existing pgfoundry citext module. >> And even more of us have written queries using LOWER(col) = LOWER >> (?), which is just a PITA. > > OK. I'm going to review your patch. > > Zdenek
David E. Wheeler napsal(a): > Howdy, Howdy > Please find attached a patch adding a locale-aware, case-insensitive > text type, called citext, as a contrib module. A few notes: I went through your code and I have following comments/questions: 1) formating Please, do not type space before asterix: char * lcstr, * rcstr -> char *lcstr, *rcstr and do not put extra space in a brackets citextcmp( left, right ) -> citextcmp(left, right) Other seems to me OK (remove 8.2 version mention at the bottom) 2) citextcmp function returns int but citext_cmp returns int32. I think citextcmp should returns int32 as well. 3) citext_larger, citext_smaller function have memory leak. You don't call PG_FREE_IF_COPY on unused text. I'm not sure if return value in these function should be duplicated or not. 4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac There must be difference between equality and collation for example in Czech language 'láska' and 'laská' are different word it means that 'láska' != 'laská'. But there is no difference in collation order. See Unicode Universal Collation Algorithm for detail. 5) There are several commented out lines in CREATE OPERATOR statement mostly related to NEGATOR. Is there some reason for that? Also OPERATOR || has probably wrong negator. 6) You use pgTAP for unit test. It is something what should be widely discussed. It is new approach and I'm not correctperson to say if it good or not. I see there two problems: At first point there is not clear licensing (https://svn.kineticode.com/pgtap/trunk/README.pgtap). And, It should be implemented as a general framework by my opinion not only as a part of citext contrib module. Please, let me know when you will have updated code and I will start deep testing. With regards Zdenek
Am Dienstag, 1. Juli 2008 schrieb Tom Lane: > Zdenek Kotala <Zdenek.Kotala@Sun.COM> writes: > > However, it will be solved when collation per column will be implemented. > > Well, yeah, but that seems a very long way off. In the meantime a lot > of people use the existing pgfoundry citext module. Indeed, but why isn't this code put there as well?
> I went through your code and I have following comments/questions: one more comment: 7) Hash opclass is absent. Hash opclass framework is used for hash join. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
Peter Eisentraut <peter_e@gmx.net> writes: > Am Dienstag, 1. Juli 2008 schrieb Tom Lane: >> Well, yeah, but that seems a very long way off. In the meantime a lot >> of people use the existing pgfoundry citext module. > Indeed, but why isn't this code put there as well? Well, actually, that might be the best thing to do with it. But it would be sensible to give it a code review anyway, no? regards, tom lane
On Jul 2, 2008, at 09:13, Zdenek Kotala wrote: > I went through your code and I have following comments/questions: Thanks. I'm on a short family vacation, so it will probably be Monday before I can submit a new patch. I got a few replies below, though. > 1) formating > > Please, do not type space before asterix: > char * lcstr, * rcstr -> char *lcstr, *rcstr > > and do not put extra space in a brackets > citextcmp( left, right ) -> citextcmp(left, right) Okay. > Other seems to me OK (remove 8.2 version mention at the bottom) Um, are you referring to this (at the top): +// PostgreSQL 8.2 Magic. +#ifdef PG_MODULE_MAGIC +PG_MODULE_MAGIC; +#endif That's gotta be there (though I can of course remove the comment). > 2) citextcmp function returns int but citext_cmp returns int32. I > think citextcmp should returns int32 as well. Yeah, I'm a bit confused by this. I followed what's in varlena.c on this. The comparison functions are documented supposed to return 1, 0, or -1, which of course would be covered by int. But varstr_cmp(), which ultimately returns the value, returns all kinds of numbers. So, perhaps someone could say what it's *supposed* to be? > 3) citext_larger, citext_smaller function have memory leak. You > don't call PG_FREE_IF_COPY on unused text. > > I'm not sure if return value in these function should be duplicated > or not. These I also duplicated from varlena.c, and I asked about a potential memory leak in a previous email. If there's a leak here, would there not also be a leak in varlena.c? > 4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac So should citextcmp() call strncmp() instead of varst_cmp()? The latter is what I saw in varlena.c. > There must be difference between equality and collation for example > in Czech language 'láska' and 'laská' are different word it means > that 'láska' != 'laská'. But there is no difference in collation > order. See Unicode Universal Collation Algorithm for detail. I'll leave the collation stuff to the functions I call (*far* from my specialty), but I'll add a test for this and make sure it works as expected. Um, although, with what collation should it be tested? The tests I wrote assume en_US.UTF-8. > 5) There are several commented out lines in CREATE OPERATOR > statement mostly related to NEGATOR. Is there some reason for that? I copied it from the original citext.sql. Not sure what effect it has. > Also OPERATOR || has probably wrong negator. Right, good catch. > 6) You use pgTAP for unit test. It is something what should be > widely discussed. It is new approach and I'm not correct person to > say if it good or not. Sure. I created a pgFoundry project for it here: http://pgtap.projects.postgresql.org/ > I see there two problems: > At first point there is not clear licensing (https://svn.kineticode.com/pgtap/trunk/README.pgtap > ). That's a standard revised BSD license. > And, It should be implemented as a general framework by my opinion > not only as a part of citext contrib module. It is. I just put the file in citext so that the tests can use it. It's distributed on pgFoundry and maintained at the SVN URL quoted in the file. > Please, let me know when you will have updated code and I will start > deep testing. Will do. Best, David
On Jul 2, 2008, at 09:39, Peter Eisentraut wrote: >> Well, yeah, but that seems a very long way off. In the meantime a >> lot >> of people use the existing pgfoundry citext module. > > Indeed, but why isn't this code put there as well? It could be, but this is *such* a common need (and few even know about pgFoundry), that I thought it'd be worthwhile to get it distributed as part of contrib. Best, David
On Jul 2, 2008, at 12:18, Teodor Sigaev wrote: >> 7) Hash opclass is absent. Hash opclass framework is used for hash >> join. Thanks. I haven't seen docs on this. The original citext only supports btree, and I don't remember reading about creating a hash opclass in the Douglass book, though I probably missed it. Anyone got a link for me to read to make it happen? Thanks, David
On Jul 2, 2008, at 17:21, Tom Lane wrote: >> Indeed, but why isn't this code put there as well? > > Well, actually, that might be the best thing to do with it. But it > would be sensible to give it a code review anyway, no? Obviously, I would argue that contrib is a good place for it, hence the patch. I can say more on my reasoning if you'd like. But even if it doesn't end up in contrib, I'm extremely grateful for the code reviews. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 2, 2008, at 09:13, Zdenek Kotala wrote: >> Please, do not type space before asterix: >> char * lcstr, * rcstr -> char *lcstr, *rcstr >> >> and do not put extra space in a brackets >> citextcmp( left, right ) -> citextcmp(left, right) > Okay. Note that this sort of stuff will mostly be fixed by pg_indent, whether or not David does anything about it. But certainly conforming to the project style to begin with will cause less pain to reviewers' eyeballs. > Um, are you referring to this (at the top): > +// PostgreSQL 8.2 Magic. > +#ifdef PG_MODULE_MAGIC > +PG_MODULE_MAGIC; > +#endif Here however is an outright bug: we do not allow // comments, because we still support ANSI-spec compilers that don't recognize them. > Yeah, I'm a bit confused by this. I followed what's in varlena.c on > this. The comparison functions are documented supposed to return 1, 0, > or -1, which of course would be covered by int. But varstr_cmp(), > which ultimately returns the value, returns all kinds of numbers. So, > perhaps someone could say what it's *supposed* to be? btree cmp functions are allowed to return int32 negative, zero, or positive. There is *not* any requirement that negative or positive results be exactly -1 or +1. However, if you are comparing values that are int32 or wider then you can't just return their difference because it might overflow. >> 3) citext_larger, citext_smaller function have memory leak. You >> don't call PG_FREE_IF_COPY on unused text. > These I also duplicated from varlena.c, and I asked about a potential > memory leak in a previous email. If there's a leak here, would there > not also be a leak in varlena.c? The "leak" is irrelevant for larger/smaller. The only place where it's actually useful to do PG_FREE_IF_COPY is in a btree or hash index support function. In other cases you can assume that you're being called in a memory context that's too short-lived for it to matter. >> 5) There are several commented out lines in CREATE OPERATOR >> statement mostly related to NEGATOR. Is there some reason for that? > I copied it from the original citext.sql. Not sure what effect it has. http://developer.postgresql.org/pgdocs/postgres/xoper-optimization.html regards, tom lane
> Douglass book, though I probably missed it. Anyone got a link for me to > read to make it happen? Hash opclass is 5-times simpler that btree one :) CREATE FUNCTION citext_hash(mchar) RETURNS int4 AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT; CREATE OPERATOR CLASS citext_ops DEFAULT FOR TYPE mchar USING hash AS OPERATOR 1 = (citext, citext), FUNCTION 1 citext_hash(citext); -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> CREATE FUNCTION citext_hash(*citext*) > DEFAULT FOR TYPE *citext* USING hash AS Oops, citext of course. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
On Jul 2, 2008, at 22:14, Tom Lane wrote: > Note that this sort of stuff will mostly be fixed by pg_indent, > whether or not David does anything about it. But certainly > conforming to the project style to begin with will cause less > pain to reviewers' eyeballs. Yeah, I'll change it. I'm JAPH, so kind of made up the formatting as I went, though I did try to copy the style in varlena.c. >> +// PostgreSQL 8.2 Magic. >> +#ifdef PG_MODULE_MAGIC >> +PG_MODULE_MAGIC; >> +#endif > > Here however is an outright bug: we do not allow // comments, > because we > still support ANSI-spec compilers that don't recognize them. Forgot about that. I'll change it for the next version. > btree cmp functions are allowed to return int32 negative, zero, or > positive. There is *not* any requirement that negative or positive > results be exactly -1 or +1. However, if you are comparing values > that are int32 or wider then you can't just return their difference > because it might overflow. Thanks for the explanation. I'll make sure that they're both int32. > The "leak" is irrelevant for larger/smaller. The only place where > it's > actually useful to do PG_FREE_IF_COPY is in a btree or hash index > support function. In other cases you can assume that you're being > called in a memory context that's too short-lived for it to matter. So would that be for any function used by CREATE OPERATOR CLASS citext_ops DEFAULT FOR TYPE CITEXT USING btree AS OPERATOR 1 < (citext, citext), OPERATOR 2 <= (citext, citext), OPERATOR 3 = (citext, citext), OPERATOR 4 >= (citext, citext), OPERATOR 5 > (citext, citext), FUNCTION 1 citext_cmp(citext, citext); ? (And then the btree operator and function to be added, of course.) > > >>> 5) There are several commented out lines in CREATE OPERATOR >>> statement mostly related to NEGATOR. Is there some reason for that? > >> I copied it from the original citext.sql. Not sure what effect it >> has. > > http://developer.postgresql.org/pgdocs/postgres/xoper- > optimization.html Thanks. Sounds like it'd be valuable to have them in there. I'll add tests, as well. Best, David
On Jul 3, 2008, at 00:19, Teodor Sigaev wrote: >> Hash opclass is 5-times simpler that btree one :) > > CREATE FUNCTION citext_hash(mchar) > RETURNS int4 > AS 'MODULE_PATHNAME' > LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT; > > CREATE OPERATOR CLASS citext_ops > DEFAULT FOR TYPE mchar USING hash AS > OPERATOR 1 = (citext, citext), > FUNCTION 1 citext_hash(citext); Thanks. What would citext_hash() look like? I don't see a text_hash() to borrow from anywhere in src/. Thanks, David
David E. Wheeler wrote: > On Jul 3, 2008, at 00:19, Teodor Sigaev wrote: > >>> Hash opclass is 5-times simpler that btree one :) >> >> CREATE FUNCTION citext_hash(mchar) >> RETURNS int4 >> AS 'MODULE_PATHNAME' >> LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT; >> >> CREATE OPERATOR CLASS citext_ops >> DEFAULT FOR TYPE mchar USING hash AS >> OPERATOR 1 = (citext, citext), >> FUNCTION 1 citext_hash(citext); > > Thanks. What would citext_hash() look like? I don't see a text_hash() to > borrow from anywhere in src/. See hash_any(). I assume the difficulty is making sure that hash("FOO") = hash("foo") ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jul 3, 2008, at 09:53, Alvaro Herrera wrote: >> Thanks. What would citext_hash() look like? I don't see a >> text_hash() to >> borrow from anywhere in src/. > > See hash_any(). I assume the difficulty is making sure that > hash("FOO") = hash("foo") ... Great, big help, thank you. So does this look sensible? Datum citext_hash(PG_FUNCTION_ARGS) { char *txt; char *str; Datum result; txt = cilower( PG_GETARG_TEXT_PP(0) ); str = VARDATA_ANY(txt); result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt)); /* Avoid leaking memory for toasted inputs */ PG_FREE_IF_COPY(txt, 0); pfree( str ); return result; } And how might I be able to test that it actually works? Best, David
On Jul 2, 2008, at 22:14, Tom Lane wrote: > The "leak" is irrelevant for larger/smaller. The only place where > it's > actually useful to do PG_FREE_IF_COPY is in a btree or hash index > support function. In other cases you can assume that you're being > called in a memory context that's too short-lived for it to matter. Stupid question: for the btree index support function, is that *only* the function referenced in the OPERATOR CLASS, or does it also apply to functions that implement the operators in that class? IOW, do I need to worry about memory leaks in citext_eq, citext_ne, citext_gt, etc., or only in citext_cmp()? Thanks, David
Replying to myself, but I've made some local changes (see other messages) and just wanted to follow up on some of my own comments. On Jul 2, 2008, at 21:38, David E. Wheeler wrote: >> 4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac > > So should citextcmp() call strncmp() instead of varst_cmp()? The > latter is what I saw in varlena.c. I'm guessing that the answer is "no," since varstr_cmp() uses strncmp() internally, as appropriate to the locale. Correct? >> There must be difference between equality and collation for example >> in Czech language 'láska' and 'laská' are different word it means >> that 'láska' != 'laská'. But there is no difference in collation >> order. See Unicode Universal Collation Algorithm for detail. > > I'll leave the collation stuff to the functions I call (*far* from > my specialty), but I'll add a test for this and make sure it works > as expected. Um, although, with what collation should it be tested? > The tests I wrote assume en_US.UTF-8. I added this test and is passes: SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented characters should not be equivalent' ); >> 5) There are several commented out lines in CREATE OPERATOR >> statement mostly related to NEGATOR. Is there some reason for that? > > I copied it from the original citext.sql. Not sure what effect it has. I restored these (and one of them was wrong anyway). >> Also OPERATOR || has probably wrong negator. > > Right, good catch. Stupid question: What would the negation of || actually be? There isn't one is, there? Thanks! David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 3, 2008, at 09:53, Alvaro Herrera wrote: > >>> Thanks. What would citext_hash() look like? I don't see a text_hash() to >>> borrow from anywhere in src/. >> >> See hash_any(). I assume the difficulty is making sure that >> hash("FOO") = hash("foo") ... > > Great, big help, thank you. So does this look sensible? > > txt = cilower( PG_GETARG_TEXT_PP(0) ); > str = VARDATA_ANY(txt); > > result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt)); I thought your data type implemented a locale dependent collation, not just a case insensitive collation. That is, does this hash agree with your citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=> "fooss" ? You may have to use strxfrm -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
"David E. Wheeler" <david@kineticode.com> writes: > do I need to worry about memory leaks in citext_eq, citext_ne, citext_gt, > etc., yes -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
"David E. Wheeler" <david@kineticode.com> writes: >>> Also OPERATOR || has probably wrong negator. >> >> Right, good catch. > Stupid question: What would the negation of || actually be? There > isn't one is, there? Per the docs, NEGATOR is only sensible for operators returning boolean. regards, tom lane
On Jul 5, 2008, at 02:58, Gregory Stark wrote: >> do I need to worry about memory leaks in citext_eq, citext_ne, >> citext_gt, >> etc., > > yes Thanks. David
On Jul 5, 2008, at 02:58, Gregory Stark wrote: >> txt = cilower( PG_GETARG_TEXT_PP(0) ); >> str = VARDATA_ANY(txt); >> >> result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt)); > > I thought your data type implemented a locale dependent collation, > not just > a case insensitive collation. That is, does this hash agree with your > citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=> > "fooss" ? CITEXT is basically intended to replace all those queries that do `WHERE LOWER(col) = LOWER(?)` by doing it internally. That's it. It's locale-aware to the same extent that `LOWER()` is (and that citext 1.0 is not, since it only compares ASCII characters case-insensitively). And I expect that it does, in fact, agree with your examples, in that all the current tests for = and <> pass: try=# select 'foo bar' = 'foobar'; ?column? ---------- f try=# SELECT 'fooß' = 'fooss'; ?column? ---------- f > You may have to use strxfrm In the patch against CVS HEAD, it uses str_tolower() in formatting.c. Best, David
On Jul 5, 2008, at 08:13, Tom Lane wrote: >> Stupid question: What would the negation of || actually be? There >> isn't one is, there? > > Per the docs, NEGATOR is only sensible for operators returning > boolean. Message received. Many thanks, Tom, as usual. Best, David
On Jun 27, 2008, at 18:22, David E. Wheeler wrote: > Please find attached a patch adding a locale-aware, case-insensitive > text type, called citext, as a contrib module. Here is a new version of the patch, with the following changes: * Fixed formatting to be more like core. * Added appropriate NEGATORs to operators. * Removed NEGATOR from the || operator. * Added hash index function and operator class. * The = operator now supports HASHES and MERGES. * citext_cmp and citextcmp both return int32. * Changed // comments to /* comments */. * Added test confirming láska'::citext <> 'laská'::citext. * A few other organizational, formatting, and pasto fixes. * Updated the FAQ entry on case-insensitive queries to recommend citext (it would, of course, need to be translated). Stuff I was asked about but didn't change: * citext_cmp() still uses varstr_cmp() instead of strncmp(). When I tried the latter, everything seemed to be equivalent. * citext_smaller() and citext_larger() don't have memory leaks, says Tom, so I added no calls to PG_FREE_IF_COPY(). Thank you everyone for your feedback and suggestions! Best, David
Attachment
David E. Wheeler napsal(a): > Replying to myself, but I've made some local changes (see other > messages) and just wanted to follow up on some of my own comments. > > On Jul 2, 2008, at 21:38, David E. Wheeler wrote: > >>> 4) Operator = citext_eq is not correct. See comment >>> http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac >>> >> >> So should citextcmp() call strncmp() instead of varst_cmp()? The >> latter is what I saw in varlena.c. > > I'm guessing that the answer is "no," since varstr_cmp() uses strncmp() > internally, as appropriate to the locale. Correct? You have to use varstr_cmp in citextcmp. Your code is correct, because for < <= >= > operators you need collation sensible function. You need to change only citext_cmp function to use strncmp() or call texteq function. >>> There must be difference between equality and collation for example >>> in Czech language 'láska' and 'laská' are different word it means >>> that 'láska' != 'laská'. But there is no difference in collation >>> order. See Unicode Universal Collation Algorithm for detail. >> >> I'll leave the collation stuff to the functions I call (*far* from my >> specialty), but I'll add a test for this and make sure it works as >> expected. Um, although, with what collation should it be tested? The >> tests I wrote assume en_US.UTF-8. > > I added this test and is passes: > > SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented > characters should not be equivalent' ); I'm think that this test will work correctly for en_US.UTF-8 at any time. I guess the test make sense only when Czech collation (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change collation during your test :(. I think, Best solution for now is to keep the test and add comment about recommended collation for this test. Zdenek
David E. Wheeler napsal(a): > On Jun 27, 2008, at 18:22, David E. Wheeler wrote: > >> Please find attached a patch adding a locale-aware, case-insensitive >> text type, called citext, as a contrib module. > > Here is a new version of the patch, with the following changes: > > * Fixed formatting to be more like core. > * Added appropriate NEGATORs to operators. > * Removed NEGATOR from the || operator. > * Added hash index function and operator class. > * The = operator now supports HASHES and MERGES. > * citext_cmp and citextcmp both return int32. > * Changed // comments to /* comments */. > * Added test confirming láska'::citext <> 'laská'::citext. > * A few other organizational, formatting, and pasto fixes. > * Updated the FAQ entry on case-insensitive queries to recommend citext > (it would, of course, need to be translated). > > Stuff I was asked about but didn't change: > > * citext_cmp() still uses varstr_cmp() instead of strncmp(). When I > tried the latter, everything seemed to be equivalent. citext_eq (operator =) should use strncmp and citext_cmp() should use varstr_cmp(). > * citext_smaller() and citext_larger() don't have memory leaks, says > Tom, so I added no calls to PG_FREE_IF_COPY(). Yeah, it is correct. FGMR does clean job for you. However, It seems to me that code is ok now (exclude citex_eq). I think there two open problems/questions: 1) regression test - a) I think that regresion is not correct. It depends on en_US locale, but it uses characters which is not in related character repertoire. It means comparing is not defined and I guess it could generate different result on different OS - depends on locale implementation. b) pgTap is something new. Need make a decision if this framework is acceptable or not. 2) contrib vs. pgFoundry There is unresolved answer if we want to have this in contrib or not. Good to mention that citext type will be obsoleted with full collation implementation in a future. I personally prefer to keep it on pgFoundry because it is temporally workaround (by my opinion), but I can live with contrib module as well. Zdenek
Zdenek Kotala wrote: > > > 2) contrib vs. pgFoundry > > There is unresolved answer if we want to have this in contrib or not. > Good to mention that citext type will be obsoleted with full collation > implementation in a future. I personally prefer to keep it on > pgFoundry because it is temporally workaround (by my opinion), but I > can live with contrib module as well. > > > Is there a concrete plan to get to full collation support (i.e. per column) in any known time frame? If not, then I think a citext module would be acceptable. What does still bother me is its performance. I'd like to know if any measurement has been done of using citext vs. a functional index on lower(foo). cheers andrew
On Jul 7, 2008, at 00:46, Zdenek Kotala wrote: > You have to use varstr_cmp in citextcmp. Your code is correct, > because for > < <= >= > operators you need collation sensible function. > > You need to change only citext_cmp function to use strncmp() or call > texteq function. I'm confused. What's the difference between strncmp() and varstr_cmp()? And why would I use one in one place and the other elsewhere? Shouldn't they be used consistently? > I'm think that this test will work correctly for en_US.UTF-8 at any > time. I guess the test make sense only when Czech collation > (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change > collation during your test :(. No, I was wondering before what locale was used for initdb on the build farm. I mean, how are locale-aware things really tested? > I think, Best solution for now is to keep the test and add comment > about recommended collation for this test. Yep, that's what it does. Thanks, David
On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: > However, It seems to me that code is ok now (exclude citex_eq). I > think there two open problems/questions: > > 1) regression test - > > a) I think that regresion is not correct. It depends on en_US > locale, but it uses characters which is not in related character > repertoire. It means comparing is not defined and I guess it could > generate different result on different OS - depends on locale > implementation. That I don't know about. The test requires en_US.UTF-8, at least at this point. How are tests run on the build farm? And how else could I ensure that comparisons are case-insensitive for non-ASCII characters other than requiring a Unicode locale? Or is it just an issue for the sort order tests? For those, I could potentially remove accented characters, just as long as I'm verifying in other tests that comparisons are case-insensitive (without worrying about collation). > b) pgTap is something new. Need make a decision if this framework > is acceptable or not. Well, from the point of view of `make installcheck`, it's invisible. I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it further here though, if folks are interested. > 2) contrib vs. pgFoundry > > There is unresolved answer if we want to have this in contrib or > not. Good to mention that citext type will be obsoleted with full > collation implementation in a future. I personally prefer to keep it > on pgFoundry because it is temporally workaround (by my opinion), > but I can live with contrib module as well. I second what Andrew has said in reply to this issue. I'll also say that, since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that it'd be very valuable to have citext in contrib, especially since so few folks seem to even know about pgFoundry, let alone be able to find it. I mean, look at these search results: http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is #7. Last time I did a query like this, it didn't turn up at all. Belive me, I'd love for pgFoundry (or something like it) to become the CPAN for PostgreSQL. But without some love and SEO, I don't think that's gonna happen. Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4, given the changes in the string comparison API in CVS. Thanks, David
On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: > What does still bother me is its performance. I'd like to know if > any measurement has been done of using citext vs. a functional index > on lower(foo). That's a good question. I can whip up a quick test by populating a column full of data and trying both, but I'm no performance testing expert. Anyone else know more about such performance testing who can help out? Many Thanks, David
David E. Wheeler wrote: > On Jul 7, 2008, at 00:46, Zdenek Kotala wrote: > >> You have to use varstr_cmp in citextcmp. Your code is correct, >> because for >> < <= >= > operators you need collation sensible function. >> >> You need to change only citext_cmp function to use strncmp() or call >> texteq function. > > I'm confused. What's the difference between strncmp() and varstr_cmp()? > And why would I use one in one place and the other elsewhere? Shouldn't > they be used consistently? Not necessarily -- see texteq. I think the point is that varstr_cmp() is a lot slower. >> I'm think that this test will work correctly for en_US.UTF-8 at any >> time. I guess the test make sense only when Czech collation >> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change >> collation during your test :(. > > No, I was wondering before what locale was used for initdb on the build > farm. I mean, how are locale-aware things really tested? They aren't AFAIK. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Jul 7, 2008, at 11:44, Alvaro Herrera wrote: >> I'm confused. What's the difference between strncmp() and >> varstr_cmp()? >> And why would I use one in one place and the other elsewhere? >> Shouldn't >> they be used consistently? > > Not necessarily -- see texteq. I think the point is that varstr_cmp() > is a lot slower. Then why shouldn't I use strncmp() for all comparisons? >> No, I was wondering before what locale was used for initdb on the >> build >> farm. I mean, how are locale-aware things really tested? > > They aren't AFAIK. Ow. Pity. Best, David
David E. Wheeler wrote: > On Jul 7, 2008, at 11:44, Alvaro Herrera wrote: > >>> I'm confused. What's the difference between strncmp() and >>> varstr_cmp()? >>> And why would I use one in one place and the other elsewhere? >>> Shouldn't >>> they be used consistently? >> >> Not necessarily -- see texteq. I think the point is that varstr_cmp() >> is a lot slower. > > Then why shouldn't I use strncmp() for all comparisons? I have no idea :-) -- because it's not locale-aware perhaps. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
David E. Wheeler napsal(a): > On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: > >> However, It seems to me that code is ok now (exclude citex_eq). I >> think there two open problems/questions: >> >> 1) regression test - >> >> a) I think that regresion is not correct. It depends on en_US locale, >> but it uses characters which is not in related character repertoire. >> It means comparing is not defined and I guess it could generate >> different result on different OS - depends on locale implementation. > > That I don't know about. The test requires en_US.UTF-8, at least at this > point. How are tests run on the build farm? And how else could I ensure > that comparisons are case-insensitive for non-ASCII characters other > than requiring a Unicode locale? Or is it just an issue for the sort > order tests? For those, I could potentially remove accented characters, > just as long as I'm verifying in other tests that comparisons are > case-insensitive (without worrying about collation). Hmm, it is complex area and I'm not sure if anybody on planet know correct answer :-). I think that best approach is to keep it as is and when a problem occur then it will be fixed. >> b) pgTap is something new. Need make a decision if this framework is >> acceptable or not. > > Well, from the point of view of `make installcheck`, it's invisible. > I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to > discuss it further here though, if folks are interested. Yeah, it is invisible, but question is why don't put it as a framework to common place. But it is for another discussion. >> 2) contrib vs. pgFoundry >> >> There is unresolved answer if we want to have this in contrib or not. >> Good to mention that citext type will be obsoleted with full collation >> implementation in a future. I personally prefer to keep it on >> pgFoundry because it is temporally workaround (by my opinion), but I >> can live with contrib module as well. > > I second what Andrew has said in reply to this issue. I'll also say > that, since people *so* often end up using `WHERE LOWER(col) = > LOWER(?)`, that it'd be very valuable to have citext in contrib, > especially since so few folks seem to even know about pgFoundry, let > alone be able to find it. I mean, look at these search results: I understand it but there is parallel project which should solve this problem completely I guess in "close" future (2-3years). Afterward this module will be obsolete and it will takes times to remove it from contrib. It seems to me that have citext in contrib only for two releases is not optimal solution. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
Andrew Dunstan napsal(a): > > > Zdenek Kotala wrote: >> >> >> 2) contrib vs. pgFoundry >> >> There is unresolved answer if we want to have this in contrib or not. >> Good to mention that citext type will be obsoleted with full collation >> implementation in a future. I personally prefer to keep it on >> pgFoundry because it is temporally workaround (by my opinion), but I >> can live with contrib module as well. >> >> >> > > Is there a concrete plan to get to full collation support (i.e. per > column) in any known time frame? If not, then I think a citext module > would be acceptable. Collation per database should be part of 8.4. Radek Strnad works on it as a SoC project. He plans to continue on this work and implement full support as his Master of Thesis in 2-3 years time frame. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: >> Then why shouldn't I use strncmp() for all comparisons? > > I have no idea :-) -- because it's not locale-aware perhaps. Could someone who does have an idea answer these questions: * Does it need to be locale-aware or not? * Should I use strncmp() or varstr_cmp() to compare strings? * Shouldn't it use one or the other, but not both? Sorry, I'm just confused about the "correct" thing to do here. If someone who knows the definitive answers could weigh in, I'd be happy to make the adjustment. Thanks, David
On Jul 7, 2008, at 11:54, Zdenek Kotala wrote: > Hmm, it is complex area and I'm not sure if anybody on planet know > correct answer :-). I think that best approach is to keep it as is > and when a problem occur then it will be fixed. Regression tests are really important, though. >>> b) pgTap is something new. Need make a decision if this framework >>> is acceptable or not. >> Well, from the point of view of `make installcheck`, it's >> invisible. I've submitted a talk proposal for pgDay.US on ptTAP. >> I'm happy to discuss it further here though, if folks are interested. > > Yeah, it is invisible, but question is why don't put it as a > framework to common place. But it is for another discussion. It is in a common place as a project, on pgFoundry. Whether the core team wants to use it for testing other parts of PostgreSQL (core or contrib) and therefore put it in a central location is, as you say, a separate conversation. It'd be easy to move it in such a case, of course. > I understand it but there is parallel project which should solve > this problem completely I guess in "close" future (2-3years). > Afterward this module will be obsolete and it will takes times to > remove it from contrib. It seems to me that have citext in contrib > only for two releases is not optimal solution. I guess that'd be the reason to keep it on pgFoundry, but I have two comments: * 2-3 years is a *long* time in Internet time. * There is on guarantee that it will be finished in that time or, indeed ever (no disrespect to Radek Strnad, it's just there are always unforeseen issues). Thanks, David
2008/7/7 Zdenek Kotala <Zdenek.Kotala@sun.com>: > David E. Wheeler napsal(a): >> >> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote: >> >>> However, It seems to me that code is ok now (exclude citex_eq). I think >>> there two open problems/questions: >>> >>> 1) regression test - >>> >>> a) I think that regresion is not correct. It depends on en_US locale, >>> but it uses characters which is not in related character repertoire. It >>> means comparing is not defined and I guess it could generate different >>> result on different OS - depends on locale implementation. >> >> That I don't know about. The test requires en_US.UTF-8, at least at this >> point. How are tests run on the build farm? And how else could I ensure that >> comparisons are case-insensitive for non-ASCII characters other than >> requiring a Unicode locale? Or is it just an issue for the sort order tests? >> For those, I could potentially remove accented characters, just as long as >> I'm verifying in other tests that comparisons are case-insensitive (without >> worrying about collation). > > Hmm, it is complex area and I'm not sure if anybody on planet know correct > answer :-). I think that best approach is to keep it as is and when a > problem occur then it will be fixed. > Maybe we can have some "locale" test outside our regress tests - >>> b) pgTap is something new. Need make a decision if this framework is >>> acceptable or not. >> >> Well, from the point of view of `make installcheck`, it's invisible. I've >> submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it >> further here though, if folks are interested. > > Yeah, it is invisible, but question is why don't put it as a framework to > common place. But it is for another discussion. > >>> 2) contrib vs. pgFoundry >>> >>> There is unresolved answer if we want to have this in contrib or not. >>> Good to mention that citext type will be obsoleted with full collation >>> implementation in a future. I personally prefer to keep it on pgFoundry >>> because it is temporally workaround (by my opinion), but I can live with >>> contrib module as well. >> >> I second what Andrew has said in reply to this issue. I'll also say that, >> since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that >> it'd be very valuable to have citext in contrib, especially since so few >> folks seem to even know about pgFoundry, let alone be able to find it. I >> mean, look at these search results: > > I understand it but there is parallel project which should solve this > problem completely I guess in "close" future (2-3years). Afterward this > module will be obsolete and it will takes times to remove it from contrib. > It seems to me that have citext in contrib only for two releases is not > optimal solution. > > Zdenek > > Regards Pavel > > -- > Zdenek Kotala Sun Microsystems > Prague, Czech Republic http://sun.com/postgresql > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
David E. Wheeler napsal(a): > On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: > >>> Then why shouldn't I use strncmp() for all comparisons? >> >> I have no idea :-) -- because it's not locale-aware perhaps. > > Could someone who does have an idea answer these questions: > > * Does it need to be locale-aware or not? > * Should I use strncmp() or varstr_cmp() to compare strings? > * Shouldn't it use one or the other, but not both? > > Sorry, I'm just confused about the "correct" thing to do here. If > someone who knows the definitive answers could weigh in, I'd be happy to > make the adjustment. I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270 bttext should use in citextcmp function. It uses strcol function. And citext_eq should be implemented as texteq: http://doxygen.postgresql.org/varlena_8c-source.html#l01164 I'm sorry for confusion I'm exchange bttext and varstr_cmp. :( Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 7, 2008, at 12:13, Zdenek Kotala wrote: > I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270 > > bttext should use in citextcmp function. It uses strcol function. I've no idea what bttext has to do with anything. Sorry if I'm being slow here. > And citext_eq should be implemented as texteq: > > http://doxygen.postgresql.org/varlena_8c-source.html#l01164 > > I'm sorry for confusion I'm exchange bttext and varstr_cmp. :( Okay, I see that text_cmp() uses varstr_cmp(): http://doxygen.postgresql.org/varlena_8c-source.html#l01139 While texteq() and textne() use strncmp(): http://doxygen.postgresql.org/varlena_8c-source.html#l01164 http://doxygen.postgresql.org/varlena_8c-source.html#l01187 The other operator functions, like text_lt(), use text_cmp() and therefore varstr_cmp(): http://doxygen.postgresql.org/varlena_8c-source.html#01210 My question is: why? Shouldn't they all use the same function for comparison? I'm happy to dupe this implementation for citext, but I don't understand it. Should not all comparisons be executed consistently? Thanks, David
On Jul 7, 2008, at 12:10, Pavel Stehule wrote: > Maybe we can have some "locale" test outside our regress tests - I think that would be useful. Best, David
On Jul 7, 2008, at 12:21, David E. Wheeler wrote: > My question is: why? Shouldn't they all use the same function for > comparison? I'm happy to dupe this implementation for citext, but I > don't understand it. Should not all comparisons be executed > consistently? Let me try to answer my own question by citing this comment: /* * Since we only care about equality or not-equality, we can avoid all the * expense of strcoll() here, and just do bitwise comparison. */ So, the upshot is that the = and <> operators are not locale-aware, yes? They just do byte comparisons. Is that really the way it should be? I mean, could there not be strings that are equivalent but have different bytes? Thanks, David
2008/7/7 David E. Wheeler <david@kineticode.com>: > On Jul 7, 2008, at 11:54, Alvaro Herrera wrote: > >>> Then why shouldn't I use strncmp() for all comparisons? >> >> I have no idea :-) -- because it's not locale-aware perhaps. > > Could someone who does have an idea answer these questions: > > * Does it need to be locale-aware or not? yes! > * Should I use strncmp() or varstr_cmp() to compare strings? > * Shouldn't it use one or the other, but not both? > > Sorry, I'm just confused about the "correct" thing to do here. If someone > who knows the definitive answers could weigh in, I'd be happy to make the > adjustment. > > Thanks, > > David > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
On Jul 7, 2008, at 12:36, Pavel Stehule wrote: >> * Does it need to be locale-aware or not? > > yes! texteq() in varlena.c does not seem to be. Best, David
2008/7/7 David E. Wheeler <david@kineticode.com>: > On Jul 7, 2008, at 12:36, Pavel Stehule wrote: > >>> * Does it need to be locale-aware or not? >> >> yes! > > texteq() in varlena.c does not seem to be. > it's case sensitive and it's +/- binary compare Regards Pavel Stehule > Best, > > David >
David E. Wheeler napsal(a): > On Jul 7, 2008, at 12:21, David E. Wheeler wrote: > >> My question is: why? Shouldn't they all use the same function for >> comparison? I'm happy to dupe this implementation for citext, but I >> don't understand it. Should not all comparisons be executed consistently? > > Let me try to answer my own question by citing this comment: > > /* > * Since we only care about equality or not-equality, we can avoid > all the > * expense of strcoll() here, and just do bitwise comparison. > */ > > So, the upshot is that the = and <> operators are not locale-aware, yes? > They just do byte comparisons. Is that really the way it should be? I > mean, could there not be strings that are equivalent but have different > bytes? Correct. The problem is complex. It works fine only for normalized string. But postgres now assume that all utf8 strings are normalized. If you need to implement < <= >= > operators you need to use strcol which take care of locale collation. See unicode collation algorithm http://www.unicode.org/reports/tr10/ Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
"David E. Wheeler" <david@kineticode.com> writes: > So, the upshot is that the = and <> operators are not locale-aware, > yes? They just do byte comparisons. Is that really the way it should > be? I mean, could there not be strings that are equivalent but have > different bytes? We intentionally force such strings to be considered non-equal. See varstr_cmp, and if you like see the archives from back when that was put in. The = and <> operators are in fact consistent with the behavior of varstr_cmp (and had better be!); they're just optimized a bit. regards, tom lane
On Jul 7, 2008, at 13:10, Tom Lane wrote: > We intentionally force such strings to be considered non-equal. > See varstr_cmp, and if you like see the archives from back when > that was put in. > > The = and <> operators are in fact consistent with the behavior of > varstr_cmp (and had better be!); they're just optimized a bit. Thank you, Tom. I'll post my updated patch shortly. In the meantime, can anyone suggest an easy way to populate a table full of random strings so that I can do a bit of benchmarking? Thanks, David
On Jul 7, 2008, at 12:46, Zdenek Kotala wrote: >> So, the upshot is that the = and <> operators are not locale-aware, >> yes? They just do byte comparisons. Is that really the way it >> should be? I mean, could there not be strings that are equivalent >> but have different bytes? > > Correct. The problem is complex. It works fine only for normalized > string. But postgres now assume that all utf8 strings are normalized. I see. So binary equivalence is okay, in that case. > If you need to implement < <= >= > operators you need to use strcol > which take care of locale collation. Which varstr_cmp() does, I guess. It's what textlt uses, for example. > See unicode collation algorithm http://www.unicode.org/reports/tr10/ Wow, that looks like a fun read. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 7, 2008, at 12:21, David E. Wheeler wrote: > >> My question is: why? Shouldn't they all use the same function for >> comparison? I'm happy to dupe this implementation for citext, but I don't >> understand it. Should not all comparisons be executed consistently? > > Let me try to answer my own question by citing this comment: > > /* > * Since we only care about equality or not-equality, we can avoid all > the > * expense of strcoll() here, and just do bitwise comparison. > */ > > So, the upshot is that the = and <> operators are not locale-aware, yes? They > just do byte comparisons. Is that really the way it should be? I mean, could > there not be strings that are equivalent but have different bytes? There could be strings that strcoll returns 0 for even though they're not identical. However that caused problems in Postgres so we decided that only equal strings should actually compare equal. So if strcoll returns 0 then we do a bytewise comparison to impose an arbitrary ordering. Of course the obvious case of two equivalent strings with different bytes would be two strings which differ only in case in a collation which doesn't distinguish based on case. So you obviously can't take this route for citext. I don't think you have to worry about the problem that cause Postgres to make this change. IIRC it was someone comparing strings like paths and usernames and getting false positives because they were in a Turkish locale which found certain sequences of characters to be insignificant for ordering. Someone who's using a citext data type has obviously decided that's precisely the kind of behaviour they want. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
On Jul 7, 2008, at 13:59, Gregory Stark wrote: > Of course the obvious case of two equivalent strings with different > bytes > would be two strings which differ only in case in a collation which > doesn't > distinguish based on case. So you obviously can't take this route > for citext. Well, to be fair, citext isn't imposing a collation. It's just calling str_tolower() on strings before passing them on to varstr_cmp() or strncmp() to compare. > I don't think you have to worry about the problem that cause > Postgres to make > this change. IIRC it was someone comparing strings like paths and > usernames > and getting false positives because they were in a Turkish locale > which found > certain sequences of characters to be insignificant for ordering. > Someone > who's using a citext data type has obviously decided that's > precisely the kind > of behaviour they want. Hrm. So in your opinion, strncmp() could be used for all comparisons by citext, rather than varstr_cmp()? Thanks, David
On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: > What does still bother me is its performance. I'd like to know if > any measurement has been done of using citext vs. a functional index > on lower(foo). Okay, here's a start. The attached script inserts random strings of 1-10 space-delimited words into text and citext columns, and then compares the performance of queries with and without indexes. The output for me is as follows: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 254.254 ms SELECT * FROM try WHERE citext = 'food'; Time: 288.535 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.385 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 236.186 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 235.818 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 1.260 ms SELECT * FROM try WHERE citext = 'food'; Time: 277.755 ms Test LIKE and ILIKE SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); Time: 209.073 ms SELECT * FROM try WHERE citext ILIKE 'C%'; Time: 238.430 ms SELECT * FROM try WHERE citext LIKE 'C%'; Time: 238.685 ms benedict% So for some reason, after adding the indexes, the queries against the CITEXT column aren't using them. Furthermore, the `lower(text) LIKE lower(?)` query isn't using *its* index. Huh? So this leaves me with two questions: 1. For what reason would the query against the citext column *not* use the index? 2. Is there some way to get the CITEXT index to behave like a LOWER() index, that is, so that its value is stored using the result of the str_tolower() function, thus removing some of the overhead of converting the values for each row fetched from the index? (Does this question make any sense?) Thanks, David
And here is the script. D'oh! Thanks, David On Jul 7, 2008, at 16:24, David E. Wheeler wrote: > On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: > >> What does still bother me is its performance. I'd like to know if >> any measurement has been done of using citext vs. a functional >> index on lower(foo). > > Okay, here's a start. The attached script inserts random strings of > 1-10 space-delimited words into text and citext columns, and then > compares the performance of queries with and without indexes. The > output for me is as follows: > > Loading words from dictionary. > Inserting into the table. > > Test =. > SELECT * FROM try WHERE LOWER(text) = LOWER('food'); > Time: 254.254 ms > SELECT * FROM try WHERE citext = 'food'; > Time: 288.535 ms > > Test LIKE and ILIKE > SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); > Time: 209.385 ms > SELECT * FROM try WHERE citext ILIKE 'C%'; > Time: 236.186 ms > SELECT * FROM try WHERE citext LIKE 'C%'; > Time: 235.818 ms > > Adding indexes... > > Test =. > SELECT * FROM try WHERE LOWER(text) = LOWER('food'); > Time: 1.260 ms > SELECT * FROM try WHERE citext = 'food'; > Time: 277.755 ms > > Test LIKE and ILIKE > SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); > Time: 209.073 ms > SELECT * FROM try WHERE citext ILIKE 'C%'; > Time: 238.430 ms > SELECT * FROM try WHERE citext LIKE 'C%'; > Time: 238.685 ms > benedict% > > So for some reason, after adding the indexes, the queries against > the CITEXT column aren't using them. Furthermore, the `lower(text) > LIKE lower(?)` query isn't using *its* index. Huh? > > So this leaves me with two questions: > > 1. For what reason would the query against the citext column *not* > use the index? > > 2. Is there some way to get the CITEXT index to behave like a > LOWER() index, that is, so that its value is stored using the result > of the str_tolower() function, thus removing some of the overhead of > converting the values for each row fetched from the index? (Does > this question make any sense?) > > Thanks, > > David > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
No, *really* Sheesh, sorry. David On Jul 7, 2008, at 16:26, David E. Wheeler wrote: > And here is the script. D'oh! > > Thanks, > > David > > <try.sql> > > > On Jul 7, 2008, at 16:24, David E. Wheeler wrote: > >> On Jul 7, 2008, at 08:01, Andrew Dunstan wrote: >> >>> What does still bother me is its performance. I'd like to know if >>> any measurement has been done of using citext vs. a functional >>> index on lower(foo). >> >> Okay, here's a start. The attached script inserts random strings of >> 1-10 space-delimited words into text and citext columns, and then >> compares the performance of queries with and without indexes. The >> output for me is as follows: >> >> Loading words from dictionary. >> Inserting into the table. >> >> Test =. >> SELECT * FROM try WHERE LOWER(text) = LOWER('food'); >> Time: 254.254 ms >> SELECT * FROM try WHERE citext = 'food'; >> Time: 288.535 ms >> >> Test LIKE and ILIKE >> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); >> Time: 209.385 ms >> SELECT * FROM try WHERE citext ILIKE 'C%'; >> Time: 236.186 ms >> SELECT * FROM try WHERE citext LIKE 'C%'; >> Time: 235.818 ms >> >> Adding indexes... >> >> Test =. >> SELECT * FROM try WHERE LOWER(text) = LOWER('food'); >> Time: 1.260 ms >> SELECT * FROM try WHERE citext = 'food'; >> Time: 277.755 ms >> >> Test LIKE and ILIKE >> SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%'); >> Time: 209.073 ms >> SELECT * FROM try WHERE citext ILIKE 'C%'; >> Time: 238.430 ms >> SELECT * FROM try WHERE citext LIKE 'C%'; >> Time: 238.685 ms >> benedict% >> >> So for some reason, after adding the indexes, the queries against >> the CITEXT column aren't using them. Furthermore, the `lower(text) >> LIKE lower(?)` query isn't using *its* index. Huh? >> >> So this leaves me with two questions: >> >> 1. For what reason would the query against the citext column *not* >> use the index? >> >> 2. Is there some way to get the CITEXT index to behave like a >> LOWER() index, that is, so that its value is stored using the >> result of the str_tolower() function, thus removing some of the >> overhead of converting the values for each row fetched from the >> index? (Does this question make any sense?) >> >> Thanks, >> >> David >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
"David E. Wheeler" <david@kineticode.com> writes: > Hrm. So in your opinion, strncmp() could be used for all comparisons > by citext, rather than varstr_cmp()? I thought the charter of this type was to work like lower(textcol). If that's so, you certainly can't use strncmp, because that would result in sort orderings totally different from lower()'s result. Even without that argument, for most multibyte cases you'd get a pretty arbitrary, user-unfriendly sort ordering. regards, tom lane
On Jul 7, 2008, at 16:58, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Hrm. So in your opinion, strncmp() could be used for all comparisons >> by citext, rather than varstr_cmp()? > > I thought the charter of this type was to work like lower(textcol). Correct. > If that's so, you certainly can't use strncmp, because that would > result > in sort orderings totally different from lower()'s result. Even > without > that argument, for most multibyte cases you'd get a pretty arbitrary, > user-unfriendly sort ordering. Now I'm confused again. :-( Whether or not I use strncmp() or varstr_cmp(), I first lowercase the value to be compared using str_tolower(). What Zdenek has said is, that aside, just as for the TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for everything else. Are you saying something different? I could use some examples to play with in order to ensure that things are behaving as they should. I'll add regression tests for them. Thanks, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 7, 2008, at 16:58, Tom Lane wrote: >> If that's so, you certainly can't use strncmp, because that would >> result >> in sort orderings totally different from lower()'s result. Even >> without >> that argument, for most multibyte cases you'd get a pretty arbitrary, >> user-unfriendly sort ordering. > Now I'm confused again. :-( Whether or not I use strncmp() or > varstr_cmp(), I first lowercase the value to be compared using > str_tolower(). What Zdenek has said is, that aside, just as for the > TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for > everything else. Are you saying something different? No, but you were: you proposed using strncmp for everything. regards, tom lane
On Jul 7, 2008, at 17:18, Tom Lane wrote: > No, but you were: you proposed using strncmp for everything. Yes, that's right. I was trying to understand why I wouldn't use just one or the other. And the answer turned out to be, you wouldn't, except that strncmp() is an desirable optimization for = and <>. So I've changed only those. Phew, I think I'm clear now. Thanks! DAvid
Thanks to help from RhodiumToad on IRC, I got some things improved here: On Jul 7, 2008, at 16:24, David E. Wheeler wrote: > So for some reason, after adding the indexes, the queries against > the CITEXT column aren't using them. Furthermore, the `lower(text) > LIKE lower(?)` query isn't using *its* index. Huh? I never knew what one needed to use the text_pattern_ops operator class to index a column for use with LIKE! I had no clue. Would that work for a citext column, too, since it's essentially the same as TEXT? > So this leaves me with two questions: > > 1. For what reason would the query against the citext column *not* > use the index? It turns out that it did use the index if I put `SET enable_seqscan = false;` into my script. So with RhodiumToad's direction, I added some `RESTRICT` and `JOIN` clauses to my comparison operators (copying them from ip4r). So now I have: CREATE OPERATOR = ( LEFTARG = CITEXT, RIGHTARG = CITEXT, COMMUTATOR = =, NEGATOR = <>, PROCEDURE = citext_eq,RESTRICT = eqsel,JOIN = eqjoinsel, HASHES, MERGES ); CREATE OPERATOR <> ( LEFTARG = CITEXT, RIGHTARG = CITEXT, NEGATOR = =, COMMUTATOR = <>, PROCEDURE = citext_ne,RESTRICT = neqsel,JOIN = neqjoinsel ); CREATE OPERATOR < ( LEFTARG = CITEXT, RIGHTARG = CITEXT, NEGATOR = >=, COMMUTATOR = >, PROCEDURE = citext_lt,RESTRICT = scalarltsel,JOIN = scalarltjoinsel ); CREATE OPERATOR <= ( LEFTARG = CITEXT, RIGHTARG = CITEXT, NEGATOR = >, COMMUTATOR = <=, PROCEDURE = citext_le,RESTRICT = scalarltsel,JOIN = scalarltjoinsel ); CREATE OPERATOR >= ( LEFTARG = CITEXT, RIGHTARG = CITEXT, NEGATOR = <, COMMUTATOR = <=, PROCEDURE = citext_ge,RESTRICT = scalargtsel,JOIN = scalargtjoinsel ); CREATE OPERATOR > ( LEFTARG = CITEXT, RIGHTARG = CITEXT, NEGATOR = <=, COMMUTATOR = <, PROCEDURE = citext_gt,RESTRICT = scalargtsel,JOIN = scalargtjoinsel ); With this change, the index was used: Loading words from dictionary. Inserting into the table. Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 261.295 ms SELECT * FROM try WHERE citext = 'food'; Time: 289.304 ms Time: 1228.961 ms Adding indexes... Test =. SELECT * FROM try WHERE LOWER(text) = LOWER('food'); Time: 2.018 ms SELECT * FROM try WHERE citext = 'food'; Time: 0.788 ms Seems to be faster than the LOWER() version, too, which makes me happy. The output from EXPLAIN ANALYZE: try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- IndexScan using idx_try_citext on try (cost=0.00..8.31 rows=1 width=119) (actual time=0.324..0.324 rows=0 loops=1) Index Cond: (citext = 'food'::citext) Total runtime: 0.377 ms (3 rows) try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) = LOWER('food'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ BitmapHeap Scan on try (cost=28.17..1336.10 rows=500 width=119) (actual time=0.170..0.170 rows=0 loops=1) Recheck Cond: (lower(text) = 'food'::text) -> Bitmap Index Scan on idx_try_text (cost=0.00..28.04 rows=500 width=0) (actual time=0.168..0.168 rows=0 loops=1) Index Cond: (lower(text) = 'food'::text) Total runtime: 0.211ms (5 rows) So my only other question related to this is: * Are the above RESTRICT and JOIN functions the ones to use, or is there some way to make use of those used by the TEXT type that would be more appropriate? > 2. Is there some way to get the CITEXT index to behave like a > LOWER() index, that is, so that its value is stored using the result > of the str_tolower() function, thus removing some of the overhead of > converting the values for each row fetched from the index? (Does > this question make any sense?) Given the performance with an index, I think that this is moot, yes? There is, of course, much more overhead for a table scan. Best, David
On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote: > I guess that'd be the reason to keep it on pgFoundry, but I have two > comments: > > * 2-3 years is a *long* time in Internet time. There have been patches over the years, but they tend not to get looked at. Recently someone pulled up the COLLATE patch from a couple of years ago but it didn't get much feedback either. (I can't find the link right now). It's disappointing that the discussions get hung up on the ICU library when it's not required or even needed for COLLATE support. My original patch never even mentioned it. I note that Firebird added COLLATE using ICU a few years back now. I think PostgreSQL is the only large DBMS to not support it. > * There is on guarantee that it will be finished in that time or, > indeed ever (no disrespect to Radek Strnad, it's just there are always > unforeseen issues). I think that with concerted coding effort it could be done in 2-3 months (judging by how long it took to write the first version). The problem is it needs some planner kung-fu which not many people have. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Martijn van Oosterhout napsal(a): > On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote: >> I guess that'd be the reason to keep it on pgFoundry, but I have two >> comments: >> >> * 2-3 years is a *long* time in Internet time. > > There have been patches over the years, but they tend not to get looked > at. Recently someone pulled up the COLLATE patch from a couple of years > ago but it didn't get much feedback either. (I can't find the link > right now). I know about it. I have printed your proposal on my desk. I think It is linked from TODO list. > It's disappointing that the discussions get hung up on the ICU library > when it's not required or even needed for COLLATE support. My original > patch never even mentioned it. > > I note that Firebird added COLLATE using ICU a few years back now. I > think PostgreSQL is the only large DBMS to not support it. Complete agree. Collation missing support is big problem for many users. >> * There is on guarantee that it will be finished in that time or, >> indeed ever (no disrespect to Radek Strnad, it's just there are always >> unforeseen issues). > > I think that with concerted coding effort it could be done in 2-3 > months (judging by how long it took to write the first version). The > problem is it needs some planner kung-fu which not many people have. I agree that 2-3 months on fulltime is good estimation, problem is that you need kung-fu master which has time to do it :(. What we currently have is student which works on it in free time. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 7, 2008, at 12:06, David E. Wheeler wrote: >> I understand it but there is parallel project which should solve >> this problem completely I guess in "close" future (2-3years). >> Afterward this module will be obsolete and it will takes times to >> remove it from contrib. It seems to me that have citext in contrib >> only for two releases is not optimal solution. > > I guess that'd be the reason to keep it on pgFoundry, but I have two > comments: > > * 2-3 years is a *long* time in Internet time. > > * There is on guarantee that it will be finished in that time or, > indeed ever (no disrespect to Radek Strnad, it's just there are > always unforeseen issues). One other thing that occurred to me yesterday: Given that the feature will ultimately support column-level collations, I suspect that it will be much easier to migrate CITEXT to a case-insensitive collation (perhaps using an updated CITEXT contrib module that just does so transparently) than to migrate application code from using LOWER() all over the place to not using. One transition requires a change to the schema, the other requires a change to application code, of which there is generally a lot more. Best, David