Thread: PATCH: CITEXT 2.0 v3
Attached is a new version of a patch to add a CITEXT contrib module. Changes since v2: * Optimized citext_eq() and citext_ne() (the = and <> operators, respectively) by having them use strncmp() instead of varstr_cmp(). Per discussion. * Added `RESTRICT` and `JOIN` clauses to the comparison operators (=, <>, <, >, <=, >=). These improve statistics estimations, increasing the liklihood that indices will be used. Thanks! David
Attachment
I guess you're all just blown away by the perfection of this patch? ;-) Best, David On Jul 7, 2008, at 18:03, David E. Wheeler wrote: > Attached is a new version of a patch to add a CITEXT contrib module. > Changes since v2: > > * Optimized citext_eq() and citext_ne() (the = and <> operators, > respectively) by having them use strncmp() instead of varstr_cmp(). > Per discussion. > > * Added `RESTRICT` and `JOIN` clauses to the comparison operators > (=, <>, <, >, <=, >=). These improve statistics estimations, > increasing the liklihood that indices will be used. > > Thanks! > > David
David E. Wheeler wrote: > I guess you're all just blown away by the perfection of this patch? ;-) The problem is that we're in the middle of a commitfest, so everybody is busy reviewing other patches (in theory at least). One thing that jumps at me is pgTAP usage, as Zdenek said. I understand that it's neat and all that, but we can't include the tests because they won't run unless one installs pgTAP which seems a nonstarter. So if you want the tests in the repository along the rest of the stuff, they really should use pg_regress. It's not even difficult to use. Have a look at contrib/ltree/sql and contrib/ltree/expected for examples. If you want to push for pgTAP in core, that's fine, but it's a separate discussion. The other possibility being, of course, that you are proposing citext to live on pgFoundry. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jul 9, 2008, at 13:40, Alvaro Herrera wrote: > The problem is that we're in the middle of a commitfest, so > everybody is > busy reviewing other patches (in theory at least). Of course. > One thing that jumps at me is pgTAP usage, as Zdenek said. I > understand > that it's neat and all that, but we can't include the tests because > they > won't run unless one installs pgTAP which seems a nonstarter. So if > you > want the tests in the repository along the rest of the stuff, they > really should use pg_regress. It does use pg_regress. The test just load the included pgtap.sql file to get the tap functions, and then away they go. If you run `make installcheck` it works. > It's not even difficult to use. Have a look at contrib/ltree/sql and > contrib/ltree/expected for examples. > > If you want to push for pgTAP in core, that's fine, but it's a > separate > discussion. Agreed. I've sent a couple of messages in a thread about that, the latest this morning. > The other possibility being, of course, that you are proposing > citext to > live on pgFoundry. I'm not, actually. I mean, I have an updated version for 8.3, but it'd be quite a pita to maintain them both, since the api for lowercasing text is so much simpler in 8.4. Best, David
David E. Wheeler wrote: > On Jul 9, 2008, at 13:40, Alvaro Herrera wrote: >> One thing that jumps at me is pgTAP usage, as Zdenek said. I >> understand that it's neat and all that, but we can't include the >> tests because they won't run unless one installs pgTAP which seems a >> nonstarter. So if you want the tests in the repository along the >> rest of the stuff, they really should use pg_regress. > > It does use pg_regress. The test just load the included pgtap.sql file > to get the tap functions, and then away they go. If you run `make > installcheck` it works. Uh-huh, OK, that's fine then I guess. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
David E. Wheeler napsal(a): > Attached is a new version of a patch to add a CITEXT contrib module. > Changes since v2: > > * Optimized citext_eq() and citext_ne() (the = and <> operators, > respectively) by having them use strncmp() instead of varstr_cmp(). Per > discussion. > > * Added `RESTRICT` and `JOIN` clauses to the comparison operators (=, > <>, <, >, <=, >=). These improve statistics estimations, increasing the > liklihood that indices will be used. I'm sorry for late response. I'm little bit busy this week. I quickly look on your latest changes and it seems to me that everything is OK. I'm going to change status to ready for commit. After discussion with Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib is OK for me. Maybe some native speaker should review documentation. Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 11, 2008, at 12:37, Zdenek Kotala wrote: > I'm sorry for late response. I'm little bit busy this week. I > quickly look on your latest changes and it seems to me that > everything is OK. I'm going to change status to ready for commit. > After discussion with Pavel Stehule I changed meaning about > pgFoundry vs. contrib. Contrib is OK for me. Thank you, Zdenek. Have you had a chance to try citext yet? Or did you just read the source? Best, David
David E. Wheeler napsal(a): > On Jul 11, 2008, at 12:37, Zdenek Kotala wrote: > >> I'm sorry for late response. I'm little bit busy this week. I quickly >> look on your latest changes and it seems to me that everything is OK. >> I'm going to change status to ready for commit. After discussion with >> Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib >> is OK for me. > > Thank you, Zdenek. Have you had a chance to try citext yet? Or did you > just read the source? I tested version two on Solaris/SPARC and Sun studio compiler. I checked last version only quickly (comparing your changes). With regards Zdenek -- Zdenek Kotala Sun Microsystems Prague, Czech Republic http://sun.com/postgresql
On Jul 11, 2008, at 13:02, Zdenek Kotala wrote: >> Thank you, Zdenek. Have you had a chance to try citext yet? Or did >> you just read the source? > > I tested version two on Solaris/SPARC and Sun studio compiler. I > checked last version only quickly (comparing your changes). Thanks. I just updated my performance test script (attached) by increasing the number of rows tested by an order of magnitude. So now it creates 1,000,000 rows, queries them, adds indexes, and then queries them again. Unfortunately, CITEXT seems to have a memory leak somewhere, because when I index the CITEXT column, it fails with "ERROR: out of memory". So obviously something's not getting cleaned up. Here's the btree indexing function: Datum citext_cmp(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32(result); } And here's citextcmp(): citextcmp (text *left, text *right) { char *lcstr, *rcstr; int result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); result = varstr_cmp( lcstr, VARSIZE_ANY_EXHDR(left), rcstr, VARSIZE_ANY_EXHDR(right) ); pfree(lcstr); pfree(rcstr); return result; } Can anyone see where I'm failing to free up memory? Might it be in some other function? Thanks! David
Attachment
"David E. Wheeler" <david@kineticode.com> writes: > lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); > rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); > result = varstr_cmp( > lcstr, > VARSIZE_ANY_EXHDR(left), > rcstr, > VARSIZE_ANY_EXHDR(right) > ); Not sure about a memory leak, but this code is clearly wrong if str_tolower results in a change of physical string length (clearly possible in Turkish, for instance, and probably in some other locales too). You need to do strlen's on the lowercased strings. regards, tom lane
On Jul 11, 2008, at 16:45, Tom Lane wrote: > Not sure about a memory leak, but this code is clearly wrong if > str_tolower results in a change of physical string length (clearly > possible in Turkish, for instance, and probably in some other > locales too). You need to do strlen's on the lowercased strings. Ah, great point. Thanks. Anyone else got an idea on the memory leak? Best, David
"David E. Wheeler" <david@kineticode.com> writes: > Unfortunately, CITEXT seems to have a memory leak somewhere, I tried it here and didn't see any apparent memory leak, on two different systems. What platform are you testing on, and with what encoding/locale settings? regards, tom lane
On Jul 11, 2008, at 19:18, Tom Lane wrote: > I tried it here and didn't see any apparent memory leak, on two > different systems. What platform are you testing on, and with > what encoding/locale settings? PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686- apple-darwin9-gcc-4.0.1 (G That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my version for 8.3.3 from svn. https://svn.kineticode.com/citext/trunk/ I'll give it a try myself with the patch against CVS in a bit. Thanks, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 11, 2008, at 19:18, Tom Lane wrote: >> I tried it here and didn't see any apparent memory leak, on two >> different systems. What platform are you testing on, and with >> what encoding/locale settings? > That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. Oh, got one of those too ... [ time passes ... ] Nope, no leak seen here either, though you could possibly fry an egg on my laptop right now ... Also, I notice that the Mac is the *only* one of the three systems on which your regression tests pass. You're depending way too much on platform-specific behavior there, I fear. regards, tom lane
Here's an updated version of citext.c. Some changes cosmetic, some not so much ... regards, tom lane /* * PostgreSQL type definitions for CITEXT 2.0. */ #include "postgres.h" #include "access/hash.h" #include "fmgr.h" #include "utils/builtins.h" #include "utils/formatting.h" #ifdef PG_MODULE_MAGIC PG_MODULE_MAGIC; #endif /* * ==================== * FORWARD DECLARATIONS * ==================== */ static int32 citextcmp (text *left, text *right); extern Datum citext_cmp (PG_FUNCTION_ARGS); extern Datum citext_hash (PG_FUNCTION_ARGS); extern Datum citext_eq (PG_FUNCTION_ARGS); extern Datum citext_ne (PG_FUNCTION_ARGS); extern Datum citext_gt (PG_FUNCTION_ARGS); extern Datum citext_ge (PG_FUNCTION_ARGS); extern Datum citext_lt (PG_FUNCTION_ARGS); extern Datum citext_le (PG_FUNCTION_ARGS); extern Datum citext_smaller (PG_FUNCTION_ARGS); extern Datum citext_larger (PG_FUNCTION_ARGS); /* * ================= * UTILITY FUNCTIONS * ================= */ /* citextcmp() * Internal comparison function for citext strings. * Returns int32 negative, zero, or positive. */ static int32 citextcmp (text *left, text *right) { char *lcstr, *rcstr; int32 result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); result = varstr_cmp(lcstr, strlen(lcstr), rcstr, strlen(rcstr)); pfree(lcstr); pfree(rcstr); return result; } /* * ================== * INDEXING FUNCTIONS * ================== */ PG_FUNCTION_INFO_V1(citext_cmp); Datum citext_cmp(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32(result); } PG_FUNCTION_INFO_V1(citext_hash); Datum citext_hash(PG_FUNCTION_ARGS) { text *txt = PG_GETARG_TEXT_PP(0); char *str; Datum result; str = str_tolower(VARDATA_ANY(txt), VARSIZE_ANY_EXHDR(txt)); result = hash_any((unsigned char *) str, strlen(str)); pfree(str); /* Avoid leaking memory for toasted inputs */ PG_FREE_IF_COPY(txt, 0); PG_RETURN_DATUM(result); } /* * ================== * OPERATOR FUNCTIONS * ================== */ PG_FUNCTION_INFO_V1(citext_eq); Datum citext_eq(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); char *lcstr, *rcstr; bool result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); /* * We can't do the length-comparison optimization here, as is done for the * text type in varlena.c, because sometimes the lengths can be different. * The canonical example is the turkish dotted i: the lowercase version is * the standard ASCII i, but the uppercase version is multibyte. * Otherwise, since we only care about equality or not-equality, we can * avoid all the expense of strcoll() here, and just do bitwise * comparison. */ result = (strcmp(lcstr, rcstr) == 0); pfree(lcstr); pfree(rcstr); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_ne); Datum citext_ne(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); char *lcstr, *rcstr; bool result; lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left)); rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right)); /* * We can't do the length-comparison optimization here, as is done for the * text type in varlena.c, because sometimes the lengths can be different. * The canonical example is the turkish dotted i: the lowercase version is * the standard ASCII i, but the uppercase version is multibyte. * Otherwise, since we only care about equality or not-equality, we can * avoid all the expense of strcoll() here, and just do bitwise * comparison. */ result = (strcmp(lcstr, rcstr) != 0); pfree(lcstr); pfree(rcstr); PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_lt); Datum citext_lt(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; result = citextcmp(left, right) < 0; PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_le); Datum citext_le(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; result = citextcmp(left, right) <= 0; PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_gt); Datum citext_gt(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; result = citextcmp(left, right) > 0; PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } PG_FUNCTION_INFO_V1(citext_ge); Datum citext_ge(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); bool result; result = citextcmp(left, right) >= 0; PG_FREE_IF_COPY(left, 0); PG_FREE_IF_COPY(right, 1); PG_RETURN_BOOL(result); } /* * =================== * AGGREGATE FUNCTIONS * =================== */ PG_FUNCTION_INFO_V1(citext_smaller); Datum citext_smaller(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); text *result; result = citextcmp(left, right) < 0 ? left : right; PG_RETURN_TEXT_P(result); } PG_FUNCTION_INFO_V1(citext_larger); Datum citext_larger(PG_FUNCTION_ARGS) { text *left = PG_GETARG_TEXT_PP(0); text *right = PG_GETARG_TEXT_PP(1); text *result; result = citextcmp(left, right) > 0 ? left : right; PG_RETURN_TEXT_P(result); }
On Jul 11, 2008, at 20:10, Tom Lane wrote: > Oh, got one of those too ... [ time passes ... ] Nope, no leak > seen here either, though you could possibly fry an egg on my laptop > right now ... Yeah, gotta love that, right? So the issue must be with my version for 8.3.x, meaning, likely, my custom cilower() function. I'll have another look at it. Thanks for giving it a whirl on your end. > Also, I notice that the Mac is the *only* one of the three systems > on which your regression tests pass. You're depending way too much > on platform-specific behavior there, I fear. Hrm. I wonder what that could be. I'll have to give it a shot on my Ubuntu box and find out. Thanks. David
On Jul 11, 2008, at 20:22, Tom Lane wrote: > Here's an updated version of citext.c. Some changes cosmetic, some > not so much ... Thanks! Good catch on my missing all of the s/int/int32/ stuff related to citextcmp(). Stupid oversites on my part. I'll submit a new patch with these changes shortly. Best, David
BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments: * An explicit comment explaining that you're piggybacking on the I/O functions (and some others) for "text" wouldn't be out of place. * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. (If you needed them, you'd need them on a lot more than these two.) I'd be inclined to lose the COMMENTs on the functions too; again these are about the least useful ones to comment on out of the whole module. * You should provide binary I/O (send/receive) functions, if you want this to be an industrial-strength module. It's easy since you can piggyback on text's. * The type declaration needs to say storage = extended, else the type won't be toastable. * The cast from bpchar to citext cannot be WITHOUT FUNCTION; it needs to invoke rtrim1. Compare the bpchar to text cast. * <= is surely not its own commutator. You might try running the opr_sanity regression test on this module to see if it finds any other silliness. (Procedure: insert the citext definition script into the serial_schedule list just ahead of opr_sanity, run tests, make sure you understand the reason for any diffs in the opr_sanity result. There will be at least one from the uses of text-related functions for citext.) * I think you can and should drop all of the "size" functions and a lot of the "miscellaneous" functions: anyplace where the implicit coercion to text would serve, you don't need a piggyback function, and introducing one just creates risks of can't-resolve-ambiguous-function failures. The overloaded miscellaneous functions are only justifiable to the extent that it's important to preserve "citext-ness" of the result of a function, which seems at best dubious for many of these. It is likewise pretty pointless AFAICS to introduce regex functions taking citext pattern arguments. * Don't use the OPERATOR() notation when you don't need to. It's just clutter. regards, tom lane
On Jul 12, 2008, at 12:17, Tom Lane wrote: > BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments: Thanks a million for these, Tom. I greatly appreciate it. > * An explicit comment explaining that you're piggybacking on the I/O > functions (and some others) for "text" wouldn't be out of place. I've added SQL comments. Were you talking about a COMMENT? > * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. > (If you needed them, you'd need them on a lot more than these two.) > I'd be inclined to lose the COMMENTs on the functions too; again > these are about the least useful ones to comment on out of the > whole module. I wondered about that; those were copied from CITEXT 1. I've removed all GRANTs and COMMENTs. > * You should provide binary I/O (send/receive) functions, if you want > this to be an industrial-strength module. It's easy since you can > piggyback on text's. I'm confused. Is that not what the citextin and citextout functions are? > * The type declaration needs to say storage = extended, else the type > won't be toastable. Ah, good, thanks. > * The cast from bpchar to citext cannot be WITHOUT FUNCTION; > it needs to invoke rtrim1. Compare the bpchar to text cast. Where do I find that? I have trouble finding the SQL that creates the core types. :-( > * <= is surely not its own commutator. Changed to >=. > You might try running the > opr_sanity regression test on this module to see if it finds any > other silliness. (Procedure: insert the citext definition script > into the serial_schedule list just ahead of opr_sanity, run tests, > make sure you understand the reason for any diffs in the opr_sanity > result. There will be at least one from the uses of text-related > functions for citext.) Thanks. Added to my list. > * I think you can and should drop all of the "size" functions and > a lot of the "miscellaneous" functions: anyplace where the implicit > coercion to text would serve, you don't need a piggyback function, > and introducing one just creates risks of > can't-resolve-ambiguous-function failures. The overloaded > miscellaneous > functions are only justifiable to the extent that it's important to > preserve "citext-ness" of the result of a function, which seems at > best dubious for many of these. It is likewise pretty pointless > AFAICS > to introduce regex functions taking citext pattern arguments. I added most of those as I wrote tests and they failed to find the functions. Once I added the functions, they worked. But I'll do an audit to make sure that I didn't inadvertantly leave in any unneeded ones (I'm happy to have less code :-)). > * Don't use the OPERATOR() notation when you don't need to. > It's just clutter. Sorry, don't know what you're referring to here. CREATE OPERATOR appears to require parens… Thanks for the great feedback! I'll work on getting things all straightened out and a new patch in tonight. Best, David
There was some discussion earlier about whether the proposed regression tests for citext are suitable for use in contrib or not. After playing with them for awhile, I have to come down very firmly on the side of "not". I have these gripes: 1. The style is gratuitously different from every other regression test in the system. This is not a good thing. If it were an amazingly better way to do things, then maybe, but as far as I can tell the style pgTAP forces on you is really pretty darn poorly suited for SQL tests. You have to contort what could naturally be expressed in SQL as a table result into a scalar. Plus it's redundant with the expected-output file. 2. It's ridiculously slow; at least a factor of ten slower than doing equivalent tests directly in SQL. This is a very bad thing. Speed of regression tests matters a lot to those of us who run them a dozen times per day --- and I do not wish to discourage any developers who don't work that way from learning better habits ;-) Because of #1 and #2 I find the use of pgTAP to be a nonstarter. I have a couple of gripes about the substance of the tests as well: 3. What you are mostly testing is not the behavior of citext itself, but the behavior of the underlying strcoll function. This is pretty much doomed to failure, first because the regression tests are intended to work in multiple locales (and we are *not* giving that up for citext), and second because the behavior of strcoll isn't all that portable across platforms even given allegedly similar locale settings (as we already found out yesterday). Sadly, I think you have to give up attempts to check the interesting multibyte cases and confine yourself to tests using ASCII strings. 4. A lot of the later test cases are equally uselessly testing whether piggybacking over text functions works. The odds of ever finding anything with those tests are not distinguishable from zero (unless the underlying text function is busted, which is not your responsibility to test). So I don't see any point in putting them into the standard regression package. (What maybe *would* be useful to test, but you didn't, is whether the result of a function is considered citext rather than text.) I suggest something more like the attached as a suitable regression test. I got bored about halfway through and started to skim, so there might be a few tests toward the end of the original set that would be worth transposing into this one, but nothing jumped out at me. regards, tom lane -- -- Test citext datatype -- -- -- first, define the datatype. Turn off echoing so that expected file -- does not depend on contents of citext.sql. -- SET client_min_messages = warning; \set ECHO none \i citext.sql \set ECHO all RESET client_min_messages; -- Test the operators and indexing functions -- Test = and <>. SELECT 'a'::citext = 'a'::citext as t; SELECT 'a'::citext = 'A'::citext as t; SELECT 'a'::citext = 'A'::text as f; -- text wins the discussion SELECT 'a'::citext = 'b'::citext as f; SELECT 'a'::citext = 'ab'::citext as f; SELECT 'a'::citext <> 'ab'::citext as t; -- Test > and >= SELECT 'B'::citext > 'a'::citext as t; SELECT 'b'::citext > 'A'::citext as t; SELECT 'B'::citext > 'b'::citext as f; SELECT 'B'::citext >= 'b'::citext as t; -- Test < and <= SELECT 'a'::citext < 'B'::citext as t; SELECT 'a'::citext <= 'B'::citext as t; -- Test implicit casting. citext casts to text, but not vice-versa. SELECT 'a'::citext = 'a'::text as t; SELECT 'A'::text <> 'a'::citext as t; SELECT 'aardvark'::citext = 'aardvark'::citext as t; SELECT 'aardvark'::citext = 'aardVark'::citext as t; -- Check the citext_cmp() function explicitly. SELECT citext_cmp('aardvark'::citext, 'aardvark'::citext) as zero; SELECT citext_cmp('aardvark'::citext, 'aardVark'::citext) as zero; SELECT citext_cmp('B'::citext, 'a'::citext) as one; -- Do some tests using a table and index. CREATE TEMP TABLE try ( name citext PRIMARY KEY ); INSERT INTO try (name) VALUES ('a'), ('ab'), ('aba'), ('b'), ('ba'), ('bab'), ('AZ'); SELECT name, 'a' = name from try; SELECT name, 'a' = name from try where name = 'a'; SELECT name, 'A' = name from try; SELECT name, 'A' = name from try where name = 'A'; SELECT name, 'A' = name from try where name = 'A'; -- expected failures on duplicate key INSERT INTO try (name) VALUES ('a'); INSERT INTO try (name) VALUES ('A'); INSERT INTO try (name) VALUES ('aB'); -- Test aggregate functions and sort ordering CREATE TEMP TABLE srt ( name CITEXT ); INSERT INTO srt (name) VALUES ('aardvark'), ('AAA'), ('aba'), ('ABC'), ('abd'); -- Check the min() and max() aggregates, with and without index. set enable_seqscan = off; SELECT MIN(name) from srt; SELECT MAX(name) from srt; reset enable_seqscan; set enable_indexscan = off; SELECT MIN(name) from srt; SELECT MAX(name) from srt; reset enable_indexscan; -- Check sorting likewise set enable_seqscan = off; SELECT name FROM srt ORDER BY name; reset enable_seqscan; set enable_indexscan = off; SELECT name FROM srt ORDER BY name; reset enable_indexscan; -- Check LIKE SELECT name FROM srt WHERE name LIKE '%a%' ORDER BY name; SELECT name FROM srt WHERE name NOT LIKE '%a%' ORDER BY name; SELECT name FROM srt WHERE name LIKE '%A%' ORDER BY name; SELECT name FROM srt WHERE name NOT LIKE '%A%' ORDER BY name; -- ~ should be case-insensitive SELECT name FROM srt WHERE name ~ '%a%' ORDER BY name; SELECT name FROM srt WHERE name !~ '%a%' ORDER BY name; SELECT name FROM srt WHERE name ~ '%A%' ORDER BY name; SELECT name FROM srt WHERE name !~ '%A%' ORDER BY name;
On Jul 12, 2008, at 14:57, Tom Lane wrote: > There was some discussion earlier about whether the proposed > regression > tests for citext are suitable for use in contrib or not. After > playing > with them for awhile, I have to come down very firmly on the side of > "not". I have these gripes: You're nothing if not thorough, Tom. > 1. The style is gratuitously different from every other regression > test > in the system. This is not a good thing. If it were an amazingly > better way to do things, then maybe, but as far as I can tell the > style > pgTAP forces on you is really pretty darn poorly suited for SQL tests. > You have to contort what could naturally be expressed in SQL as a > table > result into a scalar. Plus it's redundant with the expected-output > file. Sure. I wasn't aware of the existing regression test methodology when I wrote pgTAP and those tests. I'm fine to change them and use pgTAP for testing my app code, rather than PostgreSQL itself. > 2. It's ridiculously slow; at least a factor of ten slower than doing > equivalent tests directly in SQL. This is a very bad thing. Speed of > regression tests matters a lot to those of us who run them a dozen > times > per day --- and I do not wish to discourage any developers who don't > work that way from learning better habits ;-) Hrm. I'm wonder why it's so slow? The test functions don't really do a lot. Anyway, I agree that they should perform well. > Because of #1 and #2 I find the use of pgTAP to be a nonstarter. > I have a couple of gripes about the substance of the tests as well: > > 3. What you are mostly testing is not the behavior of citext itself, > but the behavior of the underlying strcoll function. This is pretty > much doomed to failure, first because the regression tests are > intended > to work in multiple locales (and we are *not* giving that up for > citext), and second because the behavior of strcoll isn't all that > portable across platforms even given allegedly similar locale settings > (as we already found out yesterday). Yes, I *just* ran the tests on Ubuntu and opened my mail to ask about the bizarre differences when I saw this message. Thanks for answering my question before I asked it. :-) > Sadly, I think you have to give up > attempts to check the interesting multibyte cases and confine yourself > to tests using ASCII strings. Grr. Kind of defeats the purpose. Is there no infrastructure for testing multibyte functionality? Are test database clusters all built using SQL_ASCII and the C locale? > 4. A lot of the later test cases are equally uselessly testing whether > piggybacking over text functions works. The odds of ever finding > anything with those tests are not distinguishable from zero (unless > the > underlying text function is busted, which is not your responsibility > to > test). So I don't see any point in putting them into the standard > regression package. (What maybe *would* be useful to test, but you > didn't, is whether the result of a function is considered citext > rather > than text.) Well, I was doing test-driven development: I wrote the tests to ensure that I had added the functions for CITEXT properly, and when they passed, I could move on. As a unit tester, it'd feel weird for me to drop these. I'm not testing the underlying functions; I'm making sure that they work properly with CITEXT. > I suggest something more like the attached as a suitable regression > test. I got bored about halfway through and started to skim, so there > might be a few tests toward the end of the original set that would be > worth transposing into this one, but nothing jumped out at me. Thanks! I'll work this in. Best, David PS: I confirmed late yesterday that the memory leak I saw was with my version for 8.3, where I had my own str_tolower(). It clearly has a leak that I'll have to fix, but it has no bearing on the contribution to 8.4, which has no such leak. Thanks for running the test yourself to confirm.
On Jul 12, 2008, at 15:13, David E. Wheeler wrote: >> Sadly, I think you have to give up >> attempts to check the interesting multibyte cases and confine >> yourself >> to tests using ASCII strings. > > Grr. Kind of defeats the purpose. Is there no infrastructure for > testing multibyte functionality? Are test database clusters all > built using SQL_ASCII and the C locale? I just tried to take your modified tests and add multibyte tests that run only on OS X with en_US.UTF-8. They worked like this: CREATE OR REPLACE FUNCTION test_multibyte () RETURNS BOOLEAN AS $$ SELECT version() ~ 'apple-darwin' AND (select setting from pg_settings where name = 'lc_collate') = 'en_US.UTF-8'; $$ LANGUAGE SQL IMMUTABLE; SELECT 'À'::citext = 'À'::citext WHERE test_multibyte() = true; SELECT 'À'::citext = 'à'::citext WHERE test_multibyte() = true; But then I realized that this would change the expected output depending on the platform, and thus make the tests fail. This is one reason why the inflexibility of the existing regression test model is a drag: it limits you to testing only what works everywhere! Grrr. I'll remove all the multibyte character tests, but I have to say that, as a result, the CITEXT 1 module would likely pass all such tests, but it still isn't locale-aware. How can one add regressions to ensure that something truly is locale-aware? Thanks, David
On Jul 12, 2008, at 14:50, David E. Wheeler wrote: >> * An explicit comment explaining that you're piggybacking on the I/O >> functions (and some others) for "text" wouldn't be out of place. > > I've added SQL comments. Were you talking about a COMMENT? > >> * Lose the GRANT EXECUTEs on the I/O functions; they're redundant. >> (If you needed them, you'd need them on a lot more than these two.) >> I'd be inclined to lose the COMMENTs on the functions too; again >> these are about the least useful ones to comment on out of the >> whole module. > > I wondered about that; those were copied from CITEXT 1. I've removed > all GRANTs and COMMENTs. > >> * You should provide binary I/O (send/receive) functions, if you want >> this to be an industrial-strength module. It's easy since you can >> piggyback on text's. > > I'm confused. Is that not what the citextin and citextout functions > are? > >> * The type declaration needs to say storage = extended, else the type >> won't be toastable. > > Ah, good, thanks. > >> * The cast from bpchar to citext cannot be WITHOUT FUNCTION; >> it needs to invoke rtrim1. Compare the bpchar to text cast. > > Where do I find that? I have trouble finding the SQL that creates > the core types. :-( Duh, you just told me. Added, thanks. >> * <= is surely not its own commutator. > > Changed to >=. > >> You might try running the >> opr_sanity regression test on this module to see if it finds any >> other silliness. (Procedure: insert the citext definition script >> into the serial_schedule list just ahead of opr_sanity, run tests, >> make sure you understand the reason for any diffs in the opr_sanity >> result. There will be at least one from the uses of text-related >> functions for citext.) > > Thanks. Added to my list. > >> * I think you can and should drop all of the "size" functions and >> a lot of the "miscellaneous" functions: anyplace where the implicit >> coercion to text would serve, you don't need a piggyback function, >> and introducing one just creates risks of >> can't-resolve-ambiguous-function failures. The overloaded >> miscellaneous >> functions are only justifiable to the extent that it's important to >> preserve "citext-ness" of the result of a function, which seems at >> best dubious for many of these. It is likewise pretty pointless >> AFAICS >> to introduce regex functions taking citext pattern arguments. > > I added most of those as I wrote tests and they failed to find the > functions. Once I added the functions, they worked. But I'll do an > audit to make sure that I didn't inadvertantly leave in any unneeded > ones (I'm happy to have less code :-)). Of the size functions, I was able to remove only this one and keep all of my pgTAP tests passing: CREATE FUNCTION textlen(citext) RETURNS int4 AS 'textlen' LANGUAGE 'internal' IMMUTABLE STRICT; When I deleted any of the others, I got errors like this: psql:sql/citext.sql:865: ERROR: function length(citext) is not unique LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||... ^ HINT: Could not choose a best candidate function. You might need to add explicit type casts. I think you can see now why I wrote the tests: I wanted to ensure that CITEXT really does work (almost) just like TEXT. I was able to eliminate *all* of the miscellaneous functions, but the upper() and lower() functions now return TEXT instead of CITEXT, which I don't think is exactly what we want, is it? For now, I'e left upper() and lower() in. It just seems like more expected functionality. >> * Don't use the OPERATOR() notation when you don't need to. >> It's just clutter. > > Sorry, don't know what you're referring to here. CREATE OPERATOR > appears to require parens… Best, David
"David E. Wheeler" <david@kineticode.com> writes: > When I deleted any of the others, I got errors like this: > psql:sql/citext.sql:865: ERROR: function length(citext) is not unique > LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||... > ^ > HINT: Could not choose a best candidate function. You might need to > add explicit type casts. Hmm. I think what that actually means is that the cast from citext to bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is happening is that the system can't figure out whether to use length(text) or length(bpchar) when presented with a citext argument. I had been thinking yesterday that it would automatically prefer length(text) because text is a "preferred type", but after tracing through it I see that that doesn't happen because citext is not thought to be of the string category. (We really need a way to let user-defined types specify their category...) The fact that you need all these piggyback functions is a red flag because what it implies is that citext will not work nicely for any situation where both text and bpchar functions have been provided --- and that includes user-added functions, so it's hopeless to think that you can get to a solution this way. Downgrading the cast seems like the right thing to me. The implicit cast to varchar is a bit worrisome because it creates the same issue if someone has provided both varchar and text versions of a function. However, that seems a bit pointless given the lack of semantic difference, and I suspect that a lot of user-written functions come only in varchar variants --- so on balance my recommendation is to keep that one as implicit. regards, tom lane
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 12, 2008, at 12:17, Tom Lane wrote: >> * You should provide binary I/O (send/receive) functions, if you want >> this to be an industrial-strength module. It's easy since you can >> piggyback on text's. > I'm confused. Is that not what the citextin and citextout functions are? No, those are text I/O. You need analogues of textsend and textrecv too. >> You might try running the >> opr_sanity regression test on this module to see if it finds any >> other silliness. (Procedure: insert the citext definition script >> into the serial_schedule list just ahead of opr_sanity, run tests, >> make sure you understand the reason for any diffs in the opr_sanity >> result. There will be at least one from the uses of text-related >> functions for citext.) > Thanks. Added to my list. BTW, actually a better idea would be to put citext.sql at the front of the list and just run the whole main regression series with it present. typ_sanity and oidjoins might possibly find issues too. >> * Don't use the OPERATOR() notation when you don't need to. >> It's just clutter. > Sorry, don't know what you're referring to here. Some (not all) of your CREATE OPERATOR commands have things like NEGATOR = OPERATOR(!~), which seems unnecessary, and is certainly inconsistent. regards, tom lane
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 12, 2008, at 14:57, Tom Lane wrote: >> Sadly, I think you have to give up >> attempts to check the interesting multibyte cases and confine yourself >> to tests using ASCII strings. > Grr. Kind of defeats the purpose. Is there no infrastructure for > testing multibyte functionality? There's some stuff under src/test/locale and src/test/mb, though it doesn't get exercised regularly. The contrib tests are a particularly bad place for trying to do any locale-dependent testing, because we only support "make installcheck" which means there is no way to set the database locale --- you *have to* work with whatever locale is predetermined by the postmaster you're pointed at. I don't recall the reason for not supporting "make check" in the contrib modules; maybe it was just that preparing a test installation for each contrib module sounded too slow? In any case, what you'd need to pursue is having some additional tests available that are not executed by the standard contrib regression test sequence. (If we get to having per-database collations in 8.4 then integrating a test with a non-default collation would get a lot easier, of course; but for the moment I'm afraid you've got to work with what's there.) regards, tom lane
On Jul 13, 2008, at 10:16, Tom Lane wrote: > Hmm. I think what that actually means is that the cast from citext to > bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is > happening > is that the system can't figure out whether to use length(text) or > length(bpchar) when presented with a citext argument. I had been > thinking yesterday that it would automatically prefer length(text) > because text is a "preferred type", but after tracing through it I see > that that doesn't happen because citext is not thought to be of the > string category. (We really need a way to let user-defined types > specify their category...) That'd be nice. > The fact that you need all these piggyback functions is a red flag > because what it implies is that citext will not work nicely for any > situation where both text and bpchar functions have been provided > --- and that includes user-added functions, so it's hopeless to think > that you can get to a solution this way. Downgrading the cast seems > like the right thing to me. Yes, that works for me. I've downgraded it and can now remove the size functions and all the tests still pass. > The implicit cast to varchar is a bit worrisome because it creates the > same issue if someone has provided both varchar and text versions of a > function. However, that seems a bit pointless given the lack of > semantic difference, and I suspect that a lot of user-written > functions > come only in varchar variants --- so on balance my recommendation is > to > keep that one as implicit. Yes, I agree. Thanks for tracing this out, Tom, it never would have ocurred to me -- and now I know more than I did before! Best, David
On Jul 13, 2008, at 10:19, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> On Jul 12, 2008, at 12:17, Tom Lane wrote: >>> * You should provide binary I/O (send/receive) functions, if you >>> want >>> this to be an industrial-strength module. It's easy since you can >>> piggyback on text's. > >> I'm confused. Is that not what the citextin and citextout functions >> are? > > No, those are text I/O. You need analogues of textsend and textrecv > too. Okay. >> Thanks. Added to my list. > > BTW, actually a better idea would be to put citext.sql at the front of > the list and just run the whole main regression series with it > present. > typ_sanity and oidjoins might possibly find issues too. Also added to my list. :-) > Some (not all) of your CREATE OPERATOR commands have things like > > NEGATOR = OPERATOR(!~), > > which seems unnecessary, and is certainly inconsistent. Oh, I hadn't even noticed those; I'd just copied them from citext 1. Fixed! Best, David
On Jul 13, 2008, at 10:19, Tom Lane wrote: >> I'm confused. Is that not what the citextin and citextout functions >> are? > > No, those are text I/O. You need analogues of textsend and textrecv > too. Should those return bytea and citext, respectively? IOW, are these right? CREATE OR REPLACE FUNCTION citextrecv(bytea) RETURNS citext AS 'textrecv' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION citextsend(citext) RETURNS bytea AS 'textsend' LANGUAGE 'internal' IMMUTABLE STRICT; Thanks, David
On Jul 13, 2008, at 16:06, David E. Wheeler wrote: > Should those return bytea and citext, respectively? IOW, are these > right? > > CREATE OR REPLACE FUNCTION citextrecv(bytea) > RETURNS citext > AS 'textrecv' > LANGUAGE 'internal' IMMUTABLE STRICT; > > CREATE OR REPLACE FUNCTION citextsend(citext) > RETURNS bytea > AS 'textsend' > LANGUAGE 'internal' IMMUTABLE STRICT; To judge by the User-Defined Types docs, I was close. :-) I just changed the argument to citextrecv to "internal". Thanks, David
"David E. Wheeler" <david@kineticode.com> writes: > To judge by the User-Defined Types docs, I was close. :-) I just > changed the argument to citextrecv to "internal". Right. The APIs for send and recv aren't inverses. (Oh, the sins we'll commit in the name of performance ...) regards, tom lane
On Jul 13, 2008, at 10:19, Tom Lane wrote: >>> You might try running the >>> opr_sanity regression test on this module to see if it finds any >>> other silliness. (Procedure: insert the citext definition script >>> into the serial_schedule list just ahead of opr_sanity, run tests, >>> make sure you understand the reason for any diffs in the opr_sanity >>> result. There will be at least one from the uses of text-related >>> functions for citext.) > >> Thanks. Added to my list. > > BTW, actually a better idea would be to put citext.sql at the front of > the list and just run the whole main regression series with it > present. > typ_sanity and oidjoins might possibly find issues too. Um, stupid question (sorry, I'm not familiar with how the regression tests are organized): serial_schedule doesn't look like a file into which I can insert an SQL file. How would I do that? Thanks, David
On Jul 13, 2008, at 10:31, Tom Lane wrote: >> Grr. Kind of defeats the purpose. Is there no infrastructure for >> testing multibyte functionality? > > There's some stuff under src/test/locale and src/test/mb, though it > doesn't get exercised regularly. The contrib tests are a particularly > bad place for trying to do any locale-dependent testing, because we > only support "make installcheck" which means there is no way to set > the database locale --- you *have to* work with whatever locale is > predetermined by the postmaster you're pointed at. > > I don't recall the reason for not supporting "make check" in the > contrib > modules; maybe it was just that preparing a test installation for each > contrib module sounded too slow? In any case, what you'd need to > pursue > is having some additional tests available that are not executed by the > standard contrib regression test sequence. > > (If we get to having per-database collations in 8.4 then integrating a > test with a non-default collation would get a lot easier, of course; > but for the moment I'm afraid you've got to work with what's there.) Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 and one for everything else, as described in the last three paragraphs here? http://www.postgresql.org/docs/current/static/regress-variant.html That way, I can at least make sure that the multibyte stuff *does* work. Even if it's tested on only one platform, the purpose is not to test a particular collation, but to test that CITEXT is actually sensitive to locale. Thanks, David
On Jul 12, 2008, at 14:57, Tom Lane wrote: > 4. A lot of the later test cases are equally uselessly testing whether > piggybacking over text functions works. The odds of ever finding > anything with those tests are not distinguishable from zero (unless > the > underlying text function is busted, which is not your responsibility > to > test). So I don't see any point in putting them into the standard > regression package. (What maybe *would* be useful to test, but you > didn't, is whether the result of a function is considered citext > rather > than text.) I'd like to keep these tests, since they ensure not just that the functions work but that they work with citext. Given what we found with length() and friends not working when there was an implicit cast to bpchar, I think it's valuable to continue to ensure that these functions work as expected with citext. Otherwise someone in the future might come along and make the cast to bpchar implicit again, and no tests would fail to tell him/her otherwise. These tests make good regressions. Thanks, David
David E. Wheeler wrote: >> >> (If we get to having per-database collations in 8.4 then integrating a >> test with a non-default collation would get a lot easier, of course; >> but for the moment I'm afraid you've got to work with what's there.) > > Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 > and one for everything else, as described in the last three paragraphs > here? > > http://www.postgresql.org/docs/current/static/regress-variant.html > > That way, I can at least make sure that the multibyte stuff *does* > work. Even if it's tested on only one platform, the purpose is not to > test a particular collation, but to test that CITEXT is actually > sensitive to locale. > > You can certainly add any tests you like. But the buildfarm does all its regression tests using an install done with --no-locale. Any test that requires some locale to work, or make sense, should probably not be in your Makefile's REGRESS target. Some time ago I raised the question of doing locale- and encoding-aware regression tests, and IIRC the consensus seemed to be that it was not worth the effort, but maybe we need to revisit that. We certainly seem to be doing a lot more work now to get Postgres to where it needs to be w.r.t. locale support, and we've cleaned up a lot of the encoding issues we had, so maybe we need to reflect that in our testing regime more. cheers andrew
Andrew Dunstan wrote: > Some time ago I raised the question of doing locale- and encoding-aware > regression tests, and IIRC the consensus seemed to be that it was not > worth the effort, but maybe we need to revisit that. We certainly seem > to be doing a lot more work now to get Postgres to where it needs to be > w.r.t. locale support, and we've cleaned up a lot of the encoding issues > we had, so maybe we need to reflect that in our testing regime more. I agree. Maybe we need to unearth the src/test/mb or src/test/locale; perhaps have the buildfarm run those tests when they get to a useful point. I notice that src/test/locale is not VPATH-aware, and I'm unsure if it's got a clear distinction of locales vs. encodings (it calls koi8-r a locale). src/test/mb is not VPATH aware either (and it doesn't have a Makefile at all), and is now broken by the fact that createdb refuses to create a database with mismatching encoding. $ LC_ALL=C sh mbregress.sh euc_jp .. createdb: database creation failed: ERROR: encoding EUC_JP does not match server's locale fr_CA.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. failed sjis .. failed euc_kr .. createdb: database creation failed: ERROR: encoding EUC_KR does not match server's locale fr_CA.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. failed euc_cn .. createdb: database creation failed: ERROR: encoding EUC_CN does not match server's locale fr_CA.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. failed euc_tw .. createdb: database creation failed: ERROR: encoding EUC_TW does not match server's locale fr_CA.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. failed big5 .. failed utf8 .. ok mule_internal .. createdb: database creation failed: ERROR: encoding MULE_INTERNAL does not match server's locale fr_CA.UTF-8 DETAIL: The server's LC_CTYPE setting requires encoding UTF8. failed So this whole area would seem to need a lot of love ... -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 13, 2008, at 10:19, Tom Lane wrote: >> BTW, actually a better idea would be to put citext.sql at the front of >> the list and just run the whole main regression series with it >> present. >> typ_sanity and oidjoins might possibly find issues too. > Um, stupid question (sorry, I'm not familiar with how the regression > tests are organized): serial_schedule doesn't look like a file into > which I can insert an SQL file. How would I do that? You're really making it into another test. Just copy the citext.sql file into the sql/ subdirectory and add a "citext" entry to the schedule file. The last time I did this, I had to at least "touch" a corresponding expected file (expected/citext.out) or pg_regress would go belly-up without running the rest of the tests. There's no special need to make the expected file match, of course, since you can just ignore the "failure" report from that one test. regards, tom lane
"David E. Wheeler" <david@kineticode.com> writes: > Could I supply two comparison files, one for Mac OS X with en_US.UTF-8 > and one for everything else, as described in the last three paragraphs > here? The fallacy in that proposal is the assumption that there are only two behaviors out there. Let me recalibrate your thoughts a bit: so far I have tried citext on three different machines (Mac, Fedora 8, HPUX), and I got three different answers from those tests. That's despite endeavoring to make the database locales match ... which is less than trivial in itself because they use three slightly different spellings of "en_US.UTF8". Given that you were more or less deliberately testing corner cases, I think it's quite likely that the number of observable reactions from N platforms would be more nearly O(N) than O(1). In the real world, to the extent that we are able to control the locale of the regression tests, we make it "C" --- and to a large extent we can't control it at all, which means you have another uncontrolled variable besides platform. So in the current universe there is absolutely no value in submitting locale-specific tests for a contrib module. I see some discussion in the thread about improving the situation, but until we are able to decouple database locale from environment locale, I doubt we'll be able to do a whole lot about automating this kind of test. There are too many variables at the moment. regards, tom lane
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 12, 2008, at 14:57, Tom Lane wrote: >> 4. A lot of the later test cases are equally uselessly testing whether >> piggybacking over text functions works. > I'd like to keep these tests, since they ensure not just that the > functions work but that they work with citext. It might be reasonable to test a couple of them for that purpose. If your agenda is "test every function in the system that comes or might come in a bpchar variant", I think that's pointless. regards, tom lane
On Jul 14, 2008, at 06:05, Andrew Dunstan wrote: > You can certainly add any tests you like. But the buildfarm does all > its regression tests using an install done with --no-locale. Any > test that requires some locale to work, or make sense, should > probably not be in your Makefile's REGRESS target. Well, unless the display of the characters would be broken (the build farm databases use UTF-8, no?), the output would look like this, which should more or less work (I'm hoping): SELECT citext_larger( 'Â'::citext, 'ç'::citext ) = 'ç' AS t WHERE false and test_multibyte(); t --- (0 rows) > Some time ago I raised the question of doing locale- and encoding- > aware regression tests, and IIRC the consensus seemed to be that it > was not worth the effort, but maybe we need to revisit that. We > certainly seem to be doing a lot more work now to get Postgres to > where it needs to be w.r.t. locale support, and we've cleaned up a > lot of the encoding issues we had, so maybe we need to reflect that > in our testing regime more. Makes sense to me. Best, David
On Jul 14, 2008, at 07:24, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> Could I supply two comparison files, one for Mac OS X with >> en_US.UTF-8 >> and one for everything else, as described in the last three >> paragraphs >> here? > > The fallacy in that proposal is the assumption that there are only two > behaviors out there. Well, no, that's not the assumption at all. The assumption is that the type works properly with multibyte characters under multibyte-aware locales. So I want to have tests to ensure that such is true by having multibyte characters run under a very specific locale and platform. I don't really care what platform or locale; the point is to make sure that the type is actually multibyte-aware. > Let me recalibrate your thoughts a bit: so far > I have tried citext on three different machines (Mac, Fedora 8, HPUX), > and I got three different answers from those tests. That's despite > endeavoring to make the database locales match ... which is less than > trivial in itself because they use three slightly different > spellings of > "en_US.UTF8". <rant> This is a truly pitiful state of affairs. Rhetorical question: Why is there no standardization of locales? I'm sure there are a lot of opinions out there (should all uppercase chars should precede all lowercase chars or be mixed in with lowercase chars), but I should think that, in this day and age, there would be some sort of standard defining locales and how they work -- and to allow such opinions to be expressed by different locales, not in the same locale names on different platforms. </rant> > Given that you were more or less deliberately testing corner cases, > I think it's quite likely that the number of observable reactions from > N platforms would be more nearly O(N) than O(1). To me they're not corner cases. To me it is just, "given a specific platform/locale, does CITEXT respect the locale's rules?" I don't care to test all platforms and locales (I'm not *that* stupid :-)). > In the real world, to the extent that we are able to control the > locale > of the regression tests, we make it "C" --- and to a large extent we > can't control it at all, which means you have another uncontrolled > variable besides platform. So in the current universe there is > absolutely no value in submitting locale-specific tests for a contrib > module. Then how do we know that it will continue to be locale-aware over time? Someone could replace the comparison function with one that just lowercases ASCII characters, like CITEXT 1 does, and no tests would fail. How do you prevent that from happening without being hyper- vigilant (and never leaving the project, I might add)? > I see some discussion in the thread about improving the situation, but > until we are able to decouple database locale from environment locale, > I doubt we'll be able to do a whole lot about automating this kind > of test. There are too many variables at the moment. Is the decoupling of database locale from environment locale likely to happen anytime soon? Now that I've written CITEXT, I dare say that such might become my top-desired feature (aside from replication). Thanks for the discussion, much appreciated, and I'm learning a ton. I retain the right to be opinionated, however. ;-) Best, David
On Jul 14, 2008, at 07:26, Tom Lane wrote: >> I'd like to keep these tests, since they ensure not just that the >> functions work but that they work with citext. > > It might be reasonable to test a couple of them for that purpose. > If your agenda is "test every function in the system that comes or > might come in a bpchar variant", I think that's pointless. Or a varchar variant, or where such a variant might be added in the future. To my mind, it's important to have good coverage in my unit tests to ensure that things continue to work exactly the same over time. So, since the tests are already written, and are unlikely to add more than a few milliseconds to test runtime, can you at least agree that such tests are harmless? Updated patch later today. Thanks, David
On Jul 14, 2008, at 06:49, Alvaro Herrera wrote: > So this whole area would seem to need a lot of love ... Do the tests control for platforms, as well, since locales with the same spellings can vary on different platforms? Or even on different versions of the same platforms? Thanks, David
David E. Wheeler wrote: > On Jul 14, 2008, at 06:05, Andrew Dunstan wrote: > >> You can certainly add any tests you like. But the buildfarm does all >> its regression tests using an install done with --no-locale. Any test >> that requires some locale to work, or make sense, should probably not >> be in your Makefile's REGRESS target. > > Well, unless the display of the characters would be broken (the build > farm databases use UTF-8, no?), No. --no-locale is the same as --locale=C which means the encoding is SQL_ASCII. cheers andrew
On Jul 14, 2008, at 10:55, Andrew Dunstan wrote: > No. > > --no-locale is the same as --locale=C which means the encoding is > SQL_ASCII. I've used --no-locale --encoding UTF-8 many times. But if the regressions run with SQL_ASCII…well, I'm just amazed that there haven't been more unexpected side-effects to source code changes without controlling for such variables in tests. Seems like a lot to keep in one's head. Anyway, are tests for contrib modules run on the build farms? If so, I could still potentially get around this issue by turning off ECHO. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > Well, unless the display of the characters would be broken (the build > farm databases use UTF-8, no?), No, they use C. regards, tom lane
On Jul 14, 2008, at 10:58, Tom Lane wrote: >> Well, unless the display of the characters would be broken (the build >> farm databases use UTF-8, no?), > > No, they use C. Um, you mean SQL_ASCII? Best, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 14, 2008, at 07:24, Tom Lane wrote: >> The fallacy in that proposal is the assumption that there are only two >> behaviors out there. > Well, no, that's not the assumption at all. The assumption is that the > type works properly with multibyte characters under multibyte-aware > locales. So I want to have tests to ensure that such is true by having > multibyte characters run under a very specific locale and platform. [ shrug... ] Seems pretty useless to me: we already know that it works for you. The point of a regression test in my mind is to make sure that it works for everybody. Given the platform variations involved in strcoll's behavior, the definition of "works for everybody" is going to be pretty darn narrowly circumscribed anyway, and thus I don't have a big problem with restricting the tests to ASCII cases. Let me put it another way: if we test on another platform and find out that strcoll's behavior is different there, are you going to fix that version of strcoll? No, you're not. So you might as well just test the behavior of the code that's actually under your control. regards, tom lane
On Jul 14, 2008, at 11:06, Tom Lane wrote: > [ shrug... ] Seems pretty useless to me: we already know that it > works > for you. The point of a regression test in my mind is to make sure > that > it works for everybody. Given the platform variations involved in > strcoll's behavior, the definition of "works for everybody" is going > to > be pretty darn narrowly circumscribed anyway, and thus I don't have a > big problem with restricting the tests to ASCII cases. Neither do I, as long as there is *some* context to ensure that the type remains locale-aware. We only know that it works for me because I've written the tests. > Let me put it another way: if we test on another platform and find out > that strcoll's behavior is different there, are you going to fix that > version of strcoll? No, you're not. So you might as well just test > the > behavior of the code that's actually under your control. You don't seem to understand what I'm suggesting: I'm not talking about testing strcoll. I'm talking about making sure that citext *uses* strcoll. Whether or not strcoll actually works properly is not my concern. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > On Jul 14, 2008, at 11:06, Tom Lane wrote: >> Let me put it another way: if we test on another platform and find out >> that strcoll's behavior is different there, are you going to fix that >> version of strcoll? No, you're not. So you might as well just test >> the >> behavior of the code that's actually under your control. > You don't seem to understand what I'm suggesting: I'm not talking > about testing strcoll. I'm talking about making sure that citext > *uses* strcoll. This seems pointless, as well as essentially impossible to enforce by black-box testing. Nobody is likely to consider a patch that removes such obviously basic functionality of the module. I think you're worrying about testing the wrong things --- and as evidence I'd note the serious errors that slipped by your testing (including the fact that the last submitted version would still claim that 'a' = 'ab'). What we generally ask a regression test to do is catch portability problems (which is certainly a real-enough issue here, but we don't know how to address it well) or catch foreseeable breakage (such as someone introducing a cast that breaks resolution of citext function calls). The methodology can't catch everything, and trying to push it beyond what it can do is just a time sink for effort that can more usefully be spent other ways, such as code-reading. regards, tom lane
On Jul 14, 2008, at 11:49, Tom Lane wrote: >> You don't seem to understand what I'm suggesting: I'm not talking >> about testing strcoll. I'm talking about making sure that citext >> *uses* strcoll. > > This seems pointless, as well as essentially impossible to enforce by > black-box testing. Nobody is likely to consider a patch that removes > such obviously basic functionality of the module. Never underestimate the human capacity for incompetence. One of my mottos. > I think you're > worrying about testing the wrong things --- and as evidence I'd note > the > serious errors that slipped by your testing (including the fact that > the > last submitted version would still claim that 'a' = 'ab'). Um, say what? I had that problem at one time, but it was when I was writing my own lower() function, which was shitty (I wasn't creating a nul-terminated string). I don't recall that being in any patch I sent to the list, and indeed you wrote that very test in the revised test script you sent in. At any rate, I make no claims that my tests properly covered everything. There is a lot I didn't know to test, but I'm learning more all the time. That's why this code review has been so immensely valuable, both for me and for CITEXT. > What we > generally ask a regression test to do is catch portability problems > (which is certainly a real-enough issue here, but we don't know how > to address it well) or catch foreseeable breakage (such as someone > introducing a cast that breaks resolution of citext function calls). > The methodology can't catch everything, and trying to push it beyond > what it can do is just a time sink for effort that can more usefully > be spent other ways, such as code-reading. I guess what I'm thinking of is not regression tests but unit tests. And with unit testing, one of the goals is coverage. Good coverage has saved my ass many times in the past, even catching changes that, in hindsight, should obviously never have been attempted. Perhaps it'd be worth thinking about creating a project for unit testing PostgreSQL in controlled environments? Maybe having tests that can contain proper conditionals? Anyway, it seems that, given the limitations of the current testing system with regard to testing multibyte characters, the issue is moot. I'll throw in a few tests that test multibyte characters, but I'll comment them out and just uncomment them for individual test runs on my box for the purposes of my own sanity going forward. Thanks, David
On Jul 12, 2008, at 15:13, David E. Wheeler wrote: >> 2. It's ridiculously slow; at least a factor of ten slower than doing >> equivalent tests directly in SQL. This is a very bad thing. Speed >> of >> regression tests matters a lot to those of us who run them a dozen >> times >> per day --- and I do not wish to discourage any developers who don't >> work that way from learning better habits ;-) > > Hrm. I'm wonder why it's so slow? The test functions don't really do > a lot. Anyway, I agree that they should perform well. Just as an FYI, I've just moved all the tests to regular SQL instead of using pgTAP. The difference in runtime is: psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 19% cpu 0.253 total psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 4% cpu 1.298 total So it's close to a factor of five, though subtract .125 for the time to load the pgTAP functions. The pgTAP tests *are* doing a lot more work, but I'm sure that they could be made a lot more efficient, though of course the TAP functions will always introduce some overhead. One just needs to decide whether the tradeoffs are worth it. Best, David
On Jul 14, 2008, at 07:08, Tom Lane wrote: > You're really making it into another test. Just copy the citext.sql > file into the sql/ subdirectory and add a "citext" entry to the > schedule > file. > > The last time I did this, I had to at least "touch" a corresponding > expected file (expected/citext.out) or pg_regress would go belly-up > without running the rest of the tests. There's no special need to > make the expected file match, of course, since you can just ignore > the "failure" report from that one test. Okay, I copied citext.sql into src/test/regress/sql and then added "test: citext" to the top of src/test/regress/serial_schedule. Then I ran `make check`. All tests passed, but I don't think that citext was tested. Do I need to install the server, build a cluster, and run `make installcheck`? Thanks for the hand-holding. Best, David
"David E. Wheeler" <david@kineticode.com> writes: > Okay, I copied citext.sql into src/test/regress/sql and then added > "test: citext" to the top of src/test/regress/serial_schedule. Then I > ran `make check`. All tests passed, but I don't think that citext was > tested. > Do I need to install the server, build a cluster, and run `make > installcheck`? Yeah, probably. I don't think the "make check" path will support it because it doesn't install contrib into the temp installation. (You'd also need to have put the extra entry in parallel_schedule not serial_schedule, but it's not gonna work anyway.) regards, tom lane
On Jul 15, 2008, at 07:09, Tom Lane wrote: > Yeah, probably. I don't think the "make check" path will support it > because it doesn't install contrib into the temp installation. > (You'd also need to have put the extra entry in parallel_schedule > not serial_schedule, but it's not gonna work anyway.) Well now that was cool to see. I got some failures, of course, but nothing stands out to me as an obvious bug. I attach the diffs file (with the citext.sql failure removed) for your perusal. What would be the best way for me to resolve those permission issues? Or do they matter for sanity-checking citext? Thanks, David
Attachment
"David E. Wheeler" <david@kineticode.com> writes: > Well now that was cool to see. I got some failures, of course, but > nothing stands out to me as an obvious bug. I attach the diffs file > (with the citext.sql failure removed) for your perusal. What would be > the best way for me to resolve those permission issues? Don't run the tests in a read-only directory, perhaps. > Or do they matter for sanity-checking citext? Hard to tell --- I'd suggest trying to get a clean run. As for what you have, the first diff hunk suggests you've got the wrong function properties for citextsend/citextrecv. regards, tom lane
On Jul 15, 2008, at 12:56, Tom Lane wrote: > Don't run the tests in a read-only directory, perhaps. Yes, I changed the owner to the postgres system user and that did the trick. > >> Or do they matter for sanity-checking citext? > > Hard to tell --- I'd suggest trying to get a clean run. As for what > you > have, the first diff hunk suggests you've got the wrong function > properties for citextsend/citextrecv. Here's the new diff: *** ./expected/opr_sanity.out Mon Jul 14 21:55:49 2008 --- ./results/opr_sanity.out Tue Jul 15 17:41:03 2008 *************** *** 87,94 **** p1.provolatile != p2.provolatile OR p1.pronargs != p2.pronargs); oid | proname | oid | proname ! -----+---------+-----+--------- ! (0 rows) -- Look for uses of different type OIDs in the argument/result type fields -- for different aliases of the same built-in function. --- 87,96 ---- p1.provolatile != p2.provolatile OR p1.pronargs != p2.pronargs); oid | proname | oid | proname ! ------+----------+-------+------------ ! 2414 | textrecv | 87258 | citextrecv ! 2415 | textsend | 87259 | citextsend ! (2 rows) -- Look for uses of different type OIDs in the argument/result type fields -- for different aliases of the same built-in function. *************** *** 110,117 **** prorettype | prorettype ------------+------------ 25 | 1043 1114 | 1184 ! (2 rows) SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0] FROM pg_proc AS p1, pg_proc AS p2 --- 112,120 ---- prorettype | prorettype ------------+------------ 25 | 1043 + 25 | 87255 1114 | 1184 ! (3 rows) SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0] FROM pg_proc AS p1, pg_proc AS p2 *************** *** 124,133 **** -------------+------------- 25 | 1042 25 | 1043 1114 | 1184 1560 | 1562 2277 | 2283 ! (5 rows) SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1] FROM pg_proc AS p1, pg_proc AS p2 --- 127,138 ---- -------------+------------- 25 | 1042 25 | 1043 + 25 | 87255 + 1042 | 87255 1114 | 1184 1560 | 1562 2277 | 2283 ! (7 rows) SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1] FROM pg_proc AS p1, pg_proc AS p2 *************** *** 139,148 **** proargtypes | proargtypes -------------+------------- 23 | 28 1114 | 1184 1560 | 1562 2277 | 2283 ! (4 rows) SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2] FROM pg_proc AS p1, pg_proc AS p2 --- 144,154 ---- proargtypes | proargtypes -------------+------------- 23 | 28 + 25 | 87255 1114 | 1184 1560 | 1562 2277 | 2283 ! (5 rows) SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2] FROM pg_proc AS p1, pg_proc AS p2 *************** *** 305,311 **** 142 | 25 | 0 | a 142 | 1043 | 0 | a 142 | 1042| 0 | a ! (6 rows) -- **************** pg_operator **************** -- Look for illegal values in pg_operator fields. --- 311,318 ---- 142 | 25 | 0 | a 142 | 1043 | 0 | a 142 | 1042| 0 | a ! 87255 | 1042 | 0 | a ! (7 rows) -- **************** pg_operator **************** -- Look for illegal values in pg_operator fields. ====================================================================== So I guess my question is: what is wrong with the properties for citextsend/citextrecv and what else might these failures be indicating is wrong? CREATE OR REPLACE FUNCTION citextrecv(internal) RETURNS citext AS 'textrecv' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION citextsend(citext) RETURNS bytea AS 'textsend' LANGUAGE 'internal' IMMUTABLE STRICT; CREATE TYPE citext ( INPUT = citextin, OUTPUT = citextout, RECEIVE = citextrecv, SEND = citextsend, INTERNALLENGTH = VARIABLE, STORAGE = extended ); Thanks, David
"David E. Wheeler" <david@kineticode.com> writes: > So I guess my question is: what is wrong with the properties for > citextsend/citextrecv [ checks catalogs... ] textsend and textrecv are marked STABLE not IMMUTABLE. I am not totally sure about the reasoning offhand --- it might be because their behavior depends on client_encoding. > and what else might these failures be indicating > is wrong? I think the other diffs are okay, they just reflect the fact that you're depending on binary equivalence of text and citext. regards, tom lane
On Jul 15, 2008, at 20:26, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> So I guess my question is: what is wrong with the properties for >> citextsend/citextrecv > > [ checks catalogs... ] textsend and textrecv are marked STABLE not > IMMUTABLE. I am not totally sure about the reasoning offhand --- it > might be because their behavior depends on client_encoding. Thanks. Looks like maybe the xtypes docs need to be updated? http://www.postgresql.org/docs/8.3/static/xtypes.html Anyway, changing them to "STABLE STRICT" appears to have done the trick (diff attached). >> and what else might these failures be indicating >> is wrong? > > I think the other diffs are okay, they just reflect the fact that > you're > depending on binary equivalence of text and citext. Great, thanks. And with that, I think I'm just about ready to submit a new version of the patch, coming up shortly. Best, David