Thread: Latest on CITEXT 2.0
Howdy, I just wanted to report the latest on my pet project: implementing a new case-insensitive text type, "citext", to be locale-aware and to build and run on PostgreSQL 8.3. I'm not much of a C programmer (this is only the second time I've written *anything* in C), so I also have a few questions about my code, best practices, coverage, etc. You can grab the latest here: https://svn.kineticode.com/citext/trunk/ BTW, the tests in sql/citext.sql use the pgtap.sql file to run TAP regression tests. So you can run them using `make installcheck` or `make test`. The latter requires that pg_prove be installed; you can get it here: https://svn.kineticode.com/pgtap/trunk/ Anyway, I think I've got it pretty close to done. The tests cover a lot of stuff -- nearly everything I could figure out, anyway. But there are a few gaps. As a result, I'd appreciate a little help with these questions, all in the name of making this a solid data type suitable for use on production systems: * There seem to still be some implicit CASTS to text that I'd like to duplicate. For example, select '192.168.1.2'::cidr::text;` works, but `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C functions that do these casts for TEXT so that I can put them to work for citext, too? The internal cast functions used in the old citext distribution don't exist at all on 8.3. * There are casts from text that I'd also like to harness for use by citext, like `cidr(text)`. Where can I find these C functions as well? (The upshot of this and the previous points is that I'd like citext to be as compatible with TEXT as possible, and I just need to figure out how to fill in the gaps in that compatibility.) * Regular expression and LIKE comparisons using the the operators properly work case-insensitively, but functions like replace() and regexp_replace() do not. Should they? and if so, how can I make them do so? * The tests assume that LC_COLLATE is set to en_US.UTF-8. Does that work well for standard PostgreSQL regression tests? How are locale- sensitive tests run in core regression tests? * As for my C programming, well, what's broken? I'm especially concerned that I pfree variables appropriately, but I'm not at all clear on what needs to be freed. Martijn mentioned before that btree comparison functions free memory, but I'm such a C n00b that I don't know what that actually means for my implementation. I'd actually appreciate a bit of pedantry here. :-) * Am I in fact getting an appropriate nul-terminated string in my cilower() function using this code? char * str = DatumGetCString( DirectFunctionCall1( textout, PointerGetDatum( arg ) ) ); Those are all the questions I had about my implementation. I'd like to get this thing done and released soon, so that I can be done with this particular Yak and get back to what I'm *supposed* to be doing with my time. BTW, would there be any interest in this code going into contrib/ in the distribution? I think that, if we can ensure that it works just like LOWER() = LOWER(), but without requiring that code, then it would be a great type to point people to to use instead of that SQL hack (with all the usual caveats about it being locale-sensitive and not canonically case-insensitive in the Unicode sense). If so, I'd be happy to make whatever changes are necessary to make it fit in with the coding and organization standards of the core and to submit it. But please, don't expect a civarchar type from me anytime soon. ;-) Many thanks, David
On Wed, Jun 25, 2008 at 11:47:39AM -0700, David E. Wheeler wrote: > * There seem to still be some implicit CASTS to text that I'd like to > duplicate. For example, select '192.168.1.2'::cidr::text;` works, but > `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C > functions that do these casts for TEXT so that I can put them to work > for citext, too? The internal cast functions used in the old citext > distribution don't exist at all on 8.3. Hmm, casts to/from text are somewhat "magic" in postgres. They are implemented by calling the usual type input/output function. I have no idea how to extend that to other types. > * There are casts from text that I'd also like to harness for use by > citext, like `cidr(text)`. Where can I find these C functions as well? > (The upshot of this and the previous points is that I'd like citext to > be as compatible with TEXT as possible, and I just need to figure out > how to fill in the gaps in that compatibility.) As above, they're probably not as seperate functions but a special hack inthe casting code. > * Regular expression and LIKE comparisons using the the operators > properly work case-insensitively, but functions like replace() and > regexp_replace() do not. Should they? and if so, how can I make them > do so? Regexes have case-insensetive modifiers, don't they? In which case I don't think it'd be becessary. > * As for my C programming, well, what's broken? I'm especially > concerned that I pfree variables appropriately, but I'm not at all > clear on what needs to be freed. Martijn mentioned before that btree > comparison functions free memory, but I'm such a C n00b that I don't > know what that actually means for my implementation. I'd actually > appreciate a bit of pedantry here. :-) When creating an index, your comparison functions are going ot be called O(N log N) times. If they leak into a context that isn't regularly freed you may have a problem. I'd suggest loking at how the text comparisons do it. PG_FREE_IF_COPY() is probably a good idea because the incoming tuples may be detoasted. > * Am I in fact getting an appropriate nul-terminated string in my > cilower() function using this code? > > char * str = DatumGetCString( > DirectFunctionCall1( textout, PointerGetDatum( arg ) ) > ); Yes. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote: > Hmm, casts to/from text are somewhat "magic" in postgres. They are > implemented by calling the usual type input/output function. I have no > idea how to extend that to other types. Oh. Okay. Perhaps I won't worry about it just now, then. > As above, they're probably not as seperate functions but a special > hack > inthe casting code. Okay. > Regexes have case-insensetive modifiers, don't they? In which case I > don't think it'd be becessary. They do, but replace(), split_part(), strpos(), and translate() do not. > When creating an index, your comparison functions are going ot be > called O(N log N) times. If they leak into a context that isn't > regularly freed you may have a problem. I'd suggest loking at how the > text comparisons do it. PG_FREE_IF_COPY() is probably a good idea > because the incoming tuples may be detoasted. Okay. I'll have a look at varlena.c, then. >> * Am I in fact getting an appropriate nul-terminated string in my >> cilower() function using this code? >> >> char * str = DatumGetCString( >> DirectFunctionCall1( textout, PointerGetDatum( arg ) ) >> ); > > Yes. Great, I thought so (since it made the failures go away). Many thanks. David
Thanks a million for your answers, Martijn. I just have some more stupid questions, if you could bear with me. On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote: > When creating an index, your comparison functions are going ot be > called O(N log N) times. If they leak into a context that isn't > regularly freed you may have a problem. I'd suggest loking at how the > text comparisons do it. PG_FREE_IF_COPY() is probably a good idea > because the incoming tuples may be detoasted. Okay, I see that text_cmp in varlena doesn't use PG_FREE_IF_COPY(), and neither do text_smaller nor text_larger (which just dispatch to text_cmp anyway). The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's these functions you're talking about. However, my implementation just looks like this: Datum citext_ne (PG_FUNCTION_ARGS) { // Fast path for different-length inputs. Okay for canonical equivalence? if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) PG_RETURN_BOOL( 1 ); PG_RETURN_BOOL(citextcmp( PG_ARGS ) != 0 ); } I don't *thinkI any variables are copied there. citextcmp() is just this: int citextcmp (PG_FUNCTION_ARGS) { // XXX These are all just references to existing structures, right? text * left = PG_GETARG_TEXT_P(0); text * right = PG_GETARG_TEXT_P(1); return varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); } Again, no copying. cilower() does copy: int index, len; char * result; index = 0; len = VARSIZE(arg) - VARHDRSZ; result = (char *) palloc( strlen( str ) + 1 ); for (index = 0; index <= len; index++) { result[index] = tolower((unsigned char) str[index] ); } // XXXI don't need to pfree result if I'm returning it, right? return result; But the copied value is returned. Hrm…it should probably be pfreed somewhere, yes? So I'm wondering if I should change citextcmp to pfree values? Something like this: text * left = PG_GETARG_TEXT_P(0); text * right = PG_GETARG_TEXT_P(1); char * lcstr = cilower( left ); char* rcstr = cilower( right ); int result = varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); pfree( lcstr ); pfree( rcstr ); return result; This is the only function that calls cilower(). And I *think* it's the only place where values are copied or memory is allocated needing to be freed. Does that sound right to you? On a side note, I've implemented this pretty differently from how the text functions are implemented in varlena.c, just to try to keep things succinct. But I'm wondering now if I shouldn't switch back to the style used by varlena.c, if only to keep the style the same, and thus perhaps to increase the chances that citext would be a welcome contrib addition. Thoughts? Many thanks again. You're a great help to this C n00b. Best, David
David E. Wheeler wrote: > The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's > these functions you're talking about. However, my implementation just > looks like this: > > Datum citext_ne (PG_FUNCTION_ARGS) { > // Fast path for different-length inputs. Okay for canonical > equivalence? > if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) > PG_RETURN_BOOL( 1 ); > PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); > } PG_GETARG_TEXT_P can detoast the datum, which creates a copy. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Jun 26, 2008, at 09:19, Alvaro Herrera wrote: > PG_GETARG_TEXT_P can detoast the datum, which creates a copy. Thanks. I've just completely refactored things to look more like the approach taken by varlena.c, both in terms of when stuff gets freed and in terms of coding style. It's more verbose, but I feel much more comfortable with memory management now that I'm following a known implementation more closely. :-) So now I've changed citextcmp to this: static int citextcmp (text * left, text * right) { char * lcstr, * rcstr; int result; lcstr = cilower( left ); rcstr = cilower( right ); result = varstr_cmp( cilower( left ), VARSIZE_ANY_EXHDR(left), cilower( right ), VARSIZE_ANY_EXHDR(right) ); pfree( lcstr ); pfree( rcstr ); return result; } And now all of the operator functions are freeing memory using PG_FREE_IF_COPY() like this: Datum citext_cmp(PG_FUNCTION_ARGS) {text * left = PG_GETARG_TEXT_PP(0);text * right = PG_GETARG_TEXT_PP(1);int32 result; result = citextcmp(left, right); PG_FREE_IF_COPY(left, 0);PG_FREE_IF_COPY(right, 1); PG_RETURN_INT32( result ); } The only functions that don't do that are citext_smaller() and citext_larger(): Datum citext_smaller(PG_FUNCTION_ARGS) {text * left = PG_GETARG_TEXT_PP(0);text * right = PG_GETARG_TEXT_PP(1);text * result; result = citextcmp(left, right) < 0 ? left : right;PG_RETURN_TEXT_P(result); } This is just how varlena.c does it, but I am wondering if something *should* be freed there. Thanks a bunch! Best, David
"David E. Wheeler" <david@kineticode.com> writes: > Datum citext_ne (PG_FUNCTION_ARGS) { > // Fast path for different-length inputs. Okay for canonical > equivalence? > if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) > PG_RETURN_BOOL( 1 ); > PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); > } BTW, I don't think you can use that same-length optimization for citext. There's no reason to think that upper/lowercase pairs will have the same length all the time in multibyte encodings. regards, tom lane
On Jun 26, 2008, at 10:02, Tom Lane wrote: > BTW, I don't think you can use that same-length optimization for > citext. There's no reason to think that upper/lowercase pairs will > have the same length all the time in multibyte encodings. I was wondering about that. I had been thinking of canonically- equivalent stings and combining marks. Doing a quick test it looks like combining marks are not equivalent. For example, this returns false: SELECT 'Ä'::text = 'Ä'::text; At least with en_US.UTF-8. Hrm. It looks like my client makes them both canonical, so I've attached a script demonstrating this issue. Anyway, I was aware of different byte counts for canonical equivalence, but not for differences between upper- and lowercase characters. I'd certainly defer to your knowledge of how these things truly work in PostgreSQL, Tom, and can of course easily remove that optimization. So, are your certain about this? Many thanks, David
Attachment
David, > Thanks. I've just completely refactored things to look more like the > approach taken by varlena.c, both in terms of when stuff gets freed > and in terms of coding style. It's more verbose, but I feel much more > comfortable with memory management now that I'm following a known > implementation more closely. :-) Will this be ready for the July CommitFest? -- Josh Berkus PostgreSQL @ Sun San Francisco
On Jun 26, 2008, at 12:03, Josh Berkus wrote: > Will this be ready for the July CommitFest? When is it due, July 1? If so, yes, it should be. I could use a close review by someone who knows this shit a whole lot better than I do. Thanks, David
David, > When is it due, July 1? If so, yes, it should be. I could use a close > review by someone who knows this shit a whole lot better than I do. Well, that's what the commitfest is for. Go ahead and add yourself once you post the new patch. -- Josh Berkus PostgreSQL @ Sun San Francisco
"David E. Wheeler" <david@kineticode.com> writes: > So, are your certain about this? See Turkish --- in that locale i and I are not an upper/lower pair, instead they pair with some non-ASCII letters. There are likely other cases but that's the counterexample I remember. regards, tom lane
On Jun 26, 2008, at 13:59, Tom Lane wrote: > "David E. Wheeler" <david@kineticode.com> writes: >> So, are your certain about this? > > See Turkish --- in that locale i and I are not an upper/lower pair, > instead they pair with some non-ASCII letters. There are likely > other cases but that's the counterexample I remember. Perfect, thank you. I was able to add a failing test and make it pass by removing the string length optimization. For future reference of anyone reading the list, this page has a great description of the problem: http://www.i18nguy.com/unicode/turkish-i18n.html Best, David
On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "David E. Wheeler" <david@kineticode.com> writes: > > Datum citext_ne (PG_FUNCTION_ARGS) { > > // Fast path for different-length inputs. Okay for canonical > > equivalence? > > if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1))) > > PG_RETURN_BOOL( 1 ); > > PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 ); > > } > > BTW, I don't think you can use that same-length optimization for > citext. There's no reason to think that upper/lowercase pairs will > have the same length all the time in multibyte encodings. What about this code in current str_tolower(): /* Output workspace cannot have more codes than input bytes */ workspace = (wchar_t *) palloc((nbytes + 1) *sizeof(wchar_t)); Bug? -- marko
"Marko Kreen" <markokr@gmail.com> writes: > On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> BTW, I don't think you can use that same-length optimization for >> citext. There's no reason to think that upper/lowercase pairs will >> have the same length all the time in multibyte encodings. > What about this code in current str_tolower(): > /* Output workspace cannot have more codes than input bytes */ > workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); That's working with wchars, not bytes. regards, tom lane
On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Marko Kreen" <markokr@gmail.com> writes: > > On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> BTW, I don't think you can use that same-length optimization for > >> citext. There's no reason to think that upper/lowercase pairs will > >> have the same length all the time in multibyte encodings. > > > What about this code in current str_tolower(): > > > /* Output workspace cannot have more codes than input bytes */ > > workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); > > > That's working with wchars, not bytes. Ah, I missed the point of char2wchar() line. I'm rather unfamiliar with various MB API-s, sorry. There's another thing I'm probably missing: does current code handle multi-wchar codepoints? Or is it guaranteed they don't happen? (Wasn't wchar_t usually 16bit value?) -- marko
Marko Kreen wrote: > On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Marko Kreen" <markokr@gmail.com> writes: > > > On 6/26/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > >> BTW, I don't think you can use that same-length optimization for > > >> citext. There's no reason to think that upper/lowercase pairs will > > >> have the same length all the time in multibyte encodings. > > > > > What about this code in current str_tolower(): > > > > > /* Output workspace cannot have more codes than input bytes */ > > > workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t)); > > > > > > That's working with wchars, not bytes. > > Ah, I missed the point of char2wchar() line. > > I'm rather unfamiliar with various MB API-s, sorry. > > There's another thing I'm probably missing: does current code handle > multi-wchar codepoints? Or is it guaranteed they don't happen? > (Wasn't wchar_t usually 16bit value?) If you want a simple example of wide character use look at oracle_compat.c::upper() which calls str_toupper() in CVS HEAD. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On 7/1/08, Bruce Momjian <bruce@momjian.us> wrote: > If you want a simple example of wide character use look at > oracle_compat.c::upper() which calls str_toupper() in CVS HEAD. ATM I'm looking at str_tolower/upper internal implementation. They do: workspace[curr_char] = towlower(workspace[curr_char]); where workspace is wchar_t but towlower() operates on wint_t. Is such inconsistency ok? -- marko
"Marko Kreen" <markokr@gmail.com> writes: > ATM I'm looking at str_tolower/upper internal implementation. > They do: > workspace[curr_char] = towlower(workspace[curr_char]); > where workspace is wchar_t but towlower() operates on wint_t. IIRC this is exactly comparable to the type situation for the traditional <ctype.h> macros. The reason is that they are defined to accept EOF in addition to actual char (or wchar) values. regards, tom lane
"Marko Kreen" <markokr@gmail.com> writes: > There's another thing I'm probably missing: does current code handle > multi-wchar codepoints? Or is it guaranteed they don't happen? AFAIK we disallow multi-wchar situations (by rejecting the UTF8 combining codes). > (Wasn't wchar_t usually 16bit value?) Hmm. It's unsigned int on my ancient HPUX box. I think we could have a problem on any machines whose mbstowcs doesn't support 4-byte UTF8 codes, though ... in particular, what about Windows? regards, tom lane
On 7/1/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Marko Kreen" <markokr@gmail.com> writes: > > ATM I'm looking at str_tolower/upper internal implementation. > > They do: > > workspace[curr_char] = towlower(workspace[curr_char]); > > where workspace is wchar_t but towlower() operates on wint_t. > > IIRC this is exactly comparable to the type situation for the > traditional <ctype.h> macros. The reason is that they are defined > to accept EOF in addition to actual char (or wchar) values. I read SUS v3 and there is no hint on multi-wchar anything, so for unix systems you are right, wint_t == wchar_t. Seems stories how Windows and Java operate have affected me too much. Then I browsed MSDN: http://msdn.microsoft.com/en-us/library/dtxesf6k.aspx and they seem to strongly hint that wchar_t == 16 bits and UTF-16 is used internally. Probably some Windows developer should look into it and decide if there is a #ifdef WIN32 branch needed. -- marko