Re: Re: LIKE gripes - Mailing list pgsql-hackers
From | Thomas Lockhart |
---|---|
Subject | Re: Re: LIKE gripes |
Date | |
Msg-id | 39916A8A.BE0D9CAE@alumni.caltech.edu Whole thread Raw |
In response to | RE: Re: LIKE gripes ("Hiroshi Inoue" <Inoue@tpf.co.jp>) |
Responses |
Re: Re: LIKE gripes
|
List | pgsql-hackers |
> MB has something similar to the "next character" fucntion called > pg_encoding_mblen. It tells the length of the MB word pointed to so > that you could move forward to the next MB word etc. > > 2) For each character set, we would need to provide conversion functions > > to other "compatible" character sets, or to a character "superset". Why > > don't we have those conversion functions? Answer: we do! There is an > > internal 32-bit encoding within which all comparisons are done. > Right. OK. As you know, I have an interest in this, but little knowledge ;) > > Anyway, I think it will be pretty easy to put the MB stuff back in, by > > #ifdef'ing some string copying inside each of the routines (such as > > namelike()). The underlying routine no longer requires a null-terminated > > string (using explicit lengths instead) so I'll generate those lengths > > in the same place unless they are already provided by the char->int MB > > support code. > I have not taken a look at your new like code, but I guess you could use > pg_mbstrlen(const unsigned char *mbstr) > It tells the number of words in mbstr (however mbstr needs to null > terminated). To get the length I'm now just running through the output string looking for a zero value. This should be more efficient than reading the original string twice; it might be nice if the conversion routines (which now return nothing) returned the actual number of pg_wchars in the output. The original like() code allocates a pg_wchar array dimensioned by the number of bytes in the input string (which happens to be the absolute upper limit for the size of the 32-bit-encoded string). Worst case, this results in a 4-1 expansion of memory, and always requires a palloc()/pfree() for each call to the comparison routines. I think I have a solution for the current code; could someone test its behavior with MB enabled? It is now committed to the source tree; I know it compiles, but afaik am not equipped to test it :( > > In the future, I'd like to see us use alternate encodings as-is, or as a > > common set like UniCode (16 bits wide afaik) rather than having to do > > this widening to 32 bits on the fly. Then, each supported character set > > can be efficiently manipulated internally, and only converted to another > > encoding when mixing with another character set. > If you are planning to convert everything to Unicode or whatever > before storing them into the disk, I'd like to object the idea. It's > not only the waste of disk space but will bring serious performance > degration. For example, each ISO 8859 byte occupies 2 bytes after > converted to Unicode. I dont't think this two times disk space > consuming is acceptable. I am not planning on converting everything to UniCode for disk storage. What I would *like* to do is the following: 1) support each encoding "natively", using Postgres' type system to distinguish between them. This would allow strings with the same encodings to be used without conversion, and would both minimize storage requirements *and* run-time conversion costs. 2) support conversions between encodings, again using Postgres' type system to suggest the appropriate conversion routines. This would allow strings with different but compatible encodings to be mixed, but requires internal conversions *only* if someone is mixing encodings inside their database. 3) one of the supported encodings might be Unicode, and if one chooses, that could be used for on-disk storage. Same with the other existing encodings. 4) this difference approach to encoding support can coexist with the existing MB support since (1) - (3) is done without mention of existing MB internal features. So you can choose which scheme to use, and can test the new scheme without breaking the existing one. imho this comes closer to one of the important goals of maximizing performance for internal operations (since there is less internal string copying/conversion required), even at the expense of extra conversion cost when doing input/output (a good trade since *usually* there are lots of internal operations to a few i/o operations). Comments? - Thomas
pgsql-hackers by date: