Home > mailing lists

Re: Re: LIKE gripes - Mailing list pgsql-hackers

From	Thomas Lockhart
Subject	Re: Re: LIKE gripes
Date	August 9, 2000 10:24:56
Msg-id	39916A8A.BE0D9CAE@alumni.caltech.edu Whole thread Raw
In response to	RE: Re: LIKE gripes ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
Responses	Re: Re: LIKE gripes
List	pgsql-hackers

Tree view

> MB has something similar to the "next character" fucntion called
> pg_encoding_mblen. It tells the length of the MB word pointed to so
> that you could move forward to the next MB word etc.
> > 2) For each character set, we would need to provide conversion functions
> > to other "compatible" character sets, or to a character "superset". Why
> > don't we have those conversion functions? Answer: we do! There is an
> > internal 32-bit encoding within which all comparisons are done.
> Right.

OK. As you know, I have an interest in this, but little knowledge ;)

> > Anyway, I think it will be pretty easy to put the MB stuff back in, by
> > #ifdef'ing some string copying inside each of the routines (such as
> > namelike()). The underlying routine no longer requires a null-terminated
> > string (using explicit lengths instead) so I'll generate those lengths
> > in the same place unless they are already provided by the char->int MB
> > support code.
> I have not taken a look at your new like code, but I guess you could use
>                 pg_mbstrlen(const unsigned char *mbstr)
> It tells the number of words in mbstr (however mbstr needs to null
> terminated).

To get the length I'm now just running through the output string looking
for a zero value. This should be more efficient than reading the
original string twice; it might be nice if the conversion routines
(which now return nothing) returned the actual number of pg_wchars in
the output.

The original like() code allocates a pg_wchar array dimensioned by the
number of bytes in the input string (which happens to be the absolute
upper limit for the size of the 32-bit-encoded string). Worst case, this
results in a 4-1 expansion of memory, and always requires a
palloc()/pfree() for each call to the comparison routines.

I think I have a solution for the current code; could someone test its
behavior with MB enabled? It is now committed to the source tree; I know
it compiles, but afaik am not equipped to test it :(

> > In the future, I'd like to see us use alternate encodings as-is, or as a
> > common set like UniCode (16 bits wide afaik) rather than having to do
> > this widening to 32 bits on the fly. Then, each supported character set
> > can be efficiently manipulated internally, and only converted to another
> > encoding when mixing with another character set.
> If you are planning to convert everything to Unicode or whatever
> before storing them into the disk, I'd like to object the idea. It's
> not only the waste of disk space but will bring serious performance
> degration. For example, each ISO 8859 byte occupies 2 bytes after
> converted to Unicode. I dont't think this two times disk space
> consuming is acceptable.

I am not planning on converting everything to UniCode for disk storage.
What I would *like* to do is the following:

1) support each encoding "natively", using Postgres' type system to
distinguish between them. This would allow strings with the same
encodings to be used without conversion, and would both minimize storage
requirements *and* run-time conversion costs.

2) support conversions between encodings, again using Postgres' type
system to suggest the appropriate conversion routines. This would allow
strings with different but compatible encodings to be mixed, but
requires internal conversions *only* if someone is mixing encodings
inside their database.

3) one of the supported encodings might be Unicode, and if one chooses,
that could be used for on-disk storage. Same with the other existing
encodings.

4) this difference approach to encoding support can coexist with the
existing MB support since (1) - (3) is done without mention of existing
MB internal features. So you can choose which scheme to use, and can
test the new scheme without breaking the existing one.

imho this comes closer to one of the important goals of maximizing
performance for internal operations (since there is less internal string
copying/conversion required), even at the expense of extra conversion
cost when doing input/output (a good trade since *usually* there are
lots of internal operations to a few i/o operations).

Comments?
                     - Thomas

pgsql-hackers by date:

From: "Michael Mayo"
Date: 09 August 2000, 09:20:12
Subject: Activating USE_SYSLOG from srpm?

From: "Michael Mayo"
Date: 09 August 2000, 10:53:07
Subject: Re: Activating USE_SYSLOG from srpm?

Re: Re: LIKE gripes - Mailing list pgsql-hackers

Previous

Next