Re: [GENERAL] workaround for lack of REPLACE() function - Mailing list pgsql-hackers
From | Joe Conway |
---|---|
Subject | Re: [GENERAL] workaround for lack of REPLACE() function |
Date | |
Msg-id | 3D569BBF.8040309@joeconway.com Whole thread Raw |
In response to | Re: [GENERAL] workaround for lack of REPLACE() function (Thomas Lockhart <lockhart@fourpalms.org>) |
Responses |
Re: [GENERAL] workaround for lack of REPLACE() function
|
List | pgsql-hackers |
Tatsuo Ishii wrote: >>Any objection if I rework this function to meet SQL92 and fix the bug? > I've started working on text_substr() as described in this thread (which is hopefully prep work for the replace() function that started the thread). I haven't really looked at toast or multibyte closely before, so I'd like to ask a couple of questions to be sure I'm understanding the relevant issues correctly. First, in textlen() I see (ignoring multibyte for a moment): text *t = PG_GETARG_TEXT_P(0); PG_RETURN_INT32(VARSIZE(t) - VARHDRSZ); Tom has pointed out to me before that PG_GETARG_TEXT_P(n) incurs the overhead of retrieving and possibly decompressing a toasted datum. So my first question is, can we simply do: PG_RETURN_INT32(toast_raw_datum_size(PG_GETARG_DATUM(0)) - VARHDRSZ); and save the overhead of retrieving and decompressing the whole datum? Now, in the multibyte case, again in textlen(), I see: /* optimization for single byte encoding */ if (pg_database_encoding_max_length() <= 1) PG_RETURN_INT32(VARSIZE(t)- VARHDRSZ); PG_RETURN_INT32( pg_mbstrlen_with_len(VARDATA(t), VARSIZE(t) - VARHDRSZ)); Three questions here. 1) In the case of encoding max length == 1, can we treat it the same as the non-multibyte case (I presume they are exactly the same)? 2) Can encoding max length ever be < 1? Doesn't make sense to me. 3) In the case of encoding max length > 1, if I understand correctly, each encoded character can be one *or more* bytes, up to and encluding encoding max length bytes. So the *only* way presently to get the length of the original character string is to loop through the entire string checking the length of each individual character (that's what pg_mbstrlen_with_len() does it seems)? Finally, if 3) is true, then there is no way to avoid the retrieval and decompression of the datum just to find out its length. For large datums, detoasting plus the looping through each character would add a huge amount of overhead just to get at the length of the original string. I don't know if we need to be able to get *just* the length often enough to really care, but if we do, I had an idea for some future release (I wouldn't propose doing this for 7.3): - add a new EXTENDED state to va_external for MULTIBYTE - any string with max encoding length > 1 would be EXTENDED even if it is not EXTERNAL and not COMPRESSED. - to each of the structs in the union, add va_strlen - populate va_strlen on INSERT and maintain it on UPDATE. Now a new function similar to toast_raw_datum_size(), maybe toast_raw_datum_strlen() could be used to get the original string length, whether MB or not, without needing to retrieve and decompress the entire datum. I understand we would either: have to steal another bit from the VARHDR which would reduce the effective size of a valena from 1GB down to .5GB; or we would need to add a byte or two to the VARHDR which is extra per-datum overhead. I'm not sure we would want to do either. But I wanted to toss out the idea while it was fresh on my mind. Thanks, Joe
pgsql-hackers by date: