Thread: Implementing full UTF-8 support (aka supporting 0x00)
Hi list. As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00 value. Sure, you can use bytea, but this hardly a good solution. The problem seems to be hitting some use cases, like: - People migrating data from other databases (apart from PostgreSQL, I don't know of any other database which suffers the same problem). - People using drivers which use UTF-8 or equivalent encodings by default (Java for example) Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're strictly non-compliant. And given the general Postgres policy regarding standards compliance and the people being hit by this, I think it should be addressed. Specially since all the usual fixes are a real PITA (re-parsing, re-generating strings, which is very expensive, or dropping data). What would it take to support it? Isn't the varlena header propagated everywhere, which could help infer the real length of the string? Any pointers or suggestions would be welcome. Thanks, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
Álvaro Hernández Tortosa <aht@8kdata.com> writes: > As has been previously discussed (see > https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl > for instance) varlena fields cannot accept the literal 0x00 value. Yup. > What would it take to support it? One key reason why that's hard is that datatype input and output functions use nul-terminated C strings as the representation of the text form of any datatype. We can't readily change that API without breaking huge amounts of code, much of it not under the core project's control. There may be other places where nul-terminated strings would be a hazard (mumble fgets mumble), but offhand that API seems like the major problem so far as the backend is concerned. There would be a slew of client-side problems as well. For example this would assuredly break psql and pg_dump, along with every other client that supposes that it can treat PQgetvalue() as returning a nul-terminated string. This end of it would possibly be even worse than fixing the backend, because so little of the affected code is under our control. In short, the problem is not with having an embedded nul in a stored text value. The problem is the reams of code that suppose that the text representation of any data value is a nul-terminated C string. regards, tom lane
On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > What would it take to support it? Would it be of any value to support "Modified UTF-8"? https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 8/3/16 11:47 AM, Kevin Grittner wrote: > On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > >> What would it take to support it? > > Would it be of any value to support "Modified UTF-8"? > > https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 Will this work with OS libraries? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're > strictly non-compliant. It's perhaps worth mentioning that 0x00 is valid ASCII too, and PostgreSQL has never stored that either. If you want to start quoting standards, there is in fact specific mention in the ANSI spec of null terminators in passing strings to host languages, so if postgresql stored NULs in that way we would end up with parameters that we couldn't pass to UDFs in a standards-compliant way. Geoff
On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
Hi list.
As has been previously discussed (see https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl for instance) varlena fields cannot accept the literal 0x00 value. Sure, you can use bytea, but this hardly a good solution. The problem seems to be hitting some use cases, like:
- People migrating data from other databases (apart from PostgreSQL, I don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by default (Java for example)
Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're strictly non-compliant. And given the general Postgres policy regarding standards compliance and the people being hit by this, I think it should be addressed. Specially since all the usual fixes are a real PITA (re-parsing, re-generating strings, which is very expensive, or dropping data).
What would it take to support it? Isn't the varlena header propagated everywhere, which could help infer the real length of the string? Any pointers or suggestions would be welcome.
One of the bigger pain points is that our interaction with C library collation routines for sorting uses NULL-terminated C strings. strcoll, strxfrm, etc.
On 03/08/16 17:23, Tom Lane wrote: > Álvaro Hernández Tortosa <aht@8kdata.com> writes: >> As has been previously discussed (see >> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl >> for instance) varlena fields cannot accept the literal 0x00 value. > Yup. > >> What would it take to support it? > One key reason why that's hard is that datatype input and output > functions use nul-terminated C strings as the representation of the > text form of any datatype. We can't readily change that API without > breaking huge amounts of code, much of it not under the core project's > control. > > There may be other places where nul-terminated strings would be a hazard > (mumble fgets mumble), but offhand that API seems like the major problem > so far as the backend is concerned. > > There would be a slew of client-side problems as well. For example this > would assuredly break psql and pg_dump, along with every other client that > supposes that it can treat PQgetvalue() as returning a nul-terminated > string. This end of it would possibly be even worse than fixing the > backend, because so little of the affected code is under our control. > > In short, the problem is not with having an embedded nul in a stored > text value. The problem is the reams of code that suppose that the > text representation of any data value is a nul-terminated C string. > > regards, tom lane Wow. That seems like a daunting task. I guess, then, than even implementing a new datatype based on bytea but that would use the text IO functions to show up as text (not send/recv) would neither work, right? Thanks for the input, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 03/08/16 17:47, Kevin Grittner wrote: > On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > >> What would it take to support it? > Would it be of any value to support "Modified UTF-8"? > > https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 > That's nice, but I don't think so. The problem is that you cannot predict how people would send you data, like when importing from other databases. I guess it may work if Postgres would implement such UTF-8 variant and also the drivers, but that would still require an encoding conversion (i.e., parsing every string) to change the 0x00, which seems like a serious performance hit. It could be worse than nothing, though! Thanks, Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 03/08/16 18:35, Geoff Winkless wrote: > On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: >> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're >> strictly non-compliant. > It's perhaps worth mentioning that 0x00 is valid ASCII too, and > PostgreSQL has never stored that either. Then yes, it could also be a problem. But as of today I believe solving the problem for UTF-8 would solve the great majority of this embedded NUL problems. Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 03/08/16 20:14, Álvaro Hernández Tortosa wrote: > > > On 03/08/16 17:47, Kevin Grittner wrote: >> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa >> <aht@8kdata.com> wrote: >> >>> What would it take to support it? >> Would it be of any value to support "Modified UTF-8"? >> >> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 >> > > That's nice, but I don't think so. > > The problem is that you cannot predict how people would send you > data, like when importing from other databases. I guess it may work if > Postgres would implement such UTF-8 variant and also the drivers, but > that would still require an encoding conversion (i.e., parsing every > string) to change the 0x00, which seems like a serious performance hit. > > It could be worse than nothing, though! > > Thanks, > > Álvaro > It may indeed work. According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code point. In short, the \u00 nul is represented (overlong encoding) by the two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 so should not appear in an otherwise valid UTF-8 string. Yet they are accepted by Postgres (like if Postgres would support Modified UTF-8 intentionally). The caracter in psql does not render as a nul but as this symbol: "삀". Given that this works, the process would look like this: - Parse all input data looking for bytes with hex value 0x00. If they appear in the string, they are the null byte. - Replace that byte with the two bytes 0xc080. - Reverse the operation when reading. This is OK but of course a performance hit (searching for 0x00 and then augmenting the byte[] or whatever data structure to account for the extra byte). A little bit of a PITA, but I guess better than fixing it all :) Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > Yet they are accepted by Postgres > (like if Postgres would support Modified UTF-8 intentionally). The caracter > in psql does not render as a nul but as this symbol: "삀". Not accepted as valid utf8: # select E'\xc0\x80'; ERROR: invalid byte sequence for encoding "UTF8": 0xc0 0x80 You would need a "modified utf8" encoding, I think. Geoff
On 03/08/16 21:31, Geoff Winkless wrote: > On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: >> Yet they are accepted by Postgres >> (like if Postgres would support Modified UTF-8 intentionally). The caracter >> in psql does not render as a nul but as this symbol: "삀". > Not accepted as valid utf8: > > # select E'\xc0\x80'; > ERROR: invalid byte sequence for encoding "UTF8": 0xc0 0x80 > > You would need a "modified utf8" encoding, I think. > > Geoff Isn't the correct syntax something like: select E'\uc080', U&'\c080'; ? It is a single character, 16 bit unicode sequence (see https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html). Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
Álvaro Hernández Tortosa <aht@8kdata.com> writes: > According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout > the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code > point. In short, the \u00 nul is represented (overlong encoding) by the > two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8 > so should not appear in an otherwise valid UTF-8 string. Yet they are > accepted by Postgres (like if Postgres would support Modified UTF-8 > intentionally). Really? It sure looks to me like pg_utf8_islegal() would reject this. We could hack it to allow the case, no doubt, but I concur with Peter's concern that we'd have trouble with OS-level code that is strict about what UTF8 allows. glibc, for example, is known to do very strange things with strings that it thinks are invalid in the active encoding. regards, tom lane
On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: > Isn't the correct syntax something like: > > select E'\uc080', U&'\c080'; > > ? > > It is a single character, 16 bit unicode sequence (see > https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html). No, what you've done there is created the three-byte utf8 sequence \xec8280 # select U&'\c080'::bytea; bytea ----------\xec8280 It's not a UCS2 c080, it's utf8 c080. Geoff
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote: > On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: >> What would it take to support it? Isn't the varlena header propagated >> everywhere, which could help infer the real length of the string? Any >> pointers or suggestions would be welcome. > > > One of the bigger pain points is that our interaction with C library > collation routines for sorting uses NULL-terminated C strings. strcoll, > strxfrm, etc. That particular bit of the problem would go away if this ever happened: https://wiki.postgresql.org/wiki/Todo:ICU ucoll_strcoll takes explicit lengths (though optionally accepts -1 for null terminated mode). http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters -- Thomas Munro http://www.enterprisedb.com
On 03/08/16 21:42, Geoff Winkless wrote: > On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht@8kdata.com> wrote: >> Isn't the correct syntax something like: >> >> select E'\uc080', U&'\c080'; >> >> ? >> >> It is a single character, 16 bit unicode sequence (see >> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html). > No, what you've done there is created the three-byte utf8 sequence \xec8280 > > # select U&'\c080'::bytea; > bytea > ---------- > \xec8280 > > It's not a UCS2 c080, it's utf8 c080. > > Geoff Yes, you're absolutely right ^_^ Álvaro -- Álvaro Hernández Tortosa ----------- 8Kdata
On 4 August 2016 at 05:00, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht@8kdata.com> wrote:
>> What would it take to support it? Isn't the varlena header propagated
>> everywhere, which could help infer the real length of the string? Any
>> pointers or suggestions would be welcome.
>
>
> One of the bigger pain points is that our interaction with C library
> collation routines for sorting uses NULL-terminated C strings. strcoll,
> strxfrm, etc.
That particular bit of the problem would go away if this ever happened:
https://wiki.postgresql.org/wiki/Todo:ICU
ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
null terminated mode).
http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters
Yep, it does. But we've made little to no progress on integration of ICU support and AFAIK nobody's working on it right now.
I wonder how MySQL implements their collation and encoding support?
On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: > Yep, it does. But we've made little to no progress on integration of ICU > support and AFAIK nobody's working on it right now. Uh, this email from July says Peter Eisentraut will submit it in September :-) https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
From
Palle Girgensohn
Date:
> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce@momjian.us>: > > On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote: >> Yep, it does. But we've made little to no progress on integration of ICU >> support and AFAIK nobody's working on it right now. > > Uh, this email from July says Peter Eisentraut will submit it in > September :-) > > https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com Cool. I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns. https://github.com/girgen/postgres/ in branches icu/XXX where XXX is master or REL9_X_STABLE. They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them,but also, I see little point in supporting anything else using ICU. I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when running $ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8 $ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs" *** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out 2016-08-10 21:09:03.000000000 +0200 --- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out 2016-08-10 21:12:53.000000000 +0200 *************** *** 373,379 **** SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! f (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; --- 373,379 ---- SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false"; false ------- ! t (1 row) SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false"; *************** *** 385,391 **** SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! t (1 row) -- The following actually exercises the selectivity estimation for ~*. --- 385,391 ---- SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true"; true ------ ! f (1 row) -- The following actually exercises the selectivity estimation for ~*. ====================================================================== The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the other?I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or upper(both)?I haven't investigated this yet. @Devrim, is one more correct than the other? As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon. Palle [1] https://people.freebsd.org/~girgen/postgresql-icu/README.html
Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
From
Peter Geoghegan
Date:
On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen@pingpong.net> wrote: > They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. Inthis latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them,but also, I see little point in supporting anything else using ICU. Looks like you're not using the ICU equivalent of strxfrm(). While 9.5 is not the release that introduced its use, it did expand it significantly. I think you need to fix this, even though it isn't actually used to sort text at present, since presumably FreeBSD builds of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could reasonably trust the ICU equivalent of strxfrm(), so that's a missed opportunity. (See the wiki page on the abbreviated keys issue [1] if you don't know what I'm talking about.) Shouldn't you really have a strxfrm() wrapper, used across the board, including for callers outside of varlena.c? convert_string_datum() has been calling strxfrm() for many releases now. These calls are still used in FreeBSD builds, I would think, which seems like a bug that is not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption that that matching the ordering used elsewhere not really hold on FreeBSD builds? [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue -- Peter Geoghegan
Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
From
Palle Girgensohn
Date:
> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg@heroku.com>: > > On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen@pingpong.net> wrote: >> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how totest them, but also, I see little point in supporting anything else using ICU. > > Looks like you're not using the ICU equivalent of strxfrm(). While 9.5 > is not the release that introduced its use, it did expand it > significantly. I think you need to fix this, even though it isn't > actually used to sort text at present, since presumably FreeBSD builds > of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could > reasonably trust the ICU equivalent of strxfrm(), so that's a missed > opportunity. (See the wiki page on the abbreviated keys issue [1] if > you don't know what I'm talking about.) My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary testsusing ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It isthe first iteration wrt strxfrm, the plan is to use that code base. > > Shouldn't you really have a strxfrm() wrapper, used across the board, > including for callers outside of varlena.c? convert_string_datum() has > been calling strxfrm() for many releases now. These calls are still > used in FreeBSD builds, I would think, which seems like a bug that is > not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption > that that matching the ordering used elsewhere not really hold on > FreeBSD builds? I was not aware of convert_string_datum, I will check that, thanks! Using a wrapper across the board seems like a good ideafor refactoring. > > [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue > -- > Peter Geoghegan
Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
From
Palle Girgensohn
Date:
<br class="" /><blockquote class="" type="cite">11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <<a class="" href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>:<brclass="" /><br class="" /><blockquote class="" style="font-family:Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing:normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows:auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" type="cite"><br class="" />11 aug. 2016 kl. 03:05 skrevPeter Geoghegan <<a class="" href="mailto:pg@heroku.com">pg@heroku.com</a>>:<br class="" /><br class="" />On Wed,Aug 10, 2016 at 1:42 PM, Palle Girgensohn <<a class="" href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>wrote:<br class="" /><blockquote class="" type="cite">They'vebeen used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how totest them, but also, I see little point in supporting anything else using ICU.<br class="" /></blockquote><br class=""/>Looks like you're not using the ICU equivalent of strxfrm(). While 9.5<br class="" />is not the release that introducedits use, it did expand it<br class="" />significantly. I think you need to fix this, even though it isn't<br class=""/>actually used to sort text at present, since presumably FreeBSD builds<br class="" />of 9.5 don't TRUST_STRXFRM.Since you're using ICU, though, you could<br class="" />reasonably trust the ICU equivalent of strxfrm(), sothat's a missed<br class="" />opportunity. (See the wiki page on the abbreviated keys issue [1] if<br class="" />you don'tknow what I'm talking about.)<br class="" /></blockquote><br class="" />My plan was to get it working without TRUST_STRXFRMfirst, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I willhave to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the planis to use that code base.<br class="" /></blockquote><div class=""><br class="" /></div><div class="">Here are some preliminaryresults running 10000 times comparing the same two strings in a tight loop.</div><br class="" /><div class=""> ucol_strcollUTF8: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.002448<br class=""/> strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.060711<br class=""/> ucol_strcollIter: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.009221<br class=""/> direct memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.000457<br class=""/> memcpy memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.001706<br class=""/> memcpy strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.068425<br class=""/> nextSortKeyPart: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.041011<br class=""/> ucnv_toUChars + getSortKey: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.050379<br class=""/><br class="" /></div><div class=""><br class="" /></div><div class="">correct answer is -1, but since we compareåasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times fasterthan ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart,is way slower.</div><div class=""><br class="" /></div><div class=""><br class="" /></div><div class=""><brclass="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();<brclass="" /><span class="Apple-tab-span" style="white-space:pre"> </span>for ( int i = 0; i < loop; i++){<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = ucol_strcollUTF8(collator, arg1,len1, arg2, len2, &status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class=""/><span class="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime);<brclass="" /><br class="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><span class=""style="white-space: normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;">vs</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div class=""></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="Apple-tab-span" style="white-space:pre;"> </span>int sortkeysize=8;<br class="" /><br class="" /><span class="" style="white-space: normal;"><spanclass="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();</span></span></div><div class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;"><span class="Apple-tab-span"style="white-space:pre"> </span>uint8_t key1[sortkeysize], key2[sortkeysize];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:normal;"><span class="Apple-tab-span" style="white-space:pre"> </span>uint32_t sState[2], tState[2];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space: pre;"> </span>UCharIterator sIter,tIter; </div><div class=""><span class="Apple-tab-span" style="white-space:pre"><br class="" /></span>for ( int i =0; i < loop; i++) {</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>uiter_setUTF8(&sIter,arg1, len1);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>uiter_setUTF8(&tIter,arg2, len2);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>sState[0]= 0; sState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>tState[0]= 0; tState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>ucol_nextSortKeyPart(collator,&sIter, sState, key1, sortkeysize, &status);<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize,&status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = memcmp (key1,key2, sortkeysize);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class="" /><spanclass="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime);<brclass="" /><br class="" /></div><div class=""><br class="" /></div><div class=""><br class="" /></div><divclass="">But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remembercorrectly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess...</div><divclass=""><br class="" /></div><div class="">Palle</div><div class=""><br class="" /></div><div class=""><brclass="" /></div><div class=""><br class="" /></div>
Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
From
Peter Geoghegan
Date:
On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn <girgen@pingpong.net> wrote: > But in your strxfrm code in PostgreSQL, the keys are cached, and represented > as int64:s if I remember correctly, so perhaps there is still a benefit > using the abbreviated keys? More testing is required, I guess... ICU's ucol_nextSortKeyPart() interface is faster for this, I believe, and works because you only need the first sizeof(Datum) bytes (4 or 8 bytes). So, you have the right idea about using it (at least for the abbreviated key stuff), but the second last argument needs to be sizeof(Datum). The whole point of the abbreviated key optimization is that you can avoid pointer chasing during each and every comparison. Your test here is invalid because it doesn't involved the reuse of the keys. Often, during a sort, each item has to be compared about 20 times. I've experimented with ICU, and know it works well with this. You really need to create a scenario with a real sort, and all the conditions I describe. -- Peter Geoghegan