Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) - Mailing list pgsql-hackers
From | Palle Girgensohn |
---|---|
Subject | Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) |
Date | |
Msg-id | 789A2F56-0E42-409D-A840-6AF5110D6085@pingpong.net Whole thread Raw |
In response to | Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00) (Palle Girgensohn <girgen@pingpong.net>) |
Responses |
Re: Improved ICU patch - WAS: Implementing full UTF-8
support (aka supporting 0x00)
|
List | pgsql-hackers |
<br class="" /><blockquote class="" type="cite">11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <<a class="" href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>:<brclass="" /><br class="" /><blockquote class="" style="font-family:Menlo-Regular; font-size: 13px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing:normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows:auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" type="cite"><br class="" />11 aug. 2016 kl. 03:05 skrevPeter Geoghegan <<a class="" href="mailto:pg@heroku.com">pg@heroku.com</a>>:<br class="" /><br class="" />On Wed,Aug 10, 2016 at 1:42 PM, Palle Girgensohn <<a class="" href="mailto:girgen@pingpong.net">girgen@pingpong.net</a>>wrote:<br class="" /><blockquote class="" type="cite">They'vebeen used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly.In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how totest them, but also, I see little point in supporting anything else using ICU.<br class="" /></blockquote><br class=""/>Looks like you're not using the ICU equivalent of strxfrm(). While 9.5<br class="" />is not the release that introducedits use, it did expand it<br class="" />significantly. I think you need to fix this, even though it isn't<br class=""/>actually used to sort text at present, since presumably FreeBSD builds<br class="" />of 9.5 don't TRUST_STRXFRM.Since you're using ICU, though, you could<br class="" />reasonably trust the ICU equivalent of strxfrm(), sothat's a missed<br class="" />opportunity. (See the wiki page on the abbreviated keys issue [1] if<br class="" />you don'tknow what I'm talking about.)<br class="" /></blockquote><br class="" />My plan was to get it working without TRUST_STRXFRMfirst, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I willhave to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the planis to use that code base.<br class="" /></blockquote><div class=""><br class="" /></div><div class="">Here are some preliminaryresults running 10000 times comparing the same two strings in a tight loop.</div><br class="" /><div class=""> ucol_strcollUTF8: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.002448<br class=""/> strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.060711<br class=""/> ucol_strcollIter: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.009221<br class=""/> direct memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.000457<br class=""/> memcpy memcmp: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.001706<br class=""/> memcpy strcoll: 1<span class="Apple-tab-span" style="white-space:pre"> </span>0.068425<br class=""/> nextSortKeyPart: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.041011<br class=""/> ucnv_toUChars + getSortKey: -1<span class="Apple-tab-span" style="white-space:pre"> </span>0.050379<br class=""/><br class="" /></div><div class=""><br class="" /></div><div class="">correct answer is -1, but since we compareåasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times fasterthan ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart,is way slower.</div><div class=""><br class="" /></div><div class=""><br class="" /></div><div class=""><brclass="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();<brclass="" /><span class="Apple-tab-span" style="white-space:pre"> </span>for ( int i = 0; i < loop; i++){<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = ucol_strcollUTF8(collator, arg1,len1, arg2, len2, &status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class=""/><span class="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime);<brclass="" /><br class="" /></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><span class=""style="white-space: normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;">vs</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:normal;"><br class="" /></span></span></div><div class=""><span class="Apple-tab-span" style="white-space:pre"><spanclass="" style="white-space: normal;"><br class="" /></span></span></div><div class=""></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="Apple-tab-span" style="white-space:pre;"> </span>int sortkeysize=8;<br class="" /><br class="" /><span class="" style="white-space: normal;"><spanclass="Apple-tab-span" style="white-space:pre"> </span>startTime = getRealTime();</span></span></div><div class=""><spanclass="Apple-tab-span" style="white-space:pre"><span class="" style="white-space: normal;"><span class="Apple-tab-span"style="white-space:pre"> </span>uint8_t key1[sortkeysize], key2[sortkeysize];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space:pre"><span class="" style="white-space:normal;"><span class="Apple-tab-span" style="white-space:pre"> </span>uint32_t sState[2], tState[2];</span></span></div><divclass=""><span class="Apple-tab-span" style="white-space: pre;"> </span>UCharIterator sIter,tIter; </div><div class=""><span class="Apple-tab-span" style="white-space:pre"><br class="" /></span>for ( int i =0; i < loop; i++) {</div><div class=""><span class="Apple-tab-span" style="white-space:pre"> </span>uiter_setUTF8(&sIter,arg1, len1);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>uiter_setUTF8(&tIter,arg2, len2);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>sState[0]= 0; sState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>tState[0]= 0; tState[1] = 0;<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>ucol_nextSortKeyPart(collator,&sIter, sState, key1, sortkeysize, &status);<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize,&status);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>result = memcmp (key1,key2, sortkeysize);<br class="" /><span class="Apple-tab-span" style="white-space:pre"> </span>}<br class="" /><spanclass="Apple-tab-span" style="white-space:pre"> </span>endTime = getRealTime();<br class="" /><span class="Apple-tab-span"style="white-space:pre"> </span>printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime);<brclass="" /><br class="" /></div><div class=""><br class="" /></div><div class=""><br class="" /></div><divclass="">But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remembercorrectly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess...</div><divclass=""><br class="" /></div><div class="">Palle</div><div class=""><br class="" /></div><div class=""><brclass="" /></div><div class=""><br class="" /></div>
pgsql-hackers by date: