Re: [HACKERS] Radix tree for character conversion - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: [HACKERS] Radix tree for character conversion |
Date | |
Msg-id | CAB7nPqREL1fsDBGv4zRvaXY+UKtS0wzkamJcnYhX0--OZvpUUQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [HACKERS] Radix tree for character conversion
|
List | pgsql-hackers |
On Tue, Jan 10, 2017 at 8:22 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > [...patch...] Nobody has showed up yet to review this patch, so I am giving it a shot. The patch file sizes are scary at first sight, but after having a look:36 files changed, 1411 insertions(+), 54398 deletions(-) Yes that's a surprise, something like git diff --irreversible-delete would have helped as most of the diffs are just caused by 3 files being deleted in patch 0004, making 50k lines going to the abyss of deletion. > Hello, I found a bug in my portion while rebasing. Right, that's 0001. Nice catch. > The attached files are the following. This patchset is not > complete missing changes of map files. The change is tremendously > large but generatable. > > 0001-Add-missing-semicolon.patch > > UCS_to_EUC_JP.pl has a line missing teminating semicolon. This > doesn't harm but surely a syntax error. This patch fixes it. > This might should be a separate patch. This requires a back-patch. This makes me wonder how long this script has actually not run... > 0002-Correct-reference-resolution-syntax.patch > > convutils.pm has lines with different syntax of reference > resolution. This unifies the syntax. Yes that looks right to me. I am the best perl guru on this list but looking around $$var{foo} is bad, ${$var}{foo} is better, and $var->{foo} is even better. This also generates no diffs when running make in src/backend/utils/mb/Unicode/. So no objections to that. > 0003-Apply-pgperltidy-on-src-backend-utils-mb-Unicode.patch > > Before adding radix tree stuff, applied pgperltidy and inserted > format-skipping pragma for the parts where perltidy seems to do > too much. Which version of perltidy did you use? Looking at the archives, the perl code is cleaned up with a specific version, v20090616. See https://www.postgresql.org/message-id/20151204054322.GA2070309@tornado.leadboat.com for example on the matter. As perltidy changes over time, this may be a sensitive change if done this way. > 0004-Use-radix-tree-for-character-conversion.patch > > Radix tree body. Well, here a lot of diffs could have been saved. > The unattached fifth patch is generated by the following steps. > > [$(TOP)]$ ./configure > [Unicode]$ make > [Unicode]$ make distclean > [Unicode]$ git add . > [Unicode]$ commit > === COMMITE MESSSAGE > Replace map files with radix tree files. > > These encodings no longer uses the former map files and uses new radix > tree files. > === OK, I can see that working, with 200k of maps generated.. So going through the important bits of this jungle.. +/* + * radix tree conversion function - this should be identical to the function in + * ../conv.c with the same name + */ +static inline uint32 +pg_mb_radix_conv(const pg_mb_radix_tree *rt, + int l, + unsigned char b1, + unsigned char b2, + unsigned char b3, + unsigned char b4) This is not nice. Having a duplication like that is a recipe to forget about it as this patch introduces a dependency with conv.c and the radix tree generation. Having a .gitignore in Unicode/ would be nice, particularly to avoid committing map_checker. A README documenting things may be welcome, or at least comments at the top of map_checker.c. Why is map_checker essential? What does it do? There is no way to understand that easily, except that it includes a "radix tree conversion function", and that it performs sanity checks on the radix trees to be sure that they are on a good shape. But as this something that one would guess only after looking at your patch and the code (at least I will sleep less stupid tonight after reading this stuff). --- a/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl +++ b/src/backend/utils/mb/Unicode/UCS_to_SJIS.pl# Drop these SJIS codes from the source for UTF8=>SJIS conversion#<<< donot let perltidy touch this -my @reject_sjis =( +my @reject_sjis = ( 0xed40..0xeefc, 0x8754..0x875d, 0x878a, 0x8782, - 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, + 0x8784, 0xfa5b, 0xfa54, 0x8790..0x8792, 0x8795..0x8797, 0x879a..0x879c -); + ); This is not generated, it would be nice to drop the noise from the patch. Here is another one: - $i->{code} = $jis | ( - $jis < 0x100 - ? 0x8e00 - : ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); - +#<<< do not let perltidy touch this + $i->{code} = $jis | ($jis < 0x100 ? 0x8e00: + ($sjis >= 0xeffd ? 0x8f8080 : 0x8080)); +#>>> if (l == 2) { - iutf = *utf++ << 8; - iutf |= *utf++; + b3 = *utf++; + b4 = *utf++; } Ah, OK. This conversion is important so as it performs a minimum of bitwise operations. Yes let's keep that. That's pretty cool to get a faster operation. -- Michael
pgsql-hackers by date: