Re: Radix tree for character conversion - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Radix tree for character conversion |
Date | |
Msg-id | da58a154-0b28-802e-5e82-5a205f53926e@iki.fi Whole thread Raw |
In response to | Re: Radix tree for character conversion (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: Radix tree for character conversion
Re: Radix tree for character conversion |
List | pgsql-hackers |
On 11/09/2016 10:38 AM, Kyotaro HORIGUCHI wrote: > Thanks. The attached patch contains the patch by perlcritic. > > 0001,2,3 are Heikki's patch that are not modified since it is > first proposed. It's a bit too big so I don't attach them to this > mail (again). > > https://www.postgresql.org/message-id/08e7892a-d55c-eefe-76e6-7910bc8dd1f3@iki.fi I've now pushed these preliminary patches, with the applicable fixes from you and Daniel. The attached patch is now against git master. > 0004 is radix-tree stuff, applies on top of the three patches > above. I've spent the last couple of days reviewing this. While trying to understand how it works, I ended up dismantling, rewriting, and putting back together most of the added perl code. Attached is a new version, with more straightforward logic, making it more understandable. I find it more understandable, anyway, I hope it's not only because I wrote it myself :-). Let me know what you think. In particular, I found the explanations of flat and segmented tables really hard to understand. So in this version, the radix trees for a conversion are stored completely in one large array. Leaf and intermediate levels are all in the same array. When reading this version, please note that I'm not sure if I mean the same thing with "segment" that you did in your version. I moved the "lower" and "upper" values in the structs. Also, there are now also separate "lower" and "upper" values for the leaf levels of the trees, for 1- 2-, 3- and 4-byte inputs. This made a huge difference to the size of gb18030_to_utf8_radix.map, in particular: the source file shrank from about 2 MB to 1/2 MB. In that conversion, the valid range for the last byte of 2-byte inputs is 0x40-0xfe, and the valid range for the last byte of 4-byte inputs is 0x30-0x39. With the old patch version, the "chars" range was therefore 0x30-0xfe, to cover both of those, and most of the array was filled with zeros. With this new patch version, we store separate ranges for those, and can leave out most of the zeros. There's a segment full of zeros at the beginning of each conversion array now. The purpose of that is that when traversing the radix tree, you don't need to check each intermediate value for 0. If you follow a 0 offset, it simply points to the dummy all-zeros segments in the beginning. Seemed like a good idea to shave some cycles, although I'm not sure if it made much difference in reality. I optimized pg_mb_radix_conv() a bit, too. We could do more. For example, I think it would save some cycles to have specialized versions of UtfToLocal and LocalToUtf, moving the tests for whether a combined character map and/or conversion callback is used, out of the loop. They feel a bit ugly too, in their current form... I need a break now, but I'll try to pick this up again some time next week. Meanwhile, please have a look and tell me what you think. - Heikki
Attachment
pgsql-hackers by date: