Unicode grapheme clusters - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Unicode grapheme clusters |
Date | |
Msg-id | Y8iMr2wi2ABrOSBH@momjian.us Whole thread Raw |
Responses |
Re: Unicode grapheme clusters
Re: Unicode grapheme clusters |
List | pgsql-hackers |
Just my luck, I had to dig into a two-"character" emoji that came to me as part of a Google Calendar entry --- here it is: 👩🏼⚕️🩺 libc Unicode UTF8 len U+1F469 f0 9f 91 a9 2 woman U+1F3FC f0 9f 8f bc 2 emoji modifier fitzpatrick type-3 (skin tone) U+200D e2 80 8d 0 zero width joiner (ZWJ) U+2695 e2 9a 95 1 staff with snake U+FE0F ef b8 8f 0 variation selector-16 (VS16) (previous character as emoji) U+1FA7A f0 9f a9 ba 2 stethoscope Now, in Debian 11 character apps like vi, I see: a woman(2) - a black box(2) - a staff with snake(1) - a stethoscope(2) Display widths are in parentheses. I also see '<200d>' in blue. In current Firefox, I see a woman with a stethoscope around her neck, and then a stethoscope. Copying the Unicode string above into a browser URL bar should show you the same thing, thought it might be too small to see. For those looking for details on how these should be handled, see this for an explanation of grapheme clusters that use things like skin tone modifiers and zero-width joiners: https://tonsky.me/blog/emoji/ These comments explain the confusion of the term character: https://stackoverflow.com/questions/27331819/whats-the-difference-between-a-character-a-code-point-a-glyph-and-a-grapheme and I think this comment summarizes it well: https://github.com/kovidgoyal/kitty/issues/3998#issuecomment-914807237 This is by design. wcwidth() is utterly broken. Any terminal or terminal application that uses it is also utterly broken. Forget about emoji wcwidth() doesn't even work with combining characters, zero width joiners, flags, and a whole bunch of other things. I decided to see how Postgres, without ICU, handles it: show lc_ctype; lc_ctype ------------- en_US.UTF-8 select octet_length('👩🏼⚕️🩺'); octet_length -------------- 21 select character_length('👩🏼⚕️🩺'); character_length ------------------ 6 The octet_length() is verified as correct by counting the UTF8 bytes above. I think character_length() is correct if we consider the number of Unicode characters, display and non-display. I then started looking at how Postgres computes and uses _display_ width. The display width, when properly processed like by Firefox, is 4 (two double-wide displayed characters.) Based on the libc display lengths above and incorrect displayed character lengths in Debian 11, it would be 7. libpq has PQdsplen(), which calls pg_encoding_dsplen(), which then calls the per-encoding width function stored in pg_wchar_table.dsplen --- for UTF8, the function is pg_utf_dsplen(). There is no SQL API for display length, but PQdsplen() that can be called with a string by calling pg_wcswidth() the gdb debugger: pg_wcswidth(const char *pwcs, size_t len, int encoding) UTF8 encoding == 6 (gdb) print (int)pg_wcswidth("abcd", 4, 6) $8 = 4 (gdb) print (int)pg_wcswidth("👩🏼⚕️🩺", 21, 6)) $9 = 7 Here is the psql output: SELECT octet_length('👩🏼⚕️🩺'), '👩🏼⚕️🩺', character_length('👩🏼⚕️🩺'); octet_length | ?column? | character_length --------------+----------+------------------ 21 | 👩🏼⚕️🩺 | 6 More often called from psql are pg_wcssize() and pg_wcsformat(), which also calls PQdsplen(). I think the question is whether we want to report a string width that assumes the display doesn't understand the more complex UTF8 controls/"characters" listed above. tsearch has p_isspecial() calls pg_dsplen() which also uses pg_wchar_table.dsplen. p_isspecial() also has a small table of what it calls "strange_letter", Here is a report about Unicode variation selector and combining characters from May, 2022: https://www.postgresql.org/message-id/flat/013f01d873bb%24ff5f64b0%24fe1e2e10%24%40ndensan.co.jp Is this something people want improved? -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com Embrace your flaws. They make you human, rather than perfect, which you will never be.
pgsql-hackers by date: