Jeff Davis wrote at 2025-07-31 02:58:
Apologies for the late answer to a review
> First, it doesn't mention the "builtin" provider, which uses the same
> word break rules as libc.
Completely forgot about builtin provider in the first patch, my bad
> Second, word boundaries can be complex, and I'm wondering if we should
> not be so precise about what ICU does or doesn't do. For instance, ICU
> has options like U_TITLECASE_ADJUST_TO_CASED,
> U_TITLECASE_NO_BREAK_ADJUSTMENT, etc., and I'm not sure exactly
> which one of those we use.
While [1] describes the default word boundary rules and could be useful
as a starting point, I agree that in reality it probably is more
complicated. I didn't exactly find any place where
U_TITLECASE_ADJUST_TO_CASED and alike are set in non-test code, but
U_TITLECASE_ADJUST_TO_CASED was used as a default prior to ICU 60,
so initcap() will also behave differently depending on ICU version
> I'd prefer that we try to explain that INITCAP() is intended for
> convenient display, and the specific result should not be relied upon
> (at least for ICU; maybe for all providers). If you want specific word
> boundary rules, write your own function.
First patch just adds this warning about not relying on initcap() exact
result. The second one is the same, but removes the part "what is a
word"
since it's could be moot because we recommend writing custom functions,
so understanding what is a word is not exactly needed. Still on the
fence
about which patch is better, though
Thoughts?
[1]: https://www.unicode.org/reports/tr29/#Word_Boundaries
Regards, Oleg Tselebrovskiy