language agnostic - Fully correct Unicode visual string reversal -
[inspired largely trying explain problems character encoding independent character swap, these other questions neither of contain complete answer: how reverse unicode string, how reversed string (unicode safe)]
doing visual string reversal in unicode harder looks. in storage format other utf-32 have pay attention codepoint boundaries rather going byte-by-byte. that's not enough, because of combining glyphs; spec has concept of "grapheme cluster" that's closer basic unit want reversing. that's still not enough; there sorts of special case characters, bidi overrides , final forms, have fixed up.
this pseudo-algorithm handles easy cases know about:
- segment string alternating list of words , word-separators (some word-separators may empty string)
- reverse order of list.
- for each string in list:
- segment string grapheme clusters.
- reverse order of grapheme clusters.
- check initial , final cluster in reversed sequence; base characters may need reassigned correct form (e.g. if u+05db hebrew letter kaf @ end of sequence needs become u+05da hebrew letter final kaf, , vice versa)
- join sequence string.
- recombine list of reversed words produce final reversed string.
... doesn't handle bidi overrides , i'm sure there's stuff don't know about, well. can fill in gaps?
Comments
Post a Comment