> Hebrew (and I'd guess Arabic and other right-to-left languages) work rather ba...

yosefk · on March 17, 2015

Unicode defines a "logical order" and a "rendering order". The logical order is supposed to be the order in which you read the text - letters read earlier are closer to the beginning of the buffer. The rendering order is how the stuff appears on screen - where an English word inside Hebrew text, if you count the letters from right to left, will obviously have its last letter assigned a smaller number than its first letter.

This means that any program rendering Unicode text - which is stored in logical order - needs to implement the crazy "bidi rendering" algorithm which includes a guess of the "document language" (is it a mostly-English or a mostly-Hebrew text? that fact changes everything in how the character order affects on-screen positioning) and the arbitrary constant 64 for nesting levels.

In this sense Unicode has a lot to do with rendering. As to input methods...

Could your program ignore it all and just let users edit text however you choose? Well, you'd need to translate the on-screen result to a logical order that can then be rendered by someone else the way Unicode prescribes, and you'd be breaking a near-universal (though terrible) UI standard, confusing users mightily. For instance users are accustomed to Delete deleting a different character when the cursor is between two words, one English one and one Hebrew one, depending on where you came from (a single on-screen position corresponds to two logical order positions, see?) You also can't store any hints wrt how things should be interpreted in the Unicode text that are not a part of the Unicode standard, you can only store such hints in a side-band channel which is not available if you store plain text someone else might read. And if you don't do that, well, you can ignore Unicode altogether... except when handling paste commands which will force you to implement bidi in all its glory, including a guess of what the document language was.

Now the part that I don't remember (it's been a decade since I last touched this shit) is whether Unicode mandates anything wrt editing, versus just implicitly compelling you to standardize on the same methods that make sense in light of their logical & rendering orders...

BTW - would I, or some bunch of Jews, Arabs etc. with a background in programming and linguistics beat the white folks in the quality of their solution? I dunno, it's a really really tough problem because you kinda don't want to reverse words in the file, on the other hand you kinda ought to "reverse" them on the screen - that is, spell some in the opposite direction of some others. You could keep them reversed in the file - a different tradeoff - or you could have some other bidirectional rendering algorithm, perhaps with less defaults/guesses. How it "really" ought to work I don't know, obviously neither left-to-right nor right-to-left writing systems were designed with interoperability in mind. Much as I hated bidi when I dealt with it and much as I despise it every day when it routinely garbles my email etc., it's hard to expect much better...

As to your comparison of Chinese characters in Unicode and old English words in a spellchecker... The latter is what, 10000x easier to fix/work around? (Not that I know anything about Chinese; I do see people in the comments counter his point about his own language, I just can't know who is right.)

cplease · on March 17, 2015

Okay, how would you do it? If you want bidirectional mixed language support, you're not going to get around implementing bidi rendering at some layer. Character encoding is not that layer. You are making legitimate complaints, but they are largely in the realm of poor internationalization support. There may be finger-pointing between applications, toolkits, and OS middleware, but the character encoding should not be taking the bum rap for that.

Internationalization is a fundamentally difficult engineering problem and no character encoding is going to solve it.

microtherion · on March 17, 2015

You could keep them reversed in the file

I don't think you could, if the screen rendering has to do any soft line breaking.

Yes, bidirectional rendering is messy, but I don't think it's fair to blame the mess on Unicode without a clearly superior alternative.

toast0 · on March 17, 2015

> Could your program ignore it all and just let users edit text however you choose?

Sure, that's what happens in my text editor in my terminal, the unicode rendering rules are ignored, each code point gets one character on the screen, from left to right. Once things look how I expect them to in there, I test on real devices, in the real output context to make sure things are looking how they should; often I also have someone who can read and understand the text check too.

This works for me, but I'm a software developer, and I'm also not fluent in many languages, so seeing code points in logical order doesn't cause me problems with readability.