But the letter *wasn't* left out of Unicode; it's actually typed *in the article...

cplease · on March 17, 2015

> It's just internally represented as multiple codepoints

And in fact it is not, and even in the article it is U+09CE. One codepoint. If his input method irks him, he's as free to tweak it as I am to switch to Dvorak.

Also folks, there's no "CJK unification" project. It's Han unification. Han characters are Han characters, just like Latin characters are Latin characters. Just because German has ß and Danish has Ø doesn't mean A isn't a Latin character and not, say, a French one. Not to get all Ayn Rand-y, but A is A is U+0041 in all Western European/Latin alphabets. It makes sense for 中国 and 日本to have the same encoding in Chinese and Japanese.

mikekchar · on March 18, 2015

I hate to say it, but I think the author's objections seem to stem from his lack of understanding of character encoding issues. I don't know Bengali at all and so I will try to refrain from commenting on it, but I do speak and read Japanese fluently and Han Unification is a very, very good thing. Can you imagine the absolute hell you would have to go through trying to determine if place names were the same if they used different code points for identical characters -- just because of geopolitical origins?

Yes, there are some frustrating issues -- it has been historically difficult to set priorities for fonts in X and since Chinese fonts tend to have more glyphs, you often wind up with Chinese glyphs when you wanted Japanese glyphs. But this is not an encoding issue. Localization/Internationalization is really difficult. Making a separate code point for every glyph is not going to change that for the better.

Manishearth · on March 18, 2015

I feel that way too. The distinction between codepoint, glyph, grapheme, character, (...) is not an easy one, and that's what he seems to be stumbling over. Unicode worries itself about only some of these issues, many of the other issues are about rendering (the job of the font) or input.