I wonder if the author has submitted a proposal to get the missing glyph for the...

bbreier · on March 17, 2015

The author's explanation of what characters Chinese, Japanese, and Korean share is very limited. All three languages use Chinese characters in written language to varying extents, and in some cases the differences begin significantly less than a century ago. Though there are cases where the same Chinese character represented in Japanese writing is different from how it is represented in Traditional Chinese writing (i.e. 国, Japanese version only because I don't have Chinese installed on this PC), which could be different still from how it is represented in Simplified Chinese, there are also many instances where the character is identical across all three languages (i.e. 中). Although I am not privy to the specifics of the CJK unification project, identifying these cases and using the same character for them doesn't sound unreasonable.

Edit- To be clear, Korean primarily uses Hangul, which basically derives jack and shit from the Chinese alphabet, and Japanese uses a mixture of the Chinese alphabet, and two alphabets that sort-of kind-of owe some of their heritage to the Chinese alphabet, but look nothing like it. If they are talking about unifying these alphabets, then they are out of their minds.

yummyfajitas · on March 17, 2015

Nor is it unreasonable to "unify" Latin, Greek and Cyrilic:

Cyrillic ПФ vs Greek ΠΦ

Cyrillic АВ vs Latin AB

Obviously using ω for w (as he does) is stupid, but his reducto-ad-absurdum is not particularly absurd.

peterfirefly · on March 17, 2015

Not unifying them means that the fonts automatically work when you mix text/names written in these alphabets. It also means that mathematical/physical/chemical stuff (that typically uses Latin and Greek letters together) will just work. There is a similar reasoning behind all the mathematical alphabets in Unicode.

Furthermore, Unicode was supposed to handle transcoding from all important preexisting encodings to Unicode and back with no or minimal loss. Since ISO 8859-5 (Cyrillic) and 8859-7 (Greek) already existed (and both included ASCII, hence all the basic Latin letters), the ship had definitively sailed on LaGreCy unification.

On top of that, CJK unification affected so many characters that the savings would really matter and it happened at a time where the codepoints were only 16 bit so it helped squeeze the whole in. All continental European languages suffered equally or worse back when all their letters had to be squeezed into 8 bits /and/ coexist with ASCII.

masklinn · on March 17, 2015

> Not unifying them means that the fonts automatically work when you mix text/names written in these alphabets. It also means that mathematical/physical/chemical stuff (that typically uses Latin and Greek letters together) will just work.

These are already completely separate symbols. Ignoring precomposition, there are at least 4 different lowercase omegas in unicode: APL (⍵ U+2375 "APL FUNCTIONAL SYMBOL OMEGA"), cyrillic (ѡ U+0461 "CYRILLIC SMALL LETTER OMEGA"), greek (ω U+03C9 "GREEK SMALL LETTER OMEGA") and Mathematics (𝜔 U+1D714 "MATHEMATICAL ITALIC SMALL OMEGA").

jeorgun · on March 17, 2015

Unicode kind of does this already with dotless 'i'; capital 'ı' and lowercase 'İ' are represented as regular latin 'I' and 'i' respectively, despite being semantically different letters.

riffraff · on March 17, 2015

hungarian "a" is also a separate letter from hungarian "á" but shares the same glyph with english "a" (edit: all vowels are actually considered different letters in their accented form in hungarian, while obviously they are the same letter with a modifier in some latin languages).

saalweachter · on March 17, 2015

(For those who are rusty on their Greek, ω is a lower-case omega, and unrelated to the English/German letter w.)

iopq · on March 18, 2015

Cyrillic ПФ vs Greek ΠΦ?

Here's Cyrillic lower case: пф Here's Greek lower case: πφ

in some fonts the pi would be rendered with a longer bar on top, but you just showed why it's a bad idea:

I would want to be able to discuss Greek in Russian on a forum, but this would not be possible because all the glyphs in lowercase would look Russian

claudius · on March 18, 2015

The forms 𝜙 and 𝜑 of "lowercase phi" to have different codepoints makes perfect sense to me. That doesn’t mean that upper-case variants of these can’t share a codepoint. As presented elsewhere in this thread, "X.toUpper().toLower()" doesn’t have to be "X". The same holds for "B → b" and "B → β" depending on the context. It’s just that the savings from such a unification would be far smaller due to the smaller sizes of the relevant alphabets.

iopq · on March 21, 2015

OK, but you still have a problem because I want to use the same font for Greek and Russian. What if my font is CURSIVE?

Russian and Greek have different cursive forms. You might unify κ and к, but actually the cursive form of κ looks like a Roman u.

So really if this were to happen you'd have "Russian" fonts and "Greek" fonts. Kind of like how Japanese and Chinese have to use different fonts for their languages.

microcolonel · on March 17, 2015

I think their distinct calligraphic representations mean that these would be destructive; whereas with regard to CJK, the characters are clearly represented the same way between the considered languages.

EdiX · on March 17, 2015

> If they are talking about unifying these alphabets, then they are out of their minds.

AFAIK the author is just discussing han unification:

http://en.wikipedia.org/wiki/Han_unification

chris_wot · on March 18, 2015

According to Wikipedia, this is being coordinated by the Ideographic Rapporteur Group, and "the working members of the IRG are either appointed by member governments, or are invited experts from other countries. IRG members include Mainland China, Hong Kong, Macau, Taipei Computer Association, Singapore, Japan, South Korea, North Korea, Vietnam and United States."

So this criticism of English speakers seems pretty unfounded! And his concerns about unification is being driven by a diverse group of experts in a variety of countries - so not sure why the concern?

kijin · on March 18, 2015

Exactly. The Han Unificiation project never tried to unify everything. They just took the set of common characters and unified them, leaving the rest alone. They may have made some mistakes in choosing which characters to unify, but for the most part they did a splendid job.

Korean (Hangul) has its own, massive block of over 11K code points. Japanese (Hiragana, Katakana, and other assorted symbols) also has its own block outside of the Unified Han block. Chinese characters that are clearly distinct get their own code points as well. How else would I write 国 and 國 in the same sentence?

legulere · on March 17, 2015

Also Antiqua and Fraktur used to be seen as different writing systems (ſ, I and J being equivalent, tironian et being some examples where they differ), yet this is largely ignored by Unicode (except when used in mathematics)

nothrabannosir · on March 17, 2015

tangent, but isn't 国 simplified and 國 traditional guó?

bbreier · on March 17, 2015

Yes, that's exactly right. In this case, Japanese uses the simplified version, but in others it uses the traditional version (or even a version that is slightly different from the current traditional version used in Taiwan or Cantonese.)

psychometry · on March 17, 2015

I found it extremely annoying that he doesn't even say specifically what the problem with his name is. So there's a letter that's unavailable? Which letter?

theandrewbailey · on March 17, 2015

> Even today, I am forced to do this when writing my own name. My name is not only a common Indian name, but one of the top 1,000 names in the United States as well. But the final letter has still not been given its own Unicode character, so I have to use a substitute.

Not as descriptive as it could be, but this article isn't about him.

BillinghamJ · on March 17, 2015

Yet the title is solely about him.

vilhelm_s · on March 17, 2015

It sounds like the glyph is in unicode already, but expressed using combining characters?

nemo · on March 17, 2015

His writing there is pretty confusing. He started by complaining about a glyph that was missing until 2005, but was either fixed in 2005 or approximated by combining some characters 'ত + ্ + ‍ = ‍ৎ'. He doesn't really make it very clear whether ৎ is a substitute for a glyph, or whether it's the correct glyph and a case of an input system not making that easy to enter, but it seems like the glyph was added in 2005 and he's complaining about the input method. Assuming it is a case of a clunky input system, then pointing the finger at the Unicode consortium seems pretty weak, since so far as I understand it, various OS vendors/app platforms handle that implementation.

breadbox · on March 17, 2015

Imagine if the letter Q had been left out of Unicode's Latin alphabet. The argument against it is that it can be written with a capital O combined with a comma. (That's going to play hell with naive sorting algorithms, of course, but oh well.) Oh, and also imagine your name is Quentin.

icebraining · on March 17, 2015

But the letter wasn't left out of Unicode; it's actually typed in the article. It's just internally represented as multiple codepoints, much like one of parts of my name (é) may be.

Frankly, this is irrelevant to the actual problem, which is the input system, and which has nothing to do with Unicode. Nothing prevents a single key from typing multiple codepoints at once.

cplease · on March 17, 2015

> It's just internally represented as multiple codepoints

And in fact it is not, and even in the article it is U+09CE. One codepoint. If his input method irks him, he's as free to tweak it as I am to switch to Dvorak.

Also folks, there's no "CJK unification" project. It's Han unification. Han characters are Han characters, just like Latin characters are Latin characters. Just because German has ß and Danish has Ø doesn't mean A isn't a Latin character and not, say, a French one. Not to get all Ayn Rand-y, but A is A is U+0041 in all Western European/Latin alphabets. It makes sense for 中国 and 日本to have the same encoding in Chinese and Japanese.

mikekchar · on March 18, 2015

I hate to say it, but I think the author's objections seem to stem from his lack of understanding of character encoding issues. I don't know Bengali at all and so I will try to refrain from commenting on it, but I do speak and read Japanese fluently and Han Unification is a very, very good thing. Can you imagine the absolute hell you would have to go through trying to determine if place names were the same if they used different code points for identical characters -- just because of geopolitical origins?

Yes, there are some frustrating issues -- it has been historically difficult to set priorities for fonts in X and since Chinese fonts tend to have more glyphs, you often wind up with Chinese glyphs when you wanted Japanese glyphs. But this is not an encoding issue. Localization/Internationalization is really difficult. Making a separate code point for every glyph is not going to change that for the better.

Manishearth · on March 18, 2015

I feel that way too. The distinction between codepoint, glyph, grapheme, character, (...) is not an easy one, and that's what he seems to be stumbling over. Unicode worries itself about only some of these issues, many of the other issues are about rendering (the job of the font) or input.

vilhelm_s · on March 17, 2015

Combining characters are not just used for Bengali though. E.g. umlauted letters in European languages can also be expressed using combining characters, and implementations need to deal with those when sorting.

estebank · on March 17, 2015

> Imagine if the letter Q had been left out of Unicode's Latin alphabet.

To properly write my european last name I have to press between 2 and 4 different simultaneous keys, depending on the system. Han unification is beyond misguided, but combining characters is not the problem.

frivoal · on March 17, 2015

Han unification as a hole is misguided? I'll grant you that some characters which were unified probably shouldn't have been, and maybe some that some that should have been weren't, but what's the argument for the whole thing to be misguided?

Should Norwegian A and English A be different Unicode code points just because Norwegian also has Ø, proving that it is a different writing system? You may want to debate whether i and ı should the same letter (they aren't), but most letters in the Turkish alphabet are the same as the letters in the English alphabet.

estebank · on March 18, 2015

We'll the Turkish i/ı/I/I is I think exactly the example I would have come up with of characters that looks the same as i/I, but should have it's own code point, just like cyrillic characters have their own code points despite looking like latin characters.

frivoal · on March 18, 2015

Absolutely. So i/ı/I/I do have their own codepoints. But the rest of the letters, which are the same, don't. Just like han unification. Letters which are the same are the same, and those which are not are not, even if they look pretty close.

estebank · on March 18, 2015

The thing is that the turkish "i" and "I" don't have their own codepoints, it is the same one as latin "i" and "I", when they should have been their own codepoints representing the same glyphs. That way going from I to ı and from i to İ wouldn't be a locale dependant problem.

ptaipale · on March 17, 2015

When Chinese linguists came up with hanyu pinyin, they specifically wanted to pick up Latin characters (1) for Chinese phonetics, so that Chinese phonetic writing could use what we'd call "white men's writing system".

Now, they did use the letter Q for the sound tɕʰ that was formerly often romanized as "ch". It is not really a "k" as Q is in English.

Are people now saying that hanyu pinyin should use a different coding to English, because it would be more "respectable" for non-English languages to have their own code points even if the character has same roots and appearance? That is absolutely pointless. The whole idea of using Q for tɕʰ is that you can use the same letter, same coding, same symbol as in English.

(1) OK they did add ü to the mix, although that is usually only used in romanization in linguistics or textbooks, and regular pinyin just replaces it with u.

jsmthrowaway · on March 17, 2015

My first choice as theoretical Quentin wouldn't be "how can I frame this accidental, perhaps even flagrantly disrespectful omission as antiprogressive and dissect the credentials, experience, and ethnicity of the people who made the mistake via culture essay," it would probably be "where do I issue a pull request to fix this mistake or in what way can I help?"

Maybe that's just me. I look forward to the future where any mistake not involving a straight white Anglo-Saxon man or his customs can be built up as antiprogressive agenda, and the best advocacy is taking the people who made them down rather than fixing the problem that is the, you know, problem.

(As an aside, imagine my surprise to see a Model View Culture link on HN given how much MVC absolutely hates and criticizes HN, including a weekly "worst of HN" comment dissection.)

kens · on March 18, 2015

Anyone can propose the addition of a new character to Unicode. It doesn't take $18,000 as some people think. You just need to convince the Unicode Consortium that it makes sense (preferably with solid evidence on use of the character). The process is discussed at: http://unicode.org/pending/proposals.html

I have a proposal of my own in the works to add a character to Unicode, so I'll see how it goes. There's a discussion of how someone successfully got the power symbol added to Unicode at https://github.com/jloughry/Unicode so take a look if you're thinking of proposing a character.

__david__ · on March 17, 2015

> …including a weekly "worst of HN" comment dissection.

That sounds interesting, but I can't find any reference to it on their site or search engines. Do you have a link?

jsmthrowaway · on March 18, 2015

They appear to have deleted at least some. Here's one.

https://web.archive.org/web/20140724125906/http://modelviewc...

When you see a piece on MVC by "The Editors," that's code for Shanley Kane writing under an anonymous byline.

ohjesusthatguy · on March 17, 2015

"My first choice as theoretical Quentin wouldn't be "how can I frame this accidental, perhaps even flagrantly disrespectful omission as antiprogressive and dissect the credentials, experience, and ethnicity of the people who made the mistake via culture essay," it would probably be "where do I issue a pull request to fix this mistake or in what way can I help?" "

I could tell you, but I'll need $18,000 first.

ohjesusthatguy · on March 18, 2015

[flagged]

ohjesusthatguy · on March 18, 2015

Just because I'm harsh doesn't mean I don't love you guys :)

ohjesusthatguy · on March 18, 2015

Hey how do I get downvotes? No fair! :) Answer: I could tell you, but you need 2070 karma first :) :)

dang · on March 19, 2015

From https://news.ycombinator.com/newsguidelines.html:

"Resist commenting about being downvoted. It never does any good, and it makes boring reading.

Please don't bait other users by inviting them to downvote you."

discardorama · on March 17, 2015

> I wonder if the author has submitted a proposal to get the missing glyph for their name added.

Probably not. If he did, he wouldn't be able to rant about the injustices of the White Man, now, would he?

gerbal · on March 17, 2015

Holy cow CJK unification is a terrible idea. Maybe if it originated from the CJK governments, it might be an OK idea, but the idea of a Western multinationals trying to save Unicode space by disregarding the distinctness of a whole language group is idiotic.

The fundamental roll of an institution like the Unicode Consortium is to be descriptive, not prescriptive. If there is a human script, passing certain, low, low thresholds, it should be eligible to be included in its distinct whole in Unicode.

mrgriscom · on March 17, 2015

To oppose Han unification is to say that an 'a' in English and an 'a' in French should be different code points, because they're different languages.

Alternatively, if Unicode directly encoded words, rather than letters, of Western languages akin to ideographs in East Asian languages, it's like arguing that 'color' and 'colour' should be separate code points.

astrange · on March 17, 2015

The Han unification working group (IRG) members were in fact from Asia and appointed by their governments. Why would you think otherwise?

Apparently this even includes North Korea.

fixermark · on March 17, 2015

I don't disagree, while observing that it plays hell with security if (forgive my use of Latin; I don't have a codepoint translator handy) 'bat.com' and 'bat.com' are two different websites because the 'a' in the first is a Chinese-a and the 'a' in the second is a Korean-a.

(Of course, this calls into question the wisdom of expanding DNS into the Unicode space in the first place---a space that does nothing like guarantee 1-to-1 association between visual glyph and code for an application that has been built on the assumption that different codes are visually distinguishable. But that ship has sailed).