English is the best candidate because it has the second largest user base (1.2 Billion vs 1.3 Billion for Mandarin), http://en.wikipedia.org/wiki/List_of_languages_by_total_numb...
and is twice as spoken as the third most popular language Spanish. (0.55 Billion)
If I got to pick the universal language, it would be Lojban (a few hundred speakers), but that is not a realistic goal, teaching the other 6 Billion people a language that is already spoken by 1/7th of the population is at least plausible.
> Why would you want that...
Why would, you not want that?! Many popular programming languages are based on array indexing through pointer arithmetic, having a variable width encoding there is a horrible idea, because you have to iterate through the text to get to an index.
Length is the number of characters, which is just the number of bytes in ASCI, but has to be calculated by looking at every character in UTF-8.
Even if 1.2 billion seems a lot, that's still a small fraction of a world's population. So every choice of a universal language would force majority of a world to learn new one. So that's why I think winning popularity contest is a poor argument and we shouldn't look at that and focus on things like simplicity (which I don't find in English), speed of learning, consistency, expressiveness etc. I'd be happy to use Lojban (it's easier for machines too, I guess) or any other invented language. If I had to pick one from popular ones, I'd like Spanish more than English.
I was asking what are your specific usecases, which forbid you to treat UTF-8 string as a black box blob of bytes? If dealing with international code, you'd rather want to use predefined functions. If you want to limit yourself to ASCII, just do it and simply don't touch bytes >= 0x80.
And what is a character? Do you mean graphemes or codepoints? Or something else? Few years before I was thinking like you – that calculating length is a useful feature. But most often when you think about your usecase, you realise either that you don't need length or you need some other kind of length: like monospace-width, rendered-width or some kind of entropy-based amount of information. Twitter is the only case I know, where you want to really count "characters". And I find it really silly: eg. Japanese tweet vs. English tweet.
If I got to pick the universal language, it would be Lojban (a few hundred speakers), but that is not a realistic goal, teaching the other 6 Billion people a language that is already spoken by 1/7th of the population is at least plausible.
> Why would you want that...
Why would, you not want that?! Many popular programming languages are based on array indexing through pointer arithmetic, having a variable width encoding there is a horrible idea, because you have to iterate through the text to get to an index.
Length is the number of characters, which is just the number of bytes in ASCI, but has to be calculated by looking at every character in UTF-8.