Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think this is more related to the fact that there aren't many sane libraries implementing unicode and locales -- so you'll get either some c lib/c++ lib, system lib, java lib -- or an actual new implementation that's actually been done "seriously" -- as part of being able to say: "Yes, X does actually support unicode strings.".

Python3 got a lot of flac for the decision to break away from it's byte sequences, to it's a unicode string. But I think that was the right choice. I still understand why people writing software that only cared about network, on-the-wire, pretend-to-be-text type strings.

Then again, based on some other comments here, apparently there are still some dark corners:

    Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
    [GCC 4.7.2] on linux2
    >>> s="Åßẞ"
    >>> s == s.upper().lower()
    False
    >>> s.lower()
    'åßß'
However, to complicate things:

    Python 3.4.2 (default, Dec 27 2014, 13:16:08)
    [GCC 4.9.2] on linux
    >>> s="Åßẞ"
    >>> s.lower()
    'åßß'
    >>> s.lower().upper()
    'ÅSSSS'
    >>> s == s.lower().upper()
    False
    >>> s.lower().upper() == 'ÅSSSS'
    True
    >>> 'SS'.lower()
    'ss'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.lower().upper()
    'SS'
    >>> 'ß'.lower().upper().lower()
    'ss'
So that's fun.


Thanks for pointing that out -- I was vaguely aware 3.2 wasn't good (but pypy still isn't up to 3.4?) -- it's what's (still) in Debian stable as python3 though. Jessie (soonish to be released) will have 3.4 though, so at that point python3 should really start to be viable (to the extent that there are differences that actually are important...).

For the record, .casefold():

    #Python 3.4:
    >>> 'Åßẞ'.casefold() == 'åßß'.casefold() == 'åssss'
    True
[ed: Also, wrt upper/lower being for display purposes -- I thought it was nice to point out that they are not symmetric, as one might expect them to (although that expectation is probably wrong in the first place...]


FWIW,

- 3.2 is considered broken with a narrow unicode build (although it doesn't matter here)

- .lower and .upper are primarily for display purposes

- .casefold is for caseless matching




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: