The most amusing quote in the entire article is this (emphasis mine): > This gro...

sfink · on Jan 13, 2020

> And the complaint doesn't even make sense if taken at face value - if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The article directly answers that question. Many, many things in the standard library now only accept unicode strings, not byte strings. So a wholesale change to b'' everywhere breaks lots of stuff.

> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated - because it has to be explicit now, instead of the Python 2 world, where bytes was the default, and Unicode had to be requested explicitly.

Once again, the article directly states that the default is not the problem. The lack of escape hatches is. Paths are not unicode strings, and pretending they are does not work. Using bytes when you need bytes works only until you need to call a library function that only accepts strings.

acdha · on Jan 14, 2020

Paths ARE Unicode strings on 99% of the computers with humans sitting in front of them. NTFS, HFS+, and APFS all use Unicode but more importantly, the experience of not using valid Unicode where that’s possible is horrible: undeletable files, crashes, etc. I’ve seen that many times over the years (it was popular with malware authors) but never a time where this was a desirable behavior.

The default should always be Unicode with only people writing low-level backup and security tools dealing with bytes.

jgraham · on Jan 14, 2020

This just isn't true. In Windows paths are UCS2 i.e. arbitary sequences of unicode code units, inclusing unpaired surrogates. This means that there are paths that will work on Windows but cannot be encoded as e.g. valid UTF-8. As a result Rust has a bespoke encoding just for representing Windows paths in a way that's compatible with UTF-8 ("WTF-8"). It also means that you can't make a guaranteed lossless conversion from a filesystem path to a Rust string; you have to handle the possibility of errors.

On Mac paths are some weird NFKD-ish thing, so equality comparisons are complicated.

As a rule, if you think that filesystem paths as easy then you're probably ignoring all the edge cases. In application where you don't deal with arbitary user files that's fine. In a programming language that's a huge design error.

int_19h · on Jan 14, 2020

This all - including complicated equality comparisons - is why paths should have their own dedicated type, and not just be raw strings. Thankfully, Python has had pathlib for a while now.

WorldMaker · on Jan 14, 2020

Paths are Unicode strings on Windows. Yes, POSIX adds a lot more spice to the mix, but if the intent is a cross-platform tool, then Unicode is a reasonable lowest-common-denominator assumption for filenames in 2020.

Conan_Kudo · on Jan 14, 2020

Paths are Unicode strings everywhere but Unix/Linux. And I would even argue that this is a broken aspect of POSIX today. We should make Unicode the baseline for paths in POSIX-compliant systems, but there's probably too much hand-wringing for that to ever happen.

ygra · on Jan 14, 2020

Paths are sequences of 16-bit values on Windows, not necessarily valid UTF-16. It's basically the same as in POSIX, just one byte wider per character.

markbnj · on Jan 13, 2020

> if all strings in Mercurial are byte strings, then what is there to think about? just use b'' throughout, no need to worry about anything else.

The author explains later in the article that many system level python 3 apis that are important to a vcs require unicode and won't accept bytes. So apparently it wasn't as easy as just sticking 'b' in front of every literal.

int_19h · on Jan 13, 2020

Right. But that's a very different issue, and it's not at all about string literals as such.

Furthermore, the way they solve it - by using their own wrapper helpers that allow bytes - means that the end result should be b'' throughout, no?

phkahler · on Jan 13, 2020

>> So the real complaint is that Python switched the defaults in a way that made bytes-centric code more complicated

The author made it clear. The issue wasn't just that the default changed. It was that 3.0 took away the ability to always make your choice explicit.

Changing the default would have no effect on code that was always explicit. Going over the code and making all implicit strings explicit would allow them to know when they had full coverage, and also make the code work with both 2 and 3.

With 3, any implicit had to get b added, while any string with u had to be made implicit (drop the u). You couldn't tell by looking at code if it was converted or not. At least that's how I read it.

int_19h · on Jan 13, 2020

The lack of u'' in early versions of Python 3 is a valid complaint, but it's a separate one.

It's also not that big of a deal in practice, because you could always write a helper function like u('foo') that would call unicode() on Python 2, and just pass the value through on Python 3. This only breaks when you need a Unicode literal with actual Unicode characters inside, which is a rare case - and should be especially rare in something like Mercurial.

epage · on Jan 14, 2020

Another reason the complaint doesn't make sense is that the author then praises Rust which is more similar to Python 3 than 2.

afiori · on Jan 14, 2020

From other comments the annoyances for the author were about the standard library using Unicode for system level API; Rust had a OSString type that works with the GIGO model of posix