> the only thing we can do is to encourage everyone, the users, organizations, a...

terinjokes · on Jan 9, 2025

The position I got reading documentation Microsoft has written in the last two years is the opposite: set activeCodePage in your application manifest to UTF-8 and only ever use the "ANSI" functions.

ziml77 · on Jan 9, 2025

Yes that does seem to be the way going forward. Makes it a lot easier to write cross-platform code. Though library code still has to use the Wide Character APIs because it's up to the application as a whole to opt into UTF-8. Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.

terinjokes · on Jan 9, 2025

Many libraries I've encountered have defines available now to use the -A APIs; previously they were using -W APIs and converting to/from UTF-8 internally.

As for my application, any wchar conversions being done by the runtime are a drop in the bucket compared to the actual compute.

account42 · on Jan 15, 2025

> Also if you're looking for maximal efficiency, the WChar APIs still make sense because it avoids the conversion of all the string inputs and outputs on every call.

OTOH you need ~twice as much memory / copy ~twice as much data around than if you converted to WTF-8 internally.

Joker_vD · on Jan 9, 2025

Ah, so they've finally given up? Interesting to hear. But I guess the app manifests does give them a way to move forward this way while maintaining the backward-compatible behaviour (for apps without this setting in their manifests).

dataflow · on Jan 10, 2025

Despite whatever Microsoft may seem to be suggesting, you don't want to do this. Just use the wide APIs. Lots of reasons why UTF-8'ing the narrow APIs is a bad idea:

- The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.

- You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.

- Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).

- Compatibility with previous Windows versions, obviously.

- Performance

cesarb · on Jan 10, 2025

> You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.

I want to emphasize this point. From what I've heard, on Windows it's very common for DLLs from who knows where to end up loaded in your process. Not only the things you'd also find on other operating systems like the user-space component of graphics APIs like OpenGL and Vulkan, but also things like printer drivers, shell extensions, "anti-malware" stuff, and I've even heard of things like RGB LED control software injecting their DLLs into every single process. It's gotten so bad that browsers like Firefox and Chrome use fairly elaborate mechanisms to try to prevent arbitrary DLLs from being injected into their sandbox processes, since they used to be a common source of crashes.

account42 · on Jan 15, 2025

> The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.

There's WTF-8 - too bad that's not what Microsoft chose to use for their universal 8-bit codepage.

cryptonector · on Jan 10, 2025

Disagree. At least in the context of Unix utilities portable to Windows. We are NOT going to be forking those to use wchar_t on Windows and char on Unix -that's a non-starter- and we're also not going to be switching to wchar_t on both because wchar_t is a second-class citizen on Unix.

Using UTF-8 with the "A" Windows APIs is the only reasonable solution, and Microsoft needs to commit to that.

> - The wide APIs accept and/or produce invalid UTF-16 in some places (like filesystems). There's no corresponding UTF-8 for invalid UTF-16. Meaning there are cases that lead to loss of information and that you simply cannot handle.

This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix. So you're damned if you do and damned if you don't, so just accept this loss and say "don't do that".

> - You have no control over all the DLLs loaded in your process. If a user DLL loads that can't handle UTF-8 narrow APIs, you're just praying it won't break.

In some cases you do have such control, but if some DLL unknown to you uses "W" APIs then.. it doesn't matter because if it's unknown to you then you're not interacting with it, or if you are interacting with it via another DLL that is known to you then it's that DLL's responsibility to convert between char and wchar_t as needed. I.e., this is not your problem -- I get that other people's bugs have a way of becoming your problem, but strictly speaking it's their problem not yours.

> - Some APIs simply don't have narrow versions. Like CommandLineToArgvW() or GetFileInformationByHandleEx() (e.g., FILE_NAME_INFO). You will not avoid wide APIs by doing this if you need to use enough of the APIs; you're just going to have to perform conversions that have dubious semantics anyway (see point #1 above).

True, but these can be wrapped with code that converts as needed. This is a lot better from a portability point of view than to fork your entire code into Windows and Unix versions.

> - Compatibility with previous Windows versions, obviously.

Sigh. At some point people (companies, contractors/consultants, ...) need to put their feet down and tell the U.S. government to upgrade their ancient Windows systems.

> - Performance

The performance difference between UTF-8 and UTF-16 is in the noise, and it depends greatly on context. But it doesn't matter. UTF-8 could be invariably slower than UTF-16 and it would still be better to move Windows code to UTF-8 than to move Unix to UTF-16 or lose portability between Windows and Unix.

In case you and others had not noticed Linux has a huge share of the market on servers while Windows has a huge share of the market on laptops, which means that giving up on portability is not an option.

The advice we give developers here has to include advice we give to developers who have to write and look after code that is meant to be portable to Windows and Unix. Sure, if you're talking to strictly-Windows-only devs, the advice you give is alright enough, but if later their code needs porting to Unix they'll be sad.

The reality is that UTF-8 is superior to UTF-16. UTF-8 has won. There's just a few UTF-16 holdouts: Windows and JavaScript/ECMAScript. Even Java has moved to UTF-8. And even Microsoft seems to be heading in the direction of making UTF-8 a first-class citizen on Windows.

account42 · on Jan 15, 2025

> This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix. So you're damned if you do and damned if you don't, so just accept this loss and say "don't do that".

The problem is that you can't roundtrip all filenames. CP_UTF8 doesn't solve that only pretends to. For a full solution you need to use the W functiosn and then convert between WTF-16 and WTF-8 yourself.

dataflow · on Jan 10, 2025

Hard disagree:

> At least in the context of Unix utilities portable to Windows. We are NOT going to be forking those to use wchar_t on Windows and char on Unix -that's a non-starter- and we're also not going to be switching to wchar_t on both because wchar_t is a second-class citizen on Unix.

Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)

Or you could e.g. get upstream to start caring about their users on other platforms, and play ball.

> This is also true on Unix systems as to `char`. Yes, that means there will be loss of information regarding paths that have garbage in them. And again, if you want to write code for Windows _and_ Unix, using wchar_t won't spare you this loss on Unix.

Er, no. First, if you're actually writing portable code, TCHAR is the solution, not wchar_t. Second, if you can't fork others' code, at the very least you can produce errors to avoid silent bugs (see above). And finally, "this problem also exists with char" is just wrong. In a lot of cases the problem doesn't exist as long as you're using the same representation and avoiding lossy conversion, whatever the data type is. If (say) the file path is invalid UTF, and you save it somewhere and reuse it, or pass it to some program and then have it passed back to you, you won't encounter any issues -- the data is whatever it was. The issues only come up with lossy conversions in any direction.

> if some DLL unknown to you uses "W" APIs then.. it doesn't matter because if it's unknown to you then you're not interacting with it, or if you are interacting with it via another DLL

I don't think you're understanding the problem here. Interaction is not part of the picture at all. You might not be loading the DLL yourself at all. DLLs get loaded by the OS and user for all sorts of reasons (antiviruses, shell extensions, etc.) and they easily run in the background without anything else in the process "knowing" anything about the at all. Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.

> Sigh. At some point people (companies, contractors/consultants, ...) need to put their feet down and tell the U.S. government to upgrade their ancient Windows systems.

USG? Ancient? These are systems less than 10 years old. We're not talking floppy-controlled nukes here.

> The performance difference between UTF-8 and UTF-16 is in the noise, and it depends greatly on context.

"Depends greatly on the context" kinda makes my point. It can turn a zero-copy program into single- or double-copy. Generally not a showstopper by any means, but it sure as heck can impact some programs. And if that program is a DLL people use - well now you can't work around. (Yes, there's a reason I listed this last. But there's a reason I listed it at all.)

> The reality is that UTF-8 is superior to UTF-16. UTF-8 has won.

The reality is Windows isn't UTF-16 and nix isn't UTF-8, which was the crux of most of my points.

account42 · on Jan 15, 2025

> Er, no. First, if you're actually writing portable code, TCHAR is the solution, not wchar_t.

TCHAR is a Microsoftism, it's NOT portable at all.

dataflow · on Jan 15, 2025

I didn't mean "portable" in the same sense you're using it. Maybe "cross-platform", if you will. Or insert whatever word you want that would get my point across.

cryptonector · on Jan 10, 2025

> Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)

That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

> Or you could e.g. get upstream to start caring about their users on other platforms, and play ball.

The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it.

Microsoft has to make UTF-8 a first-class citizen.

> I don't think you're understanding the problem here. Interaction is not part of the picture at all. You might not be loading the DLL yourself at all. DLLs get loaded by the OS and user for all sorts of reasons (antiviruses, shell extensions, etc.) and they easily run in the background without anything else in the process "knowing" anything about the at all. Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.

You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.

> "Depends greatly on the context" kinda makes my point. It can turn a zero-copy program into single- or double-copy. Generally not a showstopper by any means, but it sure as heck can impact some programs. And if that program is a DLL people use - well now you can't work around. (Yes, there's a reason I listed this last. But there's a reason I listed it at all.)

I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.

> The reality is Windows isn't UTF-16 and nix isn't UTF-8, which was the crux of most of my points.

Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.

terinjokes · on Jan 11, 2025

>> Those aren't the only options. You (or someone) could also write your own compatibility layers for the APIs that avoid some of the problems I mentioned (e.g., by producing errors on inconvertible characters, by being compatible with former Windows versions, by not affecting other DLLs in your process, etc.)

> That's akin to writing a partial C library. If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

I found out about activeCodePages thanks to developers of those compatibility layers documenting the option and recommending it over their own solutions.

> The upstream is often not paid for this. Even if they get a PR, if the PR makes their code harder to work on they might reject it

The project I work on is an MFC application stemming from 9x and early XP and abandoned for 15 years. Before I touched it it had no Unicode support at all. I'm definitely not being paid to work on it, let alone the effort to convert everything to UTF-16 when the tide seems to be going the other direction.

> Your program is declaring that everything in the process is UTF-8 compatible, but those DLLs might not be compatible with that, and so you're just praying that they don't use -A functions in an incompatible manner.

Programs much, much, much more popular than mine written by the largest companies in the world, and many programs you likely use as a developer on Windows, set activeCodePage to UTF-8. Notwithstanding the advice in the article to set it globally for all applications (and it implies it already is the default in some locales). Those DLLs will be upgraded, removed, or replaced.

Joker_vD · on Jan 11, 2025

Forget it, you ain't gonna make Linux-centric open-source community to really care about Windows (or other un-POSIX-like OSes, of which today there is almost none). The others have to give in and accomodate to their ways if those others want to use their code.

And since Windows-centric developers, when porting their apps to Linux, are generally willing to accomodate for Linux-specific idiosyncrasies (that's what porting is about, after all) if they care abour that platform enough, the dynamic will generally stay the same: people porting from Windows to Linux will keep making compatibility shims, people porting from Linux to Windows will keep telling you "build it with MinGW or just run it in WSL2, idgaf".

dataflow · on Jan 11, 2025

> That's akin to writing a partial C library.

Not really. It's just writing an encoding layer for the APIs. For most APIs it doesn't actually matter what they're doing at all; you don't have to actually care what their behaviors are. In fact you could probably write compiler tooling to do automatically analyze the APIs and generate code for most functions so you don't have to do this manually.

> If MSFT makes UTF-8 as the codepage work well enough I'd rather use that.

"Well enough" as in, with all the warts I'm pointing out? Their current solution is all-or-nothing for the whole process. They haven't provided a module-by-module solution and I don't expect them to. They haven't provided a way to avoid information loss and I don't expect them to.

> You mean changing the codepage for use with the "A" functions? Any DLL that does that must go on the bonfire. There's a special place in Hell for developers who build such DLLs.

"Changing" the code page? No, I'm just saying any DLL that calls FooA() without realizing FooA() can now accept UTF-8 could easily break. You're just praying that they don't.

> I'm assuming you're referring to having to re-encode at certain boundaries. But note that nothing in Windows forces or even encourages you to use UTF-16 for bulk data.

Nothing? How do you say this with such confidence? What about, say, IDWriteFactory::CreateTextLayout(const wchar_t*) (to give just one random example)?

And literally everything that interacts with other apps/libraries/etc. that use Unicode (which at least includes the OS itself) will have to encode/decode. Like the console, clipboard, or WM_GETTEXT, or whatever.

The whole underlying system is based on 16-bit code units. You're going to get a performance hit in some places, it's just unavoidable. And performance isn't just throughput, it's also latency.

> Windows clearly prefers UTF-16, and its filesystems generally use just-wchar-strings for filenames on disk (they don't have to though). Unix clearly prefers UTF-8, and its filesystems generally use just-char-strings on disk.

Yes, and you completely missed the point. I was replying to your claim that "UTF-8 has won" over UTF-16. I was pointing out that what you have here is neither UTF-8 on one side nor UTF-16 on the other. Going with who "won" makes no sense when neither is the one you're talking about, and you're hitting information loss during conversions. If you were actually dealing with UTF-16 and UTF-8, that would be a very different story.

SleepyMyroslav · on Jan 10, 2025

In gamedev a lot of people read those docs but not a lot of them shipped anything using it. The reason is that file paths are not everything that has A/W versions. There is user input, window message handling ... The API is a maze.

I really would like to learn otherwise. But when I have to suggest fixes my old opinion stays. Dropping any C runtime use and going from API macro or A version to W is the solution to all weird and hard to repro problems on platforms from Ms.

7bit · on Jan 10, 2025

Not a Programmer. Wouldn't manifests risk the application breaking, if the manifest is not copied with the exe file? As a power user, I see the manifests sometimes, but honestly ,if I download e.g., bun.exe I would just copy the bun.exe without any manifest that the downloaded archive would contain.

That does not sound like a good solution.

lmz · on Jan 10, 2025

You can embed manifests in the exe.

terinjokes · on Jan 10, 2025

Expanding on this a bit, if the manifest is available at compile time it's included as a resource in the executable by the RC resource compiler. You can embed a manifest into an existing executable with mt.exe. Embedding the application manifest is recommended.

If you can't embed it for some reason, then you can distribute the application manifest side-by-side with the executable by appending ".manifest" to the binary filename. In this case probably already have defensive checks for other resources not being found if a user copies just the exe, and if not can add one and exit.

masfuerte · on Jan 9, 2025

In my portable code I #define standard functions like main and fopen to their wide equivalents when building on Windows.

This does mean I can't just use char* and unadorned string literals, so I define a tchar type (which is char on Linux and wchar_t on Windows) and an _T() macro for string literals.

This mostly works without thinking about it.

dblohm7 · on Jan 9, 2025

What really annoys me these days is that if you search for a Win32 API on Google, it will always come up with the -A variant, not the -W variant. I don't know if they've got something weird in their robots.txt or what, but I find it bizarre that an API whose guidelines desire developers to use the -W variants in all greenfield code, instead returns the legacy APIs by default.

ack_complete · on Jan 10, 2025

They did a strange reorg of the API docs at one point. Not only does it now have functions split by A/W (mostly unnecessarily), it also groups them by header file instead of feature reference, which is kind of annoying. It used to be just that the function doc would note at the bottom if A/W variants were present and they were grouped under Functions in the feature/subsystem area of the docs tree.

dblohm7 · on Jan 10, 2025

Yeah, that new content management system is awful too -- it doesn't grok preprocessor stuff at all, so sometimes you get nonsensical struct definitions, kernel-mode structs instead of user-mode structs, etc.

delta_p_delta_x · on Jan 9, 2025

> Microsoft's own C/C++ runtime library (msvcrt.dll) is implemented

This has been superseded by the Universal C runtime (UCRT)[1] which is C99-compliant.

pjmlp · on Jan 10, 2025

Mostly C99 compliant, some things are left out.

https://learn.microsoft.com/en-us/cpp/c-runtime-library/comp...

sigsev_251 · on Jan 10, 2025

I think the documentation is outdated given that C11 atomics [1] and threads [2] are available for more than a year now. Same goes for pretty much everything MSVC frontend related stuff (I've yet to try which C++23 features are supported at the moment, but they've secretly added support for C23 features like typeof and attributes, as well as GNU Statement Expressions).

[1]: https://devblogs.microsoft.com/cppblog/c11-atomics-in-visual...

[2]: https://devblogs.microsoft.com/cppblog/c11-threads-in-visual...

pjmlp · on Jan 10, 2025

Outdated documentation is pretty normal unfortunely, even .NET suffers from that nowadays.

Not as bad as Apple nowadays though, quite far from Inside Inside Macintosh days.

Glad to know about C23 features, as they went silent on C23 plans.

C++23 looks quite bad for anything that requires frontend changes, there are even developer connection issues for us to tell what to prioritise, as if it wasn't logically all of it. There is another one for C++26 as well.

Personally, I think that with the improvements on low level coding and AOT compilation from managed languages, we are reaching local optimum, where C and C++ are good enough for the low level glue, C23 and C++23 (eventually C++26 due to static reflection) might be the last ones that are actually relevant.

Similar to how although COBOL and Fortran standards keep being updated, how many ISO 2023 revision compliant compilers are you going to find out for portable code?

sigsev_251 · on Jan 10, 2025

> Outdated documentation is pretty normal unfortunely, even .NET suffers from that nowadays.

That's really unfortunate.

> Not as bad as Apple nowadays though, quite far from Inside Inside Macintosh days.

Funny story, I know a guy who wanted to write a personal Swift project for an esoteric spreadsheet format and the quality of the documentation of SwiftUI made him ragequit. After that, he switched to kotlin native and gtk and he is much happier.

> Personally, I think that with the improvements on low level coding and AOT compilation from managed languages, we are reaching local optimum, where C and C++ are good enough for the low level glue, C23 and C++23 (eventually C++26 due to static reflection) might be the last ones that are actually relevant.

I agree on the managed language thing but, I mean, the fact that other languages are getting more capable with low level resources does not mean that improvements in C/C++ are a bad idea and will not be used. In fact, I think that features like the transcoding functions in <stdmchar.h> in C2y (ironically those are relevant to the current HN post) are useful to those languages too! So even if C, C++ and fortran are just used for numerical kernels, emulators, hardware stuff, glue code and other "dirty" code advancements made to them are not going wasted.

nialv7 · on Jan 9, 2025

Windows really should provide an API that treats path names as just bytes, without any of these stupid encoding stuff. Could probably have done that when they introduced UNC paths.

Dwedit · on Jan 10, 2025

Ever since Windows 95 Long File Names for FAT, filenames have been 16-bit characters in their on-disk format. So passing "bytes" means that they need to become wide characters before the filesystem can act on them. And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)

It is possible to request for a directory to contain case-sensitive files though, and the filesystem will respect that. And if you use the NT Native API, you have no restrictions on filenames, except for the Backslash character. You can even use filenames that Win32 doesn't allow (name with a ":", name with a null byte, file named "con" etc), and every Win32 program will break badly if it tries to access such a file.

It's also possible to use unpaired surrogate characters (D800-DFFF without the matching second part) in a filename. Now you have a file on the disk whose name can't be represented in UTF-8, but the filename is still sitting happily in the filesystem. So people invented "WTF-8" encoding to allow those characters to be represented.

cesarb · on Jan 10, 2025

> And case-sensitivity is still applied, stupidly enough, using locale-specific rules. (Change your locale, and you change how case-insensitive filenames work!)

AFAIK, it's even worse: it uses the rules for the locale which was in use when the filesystem was created (it's stored in the $UpCase table in NTFS, or its equivalent in EXFAT). So you could have different case-insensitive rules in a single system, if it has more than one partition and they were formatted with different locales.

IMO, case-insensitive filesystems are an abomination; the case-insensitivity should have been done in the user interface layer, not in the filesystem layer.

cryptonector · on Jan 10, 2025

> IMO, case-insensitive filesystems are an abomination; the case-insensitivity should have been done in the user interface layer, not in the filesystem layer.

Implementing case-insensitivity in a file picker or something is OK, but doing that throughout your app's runtime is insane since you'd have to hook every file open and then list the directory, whereas in a file picker you're probably listing the directory anyways.

cesarb · on Jan 10, 2025

The file picker is precisely where case-insensitivity should be done; the rest of the application should already have the correct file name.

cryptonector · on Jan 10, 2025

Though you best not have a million files in that directory...

Dwedit · on Jan 10, 2025

Did not know about $UpCase, the only part I knew was that the FAT16/32 driver from Microsoft (Which has the source code officially released, it's used as an example for how to implement a filesystem on Windows NT) uses locale-specific case-sensitivity tests.

cesarb · on Jan 11, 2025

You're right, in the case of FAT16/FAT32 it AFAIK has to use the current system locale, since unlike EXFAT or NTFS there isn't a place in the filesystem to store that locale table.

Joker_vD · on Jan 9, 2025

Windows does treat path names as just sequences of uint16_t (which is how NTFS stores them) if you use W-functions and prepend the paths with "\\?\".

nialv7 · on Jan 10, 2025

oh, that's interesting. do UNC paths not have to be valid UTF-16?

Dwedit · on Jan 10, 2025

"\\?\" is strange, because it looks just like a UNC path. But it actually isn't. It's actually a way for Win32 programs to request a path in the NT Object Namespace.

What's the NT Object Namespace? You can use "WinObj" from SysInternals to see it.

The NT Object Namespace uses its own special paths called NT-Native paths. A file might be "C:\hello.txt" as a Win32 path, but as an NT-Native path, it's "\??\C:\hello.txt". "\??\" isn't a prefix, or a escape or anything like that. It's a real directory sitting in the NT Object Namespace named "\??", and it's holding symbolic links to all your drive letters. For instance, on my system, "\??\C:" is a symbolic link that points to "\Device\HarddiskVolume4".

Just like Linux has the "/dev/" directory that holds devices, the NT Object Namespace has a directory named "\Device\" that holds all the devices. You can perform File IO (open files, memory map, device IO control) on these devices, just like on Linux.

"\??\" in addition to your drive letters, also happens to have a symbolic link named "GLOBALROOT" that points back to the NT-Native path "\".

Anyway, back to "\\?\". This is a special prefix that when Win32 sees it, it causes the path to be parsed differently. Many of the checks are removed, and the path is rewritten as an NT-Native path that begins with "\??\". You can even use the Win32 Path "\\?\GLOBALROOT\Device\HarddiskVolume4\" (at least on my PC) as another way to get to your C:\ drive. *Windows Explorer and File Dialogs forbid this style of path.* But 7-Zip File Manager allows it! And regular programs will accept a filename as a command line argument in that format.

Another noteworthy path in "\??\" is "\??\UNC\". It's a symbolic link to "\Device\Mup". From there, you can add on the hostname/IP address, and share name, and access a network share. So in addition to the classic UNC path "\\hostname\sharename", you can also access the share with "\\?\UNC\hostname\sharename" or "\\?\GLOBALROOT\Device\Mup\hostname\sharename".

jeroenhd · on Jan 10, 2025

I don't believe they do. Maybe the documentation will tell you it must be, but in practice file names with broken UTF-16 can be created.

cryptonector · on Jan 10, 2025

It's the same on Unix.

On Unix the reason for this is that the kernel has no idea what codeset you're using for your strings in user-land, so filesystem-related system calls have to limit themselves to treating just a few ASCII codepoints as such (mainly NUL, `/`, and `.`).

userbinator · on Jan 9, 2025

And of course making everything twice as big as it needs to be is also extremely repugnant.

Joker_vD · on Jan 10, 2025

Not everyone uses Latin-based scripts, you know. Most of the symbols in the BMP (including Brahmic scripts) take two bytes in either UTF-8 or UTF-16, and CJK symbols take 3 bytes in UTF-8 instead of 2 in UTF-16. Emojis, again, are 4 bytes long in either encoding. So for the most people in the world, UTF-16 is either slightly more compact encoding, or literally the same as UTF-8.

account42 · on Jan 15, 2025

> Not everyone uses Latin-based scripts

Actually, everyne does use Latin-based scripts extensively. Maybe not exclusively but your almost all text-like data intended to be consumed by programs will mainly be Latin-based scripts. So even for languages where you have characters that need 3-bytes in UTF-8 but two in UTF-16 you can still end up saving memory with UTF-8 because all the boilerplate syntax around your fancy characters is ASCII.