New UUID Formats

amluto · on June 12, 2022

Fortunately this is a bit less relevant today as Windows loses market share in database and server applications, but:

UUIDs have historically massively screwed up endian handling. While this new draft discusses sorting UUIDs as strings of octets (bytes) and the text of RFC4122 is fairly explicit about most significant bytes coming first, the C UUID structure in RFC 4122 appendix A is entirely misguided:

    typedef struct {
        unsigned32  time_low;
        unsigned16  time_mid;
        unsigned16  time_hi_and_version;
        unsigned8   clock_seq_hi_and_reserved;
        unsigned8   clock_seq_low;
        byte        node[6];
    } uuid_t;

Those aren’t bytes — they’re integers of various sizes. (Hint: do not use integer types in C code for portable data structures. ntohl, etc are a mess. Just use arrays of bytes.)

I don’t know the whole history, but MS somehow took this structure at face value and caused problems like this:

https://github.com/uuid-rs/uuid/issues/277

So, if you want to do anything (e.g. sorting) that depends on the representation of a UUID (or even depends on converting between string and binary representations), be aware that UUIDs coming from Windows may be little-endian. In my book, this is a Windows bug, but opinions may differ here.

JonathonW · on June 12, 2022

UUIDs (as GUIDs) in Windows predate RFC 4122, so I don't think it's that unreasonable that they're not compliant with the spec (since the exact contents of a UUID aren't typically that important, but consistency in how you produce them is).

The GUID implementation in Windows derives from the DCE RPC specification [1]. That's where the multibyte integers in the RFC 4122 specification come from (they're stated the same way in the DCE RPC spec), and it doesn't explicitly specify their endianness. It does call them "NDR integers", but NDR integers can be little endian or big endian depending on implementation. DCE specifies a mechanism by which you'd indicate which you're using in an RPC call, but that data's not included in the UUID format-- you get whatever byte order the system's decided to use, which, for Windows, is little-endian.

[1] https://pubs.opengroup.org/onlinepubs/9629399/apdxa.htm#tagc...

firebird84 · on June 12, 2022

It's actually worse than that. The first 3 groupings (textually) of the uuid might be little endian while the other 2 are big endian. Learning this cost me more time than I care to admit.

https://en.wikipedia.org/wiki/Universally_unique_identifier#...

formerly_proven · on June 12, 2022

Making mixed-endian even more haunted is quite the achievement. I congratulate whoever did this at Microsoft for their lasting contribution.

amluto · on June 12, 2022

This is consistent with the misguided structure in the RFC. The first three fields (the time fields) are multibyte integers. The remainder is just bytes. The dashes in the textual representation are just there to confuse you.

throwaway9870 · on June 12, 2022

I had to write some EFI/GPT code earlier this year and was dumbfounded to learn this. This is right up there with the mork file format.

qwerty456127 · on June 13, 2022

Why would this matter (except when you build an index for a big database)?

Note:

> 6.8. Opacity: UUIDs SHOULD be treated as opaque values and implementations SHOULD NOT examine the bits in a UUID to whatever extent is possible.

legulere · on June 13, 2022

When you translate between UUIDs in binary and in text form and communicate with other code the binary UUIDs. The other code might expect the uuids in a different encoding.

cpach · on June 12, 2022

What the…?

This fact will haunt me in my dreams :-p

blueflow · on June 12, 2022

This mishap lives forth in the UUID stored in a machines DMI data, as well in GPT partition tables, which are required when EFI is used. It would be really cool if we had some replacement for EFI that would not harbour these kind of painful legacies.

dtech · on June 12, 2022

On the other hand, this is an easy to implement conversion, while changing such a fundamental thing from EFI sounds pretty hard, making it not worth it.

blueflow · on June 12, 2022

You cannot fix this with an conversion because you do not know if your UUID is correct or needs conversion. DMI data for example has inconsistent endian-ness depending on the vendor. So if you have a UUID sticker on a new server, you still have two options which UUID the machine will send during PXE, either the printed UUID in big-endian encoding or in the microsoft mixed-endian encoding.

Use BIOS boot instead of EFI, it has less legacy to implement: PE executables, FAT file system, Win64 ABI

smallnamespace · on June 13, 2022

Isn’t endianness marked by the variant field? 1 is IETF, 2 is Microsoft, and this can be inferred from the sticker.

iforgotpassword · on June 13, 2022

At my last workplace we twice got new workstations where they all had the same bogus UUID. Once it was Dell, the other I don't remember. The fix was to either manually set a new one in the BIOS setup, or install a BIOS Update. So I have my doubts that the variant field can be trusted here.

blueflow · on June 13, 2022

After failed attempts with the UUID due to above problems, we now do machine identification for PXE via MAC address for virtual machines (you see/set it in the hypervisor) and the DMI product_serial for physical machines (written on chassis).

WorldMaker · on June 13, 2022

Raymond Chen's post on UUID sort orders has been indispensable to me at various points: https://devblogs.microsoft.com/oldnewthing/20190426-00/?p=10...

It's fun to note that the worst possible UUID sort order in existence isn't Microsoft's fault but Sun's. They missed the "unsigned" catch to those integers in the struct and Java sorts UUIDs as signed integers. (Which is why sometimes you'll notice in for instance Android apps Guids get sorted such that that 8000... < FFFF... < 0000... < 7FFF...)

throw868788 · on June 13, 2022

He's linked it here: https://devblogs.microsoft.com/oldnewthing/20190913-00/?p=10.... Its a good read. The LE format, especially with UUID 7 like UUID (seq UUID's), seems to be easier/faster to sort just using binary sorting on LE architectures which is most computers nowdays. Interesting to see all these tradeoffs.

Someone · on June 12, 2022

> Hint: do not use integer types in C code for portable data structures. ntohl, etc are a mess. Just use arrays of bytes.

I don’t see how that helps much. If a developer forgets to call ntohl on multi-byte integer fields, I don’t trust them to correctly convert said integers to arrays of bytes, either.

kevincox · on June 12, 2022

If you write it the naive way it works.

    uint8_t bytes[4];
    uint32_t = bytes[0] << 24 + bytes[1] << 16 + bytes[2] << 8 + bytes[3];

The endianness is whatever you write in the indexing and will be the same across architectures.

e4m2 · on June 12, 2022

Unfortunately, the naive way also turns out to be wrong in C. uint8_t gets promoted to a signed int when shifting, which in turn causes undefined behavior for specific input. One way of fixing this is casting to the desired type before the shift, thus avoiding surprising conversions.

On a side note, compiler warnings and sanitizers help with this kind of stuff greatly, use them if you have the option: https://godbolt.org/z/8oq9GTcze

zimmerfrei · on June 13, 2022

>> uint8_t gets promoted to a signed int when shifting, which in turn causes undefined behavior for specific input. >> One way of fixing this is casting to the desired type before the shift, thus avoiding surprising conversions.

Agree. For clarity, the "specific input" in the example would be a bytes[0] value larger than 127.

This behavior is also explained here:

https://wiki.sei.cmu.edu/confluence/display/c/INT34-C.+Do+no...

secondcoming · on June 12, 2022

I thought Windows uses GUIDs, not UUIDs

takeda · on June 12, 2022

It looks like GUID is synonymous with UUID, but the name GUID also implies that it could contain that bug/feature mentioned[1]

[1] https://en.wikipedia.org/wiki/Universally_unique_identifier#...

quotemstr · on June 12, 2022

Little endian won. I don't see the point of maintaining theoretical compatibility with big endian systems that are at best esoteric right now and soon to be extinct. Likewise, every reasonable platform aligns data fields on natural alignment these days. It's just a waste of effort to make software portable to evolutionary dead ends.

mort96 · on June 12, 2022

This isn't about compatibility with big-endian machines. This is about compatibility between different uuid libraries, potentially on different operating systems, all on little endian CPU architectures.

ntauthority · on June 12, 2022

Every existing library follows the standard which specifies host byte order, which usually means little endian. The Rust library cited a few levels up this chain ignored that, somehow assuming big endian, and then had to correct for this mistake.

wolverine876 · on June 12, 2022

You're saying it won for this application (UUIDs)? Universally?

What is the most common remaining use of big endian?

tech2 · on June 12, 2022

TCP/IP might be a reasonable example. There's a reason "network byte order" and "big endian" are the same thing.

wolverine876 · on June 12, 2022

> TCP/IP might be a reasonable example.

That's a pretty significant example of widespread usage that isn't going away soon. Perhaps the GGP was only referring to UUID applications.

GiorgioG · on June 12, 2022

I used to be a big proponent of using UUIDs for database PKs but I've found them inherently difficult to work with. It's much easier to remember/recognize an integer based PK when troubleshooting a data problem.

This isn't to say you shouldn't use UUIDs at all, but I much prefer to use an "ExternalId" column of UUID type if you don't want to expose your integer based PKs externally.

cameronh90 · on June 12, 2022

UUIDs have a few distinct advantages: you'll never run out, you don't need a roundtrip to find out what they are after saving them, they often make a good partitioning key and it makes things easier if you ever need to combine multiple data sources together in migration and recovery type scenarios. I also quite like how they're unique across all data sources and tables, so if you just encounter a random contextless UUID in the wild, for example in a support ticket, you can probably still find what it refers to.

They are quite unwieldy though. There are a few compact representations you can use in URLs which make it a bit less ugly, but they can make your database and logs quite bloated, in particular if you've got a large number of small records.

feoren · on June 13, 2022

> if you ever need to combine multiple data sources together in migration and recovery type scenarios

This insane idea that combining data sources is a rare event in some unusual "migration and recovery" scenarios is one of the most poisonous and yet pervasive ideas in all of database design. You are always combining multiple data sources, all the time. Users submitting data from a form is a data source. Test, staging, and production deployments, with multiple of each. External APIs. Multiple clients. Eventual consistency. Replication. Microservices with distributed systems. Or even sharing any common data at all between different systems, like unit conversions, chemical data, country names, engineering constants, etc.

Anyone who even considers using a single authoritative source for all entity identity either better be making a system in an underground bunker that will never talk to any other system. Otherwise they are making a serious and extremely avoidable mistake. Never use auto-incrementing IDs.

It's even wrong in a monolith! Why does everyone abandon this idea of "separation of concerns" and "single responsibility principle" and "bounded contexts" and proper abstraction and limited communication between system parts and literally every design principle they've ever been taught when they go to design a database? It all just goes out the window! Why do you guys bother reading books about system design if you ignore them when you build a database? "Multiple systems communicating with each other" should apply recursively all the way from deployment and external integration down to individual functions. That means database, too. Auto-incrementing IDs are anathema to that.

bruce511 · on June 13, 2022

While your tone might be a tad hyperbolic, I agree with your basic premise.

If I could go back and tell my 30 year ago self one tip, it would be to use uuids over auto-increments. And this is back when that was expensive - in disk space and database time.

Instead I'm stuck with my design, and as time has passed the real cost of auto-Inc has slowly revealed itself.

What's interesting to me though is that this view is not universal. I get a lot of push-back when promoting uuids, but I can really only speak to my experience.

acmecorps · on June 13, 2022

You’re not alone. I’ve been migrating my tables to use uuid instead of integers and have been using uuid whenever I have new tables, unless I have very good reason not to. Experience was my teacher.

vlovich123 · on June 13, 2022

Don’t UUIDs as primary keys totally destroy the performance because UUIDs aren’t sortable and thus wreak havoc with the index for the primary key?

matwood · on June 13, 2022

Destroy is a strong word (and UUIDs can certainly be sorted, but locality is an issue), but all of software is a series of tradeoffs. I've used both auto-increment and UUIDs, and wish I had used UUIDs in almost every case.

Distributed generation - no sending a record to a server to get a key then using it to generate other records. In a world with increasing use of services, this becomes more important every day.

System wide unique - helps with logging, debugging, and avoiding general errors.

Multi-master db replication - I know this depends on the RDBMS, but having a unique key on every record avoids clashes. Also super useful during data migrations (which will happen. I have another rant that data always outlives code, so plan accordingly).

Validation - UUIDs have a form that can be a first level validation on input.

For me, those advantages outweigh some extra space usage, possible performance impact, and ugly URLs.

And on performance, if it's determined that it is an issue because of using UUIDs there are ways to make them more index friendly.

sumtechguy · on June 13, 2022

There is a perf hit. You can not help it when you are slugging that many bytes around. An int/int64 fits into a register easy and the instruction for comparison is cheap. However, when you add in replication and other items UUID becomes more desirable due to the property having extra information embedded into it to make them unique. You can get some of the same benefits with more complexity if you are using auto increment. Such as using a stride that is similar to the number of machines you have in the cluster. But even that can get weird, and depending on your db can be a pain to setup. Using them as a cluster index and you probably will most certainly create a hotspot index and poor lookup performance, but decent write. Just due to the fact that items probably are not grouped the same as your where clause. SSD's have hidden most of this for most databases. But the issues are the same.

keosak · on June 13, 2022

They don't utterly destroy performance, but there is some hit if you use UUID v4 random values due to database index scattering. That's why this new proposal exists. It adds new versions of UUID that are mostly incrementing in time, so that they group better.

Akronymus · on June 13, 2022

In my experience, thats a big NO. While there is some performance impact, in other ways it can actually be a benefit (spreading writes out to multiple pages, for example) so as with anything in regards to performance: benchmark, benchmark, benchmark.

PS.: UUID's also work quite nicely to deal with certain system hiccups, specifically temporal ones. Could be thought of as something similar to the Erlang supervisor trees.

pcthrowaway · on June 13, 2022

CUID might be a good alternative if you need something sortable over time: https://github.com/ericelliott/cuid

It's mentioned in this IETF draft, but I don't see the analysis made available

netghost · on June 13, 2022

Can you give an example of an issue you faced with auto incrementing keys?

Though I can imagine scenarios, the worst I've ever run into was maxing out the integer size, bit that was easily remedied.

bruce511 · on June 13, 2022

There are a few, but I'll sumarize;

A) data merging. Merging multiple data sets together with auto-Inc is tricky - especially in the case of related tables.

B) data distribution /replication - especially on and off phones - especially if phones are creating data records offline, then syncing later.

C) parent-child forms. Parent has to be committed to the dB before children can be added.

D) data imports - again especially related data.

I'm not doing justice in such a brief question, each item above deserves a whole exploration.

netghost · on June 14, 2022

Thanks! Those are a great set of examples.

Perhaps one through-line is that if IDs are not created by a single source, don't rely on auto incrementing IDs.

The parent/child one is interesting. When I've worked with hierarchical data, I tend to wrap the whole process in a transaction, so it may be many commits (by depth), but one transaction.

bruce511 · on June 15, 2022

>> if IDs are not created by a single source, don't rely on auto incrementing IDs.

Data design outlives programs, and environments by a long time. Circumstances change.

So when I designed the db, there was a single database. But 30 years later we live in a world with smart phones.

Making design decisions because of _current_ circumstances can bite you hard later on.

Reading parent /child. Yes there are ways to mitigate the issue, but it's extra work and code to do so. Ultimately you need the parent id before you can add the children though.

dolmen · on June 14, 2022

One danger with auto incrementing ids is to be tempted to use the key for sorting instead of using a separate creation timestamp.

At some point you might have to create out of order records (ex: inserting missing data) and that will break the order.

remram · on June 14, 2022

A problem here is that UUIDs are not actually a silver bullet against collision. You will never generate a UUID that will collide with someone else's, but it is easy for someone else to collide with you, either on purpose or because they made changes to data from your platform before sending it back (PKs tend to leak in URLs, APIs, data exports, etc).

I find that I have to implement a system for re-numbering incoming data on migration/import anyway, so the advantage of UUIDs is not that huge.

CharlesW · on June 12, 2022

> There are a few compact representations you can use in URLs which make it a bit less ugly…

Any thoughts on where to find best-practices guidance? I need to create an external ID scheme for several million items. hashids (hashids.org) seems interesting, but I have anxiety about choosing a solution with weaknesses that I can't identify given my current level of experience in regards to this.

munawwar · on June 12, 2022

I took a shot at the math behind this at https://www.codepasta.com/databases/2020/09/10/shorter-uniqu...

Using the equation listed in the article I couldn't generate a collision so far. Yet, I still check (in code) for id collision, and pick new id, just to be 100% sure.

Olreich · on June 12, 2022

This article mostly just says: "If your data is small, then 56 bits is fine". Which is true. But if you are at 56 bits in your UUID, then just use an integer PK with 64 bits and you'll be fine a good deal longer. If your data might get large, an extra 64 bits to get a full 128bit UUID isn't expensive, and encoding in Base36 still yields 25 character UUIDs, which are pretty manageable: "abcdef-01234-56789-ghi-jklmno" (6-5-5-3-6, or max of a u16 for digits). I'm honestly shocked no UUID library makes it easy to export in Base36 with that grouping, it's very neat and tidy and easy to remember.

Dylan16807 · on June 13, 2022

Why base36 in particular? That implies that you're worried you might lose case information but you're not worried about characters that look the same?

selcuka · on June 12, 2022

You may consider encoding the uuid in base58 as it is shorter, safe to use in URLs, and easier to debug as it doesn't use characters that can be confused such as 0, 0, I, and l.

inopinatus · on June 13, 2022

I'll add a voice for base58 representation of UUIDs on the wire, it's my preferred solution as well.

Not only for debugging: by being compact, unambiguous, and URL-safe, the helpdesk is also spared a cringeworthy source of PEBCAK incidents since misquoting the identifiers is simply harder.

Caveat programmer, though, there is a hazard, occurring when someone is glib about the usage and relies on randomly generated fixed-length base58 values directly. It happens readily because some popular frameworks include such generation as a utility function. However, 58^22 > 2^128 > 58^21, so a 22-character base58 representation is expected for UUIDs but carelessly random 22-character base58 values may exceed the capacity of UUID's familiar hexadecimal serialization, effectively an integer overflow. We never generate base58 identifiers as a PK, for example, for this reason (they would of course not be UUIDs either). Alas, there is no conventional base encoding more compact than hexadecimal that is robust to the general problem. And I mention it because this issue was observed in the wild.

selcuka · on June 13, 2022

If I'm not mistaken base64 should not cause an integer overflow (i.e. it will always be a 128-bit integer) if you limit it to 22 characters (and append two = characters to pad it when you actually decode it).

It may not be as URL-safe as base58 though.

inopinatus · on June 13, 2022

Yes. I think of base64 as an unconventional encoding, in that regard, since it's more than just an alphabet for powers of the base but also communicates properties of the output octet length in the serialized form. To do that it's relying on an elegant alignment of bit values to that base58 doesn't have. Perhaps there could be a means to inject the same confluence of form and function into base58, albeit I can't see it after a couple minutes of staring at the algorithm¹.

And also yes: the equals sign is potentially hazardous in a query string, and the use of non-alphanumeric symbols in base64 also creates line-break and double-click/double-tap traps when passed around in an ad-hoc fashion, even in the url-safe variant.

[1] https://datatracker.ietf.org/doc/html/draft-msporny-base58

RhodesianHunter · on June 13, 2022

Something like Snowflake ids has many of these advantages while being integers.

Merad · on June 12, 2022

> UUIDs have a few distinct advantages: you'll never run out

I'm curious what kind of applications are limited by the range of bigint values? I have no doubt that such applications exist somewhere, but most software engineers won't ever come close to encountering those limits. Even if you have a table that is consistently consuming a billion (with a B) bigint id values _every second_ (is that even feasible with current hardware and RDBMS software?) you won't run out for almost 300 years.

jandrewrogers · on June 13, 2022

Some famous RDBMS bugs had their root cause in this kind of reasoning, because the model embeds assumptions on how those IDs will be generated and used which may not hold true in the future for reasons that are difficult to anticipate.

For example, Oracle had a 48-bit ID rollover bug many years ago that by all calculations should never occur in real systems. This calculation was made under the assumption that the IDs were mostly actually used. However, many features added later necessitated generating or reserving vast numbers of IDs in bulk, a low-cost optimization, the vast majority of which were ultimately discarded. It got to the point where very large systems started running out of these IDs due to the fact that such a low percentage were used in the way the designers had anticipated.

Extremely large systems do not run into the limitations of bigint because at that scale the identifiers are naturally segmented, often implicitly.

staticassertion · on June 13, 2022

> Extremely large systems do not run into the limitations of bigint because at that scale the identifiers are naturally segmented, often implicitly.

Agreed. That's the real key imo. You could have a 2^64 random number as a primary key if that key is also namespaced per customer - maybe in aggregate your customers generate > 2^32 events but a given customer may not. And there's other ways to limit it further.

This is the approach we take, basically, although we use a counter and not a random number.

Gigachad · on June 13, 2022

For some reason a lot of things used to default to 32 bit primary keys and you absolutely did run out of those and its been a major issue for a lot of apps for the last few years. But yeah, you'll never run out of bigints SQL servers couldn't cope with the amount of data that would require.

thayne · on June 13, 2022

Also, using an incrementing id can lead to to information disclosure if they are visible in user facing APIs. For example, if I know an id for my user, document, cart, etc. Then I know all ids lower than that are probably also valid.

zbuf · on June 12, 2022

"Unique IDs" _can_ be super really easy to work with if they're not so baffling complicated.

A random string generated using quality randomness can be adjusted to length to suit the quantity of data (negligible probability of a collision) which in most cases is very short.

It's easy to increase the length as you get more data.

They are visually very different for each item of data.

They're evenly spread which means they hash/index well.

You can tune a subset of characters if you want to decrease ambiguity eg. when exchanged by voice (no zero vs. letter O, upper/lower case etc.)

And a final bonus, when working with user input only a a short prefix is needed to uniquely identify an item (in contrast, it seems like UUIDs deliberately share a common prefix)

I'm very happy to concede I must be missing something here, and would be interested to know. But the above approach has served me well in a range of uses.

I can see how UUIDs work, and perhaps "looks like a UUID" is a useful feature. But reading the URL above and a bit of Wikipedia doesn't give me much to go on as to _why_ any of this is happening, and why the hyphens aim to retain meaning to what is ostensibly a 'unique' number.

jasonwatkinspdx · on June 12, 2022

Using purely random ids in your database destroys locality. They mention this in the introduction:

> Non-time-ordered UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic.

The V7 ids work similarly to what you like, as they're just a unix timestamp and 74 bits of pseudorandom data (they present several different schemes you could use to generate this randomness, but the basic birthday bound says we'd need to be above 100 billion id's generated in a single millisecond to worry about collisons. Obviously most systems are nowhere near that territory.

So using these id's gives you the practical advantages of random uinique ids, but with the performance of autoincrement ids.

jandrewrogers · on June 12, 2022

100 billion UUIDs per millisecond is the 50% collision probability threshold. Achieving an acceptable collision probability for most applications would limit the UUID generation rate to more like thousands of UUIDs per millisecond.

Even if one was not generating millions of UUIDs per second on average, the risk of spiky temporal distributions when generating UUIDs would still need to be considered.

jasonwatkinspdx · on June 12, 2022

I'm aware of what the bound I quoted is, and am just too lazy to type the refinement into Wolfram for a discussion like this. But just knowing the value off the top of my head for 64 bits is 200k for 1e-9 probability of collision, I'm pretty happy with 74 bits.

And although this standard obviously wants to stick within the existing UUID footprint, if you were say doing some IoT software that would run on billions of nodes simultaneously, just add another 32/64/whatever bits of random data and deal with the minor annoyance of longer ids and lack of UUID RFC compatibility. But even then, you can truncate and reformat these these to v4 UUIDs trivially without meaningfully impacting the collision resistance, for unsorted external ids in systems that need the compatability.

The actual big risk with this sort of scheme is vm initialization. You need to be sure the CSPRNG you're using is initialized, which can be slightly tricky in cloud environments with configuration/control layer stuff that's racy. Mess this up and two nodes hydrated from the same snapshot may overlap in sequence as they start generating ids, and of course the time component cannot be trusted to save you in this instance.

11235813213455 · on June 13, 2022

Is it even possible to reach that rate given that generating a random hash takes some time?

zbuf · on June 12, 2022

Thanks for drawing my attention to that, it's the useful answer I was looking for; my use cases haven't been bound by write performance in this manner. However, I'd still be considering carefully before making use of these UUID schemes.

jasonwatkinspdx · on June 12, 2022

If you want something like this, and need a "I just want it to work, require no central coordination, and to have vanishingly small probability of collision" then just use the same concepts but wider than the 128 bit footprint limit of this scheme. This limit makes sense for the RFC in the post, as there backwards compatibility is an explicit goal. But if you used say a 64 bit nanosecond counter (to preserve best case precision on a single machine) along with 128+ bits of random data and you'll need to worry more about gamma ray bursts than collisions.

jxcole · on June 12, 2022

We used almost this exact scheme for app id indices and the curious problem we had to design against was inadvertent profanity. At some point we decided to just never use vowels to avoid ever having a complaint about 12f*ck if in the URL

jasonwatkinspdx · on June 12, 2022

Another approach is to use something like EFF's dice words lists. One of the smaller lists in particular is interesting as it's 6^4 words, filtered for profanity, and where all words have both a unique 3 letter prefix and an edit distance of 3. That makes them robust for the use case of someone reading out the phrase to someone typing or such.

Never using vowells is a smart idea I wish I'd used in the past. Previously when I've needed something like this I've used other dictionary lists vs EFF's, and those were not curated sufficiently to avoid some really unfortunate combinations.

dolmen · on June 14, 2022

https://www.eff.org/dice

manigandham · on June 12, 2022

Use integer IDs and a library like Hashids for friendly alphanumeric representations: https://hashids.org/

This particular implementation is available in dozens of languages.

manigandham · on June 12, 2022

> "They're evenly spread which means they hash/index well."

What do you mean by this? Why would you hash it further? Hash distribution is primarily down to the hashing algorithm, not the input data.

Also indexes are better with somewhat ordered and smaller data. A 64-bit int sequential counter is much faster and half the size, and compatible everywhere without the annoyances of a UUID.

chrismorgan · on June 12, 2022

I’ve been working on a robust scheme for encrypted sequential IDs, which is done, including library implementations in Rust, JavaScript and Python, pending just a smidgeon more writing about it and reviewing a decision on naming. You store an integer in the database, then encrypt it with a real block cipher, and stringify with Base58. I have three modes: one for 32-bit IDs, using Speck32/64 and producing 4–6 character IDs; one for 64-bit IDs, using Speck64/128 and producing 8–11 character IDs; and one hybrid, using the 32-bit mode for IDs below 2³² and the 64-bit mode above that, providing both a forwards-compatibility measure and a way of producing short IDs as long as possible. Contact me (see my profile) if you’re interested, or I’ll probably publish it in another day or two. Trouble is that I’ve been getting distracted with other related concepts, like optimally-short encoding by using encryption domains [0, 58¹), [58¹, 58²), …, [58¹⁰, 2⁶⁴) (this is format-preserving encryption; the main reputable and practical choices I’ve found are Hasty Pudding, which I’ve just about finished implementing but would like test vectors for but they’re on a dead FTP site, and NIST’s FF1 and FF3, which are patent-encumbered), and ways of avoiding undesirable patterns (curse words and such) by skipping integers from the database’s ID sequence if they encode to what you don’t want, and check characters with the Damm algorithm. If I didn’t keep getting distracted with these things, I’d have published a couple of weeks ago.

(I am not aware of any open-source library embodying a scheme like what I propose—all that I’ve found have either reduced scope or badly broken encryption; https://github.com/yi-jiayu/presents encrypts soundly, but doesn’t stringify; Hashids is broken almost beyond belief and should not be considered encryption; Optimus uses an extremely weak encryption.)

UUIDs are crazy overkill in any situation where you can have centralised ID allocation. Fully decentralised? Sure, 128 bits of randomness or mixed clock and randomness or similar, knock yourself out. But got a master database? Nah, you’re just generating unreasonably long values that take up unnecessary space and make for messy URLs and such.

iroddis · on June 13, 2022

> UUIDs are crazy overkill in any situation where you can have centralised ID allocation.

Except that’s specifically the use case of UUIDs: to have a decentralized method to generate unique IDs with minimal chance of collisions. If you have centralized control, of course there will be options with more attractive properties: they aren’t dealing with the same constraints.

chrismorgan · on June 13, 2022

Sure, decentralised allocation is the intent of UUIDs, but in practice UUIDv4 is used extremely widely where it’s completely unnecessary, because it’s a convenient way of generating non-sequential IDs with good tooling support (UUID libraries, UUID data types in some databases, &c.).

mappu · on June 12, 2022

I've done something similar to obfuscate private DB IDs in a large existing application - just ensure they're all Skip32-encoded in all query parameters with an app-wide secret.

It works well but you have to be very disciplined to catch every case individually. Using GUID PKs from the start just removes this entire category of problem.

chrismorgan · on June 13, 2022

If it requires discipline, you’re doing it wrong. For best results, it should be threaded through your ORM or equivalent so that you can’t do it wrong, that at your app layer you never get raw IDs.

yencabulator · on June 13, 2022

I dabbled with doing that. XTEA is a 64-bit block cipher, which suits 64-bit IDs pretty well, and well you can't really ask for much more without using larger IDs. I use z-base-32 to stringify -- I considered base58, but I think case-sensitive URLs are not nice.

Looks like

    user=1 doc=1
    -> /users/a7e34gz71r4ig/documents/xs69f1c878rzq
    
    user=42 doc=13
    -> /users/am8hng8rnoopg/documents/9othzs4tgujrw

I have the code in Go (it's near-trivial), but ended up doing something else for that project.

Also, I hate that Postgres doesn't actually have unsigned integers.

kortex · on June 12, 2022

> It's much easier to remember/recognize an integer based PK when troubleshooting a data problem.

How often are you actually relying on memory of an ID to troubleshoot a problem? I mean sure, if you are scanning visually, it's good to recognize the same ID over and over again, but my ability to do so caps around 4-6 characters. So I just look at the last 4 chars regardless when fast scanning.

I use copy-paste for any time I need to transport IDs between contexts (that isn't just scripted, which is best). Having a copy-paste stack (Alfred, Raycast and others have this feature) is a huge game changer here.

marcus_holmes · on June 13, 2022

I have exactly the reverse experience.

I do comparisons using ```<uuid fieldname>::text like '<first few chars of the uuid>%``` when debugging in a command line. Or just copy/paste the whole thing. Yes it's marginally more annoying than integers, but only marginally.

I have several times wondered why I was getting no match on a query. And then discovered that I was using a user_id on an account_id field. UUID's have saved me from shooting myself in the foot so many times.

Aeolun · on June 13, 2022

How do UUID’s help in that situation? You get no result at all instead of the wrong result?

marcus_holmes · on June 14, 2022

yes, that.

The extreme case (that I had and is the one where I was finally convinced that uuid's save me from myself) was setting admin permissions on a user. I accidentally copy/pasted their account_id instead of their user_id. If I had been using integers I would have given admin permissions to a random user and never been aware of it. But because I was using uuid's I got a nice, safe "updated 0 records" response and knew there was a problem.

tpetry · on June 12, 2022

That‘s in my experience also the best approach. I wrote an article a few days ago about the exact thing: An integer auto incrementing PK with an UUID you use externally:

https://sqlfordevs.io/uuid-prevent-enumeration-attack

hn_throwaway_99 · on June 12, 2022

That is how I used to do (and currently still do) DB design, but honestly I think if DBs start supporting UUID v7s well that I would use that as the sole primary DB key as well as the external ID:

1. They are still sorted in increasing timestamp order (at millisecond granularity), so they should have good DB index characteristics.

2. At the same time, they contain 62 bits of randomness, would should pretty much eliminate IDOR attacks if there is a bug elsewhere that isn't doing proper access checks. Not good enough for secure tokens, but just good defense against access permission check bugs.

That is, you should basically get the best of both worlds: ordered keys with enough randomness to make ID-increment attacks infeasible.

RobertRoberts · on June 12, 2022

Yep, I want UUID v7, because right now I am using ULID and it's fantastic, but I'd like more official and wide support as well.

jeremyjh · on June 12, 2022

I like the textual representation of ULIDs as well though; I wish they’d just adopted ULID as v7. At least it is binary compatible with existing UUID types.

RobertRoberts · on June 12, 2022

Is there not an equivalent text representation for UUIDv7?

jeremyjh · on June 12, 2022

Sure you could encode it in Crockford's base-32 but if it isn't part of the standard then tools won't implement it natively, so you couldn't copy a key from a url and look it up in postgres without running it through a conversion function, for example.

tpetry · on June 13, 2022

You can write a custom data type in pure SQL for PostgreSQL which is just transforming a visible string to the more efficient uuid type. That‘s basically how the uuid type can be implemented: For storage it‘s binary(16) but all operations transform the value to the visible string you see all the time. It‘s a pretty powerfull feature.

yencabulator · on June 13, 2022

I was enthusiastic for a while but

https://www.postgresql.org/docs/current/sql-createtype.html

> (This restriction is made because an erroneous type definition could confuse or even crash the server.)

Uhh, what..

> Generally these functions have to be coded in C or another low-level language.

Oh, okay. That sounds like this not really doable with many hosted Postgres services out there.

tylerscott · on June 12, 2022

Just seconding this as a sane way to use UUIDs IME. Basically the sequential integer PK is the “internal ID”. IIRC in SQLite regardless of type or existence of PK there is a private sequential integer. Super handy pattern to use.

ciupicri · on June 13, 2022

Are you talking about ROWID? [1]

> Except for WITHOUT ROWID tables, all rows within SQLite tables have a 64-bit signed integer key that uniquely identifies the row within its table. This integer is usually called the "rowid". The rowid value can be accessed using one of the special case-independent names "rowid", "oid", or "_rowid_" in place of a column name. If a table contains a user defined column named "rowid", "oid" or "_rowid_", then that name always refers the explicitly declared column and cannot be used to retrieve the integer rowid value.

> [...] If an INSERT statement attempts to insert a NULL value into a rowid or integer primary key column, the system chooses an integer value to use as the rowid automatically. A detailed description of how this is done is provided separately. [2]

[1]: https://www.sqlite.org/autoinc.html

[2]: https://www.sqlite.org/autoinc.html

evil-olive · on June 12, 2022

my personal favorite UUID replacement, when not concerned with external compatibility or standards-compliance: a 96-bit random value, base64-urlsafe encoded to 16 ASCII characters

    >>> import secrets
    >>> secrets.token_urlsafe(12)
    'uUBpBk2eENDslHyw'

with UUID, you can store it as a compact 16 bytes in a database, but it needs 36 characters if you want to embed it in a URL or a JSON payload. and there's a temptation to be "clever" and strip out the hyphens to shave off 4 bytes and create a nonstandard UUID format. by comparison these have one single canonical representation that is always a 16-character URL-safe string.

tehlike · on June 13, 2022

uuid is just what you said - 16bytes. you could presumably base64 that too...

dagss · on June 13, 2022

This proposal was 12 bytes and the feature that it is 16 chars in base64. UUID is more than 16 chars in base64.

citrin_ru · on June 13, 2022

Being impossible to remember to me is a minor problem compare to how much RAM and HDD space wasted on UUIDs across all systems being built nowadays. In addition to being large than say 64bit integer ID they are almost not compressible. DB compression will not reduce used by UUID space in DB, logs with UUIDs are bigger and compression not a great help here either.

staticassertion · on June 12, 2022

I feel like `serial` has got to be pretty fast for writes and gives you some nice properties like sorted ordering based on insertion time, a smaller key, and a key that can be compressed far better than a uuid.

I get the idea of a ULID/UUID7 encoding a timestamp so you get sorted order, but I wonder at what scale that beats just using serial.

stormbrew · on June 12, 2022

At any scale where you have to have multiple writers generating IDs or need to merge results from multiple sources, basically. Synchronizing a serial increment across hosts and across time is a pain.

Obviously you can composite a timestamp to an incrementing id, but then the serial part of it is kind of useless for ordering, so you may as well use a random number and avoid needing to synchronize at all. And then you've just reinvented a time ordered uuid (but to be fair, without the endian compatibility bs mentioned elsewhere).

staticassertion · on June 12, 2022

I'm wondering what that performance actually looks like though. The closest benchmark I have is a single postgres instance with a uuid pkey and a bigint column, transactionally incrementing the integer in the column.

I mean I guess mathing it out, updating an atomic integer is ~2-100ns, depending on contention. If you need to coordinate the writes you have anywhere from ~250μs-10ms.

We can basically throw away the increment at that point since the network is hundreds/thousands of times slower.

So at 10ms, that's 100 op/s. At 250μs more like 40kops/s.

That ignores the fact that your db can perform those writes concurrently and then batch the writes off to the other database, so long as it doesn't pretend that they're all committed at once. psql HOT updates would presumably be a thing here idk.

If I had to guess, I'd lean towards the "40kop/s" being closer than the "100op/s" but idk! I wish we had benchmarks but I can't find anything :\

deepsun · on June 12, 2022

The best strategy is to have integer primary keys for internal purposes, and some form of uuid for external (think permalinks).

marcos100 · on June 12, 2022

I can't see a case where an UUID PK is better than an INT (or BIGINT). Why would you do that?

lkrubner · on June 12, 2022

There were two problems.

One problem was the style that started around 2004, and was very popular with Ruby on Rails and WordPress, and then Syfmony and Django, where you expose the PK in the URL. If your integer starts with 1 and then increments, you may not get to a billion, and you'll never get to a trillion. So it became ridiculously easy to for outsiders to scan your site:

http://www.example.com/1

http://www.example.com/2

http://www.example.com/3

...

http://www.example.com/10000000000

That was one problem. Using UUIDs for PKs means outsiders can't simply scan your site.

The other problem was that over the years, everyone ran into the problem of moving a database, or needing to combine multiple databases, in which case having PKs the start with 1 and then increment, a collision of the PKs, from different databases, is 100% guaranteed. This often happens when combining WordPress sites, for instance. If you use UUIDs as your PK, then such collisions become unlikely.

madsbuch · on June 12, 2022

Some applications have the need to create IDs in a distributed manner, eg. Og clients, and use that identity before the database returns it. These systems benefit from randomly generated IDs.

You could potentially hist use a random number between 0 and 2^128-1 and still use ints, though I haven't seen that in action, usually pk with ints are centrally generated and consequitive.

__s · on June 12, 2022

If the id appears in a url, you may not want people to guess ids. The information leak exists even if you do authentication: maybe you don't want someone to be able to guess how many records there are, or how quickly records are being generated

tylerscott · on June 12, 2022

If for no other reason than simplified debugging I find there is value. Maybe I’m old but if you have more than one UUID involved in the debugging I’m more likely to trip up than just integers.

unixhero · on June 13, 2022

Useful to have a unique UIID too, but integers are better as PK I agree.

lewisl9029 · on June 12, 2022

I've been using ULID for a while now, which analogous to UUID v7 but with a different (better IMHO) string representation. They've been awesome for using as sort keys in dynamo for instance, since they're lexicographically sortable as strings.

But one thing I'm still wary about is exposing these IDs with millisecond-precision time components to end users, since I've seen multiple discussions here on HN about the potential for timing attacks.

How worried should I really be? Do people have useful heuristics on the kinds of data where it's safe/unsafe to expose timing information, or should I just only expose a separate UUID v4 externally across the board just to be safe?

djbusby · on June 13, 2022

ULID has like 80 bits of random per millisecond. I found some calculator online that showed the probability of collision on 80bits was very, very low even across time much longer than millisecond. I think knowing when an ID was created is harmless - and if it isn't there is a design problem that is larger than ID choice.

There is another lib called hashid which can be used if masking that ID algo is important.

lewisl9029 · on June 13, 2022

> I think knowing when an ID was created is harmless - and if it isn't there is a design problem that is larger than ID choice.

I hope you're right, since I've been cautiously operating under that assumption so far. Cautiously, because of warnings, from people better versed in cryptography than I, in discussions like this one: https://news.ycombinator.com/item?id=29805433. I still haven't quite been able to internalize when this should vs shouldn't be something I should be concerned about, hence the comment.

doliveira · on June 12, 2022

I came up with a scheme in which the random part of the ULID is a slice of the hash of a UUID, which seems to work fine because I'm generating it all server side, I guess? It works very well for insertion into NoSQL databases, for instance.

So I'd also like to know the threat model for these timing attacks.

stevesimmons · on June 12, 2022

This had appeared a few times before at earlier stages of the draft process...

https://news.ycombinator.com/item?id=28088213 [244 comments]

Personally, I really like UUIDv7, except I transform the UUID to a 25-character string which has all the same properties except it doesn't look like a UUID. The last thing I want is my index of time-sortable UUIDs getting contaminated with with some UUIDv4 fully random ones. Since UUIDs may be generated and persisted in a distributed manner, it's a simple way to at least spot this.

ikornaselur · on June 12, 2022

The UUID version is in the actual ID, so you would be able to spot the version as it isn't fully random.

Although you light mean just visually you'd spot the difference much easier!

theptip · on June 12, 2022

Can’t you just check the “version” bits and reject if it’s not 4/7? Or are you worried about someone generating a completely random (non-compliant) set of bits that happens to parse as a v4/7

operator-name · on June 12, 2022

With the introduction of UUIDv8, my fun script to generate vanity uuids[0] can finally be spec comfortant!

[0]:https://github.com/operator-name/vanity-uuid

davidjfelix · on June 12, 2022

I've been staring at the supposedly readable and memorable uuid for like 4 minutes without any idea what it says.

operator-name · on June 12, 2022

Yeah, looking back I could have done a bit more than simple substitution. Since the sentences don't gave meaning, interpreting between 1 as i vs l is especially difficult.

5eedbed5-f05e-b055-ada0-d15ab11171e5

seedbeds-fose-boss-adao-disabilities

"Memorable" was definitely tongue in cheek, as was the spec bending.

kbumsik · on June 12, 2022

UUIDv7 looks interesting, but how is it different from ULID [1] in practice? I was considering using ULID for a upcoming new project because it is lexicographically sortable but it looks like UUIDv7 just can replace that.

[1]: https://cran.r-project.org/web/packages/ulid/vignettes/intro...

rekwah · on June 12, 2022

As the author of a popular ULID implementation in python[1], the spec has no stewardship anymore. The specification repo[2] has plenty of open issues and no real guidance or communication beyond language implementation authors discussing corner cases and the gaps in the spec. The monotonic functionality is ambiguous (at best), doesn't consider distributed id generation, and is implemented differently per-language [3].

Functionally, UUIDv7 might be the _same_ but the hope would be for a more rigid specification for interoperability.

[1]: https://github.com/ahawker/ulid

[2]: https://github.com/ulid/spec

[3]: https://github.com/ulid/spec/issues/11

kortex · on June 12, 2022

I've bee using ULIDs in python for about a year now and so far have been super happy with them, so a) thank you for maintaining this! b) I always felt a bit uneasy about the way the spec describes the monotonicity component. Personally I just rely on the random aspect as I am fortunate enough to say that two events in the same millisecond are effectively simultaneous.

At that point, it's basically just UUID7 with Crockford base32 encoding, more or less.

IMHO the in-process monotonically increasing feature of ULID is misguided. As you mention, distributed ids are a pain. The instant you start talking distributed, monotonic counters or orderable events (two threads count as distributed in this case), you need to talk things like Lamport clocks or other hybrid clock strategies. It's better to reach for the right tools in this case, vs half-baked monotonic-only-in-this-process vague guarantee.

RobertRoberts · on June 12, 2022

Thank you, I've been using ULID for a while now, and it serves my purposes. But I have long term support concerns.

UUIDv7 really seems like the sweet spot between pure INT/BIGINT auto incrementing PKs and universally sortable universal ids.

duxup · on June 12, 2022

I’m always amazed how much work goes into these.

Meanwhile 99% of the time i just call a function when “I just need something unique here”, -calls function-, and it works!

Thanks everyone!

Daegalus · on June 13, 2022

I maintain the Dart UUID library. For anyone using dart, or wants to see one of many implementations of v6,v7 and a custom v8 UUID, feel free to look at my in-progress branch linked below, I plan to merge it in once they add different string representations in a future draft (I've been involved in the conversations).

https://github.com/Daegalus/dart-uuid/tree/uuidv6

kdps · on June 13, 2022

My main concern with random-based UUIDs always has been running out of entropy and cause the application to remain in a blocking state (e.g. as described here: https://blog.fastthread.io/2022/03/09/java-uuid-generation-p...). Not due to any negative experiences that I made myself, but due to a particular colleague of mine who sees it as a dealbreaker.

Is this an actual issue? Most people don't seem to care when talking about random UUIDs. The target platform of our applications is mostly Kubernetes on cloud environments, if that makes any difference.

Why I'm asking: UUID Version 7 looks quite interesting to me, and the document describes rand_a and rand_b just as "pseudo-random data"... which made me think that in the context of "uniqueness per millisecond", a source of entropy is conceptually not required. However, chapter 6.6 clearly advises the usage of CSPRNGs, so I guess the overall problem remains :(

astrange · on June 13, 2022

“Running out of entropy” is not possible; that’s a property of ancient PRNGs written by confused people.

Even if your PRNG could run out of entropy, rdrand would give it all it needs.

ciupicri · on June 13, 2022

From the documentation for java.security.SecureRandom from Java 17 [1]:

> Note: Depending on the implementation, the generateSeed, reseed and nextBytes methods may block as entropy is being gathered, for example, if the entropy source is /dev/random on various Unix-like operating systems.

[1]: https://docs.oracle.com/en/java/javase/17/docs/api/java.base...

astrange · on June 13, 2022

Like I said, ancient PRNGs written by confused people.

yencabulator · on June 13, 2022

On current day Linux, /dev/random will only block during bootup, never after it has once started.

swyx · on June 12, 2022

(poster here)

this RFC is not new new, but is still pretty new and i was surprised to learn that UUIDv7 and v8 are being worked on.

the context is i keep a list of uuid impls and knowledge for my own reference. posted this up today simply because I got a PR from some subscribers https://github.com/sw-yx/brain/pull/36

gigatexal · on June 12, 2022

Uuidv7 looks really interesting. I like that the article mentions all the other projects that attempted to fix uuid issues for different use cases.

nixpulvis · on June 12, 2022

Maybe I'm being slow right now, but can somewhat help me understand why Max UUID is ever specifically useful?

okr · on June 12, 2022

My guess is, that it is just a constant, ready to be used for bitwise operations.

nixpulvis · on June 12, 2022

Yep... I was just being slow. I thought the specialized variant was a new kind of UUID (being inverse of RFC4122), but it's just the inverse of RFC4122's Nil UUID; a single value.

Sorry for the silly comment. This value is just a bunch of binary 1s.

jwilk · on June 12, 2022

Old-school formatting of the same document:

https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

joshspankit · on June 13, 2022

When I read the headline I thought “why do we need another UUID format?” but my respect was quickly earned.

A lot of excellent thought put in to this, and thank you to the authors for casting aside the abomination called leap seconds.

tqkxzugoaupvwqr · on June 12, 2022

Is there a way to verify a UUID was created by me? I want to avoid database lookups of UUIDs from malicious/manipulated URLs.

rossriley · on June 13, 2022

You could use the Luhn algorithm or something similar to add a check digit to the end, so for instance, generate a standard uuid, remove the last character, pass the remaining characters through the Luhn function and then use the result as the final replacement character for the UUID.

It's obviously not cryptographically secure, as in someone else could use the same algorithm to generate UUIDs that your system recognises, but could be a quick way of doing a verification before you hit the DB.

borplk · on June 13, 2022

The only secure way of doing it is to somehow crypotgraphically sign it.

You'd have to trade that DB lookup with CPU cycles for a signature validation.

For example you can use HMAC.

You send the UUID and HMAC-of-UUID to the client.

The client sends back UUID + HMAC-of-UUID.

You re-calculate HMAC-of-UUID-provided-by-client (using your secret key).

If the calculated HMAC matches the HMAC provided by the client you have confirmation that the UUID was issued by you. Because without the secret key the client can't calculate the correct HMAC for a random or modified UUID.

billpg · on June 13, 2022

Warning: The entirety of this comment might be a bad idea.

There are 124 bits available. (Actually a little less but let's pretend only 4 bits are needed for the version to keep it simple.)

You'll have to decide how many are the ID and how many are the signature. Let's say 32 bits are your ID and the remaining 92 are the signature. (Adjust according to your own needs.)

Let's suppose your ID is 1. Now you need to hash that ID together with a private key using an HMAC algorithm. This signature will have more than 92 bits so trim your signature to this length.

Now build your UUID by combining the 32 bits of your ID with the 92 bits of the signature and four bits specifying version 8 (proprietary).

To test if a claimed UUID is yours, pull out the 32 bits of ID and repeat the process of signature and building a complete UUID. If the bits match, success! If not, its a fake UUID.

Now read the responses to this comment to find out why you shouldn't do this.

fer · on June 13, 2022

Depends on your workload and how it's distributed, but a bloom filter might help you there.

The filter needs to be stored somewhere, so if your workload is write-heavy you'd be replacing lookups with updates, but if it's read-heavy most DB lookups can be replaced with filter lookups.

canadiantim · on June 12, 2022

How would one go about trying to use UUID 7 in a Postgres database / python codebase now?

oittaa · on June 12, 2022

I made a simple Python test library that extends the standard UUID class with UUIDv6 and UUIDv7. You might want to check it out. https://github.com/oittaa/uuid6-python

The official UUID Draft repository has also some alternatives if you'd like to check those out. https://github.com/uuid6/prototypes

Postgres supports UUIDs with any version number natively so you can then do something like this with it:

  create table data (id uuid, firstname varchar(100));
  insert into data (id, firstname) values ('017f21cf-d130-7cc3-98c4-dc0c0c07398f', 'John');
  select * from data;

kjmph · on June 13, 2022

I created a PL/pgSQL function for this purpose. As Starlevel001 mentioned, the UUID format will accept opaque bytes. Perhaps you could find this useful?

https://gist.github.com/kjmph/5bd772b2c2df145aa645b837da7eca...

stevesimmons · on June 13, 2022

My Python implementation is here:

https://github.com/stevesimmons/uuid7

https://pypi.org/project/uuid7/

oittaa · on June 13, 2022

Nice to see another implementation which takes a bit different approach. Just for your information, there's now Draft 03 which changes the format a little bit. I kinda liked the arbitrary precision of Draft 02, but the newer one just requires millisecond precision and then basically leaves it up to the implementation how to handle the generation of multiple UUIDs within the same millisecond.

https://github.com/uuid6/prototypes/issues/21

Starlevel001 · on June 12, 2022

Postgres UUIDs are treated as an opaque 8-byte array, you can use anything as long as it's in the textual format.

Zamicol · on June 12, 2022

Why isn't there an option with a strong cryptographic hash like SHA-256?

staticassertion · on June 12, 2022

It would be slow and doesn't really serve a purpose since you either have uuids that are totally random or uuids that need to preserve their structure.

jwilk · on June 12, 2022

Or UUIDs derived from other identifiers:

https://datatracker.ietf.org/doc/html/rfc4122#section-4.3

jamesrom · on June 13, 2022

You could utilize UUIDv8 for this purpose.

Of course v5 isn't strong in the cryptographic sense, but probably suits most purposes where you need a unique ID per an object: https://datatracker.ietf.org/doc/html/rfc4122#section-4.3

The idea of using UUIDs for cryptographic purposes is probably misguided anyway.

kagitac · on June 13, 2022

Can someone ELI5 why Unix timestamp would be stored as left-padded 48 bits for a 32-bit or potentially a 64-bit number? Is it a coincidence that 48 is the middle of 32-64?

oofbey · on June 13, 2022

I really don’t understand why we need standards for UUIDs. I get that with cryptography it’s super easy to make subtle mistakes. But UUID4 (the most commonly used variant these days) is just a long random number. Except for a few bits which aren’t random because the standard says so. As long as you’re using a goodsource of randomness (most are these days) it’s pretty hard to screw up. Why do we need a standard to tell us how to combine time stamps and random numbers?

swyx · on June 13, 2022

thats what the authors of uuid4 thought. but if you read the RFC again you’ll see they listed out the flaws and pain points. eg the impact of true randomness on database writes at scale.

having standards encodes lessons from a decade worth of pain. don’t dismiss it so casually.

oofbey · on June 14, 2022

Sorry, I still don't get it. It's a timestamp plus a random number. As their references show, plenty of people have figured out this is a fine strategy. I could explain to an intern how to build this in 10 seconds with "use a 48 bit timestamp plus 80 bits of random, then encode it as a hex string" and they're done. Finding timestamps and high-quality RNG's is just not hard these days. Replace "hex string" with "base62 encoding" and we have something significantly better than a UUID.

I get that the UUID authors feel a need to tell people there's a better way to do it, and I guess adding a new version to their standard is easier than telling people to go use something else like ULID. But I still don't see it as particularly important.

jojohohanon · on June 13, 2022

Is the author of the spec here?

I notice in section 6.1 reliability, the word “must” in clock going backwards mitigation is not MUST.

Intentional?

ripe · on June 12, 2022

Side note: I love the HTML format of these IETF RFCs, as in TFA. Over the decades, I was used to seeing the old text format (which I like), but this one is particularly easy on the eye, especially on my Android phone.

hoten · on June 12, 2022

Only mobile issue I see is there is a really long link that overflows the intended document width, which introduces an horizontal scrollbar. That makes scrolling a bit finicky on mobile.

sureglymop · on June 12, 2022

I love the design and would like to copy it for my blog! Especially the code blocks with the nice 'figure' description below them.

bearjaws · on June 12, 2022

UUIDv6 coming in just about 10 years late. Guess its something.

kstenerud · on June 12, 2022

It's a shame they didn't finally drop big endian. Little endian won out and our newer protocols should reflect that for efficiency's sake.

chrismorgan · on June 12, 2022

Big-endianness is mandatory for opaque bytewise sorting (or lexicographic ordering in string form), which is a very desirable property of UUIDv6 and UUIDv7.

amluto · on June 12, 2022

No, they should have tried to find a way to drop little endian. A cycle or two when generating UUIDs is irrelevant, and almost all UUID implementations, and most of the RFC 4122 text, are big-endian.

finnh · on June 12, 2022

But network byte order is big endian, so I can see arguments both ways (as it were)

zuzun · on June 12, 2022

It's necessary to keep them sortable at byte level.

wyager · on June 12, 2022

I guarantee you that byte order swapping is not the limiting factor on the efficiency of any system involving UUIDs on the wire.