Fortunately this is a bit less relevant today as Windows loses market share in database and server applications, but:
UUIDs have historically massively screwed up endian handling. While this new draft discusses sorting UUIDs as strings of octets (bytes) and the text of RFC4122 is fairly explicit about most significant bytes coming first, the C UUID structure in RFC 4122 appendix A is entirely misguided:
Those aren’t bytes — they’re integers of various sizes. (Hint: do not use integer types in C code for portable data structures. ntohl, etc are a mess. Just use arrays of bytes.)
I don’t know the whole history, but MS somehow took this structure at face value and caused problems like this:
So, if you want to do anything (e.g. sorting) that depends on the representation of a UUID (or even depends on converting between string and binary representations), be aware that UUIDs coming from Windows may be little-endian. In my book, this is a Windows bug, but opinions may differ here.
UUIDs (as GUIDs) in Windows predate RFC 4122, so I don't think it's that unreasonable that they're not compliant with the spec (since the exact contents of a UUID aren't typically that important, but consistency in how you produce them is).
The GUID implementation in Windows derives from the DCE RPC specification [1]. That's where the multibyte integers in the RFC 4122 specification come from (they're stated the same way in the DCE RPC spec), and it doesn't explicitly specify their endianness. It does call them "NDR integers", but NDR integers can be little endian or big endian depending on implementation. DCE specifies a mechanism by which you'd indicate which you're using in an RPC call, but that data's not included in the UUID format-- you get whatever byte order the system's decided to use, which, for Windows, is little-endian.
It's actually worse than that. The first 3 groupings (textually) of the uuid might be little endian while the other 2 are big endian. Learning this cost me more time than I care to admit.
This is consistent with the misguided structure in the RFC. The first three fields (the time fields) are multibyte integers. The remainder is just bytes. The dashes in the textual representation are just there to confuse you.
When you translate between UUIDs in binary and in text form and communicate with other code the binary UUIDs. The other code might expect the uuids in a different encoding.
This mishap lives forth in the UUID stored in a machines DMI data, as well in GPT partition tables, which are required when EFI is used. It would be really cool if we had some replacement for EFI that would not harbour these kind of painful legacies.
On the other hand, this is an easy to implement conversion, while changing such a fundamental thing from EFI sounds pretty hard, making it not worth it.
You cannot fix this with an conversion because you do not know if your UUID is correct or needs conversion. DMI data for example has inconsistent endian-ness depending on the vendor. So if you have a UUID sticker on a new server, you still have two options which UUID the machine will send during PXE, either the printed UUID in big-endian encoding or in the microsoft mixed-endian encoding.
Use BIOS boot instead of EFI, it has less legacy to implement: PE executables, FAT file system, Win64 ABI
At my last workplace we twice got new workstations where they all had the same bogus UUID. Once it was Dell, the other I don't remember. The fix was to either manually set a new one in the BIOS setup, or install a BIOS Update. So I have my doubts that the variant field can be trusted here.
After failed attempts with the UUID due to above problems, we now do machine identification for PXE via MAC address for virtual machines (you see/set it in the hypervisor) and the DMI product_serial for physical machines (written on chassis).
It's fun to note that the worst possible UUID sort order in existence isn't Microsoft's fault but Sun's. They missed the "unsigned" catch to those integers in the struct and Java sorts UUIDs as signed integers. (Which is why sometimes you'll notice in for instance Android apps Guids get sorted such that that 8000... < FFFF... < 0000... < 7FFF...)
He's linked it here: https://devblogs.microsoft.com/oldnewthing/20190913-00/?p=10.... Its a good read. The LE format, especially with UUID 7 like UUID (seq UUID's), seems to be easier/faster to sort just using binary sorting on LE architectures which is most computers nowdays. Interesting to see all these tradeoffs.
> Hint: do not use integer types in C code for portable data structures. ntohl, etc are a mess. Just use arrays of bytes.
I don’t see how that helps much. If a developer forgets to call ntohl on multi-byte integer fields, I don’t trust them to correctly convert said integers to arrays of bytes, either.
Unfortunately, the naive way also turns out to be wrong in C. uint8_t gets promoted to a signed int when shifting, which in turn causes undefined behavior for specific input. One way of fixing this is casting to the desired type before the shift, thus avoiding surprising conversions.
On a side note, compiler warnings and sanitizers help with this kind of stuff greatly, use them if you have the option: https://godbolt.org/z/8oq9GTcze
>> uint8_t gets promoted to a signed int when shifting, which in turn causes undefined behavior for specific input.
>> One way of fixing this is casting to the desired type before the shift, thus avoiding surprising conversions.
Agree. For clarity, the "specific input" in the example would be a bytes[0] value larger than 127.
Little endian won. I don't see the point of maintaining theoretical compatibility with big endian systems that are at best esoteric right now and soon to be extinct. Likewise, every reasonable platform aligns data fields on natural alignment these days. It's just a waste of effort to make software portable to evolutionary dead ends.
This isn't about compatibility with big-endian machines. This is about compatibility between different uuid libraries, potentially on different operating systems, all on little endian CPU architectures.
Every existing library follows the standard which specifies host byte order, which usually means little endian. The Rust library cited a few levels up this chain ignored that, somehow assuming big endian, and then had to correct for this mistake.
I used to be a big proponent of using UUIDs for database PKs but I've found them inherently difficult to work with. It's much easier to remember/recognize an integer based PK when troubleshooting a data problem.
This isn't to say you shouldn't use UUIDs at all, but I much prefer to use an "ExternalId" column of UUID type if you don't want to expose your integer based PKs externally.
UUIDs have a few distinct advantages: you'll never run out, you don't need a roundtrip to find out what they are after saving them, they often make a good partitioning key and it makes things easier if you ever need to combine multiple data sources together in migration and recovery type scenarios. I also quite like how they're unique across all data sources and tables, so if you just encounter a random contextless UUID in the wild, for example in a support ticket, you can probably still find what it refers to.
They are quite unwieldy though. There are a few compact representations you can use in URLs which make it a bit less ugly, but they can make your database and logs quite bloated, in particular if you've got a large number of small records.
> if you ever need to combine multiple data sources together in migration and recovery type scenarios
This insane idea that combining data sources is a rare event in some unusual "migration and recovery" scenarios is one of the most poisonous and yet pervasive ideas in all of database design. You are always combining multiple data sources, all the time. Users submitting data from a form is a data source. Test, staging, and production deployments, with multiple of each. External APIs. Multiple clients. Eventual consistency. Replication. Microservices with distributed systems. Or even sharing any common data at all between different systems, like unit conversions, chemical data, country names, engineering constants, etc.
Anyone who even considers using a single authoritative source for all entity identity either better be making a system in an underground bunker that will never talk to any other system. Otherwise they are making a serious and extremely avoidable mistake. Never use auto-incrementing IDs.
It's even wrong in a monolith! Why does everyone abandon this idea of "separation of concerns" and "single responsibility principle" and "bounded contexts" and proper abstraction and limited communication between system parts and literally every design principle they've ever been taught when they go to design a database? It all just goes out the window! Why do you guys bother reading books about system design if you ignore them when you build a database? "Multiple systems communicating with each other" should apply recursively all the way from deployment and external integration down to individual functions. That means database, too. Auto-incrementing IDs are anathema to that.
While your tone might be a tad hyperbolic, I agree with your basic premise.
If I could go back and tell my 30 year ago self one tip, it would be to use uuids over auto-increments. And this is back when that was expensive - in disk space and database time.
Instead I'm stuck with my design, and as time has passed the real cost of auto-Inc has slowly revealed itself.
What's interesting to me though is that this view is not universal. I get a lot of push-back when promoting uuids, but I can really only speak to my experience.
You’re not alone. I’ve been migrating my tables to use uuid instead of integers and have been using uuid whenever I have new tables, unless I have very good reason not to. Experience was my teacher.
Destroy is a strong word (and UUIDs can certainly be sorted, but locality is an issue), but all of software is a series of tradeoffs. I've used both auto-increment and UUIDs, and wish I had used UUIDs in almost every case.
Distributed generation - no sending a record to a server to get a key then using it to generate other records. In a world with increasing use of services, this becomes more important every day.
System wide unique - helps with logging, debugging, and avoiding general errors.
Multi-master db replication - I know this depends on the RDBMS, but having a unique key on every record avoids clashes. Also super useful during data migrations (which will happen. I have another rant that data always outlives code, so plan accordingly).
Validation - UUIDs have a form that can be a first level validation on input.
For me, those advantages outweigh some extra space usage, possible performance impact, and ugly URLs.
And on performance, if it's determined that it is an issue because of using UUIDs there are ways to make them more index friendly.
There is a perf hit. You can not help it when you are slugging that many bytes around. An int/int64 fits into a register easy and the instruction for comparison is cheap. However, when you add in replication and other items UUID becomes more desirable due to the property having extra information embedded into it to make them unique. You can get some of the same benefits with more complexity if you are using auto increment. Such as using a stride that is similar to the number of machines you have in the cluster. But even that can get weird, and depending on your db can be a pain to setup. Using them as a cluster index and you probably will most certainly create a hotspot index and poor lookup performance, but decent write. Just due to the fact that items probably are not grouped the same as your where clause. SSD's have hidden most of this for most databases. But the issues are the same.
They don't utterly destroy performance, but there is some hit if you use UUID v4 random values due to database index scattering. That's why this new proposal exists. It adds new versions of UUID that are mostly incrementing in time, so that they group better.
In my experience, thats a big NO. While there is some performance impact, in other ways it can actually be a benefit (spreading writes out to multiple pages, for example) so as with anything in regards to performance: benchmark, benchmark, benchmark.
PS.: UUID's also work quite nicely to deal with certain system hiccups, specifically temporal ones. Could be thought of as something similar to the Erlang supervisor trees.
Perhaps one through-line is that if IDs are not created by a single source, don't rely on auto incrementing IDs.
The parent/child one is interesting. When I've worked with hierarchical data, I tend to wrap the whole process in a transaction, so it may be many commits (by depth), but one transaction.
>> if IDs are not created by a single source, don't rely on auto incrementing IDs.
Data design outlives programs, and environments by a long time. Circumstances change.
So when I designed the db, there was a single database. But 30 years later we live in a world with smart phones.
Making design decisions because of _current_ circumstances can bite you hard later on.
Reading parent /child. Yes there are ways to mitigate the issue, but it's extra work and code to do so. Ultimately you need the parent id before you can add the children though.
A problem here is that UUIDs are not actually a silver bullet against collision. You will never generate a UUID that will collide with someone else's, but it is easy for someone else to collide with you, either on purpose or because they made changes to data from your platform before sending it back (PKs tend to leak in URLs, APIs, data exports, etc).
I find that I have to implement a system for re-numbering incoming data on migration/import anyway, so the advantage of UUIDs is not that huge.
> There are a few compact representations you can use in URLs which make it a bit less ugly…
Any thoughts on where to find best-practices guidance? I need to create an external ID scheme for several million items. hashids (hashids.org) seems interesting, but I have anxiety about choosing a solution with weaknesses that I can't identify given my current level of experience in regards to this.
Using the equation listed in the article I couldn't generate a collision so far. Yet, I still check (in code) for id collision, and pick new id, just to be 100% sure.
This article mostly just says: "If your data is small, then 56 bits is fine". Which is true. But if you are at 56 bits in your UUID, then just use an integer PK with 64 bits and you'll be fine a good deal longer. If your data might get large, an extra 64 bits to get a full 128bit UUID isn't expensive, and encoding in Base36 still yields 25 character UUIDs, which are pretty manageable: "abcdef-01234-56789-ghi-jklmno" (6-5-5-3-6, or max of a u16 for digits). I'm honestly shocked no UUID library makes it easy to export in Base36 with that grouping, it's very neat and tidy and easy to remember.
You may consider encoding the uuid in base58 as it is shorter, safe to use in URLs, and easier to debug as it doesn't use characters that can be confused such as 0, 0, I, and l.
I'll add a voice for base58 representation of UUIDs on the wire, it's my preferred solution as well.
Not only for debugging: by being compact, unambiguous, and URL-safe, the helpdesk is also spared a cringeworthy source of PEBCAK incidents since misquoting the identifiers is simply harder.
Caveat programmer, though, there is a hazard, occurring when someone is glib about the usage and relies on randomly generated fixed-length base58 values directly. It happens readily because some popular frameworks include such generation as a utility function. However, 58^22 > 2^128 > 58^21, so a 22-character base58 representation is expected for UUIDs but carelessly random 22-character base58 values may exceed the capacity of UUID's familiar hexadecimal serialization, effectively an integer overflow. We never generate base58 identifiers as a PK, for example, for this reason (they would of course not be UUIDs either). Alas, there is no conventional base encoding more compact than hexadecimal that is robust to the general problem. And I mention it because this issue was observed in the wild.
If I'm not mistaken base64 should not cause an integer overflow (i.e. it will always be a 128-bit integer) if you limit it to 22 characters (and append two = characters to pad it when you actually decode it).
Yes. I think of base64 as an unconventional encoding, in that regard, since it's more than just an alphabet for powers of the base but also communicates properties of the output octet length in the serialized form. To do that it's relying on an elegant alignment of bit values to that base58 doesn't have. Perhaps there could be a means to inject the same confluence of form and function into base58, albeit I can't see it after a couple minutes of staring at the algorithm¹.
And also yes: the equals sign is potentially hazardous in a query string, and the use of non-alphanumeric symbols in base64 also creates line-break and double-click/double-tap traps when passed around in an ad-hoc fashion, even in the url-safe variant.
> UUIDs have a few distinct advantages: you'll never run out
I'm curious what kind of applications are limited by the range of bigint values? I have no doubt that such applications exist somewhere, but most software engineers won't ever come close to encountering those limits. Even if you have a table that is consistently consuming a billion (with a B) bigint id values _every second_ (is that even feasible with current hardware and RDBMS software?) you won't run out for almost 300 years.
Some famous RDBMS bugs had their root cause in this kind of reasoning, because the model embeds assumptions on how those IDs will be generated and used which may not hold true in the future for reasons that are difficult to anticipate.
For example, Oracle had a 48-bit ID rollover bug many years ago that by all calculations should never occur in real systems. This calculation was made under the assumption that the IDs were mostly actually used. However, many features added later necessitated generating or reserving vast numbers of IDs in bulk, a low-cost optimization, the vast majority of which were ultimately discarded. It got to the point where very large systems started running out of these IDs due to the fact that such a low percentage were used in the way the designers had anticipated.
Extremely large systems do not run into the limitations of bigint because at that scale the identifiers are naturally segmented, often implicitly.
> Extremely large systems do not run into the limitations of bigint because at that scale the identifiers are naturally segmented, often implicitly.
Agreed. That's the real key imo. You could have a 2^64 random number as a primary key if that key is also namespaced per customer - maybe in aggregate your customers generate > 2^32 events but a given customer may not. And there's other ways to limit it further.
This is the approach we take, basically, although we use a counter and not a random number.
For some reason a lot of things used to default to 32 bit primary keys and you absolutely did run out of those and its been a major issue for a lot of apps for the last few years. But yeah, you'll never run out of bigints SQL servers couldn't cope with the amount of data that would require.
Also, using an incrementing id can lead to to information disclosure if they are visible in user facing APIs. For example, if I know an id for my user, document, cart, etc. Then I know all ids lower than that are probably also valid.
"Unique IDs" _can_ be super really easy to work with if they're not so baffling complicated.
A random string generated using quality randomness can be adjusted to length to suit the quantity of data (negligible probability of a collision) which in most cases is very short.
It's easy to increase the length as you get more data.
They are visually very different for each item of data.
They're evenly spread which means they hash/index well.
You can tune a subset of characters if you want to decrease ambiguity eg. when exchanged by voice (no zero vs. letter O, upper/lower case etc.)
And a final bonus, when working with user input only a a short prefix is needed to uniquely identify an item (in contrast, it seems like UUIDs deliberately share a common prefix)
I'm very happy to concede I must be missing something here, and would be interested to know. But the above approach has served me well in a range of uses.
I can see how UUIDs work, and perhaps "looks like a UUID" is a useful feature. But reading the URL above and a bit of Wikipedia doesn't give me much to go on as to _why_ any of this is happening, and why the hyphens aim to retain meaning to what is ostensibly a 'unique' number.
Using purely random ids in your database destroys locality. They mention this in the introduction:
> Non-time-ordered UUID versions such as UUIDv4 have poor database index locality. Meaning new values created in succession are not close to each other in the index and thus require inserts to be performed at random locations. The negative performance effects of which on common structures used for this (B-tree and its variants) can be dramatic.
The V7 ids work similarly to what you like, as they're just a unix timestamp and 74 bits of pseudorandom data (they present several different schemes you could use to generate this randomness, but the basic birthday bound says we'd need to be above 100 billion id's generated in a single millisecond to worry about collisons. Obviously most systems are nowhere near that territory.
So using these id's gives you the practical advantages of random uinique ids, but with the performance of autoincrement ids.
100 billion UUIDs per millisecond is the 50% collision probability threshold. Achieving an acceptable collision probability for most applications would limit the UUID generation rate to more like thousands of UUIDs per millisecond.
Even if one was not generating millions of UUIDs per second on average, the risk of spiky temporal distributions when generating UUIDs would still need to be considered.
I'm aware of what the bound I quoted is, and am just too lazy to type the refinement into Wolfram for a discussion like this. But just knowing the value off the top of my head for 64 bits is 200k for 1e-9 probability of collision, I'm pretty happy with 74 bits.
And although this standard obviously wants to stick within the existing UUID footprint, if you were say doing some IoT software that would run on billions of nodes simultaneously, just add another 32/64/whatever bits of random data and deal with the minor annoyance of longer ids and lack of UUID RFC compatibility. But even then, you can truncate and reformat these these to v4 UUIDs trivially without meaningfully impacting the collision resistance, for unsorted external ids in systems that need the compatability.
The actual big risk with this sort of scheme is vm initialization. You need to be sure the CSPRNG you're using is initialized, which can be slightly tricky in cloud environments with configuration/control layer stuff that's racy. Mess this up and two nodes hydrated from the same snapshot may overlap in sequence as they start generating ids, and of course the time component cannot be trusted to save you in this instance.
Thanks for drawing my attention to that, it's the useful answer I was looking for; my use cases haven't been bound by write performance in this manner. However, I'd still be considering carefully before making use of these UUID schemes.
If you want something like this, and need a "I just want it to work, require no central coordination, and to have vanishingly small probability of collision" then just use the same concepts but wider than the 128 bit footprint limit of this scheme. This limit makes sense for the RFC in the post, as there backwards compatibility is an explicit goal. But if you used say a 64 bit nanosecond counter (to preserve best case precision on a single machine) along with 128+ bits of random data and you'll need to worry more about gamma ray bursts than collisions.
We used almost this exact scheme for app id indices and the curious problem we had to design against was inadvertent profanity. At some point we decided to just never use vowels to avoid ever having a complaint about 12f*ck if in the URL
Another approach is to use something like EFF's dice words lists. One of the smaller lists in particular is interesting as it's 6^4 words, filtered for profanity, and where all words have both a unique 3 letter prefix and an edit distance of 3. That makes them robust for the use case of someone reading out the phrase to someone typing or such.
Never using vowells is a smart idea I wish I'd used in the past. Previously when I've needed something like this I've used other dictionary lists vs EFF's, and those were not curated sufficiently to avoid some really unfortunate combinations.
> "They're evenly spread which means they hash/index well."
What do you mean by this? Why would you hash it further? Hash distribution is primarily down to the hashing algorithm, not the input data.
Also indexes are better with somewhat ordered and smaller data. A 64-bit int sequential counter is much faster and half the size, and compatible everywhere without the annoyances of a UUID.
I’ve been working on a robust scheme for encrypted sequential IDs, which is done, including library implementations in Rust, JavaScript and Python, pending just a smidgeon more writing about it and reviewing a decision on naming. You store an integer in the database, then encrypt it with a real block cipher, and stringify with Base58. I have three modes: one for 32-bit IDs, using Speck32/64 and producing 4–6 character IDs; one for 64-bit IDs, using Speck64/128 and producing 8–11 character IDs; and one hybrid, using the 32-bit mode for IDs below 2³² and the 64-bit mode above that, providing both a forwards-compatibility measure and a way of producing short IDs as long as possible. Contact me (see my profile) if you’re interested, or I’ll probably publish it in another day or two. Trouble is that I’ve been getting distracted with other related concepts, like optimally-short encoding by using encryption domains [0, 58¹), [58¹, 58²), …, [58¹⁰, 2⁶⁴) (this is format-preserving encryption; the main reputable and practical choices I’ve found are Hasty Pudding, which I’ve just about finished implementing but would like test vectors for but they’re on a dead FTP site, and NIST’s FF1 and FF3, which are patent-encumbered), and ways of avoiding undesirable patterns (curse words and such) by skipping integers from the database’s ID sequence if they encode to what you don’t want, and check characters with the Damm algorithm. If I didn’t keep getting distracted with these things, I’d have published a couple of weeks ago.
(I am not aware of any open-source library embodying a scheme like what I propose—all that I’ve found have either reduced scope or badly broken encryption; https://github.com/yi-jiayu/presents encrypts soundly, but doesn’t stringify; Hashids is broken almost beyond belief and should not be considered encryption; Optimus uses an extremely weak encryption.)
UUIDs are crazy overkill in any situation where you can have centralised ID allocation. Fully decentralised? Sure, 128 bits of randomness or mixed clock and randomness or similar, knock yourself out. But got a master database? Nah, you’re just generating unreasonably long values that take up unnecessary space and make for messy URLs and such.
> UUIDs are crazy overkill in any situation where you can have centralised ID allocation.
Except that’s specifically the use case of UUIDs: to have a decentralized method to generate unique IDs with minimal chance of collisions. If you have centralized control, of course there will be options with more attractive properties: they aren’t dealing with the same constraints.
Sure, decentralised allocation is the intent of UUIDs, but in practice UUIDv4 is used extremely widely where it’s completely unnecessary, because it’s a convenient way of generating non-sequential IDs with good tooling support (UUID libraries, UUID data types in some databases, &c.).
I've done something similar to obfuscate private DB IDs in a large existing application - just ensure they're all Skip32-encoded in all query parameters with an app-wide secret.
It works well but you have to be very disciplined to catch every case individually. Using GUID PKs from the start just removes this entire category of problem.
If it requires discipline, you’re doing it wrong. For best results, it should be threaded through your ORM or equivalent so that you can’t do it wrong, that at your app layer you never get raw IDs.
I dabbled with doing that. XTEA is a 64-bit block cipher, which suits 64-bit IDs pretty well, and well you can't really ask for much more without using larger IDs. I use z-base-32 to stringify -- I considered base58, but I think case-sensitive URLs are not nice.
> It's much easier to remember/recognize an integer based PK when troubleshooting a data problem.
How often are you actually relying on memory of an ID to troubleshoot a problem? I mean sure, if you are scanning visually, it's good to recognize the same ID over and over again, but my ability to do so caps around 4-6 characters. So I just look at the last 4 chars regardless when fast scanning.
I use copy-paste for any time I need to transport IDs between contexts (that isn't just scripted, which is best). Having a copy-paste stack (Alfred, Raycast and others have this feature) is a huge game changer here.
I do comparisons using ```<uuid fieldname>::text like '<first few chars of the uuid>%``` when debugging in a command line. Or just copy/paste the whole thing. Yes it's marginally more annoying than integers, but only marginally.
I have several times wondered why I was getting no match on a query. And then discovered that I was using a user_id on an account_id field. UUID's have saved me from shooting myself in the foot so many times.
The extreme case (that I had and is the one where I was finally convinced that uuid's save me from myself) was setting admin permissions on a user. I accidentally copy/pasted their account_id instead of their user_id. If I had been using integers I would have given admin permissions to a random user and never been aware of it. But because I was using uuid's I got a nice, safe "updated 0 records" response and knew there was a problem.
That‘s in my experience also the best approach. I wrote an article a few days ago about the exact thing: An integer auto incrementing PK with an UUID you use externally:
That is how I used to do (and currently still do) DB design, but honestly I think if DBs start supporting UUID v7s well that I would use that as the sole primary DB key as well as the external ID:
1. They are still sorted in increasing timestamp order (at millisecond granularity), so they should have good DB index characteristics.
2. At the same time, they contain 62 bits of randomness, would should pretty much eliminate IDOR attacks if there is a bug elsewhere that isn't doing proper access checks. Not good enough for secure tokens, but just good defense against access permission check bugs.
That is, you should basically get the best of both worlds: ordered keys with enough randomness to make ID-increment attacks infeasible.
I like the textual representation of ULIDs as well though; I wish they’d just adopted ULID as v7. At least it is binary compatible with existing UUID types.
Sure you could encode it in Crockford's base-32 but if it isn't part of the standard then tools won't implement it natively, so you couldn't copy a key from a url and look it up in postgres without running it through a conversion function, for example.
You can write a custom data type in pure SQL for PostgreSQL which is just transforming a visible string to the more efficient uuid type.
That‘s basically how the uuid type can be implemented: For storage it‘s binary(16) but all operations transform the value to the visible string you see all the time. It‘s a pretty powerfull feature.
Just seconding this as a sane way to use UUIDs IME. Basically the sequential integer PK is the “internal ID”. IIRC in SQLite regardless of type or existence of PK there is a private sequential integer. Super handy pattern to use.
> Except for WITHOUT ROWID tables, all rows within SQLite tables have a 64-bit signed integer key that uniquely identifies the row within its table. This integer is usually called the "rowid". The rowid value can be accessed using one of the special case-independent names "rowid", "oid", or "_rowid_" in place of a column name. If a table contains a user defined column named "rowid", "oid" or "_rowid_", then that name always refers the explicitly declared column and cannot be used to retrieve the integer rowid value.
> [...] If an INSERT statement attempts to insert a NULL value into a rowid or integer primary key column, the system chooses an integer value to use as the rowid automatically. A detailed description of how this is done is provided separately. [2]
my personal favorite UUID replacement, when not concerned with external compatibility or standards-compliance: a 96-bit random value, base64-urlsafe encoded to 16 ASCII characters
with UUID, you can store it as a compact 16 bytes in a database, but it needs 36 characters if you want to embed it in a URL or a JSON payload. and there's a temptation to be "clever" and strip out the hyphens to shave off 4 bytes and create a nonstandard UUID format. by comparison these have one single canonical representation that is always a 16-character URL-safe string.
Being impossible to remember to me is a minor problem compare to how much RAM and HDD space wasted on UUIDs across all systems being built nowadays. In addition to being large than say 64bit integer ID they are almost not compressible. DB compression will not reduce used by UUID space in DB, logs with UUIDs are bigger and compression not a great help here either.
I feel like `serial` has got to be pretty fast for writes and gives you some nice properties like sorted ordering based on insertion time, a smaller key, and a key that can be compressed far better than a uuid.
I get the idea of a ULID/UUID7 encoding a timestamp so you get sorted order, but I wonder at what scale that beats just using serial.
At any scale where you have to have multiple writers generating IDs or need to merge results from multiple sources, basically. Synchronizing a serial increment across hosts and across time is a pain.
Obviously you can composite a timestamp to an incrementing id, but then the serial part of it is kind of useless for ordering, so you may as well use a random number and avoid needing to synchronize at all. And then you've just reinvented a time ordered uuid (but to be fair, without the endian compatibility bs mentioned elsewhere).
I'm wondering what that performance actually looks like though. The closest benchmark I have is a single postgres instance with a uuid pkey and a bigint column, transactionally incrementing the integer in the column.
I mean I guess mathing it out, updating an atomic integer is ~2-100ns, depending on contention. If you need to coordinate the writes you have anywhere from ~250μs-10ms.
We can basically throw away the increment at that point since the network is hundreds/thousands of times slower.
So at 10ms, that's 100 op/s. At 250μs more like 40kops/s.
That ignores the fact that your db can perform those writes concurrently and then batch the writes off to the other database, so long as it doesn't pretend that they're all committed at once. psql HOT updates would presumably be a thing here idk.
If I had to guess, I'd lean towards the "40kop/s" being closer than the "100op/s" but idk! I wish we had benchmarks but I can't find anything :\
One problem was the style that started around 2004, and was very popular with Ruby on Rails and WordPress, and then Syfmony and Django, where you expose the PK in the URL. If your integer starts with 1 and then increments, you may not get to a billion, and you'll never get to a trillion. So it became ridiculously easy to for outsiders to scan your site:
That was one problem. Using UUIDs for PKs means outsiders can't simply scan your site.
The other problem was that over the years, everyone ran into the problem of moving a database, or needing to combine multiple databases, in which case having PKs the start with 1 and then increment, a collision of the PKs, from different databases, is 100% guaranteed. This often happens when combining WordPress sites, for instance. If you use UUIDs as your PK, then such collisions become unlikely.
Some applications have the need to create IDs in a distributed manner, eg. Og clients, and use that identity before the database returns it. These systems benefit from randomly generated IDs.
You could potentially hist use a random number between 0 and 2^128-1 and still use ints, though I haven't seen that in action, usually pk with ints are centrally generated and consequitive.
If the id appears in a url, you may not want people to guess ids. The information leak exists even if you do authentication: maybe you don't want someone to be able to guess how many records there are, or how quickly records are being generated
If for no other reason than simplified debugging I find there is value. Maybe I’m old but if you have more than one UUID involved in the debugging I’m more likely to trip up than just integers.
I've been using ULID for a while now, which analogous to UUID v7 but with a different (better IMHO) string representation. They've been awesome for using as sort keys in dynamo for instance, since they're lexicographically sortable as strings.
But one thing I'm still wary about is exposing these IDs with millisecond-precision time components to end users, since I've seen multiple discussions here on HN about the potential for timing attacks.
How worried should I really be? Do people have useful heuristics on the kinds of data where it's safe/unsafe to expose timing information, or should I just only expose a separate UUID v4 externally across the board just to be safe?
ULID has like 80 bits of random per millisecond. I found some calculator online that showed the probability of collision on 80bits was very, very low even across time much longer than millisecond. I think knowing when an ID was created is harmless - and if it isn't there is a design problem that is larger than ID choice.
There is another lib called hashid which can be used if masking that ID algo is important.
> I think knowing when an ID was created is harmless - and if it isn't there is a design problem that is larger than ID choice.
I hope you're right, since I've been cautiously operating under that assumption so far. Cautiously, because of warnings, from people better versed in cryptography than I, in discussions like this one: https://news.ycombinator.com/item?id=29805433. I still haven't quite been able to internalize when this should vs shouldn't be something I should be concerned about, hence the comment.
I came up with a scheme in which the random part of the ULID is a slice of the hash of a UUID, which seems to work fine because I'm generating it all server side, I guess? It works very well for insertion into NoSQL databases, for instance.
So I'd also like to know the threat model for these timing attacks.
Personally, I really like UUIDv7, except I transform the UUID to a 25-character string which has all the same properties except it doesn't look like a UUID. The last thing I want is my index of time-sortable UUIDs getting contaminated with with some UUIDv4 fully random ones. Since UUIDs may be generated and persisted in a distributed manner, it's a simple way to at least spot this.
Can’t you just check the “version” bits and reject if it’s not 4/7? Or are you worried about someone generating a completely random (non-compliant) set of bits that happens to parse as a v4/7
Yeah, looking back I could have done a bit more than simple substitution. Since the sentences don't gave meaning, interpreting between 1 as i vs l is especially difficult.
5eedbed5-f05e-b055-ada0-d15ab11171e5
seedbeds-fose-boss-adao-disabilities
"Memorable" was definitely tongue in cheek, as was the spec bending.
UUIDv7 looks interesting, but how is it different from ULID [1] in practice? I was considering using ULID for a upcoming new project because it is lexicographically sortable but it looks like UUIDv7 just can replace that.
As the author of a popular ULID implementation in python[1], the spec has no stewardship anymore. The specification repo[2] has plenty of open issues and no real guidance or communication beyond language implementation authors discussing corner cases and the gaps in the spec. The monotonic functionality is ambiguous (at best), doesn't consider distributed id generation, and is implemented differently per-language [3].
Functionally, UUIDv7 might be the _same_ but the hope would be for a more rigid specification for interoperability.
I've bee using ULIDs in python for about a year now and so far have been super happy with them, so a) thank you for maintaining this! b) I always felt a bit uneasy about the way the spec describes the monotonicity component. Personally I just rely on the random aspect as I am fortunate enough to say that two events in the same millisecond are effectively simultaneous.
At that point, it's basically just UUID7 with Crockford base32 encoding, more or less.
IMHO the in-process monotonically increasing feature of ULID is misguided. As you mention, distributed ids are a pain. The instant you start talking distributed, monotonic counters or orderable events (two threads count as distributed in this case), you need to talk things like Lamport clocks or other hybrid clock strategies. It's better to reach for the right tools in this case, vs half-baked monotonic-only-in-this-process vague guarantee.
I maintain the Dart UUID library. For anyone using dart, or wants to see one of many implementations of v6,v7 and a custom v8 UUID, feel free to look at my in-progress branch linked below, I plan to merge it in once they add different string representations in a future draft (I've been involved in the conversations).
My main concern with random-based UUIDs always has been running out of entropy and cause the application to remain in a blocking state (e.g. as described here: https://blog.fastthread.io/2022/03/09/java-uuid-generation-p...). Not due to any negative experiences that I made myself, but due to a particular colleague of mine who sees it as a dealbreaker.
Is this an actual issue? Most people don't seem to care when talking about random UUIDs. The target platform of our applications is mostly Kubernetes on cloud environments, if that makes any difference.
Why I'm asking: UUID Version 7 looks quite interesting to me, and the document describes rand_a and rand_b just as "pseudo-random data"... which made me think that in the context of "uniqueness per millisecond", a source of entropy is conceptually not required. However, chapter 6.6 clearly advises the usage of CSPRNGs, so I guess the overall problem remains :(
From the documentation for java.security.SecureRandom from Java 17 [1]:
> Note: Depending on the implementation, the generateSeed, reseed and nextBytes methods may block as entropy is being gathered, for example, if the entropy source is /dev/random on various Unix-like operating systems.
this RFC is not new new, but is still pretty new and i was surprised to learn that UUIDv7 and v8 are being worked on.
the context is i keep a list of uuid impls and knowledge for my own reference. posted this up today simply because I got a PR from some subscribers https://github.com/sw-yx/brain/pull/36
Yep... I was just being slow. I thought the specialized variant was a new kind of UUID (being inverse of RFC4122), but it's just the inverse of RFC4122's Nil UUID; a single value.
Sorry for the silly comment. This value is just a bunch of binary 1s.
You could use the Luhn algorithm or something similar to add a check digit to the end, so for instance, generate a standard uuid, remove the last character, pass the remaining characters through the Luhn function and then use the result as the final replacement character for the UUID.
It's obviously not cryptographically secure, as in someone else could use the same algorithm to generate UUIDs that your system recognises, but could be a quick way of doing a verification before you hit the DB.
The only secure way of doing it is to somehow crypotgraphically sign it.
You'd have to trade that DB lookup with CPU cycles for a signature validation.
For example you can use HMAC.
You send the UUID and HMAC-of-UUID to the client.
The client sends back UUID + HMAC-of-UUID.
You re-calculate HMAC-of-UUID-provided-by-client (using your secret key).
If the calculated HMAC matches the HMAC provided by the client you have confirmation that the UUID was issued by you. Because without the secret key the client can't calculate the correct HMAC for a random or modified UUID.
Warning: The entirety of this comment might be a bad idea.
There are 124 bits available. (Actually a little less but let's pretend only 4 bits are needed for the version to keep it simple.)
You'll have to decide how many are the ID and how many are the signature. Let's say 32 bits are your ID and the remaining 92 are the signature. (Adjust according to your own needs.)
Let's suppose your ID is 1. Now you need to hash that ID together with a private key using an HMAC algorithm. This signature will have more than 92 bits so trim your signature to this length.
Now build your UUID by combining the 32 bits of your ID with the 92 bits of the signature and four bits specifying version 8 (proprietary).
To test if a claimed UUID is yours, pull out the 32 bits of ID and repeat the process of signature and building a complete UUID. If the bits match, success! If not, its a fake UUID.
Now read the responses to this comment to find out why you shouldn't do this.
Depends on your workload and how it's distributed, but a bloom filter might help you there.
The filter needs to be stored somewhere, so if your workload is write-heavy you'd be replacing lookups with updates, but if it's read-heavy most DB lookups can be replaced with filter lookups.
I made a simple Python test library that extends the standard UUID class with UUIDv6 and UUIDv7. You might want to check it out. https://github.com/oittaa/uuid6-python
Postgres supports UUIDs with any version number natively so you can then do something like this with it:
create table data (id uuid, firstname varchar(100));
insert into data (id, firstname) values ('017f21cf-d130-7cc3-98c4-dc0c0c07398f', 'John');
select * from data;
I created a PL/pgSQL function for this purpose. As Starlevel001 mentioned, the UUID format will accept opaque bytes. Perhaps you could find this useful?
Nice to see another implementation which takes a bit different approach. Just for your information, there's now Draft 03 which changes the format a little bit. I kinda liked the arbitrary precision of Draft 02, but the newer one just requires millisecond precision and then basically leaves it up to the implementation how to handle the generation of multiple UUIDs within the same millisecond.
It would be slow and doesn't really serve a purpose since you either have uuids that are totally random or uuids that need to preserve their structure.
Can someone ELI5 why Unix timestamp would be stored as left-padded 48 bits for a 32-bit or potentially a 64-bit number? Is it a coincidence that 48 is the middle of 32-64?
I really don’t understand why we need standards for UUIDs. I get that with cryptography it’s super easy to make subtle mistakes. But UUID4 (the most commonly used variant these days) is just a long random number. Except for a few bits which aren’t random because the standard says so. As long as you’re using a goodsource of randomness (most are these days) it’s pretty hard to screw up. Why do we need a standard to tell us how to combine time stamps and random numbers?
thats what the authors of uuid4 thought. but if you read the RFC again you’ll see they listed out the flaws and pain points. eg the impact of true randomness on database writes at scale.
having standards encodes lessons from a decade worth of pain. don’t dismiss it so casually.
Sorry, I still don't get it. It's a timestamp plus a random number. As their references show, plenty of people have figured out this is a fine strategy. I could explain to an intern how to build this in 10 seconds with "use a 48 bit timestamp plus 80 bits of random, then encode it as a hex string" and they're done. Finding timestamps and high-quality RNG's is just not hard these days.
Replace "hex string" with "base62 encoding" and we have something significantly better than a UUID.
I get that the UUID authors feel a need to tell people there's a better way to do it, and I guess adding a new version to their standard is easier than telling people to go use something else like ULID. But I still don't see it as particularly important.
Side note: I love the HTML format of these IETF RFCs, as in TFA. Over the decades, I was used to seeing the old text format (which I like), but this one is particularly easy on the eye, especially on my Android phone.
Only mobile issue I see is there is a really long link that overflows the intended document width, which introduces an horizontal scrollbar. That makes scrolling a bit finicky on mobile.
Big-endianness is mandatory for opaque bytewise sorting (or lexicographic ordering in string form), which is a very desirable property of UUIDv6 and UUIDv7.
No, they should have tried to find a way to drop little endian. A cycle or two when generating UUIDs is irrelevant, and almost all UUID implementations, and most of the RFC 4122 text, are big-endian.
UUIDs have historically massively screwed up endian handling. While this new draft discusses sorting UUIDs as strings of octets (bytes) and the text of RFC4122 is fairly explicit about most significant bytes coming first, the C UUID structure in RFC 4122 appendix A is entirely misguided:
Those aren’t bytes — they’re integers of various sizes. (Hint: do not use integer types in C code for portable data structures. ntohl, etc are a mess. Just use arrays of bytes.)I don’t know the whole history, but MS somehow took this structure at face value and caused problems like this:
https://github.com/uuid-rs/uuid/issues/277
So, if you want to do anything (e.g. sorting) that depends on the representation of a UUID (or even depends on converting between string and binary representations), be aware that UUIDs coming from Windows may be little-endian. In my book, this is a Windows bug, but opinions may differ here.