> UUID versions 1, 2, 3, 4, 5 are already outdated.
Interesting comment, since v4 is the only version that provides the maximal random bits and is recommended for use as a primary key for non-correlated rows in several distributed databases to counter hot-spotting and privacy issues.
Edit: Context links for reference, these recommend UUIDv4:
https://www.cockroachlabs.com/docs/stable/uuid
https://docs.cloud.google.com/spanner/docs/schema-design#uui...
Yeah, I thought it was a strange comment, too. v7 is great when you explicitly need monotonicity, but encoded timestamps can expose information about your system. v4 is still very valid.
I think "outdated" was a poor choice of words. It is a failure to meet application requirements, which has more to do with design than age. Every standardized UUID is expressly prohibited in some application contexts due to material deficiencies, including v4. That includes newer standards like v7 and v8.
In practice, most orgs with sufficiently large and complex data models use the term "UUID" to mean a pure 128-bit value that makes no reference to the UUID standard. It is not difficult to find yourself with a set of application requirements that cannot be satisfied with a standardized UUID.
The sophistication of our use case scenarios for UUIDs exceeds their original design assumptions. They don't readily support every operation you might want to do on a UUID.
Yeah v4 is the goto, and you only use something else if you have a very specific reason like needing rough ordering
Deterministic uuids is a very standard usecase
You're talking about the hash-based UUIDv3/v5? I haven't found examples of those being used, but I'm curious.
Using MD5 or 122 bits of a SHA1 hash seems questionable now that both algorithms have known collisions. Using 122 bits of a SHA2/3 seems pretty limited too. Maybe if you've got trusted inputs?
I use these a lot. My favorite use case is templates, especially ones that were not initially planned in the architecture.
Let's say i have some entity like an "organization" that has data that spans several different tables. I want to use that organization as a "parent" in such a way where i can clone them to create new "child" organizations structured the same way they are. I also want to periodically be able to pull changes from the parent organization down into the child organization.
If the primary keys for all tables involved are UUIDs, I can accomplish this very easily by mapping all IDs in the relevant tables `id => uuid5(id, childOrgId)`. This can be done to all join tables, foreign keys, etc. The end result is a perfect "child" clone of the organization with all data relations still in place. This data can be refreshed from the parent organization any time simply by repeating the process.
Common one is if you want two structs deemed "equivalent" based on a few fields to get the same ID, and you're only concerned about accidental collision. There are valid use cases for that, but I've also seen it misused often.
v7 rough ordering also helps as a PK in certain sharded DBs, while others want random, or nonsharded ones usually just serial int.
Have you seen UUIDv3/v5 used there though? I've seen lots of md5 historically and sha variants recently, but not the UUID approach.
I remember using them in a massive SQL query that needed to generate a GIS data set from multiple tables with an ungodly amount of JOINs and sub-queries to achieve ID stability. Don't ask :p
For those ~~curious~~ worried, no, this was not a security sensitive context.
If you want 128 bits of randomness why not use 128 bits of randomness? A random UUID presupposes the random number has to fit in UUID format.
122 bits of randomness.
It's the same reason we use UTF-8. It's well supported. UUIDs are well supported by most languages and storage systems. You don't have to worry about endianness or serialization. It's not a thing you have to think about. It's already been solved and optimized.
byte[16] is well supported by most languages and storage systems.
Sure.
Now generate your random ID. Did you use a CSPRNG, or were your devs lazy and just used a PRNG? Are you doing that every time you're generating one of these IDs in any system that might need to communicate with your API? Or maybe they just generated one random number, and now they're adding 1 every time.
Now transfer it over a wire. Are you sure the way you're serializing it is how the remote system will deserialize it? Maybe you should use a string representation, since character transmission is a solved problem with UTF-8. OK, so who decides what that canonical representation is? How do we make it recognizable as an ID without looking like something that people should do arithmetic with?
It's not like random IDs were a new idea in 2002.
None of these are rocket-science problems, they're just standardization issues. You build a library with your generate_id/serialize_id/deserialize_id functions that work with a wrapper type, and tell your devs to use that library. UUID libraries are exactly that, except backed by an RFC.
Of course they're not rocket science. But, the question here is, "Why don't you use random 16 bytes instead of a UUIDv4?" It's not a question about rocket science. The answer is still, "Because UUIDv4 is still a better way to do it." The UUID standard solves the second and third tier problems and knock-on effects you don't think about until you've run a system for awhile, or until you start adding multiple information systems that need to interact with the same data.
But, using UUIDv4 shouldn't be rocket science, either. UUID support should be built in to a language intended for web applications, database applications, or business applications. That's why you're using Go or C# instead of C. And Go is somewhat focused on micro-service architectures. It's going to need to serialize and deserialize objects regularly.
How's your UUIDv4 generated?
> Are you sure the way you're serializing it is how the remote system will deserialize it?
It's 16 bytes. There's no serialization.
What do they look like when I put it in a url?
Use whatever encoding you want? Base64 is probably one of the most practical, but you're not obligated to use that.
UUIDs don't use base64
> There's no serialization.
Hex encoding with hyphens in the right spot isn't serialization?
Vibe endian
Schrodinger's complement
You are really making it seem like a huge problem. Generate random bytes, serialize to a string and store in a db. Done
A downvote tells me nothing. Please tell me what I'm missing, maybe I could learn something
> serialize to a string and store in a db
Ah, here we are. If it's just bytes, why store it as a string? Sixteen bytes is just a 128-bit integer, don't waste the space. So now the DB needs to know how to convert your string back to an integer. And back to a string when you ask for it.
"Well why not just keep it as an integer?"
Sure, in which base? With leading zeroes as padding?
But now you also need to handle this in JavaScript, where you have to know to deserialize it to a Bigint or Buffer (or Uint8Array).
UUIDs just mean you don't need to do any of this crap yourself. It's already there and it already works. Everything everywhere speaks the same UUIDs.
You have to generate random bytes with sufficient entropy to avoid collisions and you have to have a consistent way to serialize it to a string. There's already a standard for this, it's called UUID.
It’s really not that complicated a problem. Don’t worry, you’ll certainly be able to solve all the problems yourself as you encounter them. What you end up with will be functionally equivalent to a proper UUID and will only have cost you man-months of pain, but then you will be able to truly understand the benefit of not spending your effort on easy problems that someone solved before you.
It's not a huge problem. Uuid adds convenience over reinventing that wheel everywhere. And some of those wheels would use the wrong random or hash or encoding.
(Downvote wasn't me)
Really? Doesn’t v4 locally make the inserts into the B-Tree pretty messy? I was taught to use v7 because it allows writes to be a lot faster due to memory efficient paging by the kernel (something you lose with v4 because the page of a subsequent write is entirely random).
https://www.thenile.dev/blog/uuidv7#why-uuidv7 has some details: " UUID versions that are not time ordered, such as UUIDv4 (described in Section 5.4), have poor database-index locality. This means that new values created in succession are not close to each other in the index; thus, they require inserts to be performed at random locations. The resulting negative performance effects on the common structures used for this (B-tree and its variants) can be dramatic. ".
Also mentioned on HN https://news.ycombinator.com/item?id=45323008
In more practical terms:-
1. Users - your users table may not benefit by being ordered by created_at ( or uuid7 ) index because whether or not you need to query that data is tied to the users activity rather than when they first on-boarded.
2 Orders - The majority of your queries on recent orders or historical reporting type query which should benefit for a created_at ( or uuidv7 ) index.
Obviously the argument is then you're leaking data in the key, but my personal take is this is over stated. You might not want to tell people how old a User is, but you're pretty much always going to tell them how old an Order is.
It's memory and disk paging both.
There's also a hot spot problem with databases. That's the performance problem with autoincrement integers. If you are always writing to the same page on disk, then every write has to lock the same page.
Uuidv7 is a trade off between a messy b-tree (page splits) and a write page hot spot (latch contention). It's always on the right side of the b-tree, but it's spread out more to avoid hot spots.
That still doesn't mean you should always use v7. It does reversibly encode a timestamp, and it could be used to determine the rate that ids are generated (analogous to the German tank problem). If the uuidv7 is monotonic, then it's worse for this issue.
v7 exposes creation date, and maybe you don't want that. So, depends on use-case
I think I read something once about using v7 internally and exposing v4 in your API.
Or even an autoincrement int primary key internally. Depending on your scale and env etc, but still fits enough use cases.
In distributed databases I've worked with, there's usually something like a B-tree per key range, but there can be thousands of key ranges distributed over all the nodes in the cluster in parallel, each handling modifications in a LSM. The goal there is to distribute the storage and processing over all nodes equally, and that's why predictable/clustered IDs fail to do so well. That's different to the Postgres/MySQL scenario where you have one large B-tree per index.
Have you considered using two uuids for more randomness
I believe current official guidance if you want a lot of random data is to use v8, the "user-defined" UUID. The use of v4 is strictly less flexible here.
No, UUIDv8 offers 122 bits for vendor specific or experimental use cases. If you fill those bits randomly, you get the same amount of randomness as a v4. The spec is explicit that it does not replace v4 for random data use case.
> To be clear, UUIDv8 is not a replacement for UUIDv4 (Section 5.4) where all 122 extra bits are filled with random data.
https://www.rfc-editor.org/rfc/rfc9562.html#section-5.8-2
Yes, vendor-specific data can be 100% random.
It can be, but you should prefer UUIDv4 if you do that. One problem is that UUIDv8 does not promise uniqueness.
> UUIDv8's uniqueness will be implementation specific and MUST NOT be assumed.
Here's a spec compliant UUIDv8 implementation I made that doesn't produce unique IDs: https://github.com/robalexdev/uuidv8-xkcd-221
So, given a spec-compliant UUIDv4 you can assume it is unique, but you'd need out-of-band information to make the same assumption about a UUIDv8.
I wrote much more in a blog post: https://alexsci.com/blog/uuid-oops/
[flagged]