What does HackerNews think of spec?

The canonical spec for ulid

UUIDv7 is a nice idea, and should probably be what people use by default instead of UUIDv4 for internal facing uses.

For the curious:

* UUIDv4 are 128 bits long, 122 bits of which are random, with 6 bits used for the version. Traditionally displayed as 32 hex characters with 4 dashes, so 36 alphanumeric characters, and compatible with anything that expects a UUID.

* UUIDv7 are 128 bits long, 48 bits encode a unix timestamp with millisecond precision, 6 bits are for the version, and 74 bits are random. You're expected to display them the same as other UUIDs, and should be compatible with basically anything that expects a UUID. (Would be a very odd system that parses a UUID and throws an error because it doesn't recognise v7, but I guess it could happen, in theory?)

* ULIDs (https://github.com/ulid/spec) are 128 bits long, 48 bits encode a unix timestamp with millisecond precision, 80 bits are random. You're expected to display them in Crockford's base32, so 26 alphanumeric characters. Compatible with almost everything that expects a UUID (since they're the right length). Spec has some dumb quirks if followed literally but thankfully they mostly don't hurt things.

* KSUIDs (https://github.com/segmentio/ksuid) are 160 bits long, 32 bits encode a timestamp with second precision and a custom epoch of May 13th, 2014, and 128 bits are random. You're expected to display them in base62, so 27 alphanumeric characters. Since they're a different length, they're not compatible with UUIDs.

I quite like KSUIDs; I think base62 is a smart choice. And while the timestamp portion is a trickier question, KSUIDs use 32 bits which, with second precision (more than good enough), means they won't overflow for well over a century. Whereas UUIDv7s use 48 bits, so even with millisecond precision (not needed) they won't overflow for something like 8000 years. We can argue whether 100 years is future proof enough (I'd argue it is), but 8000 years is just silly. Nobody will ever generate a compliant UUIDv7 with any of the first several bits aren't 0. The only downside to KSUIDs is the length isn't UUID compatible (and arguably, that they don't devote 6 bits to a compliant UUID version).

Still feels like there's room for improvement, but for now I think I'd always pick UUIDv7 over UUIDv4 unless there's an very specific reason not to. Which would be, mostly, if there's a concern over potentially leaking the time the UUID was generated. Although if you weren't worrying about leaking an integer sequence ID, you likely won't care here either.

I would use a ULID rather than thread ID.

https://github.com/ulid/spec

They (and others) are great for this kind of case - and many others.

Many people had the same idea. For example ULID https://github.com/ulid/spec is more compact and stores the time so it is lexically ordered.
ULIDs, https://github.com/ulid/spec, are also already in widespread use and are essentially the same idea as v7 UUIDs.
While UUIDs are pretty common, they are by far not the only or superior solution. To be able to easily generate IDs on different nodes in a distributed system something host-specific, pure random, or a mixture of both is needed. To be avoid problems with random numbers on DB indexes some time-based part would be useful. There are other approaches like Ulid or TSID which are similar to UUID v7, but are represented in are more compact way.

Example Ulid vs UUID:

    000360TJXZDDMSKJSQGBQHA5YA
    fb87f306-b613-4948-be24-00609cf9ccc8
https://github.com/ulid/spec https://tsid.com/de
ULID seems like a better solution to me: still 128 bits but with a more compact canonical format and sensible timestamp and increment semantics. I guess it's not far off a v7 UUID though.

https://github.com/ulid/spec

It's far less exciting I'm afraid: we use [0] to generate our DB IDs and it implements org.omg.CORBA.portable.IDLEntity.

We could fork it and remove the interface or switch to ULID[1] instead.

[0] https://github.com/stephenc/eaio-uuid/blob/master/src/main/j...

[1] https://github.com/ulid/spec

Does ULID solve both?

https://github.com/ulid/spec

ulid() // 01ARZ3NDEKTSV4RRFFQ69G5FAV

- 128-bit compatibility with UUID 1.21e+24 unique ULIDs per millisecond Lexicographically sortable!

- Canonically encoded as a 26 character string, as opposed to the 36 character UUID

- Uses Crockford's base32 for better efficiency and readability (5 bits per character)

- Case insensitive No special characters (URL safe)

- Monotonic sort order (correctly detects and handles the same millisecond)

> nano id

There are pros and cons to using random IDs as a PK. For RDBMS clustering on the PK (InnoDB), it's a terrible idea. If you're going to sort by the PK, it's usually a terrible idea (UUIDv1 isn't as bad since it includes the timestamp, but that assumes your access pattern is based on insertion time). There is ULID [0] if you'd like something that's sortable. You could also just have a secondary index. An advantage can be that it _can_ be a good way (again, this depends heavily on your access patterns) to do sharding.

My other concern for nano id is twofold, both around their PRNG. They mention using Node's crypto.randomBytes(), but their source code instead references crypto.randomFill() [1]. Node's docs mention that having "surprising and negative performance implications for some applications" [2], related to libuv's thread pool. See my later comment about libuv and containers. Also, Node's crypto.randomBytes() mentions that it "will not complete until there is sufficient entropy available." That sounds suspiciously like they're using `/dev/random` instead of `/dev/urandom`, which at least for this application of it, would be an odd decision. I did note that elsewhere in nano id, they're creating their own entropy pool, so it may not matter either way.

With that out of the way:

If the plan is only for self-hosting, then yeah, you don't really need to consider schema design that carefully. Databases are really good at their job. Also, honestly nearly none of this matters until you have significant scale.

If you plan on starting a SaaS, there's a lot to consider. An incomplete list, in no particular order:

* Foreign keys. They're very handy, but they can introduce performance problems with some access patterns. Consider indexing child table FKs (but not always - benchmark first).

* DDL like ALTER TABLE. I suggest getting intimately familiar with Postgres' locks [3]. The good news is that instant ADD COLUMN with {DEFAULT, NOT NULL} is safer now. The bad news is that it does so by lazy-loading, so if your queries are doing silly things like SELECT *, you're still going to end up a ton of contention.

* Connection pooling. You don't want to eat up RAM dealing with connections. PgBouncer [4] and Pgpool-II [5] are two that come to mind, but there are others as well. The latter also handles replication and load balancing which is nice. If you aren't using that, you'll need to handle replication and load balancing on your own.

* Load balancing. HAProxy [6] is good for load balancing, but has its own huge set of footguns. Read their docs [7]. A few things that come to mind are:

  * Any kind of abstraction away from the CPU, like containers, may cause contention. Same with VMs (i.e. EC2), for that matter, since a noisy neighbor can drop the single-core turbo of Xeons A LOT. Look into CPU pinning if possible.

  * HAProxy really likes fast clocks over anything else, for x86. Xeons will beat Epyc. ARM can beat x86 if tuned correctly.

  * If you're using Kubernetes, look into Intel's CPU Management [8], which is also now native in K8s v1.26 [9].
* Overall for containers, learn about cgroups. Specifically, how they (both v1 and v2) expose the `/proc` filesystem to applications. Then at how your application is detecting that for any kind of multiprocessing. Hint: Node [10] uses libuv, which is calling `/proc/cpuinfo` [11].

* If you have access to the disk (e.g. you're running bare metal or VMs with this capability), think carefully about the filesystem you use and its block size (and record size, if you use ZFS).

Good luck!

[0]: https://github.com/ulid/spec

[1]: https://github.com/ai/nanoid/blob/main/async/index.js#L5

[2]: https://github.com/nodejs/node/blob/main/doc/api/crypto.md#c...

[3]: https://www.postgresql.org/docs/current/explicit-locking.htm...

[4]: https://www.pgbouncer.org/

[5]: https://www.pgpool.net/mediawiki/index.php/Main_Page

[6]: https://www.haproxy.org/

[7]: https://cbonte.github.io/haproxy-dconv/2.4/configuration.htm...

[8]: https://networkbuilders.intel.com/solutionslibrary/cpu-pin-a...

[9]: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...

[10]: https://github.com/nodejs/node/blob/main/src/node_os.cc#L100

[11]: https://github.com/libuv/libuv/blob/v1.x/src/unix/linux.c#L8...*

ULID hits most of these, and can be converted to UUID for use with databases supporting this datatype (not a strong column): https://github.com/ulid/spec
TIL about ULID[0]; It looks interesting, but the spec was last touched in 2019 and I haven't really heard of it before... is it actively used?

Also curious because I don't actually know: If a format spec is GPL, does that encumber implementations of said spec?

0: https://github.com/ulid/spec

These days ULID (https://github.com/ulid/spec, or any of the variants), or most recently the new UUID versions (https://www.ietf.org/archive/id/draft-peabody-dispatch-new-u...) give you creation-order sorting.
ULID was new to me. It's quite interesting https://github.com/ulid/spec
As the author of a popular ULID implementation in python[1], the spec has no stewardship anymore. The specification repo[2] has plenty of open issues and no real guidance or communication beyond language implementation authors discussing corner cases and the gaps in the spec. The monotonic functionality is ambiguous (at best), doesn't consider distributed id generation, and is implemented differently per-language [3].

Functionally, UUIDv7 might be the _same_ but the hope would be for a more rigid specification for interoperability.

[1]: https://github.com/ahawker/ulid

[2]: https://github.com/ulid/spec

[3]: https://github.com/ulid/spec/issues/11

Interesting, I learned about ULIDs for the first time from this article, which says: "The remaining 80 bits [in a ULID] are available for randomness", which I read as saying those last 80 (non-timestamp) bytes were random, not incremental. But this was misleading/I got the wrong idea?

Going to the spec [1]... Yeah, that's weird. The spec calls those 80 bytes "randomness", and apparently you are meant to generate a random number for the first use within a particular ms... but on second and subsequent uses you need to increment that random number instead of generating another random number?

Very odd. I don't entirely understand the design constraints that led to a non-random "randomness" section still called "randomn" in the spec even though they are not.

[1]: https://github.com/ulid/spec

> They have a temporal aspect to them (IE, I know row with ID 225 was created before row with ID 392, and approximately when they might be created)

UUIDv7 (currently a draft spec[0]) are IDs that can be sorted in the chronological order they were created

In the meantime, ulid[1] and ksuid[2] are popular time-sortable ID schemes, both previously discussed on HN[3]

[0] https://datatracker.ietf.org/doc/html/draft-peabody-dispatch...

[1] https://github.com/ulid/spec

[2] https://github.com/segmentio/ksuid

[3] ulid discussion: https://news.ycombinator.com/item?id=18768909

UUIDv7 discusison: https://news.ycombinator.com/item?id=28088213

This benefit can’t be understated: missing or buggy authorization checks are present in almost all real-world web apps and APIs I’ve encountered. A long random component (such as the 80 bits provided by ULIDs) covers up a lot of sins.

A ULID is 48 bits of timestamp plus 80 bits of entropy. These new formats just don’t have enough random-component entropy. 64 bits isn’t enough. Who cares in your millisecond timestamp wraps every 8900 years as they do in a ULID?

https://github.com/ulid/spec

Seems like a missed opportunity to sneak ULID right in (https://github.com/ulid/spec), given that it's already pretty widely used.
See ulid for similar prior art. We generate Ulids in code and then store them in uuid columns in Postgres to realize the compact (binary) size benefits.

https://github.com/ulid/spec

Looks similar to ULID[0] (I am the author of a popular python implementation[1]).

It appears to have a similar constraint that two ID's generated within the same timestamp (ms, ns) have no strong guarantee of ordering. That might not be a deal breaker depending on your use case but something to consider.

* https://github.com/ulid/spec

* https://github.com/ahawker/ulid

What is the difference between this and ULID? The latter is implemented in dozens of languages already, and many ULID libraries support conversion into UUID format.

https://github.com/ulid/spec

"Universally Unique Lexicographically Sortable Identifier"

Here’s a link to a description of ULID, for those like me who had never heard of it: https://github.com/ulid/spec
Just to see the big picture, what are the main differences between this and Snowflake, or ULID (https://github.com/ulid/spec)?
https://github.com/ulid/spec isn't timeflake, but is a good spec, and there are implementations in multiple languages. there are 2 listed for js.

https://github.com/aarondcohen/id128 and https://github.com/ulid/javascript

There’s a spec called ULID that’s pretty much this with default base32 encoding

https://github.com/ulid/spec

I’ve also worked on a UUID-ULID bridge for Go

https://github.com/sudhirj/uulid.go

And seeing as this is just 128 bits it’s quite easy to move seamlessly between formats and representations.

I’ve found this concept especially useful in nosql stores like DynamoDB, where using a ULID primary key makes objects time sortable automatically. It’s also quite easy to query for items by zeroing out the random component and setting only the time stamp bytes.

I have stopped using UUID and GUID in favor of https://github.com/ulid/spec
After a lot of time spent investigating the different kind of (U)UIDs, I've come to the conclusion that ULIDs[0] are the best kind of (U)UIDs you can use for your application today:

* they are sortable so pagination is fast and easy.

* they can be represented as UUID in database (at least in Postgres).

* they are kind of serial, so insertion into indexes is rather good, as opposed to completely random UUIDs.

* 48 bits timestamp gives enough space for the next 9000 years.

[0] https://github.com/ulid/spec

Not exactly, They can be sorted, by default, only down to a millisecond, but you can use a monotonic generator to have them sorted, even if more than one Ulid is generated within a millisecond.

Other than that, they have 80 bits of randomness, enough to be unique even if millions are generated per second.

https://github.com/ulid/spec

I've been using ULID as an alternative to UUIDs, e.g.: https://github.com/ulid/spec

I've found ULID to be great for my purposes. Wonder if anyone else has tried using them too?

After discovering ULIDs [0] I can't see ever using UUIDs ever again.

ULIDs are sortable (time component), short (26 chars) and nearly human readable, and good enough entropy/randomness for everything I'd ever be working on.

Does anyone have any criticisms of ULIDs? I can't see how they don't take over general purpose use of unique ids in the future except where a more guarantee of uniqueness is needed. (ie, bajillion records a second unique...)

[0] https://github.com/ulid/spec

Hmm, other base32 system avoid that by not including I and L (and O) - and some other refs I've read (ULID comes to mind) say produce UPPER output but accept either case input.

And, like this spec, the values are aliases so 0/o are the same, 1/I/l are the same, etc

https://github.com/ulid/spec

I've not tried quick search. For the most part in the applications I've worked on I've just relied on the main primary key index (the _id field) for most lookups.

Generally I'm using a `folder/structure/ULID` approach to keys and its really easy with start_key and end_key on allDocs to grab an entire "folder" at a time. I've had some pretty large "folders" and not seen too much trouble. At this point the biggest application I worked on pulls a lot of folders into Redux on startup and so far (knock on wood) performance seems strong. (ULIDs [1] are similar to GUIDs but are timestamp ordered lexicographically so synchronizations leave a stable sort within the folder when just pulling by _id order.)

At least as far as my queries have been and what my applications needs have been, PouchDB is as fast or faster than the equivalent server-side queries (accounting for HTTPS time of flight), especially now that all modern browsers have good IndexedDB support. (There were some performance concerns I had previously when things fell back to WebSQL or worse, such as various iOS IndexedDB polyfills built on top of bad WebSQL polyfill implementations, and also a brief attempt that did not go well to use Couchbase Mobile on iOS only.)

Photos have been the bane of my applications' existence, but not for client-side reasons. I had PouchDB on top of IndexedDB handling hundreds of photos without breaking a sweat and those size limits all have nice opt-ins permission dialogs for IndexedDB if you exceed them. Where I found all of the pain in working with photos was server side. CouchDB supports binary attachments, but the Replication Protocol is really dumb at handling them. Trying to replicate/synchronize photos was always filled with HTTP timeouts due to hideously bloated JSON requests (because things often get serialized as Base64), to the point where I was restricting PouchDB to only synchronize a single document at a time (and that was painfully slow). Binary attachments would balloon CouchDB's own B-Tree files badly and its homegrown database engine is not great with that (sharding in 3.0 would help, presumably). Other replication protocol servers had their own interesting limits on binary attachments; Couchbase in my tests didn't handle them well either and Cloudant turned out to have attachment size limits that weren't obvious and would result in errors, though at least their documentation also kindly pointed out that Cloudant was not intended to be a good Blob store and recommended against using binary attachments (despite CouchDB "supporting" them). (It sounds like the proposed move to FoundationDB in CouchDB 4.0 would also hugely shake up the binary attachment game. The 8 MB document limit already eliminates some of the photos I was seeing from iOS/Android cameras.)

I'd imagine you'd have all the same replication problems with large data URIs (as it was the Base64 encoding during transfers that seemed the biggest trouble), without the benefits of how well PouchDB handles binary attachments (because of how well the browsers today have optimized IndexedDB handle binary Blobs).

The approach I've been slowly moving towards is using `_local` documents (which don't replicate) with attached photos in PouchDB, metadata documents that do replicate with name, date, captions, ULID, resource paths/bucket IDs (and comments or whatever else makes sense) and a Blurhash [2] so there's at least a placeholder to show when photos haven't replicated, and side-banding photo replication to some other Blob storage option (S3 or Azure Storage). It's somewhat disappointing to need two entirely different replication paths (and have to secure both) and multiple storage systems in play, but I haven't found a better approach.

[1] https://github.com/ulid/spec

[2] https://blurha.sh/

ULID - https://github.com/ulid/spec it's like 48bits of control and 80 bits of power packed into 128bits of fun!!
Often UUIDs are used as keys to things...

And these things are often stored in databases...

And usually the database puts them into a btree internally, because that's how tables are stored.

The moment you have any kind of load on such a table, your performance goes to hell. This is because the inserts are going to happen all over the place, and the way tables are stored definitely prefers appends.

So a common word of wisdom is to have an auto-increment primary key, and the uuid indexed! Ugh what a work around.

A better way is to discover ULIDs, which are like UUIDs but with the high-bits including a timestamp. This turns inserts into, approximately, appends. Much nicer!

https://github.com/ulid/spec is fairly recent, but the 'trick' has been used for years. I've build gazillion-row databases where all the data sources have, independently, used this same high-bits-are-timestamp trick. So even though I didn't see it discussed and called ULID until fairly recently, its a very old and well-known trick in some circles.

Once you have ULIDs and you know you have UILDs, a lot of database type tasks become easier because the ID encodes a creation date. I've found myself adding AND id BETWEEN to queries, and using them as a partition key, and using them to DELETE old data etc, and various other admin stuff that the original creators of the table never thought about.

Reading through this, I kept thinking that ULIDs[1] give the same benefits described, with wider adoption/support.

Luckily it looks like the author has already written up his thoughts on the differences[2].

[1] https://github.com/ulid/spec [2] https://github.com/segmentio/ksuid/issues/8