What does HackerNews think of riegeli?
Riegeli/records is a file format for storing a sequence of string records, typically serialized protocol buffers.
The fact that protobufs are not self-delimiting is an endless source of frustration, but I know of 2 standards for doing this:
- SerializeDelimited* is part of the protobuf library: https://github.com/protocolbuffers/protobuf/blob/main/src/go...
- Riegeli is "a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding": https://github.com/google/riegeli
FWIW, this is a conscious design decision with Protobuf: it allows for easy upsert operations on serialized messages by appending another message with the updated field values. This is very useful for middleware that wants to either just add its own context to a message it doesn't even parse [1], or for middleware that might handle protobuf messages serialized with unknown fields.
On the other hand, 'newline delimited protobuf' is much less useful day-to-day than ndjson, as gRPC provides message streaming, which solves the issue of wanting to stream small elements of a long response (which is the general usecase of ndjson from my experience). For on-disk storage of sequential protobufs (or any other data, really), you should be using something like riegeli [2], as it provides critical features like seek offsets, compression and corruption resiliency.
[1] - eg. passing a Request message from some web server frontend, through request routers, logging, ACL and ratelimit systems up to the actual service handling the request.
When I needed something like RecordIO to store market data, I couldn't find anything. So I implemented https://github.com/romkatv/ChunkIO. I later learned of https://github.com/google/riegeli (work in progress), which could've saved me a lot of time if only I found it earlier. I think my ChunkIO is a better though.