> author decided it was a good idea to prepend the message with the message length encoded as a varint.
> WHY? Oh, why?!
Uh oh. Is this my HN moment?
This is exactly how I implemented it at my company. We had to write many protobuf messages to one file in bulk (in parallel). I did a fair amount of research before designing this and didn’t find any standard for separating protobuf messages (in fact, found that there explicitly isn’t a standard in that protobuf doesn’t care). So I thought rather than using some “special” control character, like a null byte, which would inevitably be not-so-special and collide with somebody else’s (like Schema Registry’s “magic byte”), I’d use something meaningful like the number of bytes the following record is.
As for why I chose varint instead of just picking an interger size, well for one I got nerd-sniped by varint encoding and thought it would be cool to try and implement it in Scala. Secondly, I thought if I chose a fixed size integer, no matter what size I pick, my users will always surprise me and exceed it at least once, and when that happens, kaboom! I wanted to future proof this without wasting 64 goddamn bytes in front of each message, and also I got nerd-sniped, OK?!?
Someone on my team recently shared one of these files outside the company and so I really hope she’s not talking about me but that’s a crazy coincidence if not!
The fact that protobufs are not self-delimiting is an endless source of frustration, but I know of 2 standards for doing this:
- SerializeDelimited* is part of the protobuf library: https://github.com/protocolbuffers/protobuf/blob/main/src/go...
- Riegeli is "a file format for storing a sequence of string records, typically serialized protocol buffers. It supports dense compression, fast decoding, seeking, detection and optional skipping of data corruption, filtering of proto message fields for even faster decoding, and parallel encoding": https://github.com/google/riegeli