What does HackerNews think of simdjson?

Parsing gigabytes of JSON per second

Language: C++

#5 in C++
#1 in C++
#2 in JSON
Being a maintainer of the fastest XML library for Rust, I strongly disagree that XML is inherently fast to parse, and I question any such claim which comes with no evidence. Especially when it has remained unchanged on their page since (at least) 2008 [0]. Have you actually tested that claim or are you taking it at face value?

IME the XML spec is so complex that you either end up with a slow but compliant parser or a fast one that doesn't implement the spec completely.

JSON, unlike XML, is minimal enough that writing an entire compliant parser with SIMD intrinsics [1] is actually practically feasible. That library claims 3 GBps parsing speed, which could theoretically process your 120kb of data in 1/25000th of a second instead of 2/1000ths of a second.

I would wager that JSON is faster to parse, on balance.

[0] https://web.archive.org/web/20080209172554/https://rapidxml....

[1] https://github.com/simdjson/simdjson

Recently had 28GB json of IOT data with no guarantees on the data structure inside.

Used simdjson [1] together with python bindings [2]. Achieved massive speedups for analyzing the data. Before it was in the order of minutes, then it became fast enough to not leave my desk. Reading from disk became the bottleneck, not cpu power and memory.

[1] https://github.com/simdjson/simdjson [2] https://pysimdjson.tkte.ch/

The README mentions this: > Selects a CPU-tailored parser at runtime. No configuration needed.

https://github.com/simdjson/simdjson

One could combine JSON and a serializationless library, your JSON would be blown up with whitespace, but read and update could be O(1), serialization would be a memcpy, you could probably canonicalize the json during the memcpy using the SIMD techniques of Lemire.

I did this one for reading json on the fast path, the sending system laid out the arrays in a periodic pattern in memory that enabled parseless retrieval of individual values.

https://github.com/simdjson/simdjson

The author knows C++ (https://github.com/simdjson/simdjson) and writes a lot about his performance experiments. I don't see why he - or anybody else - shouldn't raise such questions and arguments without having people (like you) getting angry about it. Is it offensive?

Anyway, he has a point, `cout` is used extensively as a logging mechanism. If you don't see that "single millisecond" making any difference, you certainly haven't work on a relevant system.

> There doesn’t seem to be anything that is reasonably space efficient, simple and quick to parse and text based (not binary) so you can view and edit it with a standard editor.

> XML and Javascript are tree structures and not suitable for efficiently storing tabular data (plus other issues).

You can certainly be efficient with json(net). See:

Notice how they are separate objects:

   {'name': 'foo', 'age': 2}
   {'name': 'cat, 'age': 6}
You can do it very efficiently: https://github.com/simdjson/simdjson

Compress it if you need compact.

There's also UBF, but it never saw much traction: https://ubf.github.io/ubf/ubf-user-guide.en.html#specificati...

Daniel Lemire’s simdjson probably belongs in this discussion and I would be surprised if it is not the fastest tool by some margin.

https://github.com/simdjson/simdjson

Yup, it is crazy how much optimization these "narrow waist" formats get, and it goes even further with simdjson and so forth [1].

The Python protobuf implementation has a somewhat checkered history (I used protobuf v1 and v2 for a long time, and reviewed v3 a tiny bit).

The type system issue is that protobufs to a large extent "replace" your language's types. It's essentially language-independent type. So that means you are limited to a lowest common denominator, and you have the issues of "winners" and "losers"... I would call Python somewhat of a "loser" in the protobuf world, i.e. it feels more second class and is more of a compromise.

This doesn't mean that anybody did a bad job; it's just a fundamental issue with such IDLs. In contrast, JSON/XML/CSV are "data-first" and there are multiple ways of using them and parsing them. You can lazily parse all of them, DOM and SAX, for example, and you have push and pull parsers, etc. Protobufs have grown some of that but it wasn't the primary usage, and many people don't know about it.

[1] https://github.com/simdjson/simdjson

I'm sorry, I don't mean Hyperscan, I mean simdjson [0]. I think I got confused by my recollection of Lemire/Langdale.

[0] https://github.com/simdjson/simdjson

Just another case where a library tests and publishes results for all competing libraries slower than it, but none faster. cough simdjson [1] cough

---

[1] https://github.com/simdjson/simdjson

That's true, but the main argument made by the website is about the space advantage, so it's very relevant that that space advantage is basically nullified by the widespread use of compression.

If your worry is parsing speed, then JSON not only has battle-tested parsers, but also has SIMD-assisted parsers which can process gigabytes a second on a single core (e.g. https://github.com/simdjson/simdjson). It would take Internet Object years to develop parsers as performant as that, even if it did, by some miracle, achieve wide uptake. So the notional advantage afforded by not having keys on each row is neither here nor there.

And incidentally, as someone who's written a handful of parsers, I suspect that this scheme would not be particularly easy to parse. You need lookahead because of optional fields, as well as maintaining state and a lookup table for mapping positions to keys, etc. I can draw up a quick parser in pseudocode or Python to explain, if you disagree.

Source: https://github.com/simdjson/simdjson

PyPI: https://pypi.org/project/pysimdjson/

There's a rust port: https://github.com/simd-lite/simd-json

... From ijson https://pypi.org/project/ijson/#id3 which supports streaming JSON:

> Ijson provides several implementations of the actual parsing in the form of backends located in ijson/backends: [yajl2_c, yajl2_cffi, yajl2, yajl, python]

Is this the project? https://github.com/simdjson/simdjson

If so, Ive been following it for a couple years, but I put it out of my mind recently after moving to AMD. I could sware it was an intel only project, but a quick scan of the that git suggests I'm wrong. So either I'm totally missremembering, or AMD support was added later.

Anyway, I cant wait to try that out again. I wonder why most projects don't just use this as their default json parser now?