I asked about this on the Github issue regarding these benchmarks as well.

I'm curious as to why libraries like ultrajson[0] and orjson[1] weren't explored. They aren't command line tools, but neither is pandas (which is benchmarked) right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, `python -c "..."`)?

[0] https://github.com/ultrajson/ultrajson

[1] https://github.com/ijl/orjson

dmoura

The idea was to focus on querying tools. ujson and orjson (as well as the json module from python's standard library) offer json decoding and decoding but not a querying language: you need to implement the query logic in Python, resulting in large programs with lots of boilerplate. Still, I agree that Pandas is an outlier... it was included due to its popularity for querying datasets.

I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

davidatbu

> resulting in large programs with lots of boilerplate

That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me.

Thank you for this benchmark! I'll probably switch to spyql now from jq.

> So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit.

Yes, I definitely think this is worth mentioning upfront in the future, since, IIUC, orison's core uses Rust (the serde library, specifically). The initial title gave me the impression that a pure-Python json parsing-and-querying solution was the fastest out there.

A parallel I think is helpful to think about is saying something like "the fastest BERT implementation is written Python[0]". While the linked implementation is written in Python, it offloads the performance critical parts to C/C++ through TensorFlow.

I'm not sure how such claims advance our understanding of the tradeoffs of programming languages. I initially thought that I was going to change my mind about my impression that "python is not a good tool to implement fast parsing/querying", but now I haven't, so I do think the title is a bit misleading.

[0] https://github.com/google-research/bert