What does HackerNews think of FastPFor?

The FastPFOR C++ library: Fast integer compression

Language: C++

Well if your integers are sequential you can encode huge numbers of them using diff + RLE in just a few bytes, likely far fewer than 1/2 a byte on average, for the right dataset (in theory you can store 1,2,3,4,5...10_000 in 2 bytes).

But for other integer datasets there's FastPFOR

https://github.com/lemire/FastPFor

The linked papers there will talk about techniques that can be used to store multiple 32bit integers into a single byte, etc. Integer compression is pretty powerful if your data isn't random. The thing with UUIDs is that your data is pretty random - even a UUIDv7 contains a significant amount of random data.

One notable omission from this piece is a technique to compress integer time series with both positive and negative values.

If you naively apply bit-packing using the Simple8b algorithm, you'll find that negative integers are not compressed. This is due to how signed integers are represented in modern computers: negative integers will have their most significant bit set [1].

Zigzag encoding is a neat transform that circumvents this issue. It works by mapping signed integers to unsigned integers so that numbers with a small absolute value can be encoded using a small number of bits. Put another way, it encodes negative numbers using the least significant bit for sign. [2]

If you're looking for a quick way to experiment with various time series compression algorithm I highly recommend Daniel Lemire's FastPFor repository [3] (as linked in the article). I've used the Python bindings [4] to quickly evaluate various compression algorithms with great success.

Finally I'd like to humbly mention my own tiny contribution [5], an adaptation of Lemire's C++ Simple8b implementation (including basic methods for delta & zigzag encoding/decoding).

I used C++ templates to make the encoding and decoding routines generic over integer bit-width, which expands support up to 64 bit integers, and offers efficient usage with smaller integers (eg 16 bit). I made a couple other minor tweaks including support for arrays up to 2^64 in length, and tweaking the API/method signatures so they can be used in a more functional style. This implementation is slightly simpler to invoke via FFI, and I intend to add examples showing how to compile for usage via JS (WebAssembly), Python, and C#. I threw my code up quickly in order to share with you all, hopefully someone finds it useful. I intend to expand on usage examples/test cases/etc, and am looking forward to any comments or contributions.

[1] https://en.wikipedia.org/wiki/Signed_number_representation

[2] https://en.wikipedia.org/wiki/Variable-length_quantity#Zigza...

[3] https://github.com/lemire/FastPFor

[4] https://github.com/searchivarius/PyFastPFor

[5] https://github.com/naturalplasmoid/simple8b-timeseries-compr...

> Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size

Really good advice - this is a hard earned lesson for many folks. I've worked with quite a few data scientists who were brilliant at experimental design but not necessarily experts in the field of comp sci. Their relatively simple python scripts would run nice and fast initially. As time passed and the organization grew, their scripts would start to run slower and slower as the datasets scaled and swapping to disk started occurring, etc. In some cases they would completely lock up shared machines, taking a good chunk of the team offline for a bit.

Anyway, Daniel Lemire's blog is a fantastic resource. I highly recommend taking a look through his publications and open source contributions. I was able to save my employer a lot of money by building on time series compression algorithms [1] and vectorized implementations [2][3] that he has provided.

[1] Decoding billions of integers per second through vectorization https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203

[2] https://github.com/lemire/FastPFor

[3] https://github.com/searchivarius/PyFastPFor

Have you considered other integer compression algorithms like https://github.com/lemire/FastPFor?
Look for the section that says 'We posted our paper online together with our software.' The word 'paper' is a link to http://arxiv.org/abs/1209.2137 and 'our software' is a link to https://github.com/lemire/FastPFor