What does HackerNews think of minisketch?

Minisketch: an optimized library for BCH-based set reconciliation

Language: C++

Anyone interested in IBLT with low failure probablity should also be aware of pinsketch and, particularly, our implementation of it: minisketch ( https://github.com/sipa/minisketch/ ).

Our implementation communicates a difference of <=N b-bit entries with exactly N*b bits with 100% success. The cost for this communications efficiency and reliability is that the decoder takes CPU time quadratic in N, instead of IBLT's linear decoder. However, when N is usually small, if the implementation is fast this can be fine -- especially since you wouldn't normally want to use set recon unless you were communications limited.

For the sketch sizes (both low N and low bits per entry) we were interested in for relaying transactions in the Bitcoin network pinsketch/minisketch is more than a dozen times more communications efficient than low failure rate iblt. The concrete efficiency of this paper isn't yet clear to me-- though at a glance I think it also will not be great for small sizes only asymptotically (which is, of course, also where the computational costs are also an issue).

Pinsketches and iblt can also be combined-- one can use pinsketches as the cells of an iblt and one can also use a small pinsketch to improve the failure rate of an iblt (since when a correctly sized IBLT fails, it's usually just due to a single undecodable cycle) which can be corrected using pinsketch by only a tiny amount of data.

Since the protocol appears to use adhoc synchronization, the authors might be interested in https://github.com/sipa/minisketch/ which is a library that implements a data structure (pinsketch) that allows two parties to synchronize their sets of m b-bit elements which differ by c entries using only b*c bits. A naive protocol would use m*b bits instead, which is potentially much larger.

I'd guess that under normal usage the message densities probably don't justify such efficient means-- we developed this library for use in bitcoin targeting rates on the order of a dozen new messages per second and where every participant has many peers with potentially differing sets--, but it's still probably worth being aware of. The pinsketch is always equal or more efficient than a naive approach, but may not be worth the complexity.

The somewhat better known IBLT data structure has constant overheads that make it less efficient than even naive synchronization until the set differences are fairly large (particular when the element hashes are small); so some applications that evaluated and eschewed IBLT might find pinsketch applicable.

I love the set reconciliation structures like the IBLT (Iterative Bloom Lookup Table) and BCH set digests like minisketch.

https://github.com/sipa/minisketch

Lets say you have a set of a billion items. Someone else has mostly the same set but they differ by 10 items. These let you exchange messages that would fit in one UDP packet to reconcile the sets.

Minisketch is more costly in CPU but has deterministic guarantees. IBLTs are very fast but have a chance of failure requiring a randomized iterative procedure.

How about a pinsketch:

A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.

For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).

Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).

Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.

The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P

Here is an implementation I contributed to: https://github.com/sipa/minisketch/

There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.

IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.

The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)

Here is one not on the list so far:

Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 4MB of data be transferred to calculate the same thing.

*(in particular pin sketch https://github.com/sipa/minisketch).

Networks that need to constrain themselves to limited typologies to avoid traffic magnification do so at the expense of robustness, especially against active attackers that grind their identifiers to gain privileged positions.

Maybe this is a space where efficient reconciliation ( https://github.com/sipa/minisketch/ ) could help-- certainly if the goal were to flood messages to participants reconciliation can give almost optimal communication without compromising robustness.

> Rather than determining if the current site has been submitted by querying the Firebase/Algolia APIs with every page you visit, the extension contains regularly-updating Bloom filters for all submitted HN stories to preserve user privacy.

Nice!

I built a pi-hole esque stub dns-resolver that uses Bloom Filters generated from hostfiles (60 MiB, 5M entries --> 2 MiB with 1% false positives) and it worked like a charm. At some point, I also looked into Xor Filters which are apparently even lighter and faster but couldn't find a JavaScript implementation for it [0].

I; however, stopped using Bloom Filters because its immutability meant building it over and over again which was a pain. Inverted Bloom Filters [1] or Spectral Bloom Filters [2] might have been useful since they can be updated in-place. Instead, I went for storing hostnames in a Finite State Automata [3], which while not as compact as Bloom Filters, could be updated in-place, are deterministic, and search faster. Likely not a fit for your use-case however.

PinSketches, otoh, might be a fit for accomplishing efficient set reconciliation [4] of the filters you're distributing.

[0] https://github.com/FastFilter/xorfilter#implementations-of-x...

[1] https://www.youtube.com/watch?v=eIs9nJ-JFvA

[2] https://pncnmnp.github.io/blogs/spectral-bloom-filters.html

[3] http://stevehanov.ca/blog/?id=115

[4] https://github.com/sipa/minisketch

It's possible to be dramatically more communication efficient than bloom filters to synchronize sets.

Bloom filters require O(my_set_size) bandwidth, with a good constant factor. This work optimizes it somewhat by excluding members that existed at the time of last synchronization, but it still wastes bandwidth when those commits came in from another source.

It's possible to instead realize O(symmetric_difference_between_sets) when synchronizing a pair. Applied to this kind of usage the bandwidth required would be reduced by whatever redundancy factor (e.g. number of different repos you'd learn the same commit from).

This can be accomplished via error correcting codes, for a high performance industrial implementation I contributed to, see: https://github.com/sipa/minisketch/ (I tried submitting this to HN a while back, but no joy--).

For our paper on applying this to Bitcoin transaction relay to render rumouring bandwidth largely independent of peer count:

https://arxiv.org/abs/1905.10518

[Similar to the motivations of the post, we consider immunity to attackers to be critical-- and most prior efficient gossip work is extremely non-robust.]

For the specific case of synchronizing the blockchain itself, these fancy techniques aren't needed because there is only a single best chain 'list'-- there is only one relevant tip-- and it can be efficiently synchronized with bisection. Bitcoin sends up to 100 hashes at once (though usually more like 30), the first few dense around the best tip and the rest exponentially spaced to extremely efficiently find the highest point of commonalty.

Network coding is very interesting but the maximum gains from the network coding itself are fairly limited... a small constant factor. In attempting to apply network coding I've found that vulnerability to (maliciously) corrupt data significantly limits the utility.

If this is an area that interests you, you might also be interested in efficient set reconciliation-- where a party can communicate to another party a set using only exactly the size of the set-difference to the other parties set without having any idea of what the other parties set has.

This can result in fairly large bandwidth savings in P2P gossip applications without reducing robustness-- and could be combined with network coding in places where network coding is applicable.

I worked on an efficient implementation of setrecon https://github.com/sipa/minisketch/ and an application of it to Bitcoin ( https://arxiv.org/abs/1905.10518 ).

The minisketch library I worked on can be used for near optimal (in the sense of information leak) error correction for "set like" features:

https://github.com/sipa/minisketch/

Our application is for communications efficient set reconciliation to convert Bitcoin's quadratic-overhead transaction gossip protocol (O(txn*peers)) to effectively linear (O(txn)), though the primary academic work that our work was based on were concerned with fuzzy extractors for privacy preserving (and encryption key generating) biometrics.

Some months ago a colleague of mine and I were working on coming up with an intuitive lay explanation of the Berlekamp–Massey algorithm as part of a greater explanation of some of our work that uses it ( https://github.com/sipa/minisketch/ ).

It isn't terribly difficult to work through the algorithm mechanically and see that it solves the equations that it's intended to solve, but that is a long way from a good intuitive understanding of it. We spent some time going back through the original literature describing it and related contemporary work, but what we found also focused on a rather procedural description that didn't really strike the insight we were looking for.

At one point, I quipped that I should reach out to Berlekamp and inquire if this work really weren't the product of reverse engineering instruments found in a 1967 UFO crash-- given what a remarkable advance it was. :)

I'm sad to hear that the world will never know for sure now.

Although Berlekamp's work has already been important to the world-- impacting most of the communications and storage devices we interact with--, I expect that its importance will increase in the future.

Many advances in computer science and related fields have their utility gated or multiplied by the power and pervasiveness of computing technology. I am always struck when I read older papers-- even ones as late as the 90s-- and find them discussing problem sizes at the limits of their computer technology which my laptop is solving thousands of times per second as part of my application (or just as part of my research). So I find that my recent work using the berlekamp-massey algorithm casually applies to problem sizes that would have probably been unthinkable in the 80s, even using Cyclotomic's expensive specialized hardware. The increasing power and decreasing cost of general purpose computing increases the importance and broadens the applicability of the algorithms we run on it.