What does HackerNews think of smhasher?

Hash function quality and speed tests

Language: C++

#5 in C++
This is the best, most comprehensive hash test suite I know of: https://github.com/rurban/smhasher/

you might want to particularly look into murmur, spooky, and metrohash. I'm not exactly sure of what the tradeoffs involved are, or what your need is, but that site should serve as a good starting point at least.

We ended up with fasthash64 and lookup3 by looking for a fast hash that is easy to port to the restricted subset of C supported by eBPF with minimal changes. https://github.com/rurban/smhasher is a great resource for that.

I would probably choose different, more robust hash functions if I was targeting regular C.

> You can get higher quality and throughput for short strings from a hash that uses AES instructions or wyhash, or for long strings maybe from xxhash.

What would you consider a short string? The latency of AESENC is 5 if I remember correctly. In that time you can FNV-1a a "short" string.

For longer strings I agree that WyHash and xxHash is better. I have a benchmark of both further down on FastHash's github page[1]. Of course, that is only a measure of speed, but both are high quality as well[2].

From the top of my head, t1ha, falkhash, meowhash and metrohash are using AES-NI and none of them are particularly fast on short inputs, and at least two of them have severe issues, despite guarding against lots of vulnerabilities, which your construction does not.

You are right regarding bulk-keying with SIMD outperforming any kind of scalar hashing one-at-the-time. I've build a database that does just that (requires keys to be available in bulk - like when doing bulk inserts) using SIMD, and the latency becomes negligible.

[1] https://github.com/Genbox/FastHash

[2] https://github.com/rurban/smhasher

Have you seen https://github.com/rurban/smhasher ?

So the fastest hash functions on x86_64 without quality problems are: xxh3low, wyhash, ahash64, t1ha2_atonce, komihash...

I would not be so optimistic on RIPEMD-160 security. I found several weaknesses. https://github.com/rurban/smhasher
SMHasher results are here: https://github.com/rurban/smhasher/ below the FNV's.

In short: it shows why a simple word mixer alone is not enough for a proper hash. It is insecure, fails all tests, and is not as fast as expected. I would not even use it for PRNG's, but I haven't tested that usage yet.

For comparison: https://github.com/rurban/smhasher/

It's indeed pretty fast ( among the top 5), but not particularly good. It fails some basic tests. Rather use xxh3, t1ha or farmhash

I like to check smhasher for comparison of many hash implementations: https://github.com/rurban/smhasher. It’s a bit clunky to search for the exact items you want to compare, but there’s lots of data and a good summary under the table.
Looks good, but has somewhat widely false claims:

> SMHasher is a test and benchmark suite for a variety of hash function implementations. It can provide data for claims like “our new Foo hash function is faster than the widely used Bar, Baz and Qux hash functions”. However, when comparing Foo to CRC-32, be aware that a SIMD-accelerated CRC-32 implementation can be 47x faster than SMHasher’s simple CRC-32 implementation.

In fact smhasher implements both, the slow soft crc32 he cites, and the fastest crc32_pclmul, which is not just 47x faster, but 5000x faster. Other than wuff's implementation of the crc32_pclmul variant.

Compare https://github.com/rurban/smhasher/#smhasher

    crc32  392.10 vs crc32_pclmul 1972140.38 MiB/sec
Reminds me a lot on ATS
Added it to smhasher https://github.com/rurban/smhasher/#smhasher

It may be the fastest rust hash, but certainly not faster than other fast hashes. More like 2x slower.

xxh3, t1ha0, wyhash are all much faster on the 2 machines I tested it on, an old 7 years old Intel i5-2300, and a new Ryzen 3200U.

xxHash is legacy, replaced by xx3. Still not the fastest block hash, https://github.com/rurban/smhasher

This is a good survey paper about the old, known, fast, simple one-line hashes. I've tested all of them in smhasher also. They are all *not* recommended for production use.

Everything is wrong with MD5. Insecure, extremely slow, bad statistical properties. https://github.com/rurban/smhasher
You can see its properties here https://github.com/rurban/smhasher

Passes all smhasher tests for a tiny size (157 byte on x64), but slow as every more secure hash. I would put it as alternative to people using siphash. Personally I see no use case for siphash at all. Slow, big, insecure, easily brute forcable.

If you’re interested in diving deeper, the SMhasher benchmark suite tests a much more exhaustive list of hash functions (including wyhash) - https://github.com/rurban/smhasher

https://www.strchr.com/hash_functions Is also a great write up of how some of the classic hashing algorithms perform of various data sets.

https://nullprogram.com/blog/2018/07/31/ is an interesting approach on generating hashing functions automatically.

Tested at ~ 5GB/s @ 3Gz, (Google Cloud Platform, N1 CPU)

What is the memory bandwidth on that instance? I ask because I'm not seeing that listed [1] but it would be a useful point of comparison. Maybe run Doctor Bandwidth's STREAM benchmark [2].

DiscoHash is included in SMHASHER [3] but its benchmark results aren't.

[1] https://cloud.google.com/compute/docs/machine-types

[2] https://www.cs.virginia.edu/stream/

[3] https://github.com/rurban/smhasher

They claim the same "fastness" as Siphash whilst actually being on the slower end of all good hash functions. Too slow for real world usage, and not secure enough for real world security attacks. Compare https://github.com/rurban/smhasher and my complaints https://github.com/google/highwayhash/issues/28
How does it compare to:

FarmHash

CityHash

SMHasher

MetroHash

FNV1a_YT

Note that if you're just using it for a hash table, where you'll compare the full strings to resolve hash collisions, you don't actually need good randomness. FNV1A_VT is faster than most, has worse collision properties, yet provides a faster overall hash table, because the faster hash (every operation) outweighs the extra full string checks (rare).

https://github.com/rurban/smhasher

t1ha aka "Fast Positive Hash" - In most cases up to 15% faster than StadtX, xxHash, mum-hash, Metro-hash, etc.

Simple benchmark is included (git close && make). The quality and speed could be checked by Reini Urban's (https://github.com/rurban/smhasher) and Yves Orton's (https://github.com/demerphq/smhasher) forks of SMHasher suite.

Reasonable to note: t1ha have two injection points of each data word inside the loop of entropy sponge.

The best is currently Leonid Yuriev's t1ha. See https://github.com/rurban/smhasher/

Note that in contrast with what Andy or DJB say, the collision safety is not a problem of the hash function per se, as you cannot fix collision attacks with any "safer" hash function. You can easily brute-force even the worst of all siphash in under 4min. Safety begins with 256 bits, in a hash table you got typically 10-14, max 32 to attack. It is only making it marginably safer, but much slower. It is only doable by checking collision attacks in the collision scheme, either by collision counting or using a non-attackable collision scheme.

Unfortunately most implementations have no idea, and just go with the slowest of all. The latest such sin done in the linux network stack. I thought at least those guys were above that, but apparently not.

This simple multiplication hash is indeed superior, just with linear probing it has problems. You would use double hashing with another similar constant then. Or find two such constants dynamically, as this would be universal then.

And also note that the upper bits of a hash function are always superior to the lower bits. power-by-2 & checks are thus always worse than right-shifting as with this one.

XXhash is only fast with digest workloads, not hash tables. There the hash function needs to be fast and small, XXhash is too big and trashes the icache.

The current leaders are on my smhasher site, https://github.com/rurban/smhasher/

The answer is probably here: https://github.com/rurban/smhasher

Speed & number of cycles/hash is better with metro hash.

There's also FarmHash, whose 32-bit version is 2x as fast as xxHash on little endian machines (at least according to this benchmark suite https://github.com/rurban/smhasher).
You are welcome. I wanted add another point. rurban does have a bit of a point. The hash functions you referenced in your article are generally old and low quality, and slow. FNV is neither a good hash, nor is it fast. FNV_yt is very fast, but not the best quality hash function, it fails many tests in SMHasher, including some I personally consider dealbreaker: failing avalanche tests is unacceptable.

Both rurban and I have (independently) spent considerable time testing about 100 hash functions using the hash test suite smhasher. In addition I have greatly enhanced smhasher, and included test results and graphs in my version. Included in my fork is the --stream option so you can test different hash functions in counter mode with tools like the dieharder RNG test suite.

https://github.com/demerphq/smhasher https://github.com/rurban/smhasher

These are the FNV test results from my version of SMHasher:

FNV1a 32 32 32 FAILED 99 of 183 tests. Avalanche: 17-41, Crc-MultiCollision: 81, 83, 85-86, 88-91, 94, 96-98, 100, Cyclic: 42-52, Hi-Lo: 157-158, HiBit-Null: 150-152, Highbits2: 147-149, Highbits: 144-146, LowBit-Null: 153-155, Lowbits: 142-143, Murmur2-MultiCollision: 101-110, Murmur3A-MultiCollision: 111, 113, 117, 119-120, Murmur3F-MultiCollision: 122-127, 129-130, OverAll: 183, Text: 160-165, TwoBytes: 54-56, 63, Zeroes: 167-168

FNV1a_YT 32 32 32 FAILED 67 of 181 tests. Avalanche: 15-39, Differential: 9-14, Hi-Lo: 154-156, HiBit-Null: 148-150, Highbits2: 145-147, Highbits: 142-144, Lowbits: 139-141, OverAll: 181, Sparse: 74-78, Text: 157-158, 160-163, TwoBytes: 51, 53-61

FNV64 64 64 64 FAILED 66 of 183 tests. Avalanche: 17-41, Cyclic: 43-45, 47, 49, 51-52, Hi-Lo: 157-158, HiBit-Null: 151-152, Highbits2: 148-149, Highbits: 145-146, LowBit-Null: 154-155, Lowbits: 142-143, OverAll: 183, Sparse: 65-67, 69, 71, 73, 75, 77, 79-80, Text: 160-162, 164-165, TwoBytes: 54-56, 58, 60, 62-63, Words: 182, Zeroes: 167-168

You can see text view of the full list at:

https://github.com/demerphq/smhasher/blob/master/doc/summary...

Also after reading this thread I added some graphs comparing performance of other hashes, including some that I have developed using genetic algorithm for use in Perl. Some of those hashes have hash-time performance at the top of the envelope (StadtX is very fast).

https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F... https://github.com/demerphq/smhasher/blob/master/doc/graph/F...

You may also be interested in looking at

https://github.com/demerphq/smhasher/blob/master/HashAnatomy...

Good luck!

Cyan4973/Yann also is well known for xxhash[1], which is one of the faster hashers[2] out there. Post a new hasher and people will probably ask about xxhash (ex: metrohash[3]). Guy is an absolute machine. If you search Google for 'zstd' right now, you'll find him, not Facebook, namely: https://github.com/Cyan4973/zstd . Glad his work is being supported by someone now! Immensely well deserved, after so many years of helping make everyone fast.

PS - I didn't know a lot of the terms you used, but the Finite State Entropy (FSE) link you provided does a good job intro'ing some of them, and the linked paper Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding [4] (ANS) seems interesting.

[1] https://github.com/Cyan4973/xxHash

[2] https://github.com/rurban/smhasher

[3] https://news.ycombinator.com/item?id=9613098

[4] http://arxiv.org/abs/1311.2540

There is also a great resource with more up do date stuff on hashing and benchmarks right here: https://github.com/rurban/smhasher/
Indeed, asking the question of "Why are we doing this" and "What problem does this solve" is a great way to start a discussion of any engineering solution, and a great way to avoid doing unnecessary work.

There are a lot of things that fundamentally deal with either systems that users interact with or that run on real machines with physical world implications that also make things perform differently, from user interaction latency and pre-computing and pre-caching, to SSD block level reads/writes, cpu caches and other caches, network latency implications, etc.

Similarly, on the algorithm front from things like Boyer–Moore to work being done on non-cryptographic hashes (ie, https://github.com/rurban/smhasher/ ) that is sometimes mind boggling.

Oh come one. Comparing CLHash against CityHash and SipHash only? CityHash was replaced by FarmHash some year ago, and only python uses SipHash with its outrageously bad performance.

The fastest and mostly-secure 64bit hash functions are Metro, Spooky, xxHash64 and Multilinear-HM by the same author. See "Strongly universal string hashing is fast", Daniel Lemire and Owen Kaser, 2014.

CLHash doesn't even pass the avalanche test from smhasher https://github.com/rurban/smhasher I'll add it there for better comparison, with the new leader falkhash: https://github.com/gamozolabs/falkhash

I've updated now my smhasher results at https://github.com/rurban/smhasher including metrohash and xxHash but not FarmHash yet.

From the good enough hash functions, metrohash are by far the fastest, so they should be the new default on 64bit machines. On x86_64 and arm8 the crc variants are the best.

For 32bit we are stuck with Murmur still.

My smhasher fork has both: https://github.com/rurban/smhasher/

Comparable results for xxHash and the new metrohash will be uploaded soon. These things take time.

But I'm more interested in comparing it to HW CRC32-C, the fastest and best hash so far, 2x faster than metro.