What does HackerNews think of despacer?
C library to remove white space from strings as fast as possible
A single modern CPU core is superscalar and has a deep instruction pipeline. With your help, it will decode and reorder many instructions and execute many instructions concurrently. Each of those instructions can operate on a lot of data.
As famous online controversial opinion haver Casey Muratori tells us, most software just sucks, like really really bad (e.g. commonly people will post hash table benchmarks of high-performance hash tables that do bulk inserts in ~100ns/op, but you can do <10ns/op easily if you try), and using SIMD instructions is table stakes for making good use of the machinery inside of a single CPU core. SIMD instructions are not just for math! They are tools for general purpose programming. When your program needs to make decisions based on data that does not contain obvious patterns, it is often a lot cheaper to compute both possible outcomes and have a data dependency than to have a branch. Instructions like pshufb or blendv or just movemask and then using a dang lookup table can replace branches. Often these instructions can replace 32 or 64 branches at a time[0]. Wojciech Muła's web site[1] is the best collection of notes about using SIMD instructions for general-purpose programming, but I have found some of the articles to be a bit terse or sometimes incorrect, and I have not yet done anything to fix the issue. "Using SIMD" ends up meaning that you choose the low-level layout of your data to be more suitable to processing using the instructions available.
Inside your single CPU core there is hardware for handling virtual -> physical address translation. This is a special cache called the translation lookaside buffer (TLB). Normally, chips other than recent Apple chips have a couple hundred entries of 1 4KiB page each in the TLB, and recent Apple chips have a couple hundred entries of 1 16KiB page each. Normal programs deal with a bit more than 1 meg of RAM today, and as a result they spend a huge portion of their execution time on TLB misses. You can fix this by using explicit huge pages on Linux. This feature nominally exists on Windows but is basically unusable for most programs because it requires the application to run as administrator and because the OS will never compact memory once it is fragmented (so the huge pages must be obtained at startup and never released, or they will disappear until you reboot). I have not tried it on Mac. As an example of a normal non-crazy program that is helped by larger pages, one person noted[2] that Linux builds 16% faster on 16K vs on 4K pages.
Inside your single CPU core is a small hierarchy of set-associative caches. With your help, it will have the data it needs in cache almost all the time! An obvious aspect of this is that when you need to work on some data repeatedly, if you have a choice, you should do it before you have worked on a bunch of other data and caused that earlier data to be evicted (that is, you can rearrange your work to avoid "capacity misses"). A less obvious aspect of this is that if you operate on data that is too-aligned, you will greatly reduce the effective size of your cache, because all the data you are using will go into the same tiny subset of your cache! An easy way to run into this issue is to repeatedly request slabs of memory from an allocator that returns pretty-aligned slabs of memory, and then use them all starting at the beginning. That this could cause problems at all seems relatively unknown, so I would guess lots and lots of software is losing 5-10% of its performance because of this sort of thing. Famous online good opinion haver Dan Luu wrote about this here[3]. The links included near the bottom of that post are also excellent resources for the topics you've asked about.
When coordinating between multiple CPU cores, as noted in TFA, it is helpful to avoid false sharing[4]. People who write trading systems have mostly found that it is helpful to avoid sharing *at all*, which is why they have work explicitly divided among cores and communicate over queues rather than dumping things into a concurrent hash map and hoping things work out. In general this is not a popular practice, and if you go online and post stuff like "Well, just don't allocate any memory after startup and don't pass any data between threads other than by using queues" you will generally lose imaginary internet points.
There are some incantations you may want to apply if you would like Linux to prioritize running your program, which are documented in the Red Hat Low Latency Performance Tuning guide[5] and Erik Rigtorp's web site[6].
Some other various resources are highload.fun[7], a web site where you can practice this sort of thing, a list of links associated with highload.fun[8], Sergey Slotin's excellent online book Algorithms for Modern Hardware[9], and Dendi Bakh's online course Perf Ninja[10] and blog easyperf[11].
> Off topic: What level of sophistication about modern CPUs is _good_ to have?
Probably none? These skills are basically unemployable as far as I can tell.
[0]: https://github.com/lemire/despacer
[1]: http://0x80.pl/articles/index.html
[2]: https://twitter.com/AtTheHackOfDawn/status/13338951151741870...
[3]: https://danluu.com/3c-conflict/
[4]: https://rigtorp.se/ringbuffer/
[5]: https://access.redhat.com/sites/default/files/attachments/20...
[6]: https://rigtorp.se/low-latency-guide/
[8]: https://github.com/Highload-fun/platform/wiki
[9]: https://en.algorithmica.org/hpc/
Still, what does it signal that vector extensions are required to get better string performance on x86? Wouldn't it be better if Intel invested their AVX transistor budget into simply making existing REPB prefixes a lot faster?
Daniel Lemire has done brilliant work. His https://github.com/lemire/despacer looks great.
I wonder though if their Base64 SIMD algorithms could be tweaked to do the decode in a single pass?
Like you, also email related, I have written a native Base64 addon for Node in C optimized for decoding MIME wrapped Base64 in a single pass (https://github.com/ronomon/base64). It's not yet using SIMD instructions but it's 1.5-2x as fast as Node's routines with top speeds of 1398.18 MB/s decoding and 1048.58 MB/s encoding on a single core (Intel Xeon E5-1620 v3 @ 3.50GHz). Here is an explanation of the difference: https://github.com/nodejs/node/issues/12114
The Ronomon Base64 addon also ships with a fuzz test (https://github.com/ronomon/base64/blob/master/test.js) which you are welcome to try out on your Go binding. It picked up an interesting bug in Node's decode routine where some white space was actually decoded as data (https://github.com/nodejs/node/issues/11987).