What does HackerNews think of hyperscan?

High-performance regular expression matching library

Language: C++

I think of WAFs as an extra safety net. Defense in depth.

The author complained about the performance cost of WAFs in general, but not all WAFs have be structured like ModSecurity. They could for example be based on something like https://github.com/intel/hyperscan and perf is at a very different level.

It’s a bespoke scanning setup designed to deal with GitHub’s scale, minimise false positives, and scan fast enough to be in the `git push` request/response cycle. Under the hood it’s using Intel’s hyperscan as the regex engine.

https://github.com/intel/hyperscan

The main difference is that binwalk goes through a file linearly, searching for patterns like magic bytes, and tries to extract everything it finds.

The problems with this:

- very noisy, finding a lot of false positives (license code, format inside another format, etc).

- very slow, trying to extract irrelevant things

- imprecise, because it finds patterns in the middle of a file, where it's actually not relevant on the first level of extraction

unblob solves these problems by being smarter about the file formats, recognizing them by their specification, for example unpacks format header structs and carves out files based the information in the header (size, offset). See a simple example for NTFS [1].

We also went to great lengths preventing unnecessary work by skipping formats inside another [2]. We are using hyperscan [3] instead of grepping byte sequences with Python, which is orders of magnitudes faster. It can also handle 4Gb+ files because of this which binwalk cannot.

It's used for a year now in production and it's way more precise and faster than binwalk. We are getting less false-positives too, and even if unblob fails to extract everything, we still get meaningful information out of firmwares, where binwalk just failed with no output previously.

[1]: https://github.com/onekey-sec/unblob/blob/main/unblob/handle...

[2]: https://github.com/onekey-sec/unblob/blob/main/unblob/proces...

[3]: https://github.com/intel/hyperscan

The official repo[1] is looking pretty dead.

Edit: This is a surprisingly heavy build. Over 15 minutes on my somewhat older i5. I suggest using make -j instead of cmake -build to speed it up. Decent number of compiler warnings too.

[1] https://github.com/intel/hyperscan

Intel has an implementation of this technique here as well:

https://github.com/intel/hyperscan

I built (or at least "designed most of, built parts of") Hyperscan: https://github.com/intel/hyperscan - a fast multiple regular expression matcher.

It was not originally open sourced; it was a commercial product for some time.

I wasn't satisfied with what already existed at the time (and neither were our customers) because most regular expression libraries are not focused on performance (using backtracking), streaming (again, backtracking) across multiple inputs - e.g. packets - without holding on to old data - and handling lots of regular expressions at scale.

re2 came along while we were still doing a commercial project but had zero impact on our sales, as it didn't handle streaming (weirdly, imo, as it could have) but it also didn't scale very well to large numbers of patterns.

I also designed simdjson (https://github.com/simdjson/simdjson), although others (chiefly Daniel Lemire and John Keiser) now do all the real work. It's still my design under the hood (although the "On Demand" stuff is significantly new and different). I built the first prototypes of simdjson because Daniel trolled me with a paper from Microsoft and I was bored.

An interesting project would be to have a go at trying to automatically generate that malicious input. If I remember correctly, re2 has heuristics to see whether construction is "still worth it"; a particularly nasty input would stay just on the verge of "worth doing construction" but cost as much as possible (if you introduce too many novel states re2 will stop doing the construction at all).

DFA based engines are often relatively simple under the hood, and backtrackers (like modern JIT'd libpcre) may outrace them due to having optimizations.

Hyperscan (https://github.com/intel/hyperscan) a project I worked on for many years avoids the potential DoS attacks on backtrackers or the ones RE2 has, but surely has its own weak spots (notably, it assumes literal factors of the regular expression are rare in the input). We did build some mitigations to that, but fundamentally I bet Hyperscan could be made to slow down in a similar (linear) way to re2.

ripgrep author here.

No, you're trolling because you're being disingenuous and unnecessarily inflammatory. Your LOC counting, in particular, gratuitously misses the mark for a number of reasons.

Firstly, as others have pointed out, the features in GNU grep and ripgrep are radically different. GNU grep, for example, doesn't have anything that the `ignore` crate provides. So on this point alone, a simple LOC comparison falls flat. For example, does more lines of code imply that code complexity is higher? Not necessarily. Try reading and understanding the code itself. Obviously I'm biased, but I think folks would have an easier time with ripgrep than with GNU grep.

Secondly, ripgrep does a lot more than GNU grep does in order to make more searches faster. This ranges from parallelism, to differing search strategies (mmap vs standard read calls) to pattern analysis for better literal optimizations and more involved SIMD algorithms and better performance in the face of Unicode. (In addition to supporting a lot more of Unicode than GNU grep does.) Making programs run faster often leads to greater code complexity. Such a phenomenon, ironically, is easy to observe when comparing GNU tools with other coreutils implementations from BSD or busybox.

Thirdly, the way in which you counted lines is weirdly arbitrary. For example, GNU grep uses a lot of code from gnulib, including but not limited to, an entire regex engine. (It used to be part of GNU grep proper, so if you had done this analysis a few years ago, your LOC count would be at least double than what you came up with. Which supports my previous point: GNU grep's special lazy DFA internal regex engine only exists for performance reasons.)

Fourthly, I made a decision to make a lot of ripgrep's internals generic over both the regex engine and the printer. This was to make it possible for others to use those libraries, but also to make it easier to support plugging other regex engines into ripgrep. The official ripgrep distribution, for example, comes with both Rust's regex engine and PCRE2. But it goes beyond that. Someone even maintains a fork of ripgrep[1] that uses Hyperscan[2] to make it faster in some cases. The fact that this is even plausible to do is only because much of ripgrep's core is, in fact, generic. Making things generic tends to require more code and more complexity. I made this trade off intentionally.

Fifthly, your LOC counts include a ton of tests for ripgrep, where as your GNU grep LOC count presumably does not. For example, about half the LOC in the `printer` crate are tests. And that's just at a library level. Does GNU grep even have library level tests at all? A cursory glance suggests it does not.

Sixthly, in the course of developing ripgrep, I made a conscious decision to contribute back to the Rust library ecosystem. As a result, several of ripgrep's libraries are now used in Cargo and rustc itself, among many other tools. GNU grep has no such goal beyond the GNU project and their gnulib itself. Writing more tightly coupled application level code would make the code much tighter. (Which you can observe for yourself by looking at the evolution of ripgrep. It started off with a lot more tightly coupled application level code.)

There are probably more reasons why your comparison falls flat, but this is what I could come up with quickly before going to bed. In particular, the choice of programming language is but one factor in overall code complexity. Design goals and other such things arguably have even more of an impact. Thus, your analysis is shallow, and the lack of self-awareness of this fact combined with your inflammatory marks and seeming certainty of your position are why you are a troll.

[1] - https://sr.ht/~pierrenn/ripgrep/

[2] - https://github.com/intel/hyperscan

This is a cool paper. I'm building a SMT-based superoptimizer for SIMD sequences that is still pretty stupid (it can only handle "lanewise" SIMD i.e. some op on each lane, it can only do AVX2, it can only handle arbitrary graphs up to 4 instructions before it gets too slow).

And yet... it concocted a good chunk of a PSHUFB based pattern matcher (effectively "shufti" from https://github.com/intel/hyperscan) that I invented in 2009. If I could push further on this and scale from 4 instructions handled in about 45 minutes (single core, though it scales well) up to 5 instructions (currently takes a month of single-core search time) to 6-7 instructions (I don't even know!), this system would be able to come up with the full version of shufti, a "quite commercially relevant" SIMD sequence that I invented in 2009.

Not saying that sequence was the cleverest thing ever, but it seemed novel enough, and to have a machine just spew it out when given the black box description of what it's meant to achieve... it would be pretty cool.

All I need is a few orders of magnitude worth of search optimizations. That's OK - when I started, my superoptimizer took a couple days to handle 4 instructions, so it might still happen.

Francois is a smart guy, but there's a range of opinions on the matters he raises. We argue a good deal on Twitter ("constructive confrontation" between former Intel Principal Engineers :-) ). He put together the video pretty quickly so don't dismiss it just for lack of polish.

Personally, I argue with him a lot on AVX-512. I think AVX-512 is a Good Thing (or will be shortly - the first instantiation in Skylake Server - SKX - isn't great).

The biggest meta-problem with AVX-512 is that due to process issues, the pause button got hit just as it appeared in a server-only, 'first cut' form. AVX2 had downclocking issues when it first appeared too - but these were rapidly mitigated and forgotten.

I personally feel that SIMD is underutilized in general purpose workloads (partly due to Intel's fecklessness in promoting SIMD - including gratuitous fragmentation - e.g. a new Comet Lake machine with Pentium branding doesn't even suport AVX2 and has it fused off). Daniel Lemire and I built simdjson (https://github.com/lemire/simdjson) to show the power of SIMD in a non-traditional domain, and a lot of the benefits of Hyperscan (https://github.com/intel/hyperscan) were also due to SIMD. Notably, both use SIMD in a range of ways that aren't necessarily all that amenable to the whole idea of "just do it all on an accelerator/GPU" - choppy mixes of SIMD and GPR-based computing. Not everything looks like matrix multplication, or AI (but I repeat myself :-) ).

AVX-512 is a combination of fascinating and frustrating. I enthuse about recent developments on my blog (https://branchfree.org/2019/05/29/why-ice-lake-is-important-...) so I won't repeat why it's so interesting, but the learning curve is steep and there's plenty of frustrations and hidden traps. Intel collectively tends to be more interested in Magic Fixes (autovec) and Special Libraries for Important Benchmarks rather than building first-rate material to improve the general level of expertise with SIMD programming (one of my perpetual complaints).

You more or less have to teach yourself and feel your way through the process of getting good at SIMD. That being said, the benefits are often huge - not only because SIMD is fast, but because you can do fundamentally different things that you can't do quickly in GPR ("General Purpose Register") land (look, for example, at everyone's favorite instruction PSHUFB - you can't do the equivalent on a GPR without going to memory).

Given how things turned out, this might be a good lesson as to whether "I measured my own library informally on some benchmarks that I couldn't be bothered to describe to you and it WINS!" is trustworthy in future.

When we introduced simdjson, we went out of our way to explain (a) why we thought it might be faster than other libraries and (b) how people could replicate our measurements or find fault in our methodology. There were zero performance claims, public or private about the performance of simdjson before this material was available.

In my long experience, undocumented and non-reproducible performance claims suffer from the fact that people generally stop benchmarking when they get the result that they wanted to hear, even if the absolute numbers involved are unreasonable. It's very easy to make mistakes in performance measurement (compiling simdjson with anything short of -O2 or -O3, doing repeated runs of a tiny input and thus getting perfect branch prediction behavior), but people generally fix these mistakes only on the side that makes their stuff look good.

Back when I worked on the Hyperscan project ( https://github.com/intel/hyperscan ), we had an API call that allocated scratch memory (necessary due to Hyperscan's rather constrained approach to memory usage). I lost count of the number of times we had users run this allocation (which only ever needed to be done once per thread, effectively) inside their measurement loop. And whenever we were competing against an incumbent system and the people doing the measurement were the authors of that incumbent system, it was amazing how quickly they'd accept the "truth" of their benchmark run.

It's entertaining that there is yet another area where "running large numbers of regular expressions, known in advance and not changing all that often" over lots of input is performance critical.

This was a key driver behind writing Hyperscan (https://github.com/intel/hyperscan) which I used to work on when it was primarily used for network security (thousands or tens of thousands of regular expression rules for firewalls), but doing this for ad blocking seems pretty sensible too.

Mishit is censored too... I like the number of HN commenters who tried out their favorite Scunthorpe variations.

As somehow who designed an implementation of \b for an NFA-based regex matcher (https://github.com/intel/hyperscan) I always have an appreciation for the Scunthorpe problem; we did see quite a number of scam/spam/bad-word detection regex patterns that tried to avoid it with word boundary stuff.

This wasn't really on DRAM in the same way as the paper is, interesting though that project was. Sadly, it's more or less defunct. The academic research at UVa has largely come to an end and Micron spun out the project to a startup that seems to be moribund (http://www.naturalsemi.com/).

I tracked these guys because we were doing automata in s/w (the https://github.com/intel/hyperscan project at Intel). There were some aspects of their methodology I didn't like (notably, they tended to run benchmarks that spewed matches, slowing Hyperscan down, while turning off matches on the h/w card!) but I found them v. interesting and they had some really thought-provoking stuff.

Any good literal search algorithm could do a one-off search for a single long literal 'needle' way faster than the roughly 1GB/s that the author attained with grep.

A single string of that length is extremely easy to search for with a range of different algorithms - I would be surprised if a decent approach couldn't keep up with memory bandwidth (assuming your 22GB file is already, somehow, in memory). The mechanics of simply reading such a big file are likely to dominate in practice.

We implemented some SIMD approaches in https://github.com/intel/hyperscan that would probably work pretty well (effectively a 2-char search followed by a quick confirm) for this case.

Of course, that begs the question - presupposing that any kind of whole-text search is actually the answer to this question. The end result - assuming that you really do have more than a few searches to do - of keeping the results in any kind of prebuilt structure - is way superior to an ad hoc literal search.

iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.

This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.

It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.

Oddly, lookbehinds are evil only in a specific backtracking world. We never got around to implementing arbitrary lookarounds in Hyperscan (https://github.com/intel/hyperscan) but if we had done something in the automata world to handle lookaround, lookbehinds are way easier than lookaheads.

To handle a lookbehind, you really only need to occasionally 'AND' together some states (not an operation you would normally do in a standard NFA whether Glushkov or Thompson). To handle lookaheads... well, it gets ugly.

Dyalog implementor here. The hot loops in this function are running my code!

I'm not surprised at all about this result, although I certainly wouldn't use it to make a pronouncement about Dyalog or C as a whole. But there are some places where interpreted array languages have a major advantage over typical compiled languages.

One of the advantages seen in this wc function is our use of bit booleans rather than byte booleans. Packing 8 bits in a byte uses eight times less space, and can lead to drastic speed improvements: 2-8 times faster than even code which uses short ints.

On that note, the function

  words←{(~(¯1)↑⍵)++/1=(1↓⍵)-(¯1)↓⍵}
can be improved by keeping the data boolean. 1=(1↓⍵)-(¯1)↓⍵ is equivalent to the windowed reduction 2⍵ which identifies places where the boolean argument increased in value. We get:

  words←{(~¯1↑⍵)++/2⍵}
Since this function doesn't produce a 1-byte int result from subtraction, it's many times faster. I discuss some similar functions to 2 in a blog post at https://www.dyalog.com/blog/2018/06/expanding-bits-in-shrink....

Another big improvement is coming next year in Dyalog 18.0. The computations data∊nl data∊sp will use vectorised search methods similar to Intel's hyperscan (https://github.com/intel/hyperscan). The character lookups here are changed to be lookups from a 256-bit table, or two SSE registers, and searched using branchless SSSE3 instructions. I didn't go through this specific algorithm but explained many of our new search techniques at last year's user meeting: https://www.youtube.com/watch?v=paxIkKBzqBU.

By my measurements, each improvement (changing the word boundary computation and switching to the unreleased Dyalog 18.0) knocks of about a third from the total time. With both, this code is three times faster than it was before!

We wrote a generalization of the Levenshtein distance algorithm for regular expressions in Hyperscan (https://github.com/intel/hyperscan), which allows you to do Levenstein distance matching over regular expressions. It is not necessarily a cheap thing to do, given that incautious use of edit distances and permissive regular expressions can result in automata that match almost anything.

I believe that the first widely available approximate regex matcher was Tre (https://laurikari.net/tre/). However, while everybody wants to talk about approximate matching like they have something to say, most algorithmists act like they forgot about Tre.

It looks like there are some interesting things in this project beyond regex, but the way that it skates over the fact that there are already automata-based multiple pattern matchers is pretty rank.

When we started developing automata-based pattern matching s/w in 2006 - 2006! - we were well aware that it wasn't really a new thing. 12 years of work on Hyperscan (https://github.com/intel/hyperscan), 1 acquisition and 1 open source release later - as well as multiple other projects doing similar things (re2, Rust regex, Parabix, icgrep, Rosie Pattern Language, ...) has only added to this crowded field. All this has taught me many things, but the thing that I keep returning to is that "a lot of people apparently don't know about prior art, or assiduously pretend not to". It will be interesting to see what they try to patent.

> 2. Nevod matches MULTIPLE patterns against document in ONE PASS. Patters/expressions are indexed by state machine and filtered effectively during matching.

This is a good idea. It's such a good idea that we did it in 2006, sold a company to Intel based around in in 2013 and open-sourced the resulting library (https://github.com/intel/hyperscan) in 2015.

All multi-pattern systems get a huge improvement over one-at-a-time regex. It would be embarrassing if they did not. Our rule of thumb for Hyperscan was that it should be about 5-10x faster than libpcre on single patterns and the gap should get progressively bigger.

I don't doubt that there is quite a bit of interesting work (entirely outside of regex - the word-based focus looks interesting) going on in this project, but the fact that you couldn't be bothered to Google around for a sensible baseline is not encouraging.

Also, which bit of your system do you think is novel enough to patent? You may find considerable prior art out there, which you don't really seem to be aware of based on how much emphasis you're placing on the not-entirely-novel idea of using a state machine.

https://github.com/intel/hyperscan/ implements this along multiple regex matching and a pcre pattern compatible engine (chimera).
Not wanting to be the 'Old Man Yells At Cloud' figure, but this is a perfect illustration of how not to do it, particularly multithreading in Go before using a reasonable algorithm.

Searching for a long string like this is preposterously easy. You can use a classical algorithm, or, given you are looking for a 64 byte long string, you could do a strided inspection - every 32 bytes, look to see if you have a 4-gram from the first 35 bytes of the string. More or less anything works as almost any reasonable string searching mechanism can outrun memory bandwidth.

Or you could go use my old project (https://github.com/intel/hyperscan) and then you could look for regexes too. :-)

We are somewhat in amateur hour here. It is well understood that literal matching is easier at scale than regex matching, and some of us have built regex matchers that take advantage of this whether they are given a pure literal case or where regexes have literal factors: https://github.com/intel/hyperscan being a project I've worked on for a little while.