The docs state that it's comparable to RE2, and others, but I don't see benchmarks yet. Is it fair to assume it's faster, or is there a tradeoff to emphasize the streaming features?

RE2 isn't a multipattern engine. That's the big difference. You can join a bunch of patterns together with | in RE2, but there's a precedence so it will only tell you about the first one to hit and won't tell you which one. Also, RE2 hits the wall with joining patterns together like this (depends on the patterns, of course). Comparing multipattern and single pattern engines is a bit apples and oranges.

I haven't run HyperScan but it looks like it should be very fast. It greatly depends on the patterns selected. It also looks like HyperScan has pragmatically sacrificed some aspects of PCRE compatibility/correctness wrt/ matching in order to achieve performance.

glangdale

We sacrifice features and compatibility. We have sacrificed zero correctness as far as we are aware. We maintain wart-for-wart compatibility with libpcre in order to have a good basis for comparison for fuzzing and we believe we 100% recognize libpcre constructs even when ww don't support them.

Our semantics differ, but roughly speaking, we provide the set of matches that libpcre would if you hook up a callout to libpcre that rejects each match and sends libpcre back to look for another match (our ordering will be different, but is also well defined).

jonstewart

If a pattern uses quantification and the match extends successfully, won't two matches be reported (instead of one)? That's what I mean--not that it's buggy. As I'm sure you know quite well, deciding when matches can be reported is quite tricky, especially with Perl's ordered alternation rules.

A few questions: - What kind of overhead is actually implied by using start of matches in streaming mode?

- Is the backend a machine code JIT of the automaton, an interpreter, a table-driven search, or...?

- Why not lazy quantifiers? They're almost always preferable to greedy quantification in a streaming scenario.

cheers, Jon

glangdale

Sorry - might have sounded a bit defensive there. We've seen a few systems that have a tendency to sweep correctness under the carpet with shortcuts, especially when you hit nasties like "large bounded repeats in streaming mode".

Start of Match (SoM) in streaming mode is a real bear for some patterns. It really depends on the pattern - some are free, others (that we would otherwise support) are very expensive or impossible. Our SoM is fully general (i.e. it doesn't time out or give up after N bytes or K stream writes) so the worst case means tracking lots of 64-bit offsets through our automata.

The backend is a big pile of code that coordinates the efforts of many different automata and subsystems, most of which are a combination of a C implementation plus a big table of some sort to implement NFA/DFA/literal matcher/whatever. This isn't something of which we are particularly proud. We would like to make this a lot simpler and more interpreter-based.

A JIT would be really cool but we had some reluctance in our user base to run arbitrary code. I would be fascinated to support a JIT project even if it's not on our mainline.

I respectfully differ on lazy vs greedy. In streaming mode, both require information that's expensive to maintain and imply that you can 'predict the future' (especially once you put a complex pattern after the lazy quantifier). We've been using "automata semantics" for a long time and this is simple and easy to keep track of in streaming mode.

Capturing semantics and lazy/greedy simulation are a work in progress for us. We used to have more work in this area but we did't have a strong demand for it. We do have some IP in this that isn't in Hyperscan - I may blog about this soon. We did have a subsystem that got 100% compatibility for some patterns ("100% right, 50% of the time"?) in block mode for libpcre that covered quantifer semantics and capturing subexpressions.

jonstewart

- With a PCRE/RE2-like NFA implementation, where you have green threads representing the states in the automata, you can just store the start of match in the thread. You do have to be careful then about not killing some threads off. If HyperScan is a DFA, though, then SoM is indeed hard.

- Redgrep is a cool project by Paul Wankadia that uses LLVM to JIT regexp automata into machine code. https://github.com/google/redgrep. The two things that are tough about JITting for very large automata, though, are breaking the code up into functions (a compiler can't deal with one gigantor block), and all that entails, and the fact that a machine code representation of a large automaton might be many times bigger than a more traditional representation, so what you gain in decode time you lose on L3 cache hits.

- I don't grok what you mean about predicting the future wrt/ lazy/greedy quantification. With lazy quantification, you can stop producing matches once you've gotten the first match (or once you've moved past the quantification iff the subsequent subpattern and the quantification subpattern are disjoint). Looking for ".+" in a large stream is very bad, whereas looking for ".+?" is almost certainly what's desired.

- Capturing in a streaming mode just doesn't seem worth it.

- Large spans of bounded repetition are indeed the devil. It should be possible to replace states with a counter, but I think it requires dynamic mallocs (though I haven't worked on it yet).