What does HackerNews think of redgrep?

♥ Janusz Brzozowski

Language: C++

> If it's actually a _regular_ expression, by definition re2 will support it.

Hmmm, kinda sorta. Regular languages are closed under complement and intersection, but RE2 doesn't support them. They are difficult to implement efficiently, particularly with respect to building the regex. There are some issues on the RE2 issue tracker about this.

Of course, you could say that you could always rewrite a regex using complement or intersection to an equivalent regex that doesn't use those operations. While true, such a feat is usually quite tricky. So treat this comment as adding more clarity as opposed to me trying to correct you perhaps. :-)

Complement and intersection are not implemented in any "general purpose" regex engine that I know of. You tend to find them in more specialty projects like redgrep[1] or in libraries specifically devoted to the construction and manipulation of DFAs. (RE2 doesn't have such a thing, only a lazy DFA, because full DFAs are too expensive to build in the "general purpose" context.)

[1]: https://github.com/google/redgrep

I don't think Rust regex engine relies on this technique. I guess the main point is when you construct the DFA directly you still have the possibility of the exponential explosion of the number of states. That's why modern engines balance between NFA/DFA and lazy DFA.

Though there is an implementation that relies only on Brzozowski derivatives: https://github.com/google/redgrep

It couldn't figure it out from looking through ripgrep's website: does ripgrep support intersection and complement of expressions? Like eg https://github.com/google/redgrep does.

Regular languages are closed under those operations after all.

Most popular regex engines use backtracking and support non-regular features. They aren't based on automata, so implementing AND is perhaps not as theoretically straight-forward in that context.

There are some popular finite automata based engines though. RE2 comes to mind. In theory, things like AND and complement could be implemented, but in practice, they blow up the size of the state machine.[1, 2] The current maintainer of RE2 did build a tool that permits intersection and complement though.[3]

[1] - https://dl.acm.org/doi/10.1145/2071368.2071372

[2] - https://github.com/google/re2/issues/156#issuecomment-354068...

[3] - https://github.com/google/redgrep

RE2 was used in Google Code Search in an adversarial context, and it handles the DFA issue by simply putting an 8MB limit on the DFA for a regex:

https://github.com/google/re2/blob/master/re2/re2.h#L601

I learned a few months ago that Rust's regex crate, which is based heavily on RE2, does the same thing.

There is perhaps something worth exploring there, but after reading the Cox articles, I can see that there are so many issues that come up in regex implementation, and this seems to be one of the smaller ones.

I don't think the "regex as untrusted data" use case is very common. That is, there are probably 10,000 programs that use regexes for every one that accepts regexes from users.

-----

Although, one thing I've wanted to explore is Regex derivatives. Supposedly that technique constructs an optimal DFA directly. You don't have to make an NFA, convert it to a DFA, and then optimize the DFA.

Cox describes it as a "win-win" here (in contrast to parsing with derivatives, which is asymptotically worse than traditional parsing algorithms)

https://research.swtch.com/yaccalive

There a bunch of non-production implementations floating around like

https://github.com/google/redgrep

https://github.com/MichaelPaddon/epsilon

(This technique was revived by the 2009 paper cited)

-----

Regarding the blog post, I have some material here. I gave about 10% of this as a 5-minute talk, and realized it's about 2 hours worth of material.

https://www.oilshell.org/share/05-31-pres.html

I still think the table toward the end is a good summary and perhaps worth publishing with some more context / explanation.

----

One interesting test of whether the message from Cox's articles got through is what style of regex engine newer languages are using.

I just noticed Julia uses PCRE, which is unfortunate IMO. I like RE2 -- it has a nice API, great performance, and a lot of features.

https://docs.julialang.org/en/v1/manual/strings/index.html#R...

Swift also has Perl-style regexes, but maybe they inherited it from Objective C:

https://developer.apple.com/documentation/foundation/nsregul...

So basically Rust is the only newer language that appears to have gotten the message.

The Oil shell will of course use "regular languages" rather than regexes, but that's natural because shell never had Perl-style regexes.

One thing I think Cox didn't explore enough is that most libc's appear to be using the "fast" good method, because they don't need Perl constructs.

I guess that would be a good follow-up article. To test GNU libc and musl libc on those pathological cases. I looked at their code a few years ago and they looked "right" (I've forgotten the details), but I don't think I ever did the test.

There's also redgrep (https://github.com/google/redgrep) that supports intersection and complements of regular expressions.

I am toying the idea of writing a little game where player A thinks of a regular expression, and player B tries to guess. If B guesses right, they win. If B guesses wrong, A has to provide a false positive and a false negative (if they exist), and B gets to guess again.

Can you think of ways to automate the roles of A and/or B?

- With a PCRE/RE2-like NFA implementation, where you have green threads representing the states in the automata, you can just store the start of match in the thread. You do have to be careful then about not killing some threads off. If HyperScan is a DFA, though, then SoM is indeed hard.

- Redgrep is a cool project by Paul Wankadia that uses LLVM to JIT regexp automata into machine code. https://github.com/google/redgrep. The two things that are tough about JITting for very large automata, though, are breaking the code up into functions (a compiler can't deal with one gigantor block), and all that entails, and the fact that a machine code representation of a large automaton might be many times bigger than a more traditional representation, so what you gain in decode time you lose on L3 cache hits.

- I don't grok what you mean about predicting the future wrt/ lazy/greedy quantification. With lazy quantification, you can stop producing matches once you've gotten the first match (or once you've moved past the quantification iff the subsequent subpattern and the quantification subpattern are disjoint). Looking for ".+" in a large stream is very bad, whereas looking for ".+?" is almost certainly what's desired.

- Capturing in a streaming mode just doesn't seem worth it.

- Large spans of bounded repetition are indeed the devil. It should be possible to replace states with a counter, but I think it requires dynamic mallocs (though I haven't worked on it yet).