It really feels like they buried the lede here, as almost all discussions about catastrophic backtracking seem to do. Here's the second-to-last paragraph:
> Alternatively you could switch to a regular expression engine that doesn’t exhibit this kind of behaviour. RE2 is an excellent library that uses a finite automata approach to avoid this type of catastrophic backtracking. Using this library on the same 32 character input reduces the running time from 12 minutes to basically instant. The downside being that RE2 doesn’t support all the functionality of the builtin re package.
Perl Compatible Regular Expressions (the flavor used by most engines, including Python's) _require_ this exponential worst-case behavior (using an implementation called backtracking). But the theory behind regex does not require this, and if you eliminate just a couple (infrequently used) features of PCRE, you can have a regex engine that runs in linear time (with respect to the size of the input string). See [2], this is by the authors of RE2. The features which are incompatible with this sort of implementation are:
1. Lookahead assertions. 2. Backreferences to matched groups.
If you don't know what these are, or you rarely ever use them, then you really have no business using this kind of regex engine. (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point). Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library, and exposed the entire industry to this stupidity, all for a few backreferences and lookahead assertions.
What's especially crazy is this: it's easy for a regex engine to detect the use of these constructs, because it has to parse the regex anyway! So it's feasible for the engine to optimistically try to use an efficient implementation, and only fall back to using a backtracking implementation when you're actually using these features. This is what the Python re2 module[3] does, it uses RE2 by default and supports falling back to the backtracking implementation if necessary.
Instead, we're stuck reading the same postmortem every few years describing how "catastrophic backtracking" ruined another company/person's day, when the problem has been solved for decades, and language/library creators have just failed to include that solution.
[1]: Rob Pike invented, or at least popularized, the algorithm used by RE2. He was also involved in the creation of Go, as was Russ Cox[2]. [2]: https://swtch.com/~rsc/regexp/ [3]: https://pypi.org/project/re2/
... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].
> So it's feasible for the engine to optimistically try to use an efficient implementation
Feasible, sure. But not easy or simple. If Python's re2 module uses RE2 by default and Python's regex engine when necessary, then it's very likely that the match offsets it reports are not coherent. RE2 aims to be mostly consistent with things like PCRE, but there are cases where it isn't.
The problem is, building a production grade FSM regex engine or a production grade backtracking regex engine are both huge engineering projects. And there are few, if any, people that know how to do both. They both have their own little specialties and there isn't a ton of crossover among implementors of regex engines.
It's one of those ideas that sounds so stupidly obvious in practice, but when you actually sit down to engineer the thing, it's quite a bit harder than you might imagine.
> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).
This kind of ignores the fact that regexes are often the interface to something else. If you have a full blown programming language at your fingertips, it's usually not hard to work around the absence of look-around or backreferences by, say, using a second regex or something. But when all you have available to you is, say, a blacklist and you get to provide the regexes to that blacklist, then a more expressive regex syntax becomes a lot more useful. See for example people already running into problems with Discord's new feature: https://github.com/rust-lang/regex/issues/127#issuecomment-1...
Of course, as I noted, it would be very inappropriate for Discord to permit non-regular regexes here. It's borderline already by permitting any regexes at all, because even FSMs have their pathological cases, albeit, not exponential. But if it's a tool you run on your local system? Sure, look-arounds become much more useful.
To be clear, I am pro-FSM and agree with you that they should really be the default regex engine everywhere. But instead, reality is indeed inverted, with backtracking being the default (almost) everywhere. But I wanted to pipe up about a few points that I think you might have not quite captured precisely.