What does HackerNews think of regex?

An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.

Language: Rust

#79 in Rust
> Nonetheless, every popular programming language (except Go[1])

... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].

> So it's feasible for the engine to optimistically try to use an efficient implementation

Feasible, sure. But not easy or simple. If Python's re2 module uses RE2 by default and Python's regex engine when necessary, then it's very likely that the match offsets it reports are not coherent. RE2 aims to be mostly consistent with things like PCRE, but there are cases where it isn't.

The problem is, building a production grade FSM regex engine or a production grade backtracking regex engine are both huge engineering projects. And there are few, if any, people that know how to do both. They both have their own little specialties and there isn't a ton of crossover among implementors of regex engines.

It's one of those ideas that sounds so stupidly obvious in practice, but when you actually sit down to engineer the thing, it's quite a bit harder than you might imagine.

> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).

This kind of ignores the fact that regexes are often the interface to something else. If you have a full blown programming language at your fingertips, it's usually not hard to work around the absence of look-around or backreferences by, say, using a second regex or something. But when all you have available to you is, say, a blacklist and you get to provide the regexes to that blacklist, then a more expressive regex syntax becomes a lot more useful. See for example people already running into problems with Discord's new feature: https://github.com/rust-lang/regex/issues/127#issuecomment-1...

Of course, as I noted, it would be very inappropriate for Discord to permit non-regular regexes here. It's borderline already by permitting any regexes at all, because even FSMs have their pathological cases, albeit, not exponential. But if it's a tool you run on your local system? Sure, look-arounds become much more useful.

To be clear, I am pro-FSM and agree with you that they should really be the default regex engine everywhere. But instead, reality is indeed inverted, with backtracking being the default (almost) everywhere. But I wanted to pipe up about a few points that I think you might have not quite captured precisely.

[1]: https://github.com/rust-lang/regex/

"Third-party" isn't really correct for regex: https://github.com/rust-lang/regex (note the org)

Rust's stdlib is deliberately small in order to allow it to have very strong stability guarantees.

You might care to notice that the resignation was announced by Andrew Gallant—more commonly known as BurntSushi[1]—who is one of the most well-respected, talented, and prolific contributors in the wider Rust community. Amongst other things, they are the author of ripgrep[2], the regex[3] crate, and the byteorder[4] crate. They have multiple projects which are amongst the most-downloaded crates[5] in the Rust ecosystem.

One would struggle to find more than a handful of people who have done more "actual work" for Rust than Andrew.

[1]: https://blog.burntsushi.net/about

[2]: https://github.com/BurntSushi/ripgrep

[3]: https://github.com/rust-lang/regex

[4]: https://github.com/BurntSushi/byteorder

[5]: https://crates.io/crates?sort=recent-downloads

> One such engine is rust's https://github.com/rust-lang/regex

It can process arbitrary deep nestings? I'm not familiar with it, but I had a look and I can't see anything in there that would allow it to do it. Can you point to syntax that allows it?

In general regex's that run in linear time are based on DFA's (as opposed to NFA's - which is what most use).

A DFA is part of the Chomsky hierarchy of languages that has DFA's at the bottom (aka regex's, least powerful - can't handle recursive structures), a DFA coupled with an infinite stack for memory (aka LR parsers, can't handle context sensitive grammars, also runs in very close to linear time), and finally a DFA coupled with an infinite tape (aka Turing Machine, can do any sort of computation).

In each of these cases the DFA (regex) is essentially a computer program and all they are doing is allowing it to store stuff on various kinds of memory (none, stack, tape with arbitrary movements). An interesting corollary of this is any computer program can be expressed as a DFA.