What does HackerNews think of re2?

RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression engines like those used in PCRE, Perl, and Python. It is a C++ library.

Language: C++

Doesn't that confuse two uses of n?

One is the length of the word. The other is the number of dictionary elements.

If your language is constrained to 1,000 words, then the dictionary will be pretty small and the hash table can be constructed with no chains. That should have constant-time lookup no matter how long the query concatenation is.

I don't see why O(n^2) should be optimal. The question is only to determine if a match exists, and that can be expressed as a regular expression. In the following I omit all 1-letter words since my dictionary includes all the letters.

  >>> words = [line.strip() for line in open("/usr/share/dict/words") if len(line.strip()) > 1]
  >>> words.sort(key=len, reverse=True) # because Python uses left-to-right match, not longest
  >>> pattern = "(" + "|".join(words) + ")+$"
  >>> import re
  >>> matcher = re.compile(pattern) # takes a few seconds
  >>> matcher.match("ducksoup")
  
  >>> matcher.match("ducksquom")
  >>> matcher.match("theremanywordsinthisone")
  
  >>> matcher.match("theremanywordsqinthisone") is None
  True
Python uses backtracking, so this probably isn't O(n), especially with the ability to choose the dictionary.

But with there are non-backtracking matchers which would make this O(n). Here's re2 from https://github.com/google/re2 :

  >>> import re2
  >>> opts = re2.Options()
  >>> opts.max_mem = 200000000
  >>> matcher = re2.compile(pattern, opts)
  >>> match = matcher.match("ducksoup")
  re2/re2.cc:821: DFA out of memory: pattern length 2493008, program size
  1323834, list count 1089251, bytemap range 54
  >>> match.start(), match.end()
  (0, 8)
  >>> match = matcher.match("ducksquom")
  re2/re2.cc:821: DFA out of memory: pattern length 2493008, program size
  1323834, list count 1089251, bytemap range 54
  >>> match is None
  True
This is first time I've used re2. There's probably a way to prevent that warning message - I believe this is falling back to an NFA, which might not be O(n)?

Anyway, here it is processing the concatenation of 100,000 words selected at random (I've omitted the warning message):

  >>> s="".join(random.sample(words, 100000))
  >>> len(s)
  956249
  >>> import time
  >>> t1 = time.time(); print(f"is concatenation? {matcher.match(s) is not None} " + f"time: {time.time()-t1:.1f} seconds")
  is concatenation? True time: 12.3 seconds
  >>> t1 = time.time(); print(f"is concatenation? {matcher.match(s+'Q') is not None} " + f"time: {time.time()-t1:.1f} seconds")
  is concatenation? False time: 12.3 seconds
Cut the size by 10 and it's about 10x faster:

  >>> s="".join(random.sample(words, 10000))
  >>> len(s)
  95457
  >>> t1 = time.time(); print(f"is concatenation? {matcher.match(s) is not None} " + f"time: {time.time()-t1:.1f} seconds")
  is concatenation? True time: 1.3 seconds
For this test it appears to be linear in the query size.

Edit: if I increase max_mem to 2_000_000_000 then I don't get the warning message. But then the DFA-based search takes 141.5 seconds. That should be O(n) time in the query length, but an order of magnitude slower than the NFA-based search.

That's why O(n) analysis isn't enough to really figure out "a faster solution."

I wonder why they didn't just include an option to use a non-backtracking algorithm, like re2's[1]. As far as I know, that would completely eliminate the possibility of catastrophic backtracking happening.

[1]: https://github.com/google/re2

As this is a regex->DFA implementation I wonder how it compares to Google RE2 [1]

there are nice performance comparisons to glibc, rust regex and cl-ppcre, but re2 was missing from that list.

[1] https://github.com/google/re2

It's worth pointing out that the resource

https://swtch.com/~rsc/regexp/regexp1.html

linked above is also by the same author, Russ Cox,

as the reference I included at the end of the blog post:

https://swtch.com/~rsc/regexp/regexp2.html

All of the insights that I presented through this visualization tool are based upon the knowledge found in Russ Cox's articles. Also, the link above on RE2:

https://github.com/google/re2

is a project that was was also (started by|heavily contributed to by) Russ Cox. His writings on the topic of regular expressions are absolutely world-class and I have never found any better resource for understanding the deep low-level theory on how they work in practice.