June 15th, 2005



My esteemed colleague Philip Hazel recently released pcre-6.0, the latest version of his Perl-compatible regular expression library. The major new feature is an alternative matching algorithm which does not back-track. Unfortunately, Philip has documented this using incorrect terminology from Jeffrey Friedl's otherwise excellent book, Mastering Regular Expressions. This is my rant about Jeffrey Friedl's description of the difference between DFA and NFA, i.e. deterministic or non-deterministic finite automata.

In the correct non-Friedl terminology, DFA and NFA refer to the construction of the FA, not the matching algorithm. When matching a subject string against a DFA, each character determines a unique new automaton state, so you can implement the match with just two variables: a pointer into the string and a pointer into the DFA. In an NFA, each character can determine multiple subsequent automaton states, so to implement the match you need to trace the multiple possible routes through the NFA. Traditionally this has been done with a backtracking depth-first search, in which you keep a history of your state at each decision point so that you can go back and try the alternatives if a match fails. The new PCRE algorithm effectively performs a breadth-first search, by processing the input string a character at a time and keeping track of all the possible NFA states in parallel. (See http://www.cl.cam.ac.uk/Teaching/current/RLFA/ and in particular section 2.2 of the lecture notes for more details.)

Friedl only talks about the two NFA algorithms, referring to the first as an NFA implementation (fine) and the second as a DFA implementation (no! wrong! totally wrong! where'd he learn this?). His confusion probably comes from the fact that the alternative NFA algorithm behaves the same as if it were implemented using an equivalent DFA, but it's much slower because of the overhead of tracking multiple possible states instead of just a single possible state. With pure regular languages it is possible to construct a DFA from an NFA in order to get faster matching (see the lecture notes); the disadvantage is that the DFA construction can use a lot of space.

The problem with DFAs wrt Unix (and Perl) regexes is that Unix regexes have traditionally been implemented with a backtracking NFA matcher, and people have assumed this implementation when adding features to the regex language. The result is that so-called "regular expressions" no longer match regular languages according to the strict terminology of formal language theory. However when using the alternative not-really-DFA algorithm, PCRE restricts the regex language (round brackets do not capture and back-references are not permitted) so that it can only match regular languages. In principle, PCRE could perform the DFA construction in order to make the new algorithm fast, but Philip says this is unlikely to happen any time soon because PCRE is now on the back burner.

Not all Unix regex implementations are backtracking NFAs. Lex and GNU grep use DFA matching in order to run fast; the cost of constructing the DFA is outweighed by the benefit of its faster running time over lots of data. Lex benefits from the DFA construction becuse it can keep multiple possible token types
in mind (e.g. 123L vs. 123.4 in C) with no extra state and no back-tracking; like PCRE it doesn't support capturing sub-expressions or back-refs. GNU grep is a bit more interesting, because it does support back-refs. It uses a two-stage matcher which performs an initial DFA match (with no back-refs) which is then checked for correct back-ref matching. The latter can be done much more efficiently than a general match, or completely skipped if there are no back-refs at all. Maybe a mythical future PCRE will work the same way?