Tony Finch - NFA? DFA?

dotatfanf wrote
on 15th June 2005 at 14:39
Previous Entry Share Next Entry

NFA? DFA?

My esteemed colleague Philip Hazel recently released pcre-6.0, the latest version of his Perl-compatible regular expression library. The major new feature is an alternative matching algorithm which does not back-track. Unfortunately, Philip has documented this using incorrect terminology from Jeffrey Friedl's otherwise excellent book, Mastering Regular Expressions. This is my rant about Jeffrey Friedl's description of the difference between DFA and NFA, i.e. deterministic or non-deterministic finite automata.

In the correct non-Friedl terminology, DFA and NFA refer to the construction of the FA, not the matching algorithm. When matching a subject string against a DFA, each character determines a unique new automaton state, so you can implement the match with just two variables: a pointer into the string and a pointer into the DFA. In an NFA, each character can determine multiple subsequent automaton states, so to implement the match you need to trace the multiple possible routes through the NFA. Traditionally this has been done with a backtracking depth-first search, in which you keep a history of your state at each decision point so that you can go back and try the alternatives if a match fails. The new PCRE algorithm effectively performs a breadth-first search, by processing the input string a character at a time and keeping track of all the possible NFA states in parallel. (See http://www.cl.cam.ac.uk/Teaching/current/RLFA/ and in particular section 2.2 of the lecture notes for more details.)

Friedl only talks about the two NFA algorithms, referring to the first as an NFA implementation (fine) and the second as a DFA implementation (no! wrong! totally wrong! where'd he learn this?). His confusion probably comes from the fact that the alternative NFA algorithm behaves the same as if it were implemented using an equivalent DFA, but it's much slower because of the overhead of tracking multiple possible states instead of just a single possible state. With pure regular languages it is possible to construct a DFA from an NFA in order to get faster matching (see the lecture notes); the disadvantage is that the DFA construction can use a lot of space.

The problem with DFAs wrt Unix (and Perl) regexes is that Unix regexes have traditionally been implemented with a backtracking NFA matcher, and people have assumed this implementation when adding features to the regex language. The result is that so-called "regular expressions" no longer match regular languages according to the strict terminology of formal language theory. However when using the alternative not-really-DFA algorithm, PCRE restricts the regex language (round brackets do not capture and back-references are not permitted) so that it can only match regular languages. In principle, PCRE could perform the DFA construction in order to make the new algorithm fast, but Philip says this is unlikely to happen any time soon because PCRE is now on the back burner.

Not all Unix regex implementations are backtracking NFAs. Lex and GNU grep use DFA matching in order to run fast; the cost of constructing the DFA is outweighed by the benefit of its faster running time over lots of data. Lex benefits from the DFA construction becuse it can keep multiple possible token types
in mind (e.g. 123L vs. 123.4 in C) with no extra state and no back-tracking; like PCRE it doesn't support capturing sub-expressions or back-refs. GNU grep is a bit more interesting, because it does support back-refs. It uses a two-stage matcher which performs an initial DFA match (with no back-refs) which is then checked for correct back-ref matching. The latter can be done much more efficiently than a general match, or completely skipped if there are no back-refs at all. Maybe a mythical future PCRE will work the same way?

(Leave a comment)
From:compilerbitch
Date:2005-06-15 18:20 (UTC)
(Link)
Interesting...

A couple of years ago I wrote a natural language parser generating compiler, which went from a source language that looked a bit like a mixture of lex, yacc and maybe bash, and generated .NET code directly. It still exists and still works, but is probably only really interesting due to its lexical analyser generation. It effectively generates three lexical analysers: an 'ignorer' that is implemented as a DFA that is used to skip bits of irrelevant input, a tokeniser that is also implemented as a DFA that has the sole responsibility of chopping the input into words, and a 'recogniser', which is basically an interpreted NFA that is responsible for figuring out which word is which. Words can have sets of matches, so it is quite possible for multiple regexes to all correctly match the same word (this isn't a conflict, because they are disambiguated by the parser later).

The ignorer and tokeniser are classic, lex-style table driven DFAs, so are extremely fast. I found that, in practice, natural language gramars tended to contain lexemes definitions that stomp on each other badly enough that generating a DFA is effectively impossible (Thompson (no relation!) subset construction takes exponentially long, and the size of the resulting table also tends to be impractically enormous). I therefore came up with a four instruction virtual machine that effectively split itself into multiple threads of execution in order to deal with the nondeterminism of the underlying NFA. Threads either die when they fail to find a possible outcome at a particular input character, or succeed and add a match to the set of outcomes. Execution continues until no threads remain, at which point the word is recognised. In practice, this seems to result in pretty quick parsers (which have significantly outperformed hand-coded ad-hoc parsers written in C++ compiled to native code), and the 4 instruction VM code seems to be pretty compact too (and, usefully, grows linearly with the size of the underlying regex(es)). There is no backtracking whatsoever. I've not written this up or published it. Maybe I should...
(Reply) (Thread)
From:fanf
Date:2005-06-15 18:36 (UTC)
(Link)
I've just remembered the last time I was thinking about regex implementations, which was in the context of making SpamAssassin run faster. Its core is a pile of regexes which need to be matched against the subject message, but unlike a tokenizer the matches can overlap. At the moment it just iteratively tries each regex against the message using Perl's backtracking NFA engine. I wondered vaguely if the regexes could plausibly be compiled to a DFA (without too much state table explosion) to get better performance, but I didn't look at it in any detail because other things were more important.
(Reply) (Parent) (Thread)
From:compilerbitch
Date:2005-06-16 11:21 (UTC)
(Link)
If the regexes overlap as much as you suggest, it is likely that state table blowup might be quite extreme, so you may be stuck with NFAs. My suspicion is that mapping multiple regexes to a single NFA can still offer very significant speedups, however, due to the possibilities of optimisations that tend to fall out. That, and you only typically need to scan the input once whilst maintaining a (usually small, cache-friendly) list of current states.
(Reply) (Parent) (Thread)
From:vrann
Date:2006-06-15 06:00 (UTC)
(Link)
can you say, how look NFA for matching back references in pcre?
(Reply) (Thread)
From:fanf
Date:2006-06-15 09:20 (UTC)
(Link)
No, but the documentation can.
(Reply) (Parent) (Thread)

(Leave a comment)

Powered by LiveJournal.com