?

Log in

No account? Create an account

fanf

Parsing expression grammars

« previous entry | next entry »
11th Jun 2007 | 09:56

PEGs are an interesting new formalism for grammars. Unlike most older formalisms, which are based on Chomsky's generative grammars, their starting point is recognizing strings that are in a language, not generating them. As such they are a closer match to what we usually want a grammar for. The practical effect of this is that they naturally avoid common ambiguities without external rules, such as C's if/else ambiguity or the various rules about greediness imposed on regexes (e.g. perl's matching rules versus POSIX's longest-leftmost rule, discussed in Friedl's book). Even though PEGs can recognize some non-context-free languages (e.g. anbncn) they can be matched in linear time using a packrat parser (which can be implemented very beautifully in Haskell).

Bryan Ford's 2004 POPL paper establishes the formal foundations of PEGs and defines a concrete syntax for them, fairly similar to ABNF. The key differences are: the choice operator is ordered (prefers to match its left-hand sub-expression); repetition operators are maximally greedy and don't backtrack (so the second a in a*a can never match); and it includes positive and negative lookahead assertions of the form &a and !a (like (?=a) and (?!a) in perl).

It occurs to me that there is a useful analogy hidden in here, that would be made more obvious with a little change in syntax. Instead of a / b write a || b, and instead of &a b write a && b. With || and && I am referring to C's short-cutting "orelse" and "andalso" boolean operators - or rather the more liberal versions that can return more than just a boolean, since a PEG returns the amount of input matched when it succeeds. This immediately suggests some new identities on grammars based on De Morgan's laws, e.g. !a || !b === !(a && b). Note that !!a =/= a because the former never consumes any input, so not all of De Morgan's laws work with PEGs. This also suggests how to choose the operators to overload when writing a PEG parser combinator library for C++ (which has a much wider range of possibilities than Lua).

| Leave a comment | Share

Comments {11}

(Deleted comment)

Tony Finch

from: fanf
date: 14th Jun 2007 14:47 (UTC)

BTW I post lots of links to stuff like this on dotaturls

Reply | Parent | Thread

Just a random swede

from: vatine
date: 11th Jun 2007 10:32 (UTC)

Eeeeep! I just spotted that CL-peg uses my genhash library. I Am Officially Surprised.

Reply | Thread

Simon Tatham

from: simont
date: 11th Jun 2007 13:54 (UTC)

Hmm. It's certainly an interesting thought, and I like the way it permits unification of the lexical and syntactic layers.

It does seem to me, though, that this heads directly away from what I actually in practice use formal grammars for. My usual process of language design involves writing down a formal grammar for the language, running it through yacc to check that it doesn't contain any non-obvious ambiguities, and then writing a recursive descent parser by hand (because I've never found yacc to give me decent parse-error reporting). Whereas PEGs' prioritised choice operator means that ambiguities are expected, which is very useful if you know there's an ambiguity and want to specify how it should be resolved, but rather less helpful if you wanted to be told about any ambiguities you'd missed.

The author has noticed this, of course, and section 5 mentions the possibility of introducing an unprioritised choice operator as well. I suspect I'd end up using that almost all the time, resorting to prioritised choice only to resolve an ambiguity I'd convinced myself was necessary.

Now to go and read up on the packrat parsing algorithm and see whether it looks as if it can be persuaded to do decent error reporting...

Reply | Thread

gareth_rees

from: gareth_rees
date: 11th Jun 2007 14:00 (UTC)

It's a long time since I've used it, but what I recall is that if you want decent error reporting from yacc you have to write your own error productions. Tedious, but not more so than writing your own parser.

Reply | Parent | Thread

Tony Finch

from: fanf
date: 11th Jun 2007 14:10 (UTC)

What you're effectively using yacc for is a grammar lint which checks that your language is in a particular subset of the CFLs. I agree that for practical purposes, PEGs need a linter to help with grammar development. (The same goes for perl6 rules, which are even more powerful.) Fortunately there's already Sylvain Schmitz's paper, Modular syntax demands verification.

Robert Grimm's Java packrat implementation includes a number of optimizations for its imperative setting, plus some practical help for error handling. PEG parsers are supposed to be more helpful than LALR parsers since the structure of the parser closely follows the structure of the grammar, and lookahead assertions mean that errors can be detected closer to the relevant place.

Reply | Parent | Thread

Peter Maydell

from: pm215
date: 11th Jun 2007 20:51 (UTC)

Is there any discussion of memory requirements for PEG parsers anywhere? (both in terms of code/data table size and any runtime memory allocation). The remarks in the wikipedia article about memoization suggest that it's likely to be quite space-hungry.

(The last parser I dealt with was for an application where memory usage was a big deal; and there's been a lot of work put in to the problem of minimising the data table size for LALR parsers...)

Reply | Thread

Tony Finch

from: fanf
date: 12th Jun 2007 09:13 (UTC)

You should be able to parse one top-level syntactic element at a time, so that the parser's theoretical O(n) space usage is limited by that bound rather than the entire file size. Within a definition your parser state consists of consumed input (which should be discardable) plus partial parse tree, the memoized lookahead state, and unconsumed input. The lookahead state depends on the structure of the grammar, and in practice (since most grammars are designed to avoid unlimited lookahead) this is pretty small. A one or two LRU cache per grammar rule gives you most of the same performance as full memoization at much less cost.

Reply | Parent | Thread

reference for a statement

from: anonymous
date: 16th Sep 2008 00:40 (UTC)

Hi there,

do you have a reference for your statement "Even though PEGs can recognize some non-context-free languages (e.g. a^nb^nc^n)"?

Thanks
Stefan

Reply | Thread

Tony Finch

Re: reference for a statement

from: fanf
date: 16th Sep 2008 10:49 (UTC)

See Bryan Ford's paper linked above. Section 3.4 "Language Properties" includes the theorem "the class of Parsing Expression Languages includes non-context-free languages", which is proved by showing a grammar for anbncn.

Reply | Parent | Thread

Re: reference for a statement

from: anonymous
date: 16th Sep 2008 14:48 (UTC)

Thanks very much for your quick response :o)
Have a great day.
Stefan

Reply | Parent | Thread