Log in

No account? Create an account


The joy of lpeg

« previous entry | next entry »
21st Feb 2009 | 00:28

I've recently started playing with lpeg, a parsing library for Lua. It is based on "Parsing Expression Grammars", which were recently popularized by the prolific Bryan Ford. PEGs have some nice properties: they're suitable for unified parsers that handle both the low-level lexical syntax as well as higher-level hierarchial syntax; they have much simpler operational semantics than traditional extended regexes or context-free grammars; and as well as familiar regex-like and CFG-like operators they have nice features for controlling lookahead and backtracking. PEGs were originally developed alongside a cute algorithm for linear-time parsing which unfortunately also requires space linear in the input size with a fairly large multiplier. Lpeg instead uses a simple parsing machine, implemented somewhat like a bytecode interpreter. Its performance is quite competitive: the long paper says it has similar performance to pcre and glibc's POSIX regex implementation, and the short paper says it has similar performance to lex and yacc.

Lpeg actually consists of two modules. The core lpeg module is written in C and allows you to compose parser objects using operator overloading, building them up from primitives returned from tersely named constructor functions. The resulting syntax is rather eccentric. On top of that is the re module which provides a more normal PEG syntax for parsers, which despite the name of the module are rather different from regular expressions. This module is written in Lua, using an lpeg parser to parse PEGs and construct lpeg parsers from them. The PEG syntax is extended so that you can define "captures". Captures are the really nice thing about lpeg. You can use them like captures in Perl regexes to just extract substrings of the subject, but you can often do better. Lpeg captures are more like the semantic actions that you can attach to rules in parser generators like yacc. So, where in Perl you would do the match then fiddle around with $1, $2, etc, with lpeg the match can incorporate the fiddling in a nice tidy way. (In fact, probably the closest comparison is with Perl 6 rules, but they're not yet practically usable.)

The program I was writing with lpeg was to process some logs. I needed to convert the timestamps from ISO 8601 format into POSIX time_t which implied converting 8 fields from strings to numbers. Rather than having to convert each capture individually, or loop over the captures, I could write a single grammar rule to match a pair of digits and convert it to a number, then refer to that rule elsewhere in the grammar. (In fact Lua will coerce strings to numbers implicitly in most - but not all - circumstances. I happened to be tripped up trying to compare a number with a string, which doesn't coerce.) In the end it's nicest to let the parser drive all the program's activity through its semantic actions.

In the following, [[...]] delimits a "long string" containing the PEG grammar. {...} in the grammar denotes a capture, whereas (...) is for non-capturing groups. pat -> fun passes the captures of a pattern to a function. The second argument to compile is a table where the keys are the function names referred to in the parser. The main entry point to the rest of the program is process, whose definition I have left out.

    local parser = re.compile([[
        line <- ( <stamp> ' ' {%a*} ' ' <count> !. ) -> process
        stamp <- ( <date> ' ' <time> ' ' <zone>   ) -> tostamp
        date  <- ( <num><num> '-' <num> '-' <num> ) -> todate
        time  <- (      <num> ':' <num> ':' <num> ) -> totime
        zone  <- ( {[+-]} <num> <num>             ) -> tozone
        count <- {'@'*} -> tocount
        num   <- {%d^2} -> tonumber
    ]], {
        process = process,
        tonumber = tonumber,
        tocount = function (s) return #s end,
        todate = function (c,y,m,d)
            if m > 2 then m = m + 1;  y = c*100 + y
                     else m = m + 13; y = c*100 + y - 1 end
            return int(y*1461/4) - int(y/100) + int(y/400)
                 + int(m*153/5) + d - 719591
        totime = function (H,M,S)
            return H*3600 + M*60 + S
        tozone = function (zs,zh,zm)
            local z = zh*3600 - zm*60
            if zs == "-" then return -z
                         else return  z  end
        tostamp = function (date,time,zone)
            return date*86400 + time - zone

    for line in io.stdin:lines() do
        if not parser:match(line) then
            io.stdout:write("skipping "..line.."\n")

| Leave a comment | Share

Comments {0}