Parsing Expression Grammar - part 1

There are a lot of different ways to parse a regular grammar, but my favorite is to use a Parsing Expression Grammar. This is a technology that I've used in a lot of different projects, and as this and the next couple of articles will show, they are remarkably easy to create from scratch.

I'm not a great language and grammar theoretician, so I'm not going to go into great detail about what a Parsing Expression Grammar(PEG) is, but here are a few observations.

I'm using F# for the code in this series of posts, I'm not a completely fluent F# programmer so there may be some clumsy code in places, but it is a very succinct and clean language for this type of work.

A very simple introduction to PEG

A very minimalist implementation of a PEG only requires a few elements, I'll add some more later, both for completeness and convenience; a PEG expression is comprised of:

First Implementation

While grammars are usually stored as text files, I'm going to start with the in memory representation of the PEG expressions.

type Expression =
| Terminal of string
| NonTerminal of string
| Epsilon
| Sequence of Expression list
| Choice of Expression list
| ZeroOrMore of Expression
| OneOrMore of Expression
| Optional of Expression
| And of Expression
| Not of Expression

This is a straight-forward translation of the description of the elements into an F# discriminated union type, before I dive into how to process this, I'll show some code that will print a PEG expression in a human readable form.

let printPeg expr =
    let rec pegToString = function
    | Terminal x -> "\"" + x + "\""
    | NonTerminal x -> x
    | Epsilon -> "<epsilon>"
    | Sequence x -> List.map pegToString >> List.reduce (fun a b -> a + " " + b) <| x
    | Choice x -> List.map pegToString >> List.reduce (fun a b -> a + " / " + b) <| x
    | ZeroOrMore x -> pegToString x + "*"
    | OneOrMore x -> pegToString x + "+"
    | Optional x -> pegToString x + "?"
    | And x -> "&" + pegToString x
    | Not x -> "!" + pegToString x

There is nothing particularly complex about this, the Sequence and Choice patterns use the forward function combination operator >> to pipeline converting the list of expressions into a list of strings (map), and then to join the elements using the appropriate separator (reduce).

To check that this works as expected I'll try a few expressions in F# interactive.

> printPeg <| Terminal "foo";;
"foo"
val it : unit = ()
> printPeg <| NonTerminal "foo";;
foo
val it : unit = ()
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom "*" product?
val it : unit = ()

This doesn't quite work how I want it to, in the last expression it isn't clear what is optional, so I need to make printPeg a touch more complex…

let rec pegToString expr =
    let pegToStringParen = function
    | Terminal _ | NonTerminal _ | Epsilon as x -> pegToString x
    | x -> "(" + pegToString x + ")"

    match expr with
    | Terminal x -> "\"" + x + "\""
    | NonTerminal x -> x
    | Epsilon -> "<epsilon>"
    | Sequence x -> List.map pegToStringParen >> List.reduce (fun a b -> a + " " + b) <| x
    | Choice x -> List.map pegToStringParen >> List.reduce (fun a b -> a + " / " + b) <| x
    | ZeroOrMore x -> pegToStringParen x + "*"
    | OneOrMore x -> pegToStringParen x + "+"
    | Optional x -> pegToStringParen x + "?"
    | And x -> "&" + pegToStringParen x
    | Not x -> "!" + pegToStringParen x


let printPeg expr =
    printfn "%s" <| pegToString expr

This now perhaps errs on the the cautious side, but adds parenthesis where they are needed…

> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom (("*" product)?)
val it : unit = ()

When text is parsed against a grammar, there needs to be a internal representation of the parsed data, in part 2 when I'll add functions to grammar rules, this representation is the input to those functions.

type ParseResult =
| TerminalSymbol of string
| Production of ParseResult list
| EmptyMatch
| Unmatched

Taking the parse function in parts.

First up the atomic expressions, these are all pretty straightforward.

let parseExpression (grammar:Map<string,Expression>) start (input:string) =
    let rec parse offset = function
    | Terminal x ->
        let e = offset + x.Length - 1
        if e < input.Length && input.[offset..e] = x then (TerminalSymbol x, e + 1) else (Unmatched, offset)
    | NonTerminal x -> parse offset grammar.[x]
    | Epsilon -> if offset = input.Length then (EmptyMatch, offset) else (Unmatched, offset)

Moving on to Sequence and Choice

    | Sequence x ->
        let rec parseList lst off =
            seq {
                if List.isEmpty lst then ()
                else
                    match parse off <| List.head lst with
                    | (Unmatched, _) as y -> yield y
                    | (y, z) -> yield (y, z)
                                yield! parseList (List.tail lst) z
            }

        let s = List.ofSeq <| parseList x offset
        if List.exists (function | (Unmatched, _) -> true | _ -> false) s then (Unmatched, offset)
        else (Production <| List.map fst s, snd <| List.maxBy snd s)
    | Choice x ->
        let rec parseList lst =
            if List.isEmpty lst then (Unmatched, offset)
            else
                match parse offset <| List.head lst with
                | (Unmatched, _) -> parseList <| List.tail lst
                | y -> y

        parseList x

The pattern for Sequence creates an F# seq over the list of sub-expressions for the sequence, when an expression doesn't match the sequence is terminated with an Unmatched parse result; that is used as a signal to mark that the Sequence as a whole didn't match; otherwise the results from the sub-expressions are packaged into a Production parse result.

The pattern for Choice enumerates over the sub-expressions until it finds the first one that matches; that result is returned.

The rest…

    | ZeroOrMore x ->
        let rec parseList off =
            seq {
                match parse off x with
                | (Unmatched, _) -> ()
                | (y, z) -> yield (y, z)
                            yield! parseList z
            }
        let s = List.ofSeq <| parseList offset
        if List.isEmpty s then (EmptyMatch, offset)
        else (Production <| List.map fst s, snd <| List.maxBy snd s)
    | OneOrMore x ->
        let rec parseList off =
            seq {
                match parse off x with
                | (Unmatched, _) -> ()
                | (y, z) -> yield (y, z)
                            yield! parseList z
            }
        let s = List.ofSeq <| parseList offset
        if List.isEmpty s then (Unmatched, offset)
        else (Production <| List.map fst s, snd <| List.maxBy snd s)
    | Optional x ->
        match parse offset x with
        | (Unmatched, _) -> (EmptyMatch, offset)
        | y -> y
    | And x ->
        match parse offset x with
        | (Unmatched, _) -> (Unmatched, offset)
        | _ -> (EmptyMatch, offset)
    | Not x ->
        match parse offset x with
        | (Unmatched, _) -> (EmptyMatch, offset)
        | _ -> (Unmatched, offset)

    parse 0 grammar.[start]

ZeroOrMore and OneOrMore are almost the same, DRY suggests that I should break out the common code for these patterns; both collect a sequence of contiguous matches for the sub-expression, they differ only in how they handle the empty collection.

Optional, And, and Not can all be pulled onto one line (depending on how wide your monitor is...) they simply filter the response from the recursive parsing.

parseExpression is then rounded out with a call to the inner function to start the recursive parsing in the context of the grammar and input string.

Trying it out

First here's a simple grammar for boolean arithmetic.

let g = Map.ofList [("bool", Choice [Terminal "true"; Terminal "false"]);
                    ("paren", Sequence [Terminal "("; NonTerminal "expr"; Terminal ")"]);
                    ("not", Sequence [Terminal "!"; NonTerminal "expr"]);
                    ("atom", Choice [NonTerminal "bool"; NonTerminal "paren"; NonTerminal "not"]);
                    ("and", Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "&"; NonTerminal "and"])]);
                    ("or", Sequence [NonTerminal "and"; Optional (Sequence [Terminal "|"; NonTerminal "or"])]);
                    ("expr", NonTerminal "or");
                    ("start", Sequence [NonTerminal "expr"; Epsilon])]

The printPeg function still works…

> Map.fold (fun _ k v -> printfn "%s -> %s" k <| pegToString v) () g;;
and -> atom (("&" and)?)
atom -> bool / paren / not
bool -> "true" / "false"
expr -> or
not -> "!" expr
or -> and (("|" or)?)
paren -> "(" expr ")"
start -> expr <epsilon>
val it : unit = ()

Map has sorted its elements, but the printed representations are as expected.

> parseExpression g "start" "true&true|false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production
           [TerminalSymbol "true";
            Production
              [TerminalSymbol "&";
               Production [TerminalSymbol "true"; EmptyMatch]]];
         Production
           [TerminalSymbol "|";
            Production
              [Production [TerminalSymbol "false"; EmptyMatch]; EmptyMatch]]];
      EmptyMatch], 15)
> parseExpression g "start" "true|true&false";;
val it : ParseResult * int =
  (Production
     [Production
        [Production [TerminalSymbol "true"; EmptyMatch];
         Production
           [TerminalSymbol "|";
            Production
              [Production
                 [TerminalSymbol "true";
                  Production
                    [TerminalSymbol "&";
                     Production [TerminalSymbol "false"; EmptyMatch]]];
               EmptyMatch]]]; EmptyMatch], 15)
>

This shows how the precedence of the operators is defined explicitly in the grammar, rather than by annotating tokens or rules.

In part 2, I'll turn this into something usable.

The code for this article is on github.