Parsing Expression Grammar - part 3

The PEG parser at the end of part2 will parse, and call out to code when a rule is parsed, but isn't particularly easy to use, most particularly because the grammar has to be expressed as F# objects. In this post I'll extend the parser to be able to parse PEG from a text file, and to generate F# code for a parser described by that PEG.

There are only two (big) things missing from the code needed to do this, a parser for PEG — which I will write as (you guessed it) a PEG, and the code generator for the parser.

Parsing PEG

Parsing the PEG requires a complete grammar.

hexDigit          <- [0123456789abcdefABCDEF]
escapedCharacter  <- '\\' ([abfnrtv0\\"'[\]] / ('u' hexDigit hexDigit hexDigit hexDigit))
safeCharacter     <- escapedCharacter / (!["\\\n] <anychar>)
oneofCharacter    <- escapedCharacter / (![\\\]] <anychar>)
terminalUnicode   <- '{' ( "Lu" / "Ll" / "Lt" / "Lm" / "Lo" /  "Mn" / "Mc" / "Me" / "Nd" / "Nl" / "No" / "Pc" / "Pd" /
                           "Ps" / "Pe" / "Pi" / "Pf" / "Po" / "Sm" / "Sc" / "Sk" / "So" / "Zs" / "Zl" / "Zp" / "Cc" /
                           "Cf" / "Cs" / "Co" / "Cn" ) '}'
terminalOneOf     <- '[' oneofCharacter+ ']'
terminalCharacter <- '\'' safeCharacter '\''
terminalWord      <- '\"' safeCharacter+ '\"'
terminal          <- terminalWord / terminalCharacter / terminalOneOf / terminalUnicode / "<anychar>" / "<epsilon>"
space             <- [ \t\r\n]+
name              <- ({Lu} / {Ll} / '-') ({Lu} / {Ll} / {Nd} / [_-])*
nonterminal       <- name
atom              <- terminal / nonterminal / ('(' space? expr space? ')')
unary             <- ([!&] unary) / (atom [*+?])?
sequence          <- unary (space unary)*
choice            <- sequence (space? '/' space? sequence)*
expr              <- choice
codechar          <- (![{}\] <anychar>) / ('\\' [{}\\]) / ('{' codechar* '}')
codeblock         <- '{' codechar* '}'
rule              <- name space? "<-" space? expr space? codeblock? space? <epsilon>

Parser changes

I have added a few new terminal types to the Expression type defined in part 1, these allow a grammar to specify a terminal symbol match on one of a selection of characters, to match a character (strictly a UTF-16 character) based on a Unicode category, and to match any character, These make specifying real world grammars (including that for PEG itself) very much easier.

type Expression =
| Terminal of string
| TerminalOneOf of string
| TerminalUnicode of UnicodeCategory
| TerminalWildcard
| NonTerminal of string
| Epsilon
| Sequence of Expression list
| Choice of Expression list
| ZeroOrMore of Expression
| OneOrMore of Expression
| Optional of Expression
| And of Expression
| Not of Expression
| Rule of string * Expression * string

Because I have added terminal types that operate on a per character basis, I've added a simple utility method that gets the next single character from the input, if any. This is this only place in the code where surrogate pairs are specifically handled.

let currentCharacter (s:string) i =
    if i >= s.Length then None
    else if Char.IsSurrogatePair(s, i) then Some s.[i..i+1] else Some s.[i..i]

I have extended parseExpression in the obvious way to parse the new terminal expressions.

let parseExpression (grammar:Map<string,GrammarRule<'a>>) start (input:string) =
    let rec parse offset = function
    | Terminal x ->
        let e = offset + x.Length - 1
        if e < input.Length && input.[offset..e] = x then (TerminalSymbol x, e + 1) else (Unmatched, offset)
    | TerminalOneOf x -> match currentCharacter input offset with
                         | Some c -> if x.Contains(c) then (TerminalSymbol c, offset + c.Length) else (Unmatched, offset)
                         | None -> (Unmatched, offset)
    | TerminalWildcard -> match currentCharacter input offset with
                          | Some c -> (TerminalSymbol c, offset + c.Length)
                          | None -> (Unmatched, offset)
    | TerminalUnicode x -> match currentCharacter input offset with
                           | Some c -> if Char.GetUnicodeCategory(c, 0) = x then (TerminalSymbol c, offset + c.Length) else (Unmatched, offset)
                           | None -> (Unmatched, offset)
    | NonTerminal x ->
        let rule = grammar.[x]
// .
// .

Parsing PEG

I now have enough pieces to be able to create a PEG parser for PEG, this is in the file pegOfPeg.fs, an example of one of the grammar rules is…

    // unary <- ([!&] unary) / (atom [*+?])?
    let parseUnary _ = function
    | Production [TerminalSymbol "!"; Parsed x] -> Parsed <| Not x
    | Production [TerminalSymbol "&"; Parsed x] -> Parsed <| And x
    | Production [Parsed x; TerminalSymbol "*"] -> Parsed <| ZeroOrMore x
    | Production [Parsed x; TerminalSymbol "+"] -> Parsed <| OneOrMore x
    | Production [Parsed x; TerminalSymbol "?"] -> Parsed <| Optional x
    | Production [Parsed _ as x; EmptyMatch] -> x
    | x -> unexpected x
    let unaryRule = GrammarRule<Expression>(Choice [Sequence [TerminalOneOf "!&"; NonTerminal "unary"];
                                                    Sequence [NonTerminal "atom"; Optional(TerminalOneOf "*+?")]], parseUnary)

The rest of the files is much the same, with a parse method for each rule in the grammar and a corresponding F# declaration.

Code Generation

The main method for code generation is a recursive function which matches an Expression argument. For the simplest expression types, those that don't require recursively parsing one or more expressions, the code generation consists of writing code that calls out to the function that performs the parsing for that terminal type (for the terminals), matching the end of the input (for the <epsilon> rule), and calls a rule matching function (whose generation I will show below) - for NonTerminal rules. The functions ibprintf and ibprintfn are used to print an indented string to a StringBuilder.

let rec codeGen (b:StringBuilder) i = function
| Terminal x -> ibprintf b i """matchTerminal "%s" input offset""" <| escape x
| TerminalOneOf x -> ibprintf b i """matchTerminalOneOf "%s" input offset""" <| escape x
| TerminalWildcard -> ibprintf b i """matchTerminalWildcard input offset"""
| TerminalUnicode x -> ibprintf b i """matchTerminalUnicode System.Globalization.UnicodeCategory.%s input offset""" <| string(x)
| Epsilon -> ibprintf b i """if offset = input.Length then (EmptyMatch, offset) else (Unmatched, offset)"""
| NonTerminal x -> ibprintf b i """matchRule%s input offset""" <| capitalIdentifier x

To match Sequence and Choice expressions, the code generator generates a list of lambdas, each of which is recursively generated by the codeGen function, one for each sub-expression of the Sequence or Choice. This list of lambdas is then past to the utility function matchSequence or matchChoice as appropriate.

| Sequence x | Choice x as y ->
    let fn = match y with | Sequence _ -> "Sequence" | Choice _ -> "Choice" | _ -> failwith "Internal Error"
    let processItem index item =
        ibprintfn b 0 ""
        ibprintfn b (i + 4) "(fun offset -> "
        codeGen b (i + 8) item
        ibprintf b 0 ");"
    ibprintf b i "let l = ["
    List.iteri processItem x
    ibprintf b 0 "] in match%s input offset l" fn

For the expressions which wrap a single expression (Optional, ZeroOrMore, OneOrMore, And, and Not) a similar approach generates a lambda for the inner expression and passes it to the appropriate utility function.

| ZeroOrMore x | OneOrMore x | Optional x | And x | Not x as y ->
    let fn = match y with | ZeroOrMore _ -> "ZeroOrMore" | OneOrMore _ -> "OneOrMore" | Optional _ -> "Optional" | And _ -> "And" | Not _ -> "Not" | _ -> failwith "Internal Error"
    ibprintfn b i "match%s input offset (fun offset -> " fn
    codeGen b (i + 8) x
    ibprintf b 0 ");"

When matching a top-level grammar rule the code generator creates a function; there are two possible code paths,

for rules where there is no code defined to be executed when the rule is matched (in which case the function simply recursively generates the code for the inner expression), and
for rules where there is code to be executed on the successful match of the rule, where matching on the result of the inner expression has to be called to ensure that the code is not executed inappropriately. I make no effort to ensure that the code specified in the grammar is valid.

| Rule (x, y, z) ->
    ibprintfn b 0 "matchRule%s (input:string) (offset:int) = " <| capitalIdentifier x
    let j = i + 4
    if z = "" then codeGen b j y else
        ibprintfn b j "let res = "
        codeGen b (j + 4) y
        ibprintfn b 0 " in"
        ibprintfn b (j + 4) "match res with"
        ibprintfn b (j + 4) "| (Unmatched, _) -> (Unmatched, offset)"
        ibprintf b (j + 4) "| (parsed, endOffset) -> (%s input.[offset..endOffset - 1] parsed, endOffset)" z
    ibprintfn b 0 ""

The code generator for the Rule expression doesn't generate a valid function, there is no let rec prefixed; this is because the rule matching methods have to be mutually recursive, so an outer function is used to call codeGen for rules — it is that function which is told whether this is the first rule or not.

let codeGenRule b i first = function
| Rule _ as x ->
    ibprintf b i "%s " <| if first then "let rec" else "and"
    codeGen b i x
| x -> failwithf "Cannot generate code for %A" x

The rest of the code generator is found in peg.fs and consists of bolier plate implementations of the functions used by the generated code and utility functions that make the rest of the code generation easier.

Left recursion

A PEG parser cannot work with a grammar that contains left recursion. Left recursion occurs when a non-terminal symbol in the left position in an expression recursively (either directly or indirectly) refers to the same rule. The left position is the first element in a Sequence (this is a simplification!), any element in a Choice, and for the expressions that have a single sub-expression, that sub-expression is always in the left-position. It can be difficult to determine which elements in a sequence are in the left position because a Non-Terminal that successfully matches with zero-width (for example an Optional or ZeroOrMore expression)

example	description
a ← a b	The simplest form of left recursion
a ← b c, b ← a c	Indirect left recursion
a ← b c, b ← d a, d ← e?	Left recursive because `d` can match without consuming input even though `a` is not in the obvious left position

The following code detects simple forms of left recursion, but doesn't find the third example above.

let checkForLeftRecursion (rules:Map<string,Expression>) =
    let rec leftRules = function
    | Terminal _ | TerminalOneOf _ | TerminalUnicode _ | TerminalWildcard | Epsilon -> []
    | NonTerminal x -> [x]
    | ZeroOrMore x | OneOrMore x | Optional x | Not x | And x -> leftRules x
    | Choice x ->
        List.concat <| List.map leftRules x
    | Sequence x ->
        let rec processList = function
        | [] -> []
        | (head :: tail) ->
            match head with
            | Optional _ | Not _ | And _ | ZeroOrMore _ | OneOrMore _ -> List.append (leftRules head) (processList tail)
            | x -> leftRules head
        processList x
    | Rule _ -> failwith "Internal error"

    let rec checkRule name prior =
        if List.exists ((=) name) prior then
            failwithf "Left recursion calling %s" <| List.reduce (fun a b -> (b + " -> " + a)) (name :: prior)
        match Map.tryFind name rules with
        | None -> failwithf "Unknown rule %s" name
        | Some x ->
            let leftchoices = leftRules x
            let sp = (name :: prior)
            for rule in leftchoices do checkRule rule sp done

    try
        Map.iter (fun k _ -> checkRule k []) rules
    with
    | ex -> eprintfn "Grammar error: %s" ex.Message
            failwith "Invalid Grammar"

Putting it all together

The code in Program.fs puts this together into a program that takes an input file and writes an output, taking a grammar description, prefix and suffix from the input file. The format of the input file is as follows:

prefix written verbatim to the output file
(*%%
grammar rules
%%*)
suffix written verbatim to the output file

The section separators in the input file ((_%% and %%_)) are chosen to be similar to yacc and to allow the input file to be read as valid F#, which makes writing the prefix and suffix easier! The example on github implements a basic calculator program.

If the -x argument is passed to the parser generator, it will attempt to compile and execute the output file, this is done using the FSharpCodeProvider class from the F# power pack, Vladimir Ivanovskiy has a great article about Embedded Scripting using F#.

    use compiler = new FSharpCodeProvider()
    let parameters = CompilerParameters()
    parameters.GenerateExecutable <- true
    parameters.GenerateInMemory <- true
    let result = compiler.CompileAssemblyFromSource(parameters, parser)
    if result.Errors.Count > 0 then
        for error in result.Errors do eprintfn "%O" error done
    else
        for output in result.Output do printfn "%O" output done
        printfn "Executing..."
        try
            result.CompiledAssembly.EntryPoint.Invoke(null, null) |> ignore
        with
        | ex -> eprintfn "Error executing generated code: %s" ex.Message

As usual, all the code for this is on github.