Parsing Expression Grammar - part 3
The PEG parser at the end of part2 will parse, and call out to code when a rule is parsed, but isn't particularly easy to use, most particularly because the grammar has to be expressed as F# objects. In this post I'll extend the parser to be able to parse PEG from a text file, and to generate F# code for a parser described by that PEG.
There are only two (big) things missing from the code needed to do this, a parser for PEG — which I will write as (you guessed it) a PEG, and the code generator for the parser.
Parsing PEG
Parsing the PEG requires a complete grammar.
hexDigit <- [0123456789abcdefABCDEF]
escapedCharacter <- '\\' ([abfnrtv0\\"'[\]] / ('u' hexDigit hexDigit hexDigit hexDigit))
safeCharacter <- escapedCharacter / (!["\\\n] <anychar>)
oneofCharacter <- escapedCharacter / (![\\\]] <anychar>)
terminalUnicode <- '{' ( "Lu" / "Ll" / "Lt" / "Lm" / "Lo" / "Mn" / "Mc" / "Me" / "Nd" / "Nl" / "No" / "Pc" / "Pd" /
"Ps" / "Pe" / "Pi" / "Pf" / "Po" / "Sm" / "Sc" / "Sk" / "So" / "Zs" / "Zl" / "Zp" / "Cc" /
"Cf" / "Cs" / "Co" / "Cn" ) '}'
terminalOneOf <- '[' oneofCharacter+ ']'
terminalCharacter <- '\'' safeCharacter '\''
terminalWord <- '\"' safeCharacter+ '\"'
terminal <- terminalWord / terminalCharacter / terminalOneOf / terminalUnicode / "<anychar>" / "<epsilon>"
space <- [ \t\r\n]+
name <- ({Lu} / {Ll} / '-') ({Lu} / {Ll} / {Nd} / [_-])*
nonterminal <- name
atom <- terminal / nonterminal / ('(' space? expr space? ')')
unary <- ([!&] unary) / (atom [*+?])?
sequence <- unary (space unary)*
choice <- sequence (space? '/' space? sequence)*
expr <- choice
codechar <- (![{}\] <anychar>) / ('\\' [{}\\]) / ('{' codechar* '}')
codeblock <- '{' codechar* '}'
rule <- name space? "<-" space? expr space? codeblock? space? <epsilon>
Parser changes
I have added a few new terminal types to the Expression type defined in part 1, these allow a grammar to specify a terminal symbol match on one of a selection of characters, to match a character (strictly a UTF-16 character) based on a Unicode category, and to match any character, These make specifying real world grammars (including that for PEG itself) very much easier.
type Expression =
| Terminal of string
| TerminalOneOf of string
| TerminalUnicode of UnicodeCategory
| TerminalWildcard
| NonTerminal of string
| Epsilon
| Sequence of Expression list
| Choice of Expression list
| ZeroOrMore of Expression
| OneOrMore of Expression
| Optional of Expression
| And of Expression
| Not of Expression
| Rule of string * Expression * string
Because I have added terminal types that operate on a per character basis, I've added a simple utility method that gets the next single character from the input, if any. This is this only place in the code where surrogate pairs are specifically handled.
let currentCharacter (s:string) i =
if i >= s.Length then None
else if Char.IsSurrogatePair(s, i) then Some s.[i..i+1] else Some s.[i..i]
I have extended parseExpression
in the obvious way to parse the new terminal expressions.
let parseExpression (grammar:Map<string,GrammarRule<'a>>) start (input:string) =
let rec parse offset = function
| Terminal x ->
let e = offset + x.Length - 1
if e < input.Length && input.[offset..e] = x then (TerminalSymbol x, e + 1) else (Unmatched, offset)
| TerminalOneOf x -> match currentCharacter input offset with
| Some c -> if x.Contains(c) then (TerminalSymbol c, offset + c.Length) else (Unmatched, offset)
| None -> (Unmatched, offset)
| TerminalWildcard -> match currentCharacter input offset with
| Some c -> (TerminalSymbol c, offset + c.Length)
| None -> (Unmatched, offset)
| TerminalUnicode x -> match currentCharacter input offset with
| Some c -> if Char.GetUnicodeCategory(c, 0) = x then (TerminalSymbol c, offset + c.Length) else (Unmatched, offset)
| None -> (Unmatched, offset)
| NonTerminal x ->
let rule = grammar.[x]
// .
// .
Parsing PEG
I now have enough pieces to be able to create a PEG parser for PEG, this is in the file pegOfPeg.fs, an example of one of the grammar rules is…
// unary <- ([!&] unary) / (atom [*+?])?
let parseUnary _ = function
| Production [TerminalSymbol "!"; Parsed x] -> Parsed <| Not x
| Production [TerminalSymbol "&"; Parsed x] -> Parsed <| And x
| Production [Parsed x; TerminalSymbol "*"] -> Parsed <| ZeroOrMore x
| Production [Parsed x; TerminalSymbol "+"] -> Parsed <| OneOrMore x
| Production [Parsed x; TerminalSymbol "?"] -> Parsed <| Optional x
| Production [Parsed _ as x; EmptyMatch] -> x
| x -> unexpected x
let unaryRule = GrammarRule<Expression>(Choice [Sequence [TerminalOneOf "!&"; NonTerminal "unary"];
Sequence [NonTerminal "atom"; Optional(TerminalOneOf "*+?")]], parseUnary)
The rest of the files is much the same, with a parse method for each rule in the grammar and a corresponding F# declaration.
Code Generation
The main method for code generation is a recursive function which matches an
Expression
argument. For the simplest expression types, those that don't
require recursively parsing one or more expressions, the code generation
consists of writing code that calls out to the function that performs the
parsing for that terminal type (for the terminals), matching the end of the
input (for the <epsilon>
rule), and calls a rule matching function
(whose generation I will show below) - for NonTerminal rules. The functions
ibprintf
and ibprintfn
are used to print an indented string to a
StringBuilder
.
let rec codeGen (b:StringBuilder) i = function
| Terminal x -> ibprintf b i """matchTerminal "%s" input offset""" <| escape x
| TerminalOneOf x -> ibprintf b i """matchTerminalOneOf "%s" input offset""" <| escape x
| TerminalWildcard -> ibprintf b i """matchTerminalWildcard input offset"""
| TerminalUnicode x -> ibprintf b i """matchTerminalUnicode System.Globalization.UnicodeCategory.%s input offset""" <| string(x)
| Epsilon -> ibprintf b i """if offset = input.Length then (EmptyMatch, offset) else (Unmatched, offset)"""
| NonTerminal x -> ibprintf b i """matchRule%s input offset""" <| capitalIdentifier x
To match Sequence
and Choice
expressions, the code generator generates a list of
lambdas, each of which is recursively generated by the codeGen
function, one
for each sub-expression of the Sequence
or Choice
. This list of lambdas is then
past to the utility function matchSequence
or matchChoice
as appropriate.
| Sequence x | Choice x as y ->
let fn = match y with | Sequence _ -> "Sequence" | Choice _ -> "Choice" | _ -> failwith "Internal Error"
let processItem index item =
ibprintfn b 0 ""
ibprintfn b (i + 4) "(fun offset -> "
codeGen b (i + 8) item
ibprintf b 0 ");"
ibprintf b i "let l = ["
List.iteri processItem x
ibprintf b 0 "] in match%s input offset l" fn
For the expressions which wrap a single expression (Optional
, ZeroOrMore
,
OneOrMore
, And
, and Not
) a similar approach generates a lambda for the
inner expression and passes it to the appropriate utility function.
| ZeroOrMore x | OneOrMore x | Optional x | And x | Not x as y ->
let fn = match y with | ZeroOrMore _ -> "ZeroOrMore" | OneOrMore _ -> "OneOrMore" | Optional _ -> "Optional" | And _ -> "And" | Not _ -> "Not" | _ -> failwith "Internal Error"
ibprintfn b i "match%s input offset (fun offset -> " fn
codeGen b (i + 8) x
ibprintf b 0 ");"
When matching a top-level grammar rule the code generator creates a function; there are two possible code paths,
- for rules where there is no code defined to be executed when the rule is matched (in which case the function simply recursively generates the code for the inner expression), and
- for rules where there is code to be executed on the successful match of the rule, where matching on the result of the inner expression has to be called to ensure that the code is not executed inappropriately. I make no effort to ensure that the code specified in the grammar is valid.
| Rule (x, y, z) ->
ibprintfn b 0 "matchRule%s (input:string) (offset:int) = " <| capitalIdentifier x
let j = i + 4
if z = "" then codeGen b j y else
ibprintfn b j "let res = "
codeGen b (j + 4) y
ibprintfn b 0 " in"
ibprintfn b (j + 4) "match res with"
ibprintfn b (j + 4) "| (Unmatched, _) -> (Unmatched, offset)"
ibprintf b (j + 4) "| (parsed, endOffset) -> (%s input.[offset..endOffset - 1] parsed, endOffset)" z
ibprintfn b 0 ""
The code generator for the Rule
expression doesn't generate a valid function,
there is no let rec
prefixed; this is because the rule matching methods have
to be mutually recursive, so an outer function is used to call codeGen
for
rules — it is that function which is told whether this is the first rule
or not.
let codeGenRule b i first = function
| Rule _ as x ->
ibprintf b i "%s " <| if first then "let rec" else "and"
codeGen b i x
| x -> failwithf "Cannot generate code for %A" x
The rest of the code generator is found in peg.fs and consists of bolier plate implementations of the functions used by the generated code and utility functions that make the rest of the code generation easier.
Left recursion
A PEG parser cannot work with a grammar that contains left recursion. Left
recursion occurs when a non-terminal symbol in the left position in an
expression recursively (either directly or indirectly) refers to the same rule.
The left position is the first element in a Sequence
(this is a
simplification!), any element in a Choice
, and for the expressions that have a
single sub-expression, that sub-expression is always in the left-position. It
can be difficult to determine which elements in a sequence are in the left
position because a Non-Terminal that successfully matches with zero-width (for
example an Optional
or ZeroOrMore
expression)
example | description |
---|---|
a ← a b | The simplest form of left recursion |
a ← b c, b ← a c | Indirect left recursion |
a ← b c, b ← d a, d ← e? | Left recursive because d can match without consuming input even though a is not in the obvious left position |
The following code detects simple forms of left recursion, but doesn't find the third example above.
let checkForLeftRecursion (rules:Map<string,Expression>) =
let rec leftRules = function
| Terminal _ | TerminalOneOf _ | TerminalUnicode _ | TerminalWildcard | Epsilon -> []
| NonTerminal x -> [x]
| ZeroOrMore x | OneOrMore x | Optional x | Not x | And x -> leftRules x
| Choice x ->
List.concat <| List.map leftRules x
| Sequence x ->
let rec processList = function
| [] -> []
| (head :: tail) ->
match head with
| Optional _ | Not _ | And _ | ZeroOrMore _ | OneOrMore _ -> List.append (leftRules head) (processList tail)
| x -> leftRules head
processList x
| Rule _ -> failwith "Internal error"
let rec checkRule name prior =
if List.exists ((=) name) prior then
failwithf "Left recursion calling %s" <| List.reduce (fun a b -> (b + " -> " + a)) (name :: prior)
match Map.tryFind name rules with
| None -> failwithf "Unknown rule %s" name
| Some x ->
let leftchoices = leftRules x
let sp = (name :: prior)
for rule in leftchoices do checkRule rule sp done
try
Map.iter (fun k _ -> checkRule k []) rules
with
| ex -> eprintfn "Grammar error: %s" ex.Message
failwith "Invalid Grammar"
Putting it all together
The code in Program.fs puts this together into a program that takes an input file and writes an output, taking a grammar description, prefix and suffix from the input file. The format of the input file is as follows:
prefix written verbatim to the output file
(*%%
grammar rules
%%*)
suffix written verbatim to the output file
The section separators in the input file ((_%%
and %%_)
) are chosen to be
similar to yacc and to allow the input file
to be read as valid F#, which makes writing the prefix and suffix easier! The
example on
github
implements a basic calculator program.
If the -x
argument is passed to the parser generator, it will attempt to
compile and execute the output file, this is done using the FSharpCodeProvider
class from the F# power pack, Vladimir
Ivanovskiy has a great article about Embedded Scripting using
F#.
use compiler = new FSharpCodeProvider()
let parameters = CompilerParameters()
parameters.GenerateExecutable <- true
parameters.GenerateInMemory <- true
let result = compiler.CompileAssemblyFromSource(parameters, parser)
if result.Errors.Count > 0 then
for error in result.Errors do eprintfn "%O" error done
else
for output in result.Output do printfn "%O" output done
printfn "Executing..."
try
result.CompiledAssembly.EntryPoint.Invoke(null, null) |> ignore
with
| ex -> eprintfn "Error executing generated code: %s" ex.Message
As usual, all the code for this is on github.