Parsing Expression Grammar - part 1
I'm not a great language and grammar theoretician, so I'm not going to go into great detail about what a Parsing Expression Grammar (PEG) is, but here are a few observations.
- PEGs don't require a tokenizing stage - this can be a mixed blessing, but does mean that the grammar can be entirely expressed in one specification
- Tools like yacc and bison create parsers from LALR grammars — this is a relatively complicated procedure and requires that the grammar either be very carefully constructed to contain no ambiguity, or to be annotated such that the parser always knows how to resolve that ambiguity
- Grammars expressed as PEGs don't really have this problem, although you can't have left recursion — which I'll get to eventually
I'm using F# for the code in this series of posts, I'm not a completely fluent F# programmer so there may be some clumsy code in places, but it is a very succinct and clean language for this type of work.
A very simple introduction to PEG
A very minimalist implementation of a PEG only requires a few elements, I'll add some more later, both for completeness and convenience; a PEG expression is comprised of:
- Terminal symbol — which is commonly represented as text in quotation marks -
"foo"
would be a terminal symbol - A NonTerminal symbol — a word or identifier not in quotations mars - e.g.
foo
- Empty — matches the end of the input — represented with an
<epsilon>
- A Sequence — a list of expressions, each of which must be matched in order — represented as a space separated list
- A Choice — a list of expressions, tested in order, the first match is chosen — represented as a list of expressions separated by
/
- Zero or More — an expression that can occur any number of times, including zero — represented as a
*
suffix - One or More — an expression that can occur any number of times, but must occur at least once — represented as a
+
suffix - Optional — an expression that may or may-not match — represented as a
?
suffix - And — an expression that must match but that isn't captured by the matching process — represented as a
&
prefix - Not — an expression that must not match and isn't captured by the matching process — represented as a
!
prefix
First Implementation
While grammars are usually stored as text files, I'm going to start with the in memory representation of the PEG expressions.
type Expression =
| Terminal of string
| NonTerminal of string
| Epsilon
| Sequence of Expression list
| Choice of Expression list
| ZeroOrMore of Expression
| OneOrMore of Expression
| Optional of Expression
| And of Expression
| Not of Expression
This is a straight-forward translation of the description of the elements into an F# discriminated union type, before I dive into how to process this, I'll show some code that will print a PEG expression in a human readable form.
let printPeg expr =
let rec pegToString = function
| Terminal x -> "\"" + x + "\""
| NonTerminal x -> x
| Epsilon -> "<epsilon>"
| Sequence x -> List.map pegToString >> List.reduce (fun a b -> a + " " + b) <| x
| Choice x -> List.map pegToString >> List.reduce (fun a b -> a + " / " + b) <| x
| ZeroOrMore x -> pegToString x + "*"
| OneOrMore x -> pegToString x + "+"
| Optional x -> pegToString x + "?"
| And x -> "&" + pegToString x
| Not x -> "!" + pegToString x
There is nothing particularly complex about this, the Sequence
and Choice
patterns use the forward function combination operator >>
to pipeline converting the list of expressions into a list of strings (map), and then to join the elements using the appropriate separator (reduce).
To check that this works as expected I'll try a few expressions in F# interactive.
> printPeg <| Terminal "foo";;
"foo"
val it : unit = ()
> printPeg <| NonTerminal "foo";;
foo
val it : unit = ()
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom "*" product?
val it : unit = ()
This doesn't quite work how I want it to, in the last expression it isn't clear what is optional, so I need to make printPeg
a touch more complex…
let rec pegToString expr =
let pegToStringParen = function
| Terminal _ | NonTerminal _ | Epsilon as x -> pegToString x
| x -> "(" + pegToString x + ")"
match expr with
| Terminal x -> "\"" + x + "\""
| NonTerminal x -> x
| Epsilon -> "<epsilon>"
| Sequence x -> List.map pegToStringParen >> List.reduce (fun a b -> a + " " + b) <| x
| Choice x -> List.map pegToStringParen >> List.reduce (fun a b -> a + " / " + b) <| x
| ZeroOrMore x -> pegToStringParen x + "*"
| OneOrMore x -> pegToStringParen x + "+"
| Optional x -> pegToStringParen x + "?"
| And x -> "&" + pegToStringParen x
| Not x -> "!" + pegToStringParen x
let printPeg expr =
printfn "%s" <| pegToString expr
This now perhaps errs on the the cautious side, but adds parenthesis where they are needed…
> printPeg <| Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "*"; NonTerminal "product"])];;
atom (("*" product)?)
val it : unit = ()
When text is parsed against a grammar, there needs to be a internal representation of the parsed data, in part 2 when I'll add functions to grammar rules, this representation is the input to those functions.
type ParseResult =
| TerminalSymbol of string
| Production of ParseResult list
| EmptyMatch
| Unmatched
TerminalSymbol
— hopefully self explanatoryProduction
— represents aSequence
,ZeroOrMore
, orOneOrMore
EmptyMatch
— Used for non-capturing successful matchesZeroOrMore
(with zero matches),Optional
(not present),And
,Not
Unmatched
— No match found
Taking the parse function in parts.
First up the atomic expressions, these are all pretty straightforward.
let parseExpression (grammar:Map<string,Expression>) start (input:string) =
let rec parse offset = function
| Terminal x ->
let e = offset + x.Length - 1
if e < input.Length && input.[offset..e] = x then (TerminalSymbol x, e + 1) else (Unmatched, offset)
| NonTerminal x -> parse offset grammar.[x]
| Epsilon -> if offset = input.Length then (EmptyMatch, offset) else (Unmatched, offset)
Moving on to Sequence
and Choice
| Sequence x ->
let rec parseList lst off =
seq {
if List.isEmpty lst then ()
else
match parse off <| List.head lst with
| (Unmatched, _) as y -> yield y
| (y, z) -> yield (y, z)
yield! parseList (List.tail lst) z
}
let s = List.ofSeq <| parseList x offset
if List.exists (function | (Unmatched, _) -> true | _ -> false) s then (Unmatched, offset)
else (Production <| List.map fst s, snd <| List.maxBy snd s)
| Choice x ->
let rec parseList lst =
if List.isEmpty lst then (Unmatched, offset)
else
match parse offset <| List.head lst with
| (Unmatched, _) -> parseList <| List.tail lst
| y -> y
parseList x
The pattern for Sequence
creates an F# seq over the list of sub-expressions for the sequence, when an expression doesn't match the sequence is terminated with an Unmatched parse result; that is used as a signal to mark that the Sequence
as a whole didn't match; otherwise the results from the sub-expressions are packaged into a Production parse result.
The pattern for Choice
enumerates over the sub-expressions until it finds the first one that matches; that result is returned.
The rest…
| ZeroOrMore x ->
let rec parseList off =
seq {
match parse off x with
| (Unmatched, _) -> ()
| (y, z) -> yield (y, z)
yield! parseList z
}
let s = List.ofSeq <| parseList offset
if List.isEmpty s then (EmptyMatch, offset)
else (Production <| List.map fst s, snd <| List.maxBy snd s)
| OneOrMore x ->
let rec parseList off =
seq {
match parse off x with
| (Unmatched, _) -> ()
| (y, z) -> yield (y, z)
yield! parseList z
}
let s = List.ofSeq <| parseList offset
if List.isEmpty s then (Unmatched, offset)
else (Production <| List.map fst s, snd <| List.maxBy snd s)
| Optional x ->
match parse offset x with
| (Unmatched, _) -> (EmptyMatch, offset)
| y -> y
| And x ->
match parse offset x with
| (Unmatched, _) -> (Unmatched, offset)
| _ -> (EmptyMatch, offset)
| Not x ->
match parse offset x with
| (Unmatched, _) -> (EmptyMatch, offset)
| _ -> (Unmatched, offset)
parse 0 grammar.[start]
ZeroOrMore
and OneOrMore
are almost the same, DRY suggests that I should break out the common code for these patterns; both collect a sequence of contiguous matches for the sub-expression, they differ only in how they handle the empty collection.
Optional
, And
, and Not
can all be pulled onto one line (depending on how wide your monitor is...) they simply filter the response from the recursive parsing.
parseExpression is then rounded out with a call to the inner function to start the recursive parsing in the context of the grammar and input string.
Trying it out
First here's a simple grammar for boolean arithmetic.
let g = Map.ofList [("bool", Choice [Terminal "true"; Terminal "false"]);
("paren", Sequence [Terminal "("; NonTerminal "expr"; Terminal ")"]);
("not", Sequence [Terminal "!"; NonTerminal "expr"]);
("atom", Choice [NonTerminal "bool"; NonTerminal "paren"; NonTerminal "not"]);
("and", Sequence [NonTerminal "atom"; Optional (Sequence [Terminal "&"; NonTerminal "and"])]);
("or", Sequence [NonTerminal "and"; Optional (Sequence [Terminal "|"; NonTerminal "or"])]);
("expr", NonTerminal "or");
("start", Sequence [NonTerminal "expr"; Epsilon])]
The printPeg function still works…
> Map.fold (fun _ k v -> printfn "%s -> %s" k <| pegToString v) () g;;
and -> atom (("&" and)?)
atom -> bool / paren / not
bool -> "true" / "false"
expr -> or
not -> "!" expr
or -> and (("|" or)?)
paren -> "(" expr ")"
start -> expr <epsilon>
val it : unit = ()
Map has sorted its elements, but the printed representations are as expected.
> parseExpression g "start" "true&true|false";;
val it : ParseResult * int =
(Production
[Production
[Production
[TerminalSymbol "true";
Production
[TerminalSymbol "&";
Production [TerminalSymbol "true"; EmptyMatch]]];
Production
[TerminalSymbol "|";
Production
[Production [TerminalSymbol "false"; EmptyMatch]; EmptyMatch]]];
EmptyMatch], 15)
> parseExpression g "start" "true|true&false";;
val it : ParseResult * int =
(Production
[Production
[Production [TerminalSymbol "true"; EmptyMatch];
Production
[TerminalSymbol "|";
Production
[Production
[TerminalSymbol "true";
Production
[TerminalSymbol "&";
Production [TerminalSymbol "false"; EmptyMatch]]];
EmptyMatch]]]; EmptyMatch], 15)
>
This shows how the precedence of the operators is defined explicitly in the grammar, rather than by annotating tokens or rules.
In part 2, I'll turn this into something usable.
The code for this article is on github.