`
`1
`
`Highcr—Order Functions for Parsing>i<
`
`Graham Hutton
`
`Department of Computer Science, University of Utrecht,
`PO Boar 80.089, 3508 TB Utrecht, The Netherlands.
`
`Abstract
`
`In combinatar parsing, the text of parsers resembles BNF notation. We present the basic
`method, and a number of extensions. We address the special problems presented by white-
`space, and parsers with separate lexical and syntactic phases. In particular, a combining
`form for handling the “offside rule” is given. Other extensions to the basic method include
`an “into” combining form with many useful applications, and a simple means by which
`combinator parsers can produce more informative error messages.
`
`1 Introduction
`
`Broadly speaking, a parser may be defined as a program which analyses text to
`determine its logical structure. For example, the parsing phase in a compiler takes
`a program text, and produces a parse tree which expounds the structure of the
`program. Many programs can be improved by having their input parsed. The form of
`input which is acceptable is usually defined by a context—free grammar, using BNF
`notation. Parsers themselves may be built by hand, but are most often generated
`automatically using tools like Lex and Yacc from Unix (Aho86).
`Although there are many methods of building parsing, one in particular has
`gained widespread acceptance for use in lazy functional languages. In this method,
`parsers are modelled directly as functions; larger parsers are built piecewise from
`smaller parsers using higher order functions. For example, we define higher order
`functions for sequencing, alternation and repetition. In this way, the text of parsers
`closely resembles BNF notation. Parsers in this style are quick to build, and simple
`to understand and modify. In the sequel, we refer to the method as combinator
`parsing, after the higher order functions used to combine parsers.
`Combinator parsing is considerably more powerful than the commonly used meth-
`ods, being able to handle ambiguous grammars, and providing full backtracking if
`it is needed. In fact, we can do more than just parsing. Semantic actions can be
`added to parsers, allowing their results to be manipulated in any way we please. For
`example, in section 2.4 we convert a parser for arithmetic expressions to an eval-
`uator simply by changing the semantic actions. More generally, we could imagine
`generating some form of abstract machine code as programs are parsed.
`
`* Appears in the Journal of Functional Programming 2(3):323—343, July 1992.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 1
`
`
`
`2
`
`Graham Hutton
`
`Although the principles are widely known (due in most part to (Wadler85)),
`little has been written on combinator parsing itself. In this article, we present the
`basic method, and a number of extensions. The techniques may be used in any lazy
`functional language with a higher—order/polymorphic style type system. All our
`programming examples are given in Mirandal; features and standard functions are
`explained as they are used. A library of parsing functions taken from this paper is
`available by electronic mail from the author. Versions exist in both Miranda and
`Lazy ML.
`
`2 Parsing Using Combinators
`
`We begin by defining a type of parsers. A parser may be viewed as a function from
`a string of symbols to a result value. Since a parser might not consume the entire
`string, part of this result will be a suflix of the input string. Sometimes a parser may
`not be able to produce a result at all. For example, it may be expecting a letter, but
`find a digit. Rather than defining a special type for the success or failure of a parser,
`we choose to have parsers return a list of pairs as their result, with the empty list
`[] denoting failure, and a singleton list [(v,xs)] indicating success, with value v
`and unconsumed input xs. As we shall see in section 2.2, having parsers return a
`list of results proves very useful. Since we want to specify the type of any parser,
`regardless of the kind of symbols and results involved, these types are included as
`extra parameters. In Miranda, type variables are denoted by sequences of stars.
`
`parser * ** == [*1 -> [(**.[*])]
`
`For example, a parser for arithmetic expressions might have type (parser char
`expr), indicating that it takes a string of characters, and produces an expression
`tree. Notice that parser
`not a new type as such, but an abbreviation (or syn-
`onym); it’s only purpose is to make types involving parsers easier to understand.
`
`2.1 Primitive parsers
`
`The primitive parsers are the building blocks of combinator parsing. The first of
`these corresponds to the 5 symbol in BNF notation, denoting the empty string.
`The succeed parser always succeeds, without actually consuming any of the input
`string. Since the outcome of succeed does not depend upon its input, its result
`value must be pre—deter1nined, so is included as an extra parameter:
`
`succeed :: ** -> parser * **
`
`succeed v inp = [(v,inp)]
`
`This definition relies on partial application to work properly. The order of the argu-
`ments means that if succeed is supplied only one argument, the result is a parser
`(Le. a function) which always succeeds with this value. For example, (succeed 5)
`
`1 Miranda is a trademark of Research Software Limited.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 2
`
`
`
`Hz'gher«Order Functions for Parsing
`
`3
`
`is a parser which always returns the value 5. Furthermore, even though succeed
`plainly has two arguments, its type would suggest it has only one. There is no
`magic, the second argument is simply hidden inside the type of the result, as would
`be clear upon expansion of the type according to the parser abbreviation.
`While succeed never fails, fail always does, regardless of the input string:
`
`fail :: parser * **
`
`fail inp = []
`
`The next function allows us to make parsers that recognise single symbols. Rather
`than enumerating the acceptable symbols, we find it more convenient to provide the
`set implicitly, via a predicate'which determines if an arbitrary symbol is a member.
`Successful parses return the consumed symbol as their result value.
`
`satisfy 2:
`
`(* —> bool) -> parser * *
`
`= fail []
`satisfy p []
`satisfy 13 (x:xs) = succeed x xs , p X
`= fail xs
`, otherwise
`
`Notice how succeed and fail are used in this example. Although they are not
`strictly necessary, their presence makes the parser easier to read. Note also that the
`parser (satisfy p) returns failure if supplied with an empty input string.
`Using satisfy we can define a parser for single syrribols:
`
`literal
`
`::
`
`* —> parser * *
`
`literal x = satisfy (=x)
`
`For example, applying the parser (literal ’3’) to the string "345" gives the
`result [(’3’ ,"45")]. In the definition of literal, (=x) is a function which tests
`its argument for equality with x. It is an example of operator sectioning, a useful
`syntactic convention which allows us to partially apply infix operators.
`
`2. 2 Combinators
`
`Now that we have the basic building blocks, we consider how they should be put
`together to form useful parsers. In BNF notation, larger grammars are built piece-
`wise from smaller ones using | to denote alternation, and juxtaposition to indicate
`sequencing. So that our parsers resemble BNF notation, we define higher order
`functions which correspond directly to these operators. Since higher order func-
`tions like these combine parsers to form other parsers, they are often referred to as
`combining forms or combinators. We will use these terms from now on.
`The alt combinator corresponds to alternation in BNF. The parser (pl $a1t
`p2) recognises anything that either pl or p2 would. Normally we would interpret
`either in a sequential (or exclusive) manner, returning the result of the first parser
`to succeed, and failure if neither does. This approach is taken in (Fairbairn86).
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 3
`
`
`
`4
`
`Graham Hutton
`
`In combinator parsing however, we use inclusive either — it is acceptable for both
`parsers to succeed, in which case we return both results. In general then, combinator
`parsers may return an arbitrary number of results. This explains our decision earlier
`to have parsers return a list of results.
`
`With parsers returning a list, alt is implemented simply by appending (denoted
`by ++ in Miranda) the result of applying both parsers to the input string. In keeping
`with the BNF notation, we use the Miranda $ notation to convert alt to an infix
`operator. Just as for sectioning, the infix notation is merely a syntactic convenience:
`(x $f y) is equivalent to (f x y) in all contexts.
`
`alt
`
`:: parser * ** -> parser * H -> parser * **
`
`(pl $alt p2)
`
`inp = pl
`
`inp ++ p2 inp
`
`is the identity element for ++, it is easy to verify
`[]
`Knowing that the empty—list
`from this definition that failure is the identity element for alternation: (fail $alt
`p) = (p $alt fail) : p. In practical terms this means that alt has the expected
`behaviour if only one of the argument parsers succeeds. Similarly, alt inherits
`associativity from ++: (p $alt q) $alt r : p $alt (q $alt r). This means we
`do not not need to worry about bracketing repeated alternation correctly.
`Allowing parsers to produce more than one result allows us to handle ambiguous
`grammars, with all possible parses being produced for an ambiguous string. The
`feature has proved particularly useful in natural language processing (Frost88). An
`example ambiguous string from (Frost88) is “Who discovered a moon that orbits
`Mars or Jupiter 7” Most often however, we are only interested in the single longest
`parse of a string (i.e. that which consumes the most symbols). For this reason, it is
`normal in combinator parsing to arrange for the parses to be returned in descending
`order of length. All that is required is a little care in the ordering of the argument
`parsers to alt. See for example the many combinator in the next section.
`The then combinator corresponds to sequencing in BNF. The parser (pl $then
`p2) recognises anything that p1 and p2 would if placed in succession. Since the first
`parser may succeed with many results, each with an input stream suffix, the second
`parser must be applied to each of these in turn. In this manner, two results are
`produced for each successful parse, one from each parser. They are combined (by
`pairing) to form a single result for the compound parser.
`
`then :: parser * ** -> parser * **»< —> parser *
`
`(**,***)
`
`(pl $then p2)
`
`inp = [((v1,v2),out2)
`
`I
`
`inp;
`(v1,out1) <— pl
`(v2,out2) <— p2 out1]
`
`For example, applying the parser (literal ’a’ $then literal ’b’) to the input
`"abcd" gives the result [((’a’ , ’b’) , "cd“)] . The then combinator is an excellent
`example of list compre/tension notation, analogous to set comprehension in math-
`ematics (eg. {:52 l 00 E IN /\ at < 10} defines the first ten squares), except that lists
`replace sets, and elements are drawn in a determined order. Much of the elegance
`of the then combinator would be lost if this notation were not available.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 4
`
`
`
`Hig/ter—O7“de7" Functions for Parsing
`
`5
`
`Unlike alternation, sequencing is not associative, due to the tupling of results from
`the component parsers. In Miranda, all infix operators made using the $ notation
`are assumed to associate to the right. Thus, when we write (p $then q $then r)
`it is interpreted as (p $then (q $then
`
`2.3 Manipulating values
`
`Part of the result from a parser is a value. The using combinator allows us to
`manipulate these results, building a parse tree being the most common application.
`The parser (p Susing f) has the same behaviour as the parser p, except that the
`function f is applied to each of its result values:
`
`using :: parser * ** -> (** -> ***) —> parser * ***
`
`(p $using ;f)
`
`inp = [(f v,out)
`
`I
`
`(v,out) <— p inp]
`
`Although using has no counterpart in pure BNF notation, it does have much in
`common with the
`-
`operator in Yacc (Aho86). In fact, the using combinator
`does not restrict us to building parse trees. Arbitrary semantic actions can be
`used. For example, in section 2.4 we convert a parser for arithmetic expressions
`to an evaluator simply by changing the actions. There is a clear connection here
`with attribute grammars. A recent and relevant article on attribute grammars is
`(Johnsson87). A combinator parser may be viewed as the implementation in a lazy
`functional language of an attribute grammar in which every node has one inherited
`attribute (the input string), and two synthesised attributes (the result value of the
`parse and the unconsumed part of the input string.) In the remainder of this section
`we define some useful new parsers and combinators in terms of our primitives.
`In BNF notation, repetition occurs often enough to merit its own abbreviation.
`When zero or more repetitions of a phrase p are admissible, we simply write p*. For-
`mally, this notation is defined by the equation p* : p p* | 5. The many combinator
`corresponds directly to this operator, and is defined in much the same way:
`
`many :: parser >o< ** —> parser *
`
`[**]
`
`many p = ((p $then many p) $using cons) $a1t
`
`(succeed [D
`
`The action cons is the uncurried version of the list constructor “ : ”, and is defined by
`cons (x,xs) = x:xs. Since combinator parsers return all possible parses according
`to a grammar, if failure occurs on the nth application of (many p), it results will be
`returned, one for each of the O to n—1 successful applications. Following convention,
`the results are returned in descending order of length. For example, applying the
`parser many (literal ’a’) to the string "aaab" gives the list
`
`[:(IIaaaII , nbu) , (naan ’ nabu) ’ (Ivan ) uaabu) ’ (nu , naaab1I)]
`
`Not surprisingly, the next parser corresponds to the other common iterative form
`in BNF, defined by pl” = p p‘. The parser (some p) has the same behaviour as
`(many p), except that it accepts one or more repetitions of p, rather of zero or
`more:
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 5
`
`
`
`6
`
`Graham Hutton
`
`some :: parser * ** -> parser *
`
`[**]
`
`some p = (p $then many p) $using cons
`
`Note that (some p) may fail, whereas (many p) always succeeds. Using some we
`define parsers for number and words —~ non—empty sequences of digits and letters:
`
`number
`word
`
`:: parser char
`:: parser char
`
`[char]
`[char]
`
`number = some (satisfy digit)
`where digit X = ’0’ <= x <= ’9’
`
`word = some (satisfy letter)
`where letter x = (’a’ <= x <= ’z’) \/ (’A’ <= x <= ’Z’)
`
`The next combinator is a generalisation of the literal primitive, allowing us
`build parsers which recognise strings of symbols, rather than just single symbols:
`
`string ::
`
`[*] —> parser *
`
`[*]
`
`= succeed []
`string []
`string (xzxs) = (literal x $then string xs) $using cons
`
`For example, applying the parser (string "begin") to the string "begin end"
`gives the output [(“begin" , " end")]. It is important to note that (string xs)
`fails if only a prefix of the sequence xs is available in the input string.
`As well as being used the define other parsers, the using combinator is often
`used to prune unwanted components from a parse tree. Recall that two parsers
`composed in sequence produce a pair of results. Sometimes we are only interested in
`one component of the pair. For example, it is common to throw away reserved words
`such as “begin” and “where” during parsing. In such cases, two special versions of
`the then combinator are useful, which throw away either the left or right result
`values, as reflected by the position of the letter “X” in their names:
`
`xthen :: parser * >o<>o< -> parser * *** -> parser * ***
`thenx :: parser * ** -> parser * *** —> parser * **
`
`pl $xthen p2 = (pl $then p2) $using snd
`pl $thenx p2 = (pl $then p2) $using fst
`
`The actions fst and snd are the standard projection functions on pairs, defined by
`fst (x,y) = x and snd (x,y) = y.
`Sometimes we are not interested in the result from a parser at all, only that the
`parser succeeds. For example, if we find a reserved word during lexical analysis, it
`may be convenient to return some short representation rather than the string itself.
`The return combinator is useful in such cases. The parser (p $return v) has the
`same behaviour as p, except that it returns the value v if successful:
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 6
`
`
`
`Hz'gher—Order Functions for Parsing
`
`7
`
`return :: parser * M —> =«** —> parser * ***
`
`p $return v = p $using (const V)
`where const x y = x
`
`2.4 Eccample
`
`To conclude our introduction to combinator parsing, we will work through the
`derivation of a simple parser. Suppose we have a program which works with arith-
`metic expressions, defined in Miranda as follows:
`
`expr
`
`::= Num num I expr $Add expr
`I expr $Mu1 expr
`
`I expr $Sub expr
`I expr $Div expr
`
`We can imagine a function showexpr which converts terms of type expr to the
`normal arithmetic notation. For example,
`
`showexpr
`
`((Num 3) $Mu1 ((Num 6) $Add (Num 1))) = "3*(6+1)"
`
`While such pretty—printing is notionally quite simple, the inverse operation, pars-
`ing, is usually thought of as being much more involved. As we shall see however,
`building a combinator parser for arithmetic expressions is no more complicated
`than implementing the showexpr function.
`Before we start thinking about parsing, we must define a BNF grammar for
`expressions. To begin with, the definition for the type expr may itself be cast in
`BNF notation. All we need do is include parenthesised expressions as an extra case:
`
`empri
`
`:2:
`
`emprt + empn I expri — emprt I
`eacpn >l< eztpn I eccpn/easpn I
`dtgtt+ I (e:L'prL)
`
`in practice it
`Although this grammar could be used as the basis of the parser,
`is useful to impose a little more structure. To simplify expressions, multiplication
`and division are normally assumed to have higher precedence than addition and
`subtraction. For example, 3 + 5 a< 2 is interpreted as 3 + (5 * 2). In terms of our
`grammar, we introduce a new non—terminal for each level of precedence:
`
`exprt
`term
`factor
`
`term —I- term I term — term I term
`2::
`= factor >I< factor I factor / factor I factor
`:2:
`dtgz't+ I (catprt)
`
`VVhile addition and multiplication are clearly associative, division and subtraction
`are normally assumed to associate to the left. The natural way to express this
`convention in the grammar is with left recursive production rules (such as exp/n. ::=
`exprt — term). Unfortunately, in top—down methods such as combinator parsing, it
`is well known that left—recursion leads to non—termination of the parser (Aho86).
`In section 4.1 we show how to transform a grammar to eliminate left—recursion. For
`the present however, we will leave the grammar as above, and use extra parenthesis
`to disambiguate expressions involving repeated operations.
`
`Blue Coat Systems — CORRECTED Exhibit 1058 Page 7
`
`
`
`8
`
`Graham Hutton
`
`Now that we have a grammar for expressions, it is a simple step to build a com-
`binator parser. The BNF description is simply re—written in combinator notation,
`and augmented with semantic actions to manipulate the result values:
`
`expn
`
`term
`
`$a1t
`= ((term $then literal ’+’ $xthen term) $using plus)
`((term $then literal ’-’ $xthen term) $using minus) $a1t
`term
`
`$alt
`((factor $then literal ’*’ $xthen factor) $using times)
`((factor $then literal ’/’ $xthen factor) $using divide) $alt
`factor
`
`factor = (number $using value) $alt
`(literal ’(’ $xthen expn $thenX literal ’)’)
`
`Note that the parser makes use of the special sequential combining forms Xthen
`and thenx to strip non—numeric components from result values. In this way, the
`arithmetic actions simply take a pair of expressions as their argument. In the defi-
`nitions given below for the actions, numval is the standard Miranda function which
`converts a string of digits to the corresponding number.
`
`value
`
`xs
`
`= Num (numval xs)
`
`(x,y) = X $Add y
`plus
`(X,y) = X $Sub y
`minus
`(X,y) = X $Mul y
`times
`divide (X,y) = X $Div y
`
`This completes the parser. For example, expn "2+(4-1)*3" gives
`
`[( Add (Num 2)
`( Add (Num 2)
`( Num 2
`
`(Num 1))
`(Sub (Num 4)
`(Mul
`(Sub (Num 4)
`(Num 1))
`
`(Num 3))
`
`),
`""
`,
`),
`"*3"
`,
`, "+(4-1)*3" )]
`
`More than one result is produced because the parser is not forced to consume all
`the input. As we would expect however, the longest parse is returned first. This
`behaviour results from careful ordering of the alternatives in the parser.
`Although a parse tree is the natural output from a parser, there is no such restric-
`tion in combinator parsing. For example, simply by replacing the standard semantic
`actions with the following set, we have an evaluator for arithmetic expressions.
`
`value
`
`xs
`
`numval xs
`
`(x,y) = X + y
`plus
`(x,y) = X - y
`minus
`(X,y) = X * y
`times
`divide (x,y) = X div y
`
`Under this interpretation,
`
`expn u2+(4__1)*3n =
`
`[:(11,nu)’
`
`(5’ll*31l)’
`
`(2’ll+(4_1)*3|I)]
`
`Blue Coat Systems — CORRECTED Exhibit 1058 Page 8
`
`
`
`Hig/1e7'—07'de7" Functions for Pa7's'mg
`
`9
`
`3 Layout Conventions
`
`Most programming languages have a set of layout rules, which specify how white-
`space (spaces, tabs and newlines) may be used to improve readability. In this sec-
`tion we show how two common layout conventions may be handled in combinator
`parsers.
`
`3.1 Free-format input
`
`At the syntactic level, programs comprise a sequence of tokens. Many languages
`adopt free—format input, imposing few restrictions on the use of white—space — it
`is not permitted inside tokens, but may be freely inserted between them, although
`it is only strictly necessary when two tokens would otherwise form a single larger
`token. White~space is normally stripped out along with comments during a separate
`lexical phase, in which the source program is divided into its component tokens.
`This approach is developed in section 4.3.
`For many simple parsers however, a separate lexer is not required (as is the case
`for the arithmetic expression parser of the previous section), but we still might want
`to allow the use of white—space. The nibble combinator provides a simple solution.
`The parser (nibble p) has the same behaviour as the parser p, except that it eats
`up any white—space in the input string before or afterwards:
`
`nibble :: parser char * —> parser char *
`
`nibble p = white $xthen p $thenx white
`where white = many (any literal " \t\n")
`
`The any combinator used in this definition can often be used to simplify parsers
`involving repeated use of literal or string. It is defined as follows:
`
`any ::
`
`(* -> parser ** ***)
`
`-> [>v<] —> parser ** ***
`
`any p = foldr (alt.p) fail
`
`The library function foldr captures a common pattern of recursion over lists. It
`takes a list, a binary operator (3) and a value a, and replaces each constructor “:”
`in the list by (8), and the empty list [] at the end by oz. For example, foldr (+) O
`[1 ,2,3] : 1+(2+(3+O)) = 6. As in this example, Oz is often chosen to be the right
`identity for (8). The infix dot “.” used in any denotes function composition, defined
`by (f .g) x = f
`(g x). It should be clear that any has the following behaviour:
`
`any p [x1,x2,...,xn] = (p x1) $alt
`
`(p X2) $alt
`
`$alt
`
`(p xn)
`
`In practice, nibble is often used in conjunction with the string combinator.
`The following abbreviation is useful in this case:
`
`symbol
`
`2:
`
`[char] —> parser char
`
`[char]
`
`symbol = nibb1e.string
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 9
`
`
`
`10
`
`Graham Hutton
`
`For example, applying the parser (symbol "hi") to the string "
`gives ("hi" , "there") as the first result.
`There are two points worth noting about freewformat input. First of all, it is
`good practice to indent programs to reveal their structure. Although freeeformat
`input allows us to do this, it does not prevent us doing it wrongly. Secondly, extra
`symbols are usually needed in programs to guide the parser in determining their
`structure. Classic examples are “begin”, “end” and semi—colon from Pascal.
`
`there",
`
`hi
`
`3.2 The offside rule
`
`Another approach to layout, as adopted by many functional languages, is to con-
`strain the generality of free—format input just enough so that extra symbols to guide
`the parser are no longer needed. This is normally done by imposing a weak inden-
`tation strategy, and having the parser make intelligent use of layout to determine
`the structure of programs, Consider for example the following program:
`
`a = b+c
`where
`b = 10
`c = 15-5
`
`d = a*2
`
`It is clear from the indention that a and d are intended to be global definitions, with
`b and c local to a. The constraint which guarantees that we can always determine
`the structure of programs in this way is usually given by Landin’s ofiside rule
`(Landin66), defined as follows:
`
`If a syntactic class obeys the offside rule, every token of an object of the class must lie
`either directly below, or to the right of its first token. A token which breaks this rule is
`said to be offside with respect to the object, and terminates its parse.
`
`In Miranda, the offside rule is applied to the body of definitions, so that special
`
`symbols to separate definitions, or indicate block structuring, are not required. The
`offside rule does not force a specific way of indenting programs, so we are still
`free to use our own personal styles. It is worthwhile noting that there are other
`interpretations of the offside rule. In particular, the proposed standard functional
`language, Haskell, takes a slightly different approach (Hudak90).
`
`3.3 The ofiside combinator
`
`In keeping with the spirit of combinator parsing, we would like to define a single
`combinator which encapsulates the ofiside rule. Given a parser p, we can imagine a
`parser offside p with the same behaviour, except that it is required to consume
`precisely those symbols which are onside with respect to the first symbol parsed.
`At present, parsers only see a suffix of the entire input string, having no knowledge
`of what has already been consumed by previous parsers. To implement the offside
`combinator however, we need some context information, to decide which symbols
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 10
`
`
`
`Higher'— Order Functions for Parsing
`
`11
`
`in the input are onside. Our approach to this extra information is the key to the
`offside combinator. Rather than actually passing an extra argument to parsers,
`we will assume that each symbol in the input string has been paired with its row
`and column position at some stage prior to parsing.
`To simplify to types of parsers involving the ofiside rule, we use the abbreviation
`(pos *) for a symbol of type * paired with its position.
`
`pos * == (*,(num,num))
`
`Since the input string is now assumed to contain the position of each symbol, the
`primitive parsing function satisfy must be changed slightly. As row and column
`numbers are present only to guide the parser, it is reasonable to have satisfy strip
`this information from consumed symbols. In this manner, the annotations in the
`input string are of no concern when building parsers, being entirely hidden within
`the parsing notation itself. The other parsers defined in terms of satisfy need a
`minor change to their types, but otherwise remain the same.
`
`satisfy ::
`
`(* -> bool)
`
`-> parser (pos *)
`
`*
`
`satisfy p []
`satisfy p (xzxs)
`
`= fail []
`succeed a xs , p a
`fail xs
`, otherwise
`where (a,(r,c)) = x
`
`We are now able to define the offside combinator. The only complication is
`that white—space must be treated as a special case, in never being offside. To avoid
`this problem, we assume that white—space has been stripped from the input prior
`to parsing. No layout information is lost, since each symbol in the input is paired
`with its position. In reality, most parsers will have a separate lexical phase anyway,
`in which both comments and white—space are stripped.
`
`offside :2 parser (pos *) ** -7 parser (pos *) **
`
`offside p inp = [(v,inpOFF)
`where
`
`I
`
`(v,[]) <- p inpUN]
`
`inpDN = takewhile (onside (hd inp)) inp
`inpDFF = drop (#inp0N)
`inp
`onside (a,(r,c)) (b,(r’,c’)) = r’>=r & c’>=c
`
`The offside rule tells us that for the parser (offside p) to succeed, it must consume
`precisely the onside symbols in the input string. As such, in the definition above
`it is sufficient to apply the parser p only to the longest onside prefix (inp0N). The
`pattern (v, [D in the list comprehension filters out parses which do not consume
`all such symbols. For successful parses, we simply return the result value v, and
`remaining portion of the input string (inpOFF). It is interesting to note that the
`offside combinator does not depend upon the structure of the symbols in the
`input, only that they are paired with their position. For example, it is irrelevant
`whether symbols are single characters or complete tokens.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 11
`
`
`
`12
`
`Graham Hutton
`
`For completeness, we briefly explain the four standard Miranda functions used
`in offside. Given a list, the function (takewhile p) returns the longest prefix in
`which predicate p holds of each element. The function hd selects the first element
`of a list, and is defined by hd (x:xs) = x. The function (drop n) retains all but
`the first 11 elements of a list. Finally, “it” is the length operator for lists.
`
`4 Building Realistic Parsers
`
`Many simple grammars can be parsed in a single phase, but most programming
`languages need two distinct parsing phases — lexical and syntactic analysis. Since
`lexical analysis is nothing more than a simple form of parsing, it is not surprising
`to find that lexers themselves may be built as combinator parsers. In this section
`we work through an extended example, which shows how to build two—phase com-
`binator parsers, and demonstrates the use of the offside combinator.
`
`4.1 Example language
`
`We develop a parser for a small programming language, similar in form to Miranda.
`The following program shows all the syntactic features we are considering:
`
`fxy=addab
`where
`a=25
`
`b=subxy
`answer = mult (f 3 7) 5
`
`the parser should produce a parse tree of type
`If a program is well—formed,
`script, as defined below. Even though local definitions are attached to definitions
`in the language, it is normal to have them at the expression level in the parse tree.
`
`script
`def
`expn
`
`: := Script
`::= Def var
`::= Var var
`
`[def]
`[var] expn
`I Num num I expn $Apply expn I expn $Where [def]
`
`var == [char]
`
`The context~free aspects of the syntax are captured by the BNF grammar below.
`The non—terminals var and num correspond to variables and numbers, definedin
`the usual way. Ambiguity is resolved by the offside rule, applied to the body of
`definitions to avoid special symbols to separate definitions and delimit scope.
`
`::= defn*
`prog
`= U0/f‘+ “:” body
`defn
`= exp?" [“where” defn+]
`body
`= expr prim | prim
`ezpr
`prim ::= var | num | “(”e:cpr“)”
`
`in our language is ex-
`As we would expect, application associating to the left
`pressed by a left—recursive production rule in the grammar (expr). As already
`
`BmeComSymmns—CORRECTEDIkhmfi1058 Page12
`
`
`
`Higlier—07"der Functions for Parsing
`
`13
`
`mentioned in section 2.4 however, left—recursion and top—down parsing methods do
`not mix. If we are to build a combinator parser for this grammar, we must first
`eliminate the left—recursion. Consider the left—recursive production rule
`
`oz
`
`2:: afi|7
`
`in which it is assumed that 'y does not begin with an 04. The assumption ensures
`that the production has a non—recursive base case. (For the more general situation
`when there is more than one recursive production for Ct, the reader is referred to
`(Aho86).) What language is generated by a 7 Unwinding the recursion a few times,
`it is clear that a single 7, followed by any number of ,8s is acceptable. Thus, we
`would assert that a ::: 75* is equivalent to oz :2 cvfi
`| y. The proof is simple:
`
`#3" = 'y (/)’*fi | 5)
`= 73%’ l we
`= (yfi*) ,6 l 7
`: ozfi
`I 7
`
`{ properties of >+< }
`{ distributivity }
`{ properties of sequencing }
`{ definition of Oz }
`
`In our example language, this allows us to replace the left~recursive eccpr pro-
`duction rule with empr 2:: prim prim‘, which in turn simplifies to ewpr ::= priml.
`VVhile the languages accepted by the left—recursive and iterative production rules
`are provably equivalent, the parse trees will in fact be different. This problem can
`fixed by a simple action in the parser; we return to this point at the end of sec-
`tion 4.5.
`
`4 . 2 Layout analysis
`
`Recall that the offside combinator assumes white—space in the input is replaced by
`row and column annotations on the symbols. To this end, each character is paired
`with its position during a simple layout phase prior to lexical analysis. White—space
`itself will be stripped by the lexer, as is normal practice.
`
`prelex = pl (0,0)
`where
`
`= [1
`[J
`pl (r,c)
`pl (r,c) (xzxs) = (x,(r,c)) : pl (r,tab c) xs , x = ’\t’
`= (X,(r,C))
`: pl
`(r+1,0)
`xs , x = ’\n’
`= (x,(r,c)) : pl
`(r,c+1)
`xs , otherwise
`tab c = ((c div 8)+1)*8
`
`4 . 3 Lecztical analysis
`
`The primary function of lexical analysis is to d