throbber
Under consideration for publication in J. Functional Programming
`
`1
`
`Highcr—Order Functions for Parsing>i<
`
`Graham Hutton
`
`Department of Computer Science, University of Utrecht,
`PO Boar 80.089, 3508 TB Utrecht, The Netherlands.
`
`Abstract
`
`In combinatar parsing, the text of parsers resembles BNF notation. We present the basic
`method, and a number of extensions. We address the special problems presented by white-
`space, and parsers with separate lexical and syntactic phases. In particular, a combining
`form for handling the “offside rule” is given. Other extensions to the basic method include
`an “into” combining form with many useful applications, and a simple means by which
`combinator parsers can produce more informative error messages.
`
`1 Introduction
`
`Broadly speaking, a parser may be defined as a program which analyses text to
`determine its logical structure. For example, the parsing phase in a compiler takes
`a program text, and produces a parse tree which expounds the structure of the
`program. Many programs can be improved by having their input parsed. The form of
`input which is acceptable is usually defined by a context—free grammar, using BNF
`notation. Parsers themselves may be built by hand, but are most often generated
`automatically using tools like Lex and Yacc from Unix (Aho86).
`Although there are many methods of building parsing, one in particular has
`gained widespread acceptance for use in lazy functional languages. In this method,
`parsers are modelled directly as functions; larger parsers are built piecewise from
`smaller parsers using higher order functions. For example, we define higher order
`functions for sequencing, alternation and repetition. In this way, the text of parsers
`closely resembles BNF notation. Parsers in this style are quick to build, and simple
`to understand and modify. In the sequel, we refer to the method as combinator
`parsing, after the higher order functions used to combine parsers.
`Combinator parsing is considerably more powerful than the commonly used meth-
`ods, being able to handle ambiguous grammars, and providing full backtracking if
`it is needed. In fact, we can do more than just parsing. Semantic actions can be
`added to parsers, allowing their results to be manipulated in any way we please. For
`example, in section 2.4 we convert a parser for arithmetic expressions to an eval-
`uator simply by changing the semantic actions. More generally, we could imagine
`generating some form of abstract machine code as programs are parsed.
`
`* Appears in the Journal of Functional Programming 2(3):323—343, July 1992.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 1
`
`

`
`2
`
`Graham Hutton
`
`Although the principles are widely known (due in most part to (Wadler85)),
`little has been written on combinator parsing itself. In this article, we present the
`basic method, and a number of extensions. The techniques may be used in any lazy
`functional language with a higher—order/polymorphic style type system. All our
`programming examples are given in Mirandal; features and standard functions are
`explained as they are used. A library of parsing functions taken from this paper is
`available by electronic mail from the author. Versions exist in both Miranda and
`Lazy ML.
`
`2 Parsing Using Combinators
`
`We begin by defining a type of parsers. A parser may be viewed as a function from
`a string of symbols to a result value. Since a parser might not consume the entire
`string, part of this result will be a suflix of the input string. Sometimes a parser may
`not be able to produce a result at all. For example, it may be expecting a letter, but
`find a digit. Rather than defining a special type for the success or failure of a parser,
`we choose to have parsers return a list of pairs as their result, with the empty list
`[] denoting failure, and a singleton list [(v,xs)] indicating success, with value v
`and unconsumed input xs. As we shall see in section 2.2, having parsers return a
`list of results proves very useful. Since we want to specify the type of any parser,
`regardless of the kind of symbols and results involved, these types are included as
`extra parameters. In Miranda, type variables are denoted by sequences of stars.
`
`parser * ** == [*1 -> [(**.[*])]
`
`For example, a parser for arithmetic expressions might have type (parser char
`expr), indicating that it takes a string of characters, and produces an expression
`tree. Notice that parser
`not a new type as such, but an abbreviation (or syn-
`onym); it’s only purpose is to make types involving parsers easier to understand.
`
`2.1 Primitive parsers
`
`The primitive parsers are the building blocks of combinator parsing. The first of
`these corresponds to the 5 symbol in BNF notation, denoting the empty string.
`The succeed parser always succeeds, without actually consuming any of the input
`string. Since the outcome of succeed does not depend upon its input, its result
`value must be pre—deter1nined, so is included as an extra parameter:
`
`succeed :: ** -> parser * **
`
`succeed v inp = [(v,inp)]
`
`This definition relies on partial application to work properly. The order of the argu-
`ments means that if succeed is supplied only one argument, the result is a parser
`(Le. a function) which always succeeds with this value. For example, (succeed 5)
`
`1 Miranda is a trademark of Research Software Limited.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 2
`
`

`
`Hz'gher«Order Functions for Parsing
`
`3
`
`is a parser which always returns the value 5. Furthermore, even though succeed
`plainly has two arguments, its type would suggest it has only one. There is no
`magic, the second argument is simply hidden inside the type of the result, as would
`be clear upon expansion of the type according to the parser abbreviation.
`While succeed never fails, fail always does, regardless of the input string:
`
`fail :: parser * **
`
`fail inp = []
`
`The next function allows us to make parsers that recognise single symbols. Rather
`than enumerating the acceptable symbols, we find it more convenient to provide the
`set implicitly, via a predicate'which determines if an arbitrary symbol is a member.
`Successful parses return the consumed symbol as their result value.
`
`satisfy 2:
`
`(* —> bool) -> parser * *
`
`= fail []
`satisfy p []
`satisfy 13 (x:xs) = succeed x xs , p X
`= fail xs
`, otherwise
`
`Notice how succeed and fail are used in this example. Although they are not
`strictly necessary, their presence makes the parser easier to read. Note also that the
`parser (satisfy p) returns failure if supplied with an empty input string.
`Using satisfy we can define a parser for single syrribols:
`
`literal
`
`::
`
`* —> parser * *
`
`literal x = satisfy (=x)
`
`For example, applying the parser (literal ’3’) to the string "345" gives the
`result [(’3’ ,"45")]. In the definition of literal, (=x) is a function which tests
`its argument for equality with x. It is an example of operator sectioning, a useful
`syntactic convention which allows us to partially apply infix operators.
`
`2. 2 Combinators
`
`Now that we have the basic building blocks, we consider how they should be put
`together to form useful parsers. In BNF notation, larger grammars are built piece-
`wise from smaller ones using | to denote alternation, and juxtaposition to indicate
`sequencing. So that our parsers resemble BNF notation, we define higher order
`functions which correspond directly to these operators. Since higher order func-
`tions like these combine parsers to form other parsers, they are often referred to as
`combining forms or combinators. We will use these terms from now on.
`The alt combinator corresponds to alternation in BNF. The parser (pl $a1t
`p2) recognises anything that either pl or p2 would. Normally we would interpret
`either in a sequential (or exclusive) manner, returning the result of the first parser
`to succeed, and failure if neither does. This approach is taken in (Fairbairn86).
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 3
`
`

`
`4
`
`Graham Hutton
`
`In combinator parsing however, we use inclusive either — it is acceptable for both
`parsers to succeed, in which case we return both results. In general then, combinator
`parsers may return an arbitrary number of results. This explains our decision earlier
`to have parsers return a list of results.
`
`With parsers returning a list, alt is implemented simply by appending (denoted
`by ++ in Miranda) the result of applying both parsers to the input string. In keeping
`with the BNF notation, we use the Miranda $ notation to convert alt to an infix
`operator. Just as for sectioning, the infix notation is merely a syntactic convenience:
`(x $f y) is equivalent to (f x y) in all contexts.
`
`alt
`
`:: parser * ** -> parser * H -> parser * **
`
`(pl $alt p2)
`
`inp = pl
`
`inp ++ p2 inp
`
`is the identity element for ++, it is easy to verify
`[]
`Knowing that the empty—list
`from this definition that failure is the identity element for alternation: (fail $alt
`p) = (p $alt fail) : p. In practical terms this means that alt has the expected
`behaviour if only one of the argument parsers succeeds. Similarly, alt inherits
`associativity from ++: (p $alt q) $alt r : p $alt (q $alt r). This means we
`do not not need to worry about bracketing repeated alternation correctly.
`Allowing parsers to produce more than one result allows us to handle ambiguous
`grammars, with all possible parses being produced for an ambiguous string. The
`feature has proved particularly useful in natural language processing (Frost88). An
`example ambiguous string from (Frost88) is “Who discovered a moon that orbits
`Mars or Jupiter 7” Most often however, we are only interested in the single longest
`parse of a string (i.e. that which consumes the most symbols). For this reason, it is
`normal in combinator parsing to arrange for the parses to be returned in descending
`order of length. All that is required is a little care in the ordering of the argument
`parsers to alt. See for example the many combinator in the next section.
`The then combinator corresponds to sequencing in BNF. The parser (pl $then
`p2) recognises anything that p1 and p2 would if placed in succession. Since the first
`parser may succeed with many results, each with an input stream suffix, the second
`parser must be applied to each of these in turn. In this manner, two results are
`produced for each successful parse, one from each parser. They are combined (by
`pairing) to form a single result for the compound parser.
`
`then :: parser * ** -> parser * **»< —> parser *
`
`(**,***)
`
`(pl $then p2)
`
`inp = [((v1,v2),out2)
`
`I
`
`inp;
`(v1,out1) <— pl
`(v2,out2) <— p2 out1]
`
`For example, applying the parser (literal ’a’ $then literal ’b’) to the input
`"abcd" gives the result [((’a’ , ’b’) , "cd“)] . The then combinator is an excellent
`example of list compre/tension notation, analogous to set comprehension in math-
`ematics (eg. {:52 l 00 E IN /\ at < 10} defines the first ten squares), except that lists
`replace sets, and elements are drawn in a determined order. Much of the elegance
`of the then combinator would be lost if this notation were not available.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 4
`
`

`
`Hig/ter—O7“de7" Functions for Parsing
`
`5
`
`Unlike alternation, sequencing is not associative, due to the tupling of results from
`the component parsers. In Miranda, all infix operators made using the $ notation
`are assumed to associate to the right. Thus, when we write (p $then q $then r)
`it is interpreted as (p $then (q $then
`
`2.3 Manipulating values
`
`Part of the result from a parser is a value. The using combinator allows us to
`manipulate these results, building a parse tree being the most common application.
`The parser (p Susing f) has the same behaviour as the parser p, except that the
`function f is applied to each of its result values:
`
`using :: parser * ** -> (** -> ***) —> parser * ***
`
`(p $using ;f)
`
`inp = [(f v,out)
`
`I
`
`(v,out) <— p inp]
`
`Although using has no counterpart in pure BNF notation, it does have much in
`common with the
`-
`operator in Yacc (Aho86). In fact, the using combinator
`does not restrict us to building parse trees. Arbitrary semantic actions can be
`used. For example, in section 2.4 we convert a parser for arithmetic expressions
`to an evaluator simply by changing the actions. There is a clear connection here
`with attribute grammars. A recent and relevant article on attribute grammars is
`(Johnsson87). A combinator parser may be viewed as the implementation in a lazy
`functional language of an attribute grammar in which every node has one inherited
`attribute (the input string), and two synthesised attributes (the result value of the
`parse and the unconsumed part of the input string.) In the remainder of this section
`we define some useful new parsers and combinators in terms of our primitives.
`In BNF notation, repetition occurs often enough to merit its own abbreviation.
`When zero or more repetitions of a phrase p are admissible, we simply write p*. For-
`mally, this notation is defined by the equation p* : p p* | 5. The many combinator
`corresponds directly to this operator, and is defined in much the same way:
`
`many :: parser >o< ** —> parser *
`
`[**]
`
`many p = ((p $then many p) $using cons) $a1t
`
`(succeed [D
`
`The action cons is the uncurried version of the list constructor “ : ”, and is defined by
`cons (x,xs) = x:xs. Since combinator parsers return all possible parses according
`to a grammar, if failure occurs on the nth application of (many p), it results will be
`returned, one for each of the O to n—1 successful applications. Following convention,
`the results are returned in descending order of length. For example, applying the
`parser many (literal ’a’) to the string "aaab" gives the list
`
`[:(IIaaaII , nbu) , (naan ’ nabu) ’ (Ivan ) uaabu) ’ (nu , naaab1I)]
`
`Not surprisingly, the next parser corresponds to the other common iterative form
`in BNF, defined by pl” = p p‘. The parser (some p) has the same behaviour as
`(many p), except that it accepts one or more repetitions of p, rather of zero or
`more:
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 5
`
`

`
`6
`
`Graham Hutton
`
`some :: parser * ** -> parser *
`
`[**]
`
`some p = (p $then many p) $using cons
`
`Note that (some p) may fail, whereas (many p) always succeeds. Using some we
`define parsers for number and words —~ non—empty sequences of digits and letters:
`
`number
`word
`
`:: parser char
`:: parser char
`
`[char]
`[char]
`
`number = some (satisfy digit)
`where digit X = ’0’ <= x <= ’9’
`
`word = some (satisfy letter)
`where letter x = (’a’ <= x <= ’z’) \/ (’A’ <= x <= ’Z’)
`
`The next combinator is a generalisation of the literal primitive, allowing us
`build parsers which recognise strings of symbols, rather than just single symbols:
`
`string ::
`
`[*] —> parser *
`
`[*]
`
`= succeed []
`string []
`string (xzxs) = (literal x $then string xs) $using cons
`
`For example, applying the parser (string "begin") to the string "begin end"
`gives the output [(“begin" , " end")]. It is important to note that (string xs)
`fails if only a prefix of the sequence xs is available in the input string.
`As well as being used the define other parsers, the using combinator is often
`used to prune unwanted components from a parse tree. Recall that two parsers
`composed in sequence produce a pair of results. Sometimes we are only interested in
`one component of the pair. For example, it is common to throw away reserved words
`such as “begin” and “where” during parsing. In such cases, two special versions of
`the then combinator are useful, which throw away either the left or right result
`values, as reflected by the position of the letter “X” in their names:
`
`xthen :: parser * >o<>o< -> parser * *** -> parser * ***
`thenx :: parser * ** -> parser * *** —> parser * **
`
`pl $xthen p2 = (pl $then p2) $using snd
`pl $thenx p2 = (pl $then p2) $using fst
`
`The actions fst and snd are the standard projection functions on pairs, defined by
`fst (x,y) = x and snd (x,y) = y.
`Sometimes we are not interested in the result from a parser at all, only that the
`parser succeeds. For example, if we find a reserved word during lexical analysis, it
`may be convenient to return some short representation rather than the string itself.
`The return combinator is useful in such cases. The parser (p $return v) has the
`same behaviour as p, except that it returns the value v if successful:
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 6
`
`

`
`Hz'gher—Order Functions for Parsing
`
`7
`
`return :: parser * M —> =«** —> parser * ***
`
`p $return v = p $using (const V)
`where const x y = x
`
`2.4 Eccample
`
`To conclude our introduction to combinator parsing, we will work through the
`derivation of a simple parser. Suppose we have a program which works with arith-
`metic expressions, defined in Miranda as follows:
`
`expr
`
`::= Num num I expr $Add expr
`I expr $Mu1 expr
`
`I expr $Sub expr
`I expr $Div expr
`
`We can imagine a function showexpr which converts terms of type expr to the
`normal arithmetic notation. For example,
`
`showexpr
`
`((Num 3) $Mu1 ((Num 6) $Add (Num 1))) = "3*(6+1)"
`
`While such pretty—printing is notionally quite simple, the inverse operation, pars-
`ing, is usually thought of as being much more involved. As we shall see however,
`building a combinator parser for arithmetic expressions is no more complicated
`than implementing the showexpr function.
`Before we start thinking about parsing, we must define a BNF grammar for
`expressions. To begin with, the definition for the type expr may itself be cast in
`BNF notation. All we need do is include parenthesised expressions as an extra case:
`
`empri
`
`:2:
`
`emprt + empn I expri — emprt I
`eacpn >l< eztpn I eccpn/easpn I
`dtgtt+ I (e:L'prL)
`
`in practice it
`Although this grammar could be used as the basis of the parser,
`is useful to impose a little more structure. To simplify expressions, multiplication
`and division are normally assumed to have higher precedence than addition and
`subtraction. For example, 3 + 5 a< 2 is interpreted as 3 + (5 * 2). In terms of our
`grammar, we introduce a new non—terminal for each level of precedence:
`
`exprt
`term
`factor
`
`term —I- term I term — term I term
`2::
`= factor >I< factor I factor / factor I factor
`:2:
`dtgz't+ I (catprt)
`
`VVhile addition and multiplication are clearly associative, division and subtraction
`are normally assumed to associate to the left. The natural way to express this
`convention in the grammar is with left recursive production rules (such as exp/n. ::=
`exprt — term). Unfortunately, in top—down methods such as combinator parsing, it
`is well known that left—recursion leads to non—termination of the parser (Aho86).
`In section 4.1 we show how to transform a grammar to eliminate left—recursion. For
`the present however, we will leave the grammar as above, and use extra parenthesis
`to disambiguate expressions involving repeated operations.
`
`Blue Coat Systems — CORRECTED Exhibit 1058 Page 7
`
`

`
`8
`
`Graham Hutton
`
`Now that we have a grammar for expressions, it is a simple step to build a com-
`binator parser. The BNF description is simply re—written in combinator notation,
`and augmented with semantic actions to manipulate the result values:
`
`expn
`
`term
`
`$a1t
`= ((term $then literal ’+’ $xthen term) $using plus)
`((term $then literal ’-’ $xthen term) $using minus) $a1t
`term
`
`$alt
`((factor $then literal ’*’ $xthen factor) $using times)
`((factor $then literal ’/’ $xthen factor) $using divide) $alt
`factor
`
`factor = (number $using value) $alt
`(literal ’(’ $xthen expn $thenX literal ’)’)
`
`Note that the parser makes use of the special sequential combining forms Xthen
`and thenx to strip non—numeric components from result values. In this way, the
`arithmetic actions simply take a pair of expressions as their argument. In the defi-
`nitions given below for the actions, numval is the standard Miranda function which
`converts a string of digits to the corresponding number.
`
`value
`
`xs
`
`= Num (numval xs)
`
`(x,y) = X $Add y
`plus
`(X,y) = X $Sub y
`minus
`(X,y) = X $Mul y
`times
`divide (X,y) = X $Div y
`
`This completes the parser. For example, expn "2+(4-1)*3" gives
`
`[( Add (Num 2)
`( Add (Num 2)
`( Num 2
`
`(Num 1))
`(Sub (Num 4)
`(Mul
`(Sub (Num 4)
`(Num 1))
`
`(Num 3))
`
`),
`""
`,
`),
`"*3"
`,
`, "+(4-1)*3" )]
`
`More than one result is produced because the parser is not forced to consume all
`the input. As we would expect however, the longest parse is returned first. This
`behaviour results from careful ordering of the alternatives in the parser.
`Although a parse tree is the natural output from a parser, there is no such restric-
`tion in combinator parsing. For example, simply by replacing the standard semantic
`actions with the following set, we have an evaluator for arithmetic expressions.
`
`value
`
`xs
`
`numval xs
`
`(x,y) = X + y
`plus
`(x,y) = X - y
`minus
`(X,y) = X * y
`times
`divide (x,y) = X div y
`
`Under this interpretation,
`
`expn u2+(4__1)*3n =
`
`[:(11,nu)’
`
`(5’ll*31l)’
`
`(2’ll+(4_1)*3|I)]
`
`Blue Coat Systems — CORRECTED Exhibit 1058 Page 8
`
`

`
`Hig/1e7'—07'de7" Functions for Pa7's'mg
`
`9
`
`3 Layout Conventions
`
`Most programming languages have a set of layout rules, which specify how white-
`space (spaces, tabs and newlines) may be used to improve readability. In this sec-
`tion we show how two common layout conventions may be handled in combinator
`parsers.
`
`3.1 Free-format input
`
`At the syntactic level, programs comprise a sequence of tokens. Many languages
`adopt free—format input, imposing few restrictions on the use of white—space — it
`is not permitted inside tokens, but may be freely inserted between them, although
`it is only strictly necessary when two tokens would otherwise form a single larger
`token. White~space is normally stripped out along with comments during a separate
`lexical phase, in which the source program is divided into its component tokens.
`This approach is developed in section 4.3.
`For many simple parsers however, a separate lexer is not required (as is the case
`for the arithmetic expression parser of the previous section), but we still might want
`to allow the use of white—space. The nibble combinator provides a simple solution.
`The parser (nibble p) has the same behaviour as the parser p, except that it eats
`up any white—space in the input string before or afterwards:
`
`nibble :: parser char * —> parser char *
`
`nibble p = white $xthen p $thenx white
`where white = many (any literal " \t\n")
`
`The any combinator used in this definition can often be used to simplify parsers
`involving repeated use of literal or string. It is defined as follows:
`
`any ::
`
`(* -> parser ** ***)
`
`-> [>v<] —> parser ** ***
`
`any p = foldr (alt.p) fail
`
`The library function foldr captures a common pattern of recursion over lists. It
`takes a list, a binary operator (3) and a value a, and replaces each constructor “:”
`in the list by (8), and the empty list [] at the end by oz. For example, foldr (+) O
`[1 ,2,3] : 1+(2+(3+O)) = 6. As in this example, Oz is often chosen to be the right
`identity for (8). The infix dot “.” used in any denotes function composition, defined
`by (f .g) x = f
`(g x). It should be clear that any has the following behaviour:
`
`any p [x1,x2,...,xn] = (p x1) $alt
`
`(p X2) $alt
`
`$alt
`
`(p xn)
`
`In practice, nibble is often used in conjunction with the string combinator.
`The following abbreviation is useful in this case:
`
`symbol
`
`2:
`
`[char] —> parser char
`
`[char]
`
`symbol = nibb1e.string
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 9
`
`

`
`10
`
`Graham Hutton
`
`For example, applying the parser (symbol "hi") to the string "
`gives ("hi" , "there") as the first result.
`There are two points worth noting about freewformat input. First of all, it is
`good practice to indent programs to reveal their structure. Although freeeformat
`input allows us to do this, it does not prevent us doing it wrongly. Secondly, extra
`symbols are usually needed in programs to guide the parser in determining their
`structure. Classic examples are “begin”, “end” and semi—colon from Pascal.
`
`there",
`
`hi
`
`3.2 The offside rule
`
`Another approach to layout, as adopted by many functional languages, is to con-
`strain the generality of free—format input just enough so that extra symbols to guide
`the parser are no longer needed. This is normally done by imposing a weak inden-
`tation strategy, and having the parser make intelligent use of layout to determine
`the structure of programs, Consider for example the following program:
`
`a = b+c
`where
`b = 10
`c = 15-5
`
`d = a*2
`
`It is clear from the indention that a and d are intended to be global definitions, with
`b and c local to a. The constraint which guarantees that we can always determine
`the structure of programs in this way is usually given by Landin’s ofiside rule
`(Landin66), defined as follows:
`
`If a syntactic class obeys the offside rule, every token of an object of the class must lie
`either directly below, or to the right of its first token. A token which breaks this rule is
`said to be offside with respect to the object, and terminates its parse.
`
`In Miranda, the offside rule is applied to the body of definitions, so that special
`
`symbols to separate definitions, or indicate block structuring, are not required. The
`offside rule does not force a specific way of indenting programs, so we are still
`free to use our own personal styles. It is worthwhile noting that there are other
`interpretations of the offside rule. In particular, the proposed standard functional
`language, Haskell, takes a slightly different approach (Hudak90).
`
`3.3 The ofiside combinator
`
`In keeping with the spirit of combinator parsing, we would like to define a single
`combinator which encapsulates the ofiside rule. Given a parser p, we can imagine a
`parser offside p with the same behaviour, except that it is required to consume
`precisely those symbols which are onside with respect to the first symbol parsed.
`At present, parsers only see a suffix of the entire input string, having no knowledge
`of what has already been consumed by previous parsers. To implement the offside
`combinator however, we need some context information, to decide which symbols
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 10
`
`

`
`Higher'— Order Functions for Parsing
`
`11
`
`in the input are onside. Our approach to this extra information is the key to the
`offside combinator. Rather than actually passing an extra argument to parsers,
`we will assume that each symbol in the input string has been paired with its row
`and column position at some stage prior to parsing.
`To simplify to types of parsers involving the ofiside rule, we use the abbreviation
`(pos *) for a symbol of type * paired with its position.
`
`pos * == (*,(num,num))
`
`Since the input string is now assumed to contain the position of each symbol, the
`primitive parsing function satisfy must be changed slightly. As row and column
`numbers are present only to guide the parser, it is reasonable to have satisfy strip
`this information from consumed symbols. In this manner, the annotations in the
`input string are of no concern when building parsers, being entirely hidden within
`the parsing notation itself. The other parsers defined in terms of satisfy need a
`minor change to their types, but otherwise remain the same.
`
`satisfy ::
`
`(* -> bool)
`
`-> parser (pos *)
`
`*
`
`satisfy p []
`satisfy p (xzxs)
`
`= fail []
`succeed a xs , p a
`fail xs
`, otherwise
`where (a,(r,c)) = x
`
`We are now able to define the offside combinator. The only complication is
`that white—space must be treated as a special case, in never being offside. To avoid
`this problem, we assume that white—space has been stripped from the input prior
`to parsing. No layout information is lost, since each symbol in the input is paired
`with its position. In reality, most parsers will have a separate lexical phase anyway,
`in which both comments and white—space are stripped.
`
`offside :2 parser (pos *) ** -7 parser (pos *) **
`
`offside p inp = [(v,inpOFF)
`where
`
`I
`
`(v,[]) <- p inpUN]
`
`inpDN = takewhile (onside (hd inp)) inp
`inpDFF = drop (#inp0N)
`inp
`onside (a,(r,c)) (b,(r’,c’)) = r’>=r & c’>=c
`
`The offside rule tells us that for the parser (offside p) to succeed, it must consume
`precisely the onside symbols in the input string. As such, in the definition above
`it is sufficient to apply the parser p only to the longest onside prefix (inp0N). The
`pattern (v, [D in the list comprehension filters out parses which do not consume
`all such symbols. For successful parses, we simply return the result value v, and
`remaining portion of the input string (inpOFF). It is interesting to note that the
`offside combinator does not depend upon the structure of the symbols in the
`input, only that they are paired with their position. For example, it is irrelevant
`whether symbols are single characters or complete tokens.
`
`Blue Coat Systems - CORRECTED Exhibit 1058 Page 11
`
`

`
`12
`
`Graham Hutton
`
`For completeness, we briefly explain the four standard Miranda functions used
`in offside. Given a list, the function (takewhile p) returns the longest prefix in
`which predicate p holds of each element. The function hd selects the first element
`of a list, and is defined by hd (x:xs) = x. The function (drop n) retains all but
`the first 11 elements of a list. Finally, “it” is the length operator for lists.
`
`4 Building Realistic Parsers
`
`Many simple grammars can be parsed in a single phase, but most programming
`languages need two distinct parsing phases — lexical and syntactic analysis. Since
`lexical analysis is nothing more than a simple form of parsing, it is not surprising
`to find that lexers themselves may be built as combinator parsers. In this section
`we work through an extended example, which shows how to build two—phase com-
`binator parsers, and demonstrates the use of the offside combinator.
`
`4.1 Example language
`
`We develop a parser for a small programming language, similar in form to Miranda.
`The following program shows all the syntactic features we are considering:
`
`fxy=addab
`where
`a=25
`
`b=subxy
`answer = mult (f 3 7) 5
`
`the parser should produce a parse tree of type
`If a program is well—formed,
`script, as defined below. Even though local definitions are attached to definitions
`in the language, it is normal to have them at the expression level in the parse tree.
`
`script
`def
`expn
`
`: := Script
`::= Def var
`::= Var var
`
`[def]
`[var] expn
`I Num num I expn $Apply expn I expn $Where [def]
`
`var == [char]
`
`The context~free aspects of the syntax are captured by the BNF grammar below.
`The non—terminals var and num correspond to variables and numbers, definedin
`the usual way. Ambiguity is resolved by the offside rule, applied to the body of
`definitions to avoid special symbols to separate definitions and delimit scope.
`
`::= defn*
`prog
`= U0/f‘+ “:” body
`defn
`= exp?" [“where” defn+]
`body
`= expr prim | prim
`ezpr
`prim ::= var | num | “(”e:cpr“)”
`
`in our language is ex-
`As we would expect, application associating to the left
`pressed by a left—recursive production rule in the grammar (expr). As already
`
`BmeComSymmns—CORRECTEDIkhmfi1058 Page12
`
`

`
`Higlier—07"der Functions for Parsing
`
`13
`
`mentioned in section 2.4 however, left—recursion and top—down parsing methods do
`not mix. If we are to build a combinator parser for this grammar, we must first
`eliminate the left—recursion. Consider the left—recursive production rule
`
`oz
`
`2:: afi|7
`
`in which it is assumed that 'y does not begin with an 04. The assumption ensures
`that the production has a non—recursive base case. (For the more general situation
`when there is more than one recursive production for Ct, the reader is referred to
`(Aho86).) What language is generated by a 7 Unwinding the recursion a few times,
`it is clear that a single 7, followed by any number of ,8s is acceptable. Thus, we
`would assert that a ::: 75* is equivalent to oz :2 cvfi
`| y. The proof is simple:
`
`#3" = 'y (/)’*fi | 5)
`= 73%’ l we
`= (yfi*) ,6 l 7
`: ozfi
`I 7
`
`{ properties of >+< }
`{ distributivity }
`{ properties of sequencing }
`{ definition of Oz }
`
`In our example language, this allows us to replace the left~recursive eccpr pro-
`duction rule with empr 2:: prim prim‘, which in turn simplifies to ewpr ::= priml.
`VVhile the languages accepted by the left—recursive and iterative production rules
`are provably equivalent, the parse trees will in fact be different. This problem can
`fixed by a simple action in the parser; we return to this point at the end of sec-
`tion 4.5.
`
`4 . 2 Layout analysis
`
`Recall that the offside combinator assumes white—space in the input is replaced by
`row and column annotations on the symbols. To this end, each character is paired
`with its position during a simple layout phase prior to lexical analysis. White—space
`itself will be stripped by the lexer, as is normal practice.
`
`prelex = pl (0,0)
`where
`
`= [1
`[J
`pl (r,c)
`pl (r,c) (xzxs) = (x,(r,c)) : pl (r,tab c) xs , x = ’\t’
`= (X,(r,C))
`: pl
`(r+1,0)
`xs , x = ’\n’
`= (x,(r,c)) : pl
`(r,c+1)
`xs , otherwise
`tab c = ((c div 8)+1)*8
`
`4 . 3 Lecztical analysis
`
`The primary function of lexical analysis is to d

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket