![]() | ![]() | ![]() | ![]() |
Chapter 12 Lexer and parser generators (ocamllex, ocamlyacc)This chapter describes two program generators: ocamllex, that produces a lexical analyzer from a set of regular expressions with associated semantic actions, and ocamlyacc, that produces a parser from a grammar with associated semantic actions. These program generators are very close to the well-known lex and yacc commands that can be found in most C programming environments. This chapter assumes a working knowledge of lex and yacc: while it describes the input syntax for ocamllex and ocamlyacc and the main differences with lex and yacc, it does not explain the basics of writing a lexer or parser description in lex and yacc. Readers unfamiliar with lex and yacc are referred to “Compilers: principles, techniques, and tools” by Aho, Sethi and Ullman (Addison-Wesley, 1986), or “Lex & Yacc”, by Levine, Mason and Brown (O'Reilly, 1992). 12.1 Overview of ocamllexThe ocamllex command produces a lexical analyzer from a set of regular expressions with attached semantic actions, in the style of lex. Assuming the input file is lexer.mll, executing
ocamllex lexer.mll
produces Caml code for a lexical analyzer in file lexer.ml. This file defines one lexing function per entry point in the lexer definition. These functions have the same names as the entry points. Lexing functions take as argument a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexer buffers are an abstract data type implemented in the standard library module Lexing. The functions Lexing.from_channel, Lexing.from_string and Lexing.from_function create lexer buffers that read from an input channel, a character string, or any reading function, respectively. (See the description of module Lexing in chapter 20.) When used in conjunction with a parser generated by ocamlyacc, the semantic actions compute a value belonging to the type token defined by the generated parsing module. (See the description of ocamlyacc below.) 12.1.1 OptionsThe following command-line options are recognized by ocamllex.
12.2 Syntax of lexer definitionsThe format of lexer definitions is as follows:
{ header }
let ident = regexp …
rule entrypoint [arg1… argn] =
parse regexp { action }
| …
| regexp { action }
and entrypoint [arg1… argn] =
parse …
and …
{ trailer }
Comments are delimited by (* and *), as in Caml. The parse keyword, can be replaced by the shortest keyword, with the semantic consequences explained below. 12.2.1 Header and trailer
The header and trailer sections are arbitrary Caml
text enclosed in curly braces. Either or both can be omitted. If
present, the header text is copied as is at the beginning of the
output file and the trailer text at the end. Typically, the
header section contains the 12.2.2 Naming regular expressionsBetween the header and the entry points, one can give names to frequently-occurring regular expressions. This is written let ident = regexp. In regular expressions that follow this declaration, the identifier ident can be used as shorthand for regexp. 12.2.3 Entry pointsThe names of the entry points must be valid identifiers for Caml values (starting with a lowercase letter). Similarily, the arguments arg1… argn must be valid identifiers for Caml. Each entry point becomes a Caml function that takes n+1 arguments, the extra implicit last argument being of type Lexing.lexbuf. Characters are read from the Lexing.lexbuf argument and matched against the regular expressions provided in the rule, until a prefix of the input matches one of the rule. The corresponding action is then evaluated and returned as the result of the function. If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected. However, if lexer rules are introduced with the shortest keyword in place of the parse keyword, then the “shortest match” rule applies: the shortest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is still selected. This feature is not intended for use in ordinary lexical analyzers, it may facilitate the use of ocamllex as a simple text processing tool. 12.2.4 Regular expressionsThe regular expressions are in the style of lex, with a more Caml-like syntax.
Concerning the precedences of operators, * and + have highest precedence, followed by ?, then concatenation, then | (alternation), then as. 12.2.5 ActionsThe actions are arbitrary Caml expressions. They are evaluated in a context where the identifiers defined by using the as construct are bound to subparts of the matched string. Additionally, lexbuf is bound to the current lexer buffer. Some typical uses for lexbuf, in conjunction with the operations on lexer buffers provided by the Lexing standard library module, are listed below.
12.2.6 Variables in regular expressionsThe as construct is similar to “groups” as provided by numerous regular expression packages. The type of these variables can be string, char, string option or char option. We first consider the case of linear patterns, that is the case when all as bound variables are distinct. In regexp as ident, the type of ident normally is string (or string option) except when regexp is a character constant, an underscore, a string constant of length one, a character set specification, or an alternation of those. Then, the type of ident is char (or char option). Option types are introduced when overall rule matching does not imply matching of the bound sub-pattern. This is in particular the case of ( regexp as ident ) ? and of regexp1 | ( regexp2 as ident ). There is no linearity restriction over as bound variables. When a variable is bound more than once, the previous rules are to be extended as follows:
For instance, in
In some cases, a sucessful match may not yield a unique set of bindings.
For instance the matching of 12.2.7 Reserved identifiersAll identifiers starting with __ocaml_lex are reserved for use by ocamllex; do not use any such identifier in your programs. 12.3 Overview of ocamlyaccThe ocamlyacc command produces a parser from a context-free grammar specification with attached semantic actions, in the style of yacc. Assuming the input file is grammar.mly, executing
ocamlyacc options grammar.mly
produces Caml code for a parser in the file grammar.ml, and its interface in file grammar.mli. The generated module defines one parsing function per entry point in the grammar. These functions have the same names as the entry points. Parsing functions take as arguments a lexical analyzer (a function from lexer buffers to tokens) and a lexer buffer, and return the semantic attribute of the corresponding entry point. Lexical analyzer functions are usually generated from a lexer specification by the ocamllex program. Lexer buffers are an abstract data type implemented in the standard library module Lexing. Tokens are values from the concrete type token, defined in the interface file grammar.mli produced by ocamlyacc. 12.4 Syntax of grammar definitionsGrammar definitions have the following format:
%{
header
%}
declarations
%%
rules
%%
trailer
Comments are enclosed between 12.4.1 Header and trailerThe header and the trailer sections are Caml code that is copied as is into file grammar.ml. Both sections are optional. The header goes at the beginning of the output file; it usually contains open directives and auxiliary functions required by the semantic actions of the rules. The trailer goes at the end of the output file. 12.4.2 DeclarationsDeclarations are given one per line. They all start with a
12.4.3 RulesThe syntax for rules is as usual:
nonterminal :
symbol … symbol { semantic-action }
| …
| symbol … symbol { semantic-action }
;
Rules can also contain the Semantic actions are arbitrary Caml expressions, that
are evaluated to produce the semantic attribute attached to
the defined nonterminal. The semantic actions can access the
semantic attributes of the symbols in the right-hand side of
the rule with the The rules may contain the special symbol error to indicate resynchronization points, as in yacc. Actions occurring in the middle of rules are not supported. Nonterminal symbols are like regular Caml symbols, except that they cannot end with ' (single quote). 12.4.4 Error handlingError recovery is supported as follows: when the parser reaches an error state (no grammar rules can apply), it calls a function named parse_error with the string "syntax error" as argument. The default parse_error function does nothing and returns, thus initiating error recovery (see below). The user can define a customized parse_error function in the header section of the grammar file. The parser also enters error recovery mode if one of the grammar actions raises the Parsing.Parse_error exception. In error recovery mode, the parser discards states from the stack until it reaches a place where the error token can be shifted. It then discards tokens from the input until it finds three successive tokens that can be accepted, and starts processing with the first of these. If no state can be uncovered where the error token can be shifted, then the parser aborts by raising the Parsing.Parse_error exception. Refer to documentation on yacc for more details and guidance in how to use error recovery. 12.5 OptionsThe ocamlyacc command recognizes the following options:
At run-time, the ocamlyacc-generated parser can be debugged by setting the p option in the OCAMLRUNPARAM environment variable (see section 10.2). This causes the pushdown automaton executing the parser to print a trace of its action (tokens shifted, rules reduced, etc). The trace mentions rule numbers and state numbers that can be interpreted by looking at the file grammar.output generated by ocamlyacc -v. 12.6 A complete exampleThe all-time favorite: a desk calculator. This program reads arithmetic expressions on standard input, one per line, and prints their values. Here is the grammar definition: /* File parser.mly */
%token <int> INT
%token PLUS MINUS TIMES DIV
%token LPAREN RPAREN
%token EOL
%left PLUS MINUS /* lowest precedence */
%left TIMES DIV /* medium precedence */
%nonassoc UMINUS /* highest precedence */
%start main /* the entry point */
%type <int> main
%%
main:
expr EOL { $1 }
;
expr:
INT { $1 }
| LPAREN expr RPAREN { $2 }
| expr PLUS expr { $1 + $3 }
| expr MINUS expr { $1 - $3 }
| expr TIMES expr { $1 * $3 }
| expr DIV expr { $1 / $3 }
| MINUS expr %prec UMINUS { - $2 }
;
Here is the definition for the corresponding lexer: (* File lexer.mll *)
{
open Parser (* The type token is defined in parser.mli *)
exception Eof
}
rule token = parse
[' ' '\t'] { token lexbuf } (* skip blanks *)
| ['\n' ] { EOL }
| ['0'-'9']+ as lxm { INT(int_of_string lxm) }
| '+' { PLUS }
| '-' { MINUS }
| '*' { TIMES }
| '/' { DIV }
| '(' { LPAREN }
| ')' { RPAREN }
| eof { raise Eof }
Here is the main program, that combines the parser with the lexer: (* File calc.ml *)
let _ =
try
let lexbuf = Lexing.from_channel stdin in
while true do
let result = Parser.main Lexer.token lexbuf in
print_int result; print_newline(); flush stdout
done
with Lexer.Eof ->
exit 0
To compile everything, execute: ocamllex lexer.mll # generates lexer.ml
ocamlyacc parser.mly # generates parser.ml and parser.mli
ocamlc -c parser.mli
ocamlc -c lexer.ml
ocamlc -c parser.ml
ocamlc -c calc.ml
ocamlc -o calc lexer.cmo parser.cmo calc.cmo
12.7 Common errors
|