An LALR-1 parser generator for Erlang, similar to
To understand this text, you also have to
look at the
Any of the Boolean options can be set to
The value of the
Yecc will add the extension
Returns a descriptive string in English of an error tuple
returned by
A
The user should implement a scanner that segments the input
text, and turns it into one or more lists of tokens. Each token
should be a tuple containing information about syntactic
category, position in the text (e.g. line number), and the
actual terminal symbol found in the text:
If a terminal symbol is the only member of a category, and the
symbol name is identical to the category name, the token format
may be
A list of tokens produced by the scanner should end with a
special
The simplest case is to segment the input string into a list of
identifiers (atoms) and use those atoms both as categories and
values of the tokens. For example, the input string
[{aaa, 1}, {bbb, 1}, {777, 1}, {',' , 1}, {'X', 1},
{'$end', 1}].
This assumes that this is the first line of the input text, and
that
The Erlang scanner in the
Erlang style
Each
The grammar starts with an optional
Header "%% Copyright (C)"
"%% @private"
"%% @Author John"
Next comes a declaration of the
Nonterminals sentence nounphrase verbphrase.
A non-terminal category can be used at the left hand side (=
Next comes a declaration of the
Terminals article adjective noun verb.
Terminal categories may only appear in the right hand sides (=
Next comes a declaration of the
Rootsymbol sentence.
This symbol should appear in the lhs of at least one grammar rule. This is the most general syntactic category which the parser ultimately will parse every input string into.
After the rootsymbol declaration comes an optional declaration
of the
Endsymbol '$end'.
Next comes one or more declarations of
Examples of operator declarations:
Right 100 '='.
Nonassoc 200 '==' '=/='.
Left 300 '+'.
Left 400 '*'.
Unary 500 '-'.
These declarations mean that
Certain rules are assigned precedence: each rule gets its precedence from the last terminal symbol mentioned in the right hand side of the rule. It is also possible to declare precedence for non-terminals, "one level up". This is practical when an operator is overloaded (see also example 3 below).
Next come the
Left_hand_side -> Right_hand_side : Associated_code.
The left hand side is a non-terminal category. The right hand
side is a sequence of one or more non-terminal or terminal
symbols with spaces between. The associated code is a sequence
of zero or more Erlang expressions (with commas
Symbols such as
The last part of the grammar file is an optional section with Erlang code (= function definitions) which is included 'as is' in the resulting parser file. This section must start with the pseudo declaration, or key words
Erlang code.
No syntax rule definitions or other declarations may follow this section. To avoid conflicts with internal variables, do not use variable names beginning with two underscore characters ('__') in the Erlang code in this section, or in the code associated with the individual syntax rules.
The optional
Expect 2.
The warning is given if the number of shift/reduce conflicts differs from 2, or if there are reduce/reduce conflicts.
A grammar to parse list expressions (with empty associated code):
Nonterminals list elements element.
Terminals atom '(' ')'.
Rootsymbol list.
list -> '(' ')'.
list -> '(' elements ')'.
elements -> element.
elements -> element elements.
element -> atom.
element -> list.
This grammar can be used to generate a parser which parses list
expressions, such as
[{'(', 1} , {atom, 1, peter}, {atom, 1, charles}, {')', 1},
{'$end', 1}]
When a grammar rule is used by the parser to parse (part of) the input string as a grammatical phrase, the associated code is evaluated, and the value of the last expression becomes the value of the parsed phrase. This value may be used by the parser later to build structures that are values of higher phrases of which the current phrase is a part. The values initially associated with terminal category phrases, i.e. input tokens, are the token tuples themselves.
Below is an example of the grammar above with structure building code added:
list -> '(' ')' : nil.
list -> '(' elements ')' : '$2'.
elements -> element : {cons, '$1', nil}.
elements -> element elements : {cons, '$1', '$2'}.
element -> atom : '$1'.
element -> list : '$1'.
With this code added to the grammar rules, the parser produces
the following value (structure) when parsing the input string
{cons, {atom, 1, a,} {cons, {atom, 1, b},
{cons, {atom, 1, c}, nil}}}
The associated code contains
The associated code may not only be used to build structures
associated with phrases, but may also be used for syntactic and
semantic tests, printout actions (for example for tracing), etc.
during the parsing process. Since tokens contain positional
(line number) information, it is possible to produce error
messages which contain line numbers. If there is no associated
code after the right hand side of the rule, the value
The right hand side of a grammar rule may be empty. This is
indicated by using the special symbol
list -> '(' elements ')' : '$2'.
elements -> element elements : {cons, '$1', '$2'}.
elements -> '$empty' : nil.
element -> atom : '$1'.
element -> list : '$1'.
To call the parser generator, use the following command:
yecc:file(Grammarfile).
An error message from Yecc will be shown if the grammar
is not of the LALR type (for example too ambiguous).
Shift/reduce conflicts are resolved in favor of shifting if
there are no operator precedence declarations. Refer to the
The output file contains Erlang source code for a parser module
with module name equal to the
myparser:parse(myscanner:scan(Inport))
The call format may be different if a customized prologue file
has been included when generating the parser instead of the
default file
With the standard prologue, this call will return either
By default, the parser that was generated will not print out error messages to the screen. The user will have to do this either by printing the returned error messages, or by inserting tests and print instructions in the Erlang code associated with the syntax rules of the grammar file.
It is also possible to make the parser ask for more input tokens when needed if the following call format is used:
myparser:parse_and_scan({Function, Args})
myparser:parse_and_scan({Mod, Tokenizer, Args})
The tokenizer
The tokenizer used above has to be implemented so as to return one of the following:
{ok, Tokens, Endline}
{eof, Endline}
{error, Error_description, Endline}
This conforms to the format used by the scanner in the Erlang
If
1. A grammar for parsing infix arithmetic expressions into prefix notation, without operator precedence:
Nonterminals E T F.
Terminals '+' '*' '(' ')' number.
Rootsymbol E.
E -> E '+' T: ['$1', '$2', '$3'].
E -> T : '$1'.
T -> T '*' F: ['$1', '$2', '$3'].
T -> F : '$1'.
F -> '(' E ')' : '$2'.
F -> number : '$1'.
2. The same with operator precedence becomes simpler:
Nonterminals E.
Terminals '+' '*' '(' ')' number.
Rootsymbol E.
Left 100 '+'.
Left 200 '*'.
E -> E '+' E : ['$1', '$2', '$3'].
E -> E '*' E : ['$1', '$2', '$3'].
E -> '(' E ')' : '$2'.
E -> number : '$1'.
3. An overloaded minus operator:
Nonterminals E uminus.
Terminals '*' '-' number.
Rootsymbol E.
Left 100 '-'.
Left 200 '*'.
Unary 300 uminus.
E -> E '-' E.
E -> E '*' E.
E -> uminus.
E -> number.
uminus -> '-' E.
4. The Yecc grammar that is used for parsing grammar files, including itself:
Nonterminals
grammar declaration rule head symbol symbols attached_code
token tokens.
Terminals
atom float integer reserved_symbol reserved_word string char var
'->' ':' dot.
Rootsymbol grammar.
Endsymbol '$end'.
grammar -> declaration : '$1'.
grammar -> rule : '$1'.
declaration -> symbol symbols dot: {'$1', '$2'}.
rule -> head '->' symbols attached_code dot: {rule, ['$1' | '$3'],
'$4'}.
head -> symbol : '$1'.
symbols -> symbol : ['$1'].
symbols -> symbol symbols : ['$1' | '$2'].
attached_code -> ':' tokens : {erlang_code, '$2'}.
attached_code -> '$empty' : {erlang_code,
[{atom, 0, '$undefined'}]}.
tokens -> token : ['$1'].
tokens -> token tokens : ['$1' | '$2'].
symbol -> var : value_of('$1').
symbol -> atom : value_of('$1').
symbol -> integer : value_of('$1').
symbol -> reserved_word : value_of('$1').
token -> var : '$1'.
token -> atom : '$1'.
token -> float : '$1'.
token -> integer : '$1'.
token -> string : '$1'.
token -> char : '$1'.
token -> reserved_symbol : {value_of('$1'), line_of('$1')}.
token -> reserved_word : {value_of('$1'), line_of('$1')}.
token -> '->' : {'->', line_of('$1')}.
token -> ':' : {':', line_of('$1')}.
Erlang code.
value_of(Token) ->
element(3, Token).
line_of(Token) ->
element(2, Token).
The symbols
5. The file
Syntactic tests are used in the code associated with some
rules, and an error is thrown (and caught by the generated
parser to produce an error message) when a test fails. The
same effect can be achieved with a call to
lib/parsetools/include/yeccpre.hrl
Aho & Johnson: 'LR Parsing', ACM Computing Surveys, vol. 6:2, 1974.