diff options
Diffstat (limited to 'lib/parsetools/doc/src/leex.xml')
-rw-r--r-- | lib/parsetools/doc/src/leex.xml | 455 |
1 files changed, 455 insertions, 0 deletions
diff --git a/lib/parsetools/doc/src/leex.xml b/lib/parsetools/doc/src/leex.xml new file mode 100644 index 0000000000..c113b586df --- /dev/null +++ b/lib/parsetools/doc/src/leex.xml @@ -0,0 +1,455 @@ +<?xml version="1.0" encoding="latin1" ?> +<!DOCTYPE erlref SYSTEM "erlref.dtd"> + +<erlref> + <header> + <copyright> + <year>2009</year><year>2009</year> + <holder>Ericsson AB. All Rights Reserved.</holder> + </copyright> + <legalnotice> + Copyright (c) 2008,2009 Robert Virding. All rights reserved. + </legalnotice> + + <title>leex</title> + <prepared>Robert Virding</prepared> + <responsible>nobody</responsible> + <docno></docno> + <approved>nobody</approved> + <checked></checked> + <date>2009-05-07</date> + <rev>A</rev> + <file>leex.xml</file> + </header> + <module>leex</module> + <modulesummary>Lexical analyzer generator for Erlang</modulesummary> + <description> + <p>A regular expression based lexical analyzer generator for + Erlang, similar to lex or flex.</p> + <note><p>The Leex module should be considered experimental + as it will be subject to changes in future releases.</p></note> + </description> + <section> + <title>DATA TYPES</title> + <code type="none"> +ErrorInfo = {ErrorLine,module(),error_descriptor()} +ErrorLine = integer() +Token = tuple()</code> + </section> + <funcs> + <func> + <name>file(FileName) -> ok | error</name> + <name>file(FileName, Options) -> ok | error</name> + <fsummary>Generate a lexical analyzer</fsummary> + <type> + <v>FileName = filename()</v> + <v>Options = Option | [Option]</v> + <v>Option = - see below -</v> + <v>FileReturn = {ok, Scannerfile} + | {ok, Scannerfile, Warnings} + | error + | {error, Warnings, Errors}</v> + <v>Scannerfile = filename()</v> + <v>Warnings = Errors = [{filename(), [ErrorInfo]}]</v> + </type> + <desc> + <p>Generates a lexical analyzer from the definition in the input + file. The input file has the extension <c>.xrl</c>. This is + added to the filename if it is not given. The resulting module + is the Xrl filename without the <c>.xrl</c> extension.</p> + + <p>The current options are:</p> + <taglist> + <tag><c>dfa_graph</c></tag> + <item><p>Generates a <c>.dot</c> file which contains a + description of the DFA in a format which can be viewed with + Graphviz, <c>www.graphviz.com</c>.</p> + </item> + <tag><c>{includefile,Includefile}</c></tag> + <item><p>Uses a specific or customised prologue file + instead of default + <c>lib/parsetools/include/leexinc.hrl</c> which is + otherwise included.</p> + </item> + <tag><c>{report_errors, bool()}</c></tag> + <item><p>Causes errors to be printed as they occur. Default is + <c>true</c>.</p> + </item> + <tag><c>{report_warnings, bool()}</c></tag> + <item><p>Causes warnings to be printed as they occur. Default is + <c>true</c>.</p> + </item> + <tag><c>{report, bool()}</c></tag> + <item><p>This is a short form for both <c>report_errors</c> and + <c>report_warnings</c>.</p> + </item> + <tag><c>{return_errors, bool()}</c></tag> + <item><p>If this flag is set, <c>{error, Errors, Warnings}</c> + is returned when there are errors. Default is <c>false</c>.</p> + </item> + <tag><c>{return_warnings, bool()}</c></tag> + <item><p>If this flag is set, an extra field containing + <c>Warnings</c> is added to the tuple returned upon + success. Default is <c>false</c>.</p> + </item> + <tag><c>{return, bool()}</c></tag> + <item><p>This is a short form for both <c>return_errors</c> and + <c>return_warnings</c>.</p> + </item> + <tag><c>{scannerfile, Scannerfile}</c></tag> + <item><p><c>Scannerfile</c> is the name of the file that + will contain the Erlang scanner code that is generated. + The default (<c>""</c>) is to add the extension + <c>.erl</c> to <c>FileName</c> stripped of the + <c>.xrl</c> extension.</p> + </item> + <tag><c>{verbose, bool()}</c></tag> + <item><p>Outputs information from parsing the input file and + generating the internal tables.</p> + </item> + </taglist> + <p>Any of the Boolean options can be set to <c>true</c> by + stating the name of the option. For example, <c>verbose</c> + is equivalent to <c>{verbose, true}</c>.</p> + <p>Leex will add the extension <c>.hrl</c> to the + <c>Includefile</c> name and the extension <c>.erl</c> to the + <c>Scannerfile</c> name, unless the extension is already + there.</p> + </desc> + </func> + <func> + <name>format_error(ErrorInfo) -> Chars</name> + <fsummary>Return an English description of a an error tuple.</fsummary> + <type> + <v>Chars = [char() | Chars]</v> + </type> + <desc> + <p>Returns a string which describes the error + <c>ErrorInfo</c> returned when there is an error in a + regular expression.</p> + </desc> + </func> + </funcs> + + + <section> + <title>GENERATED SCANNER EXPORTS</title> + <p>The following functions are exported by the generated scanner.</p> + </section> + + <funcs> + <func> + <name>string(String) -> StringRet</name> + <name>string(String, StartLine) -> StringRet</name> + <fsummary>Generated by Leex</fsummary> + <type> + <v>String = string()</v> + <v>StringRet = {ok,Tokens,EndLine} | ErrorInfo</v> + <v>Tokens = [Token]</v> + <v>EndLine = StartLine = integer()</v> + </type> + <desc> + <p>Scans <c>String</c> and returns all the tokens in it, or an + error.</p> + <note><p>It is an error if not all of the characters in + <c>String</c> are consumed.</p></note> + </desc> + </func> + + <func> + <name>token(Cont, Chars) -> {more,Cont1} | {done,TokenRet,RestChars} + </name> + <name>token(Cont, Chars, StartLine) -> {more,Cont1} + | {done,TokenRet,RestChars} + </name> + <fsummary>Generated by Leex</fsummary> + <type> + <v>Cont = [] | Cont1</v> + <v>Cont1 = tuple()</v> + <v>Chars = RestChars = string() | eof</v> + <v>TokenRet = {ok, Token, EndLine} + | {eof, EndLine} + | ErrorInfo</v> + <v>StartLine = EndLine = integer()</v> + </type> + <desc> + <p>This is a re-entrant call to try and scan one token from + <c>Chars</c>. If there are enough characters in <c>Chars</c> + to either scan a token or detect an error then this will be + returned with <c>{done,...}</c>. Otherwise + <c>{cont,Cont}</c> will be returned where <c>Cont</c> is + used in the next call to <c>token()</c> with more characters + to try an scan the token. This is continued until a token + has been scanned. <c>Cont</c> is initially <c>[]</c>.</p> + + <p>It is not designed to be called directly by an application + but used through the i/o system where it can typically be + called in an application by:</p> + <code> +io:request(InFile, {get_until,Prompt,Module,token,[Line]}) + -> TokenRet</code> + </desc> + </func> + + <func> + <name>tokens(Cont, Chars) -> {more,Cont1} | {done,TokensRet,RestChars} + </name> + <name>tokens(Cont, Chars, StartLine) -> + {more,Cont1} | {done,TokensRet,RestChars} + </name> + <fsummary>Generated by Leex</fsummary> + <type> + <v>Cont = [] | Cont1</v> + <v>Cont1 = tuple()</v> + <v>Chars = RestChars = string() | eof</v> + <v>TokensRet = {ok, Tokens, EndLine} + | {eof, EndLine} + | ErrorInfo</v> + <v>Tokens = [Token]</v> + <v>StartLine = EndLine = integer()</v> + </type> + <desc> + <p>This is a re-entrant call to try and scan tokens from + <c>Chars</c>. If there are enough characters in <c>Chars</c> + to either scan tokens or detect an error then this will be + returned with <c>{done,...}</c>. Otherwise + <c>{cont,Cont}</c> will be returned where <c>Cont</c> is + used in the next call to <c>tokens()</c> with more + characters to try an scan the tokens. This is continued + until all tokens have been scanned. <c>Cont</c> is initially + <c>[]</c>.</p> + + <p>This functions differs from <c>token</c> in that it will + continue to scan tokens upto and including an + <c>{end_token,Token}</c> has been scanned (see next + section). It will then return all the tokens. This is + typically used for scanning grammars like Erlang where there + is an explicit end token, <c>'.'</c>. If no end token is + found then the whole file will be scanned and returned. If + an error occurs then all tokens upto and including the next + end token will be skipped.</p> + + <p>It is not designed to be called directly by an application + but used through the i/o system where it can typically be + called in an application by:</p> + <code> +io:request(InFile, {get_until,Prompt,Module,tokens,[Line]}) + -> TokensRet</code> + </desc> + </func> + </funcs> + + <section> + <title>Input File Format</title> + <p>Erlang style comments starting with a <c>%</c> are allowed in + scanner files. A definition file has the following format:</p> + <code> +<Header> + +Definitions. + +<Macro Definitions> + +Rules. + +<Token Rules> + +Erlang code. + +<Erlang code></code> + + <p>The "Definitions.", "Rules." and "Erlang code." headings are + mandatory and must occur at the beginning of a source line. The + <Header>, <Macro Definitions> and <Erlang code> + sections may be empty but there must be at least one rule.</p> + + <p>Macro definitions have the following format:</p> + + <code> +NAME = VALUE</code> + + <p>and there must be spaces around <c>=</c>. Macros can be used in + the regular expressions of rules by writing <c>{NAME}</c>.</p> + + <note><p>When macros are expanded in expressions the macro calls + are replaced by the macro value without any form of quoting or + enclosing in parentheses.</p></note> + + <p>Rules have the following format:</p> + + <code> +<Regexp> : <Erlang code>.</code> + + <p>The <Regexp> must occur at the start of a line and not + include any blanks; use <c>\\t</c> and <c>\\s</c> to include TAB + and SPACE characters in the regular expression. If <Regexp> + matches then the corresponding <Erlang code> is evaluated to + generate a token. With the Erlang code the following predefined + variables are available:</p> + + <taglist> + <tag><c>TokenChars</c></tag> + <item><p>A list of the characters in the matched token.</p> + </item> + <tag><c>TokenLen</c></tag> + <item><p>The number of characters in the matched token.</p> + </item> + <tag><c>TokenLine</c></tag> + <item><p>The line number where the token occurred.</p> + </item> + </taglist> + + <p>The code must return:</p> + + <taglist> + <tag><c>{token,Token}</c></tag> + <item><p>Return <c>Token</c> to the caller.</p> + </item> + <tag><c>{end_token,Token}</c></tag> + <item><p>Return <c>Token</c> and is last token in a tokens call.</p> + </item> + <tag><c>skip_token</c></tag> + <item><p>Skip this token completely.</p> + </item> + <tag><c>{error,ErrString}</c></tag> + <item><p>An error in the token, <c>ErrString</c> is a string + describing the error.</p> + </item> + </taglist> + + <p>It is also possible to push back characters into the input + characters with the following returns:</p> + + <list> + <item><c>{token,Token,PushBackList}</c></item> + <item><c>{end_token,Token,PushBackList}</c></item> + <item><c>{skip_token,PushBackList}</c></item> + </list> + + <p>These have the same meanings as the normal returns but the + characters in <c>PushBackList</c> will be prepended to the input + characters and scanned for the next token. Note that pushing + back a newline will mean the line numbering will no longer be + correct.</p> + + <note><p>Pushing back characters gives you unexpected + possibilities to cause the scanner to loop!</p></note> + + <p>The following example would match a simple Erlang integer or + float and return a token which could be sent to the Erlang + parser:</p> + <code> +D = [0-9] + +{D}+ : + {token,{integer,TokenLine,list_to_integer(TokenChars)}}. + +{D}+\\.{D}+((E|e)(\\+|\\-)?{D}+)? : + {token,{float,TokenLine,list_to_float(TokenChars)}}.</code> + + <p>The Erlang code in the "Erlang code." section is written into + the output file directly after the module declaration and + predefined exports declaration so it is possible to add extra + exports, define imports and other attributes which are then + visible in the whole file.</p> + </section> + + <section> + <title>Regular Expressions</title> + + <p>The regular expressions allowed here is a subset of the set + found in <c>egrep</c> and in the AWK programming language, as + defined in the book, The AWK Programming Language, by A. V. Aho, + B. W. Kernighan, P. J. Weinberger. They are composed of the + following characters:</p> + + <taglist> + <tag><c>c</c></tag> + <item><p>Matches the non-metacharacter c.</p> + </item> + <tag><c>\\c</c></tag> + <item><p>Matches the escape sequence or literal character c.</p> + </item> + <tag><c>.</c></tag> + <item><p>Matches any character.</p> + </item> + <tag><c>^</c></tag> + <item><p>Matches the beginning of a string.</p> + </item> + <tag><c>$</c></tag> + <item><p>Matches the end of a string.</p></item> + <tag><c>[abc...]</c></tag> + <item><p>Character class, which matches any of the characters + <c>abc...</c>. Character ranges are specified by a pair of + characters separated by a <c>-</c>.</p> + </item> + <tag><c>[^abc...]</c></tag> + <item><p>Negated character class, which matches any character + except <c>abc...</c>.</p> + </item> + <tag><c>r1 | r2</c></tag> + <item><p>Alternation. It matches either <c>r1</c> or <c>r2</c>.</p> + </item> + <tag><c>r1r2</c></tag> + <item><p>Concatenation. It matches <c>r1</c> and then <c>r2</c>.</p> + </item> + <tag><c>r+</c></tag> + <item><p>Matches one or more <c>rs</c>.</p> + </item> + <tag><c>r*</c></tag> + <item><p>Matches zero or more <c>rs</c>.</p> + </item> + <tag><c>r?</c></tag> + <item><p>Matches zero or one <c>rs</c>.</p> + </item> + <tag><c>(r)</c></tag> + <item><p>Grouping. It matches <c>r</c>.</p> + </item> + </taglist> + + <p>The escape sequences allowed are the same as for Erlang strings:</p> + + <taglist> + <tag><c>\\b</c></tag> + <item><p>Backspace.</p></item> + <tag><c>\\f</c></tag> + <item><p>Form feed.</p></item> + <tag><c>\\n</c></tag> + <item><p>Newline (line feed).</p></item> + <tag><c>\\r</c></tag> + <item><p>Carriage return.</p></item> + <tag><c>\\t</c></tag> + <item><p>Tab.</p></item> + <tag><c>\\e</c></tag> + <item><p>Escape.</p></item> + <tag><c>\\v</c></tag> + <item><p>Vertical tab.</p></item> + <tag><c>\\s</c></tag> + <item><p>Space.</p></item> + <tag><c>\\d</c></tag> + <item><p>Delete.</p></item> + <tag><c>\\ddd</c></tag> + <item><p>The octal value <c>ddd</c>.</p></item> + <tag><c>\\xhh</c></tag> + <item><p>The hexadecimal value <c>hh</c>.</p></item> + <tag><c>\\x{h...}</c></tag> + <item><p>The hexadecimal value <c>h...</c>.</p></item> + <tag><c>\\c</c></tag> + <item><p>Any other character literally, for example <c>\\\\</c> for + backslash, <c>\\"</c> for <c>"</c>.</p> + </item> + </taglist> + + <p>The following examples define Erlang data types:</p> + <code> +Atoms [a-z][0-9a-zA-Z_]* + +Variables [A-Z_][0-9a-zA-Z_]* + +Floats (\\+|-)?[0-9]+\\.[0-9]+((E|e)(\\+|-)?[0-9]+)?</code> + + <note><p>Anchoring a regular expression with <c>^</c> and <c>$</c> + is not implemented in the current version of Leex and just + generates a parse error.</p></note> + </section> +</erlref> |