From 811134171adb1c926111338dba8f36b02a0ff4e0 Mon Sep 17 00:00:00 2001 From: Patrik Nyblom Date: Fri, 19 Jul 2013 14:49:17 +0200 Subject: Add documentation of extensions to re module The following compile options are documented: no_start_optimize ucp never_utf The following run options are documented: notempty_atstart {capture, all_names} The following new functions are documented: re:inspect/2 --- lib/stdlib/doc/src/re.xml | 81 +++++++++++++++++++++++++++++++++++++---------- lib/stdlib/src/re.erl | 6 ++-- 2 files changed, 68 insertions(+), 19 deletions(-) (limited to 'lib/stdlib') diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index aae6345e84..8a47b1579d 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -101,7 +101,7 @@

The options have the following meanings:

unicode - The regular expression is given as a Unicode charlist() and the resulting regular expression code is to be run against a valid Unicode charlist() subject. + The regular expression is given as a Unicode charlist() and the resulting regular expression code is to be run against a valid Unicode charlist() subject. Also consider the ucp option when using Unicode characters. anchored The pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself. caseless @@ -147,10 +147,50 @@ This option makes it possible to include comments inside complicated patterns. N Specifies specifically that \R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters. bsr_unicode Specifies specifically that \R is to match all the Unicode newline characters (including crlf etc, the default). + no_start_optimize + This option disables optimization that may malfunction if "Special start-of-pattern items" are present in the regular expression. A typical example would be when matching "DEFABC" against "(*COMMIT)ABC", where the start optimization of PCRE would skip the subject up to the "A" and would never realize that the (*COMMIT) instruction should have made the matching fail. This option is only relevant if you use "start-of-pattern items", as discussed in the section "PCRE regular expression details" below. + ucp + Specifies that Unicode Character Properties should be used when resolving \B, \b, \D, \d, \S, \s, \Wand \w. Without this flag, only ISO-Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO-Latin-1 range. + never_utf + Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern items" are forbidden. This flag can not be combined with unicode. Useful if ISO-Latin-1 patterns from an external source are to be compiled. + + + Inspects a compiled regular expression + +

This function takes a compiled regular expression and an item, returning the relevant data from the regular expression. Currently the only supported item is namelist, which returns the tuple {namelist, [ binary()]}, containing the names of all (unique) named subpatterns in the regular expression.

+

Example:

+ +1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)"). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +2> re:inspect(MP,namelist). +{namelist,[<<"A">>,<<"B">>,<<"C">>]} +3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +4> re:inspect(MPD,namelist). +{namelist,[<<"B">>,<<"C">>]} +

Note specifically in the second example that the duplicate name only occurs once in the returned list, and that the list is in alphabetical order regardless of where the names are positioned in the regular expression. The order of the names is the same as the order of captured subexpressions if {capture, all_names} is given as an option to re:run/3. You can therefore create a name-to-value mapping from the result of re:run/3 like this:

+ +1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)"). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +2> {namelist, N} = re:inspect(MP,namelist). +{namelist,[<<"A">>,<<"B">>,<<"C">>]} +3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]). +{match,[<<"A">>,<<>>,<<>>]} +4> NameMap = lists:zip(N,L). +[{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}] +

More items are expected to be added in the future.

+
+
Match a subject against regular expression and capture subpatterns @@ -179,7 +219,7 @@ This option makes it possible to include comments inside complicated patterns. N

If the regular expression is previously compiled, the option list can only contain the options anchored, global, notbol, noteol, - notempty, {offset, integer() >= 0}, {newline, + notempty, notempty_atstart, {offset, integer() >= 0}, {newline, NLSpec} and {capture, ValueSpec}/{capture, ValueSpec, Type}. Otherwise all options valid for the re:compile/2 function are allowed as well. Options @@ -241,7 +281,7 @@ This option makes it possible to include comments inside complicated patterns. N When the global option is given, re:run/3 handles empty matches in the same way as Perl: a zero-length match at any point will be retried with the options [anchored, - notempty] as well. If that search gives a result of length + notempty_atstart] as well. If that search gives a result of length > 0, the result is included. For example:

re:run("cat","(|at)",[global]). @@ -254,9 +294,9 @@ This option makes it possible to include comments inside complicated patterns. N [{0,0},{0,0}] (the second {0,0} is due to the subexpression marked by the parentheses). As the length of the match is 0, we don't advance to the next position yet. - At offset 0 with [anchored, notempty] + At offset 0 with [anchored, notempty_atstart] The search is retried - with the options [anchored, notempty] at the same + with the options [anchored, notempty_atstart] at the same position, which does not give any interesting result of longer length, so the search position is now advanced to the next character (a). @@ -264,7 +304,7 @@ This option makes it possible to include comments inside complicated patterns. N This time, the search results in [{1,0},{1,0}], so this search will also be repeated with the extra options. - At offset 1 with [anchored, notempty] + At offset 1 with [anchored, notempty_atstart] Now the ab alternative is found and the result will be [{1,2},{1,2}]. The result is added to the list of results and the position in the @@ -272,7 +312,7 @@ This option makes it possible to include comments inside complicated patterns. N At offset 3 The search now once again matches the empty string, giving [{3,0},{3,0}]. - At offset 1 with [anchored, notempty] + At offset 1 with [anchored, notempty_atstart] This will give no result of length > 0 and we are at the last position, so the global search is complete. @@ -293,15 +333,21 @@ This option makes it possible to include comments inside complicated patterns. N subject. With the notempty option, this match is not valid, so re:run/3 searches further into the string for occurrences of "a" or "b".

- -

Perl has no direct equivalent of notempty, but it does - make a special case of a pattern match of the empty string - within its split() function, and when using the /g modifier. It - is possible to emulate Perl's behavior after matching a null - string by first trying the match again at the same offset with - notempty and anchored, and then, if that fails, by - advancing the starting offset (see below) and trying an ordinary - match again.

+
+ notempty_atstart + +

This is like notempty, except that an empty string + match that is not at the start of the subject is permitted. If + the pattern is anchored, such a match can occur only if the + pattern contains \K.

+

Perl has no direct equivalent of notempty or notempty_atstart, but it does + make a special case of a pattern match of the empty string + within its split() function, and when using the /g modifier. It + is possible to emulate Perl's behavior after matching a null + string by first trying the match again at the same offset with + notempty_atstart and anchored, and then, if that fails, by + advancing the starting offset (see below) and trying an ordinary + match again.

notbol @@ -394,6 +440,9 @@ This option makes it possible to include comments inside complicated patterns. N all All captured subpatterns including the complete matching string. This is the default. + all_names + All named subpatterns in the regular expression, as if a list() + of all the names in alphabetical order was given. The list of all names can also be retrieved with the inspect/2 function. first Only the first captured subpattern, which is always the complete matching part of the subject. All explicitly captured subpatterns are discarded. all_but_first diff --git a/lib/stdlib/src/re.erl b/lib/stdlib/src/re.erl index 4d6de1100d..79176ff317 100644 --- a/lib/stdlib/src/re.erl +++ b/lib/stdlib/src/re.erl @@ -27,9 +27,9 @@ -type compile_option() :: unicode | anchored | caseless | dollar_endonly | dotall | extended | firstline | multiline | no_auto_capture | dupnames | ungreedy - | {newline, nl_spec()}| bsr_anycrlf - | no_start_optimize | ucp | never_utf - | bsr_unicode. + | {newline, nl_spec()} + | bsr_anycrlf | bsr_unicode + | no_start_optimize | ucp | never_utf. %%% BIFs -- cgit v1.2.3