diff options
Diffstat (limited to 'lib')
-rw-r--r-- | lib/stdlib/doc/src/re.xml | 81 | ||||
-rw-r--r-- | lib/stdlib/src/re.erl | 6 |
2 files changed, 68 insertions, 19 deletions
diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index aae6345e84..8a47b1579d 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -101,7 +101,7 @@ <p><marker id="compile_options"/>The options have the following meanings:</p> <taglist> <tag><c>unicode</c></tag> - <item>The regular expression is given as a Unicode <c>charlist()</c> and the resulting regular expression code is to be run against a valid Unicode <c>charlist()</c> subject.</item> + <item>The regular expression is given as a Unicode <c>charlist()</c> and the resulting regular expression code is to be run against a valid Unicode <c>charlist()</c> subject. Also consider the <c>ucp</c> option when using Unicode characters.</item> <tag><c>anchored</c></tag> <item>The pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself.</item> <tag><c>caseless</c></tag> @@ -147,11 +147,51 @@ This option makes it possible to include comments inside complicated patterns. N <item>Specifies specifically that \R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.</item> <tag><c>bsr_unicode</c></tag> <item>Specifies specifically that \R is to match all the Unicode newline characters (including crlf etc, the default).</item> + <tag><c>no_start_optimize</c></tag> + <item>This option disables optimization that may malfunction if "Special start-of-pattern items" are present in the regular expression. A typical example would be when matching "DEFABC" against "(*COMMIT)ABC", where the start optimization of PCRE would skip the subject up to the "A" and would never realize that the (*COMMIT) instruction should have made the matching fail. This option is only relevant if you use "start-of-pattern items", as discussed in the section "PCRE regular expression details" below.</item> + <tag><c>ucp</c></tag> + <item>Specifies that Unicode Character Properties should be used when resolving \B, \b, \D, \d, \S, \s, \Wand \w. Without this flag, only ISO-Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO-Latin-1 range.</item> + <tag><c>never_utf</c></tag> + <item>Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern items" are forbidden. This flag can not be combined with <c>unicode</c>. Useful if ISO-Latin-1 patterns from an external source are to be compiled.</item> </taglist> </desc> </func> <func> + <name name="inspect" arity="2"/> + <fsummary>Inspects a compiled regular expression</fsummary> + <desc> + <p>This function takes a compiled regular expression and an item, returning the relevant data from the regular expression. Currently the only supported item is <c>namelist</c>, which returns the tuple <c>{namelist, [ binary()]}</c>, containing the names of all (unique) named subpatterns in the regular expression.</p> + <p>Example:</p> + <code type="none"> +1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)"). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +2> re:inspect(MP,namelist). +{namelist,[<<"A">>,<<"B">>,<<"C">>]} +3> {ok,MPD} = re:compile("(?<C>A)|(?<B>B)|(?<C>C)",[dupnames]). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +4> re:inspect(MPD,namelist). +{namelist,[<<"B">>,<<"C">>]}</code> + <p>Note specifically in the second example that the duplicate name only occurs once in the returned list, and that the list is in alphabetical order regardless of where the names are positioned in the regular expression. The order of the names is the same as the order of captured subexpressions if <c>{capture, all_names}</c> is given as an option to <c>re:run/3</c>. You can therefore create a name-to-value mapping from the result of <c>re:run/3</c> like this:</p> +<code> +1> {ok,MP} = re:compile("(?<A>A)|(?<B>B)|(?<C>C)"). +{ok,{re_pattern,3,0,0, + <<69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255, + 255,255,...>>}} +2> {namelist, N} = re:inspect(MP,namelist). +{namelist,[<<"A">>,<<"B">>,<<"C">>]} +3> {match,L} = re:run("AA",MP,[{capture,all_names,binary}]). +{match,[<<"A">>,<<>>,<<>>]} +4> NameMap = lists:zip(N,L). +[{<<"A">>,<<"A">>},{<<"B">>,<<>>},{<<"C">>,<<>>}]</code> + <p>More items are expected to be added in the future.</p> + </desc> + </func> + <func> <name name="run" arity="2"/> <fsummary>Match a subject against regular expression and capture subpatterns</fsummary> <desc> @@ -179,7 +219,7 @@ This option makes it possible to include comments inside complicated patterns. N <p>If the regular expression is previously compiled, the option list can only contain the options <c>anchored</c>, <c>global</c>, <c>notbol</c>, <c>noteol</c>, - <c>notempty</c>, <c>{offset, integer() >= 0}</c>, <c>{newline, + <c>notempty</c>, <c>notempty_atstart</c>, <c>{offset, integer() >= 0}</c>, <c>{newline, <anno>NLSpec</anno>}</c> and <c>{capture, <anno>ValueSpec</anno>}/{capture, <anno>ValueSpec</anno>, <anno>Type</anno>}</c>. Otherwise all options valid for the <c>re:compile/2</c> function are allowed as well. Options @@ -241,7 +281,7 @@ This option makes it possible to include comments inside complicated patterns. N When the global option is given, <c>re:run/3</c> handles empty matches in the same way as Perl: a zero-length match at any point will be retried with the options <c>[anchored, - notempty]</c> as well. If that search gives a result of length + notempty_atstart]</c> as well. If that search gives a result of length > 0, the result is included. For example:</p> <code> re:run("cat","(|at)",[global]).</code> @@ -254,9 +294,9 @@ This option makes it possible to include comments inside complicated patterns. N <c>[{0,0},{0,0}]</c> (the second <c>{0,0}</c> is due to the subexpression marked by the parentheses). As the length of the match is 0, we don't advance to the next position yet.</item> - <tag>At offset <c>0</c> with <c>[anchored, notempty]</c></tag> + <tag>At offset <c>0</c> with <c>[anchored, notempty_atstart]</c></tag> <item> The search is retried - with the options <c>[anchored, notempty]</c> at the same + with the options <c>[anchored, notempty_atstart]</c> at the same position, which does not give any interesting result of longer length, so the search position is now advanced to the next character (<c>a</c>).</item> @@ -264,7 +304,7 @@ This option makes it possible to include comments inside complicated patterns. N <item>This time, the search results in <c>[{1,0},{1,0}]</c>, so this search will also be repeated with the extra options.</item> - <tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag> + <tag>At offset <c>1</c> with <c>[anchored, notempty_atstart]</c></tag> <item>Now the <c>ab</c> alternative is found and the result will be [{1,2},{1,2}]. The result is added to the list of results and the position in the @@ -272,7 +312,7 @@ This option makes it possible to include comments inside complicated patterns. N <tag>At offset <c>3</c></tag> <item>The search now once again matches the empty string, giving <c>[{3,0},{3,0}]</c>.</item> - <tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag> + <tag>At offset <c>1</c> with <c>[anchored, notempty_atstart]</c></tag> <item>This will give no result of length > 0 and we are at the last position, so the global search is complete.</item> </taglist> @@ -293,15 +333,21 @@ This option makes it possible to include comments inside complicated patterns. N subject. With the <c>notempty</c> option, this match is not valid, so re:run/3 searches further into the string for occurrences of "a" or "b".</p> - - <p>Perl has no direct equivalent of <c>notempty</c>, but it does - make a special case of a pattern match of the empty string - within its split() function, and when using the /g modifier. It - is possible to emulate Perl's behavior after matching a null - string by first trying the match again at the same offset with - <c>notempty</c> and <c>anchored</c>, and then, if that fails, by - advancing the starting offset (see below) and trying an ordinary - match again.</p> + </item> + <tag><c>notempty_atstart</c></tag> + <item> + <p>This is like <c>notempty</c>, except that an empty string + match that is not at the start of the subject is permitted. If + the pattern is anchored, such a match can occur only if the + pattern contains \K.</p> + <p>Perl has no direct equivalent of <c>notempty</c> or <c>notempty_atstart</c>, but it does + make a special case of a pattern match of the empty string + within its split() function, and when using the /g modifier. It + is possible to emulate Perl's behavior after matching a null + string by first trying the match again at the same offset with + <c>notempty_atstart</c> and <c>anchored</c>, and then, if that fails, by + advancing the starting offset (see below) and trying an ordinary + match again.</p> </item> <tag><c>notbol</c></tag> @@ -394,6 +440,9 @@ This option makes it possible to include comments inside complicated patterns. N <taglist> <tag><c>all</c></tag> <item>All captured subpatterns including the complete matching string. This is the default.</item> + <tag><c>all_names</c></tag> + <item>All <em>named</em> subpatterns in the regular expression, as if a <c>list()</c> + of all the names <em>in alphabetical order</em> was given. The list of all names can also be retrieved with the <seealso marker="#inspect_2">inspect/2</seealso> function.</item> <tag><c>first</c></tag> <item>Only the first captured subpattern, which is always the complete matching part of the subject. All explicitly captured subpatterns are discarded.</item> <tag><c>all_but_first</c></tag> diff --git a/lib/stdlib/src/re.erl b/lib/stdlib/src/re.erl index 4d6de1100d..79176ff317 100644 --- a/lib/stdlib/src/re.erl +++ b/lib/stdlib/src/re.erl @@ -27,9 +27,9 @@ -type compile_option() :: unicode | anchored | caseless | dollar_endonly | dotall | extended | firstline | multiline | no_auto_capture | dupnames | ungreedy - | {newline, nl_spec()}| bsr_anycrlf - | no_start_optimize | ucp | never_utf - | bsr_unicode. + | {newline, nl_spec()} + | bsr_anycrlf | bsr_unicode + | no_start_optimize | ucp | never_utf. %%% BIFs |