Add documentation of extensions to re module

The following compile options are documented: no_start_optimize ucp never_utf The following run options are documented: notempty_atstart {capture, all_names} The following new functions are documented: re:inspect/2
author: Patrik Nyblom <[email protected]> 2013-07-19 14:49:17 +0200
committer: Patrik Nyblom <[email protected]> 2013-08-09 12:10:43 +0200
commit: 811134171adb1c926111338dba8f36b02a0ff4e0 (patch)
tree: 64d7ca6690f453690ff4bfd81960afb8a894ce13 /lib/stdlib
parent: 6146e7642d4bb9f7c9bb5f8cbca548c1d9667e5c (diff)
download: otp-811134171adb1c926111338dba8f36b02a0ff4e0.tar.gz
otp-811134171adb1c926111338dba8f36b02a0ff4e0.tar.bz2
otp-811134171adb1c926111338dba8f36b02a0ff4e0.zip
2 files changed, 68 insertions, 19 deletions
diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml
index aae6345e84..8a47b1579d 100644
--- a/lib/stdlib/doc/src/re.xml
+++ b/lib/stdlib/doc/src/re.xml
@@ -101,7 +101,7 @@
       <p><marker id="compile_options"/>The options have the following meanings:</p>
       <taglist>
       <tag><c>unicode</c></tag>
-      <item>The regular expression is given as a Unicode <c>charlist()</c> and the resulting regular expression code is to be run against a valid Unicode <c>charlist()</c> subject.</item>
+      <item>The regular expression is given as a Unicode <c>charlist()</c> and the resulting regular expression code is to be run against a valid Unicode <c>charlist()</c> subject. Also consider the <c>ucp</c> option when using Unicode characters.</item>
       <tag><c>anchored</c></tag>
       <item>The pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself.</item>
       <tag><c>caseless</c></tag>
@@ -147,11 +147,51 @@ This option makes it possible to include comments inside complicated patterns. N
       <item>Specifies specifically that \R is to match only the cr, lf or crlf sequences, not the Unicode specific newline characters.</item>
       <tag><c>bsr_unicode</c></tag>
       <item>Specifies specifically that \R is to match all the Unicode newline characters (including crlf etc, the default).</item>
+      <tag><c>no_start_optimize</c></tag>
+      <item>This option disables optimization that may malfunction if "Special start-of-pattern items" are present in the regular expression. A typical example would be when matching "DEFABC" against "(*COMMIT)ABC", where the start optimization of PCRE would skip the subject up to the "A" and would never realize that the (*COMMIT) instruction should have made the matching fail. This option is only relevant if you use "start-of-pattern items", as discussed in the section "PCRE regular expression details" below.</item>
+      <tag><c>ucp</c></tag>
+      <item>Specifies that Unicode Character Properties should be used when resolving \B, \b, \D, \d, \S, \s, \Wand \w. Without this flag, only ISO-Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO-Latin-1 range.</item>
+      <tag><c>never_utf</c></tag>
+      <item>Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern items" are forbidden. This flag can not be combined with <c>unicode</c>. Useful if ISO-Latin-1 patterns from an external source are to be compiled.</item>
       </taglist>
     </desc>
     </func> 
 
     <func>
+      <name name="inspect" arity="2"/>
+      <fsummary>Inspects a compiled regular expression</fsummary>
+      <desc>
+      <p>This function takes a compiled regular expression and an item, returning the relevant data from the regular expression. Currently the only supported item is <c>namelist</c>, which returns the tuple <c>{namelist, [ binary()]}</c>, containing the names of all (unique) named subpatterns in the regular expression.</p>
+      <p>Example:</p>
+      <code type="none">
+1&gt; {ok,MP} = re:compile("(?&lt;A&gt;A)|(?&lt;B&gt;B)|(?&lt;C&gt;C)").
+{ok,{re_pattern,3,0,0,
+                &lt;&lt;69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
+                  255,255,...&gt;&gt;}}
+2&gt; re:inspect(MP,namelist).
+{namelist,[&lt;&lt;"A"&gt;&gt;,&lt;&lt;"B"&gt;&gt;,&lt;&lt;"C"&gt;&gt;]}
+3&gt; {ok,MPD} = re:compile("(?&lt;C&gt;A)|(?&lt;B&gt;B)|(?&lt;C&gt;C)",[dupnames]).
+{ok,{re_pattern,3,0,0,
+                &lt;&lt;69,82,67,80,119,0,0,0,0,0,8,0,1,0,0,0,255,255,255,255,
+                  255,255,...&gt;&gt;}}
+4&gt; re:inspect(MPD,namelist).                                   
+{namelist,[&lt;&lt;"B"&gt;&gt;,&lt;&lt;"C"&gt;&gt;]}</code>
+      <p>Note specifically in the second example that the duplicate name only occurs once in the returned list, and that the list is in alphabetical order regardless of where the names are positioned in the regular expression. The order of the names is the same as the order of captured subexpressions if <c>{capture, all_names}</c> is given as an option to <c>re:run/3</c>. You can therefore create a name-to-value mapping from the result of <c>re:run/3</c> like this:</p>
+<code>
+1&gt; {ok,MP} = re:compile("(?&lt;A&gt;A)|(?&lt;B&gt;B)|(?&lt;C&gt;C)").
+{ok,{re_pattern,3,0,0,
+                &lt;&lt;69,82,67,80,119,0,0,0,0,0,0,0,1,0,0,0,255,255,255,255,
+                  255,255,...&gt;&gt;}}
+2&gt; {namelist, N} = re:inspect(MP,namelist).
+{namelist,[&lt;&lt;"A"&gt;&gt;,&lt;&lt;"B"&gt;&gt;,&lt;&lt;"C"&gt;&gt;]}
+3&gt; {match,L} = re:run("AA",MP,[{capture,all_names,binary}]).
+{match,[&lt;&lt;"A"&gt;&gt;,&lt;&lt;&gt;&gt;,&lt;&lt;&gt;&gt;]}
+4&gt; NameMap = lists:zip(N,L).
+[{&lt;&lt;"A"&gt;&gt;,&lt;&lt;"A"&gt;&gt;},{&lt;&lt;"B"&gt;&gt;,&lt;&lt;&gt;&gt;},{&lt;&lt;"C"&gt;&gt;,&lt;&lt;&gt;&gt;}]</code>
+      <p>More items are expected to be added in the future.</p>
+      </desc>
+    </func> 
+    <func>
       <name name="run" arity="2"/>
       <fsummary>Match a subject against regular expression and capture subpatterns</fsummary>
       <desc>
@@ -179,7 +219,7 @@ This option makes it possible to include comments inside complicated patterns. N
       <p>If the regular expression is previously compiled, the option
       list can only contain the options <c>anchored</c>,
       <c>global</c>, <c>notbol</c>, <c>noteol</c>,
-      <c>notempty</c>, <c>{offset, integer() >= 0}</c>, <c>{newline,
+      <c>notempty</c>, <c>notempty_atstart</c>, <c>{offset, integer() >= 0}</c>, <c>{newline,
       <anno>NLSpec</anno>}</c> and <c>{capture, <anno>ValueSpec</anno>}/{capture, <anno>ValueSpec</anno>,
       <anno>Type</anno>}</c>.  Otherwise all options valid for the
       <c>re:compile/2</c> function are allowed as well. Options
@@ -241,7 +281,7 @@ This option makes it possible to include comments inside complicated patterns. N
       When the global option is given, <c>re:run/3</c> handles empty
       matches in the same way as Perl: a zero-length match at any
       point will be retried with the options <c>[anchored,
-      notempty]</c> as well. If that search gives a result of length
+      notempty_atstart]</c> as well. If that search gives a result of length
       &gt; 0, the result is included.  For example:</p>
       
 <code>    re:run("cat","(|at)",[global]).</code>
@@ -254,9 +294,9 @@ This option makes it possible to include comments inside complicated patterns. N
       <c>[{0,0},{0,0}]</c> (the second <c>{0,0}</c> is due to the
       subexpression marked by the parentheses). As the length of the
       match is 0, we don't advance to the next position yet.</item>
-      <tag>At offset <c>0</c> with <c>[anchored, notempty]</c></tag>
+      <tag>At offset <c>0</c> with <c>[anchored, notempty_atstart]</c></tag>
       <item> The search is retried
-      with the options <c>[anchored, notempty]</c> at the same
+      with the options <c>[anchored, notempty_atstart]</c> at the same
       position, which does not give any interesting result of longer
       length, so the search position is now advanced to the next
       character (<c>a</c>).</item>
@@ -264,7 +304,7 @@ This option makes it possible to include comments inside complicated patterns. N
       <item>This time, the search results in
       <c>[{1,0},{1,0}]</c>, so this search will also be repeated
       with the extra options.</item>
-      <tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag>
+      <tag>At offset <c>1</c> with <c>[anchored, notempty_atstart]</c></tag>
       <item>Now the <c>ab</c> alternative
       is found and the result will be [{1,2},{1,2}]. The result is
       added to the list of results and the position in the
@@ -272,7 +312,7 @@ This option makes it possible to include comments inside complicated patterns. N
       <tag>At offset <c>3</c></tag>
       <item>The search now once again
       matches the empty string, giving <c>[{3,0},{3,0}]</c>.</item>
-      <tag>At offset <c>1</c> with <c>[anchored, notempty]</c></tag>
+      <tag>At offset <c>1</c> with <c>[anchored, notempty_atstart]</c></tag>
       <item>This will give no result of length &gt; 0 and we are at
       the last position, so the global search is complete.</item>
       </taglist>
@@ -293,15 +333,21 @@ This option makes it possible to include comments inside complicated patterns. N
       subject. With the <c>notempty</c> option, this match is not
       valid, so re:run/3 searches further into the string for
       occurrences of "a" or "b".</p>
-
-      <p>Perl has no direct equivalent of <c>notempty</c>, but it does
-      make a special case of a pattern match of the empty string
-      within its split() function, and when using the /g modifier. It
-      is possible to emulate Perl's behavior after matching a null
-      string by first trying the match again at the same offset with
-      <c>notempty</c> and <c>anchored</c>, and then, if that fails, by
-      advancing the starting offset (see below) and trying an ordinary
-      match again.</p>
+      </item>
+      <tag><c>notempty_atstart</c></tag>
+      <item>
+	<p>This is like <c>notempty</c>, except that an empty string
+	match that is not at the start of the subject is permitted. If
+	the pattern is anchored, such a match can occur only if the
+	pattern contains \K.</p>
+	<p>Perl has no direct equivalent of <c>notempty</c> or <c>notempty_atstart</c>, but it does
+	make a special case of a pattern match of the empty string
+	within its split() function, and when using the /g modifier. It
+	is possible to emulate Perl's behavior after matching a null
+	string by first trying the match again at the same offset with
+	<c>notempty_atstart</c> and <c>anchored</c>, and then, if that fails, by
+	advancing the starting offset (see below) and trying an ordinary
+	match again.</p>
       </item>
       <tag><c>notbol</c></tag>
 
@@ -394,6 +440,9 @@ This option makes it possible to include comments inside complicated patterns. N
         <taglist>
         <tag><c>all</c></tag>
         <item>All captured subpatterns including the complete matching string. This is the default.</item>
+        <tag><c>all_names</c></tag>
+        <item>All <em>named</em> subpatterns in the regular expression, as if a <c>list()</c> 
+	of all the names <em>in alphabetical order</em> was given. The list of all names can also be retrieved with the <seealso marker="#inspect_2">inspect/2</seealso> function.</item>
         <tag><c>first</c></tag>
         <item>Only the first captured subpattern, which is always the complete matching part of the subject. All explicitly captured subpatterns are discarded.</item>
         <tag><c>all_but_first</c></tag>
diff --git a/lib/stdlib/src/re.erl b/lib/stdlib/src/re.erl
index 4d6de1100d..79176ff317 100644
--- a/lib/stdlib/src/re.erl
+++ b/lib/stdlib/src/re.erl
@@ -27,9 +27,9 @@
 -type compile_option() :: unicode | anchored | caseless | dollar_endonly
                         | dotall | extended | firstline | multiline
                         | no_auto_capture | dupnames | ungreedy
-                        | {newline, nl_spec()}| bsr_anycrlf
-                        | no_start_optimize | ucp | never_utf
-                        | bsr_unicode.
+                        | {newline, nl_spec()}
+                        | bsr_anycrlf | bsr_unicode
+                        | no_start_optimize | ucp | never_utf.
 
 %%% BIFs
author	Patrik Nyblom <[email protected]>	2013-07-19 14:49:17 +0200
committer	Patrik Nyblom <[email protected]>	2013-08-09 12:10:43 +0200
commit	811134171adb1c926111338dba8f36b02a0ff4e0 (patch)
tree	64d7ca6690f453690ff4bfd81960afb8a894ce13 /lib/stdlib
parent	6146e7642d4bb9f7c9bb5f8cbca548c1d9667e5c (diff)
download	otp-811134171adb1c926111338dba8f36b02a0ff4e0.tar.gz otp-811134171adb1c926111338dba8f36b02a0ff4e0.tar.bz2 otp-811134171adb1c926111338dba8f36b02a0ff4e0.zip