diff options
Diffstat (limited to 'lib/stdlib/doc/src/re.xml')
-rw-r--r-- | lib/stdlib/doc/src/re.xml | 56 |
1 files changed, 31 insertions, 25 deletions
diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index a1833f6a51..8c19926b10 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -9,16 +9,17 @@ <holder>Ericsson AB, All Rights Reserved</holder> </copyright> <legalnotice> - The contents of this file are subject to the Erlang Public License, - Version 1.1, (the "License"); you may not use this file except in - compliance with the License. You should have received a copy of the - Erlang Public License along with this software. If not, it can be - retrieved on line at http://www.erlang.org/. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 - Software distributed under the License is distributed on an "AS IS" - basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See - the License for the specific language governing rights and limitations - under the License. + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. The Initial Developer of the Original Code is Ericsson AB. </legalnotice> @@ -150,7 +151,11 @@ This option makes it possible to include comments inside complicated patterns. N <tag><c>no_start_optimize</c></tag> <item>This option disables optimization that may malfunction if "Special start-of-pattern items" are present in the regular expression. A typical example would be when matching "DEFABC" against "(*COMMIT)ABC", where the start optimization of PCRE would skip the subject up to the "A" and would never realize that the (*COMMIT) instruction should have made the matching fail. This option is only relevant if you use "start-of-pattern items", as discussed in the section "PCRE regular expression details" below.</item> <tag><c>ucp</c></tag> - <item>Specifies that Unicode Character Properties should be used when resolving \B, \b, \D, \d, \S, \s, \Wand \w. Without this flag, only ISO-Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO-Latin-1 range.</item> + <item>Specifies that Unicode Character Properties should be used when + resolving \B, \b, \D, \d, \S, \s, \W and \w. Without this flag, only + ISO-Latin-1 properties are used. Using Unicode properties hurts + performance, but is semantically correct when working with Unicode + characters beyond the ISO-Latin-1 range.</item> <tag><c>never_utf</c></tag> <item>Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern items" are forbidden. This flag can not be combined with <c>unicode</c>. Useful if ISO-Latin-1 patterns from an external source are to be compiled.</item> </taglist> @@ -200,8 +205,8 @@ This option makes it possible to include comments inside complicated patterns. N </func> <func> <name name="run" arity="3"/> - <type_desc variable="CompileOpt">See <seealso marker="#compile_options">compile/2</seealso> above.</type_desc> <fsummary>Match a subject against regular expression and capture subpatterns</fsummary> + <type_desc variable="CompileOpt">See <seealso marker="#compile_options">compile/2</seealso> above.</type_desc> <desc> <p>Executes a regexp matching, returning <c>match/{match, @@ -876,11 +881,11 @@ nomatch </desc> </func> </funcs> - - <marker id="regexp_syntax"></marker> + <section> <title>PERL LIKE REGULAR EXPRESSIONS SYNTAX</title> - <p>The following sections contain reference material for the + <p><marker id="regexp_syntax"></marker> + The following sections contain reference material for the regular expressions used by this module. The regular expression reference is based on the PCRE documentation, with changes in cases where the re module behaves differently to the PCRE library.</p> @@ -966,7 +971,7 @@ appearance causes an error. </quote> <p>This has the same effect as setting the <c>ucp</c> option: it causes sequences such as \d and \w to use Unicode properties to determine character types, -instead of recognizing only characters with codes less than 128 via a lookup +instead of recognizing only characters with codes less than 256 via a lookup table. </p> @@ -1307,7 +1312,8 @@ By default, the definition of letters and digits is controlled by PCRE's low-valued character tables, in Erlang's case (and without the <c>unicode</c> option), the ISO-Latin-1 character set.</p> -<p>By default, in <c>unicode</c> mode, characters with values greater than 128 never match +<p>By default, in <c>unicode</c> mode, characters with values greater than 255, +i.e. all characters outside the ISO-Latin-1 character set, never match \d, \s, or \w, and always match \D, \S, and \W. These sequences retain their original meanings from before UTF support was available, mainly for efficiency reasons. However, if the <c>ucp</c> option is set, the behaviour is changed so that Unicode @@ -1954,10 +1960,10 @@ can be included in a class as a literal string of data units, or by using the upper case and lower case versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful version would. In a UTF mode, PCRE always understands the concept of -case for characters whose values are less than 128, so caseless matching is +case for characters whose values are less than 256, so caseless matching is always possible. For characters with higher values, the concept of case is supported if PCRE is compiled with Unicode property support, but not otherwise. -If you want to use caseless matching in a UTF mode for characters 128 and +If you want to use caseless matching in a UTF mode for characters 256 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF support.</p> @@ -1989,7 +1995,7 @@ matches the letters in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character tables for a French locale are in use, [\xc8-\xcb] matches accented E characters in both cases. In UTF modes, PCRE supports the concept of case for -characters with values greater than 128 only when it is compiled with Unicode +characters with values greater than 255 only when it is compiled with Unicode property support.</p> <p>The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, @@ -2062,9 +2068,9 @@ by a ^ character after the colon. For example,</p> syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered.</p> -<p>By default, in UTF modes, characters with values greater than 128 do not match +<p>By default, in UTF modes, characters with values greater than 255 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed -to <b>pcre_compile()</b>, some of the classes are changed so that Unicode +to <em>pcre_compile()</em>, some of the classes are changed so that Unicode character properties are used. This is achieved by replacing the POSIX classes by other sequences, as follows:</p> @@ -2072,16 +2078,16 @@ by other sequences, as follows:</p> <tag>[:alnum:]</tag> <item>becomes <em>\p{Xan}</em></item> <tag>[:alpha:]</tag> <item>becomes <em>\p{L}</em></item> <tag>[:blank:]</tag> <item>becomes <em>\h</em></item> - <tag>[:digit:</tag>] <item>becomes <em>\p{Nd}</em></item> + <tag>[:digit:]</tag> <item>becomes <em>\p{Nd}</em></item> <tag>[:lower:]</tag> <item>becomes <em>\p{Ll}</em></item> <tag>[:space:]</tag> <item>becomes <em>\p{Xps}</em></item> - <tag>[:upper:</tag>] <item>becomes <em>\p{Lu}</em></item> + <tag>[:upper:]</tag> <item>becomes <em>\p{Lu}</em></item> <tag>[:word:]</tag> <item>becomes <em>\p{Xwd}</em></item> </taglist> <p>Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX classes are unchanged, and match only characters with code points less than -128.</p> +256.</p> </section> @@ -3053,7 +3059,7 @@ default newline convention is in force:</p> <quote><p> abc #comment \n still comment</p></quote> -<p>On encountering the # character, <b>pcre_compile()</b> skips along, looking for +<p>On encountering the # character, <em>pcre_compile()</em> skips along, looking for a newline in the pattern. The sequence \n is still literal at this stage, so it does not terminate the comment. Only an actual character with the code value 0x0a (the default newline) does so.</p> |