From 748f3a6f7414f4e92597b76d430b492c4503e7c5 Mon Sep 17 00:00:00 2001 From: Anthony Ramine Date: Thu, 18 Dec 2014 01:34:37 +0100 Subject: Fix documentation about character types in unicode mode Behaviour of character types \d, \w and \s has always been to not match characters with value above 255, not 128, i.e. they are limited to ISO-Latin-1 and not ASCII. --- lib/stdlib/doc/src/re.xml | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) (limited to 'lib/stdlib') diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index a1833f6a51..5af1468e9b 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -150,7 +150,11 @@ This option makes it possible to include comments inside complicated patterns. N no_start_optimize This option disables optimization that may malfunction if "Special start-of-pattern items" are present in the regular expression. A typical example would be when matching "DEFABC" against "(*COMMIT)ABC", where the start optimization of PCRE would skip the subject up to the "A" and would never realize that the (*COMMIT) instruction should have made the matching fail. This option is only relevant if you use "start-of-pattern items", as discussed in the section "PCRE regular expression details" below. ucp - Specifies that Unicode Character Properties should be used when resolving \B, \b, \D, \d, \S, \s, \Wand \w. Without this flag, only ISO-Latin-1 properties are used. Using Unicode properties hurts performance, but is semantically correct when working with Unicode characters beyond the ISO-Latin-1 range. + Specifies that Unicode Character Properties should be used when + resolving \B, \b, \D, \d, \S, \s, \W and \w. Without this flag, only + ISO-Latin-1 properties are used. Using Unicode properties hurts + performance, but is semantically correct when working with Unicode + characters beyond the ISO-Latin-1 range. never_utf Specifies that the (*UTF) and/or (*UTF8) "start-of-pattern items" are forbidden. This flag can not be combined with unicode. Useful if ISO-Latin-1 patterns from an external source are to be compiled. @@ -966,7 +970,7 @@ appearance causes an error.

This has the same effect as setting the ucp option: it causes sequences such as \d and \w to use Unicode properties to determine character types, -instead of recognizing only characters with codes less than 128 via a lookup +instead of recognizing only characters with codes less than 256 via a lookup table.

@@ -1307,7 +1311,8 @@ By default, the definition of letters and digits is controlled by PCRE's low-valued character tables, in Erlang's case (and without the unicode option), the ISO-Latin-1 character set.

-

By default, in unicode mode, characters with values greater than 128 never match +

By default, in unicode mode, characters with values greater than 255, +i.e. all characters outside the ISO-Latin-1 character set, never match \d, \s, or \w, and always match \D, \S, and \W. These sequences retain their original meanings from before UTF support was available, mainly for efficiency reasons. However, if the ucp option is set, the behaviour is changed so that Unicode @@ -1954,10 +1959,10 @@ can be included in a class as a literal string of data units, or by using the upper case and lower case versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful version would. In a UTF mode, PCRE always understands the concept of -case for characters whose values are less than 128, so caseless matching is +case for characters whose values are less than 256, so caseless matching is always possible. For characters with higher values, the concept of case is supported if PCRE is compiled with Unicode property support, but not otherwise. -If you want to use caseless matching in a UTF mode for characters 128 and +If you want to use caseless matching in a UTF mode for characters 256 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF support.

@@ -1989,7 +1994,7 @@ matches the letters in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character tables for a French locale are in use, [\xc8-\xcb] matches accented E characters in both cases. In UTF modes, PCRE supports the concept of case for -characters with values greater than 128 only when it is compiled with Unicode +characters with values greater than 255 only when it is compiled with Unicode property support.

The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, @@ -2062,7 +2067,7 @@ by a ^ character after the colon. For example,

syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not supported, and an error is given if they are encountered.

-

By default, in UTF modes, characters with values greater than 128 do not match +

By default, in UTF modes, characters with values greater than 255 do not match any of the POSIX character classes. However, if the PCRE_UCP option is passed to pcre_compile(), some of the classes are changed so that Unicode character properties are used. This is achieved by replacing the POSIX classes @@ -2081,7 +2086,7 @@ by other sequences, as follows:

Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX classes are unchanged, and match only characters with code points less than -128.

+256.

-- cgit v1.2.3