From 34db76765561487e526fe66d3d19ecf3b3fb9dc8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bj=C3=B6rn=20Gustavsson?= Date: Tue, 30 Aug 2011 11:51:11 +0200 Subject: Allow noncharacter code points in unicode encoding and decoding The two noncharacter code points 16#FFFE and 16#FFFF were not allowed to be encoded or decoded using the unicode module or bit syntax. That causes an inconsistency, since the noncharacters 16#FDD0 to 16#FDEF could be encoded/decoded. There is two ways to fix that inconsistency. We have chosen to allow 16#FFFE and 16#FFFF to be encoded and decoded, because the noncharacters could be useful internally within an application and it will make encoding and decoding slightly faster. Reported-by: Alisdair Sullivan --- system/doc/reference_manual/expressions.xml | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) (limited to 'system/doc') diff --git a/system/doc/reference_manual/expressions.xml b/system/doc/reference_manual/expressions.xml index 497d7eb464..644896cd7f 100644 --- a/system/doc/reference_manual/expressions.xml +++ b/system/doc/reference_manual/expressions.xml @@ -879,9 +879,8 @@ Ei = Value | and UTF-32, respectively.

When constructing a segment of a utf type, Value - must be an integer in one of the ranges 0..16#D7FF, - 16#E000..16#FFFD, or 16#10000..16#10FFFF - (i.e. a valid Unicode code point). Construction + must be an integer in the range 0..16#D7FF or + 16#E000....16#10FFFF. Construction will fail with a badarg exception if Value is outside the allowed ranges. The size of the resulting binary segment depends on the type and/or Value. For utf8, @@ -896,14 +895,13 @@ Ei = Value | >]]>.

A successful match of a segment of a utf type results - in an integer in one of the ranges 0..16#D7FF, 16#E000..16#FFFD, - or 16#10000..16#10FFFF - (i.e. a valid Unicode code point). The match will fail if returned value + in an integer in the range 0..16#D7FF or 16#E000..16#10FFFF. + The match will fail if returned value would fall outside those ranges.

A segment of type utf8 will match 1 to 4 bytes in the binary, if the binary at the match position contains a valid UTF-8 sequence. - (See RFC-2279 or the Unicode standard.)

+ (See RFC-3629 or the Unicode standard.)

A segment of type utf16 may match 2 or 4 bytes in the binary. The match will fail if the binary at the match position does not contain -- cgit v1.2.3