From 26b59dfe67ef551cd94765557cdd8c79794bcc38 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Jos=C3=A9=20Valim?=
Date: Tue, 31 May 2016 14:28:54 +0200
Subject: Add new AtU8 beam chunk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The new chunk stores atoms encoded in UTF-8.
beam_lib has also been modified to handle the new
'utf8_atoms' attribute while the 'atoms' attribute
may be a missing chunk from now on.
The binary_to_atom/2 BIF can now encode any utf8
binary with up to 255 characters.
The list_to_atom/1 BIF can now accept codepoints
higher than 255 with up to 255 characters (thanks
to Björn Gustavsson).
---
erts/doc/src/erl_ext_dist.xml | 15 +++++----------
erts/doc/src/erlang.xml | 35 +++++++++++++----------------------
2 files changed, 18 insertions(+), 32 deletions(-)
(limited to 'erts/doc')
diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml
index 4f799f8f34..a436a9ca74 100644
--- a/erts/doc/src/erl_ext_dist.xml
+++ b/erts/doc/src/erl_ext_dist.xml
@@ -119,16 +119,11 @@
Compressed Data Format when Expanded
-
As from ERTS 5.10 (OTP R16) support
- for UTF-8 encoded atoms has been introduced in the external format.
- However, only characters that can be encoded using Latin-1 (ISO-8859-1)
- are currently supported in atoms. The support for UTF-8 encoded atoms
- in the external format has been implemented to be able to support
- all Unicode characters in atoms in some future release.
- Until full Unicode support for atoms has been introduced,
- it is an error to pass atoms containing
- characters that cannot be encoded in Latin-1, and the behavior is
- undefined.
+
As from ERTS 9.0 (OTP 20), UTF-8 encoded atoms may contain any Unicode
+ character. Although the support for UTF-8 encoded atoms in the external
+ format is available since ERTS 5.10 (OTP R16), passing atoms that cannot
+ be encoded in Latin-1 is an error in versions earlier than
+ Erlang/OTP 20, and the behavior is undefined.
When distribution flag DFLAG_UTF8_ATOMS has been exchanged between both nodes
in the
diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml
index b3fab3874b..cf038c49f0 100644
--- a/erts/doc/src/erlang.xml
+++ b/erts/doc/src/erlang.xml
@@ -325,16 +325,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).
is latin1, one byte exists for each character
in the text representation. If Encoding is
utf8 or
- unicode, the characters are encoded using UTF-8
- (that is, characters from 128 through 255 are
- encoded in two bytes).
+ unicode, the characters are encoded using UTF-8 where
+ characters may require multiple bytes.
-
atom_to_binary(Atom, latin1) never
- fails, as the text representation of an atom can only
- contain characters from 0 through 255. In a future release,
- the text representation
- of atoms can be allowed to contain any Unicode character and
- atom_to_binary(Atom, latin1) then fails if the
+
As from Erlang/OTP 20, atoms can contain any Unicode character
+ and atom_to_binary(Atom, latin1) may fail if the
text representation for Atom contains a Unicode
character > 255.
@@ -402,13 +397,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).
translation of bytes in the binary is done.
If Encoding
is utf8 or unicode, the binary must contain
- valid UTF-8 sequences. Only Unicode characters up
- to 255 are allowed.
+ valid UTF-8 sequences.
-
binary_to_atom(Binary, utf8) fails if
- the binary contains Unicode characters > 255.
- In a future release, such Unicode characters can be allowed and
- binary_to_atom(Binary, utf8) does then not fail.
+
As from Erlang/OTP 20, binary_to_atom(Binary, utf8)
+ is capable of encoding any Unicode character. Earlier versions would
+ fail if the binary contained Unicode characters > 255.
For more information about Unicode support in atoms, see the
note on UTF-8
encoded atoms
@@ -419,9 +412,7 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)).
> binary_to_atom(<<"Erlang">>, latin1).
'Erlang'
> binary_to_atom(<<1024/utf8>>, utf8).
-** exception error: bad argument
- in function binary_to_atom/2
- called as binary_to_atom(<<208,128>>,utf8)
+'Ѐ'
@@ -2401,10 +2392,10 @@ os_prompt%
Returns the atom whose text representation is
String.
-
String can only contain ISO-latin-1
- characters (that is, numbers < 256) as the implementation does not
- allow Unicode characters equal to or above 256 in atoms.
- For more information on Unicode support in atoms, see
+
As from Erlang/OTP 20, String may contain
+ any Unicode character. Earlier versions allowed only ISO-latin-1
+ characters as the implementation did not allow Unicode characters
+ above 255. For more information on Unicode support in atoms, see
note on UTF-8
encoded atoms
in section "External Term Format" in the User's Guide.