From 26b59dfe67ef551cd94765557cdd8c79794bcc38 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jos=C3=A9=20Valim?= Date: Tue, 31 May 2016 14:28:54 +0200 Subject: Add new AtU8 beam chunk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The new chunk stores atoms encoded in UTF-8. beam_lib has also been modified to handle the new 'utf8_atoms' attribute while the 'atoms' attribute may be a missing chunk from now on. The binary_to_atom/2 BIF can now encode any utf8 binary with up to 255 characters. The list_to_atom/1 BIF can now accept codepoints higher than 255 with up to 255 characters (thanks to Björn Gustavsson). --- erts/doc/src/erl_ext_dist.xml | 15 +++++---------- erts/doc/src/erlang.xml | 35 +++++++++++++---------------------- 2 files changed, 18 insertions(+), 32 deletions(-) (limited to 'erts/doc') diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml index 4f799f8f34..a436a9ca74 100644 --- a/erts/doc/src/erl_ext_dist.xml +++ b/erts/doc/src/erl_ext_dist.xml @@ -119,16 +119,11 @@ Compressed Data Format when Expanded -

As from ERTS 5.10 (OTP R16) support - for UTF-8 encoded atoms has been introduced in the external format. - However, only characters that can be encoded using Latin-1 (ISO-8859-1) - are currently supported in atoms. The support for UTF-8 encoded atoms - in the external format has been implemented to be able to support - all Unicode characters in atoms in some future release. - Until full Unicode support for atoms has been introduced, - it is an error to pass atoms containing - characters that cannot be encoded in Latin-1, and the behavior is - undefined.

+

As from ERTS 9.0 (OTP 20), UTF-8 encoded atoms may contain any Unicode + character. Although the support for UTF-8 encoded atoms in the external + format is available since ERTS 5.10 (OTP R16), passing atoms that cannot + be encoded in Latin-1 is an error in versions earlier than + Erlang/OTP 20, and the behavior is undefined.

When distribution flag DFLAG_UTF8_ATOMS has been exchanged between both nodes in the diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml index b3fab3874b..cf038c49f0 100644 --- a/erts/doc/src/erlang.xml +++ b/erts/doc/src/erlang.xml @@ -325,16 +325,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)). is latin1, one byte exists for each character in the text representation. If Encoding is utf8 or - unicode, the characters are encoded using UTF-8 - (that is, characters from 128 through 255 are - encoded in two bytes).

+ unicode, the characters are encoded using UTF-8 where + characters may require multiple bytes.

-

atom_to_binary(Atom, latin1) never - fails, as the text representation of an atom can only - contain characters from 0 through 255. In a future release, - the text representation - of atoms can be allowed to contain any Unicode character and - atom_to_binary(Atom, latin1) then fails if the +

As from Erlang/OTP 20, atoms can contain any Unicode character + and atom_to_binary(Atom, latin1) may fail if the text representation for Atom contains a Unicode character > 255.

@@ -402,13 +397,11 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)). translation of bytes in the binary is done. If Encoding is utf8 or unicode, the binary must contain - valid UTF-8 sequences. Only Unicode characters up - to 255 are allowed.

+ valid UTF-8 sequences.

-

binary_to_atom(Binary, utf8) fails if - the binary contains Unicode characters > 255. - In a future release, such Unicode characters can be allowed and - binary_to_atom(Binary, utf8) does then not fail. +

As from Erlang/OTP 20, binary_to_atom(Binary, utf8) + is capable of encoding any Unicode character. Earlier versions would + fail if the binary contained Unicode characters > 255. For more information about Unicode support in atoms, see the note on UTF-8 encoded atoms @@ -419,9 +412,7 @@ Z = erlang:adler32_combine(X,Y,iolist_size(Data2)). > binary_to_atom(<<"Erlang">>, latin1). 'Erlang' > binary_to_atom(<<1024/utf8>>, utf8). -** exception error: bad argument - in function binary_to_atom/2 - called as binary_to_atom(<<208,128>>,utf8) +'Ѐ' @@ -2401,10 +2392,10 @@ os_prompt%

Returns the atom whose text representation is String.

-

String can only contain ISO-latin-1 - characters (that is, numbers < 256) as the implementation does not - allow Unicode characters equal to or above 256 in atoms. - For more information on Unicode support in atoms, see +

As from Erlang/OTP 20, String may contain + any Unicode character. Earlier versions allowed only ISO-latin-1 + characters as the implementation did not allow Unicode characters + above 255. For more information on Unicode support in atoms, see note on UTF-8 encoded atoms in section "External Term Format" in the User's Guide.

-- cgit v1.2.3