From b15688d40d5147c1122aaad3b82495fbbc4dede8 Mon Sep 17 00:00:00 2001 From: Rickard Green Date: Sat, 19 Jan 2013 00:45:16 +0100 Subject: UTF-8 atom documentation --- erts/doc/src/erl_ext_dist.xml | 121 +++++++++++++++++++++++++++++++++++++----- 1 file changed, 108 insertions(+), 13 deletions(-) (limited to 'erts/doc/src/erl_ext_dist.xml') diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml index fd2da2cfe3..28afea8b29 100644 --- a/erts/doc/src/erl_ext_dist.xml +++ b/erts/doc/src/erl_ext_dist.xml @@ -119,10 +119,39 @@ Data + + +

As of ERTS version 5.10 (OTP-R16) support + for UTF-8 encoded atoms has been introduced in the external format. + However, only characters that can be encoded using Latin1 (ISO-8859-1) + are currently supported in atoms. The support for UTF-8 encoded atoms + in the external format has been implemented in order to be able to support + all Unicode characters in atoms in some future release. Full + support for Unicode atoms will not happen before OTP-R18, and might + be introduced even later than that. Until full Unicode support for + atoms has been introduced, it is an error to pass atoms containing + characters that cannot be encoded in Latin1, and the behavior is + undefined.

+

When the + DFLAG_UTF8_ATOMS + distribution flag has been exchanged between both nodes in the + distribution handshake, + all atoms in the distribution header will be encoded in UTF-8; otherwise, + all atoms in the distribution header will be encoded in Latin1. The two + new tags ATOM_UTF8_EXT, and + SMALL_ATOM_UTF8_EXT + will only be used if the DFLAG_UTF8_ATOMS distribution flag has + been exchanged between nodes, or if an atom containing characters + that cannot be encoded in Latin1 is encountered. +

+

The maximum number of allowed characters in an atom is 255. In the + UTF-8 case each character may need 4 bytes to be encoded. +

+
-
- + +
Distribution header

As of erts version 5.7.2 the old atom cache protocol was @@ -219,8 +248,7 @@

The least significant bit in that half byte is the LongAtoms flag. If it is set, 2 bytes are used for atom lengths instead of - 1 byte in the distribution header. However, the current emulator - cannot handle long atoms, so it will currently always be 0. + 1 byte in the distribution header.

After the Flags field follow the AtomCacheRefs. The @@ -247,15 +275,25 @@

InternalSegmentIndex together with the SegmentIndex completely identify the location of an atom cache entry in the - atom cache. Length is number of one byte characters that - the atom text consists of. Length is a two byte big endian integer + atom cache. Length is number of bytes that AtomText + consists of. Length is a two byte big endian integer if the LongAtoms flag has been set, otherwise a one byte - integer. Subsequent CachedAtomRefs with the same + integer. When the + DFLAG_UTF8_ATOMS + distribution flag has been exchanged between both nodes in the + distribution handshake, + characters in AtomText is encoded in UTF-8; otherwise, + encoded in Latin1. Subsequent CachedAtomRefs with the same SegmentIndex and InternalSegmentIndex as this NewAtomCacheRef will refer to this atom until a new NewAtomCacheRef with the same SegmentIndex and InternalSegmentIndex appear.

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

If the NewCacheEntryFlag for the next AtomCacheRef has not been set, a CachedAtomRef on the following format @@ -383,9 +421,9 @@

An atom is stored with a 2 byte unsigned length in big-endian order, - followed by Len numbers of 8 bit characters that forms the - AtomName. - Note: The maximum allowed value for Len is 255. + followed by Len numbers of 8 bit Latin1 characters that forms + the AtomName. + Note: The maximum allowed value for Len is 255.

@@ -754,12 +792,14 @@

An atom is stored with a 1 byte unsigned length, - followed by Len numbers of 8 bit characters that + followed by Len numbers of 8 bit Latin1 characters that forms the AtomName. Longer atoms can be represented by ATOM_EXT. Note the SMALL_ATOM_EXT was introduced in erts version 5.7.2 and - require a small atom distribution flag exchanged in the distribution - handshake. + require an exchange of the + DFLAG_SMALL_ATOM_TAGS + distribution flag in the + distribution handshake.

@@ -1007,7 +1047,62 @@ This term is used in minor version 1 of the external format.

+
+ + ATOM_UTF8_EXT + + + + 1 + 2 + Len + + + 118 + Len + AtomName + +
+

+ An atom is stored with a 2 byte unsigned length in big-endian order, + followed by Len bytes containing the AtomName encoded + in UTF-8. +

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

+
+
+ + SMALL_ATOM_UTF8_EXT + + + + 1 + 1 + Len + + + 119 + Len + AtomName + +
+

+ An atom is stored with a 1 byte unsigned length, + followed by Len bytes containing the AtomName encoded + in UTF-8. Longer atoms encoded in UTF-8 can be represented using + ATOM_UTF8_EXT. +

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

+
-- cgit v1.2.3