From b15688d40d5147c1122aaad3b82495fbbc4dede8 Mon Sep 17 00:00:00 2001
From: Rickard Green
Date: Sat, 19 Jan 2013 00:45:16 +0100
Subject: UTF-8 atom documentation
---
erts/doc/src/erl_ext_dist.xml | 121 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 108 insertions(+), 13 deletions(-)
(limited to 'erts/doc/src/erl_ext_dist.xml')
diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml
index fd2da2cfe3..28afea8b29 100644
--- a/erts/doc/src/erl_ext_dist.xml
+++ b/erts/doc/src/erl_ext_dist.xml
@@ -119,10 +119,39 @@
Data |
+
+
+ As of ERTS version 5.10 (OTP-R16) support
+ for UTF-8 encoded atoms has been introduced in the external format.
+ However, only characters that can be encoded using Latin1 (ISO-8859-1)
+ are currently supported in atoms. The support for UTF-8 encoded atoms
+ in the external format has been implemented in order to be able to support
+ all Unicode characters in atoms in some future release. Full
+ support for Unicode atoms will not happen before OTP-R18, and might
+ be introduced even later than that. Until full Unicode support for
+ atoms has been introduced, it is an error to pass atoms containing
+ characters that cannot be encoded in Latin1, and the behavior is
+ undefined.
+ When the
+ DFLAG_UTF8_ATOMS
+ distribution flag has been exchanged between both nodes in the
+ distribution handshake,
+ all atoms in the distribution header will be encoded in UTF-8; otherwise,
+ all atoms in the distribution header will be encoded in Latin1. The two
+ new tags ATOM_UTF8_EXT, and
+ SMALL_ATOM_UTF8_EXT
+ will only be used if the DFLAG_UTF8_ATOMS distribution flag has
+ been exchanged between nodes, or if an atom containing characters
+ that cannot be encoded in Latin1 is encountered.
+
+ The maximum number of allowed characters in an atom is 255. In the
+ UTF-8 case each character may need 4 bytes to be encoded.
+
+
-
-
+
+
Distribution header
As of erts version 5.7.2 the old atom cache protocol was
@@ -219,8 +248,7 @@
The least significant bit in that half byte is the LongAtoms
flag. If it is set, 2 bytes are used for atom lengths instead of
- 1 byte in the distribution header. However, the current emulator
- cannot handle long atoms, so it will currently always be 0.
+ 1 byte in the distribution header.
After the Flags field follow the AtomCacheRefs. The
@@ -247,15 +275,25 @@
InternalSegmentIndex together with the SegmentIndex
completely identify the location of an atom cache entry in the
- atom cache. Length is number of one byte characters that
- the atom text consists of. Length is a two byte big endian integer
+ atom cache. Length is number of bytes that AtomText
+ consists of. Length is a two byte big endian integer
if the LongAtoms flag has been set, otherwise a one byte
- integer. Subsequent CachedAtomRefs with the same
+ integer. When the
+ DFLAG_UTF8_ATOMS
+ distribution flag has been exchanged between both nodes in the
+ distribution handshake,
+ characters in AtomText is encoded in UTF-8; otherwise,
+ encoded in Latin1. Subsequent CachedAtomRefs with the same
SegmentIndex and InternalSegmentIndex as this
NewAtomCacheRef will refer to this atom until a new
NewAtomCacheRef with the same SegmentIndex
and InternalSegmentIndex appear.
+
+ For more information on encoding of atoms, see
+ note on UTF-8 encoded atoms
+ in the beginning of this document.
+
If the NewCacheEntryFlag for the next AtomCacheRef
has not been set, a CachedAtomRef on the following format
@@ -383,9 +421,9 @@
An atom is stored with a 2 byte unsigned length in big-endian order,
- followed by Len numbers of 8 bit characters that forms the
- AtomName.
- Note: The maximum allowed value for Len is 255.
+ followed by Len numbers of 8 bit Latin1 characters that forms
+ the AtomName.
+ Note: The maximum allowed value for Len is 255.
@@ -754,12 +792,14 @@
An atom is stored with a 1 byte unsigned length,
- followed by Len numbers of 8 bit characters that
+ followed by Len numbers of 8 bit Latin1 characters that
forms the AtomName. Longer atoms can be represented
by ATOM_EXT. Note
the SMALL_ATOM_EXT was introduced in erts version 5.7.2 and
- require a small atom distribution flag exchanged in the distribution
- handshake.
+ require an exchange of the
+ DFLAG_SMALL_ATOM_TAGS
+ distribution flag in the
+ distribution handshake.
@@ -1007,7 +1047,62 @@
This term is used in minor version 1 of the external format.
+
+
+ ATOM_UTF8_EXT
+
+
+
+ 1 |
+ 2 |
+ Len |
+
+
+ 118 |
+ Len |
+ AtomName |
+
+
+
+ An atom is stored with a 2 byte unsigned length in big-endian order,
+ followed by Len bytes containing the AtomName encoded
+ in UTF-8.
+
+
+ For more information on encoding of atoms, see
+ note on UTF-8 encoded atoms
+ in the beginning of this document.
+
+
+
+
+ SMALL_ATOM_UTF8_EXT
+
+
+
+ 1 |
+ 1 |
+ Len |
+
+
+ 119 |
+ Len |
+ AtomName |
+
+
+
+ An atom is stored with a 1 byte unsigned length,
+ followed by Len bytes containing the AtomName encoded
+ in UTF-8. Longer atoms encoded in UTF-8 can be represented using
+ ATOM_UTF8_EXT.
+
+
+ For more information on encoding of atoms, see
+ note on UTF-8 encoded atoms
+ in the beginning of this document.
+
+
--
cgit v1.2.3