UTF-8 atom documentation

author: Rickard Green <[email protected]> 2013-01-19 00:45:16 +0100
committer: Rickard Green <[email protected]> 2013-01-19 00:45:16 +0100
commit: b15688d40d5147c1122aaad3b82495fbbc4dede8 (patch)
tree: 8af86d278082f202a43902a8c88e6998f7e976e3 /erts/doc
parent: a912b3c6f4759a6a8e60fc4ea559c19edb02448c (diff)
download: otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.gz
otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.bz2
otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.zip
3 files changed, 397 insertions, 22 deletions
diff --git a/erts/doc/src/erl_dist_protocol.xml b/erts/doc/src/erl_dist_protocol.xml
index 6c725fc82d..0252187be5 100644
--- a/erts/doc/src/erl_dist_protocol.xml
+++ b/erts/doc/src/erl_dist_protocol.xml
@@ -547,13 +547,289 @@ If Result > 0, the packet only consists of [119, Result].
 -->
 
     </section>
-
+    <marker id="distribution_handshake"/>
       <section>
-	<title>Handshake</title>
-	<p>
-	  The handshake is discussed in detail in the internal documentation for 
-	  the kernel (Erlang) application.
-	</p>
+	<title>Distribution Handshake</title>
+	  <p>
+	    This section describes the distribution handshake protocol
+	    introduced in the OTP-R6 release of Erlang/OTP. This
+	    description was previously located in
+	    <c>$ERL_TOP/lib/kernel/internal_doc/distribution_handshake.txt</c>,
+	    and has more or less been copied and "formatted" here. It has been
+	    more or less unchanged since the year 1999, but the handshake
+	    should not have changed much since then either.
+	  </p>
+	<section>
+	  <title>General</title>
+	  <p>
+	    The TCP/IP distribution uses a handshake which expects a
+	    connection based protocol, i.e. the protocol does not include
+	    any authentication after the handshake procedure.
+	  </p>
+	  <p>
+	    This is not entirely safe, as it is vulnerable against takeover
+	    attacks, but it is a tradeoff between fair safety and performance.
+	  </p>
+	  <p>
+	    The cookies are never sent in cleartext and the handshake procedure
+	    expects the client (called A) to be the first one to prove that it can 
+	    generate a sufficient digest. The digest is generated with the 
+	    MD5 message digest algorithm and the challenges are expected to be very
+	    random numbers.
+	  </p>
+	</section>
+	<section>
+	  <title>Definitions</title>
+	  <p>
+	    A challenge is a 32 bit integer number in big endian order. Below the function
+	    <c>gen_challenge()</c> returns a random 32 bit integer used as a challenge.
+	  </p>
+	  <p>
+	    A digest is a (16 bytes) MD5 hash of the Challenge (as text) concatenated
+	    with the cookie (as text). Below, the function <c>gen_digest(Challenge, Cookie)</c>
+	    generates a digest as described above.
+	  </p>
+	  <p>
+	    An out_cookie is the cookie used in outgoing communication to a certain node,
+	    so that A's out_cookie for B should correspond with B's in_cookie for A and 
+	    the other way around. A's out_cookie for B and A's in_cookie for B need <em>NOT</em>
+	    be the same. Below the function <c>out_cookie(Node)</c> returns the current
+	    node's out_cookie for <c>Node</c>.
+	  </p>
+	  <p>
+	    An in_cookie is the cookie expected to be used by another node when 
+	    communicating with us, so that A's in_cookie for B corresponds with B's 
+	    out_cookie for A. Below the function <c>in_cookie(Node)</c> returns the current
+	    node's <c>in_cookie</c> for <c>Node</c>.
+	  </p>
+	  <p>
+	    The cookies are text strings that can be viewed as passwords.
+	  </p>
+	  <p>
+	    Every message in the handshake starts with a 16 bit big endian integer 
+	    which contains the length of the message (not counting the two initial bytes).
+	    In erlang this corresponds to the <c>gen_tcp</c> option <c>{packet, 2}</c>. Note that after 
+	    the handshake, the distribution switches to 4 byte packet headers.
+	  </p>
+
+	</section>
+	<section>
+	  <title>The Handshake in Detail</title>
+	  <p>
+	    Imagine two nodes, node A, which initiates the handshake and node B, which
+	    accepts the connection.
+	  </p>
+	  <taglist>
+	    <tag>1) connect/accept</tag>
+	    <item><p>A connects to B via TCP/IP and B accepts the connection.</p></item>
+	    <tag>2) send_name/receive_name</tag>
+	    <item><p>A sends an initial identification to B. B receives the message.
+	    The message looks like this (every "square" being one byte and the packet
+	    header removed):
+	  </p>
+<pre>
++---+--------+--------+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-... +-----+
+</pre>
+	  <p>
+	    The 'n' is just a message tag.
+	    Version0 and Version1 is the distribution version selected by node A,
+            based on information from EPMD. (16 bit big endian)
+	    Flag0 ... Flag3 are capability flags, the capabilities defined in
+	    <c>$ERL_TOP/lib/kernel/include/dist.hrl</c>.
+            (32 bit big endian)
+	    Name0 ... NameN is the full nodename of A, as a string of bytes (the
+            packet length denotes how long it is).
+	  </p></item>
+	  <tag>3) recv_status/send_status</tag>
+	  <item><p>B sends a status message to A, which indicates
+	    if the connection is allowed. The following status codes are defined:</p>
+	    <taglist>
+	      <tag><c>ok</c></tag>
+	      <item>The handshake will continue.</item>
+	      <tag><c>ok_simultaneous</c></tag>
+	      <item>The handshake will continue, but A is informed that B
+              has another ongoing connection attempt that will be
+	      shut down (simultaneous connect where A's name is 
+	      greater than B's name, compared literally).</item>
+	      <tag><c>nok</c></tag>
+	      <item>The handshake will not continue, as B already has an ongoing handshake
+	      which it itself has initiated. (simultaneous connect where B's name is 
+	      greater than A's).</item>
+	      <tag><c>not_allowed</c></tag>
+	      <item>The connection is disallowed for some (unspecified) security 
+              reason.</item>
+	      <tag><c>alive</c></tag>
+	      <item>A connection to the node is already active, which either means
+	      that node A is confused or that the TCP connection breakdown
+	      of a previous node with this name has not yet reached node B.
+	      See 3B below.</item>
+	    </taglist>
+	    <p>This is the format of the status message:</p>
+<pre>
++---+-------+-------+-...-+-------+
+|'s'|Status0|Status1| ... |StatusN|
++---+-------+-------+-...-+-------+
+</pre>
+	    <p>
+	      's' is the message tag Status0 ... StatusN is the status as a string (not terminated)
+	    </p>
+	  </item>
+	  <tag>3B) send_status/recv_status</tag>
+	  <item><p>If status was 'alive', node A will answer with
+	  another status message containing either 'true' which means that the
+	  connection should continue (The old connection from this node is broken), or
+	  <c>'false'</c>, which simply means that the connection should be closed, the 
+	  connection attempt was a mistake.</p></item>
+	  <tag>4) recv_challenge/send_challenge</tag>
+	  <item><p>If the status was <c>ok</c> or <c>ok_simultaneous</c>,
+	  The handshake continues with B sending A another message, the challenge.
+	  The challenge contains the same type of information as the "name" message 
+	  initially sent from A to B, with the addition of a 32 bit challenge:</p>
+<pre>
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Chal0|Chal1|Chal2|Chal3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-... +-----+
+</pre>
+          <p>
+	    Where Chal0 ... Chal3 is the challenge as a 32 bit big endian integer
+	    and the other fields are B's version, flags and full nodename.
+	  </p></item>
+	  <tag>5) send_challenge_reply/recv_challenge_reply</tag>
+	  <item><p>Now A has generated a digest and its own challenge. Those are
+	  sent together in a package to B:</p>
+<pre>
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+|'r'|Chal0|Chal1|Chal2|Chal3|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+</pre>
+          <p>
+	    Where 'r' is the tag, Chal0 ... Chal3 is A's challenge for B to handle and
+	    Dige0 ... Dige15 is the digest that A constructed from the challenge B sent
+	    in the previous step.
+	  </p></item>
+	  <tag>6) recv_challenge_ack/send_challenge_ack</tag>
+	  <item><p>B checks that the digest received from A is correct and generates a
+	  digest from the challenge received from  A. The digest is then sent to A. The
+	  message looks like this:</p>
+<pre>
++---+-----+-----+-----+-----+-...-+------+
+|'a'|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-...-+------+
+</pre>
+          <p>
+	    Where 'a' is the tag and Dige0 ... Dige15 is the digest calculated by B
+	    for A's challenge.</p></item>
+	  <tag>7)</tag>
+	  <item><p>A checks the digest from B and the connection is up.</p></item>
+	</taglist>
+	</section>
+	<section>
+	  <title>Semigraphic View</title>
+<pre>
+A (initiator)						B (acceptor)
+
+TCP connect -----------------------------------------&gt;	
+							TCP accept
+
+send_name   -----------------------------------------&gt;
+							recv_name
+
+	    &lt;----------------------------------------	send_status
+recv_status
+(if status was 'alive'
+ send_status - - - - - - - - - - - - - - - - - - - -&gt;
+							recv_status)
+							ChB = gen_challenge()
+		          (ChB)
+	    &lt;----------------------------------------	send_challenge
+recv_challenge
+
+ChA = gen_challenge(),
+OCA = out_cookie(B),
+DiA = gen_digest(ChB,OCA)
+			  (ChA, DiA)
+send_challenge_reply --------------------------------&gt;
+							recv_challenge_reply
+							ICB = in_cookie(A),
+							check:
+							DiA == gen_digest
+								(ChB, ICB) ?
+							- if OK:
+	    						 OCB = out_cookie(A),
+							 DiB = gen_digest
+			(DiB)					(ChA, OCB)
+	    &lt;-----------------------------------------	 send_challenge_ack
+recv_challenge_ack					 DONE
+ICA = in_cookie(B),                                     - else
+check:                                                   CLOSE
+DiB == gen_digest(ChA,ICA) ?
+- if OK
+ DONE
+- else
+ CLOSE
+</pre>
+	</section>
+	<marker id="dflags"/>
+	<section>
+	  <title>The Currently Defined Distribution Flags</title>
+	  <p>
+	    Currently (OTP-R16) the following capability flags are defined:
+	  </p>
+<pre>
+%% The node should be published and part of the global namespace
+-define(DFLAG_PUBLISHED,1).
+
+%% The node implements an atom cache (obsolete)
+-define(DFLAG_ATOM_CACHE,2).
+
+%% The node implements extended (3 * 32 bits) references. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_REFERENCES,4).
+
+%% The node implements distributed process monitoring.
+-define(DFLAG_DIST_MONITOR,8).
+
+%% The node uses separate tag for fun's (lambdas) in the distribution protocol.
+-define(DFLAG_FUN_TAGS,16#10).
+
+%% The node implements distributed named process monitoring.
+-define(DFLAG_DIST_MONITOR_NAME,16#20).
+
+%% The (hidden) node implements atom cache (obsolete)
+-define(DFLAG_HIDDEN_ATOM_CACHE,16#40).
+
+%% The node understand new fun-tags
+-define(DFLAG_NEW_FUN_TAGS,16#80).
+
+%% The node is capable of handling extended pids and ports. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_PIDS_PORTS,16#100).
+
+%%
+-define(DFLAG_EXPORT_PTR_TAG,16#200).
+
+%%
+-define(DFLAG_BIT_BINARIES,16#400).
+
+%% The node understands new float format
+-define(DFLAG_NEW_FLOATS,16#800).
+
+%%
+-define(DFLAG_UNICODE_IO,16#1000).
+
+%% The node implements atom cache in distribution header.
+-define(DFLAG_DIST_HDR_ATOM_CACHE,16#2000).
+
+%% The node understand the SMALL_ATOM_EXT tag
+-define(DFLAG_SMALL_ATOM_TAGS, 16#4000).
+
+%% The node understand UTF-8 encoded atoms
+-define(DFLAG_UTF8_ATOMS, 16#10000).
+
+</pre>
+	</section>
       </section>
 
       <section>
diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml
index fd2da2cfe3..28afea8b29 100644
--- a/erts/doc/src/erl_ext_dist.xml
+++ b/erts/doc/src/erl_ext_dist.xml
@@ -119,10 +119,39 @@
 	<cell align="center">Data</cell>
       </row>
     <tcaption></tcaption></table>
+    <marker id="utf8_atoms"/>
+    <note>
+    <p>As of ERTS version 5.10 (OTP-R16) support
+    for UTF-8 encoded atoms has been introduced in the external format.
+    However, only characters that can be encoded using Latin1 (ISO-8859-1)
+    are currently supported in atoms. The support for UTF-8 encoded atoms
+    in the external format has been implemented in order to be able to support
+    all Unicode characters in atoms in <em>some future release</em>. Full
+    support for Unicode atoms will not happen before OTP-R18, and might
+    be introduced even later than that. Until full Unicode support for
+    atoms has been introduced, it is an <em>error</em> to pass atoms containing
+    characters that cannot be encoded in Latin1, and <em>the behavior is
+    undefined</em>.</p>
+    <p>When the
+    <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso>
+    distribution flag has been exchanged between both nodes in the
+    <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>,
+    all atoms in the distribution header will be encoded in UTF-8; otherwise,
+    all atoms in the distribution header will be encoded in Latin1. The two
+    new tags <seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>, and
+    <seealso marker="#SMALL_ATOM_UTF8_EXT">SMALL_ATOM_UTF8_EXT</seealso>
+    will only be used if the <c>DFLAG_UTF8_ATOMS</c> distribution flag has
+    been exchanged between nodes, or if an atom containing characters
+    that cannot be encoded in Latin1 is encountered.
+    </p>
+    <p>The maximum number of allowed characters in an atom is 255. In the
+    UTF-8 case each character may need 4 bytes to be encoded.
+    </p>
+    </note>
   </section>
 
-  <section>
-    <marker id="distribution_header"/>
+  <marker id="distribution_header"/>
+  <section>  
     <title>Distribution header</title>
     <p>
       As of erts version 5.7.2 the old atom cache protocol was
@@ -219,8 +248,7 @@
     <p>
       The least significant bit in that half byte is the <c>LongAtoms</c>
       flag. If it is set, 2 bytes are used for atom lengths instead of
-      1 byte in the distribution header. However, the current emulator
-      cannot handle long atoms, so it will currently always be 0.
+      1 byte in the distribution header.
     </p>
     <p>
       After the <c>Flags</c> field follow the <c>AtomCacheRefs</c>. The
@@ -247,16 +275,26 @@
     <p>
       <c>InternalSegmentIndex</c> together with the <c>SegmentIndex</c>
       completely identify the location of an atom cache entry in the
-      atom cache. <c>Length</c> is number of one byte characters that
-      the atom text consists of. Length is a two byte big endian integer
+      atom cache. <c>Length</c> is number of bytes that <c>AtomText</c>
+      consists of. Length is a two byte big endian integer
       if the <c>LongAtoms</c> flag has been set, otherwise a one byte
-      integer. Subsequent <c>CachedAtomRef</c>s with the same
+      integer. When the
+      <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso>
+      distribution flag has been exchanged between both nodes in the
+      <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>,
+      characters in <c>AtomText</c> is encoded in UTF-8; otherwise,
+      encoded in Latin1. Subsequent <c>CachedAtomRef</c>s with the same
       <c>SegmentIndex</c> and <c>InternalSegmentIndex</c> as this
       <c>NewAtomCacheRef</c> will refer to this atom until a new
       <c>NewAtomCacheRef</c> with the same <c>SegmentIndex</c>
       and <c>InternalSegmentIndex</c> appear.
     </p>
     <p>
+      For more information on encoding of atoms, see
+      <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+      in the beginning of this document.
+    </p>
+    <p>
       If the <c>NewCacheEntryFlag</c> for the next <c>AtomCacheRef</c>
       has not been set, a <c>CachedAtomRef</c> on the following format
       will follow:
@@ -383,9 +421,9 @@
 	<tcaption></tcaption></table>
       <p>
 	An atom is stored with a 2 byte unsigned length in big-endian order, 
-	followed by <c>Len</c> numbers of 8 bit characters that forms the 
-	<c>AtomName</c>.
-	Note: The maximum allowed value for <c>Len</c> is 255.
+	followed by <c>Len</c> numbers of 8 bit Latin1 characters that forms
+	the <c>AtomName</c>.
+	<em>Note</em>: The maximum allowed value for <c>Len</c> is 255.
       </p>
     </section>
 
@@ -754,12 +792,14 @@
 	<tcaption></tcaption></table>
       <p>
 	An atom is stored with a 1 byte unsigned length, 
-	followed by <c>Len</c> numbers of 8 bit characters that
+	followed by <c>Len</c> numbers of 8 bit Latin1 characters that
 	forms the <c>AtomName</c>. Longer atoms can be represented
 	by <seealso marker="#ATOM_EXT">ATOM_EXT</seealso>. <em>Note</em>
 	the <c>SMALL_ATOM_EXT</c> was introduced in erts version 5.7.2 and
-	require a small atom distribution flag exchanged in the distribution
-	handshake.
+	require an exchange of the
+	<seealso marker="erl_dist_protocol#dflags"><c>DFLAG_SMALL_ATOM_TAGS</c></seealso>
+	distribution flag in the
+	<seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>.
       </p>
     </section>
 
@@ -1007,7 +1047,62 @@
 	  This term is used in minor version 1 of the external format.
 	</p>
     </section>
+    <section>
+      <marker id="ATOM_UTF8_EXT"/>
+      <title>ATOM_UTF8_EXT</title>
+
+	<table align="left">
+	  <row>
+	    <cell align="center">1</cell>
+	    <cell align="center">2</cell>
+	    <cell align="center">Len</cell>
+	  </row>
+	  <row>
+	    <cell align="center"><c>118</c></cell>
+	    <cell align="center"><c>Len</c></cell>
+	    <cell align="center"><c>AtomName</c></cell>
+	  </row>
+	<tcaption></tcaption></table>
+      <p>
+	An atom is stored with a 2 byte unsigned length in big-endian order,
+	followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded
+	in UTF-8.
+      </p>
+      <p>
+	For more information on encoding of atoms, see
+	<seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+	in the beginning of this document.
+      </p>
+    </section>
 
+    <section>
+      <marker id="SMALL_ATOM_UTF8_EXT"/>
+      <title>SMALL_ATOM_UTF8_EXT</title>
+
+	<table align="left">
+	  <row>
+	    <cell align="center">1</cell>
+	    <cell align="center">1</cell>
+	    <cell align="center">Len</cell>
+	  </row>
+	  <row>
+	    <cell align="center"><c>119</c></cell>
+	    <cell align="center"><c>Len</c></cell>
+	    <cell align="center"><c>AtomName</c></cell>
+	  </row>
+	<tcaption></tcaption></table>
+      <p>
+	An atom is stored with a 1 byte unsigned length, 
+	followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded
+	in UTF-8. Longer atoms encoded in UTF-8 can be represented using
+	<seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>.
+      </p>
+      <p>
+	For more information on encoding of atoms, see
+	<seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+	in the beginning of this document.
+      </p>
+    </section>
 
   </chapter>
   
diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml
index 5002c48ca1..0077c0096c 100644
--- a/erts/doc/src/erlang.xml
+++ b/erts/doc/src/erlang.xml
@@ -277,7 +277,9 @@
 	the binary contains Unicode characters greater than 16#FF.
 	In a future release, such Unicode characters might be allowed
 	and <c>binary_to_atom(<anno>Binary</anno>, utf8)</c>
-	will not fail in that case.</p></note>
+	will not fail in that case. For more information on Unicode support in atoms
+	see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+	in the chapter about the external term format in the ERTS User's Guide.</p></note>
 
         <pre>
 > <input>binary_to_atom(&lt;&lt;"Erlang"&gt;&gt;, latin1).</input>
@@ -1647,9 +1649,11 @@ os_prompt% </pre>
       <desc>
         <p>Returns the atom whose text representation is <c><anno>String</anno></c>.</p>
 	<p><c><anno>String</anno></c> may only contain ISO-latin-1
-	characterns (i.e. numbers below 256) as the current
+	characters (i.e. numbers below 256) as the current
 	implementation does not allow unicode characters >= 256 in
-	atoms.</p>
+	atoms. For more information on Unicode support in atoms
+	see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+	in the chapter about the external term format in the ERTS User's Guide.</p>
         <pre>
 > <input>list_to_atom("Erlang").</input>
 'Erlang'</pre>
author	Rickard Green <[email protected]>	2013-01-19 00:45:16 +0100
committer	Rickard Green <[email protected]>	2013-01-19 00:45:16 +0100
commit	b15688d40d5147c1122aaad3b82495fbbc4dede8 (patch)
tree	8af86d278082f202a43902a8c88e6998f7e976e3 /erts/doc
parent	a912b3c6f4759a6a8e60fc4ea559c19edb02448c (diff)
download	otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.gz otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.bz2 otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.zip