aboutsummaryrefslogtreecommitdiffstats
path: root/erts/doc
diff options
context:
space:
mode:
authorSverker Eriksson <sverker@erlang.org>2013-01-23 18:09:35 +0100
committerSverker Eriksson <sverker@erlang.org>2013-01-23 18:09:35 +0100
commitb8e623410d1c22fe6d5fdeb8ccb0b2305533f033 (patch)
tree708d64e36e18b61ae1801c02ec3aeef42a697be3 /erts/doc
parente99df74bee7c245ec76678e336fcd09d4b51a089 (diff)
parentd6e3e256b850050b7a86323b2948009d5fcc30a9 (diff)
downloadotp-b8e623410d1c22fe6d5fdeb8ccb0b2305533f033.tar.gz
otp-b8e623410d1c22fe6d5fdeb8ccb0b2305533f033.tar.bz2
otp-b8e623410d1c22fe6d5fdeb8ccb0b2305533f033.zip
Merge branch 'sverk/r16/utf8-atoms'
* sverk/r16/utf8-atoms: erl_interface: Fix bug when transcoding atoms from and to UTF8 erl_interface: Changed erlang_char_encoding interface erts: Testcase doing unicode atom printout with ~w erl_interface: even more utf8 atom stuff erts: Fix bug in analyze_utf8 causing faulty latin1 detection Add UTF-8 node name support for epmd workaround... Fix merge conflict with hasse UTF-8 atom documentation test case erl_interface: utf8 atoms continued Add utf8 atom distribution test cases atom fixes for NIFs and atom_to_binary UTF-8 support for distribution Implement UTF-8 atom support for jinterface erl_interface: Enable decode of unicode atoms stdlib: Fix printing of unicode atoms erts: Change internal representation of atoms to utf8 erts: Refactor rename DFLAG(S)_INTERNAL_TAGS for conformity Conflicts: erts/emulator/beam/io.c OTP-10753
Diffstat (limited to 'erts/doc')
-rw-r--r--erts/doc/src/erl_dist_protocol.xml288
-rw-r--r--erts/doc/src/erl_ext_dist.xml121
-rw-r--r--erts/doc/src/erlang.xml10
3 files changed, 397 insertions, 22 deletions
diff --git a/erts/doc/src/erl_dist_protocol.xml b/erts/doc/src/erl_dist_protocol.xml
index 6c725fc82d..0252187be5 100644
--- a/erts/doc/src/erl_dist_protocol.xml
+++ b/erts/doc/src/erl_dist_protocol.xml
@@ -547,13 +547,289 @@ If Result > 0, the packet only consists of [119, Result].
-->
</section>
-
+ <marker id="distribution_handshake"/>
<section>
- <title>Handshake</title>
- <p>
- The handshake is discussed in detail in the internal documentation for
- the kernel (Erlang) application.
- </p>
+ <title>Distribution Handshake</title>
+ <p>
+ This section describes the distribution handshake protocol
+ introduced in the OTP-R6 release of Erlang/OTP. This
+ description was previously located in
+ <c>$ERL_TOP/lib/kernel/internal_doc/distribution_handshake.txt</c>,
+ and has more or less been copied and "formatted" here. It has been
+ more or less unchanged since the year 1999, but the handshake
+ should not have changed much since then either.
+ </p>
+ <section>
+ <title>General</title>
+ <p>
+ The TCP/IP distribution uses a handshake which expects a
+ connection based protocol, i.e. the protocol does not include
+ any authentication after the handshake procedure.
+ </p>
+ <p>
+ This is not entirely safe, as it is vulnerable against takeover
+ attacks, but it is a tradeoff between fair safety and performance.
+ </p>
+ <p>
+ The cookies are never sent in cleartext and the handshake procedure
+ expects the client (called A) to be the first one to prove that it can
+ generate a sufficient digest. The digest is generated with the
+ MD5 message digest algorithm and the challenges are expected to be very
+ random numbers.
+ </p>
+ </section>
+ <section>
+ <title>Definitions</title>
+ <p>
+ A challenge is a 32 bit integer number in big endian order. Below the function
+ <c>gen_challenge()</c> returns a random 32 bit integer used as a challenge.
+ </p>
+ <p>
+ A digest is a (16 bytes) MD5 hash of the Challenge (as text) concatenated
+ with the cookie (as text). Below, the function <c>gen_digest(Challenge, Cookie)</c>
+ generates a digest as described above.
+ </p>
+ <p>
+ An out_cookie is the cookie used in outgoing communication to a certain node,
+ so that A's out_cookie for B should correspond with B's in_cookie for A and
+ the other way around. A's out_cookie for B and A's in_cookie for B need <em>NOT</em>
+ be the same. Below the function <c>out_cookie(Node)</c> returns the current
+ node's out_cookie for <c>Node</c>.
+ </p>
+ <p>
+ An in_cookie is the cookie expected to be used by another node when
+ communicating with us, so that A's in_cookie for B corresponds with B's
+ out_cookie for A. Below the function <c>in_cookie(Node)</c> returns the current
+ node's <c>in_cookie</c> for <c>Node</c>.
+ </p>
+ <p>
+ The cookies are text strings that can be viewed as passwords.
+ </p>
+ <p>
+ Every message in the handshake starts with a 16 bit big endian integer
+ which contains the length of the message (not counting the two initial bytes).
+ In erlang this corresponds to the <c>gen_tcp</c> option <c>{packet, 2}</c>. Note that after
+ the handshake, the distribution switches to 4 byte packet headers.
+ </p>
+
+ </section>
+ <section>
+ <title>The Handshake in Detail</title>
+ <p>
+ Imagine two nodes, node A, which initiates the handshake and node B, which
+ accepts the connection.
+ </p>
+ <taglist>
+ <tag>1) connect/accept</tag>
+ <item><p>A connects to B via TCP/IP and B accepts the connection.</p></item>
+ <tag>2) send_name/receive_name</tag>
+ <item><p>A sends an initial identification to B. B receives the message.
+ The message looks like this (every "square" being one byte and the packet
+ header removed):
+ </p>
+<pre>
++---+--------+--------+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-... +-----+
+</pre>
+ <p>
+ The 'n' is just a message tag.
+ Version0 and Version1 is the distribution version selected by node A,
+ based on information from EPMD. (16 bit big endian)
+ Flag0 ... Flag3 are capability flags, the capabilities defined in
+ <c>$ERL_TOP/lib/kernel/include/dist.hrl</c>.
+ (32 bit big endian)
+ Name0 ... NameN is the full nodename of A, as a string of bytes (the
+ packet length denotes how long it is).
+ </p></item>
+ <tag>3) recv_status/send_status</tag>
+ <item><p>B sends a status message to A, which indicates
+ if the connection is allowed. The following status codes are defined:</p>
+ <taglist>
+ <tag><c>ok</c></tag>
+ <item>The handshake will continue.</item>
+ <tag><c>ok_simultaneous</c></tag>
+ <item>The handshake will continue, but A is informed that B
+ has another ongoing connection attempt that will be
+ shut down (simultaneous connect where A's name is
+ greater than B's name, compared literally).</item>
+ <tag><c>nok</c></tag>
+ <item>The handshake will not continue, as B already has an ongoing handshake
+ which it itself has initiated. (simultaneous connect where B's name is
+ greater than A's).</item>
+ <tag><c>not_allowed</c></tag>
+ <item>The connection is disallowed for some (unspecified) security
+ reason.</item>
+ <tag><c>alive</c></tag>
+ <item>A connection to the node is already active, which either means
+ that node A is confused or that the TCP connection breakdown
+ of a previous node with this name has not yet reached node B.
+ See 3B below.</item>
+ </taglist>
+ <p>This is the format of the status message:</p>
+<pre>
++---+-------+-------+-...-+-------+
+|'s'|Status0|Status1| ... |StatusN|
++---+-------+-------+-...-+-------+
+</pre>
+ <p>
+ 's' is the message tag Status0 ... StatusN is the status as a string (not terminated)
+ </p>
+ </item>
+ <tag>3B) send_status/recv_status</tag>
+ <item><p>If status was 'alive', node A will answer with
+ another status message containing either 'true' which means that the
+ connection should continue (The old connection from this node is broken), or
+ <c>'false'</c>, which simply means that the connection should be closed, the
+ connection attempt was a mistake.</p></item>
+ <tag>4) recv_challenge/send_challenge</tag>
+ <item><p>If the status was <c>ok</c> or <c>ok_simultaneous</c>,
+ The handshake continues with B sending A another message, the challenge.
+ The challenge contains the same type of information as the "name" message
+ initially sent from A to B, with the addition of a 32 bit challenge:</p>
+<pre>
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Chal0|Chal1|Chal2|Chal3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-... +-----+
+</pre>
+ <p>
+ Where Chal0 ... Chal3 is the challenge as a 32 bit big endian integer
+ and the other fields are B's version, flags and full nodename.
+ </p></item>
+ <tag>5) send_challenge_reply/recv_challenge_reply</tag>
+ <item><p>Now A has generated a digest and its own challenge. Those are
+ sent together in a package to B:</p>
+<pre>
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+|'r'|Chal0|Chal1|Chal2|Chal3|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+</pre>
+ <p>
+ Where 'r' is the tag, Chal0 ... Chal3 is A's challenge for B to handle and
+ Dige0 ... Dige15 is the digest that A constructed from the challenge B sent
+ in the previous step.
+ </p></item>
+ <tag>6) recv_challenge_ack/send_challenge_ack</tag>
+ <item><p>B checks that the digest received from A is correct and generates a
+ digest from the challenge received from A. The digest is then sent to A. The
+ message looks like this:</p>
+<pre>
++---+-----+-----+-----+-----+-...-+------+
+|'a'|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-...-+------+
+</pre>
+ <p>
+ Where 'a' is the tag and Dige0 ... Dige15 is the digest calculated by B
+ for A's challenge.</p></item>
+ <tag>7)</tag>
+ <item><p>A checks the digest from B and the connection is up.</p></item>
+ </taglist>
+ </section>
+ <section>
+ <title>Semigraphic View</title>
+<pre>
+A (initiator) B (acceptor)
+
+TCP connect -----------------------------------------&gt;
+ TCP accept
+
+send_name -----------------------------------------&gt;
+ recv_name
+
+ &lt;---------------------------------------- send_status
+recv_status
+(if status was 'alive'
+ send_status - - - - - - - - - - - - - - - - - - - -&gt;
+ recv_status)
+ ChB = gen_challenge()
+ (ChB)
+ &lt;---------------------------------------- send_challenge
+recv_challenge
+
+ChA = gen_challenge(),
+OCA = out_cookie(B),
+DiA = gen_digest(ChB,OCA)
+ (ChA, DiA)
+send_challenge_reply --------------------------------&gt;
+ recv_challenge_reply
+ ICB = in_cookie(A),
+ check:
+ DiA == gen_digest
+ (ChB, ICB) ?
+ - if OK:
+ OCB = out_cookie(A),
+ DiB = gen_digest
+ (DiB) (ChA, OCB)
+ &lt;----------------------------------------- send_challenge_ack
+recv_challenge_ack DONE
+ICA = in_cookie(B), - else
+check: CLOSE
+DiB == gen_digest(ChA,ICA) ?
+- if OK
+ DONE
+- else
+ CLOSE
+</pre>
+ </section>
+ <marker id="dflags"/>
+ <section>
+ <title>The Currently Defined Distribution Flags</title>
+ <p>
+ Currently (OTP-R16) the following capability flags are defined:
+ </p>
+<pre>
+%% The node should be published and part of the global namespace
+-define(DFLAG_PUBLISHED,1).
+
+%% The node implements an atom cache (obsolete)
+-define(DFLAG_ATOM_CACHE,2).
+
+%% The node implements extended (3 * 32 bits) references. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_REFERENCES,4).
+
+%% The node implements distributed process monitoring.
+-define(DFLAG_DIST_MONITOR,8).
+
+%% The node uses separate tag for fun's (lambdas) in the distribution protocol.
+-define(DFLAG_FUN_TAGS,16#10).
+
+%% The node implements distributed named process monitoring.
+-define(DFLAG_DIST_MONITOR_NAME,16#20).
+
+%% The (hidden) node implements atom cache (obsolete)
+-define(DFLAG_HIDDEN_ATOM_CACHE,16#40).
+
+%% The node understand new fun-tags
+-define(DFLAG_NEW_FUN_TAGS,16#80).
+
+%% The node is capable of handling extended pids and ports. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_PIDS_PORTS,16#100).
+
+%%
+-define(DFLAG_EXPORT_PTR_TAG,16#200).
+
+%%
+-define(DFLAG_BIT_BINARIES,16#400).
+
+%% The node understands new float format
+-define(DFLAG_NEW_FLOATS,16#800).
+
+%%
+-define(DFLAG_UNICODE_IO,16#1000).
+
+%% The node implements atom cache in distribution header.
+-define(DFLAG_DIST_HDR_ATOM_CACHE,16#2000).
+
+%% The node understand the SMALL_ATOM_EXT tag
+-define(DFLAG_SMALL_ATOM_TAGS, 16#4000).
+
+%% The node understand UTF-8 encoded atoms
+-define(DFLAG_UTF8_ATOMS, 16#10000).
+
+</pre>
+ </section>
</section>
<section>
diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml
index fd2da2cfe3..28afea8b29 100644
--- a/erts/doc/src/erl_ext_dist.xml
+++ b/erts/doc/src/erl_ext_dist.xml
@@ -119,10 +119,39 @@
<cell align="center">Data</cell>
</row>
<tcaption></tcaption></table>
+ <marker id="utf8_atoms"/>
+ <note>
+ <p>As of ERTS version 5.10 (OTP-R16) support
+ for UTF-8 encoded atoms has been introduced in the external format.
+ However, only characters that can be encoded using Latin1 (ISO-8859-1)
+ are currently supported in atoms. The support for UTF-8 encoded atoms
+ in the external format has been implemented in order to be able to support
+ all Unicode characters in atoms in <em>some future release</em>. Full
+ support for Unicode atoms will not happen before OTP-R18, and might
+ be introduced even later than that. Until full Unicode support for
+ atoms has been introduced, it is an <em>error</em> to pass atoms containing
+ characters that cannot be encoded in Latin1, and <em>the behavior is
+ undefined</em>.</p>
+ <p>When the
+ <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso>
+ distribution flag has been exchanged between both nodes in the
+ <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>,
+ all atoms in the distribution header will be encoded in UTF-8; otherwise,
+ all atoms in the distribution header will be encoded in Latin1. The two
+ new tags <seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>, and
+ <seealso marker="#SMALL_ATOM_UTF8_EXT">SMALL_ATOM_UTF8_EXT</seealso>
+ will only be used if the <c>DFLAG_UTF8_ATOMS</c> distribution flag has
+ been exchanged between nodes, or if an atom containing characters
+ that cannot be encoded in Latin1 is encountered.
+ </p>
+ <p>The maximum number of allowed characters in an atom is 255. In the
+ UTF-8 case each character may need 4 bytes to be encoded.
+ </p>
+ </note>
</section>
- <section>
- <marker id="distribution_header"/>
+ <marker id="distribution_header"/>
+ <section>
<title>Distribution header</title>
<p>
As of erts version 5.7.2 the old atom cache protocol was
@@ -219,8 +248,7 @@
<p>
The least significant bit in that half byte is the <c>LongAtoms</c>
flag. If it is set, 2 bytes are used for atom lengths instead of
- 1 byte in the distribution header. However, the current emulator
- cannot handle long atoms, so it will currently always be 0.
+ 1 byte in the distribution header.
</p>
<p>
After the <c>Flags</c> field follow the <c>AtomCacheRefs</c>. The
@@ -247,16 +275,26 @@
<p>
<c>InternalSegmentIndex</c> together with the <c>SegmentIndex</c>
completely identify the location of an atom cache entry in the
- atom cache. <c>Length</c> is number of one byte characters that
- the atom text consists of. Length is a two byte big endian integer
+ atom cache. <c>Length</c> is number of bytes that <c>AtomText</c>
+ consists of. Length is a two byte big endian integer
if the <c>LongAtoms</c> flag has been set, otherwise a one byte
- integer. Subsequent <c>CachedAtomRef</c>s with the same
+ integer. When the
+ <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso>
+ distribution flag has been exchanged between both nodes in the
+ <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>,
+ characters in <c>AtomText</c> is encoded in UTF-8; otherwise,
+ encoded in Latin1. Subsequent <c>CachedAtomRef</c>s with the same
<c>SegmentIndex</c> and <c>InternalSegmentIndex</c> as this
<c>NewAtomCacheRef</c> will refer to this atom until a new
<c>NewAtomCacheRef</c> with the same <c>SegmentIndex</c>
and <c>InternalSegmentIndex</c> appear.
</p>
<p>
+ For more information on encoding of atoms, see
+ <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+ in the beginning of this document.
+ </p>
+ <p>
If the <c>NewCacheEntryFlag</c> for the next <c>AtomCacheRef</c>
has not been set, a <c>CachedAtomRef</c> on the following format
will follow:
@@ -383,9 +421,9 @@
<tcaption></tcaption></table>
<p>
An atom is stored with a 2 byte unsigned length in big-endian order,
- followed by <c>Len</c> numbers of 8 bit characters that forms the
- <c>AtomName</c>.
- Note: The maximum allowed value for <c>Len</c> is 255.
+ followed by <c>Len</c> numbers of 8 bit Latin1 characters that forms
+ the <c>AtomName</c>.
+ <em>Note</em>: The maximum allowed value for <c>Len</c> is 255.
</p>
</section>
@@ -754,12 +792,14 @@
<tcaption></tcaption></table>
<p>
An atom is stored with a 1 byte unsigned length,
- followed by <c>Len</c> numbers of 8 bit characters that
+ followed by <c>Len</c> numbers of 8 bit Latin1 characters that
forms the <c>AtomName</c>. Longer atoms can be represented
by <seealso marker="#ATOM_EXT">ATOM_EXT</seealso>. <em>Note</em>
the <c>SMALL_ATOM_EXT</c> was introduced in erts version 5.7.2 and
- require a small atom distribution flag exchanged in the distribution
- handshake.
+ require an exchange of the
+ <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_SMALL_ATOM_TAGS</c></seealso>
+ distribution flag in the
+ <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>.
</p>
</section>
@@ -1007,7 +1047,62 @@
This term is used in minor version 1 of the external format.
</p>
</section>
+ <section>
+ <marker id="ATOM_UTF8_EXT"/>
+ <title>ATOM_UTF8_EXT</title>
+
+ <table align="left">
+ <row>
+ <cell align="center">1</cell>
+ <cell align="center">2</cell>
+ <cell align="center">Len</cell>
+ </row>
+ <row>
+ <cell align="center"><c>118</c></cell>
+ <cell align="center"><c>Len</c></cell>
+ <cell align="center"><c>AtomName</c></cell>
+ </row>
+ <tcaption></tcaption></table>
+ <p>
+ An atom is stored with a 2 byte unsigned length in big-endian order,
+ followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded
+ in UTF-8.
+ </p>
+ <p>
+ For more information on encoding of atoms, see
+ <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+ in the beginning of this document.
+ </p>
+ </section>
+ <section>
+ <marker id="SMALL_ATOM_UTF8_EXT"/>
+ <title>SMALL_ATOM_UTF8_EXT</title>
+
+ <table align="left">
+ <row>
+ <cell align="center">1</cell>
+ <cell align="center">1</cell>
+ <cell align="center">Len</cell>
+ </row>
+ <row>
+ <cell align="center"><c>119</c></cell>
+ <cell align="center"><c>Len</c></cell>
+ <cell align="center"><c>AtomName</c></cell>
+ </row>
+ <tcaption></tcaption></table>
+ <p>
+ An atom is stored with a 1 byte unsigned length,
+ followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded
+ in UTF-8. Longer atoms encoded in UTF-8 can be represented using
+ <seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>.
+ </p>
+ <p>
+ For more information on encoding of atoms, see
+ <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+ in the beginning of this document.
+ </p>
+ </section>
</chapter>
diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml
index d09f286e36..315dc323ba 100644
--- a/erts/doc/src/erlang.xml
+++ b/erts/doc/src/erlang.xml
@@ -277,7 +277,9 @@
the binary contains Unicode characters greater than 16#FF.
In a future release, such Unicode characters might be allowed
and <c>binary_to_atom(<anno>Binary</anno>, utf8)</c>
- will not fail in that case.</p></note>
+ will not fail in that case. For more information on Unicode support in atoms
+ see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+ in the chapter about the external term format in the ERTS User's Guide.</p></note>
<pre>
> <input>binary_to_atom(&lt;&lt;"Erlang"&gt;&gt;, latin1).</input>
@@ -1712,9 +1714,11 @@ os_prompt% </pre>
<desc>
<p>Returns the atom whose text representation is <c><anno>String</anno></c>.</p>
<p><c><anno>String</anno></c> may only contain ISO-latin-1
- characterns (i.e. numbers below 256) as the current
+ characters (i.e. numbers below 256) as the current
implementation does not allow unicode characters >= 256 in
- atoms.</p>
+ atoms. For more information on Unicode support in atoms
+ see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso>
+ in the chapter about the external term format in the ERTS User's Guide.</p>
<pre>
> <input>list_to_atom("Erlang").</input>
'Erlang'</pre>