diff options
author | Rickard Green <[email protected]> | 2013-01-19 00:45:16 +0100 |
---|---|---|
committer | Rickard Green <[email protected]> | 2013-01-19 00:45:16 +0100 |
commit | b15688d40d5147c1122aaad3b82495fbbc4dede8 (patch) | |
tree | 8af86d278082f202a43902a8c88e6998f7e976e3 /erts/doc | |
parent | a912b3c6f4759a6a8e60fc4ea559c19edb02448c (diff) | |
download | otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.gz otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.tar.bz2 otp-b15688d40d5147c1122aaad3b82495fbbc4dede8.zip |
UTF-8 atom documentation
Diffstat (limited to 'erts/doc')
-rw-r--r-- | erts/doc/src/erl_dist_protocol.xml | 288 | ||||
-rw-r--r-- | erts/doc/src/erl_ext_dist.xml | 121 | ||||
-rw-r--r-- | erts/doc/src/erlang.xml | 10 |
3 files changed, 397 insertions, 22 deletions
diff --git a/erts/doc/src/erl_dist_protocol.xml b/erts/doc/src/erl_dist_protocol.xml index 6c725fc82d..0252187be5 100644 --- a/erts/doc/src/erl_dist_protocol.xml +++ b/erts/doc/src/erl_dist_protocol.xml @@ -547,13 +547,289 @@ If Result > 0, the packet only consists of [119, Result]. --> </section> - + <marker id="distribution_handshake"/> <section> - <title>Handshake</title> - <p> - The handshake is discussed in detail in the internal documentation for - the kernel (Erlang) application. - </p> + <title>Distribution Handshake</title> + <p> + This section describes the distribution handshake protocol + introduced in the OTP-R6 release of Erlang/OTP. This + description was previously located in + <c>$ERL_TOP/lib/kernel/internal_doc/distribution_handshake.txt</c>, + and has more or less been copied and "formatted" here. It has been + more or less unchanged since the year 1999, but the handshake + should not have changed much since then either. + </p> + <section> + <title>General</title> + <p> + The TCP/IP distribution uses a handshake which expects a + connection based protocol, i.e. the protocol does not include + any authentication after the handshake procedure. + </p> + <p> + This is not entirely safe, as it is vulnerable against takeover + attacks, but it is a tradeoff between fair safety and performance. + </p> + <p> + The cookies are never sent in cleartext and the handshake procedure + expects the client (called A) to be the first one to prove that it can + generate a sufficient digest. The digest is generated with the + MD5 message digest algorithm and the challenges are expected to be very + random numbers. + </p> + </section> + <section> + <title>Definitions</title> + <p> + A challenge is a 32 bit integer number in big endian order. Below the function + <c>gen_challenge()</c> returns a random 32 bit integer used as a challenge. + </p> + <p> + A digest is a (16 bytes) MD5 hash of the Challenge (as text) concatenated + with the cookie (as text). Below, the function <c>gen_digest(Challenge, Cookie)</c> + generates a digest as described above. + </p> + <p> + An out_cookie is the cookie used in outgoing communication to a certain node, + so that A's out_cookie for B should correspond with B's in_cookie for A and + the other way around. A's out_cookie for B and A's in_cookie for B need <em>NOT</em> + be the same. Below the function <c>out_cookie(Node)</c> returns the current + node's out_cookie for <c>Node</c>. + </p> + <p> + An in_cookie is the cookie expected to be used by another node when + communicating with us, so that A's in_cookie for B corresponds with B's + out_cookie for A. Below the function <c>in_cookie(Node)</c> returns the current + node's <c>in_cookie</c> for <c>Node</c>. + </p> + <p> + The cookies are text strings that can be viewed as passwords. + </p> + <p> + Every message in the handshake starts with a 16 bit big endian integer + which contains the length of the message (not counting the two initial bytes). + In erlang this corresponds to the <c>gen_tcp</c> option <c>{packet, 2}</c>. Note that after + the handshake, the distribution switches to 4 byte packet headers. + </p> + + </section> + <section> + <title>The Handshake in Detail</title> + <p> + Imagine two nodes, node A, which initiates the handshake and node B, which + accepts the connection. + </p> + <taglist> + <tag>1) connect/accept</tag> + <item><p>A connects to B via TCP/IP and B accepts the connection.</p></item> + <tag>2) send_name/receive_name</tag> + <item><p>A sends an initial identification to B. B receives the message. + The message looks like this (every "square" being one byte and the packet + header removed): + </p> +<pre> ++---+--------+--------+-----+-----+-----+-----+-----+-----+-...-+-----+ +|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Name0|Name1| ... |NameN| ++---+--------+--------+-----+-----+-----+-----+-----+-----+-... +-----+ +</pre> + <p> + The 'n' is just a message tag. + Version0 and Version1 is the distribution version selected by node A, + based on information from EPMD. (16 bit big endian) + Flag0 ... Flag3 are capability flags, the capabilities defined in + <c>$ERL_TOP/lib/kernel/include/dist.hrl</c>. + (32 bit big endian) + Name0 ... NameN is the full nodename of A, as a string of bytes (the + packet length denotes how long it is). + </p></item> + <tag>3) recv_status/send_status</tag> + <item><p>B sends a status message to A, which indicates + if the connection is allowed. The following status codes are defined:</p> + <taglist> + <tag><c>ok</c></tag> + <item>The handshake will continue.</item> + <tag><c>ok_simultaneous</c></tag> + <item>The handshake will continue, but A is informed that B + has another ongoing connection attempt that will be + shut down (simultaneous connect where A's name is + greater than B's name, compared literally).</item> + <tag><c>nok</c></tag> + <item>The handshake will not continue, as B already has an ongoing handshake + which it itself has initiated. (simultaneous connect where B's name is + greater than A's).</item> + <tag><c>not_allowed</c></tag> + <item>The connection is disallowed for some (unspecified) security + reason.</item> + <tag><c>alive</c></tag> + <item>A connection to the node is already active, which either means + that node A is confused or that the TCP connection breakdown + of a previous node with this name has not yet reached node B. + See 3B below.</item> + </taglist> + <p>This is the format of the status message:</p> +<pre> ++---+-------+-------+-...-+-------+ +|'s'|Status0|Status1| ... |StatusN| ++---+-------+-------+-...-+-------+ +</pre> + <p> + 's' is the message tag Status0 ... StatusN is the status as a string (not terminated) + </p> + </item> + <tag>3B) send_status/recv_status</tag> + <item><p>If status was 'alive', node A will answer with + another status message containing either 'true' which means that the + connection should continue (The old connection from this node is broken), or + <c>'false'</c>, which simply means that the connection should be closed, the + connection attempt was a mistake.</p></item> + <tag>4) recv_challenge/send_challenge</tag> + <item><p>If the status was <c>ok</c> or <c>ok_simultaneous</c>, + The handshake continues with B sending A another message, the challenge. + The challenge contains the same type of information as the "name" message + initially sent from A to B, with the addition of a 32 bit challenge:</p> +<pre> ++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-...-+-----+ +|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Chal0|Chal1|Chal2|Chal3|Name0|Name1| ... |NameN| ++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-... +-----+ +</pre> + <p> + Where Chal0 ... Chal3 is the challenge as a 32 bit big endian integer + and the other fields are B's version, flags and full nodename. + </p></item> + <tag>5) send_challenge_reply/recv_challenge_reply</tag> + <item><p>Now A has generated a digest and its own challenge. Those are + sent together in a package to B:</p> +<pre> ++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+ +|'r'|Chal0|Chal1|Chal2|Chal3|Dige0|Dige1|Dige2|Dige3| ... |Dige15| ++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+ +</pre> + <p> + Where 'r' is the tag, Chal0 ... Chal3 is A's challenge for B to handle and + Dige0 ... Dige15 is the digest that A constructed from the challenge B sent + in the previous step. + </p></item> + <tag>6) recv_challenge_ack/send_challenge_ack</tag> + <item><p>B checks that the digest received from A is correct and generates a + digest from the challenge received from A. The digest is then sent to A. The + message looks like this:</p> +<pre> ++---+-----+-----+-----+-----+-...-+------+ +|'a'|Dige0|Dige1|Dige2|Dige3| ... |Dige15| ++---+-----+-----+-----+-----+-...-+------+ +</pre> + <p> + Where 'a' is the tag and Dige0 ... Dige15 is the digest calculated by B + for A's challenge.</p></item> + <tag>7)</tag> + <item><p>A checks the digest from B and the connection is up.</p></item> + </taglist> + </section> + <section> + <title>Semigraphic View</title> +<pre> +A (initiator) B (acceptor) + +TCP connect -----------------------------------------> + TCP accept + +send_name -----------------------------------------> + recv_name + + <---------------------------------------- send_status +recv_status +(if status was 'alive' + send_status - - - - - - - - - - - - - - - - - - - -> + recv_status) + ChB = gen_challenge() + (ChB) + <---------------------------------------- send_challenge +recv_challenge + +ChA = gen_challenge(), +OCA = out_cookie(B), +DiA = gen_digest(ChB,OCA) + (ChA, DiA) +send_challenge_reply --------------------------------> + recv_challenge_reply + ICB = in_cookie(A), + check: + DiA == gen_digest + (ChB, ICB) ? + - if OK: + OCB = out_cookie(A), + DiB = gen_digest + (DiB) (ChA, OCB) + <----------------------------------------- send_challenge_ack +recv_challenge_ack DONE +ICA = in_cookie(B), - else +check: CLOSE +DiB == gen_digest(ChA,ICA) ? +- if OK + DONE +- else + CLOSE +</pre> + </section> + <marker id="dflags"/> + <section> + <title>The Currently Defined Distribution Flags</title> + <p> + Currently (OTP-R16) the following capability flags are defined: + </p> +<pre> +%% The node should be published and part of the global namespace +-define(DFLAG_PUBLISHED,1). + +%% The node implements an atom cache (obsolete) +-define(DFLAG_ATOM_CACHE,2). + +%% The node implements extended (3 * 32 bits) references. This is +%% required today. If not present connection will be refused. +-define(DFLAG_EXTENDED_REFERENCES,4). + +%% The node implements distributed process monitoring. +-define(DFLAG_DIST_MONITOR,8). + +%% The node uses separate tag for fun's (lambdas) in the distribution protocol. +-define(DFLAG_FUN_TAGS,16#10). + +%% The node implements distributed named process monitoring. +-define(DFLAG_DIST_MONITOR_NAME,16#20). + +%% The (hidden) node implements atom cache (obsolete) +-define(DFLAG_HIDDEN_ATOM_CACHE,16#40). + +%% The node understand new fun-tags +-define(DFLAG_NEW_FUN_TAGS,16#80). + +%% The node is capable of handling extended pids and ports. This is +%% required today. If not present connection will be refused. +-define(DFLAG_EXTENDED_PIDS_PORTS,16#100). + +%% +-define(DFLAG_EXPORT_PTR_TAG,16#200). + +%% +-define(DFLAG_BIT_BINARIES,16#400). + +%% The node understands new float format +-define(DFLAG_NEW_FLOATS,16#800). + +%% +-define(DFLAG_UNICODE_IO,16#1000). + +%% The node implements atom cache in distribution header. +-define(DFLAG_DIST_HDR_ATOM_CACHE,16#2000). + +%% The node understand the SMALL_ATOM_EXT tag +-define(DFLAG_SMALL_ATOM_TAGS, 16#4000). + +%% The node understand UTF-8 encoded atoms +-define(DFLAG_UTF8_ATOMS, 16#10000). + +</pre> + </section> </section> <section> diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml index fd2da2cfe3..28afea8b29 100644 --- a/erts/doc/src/erl_ext_dist.xml +++ b/erts/doc/src/erl_ext_dist.xml @@ -119,10 +119,39 @@ <cell align="center">Data</cell> </row> <tcaption></tcaption></table> + <marker id="utf8_atoms"/> + <note> + <p>As of ERTS version 5.10 (OTP-R16) support + for UTF-8 encoded atoms has been introduced in the external format. + However, only characters that can be encoded using Latin1 (ISO-8859-1) + are currently supported in atoms. The support for UTF-8 encoded atoms + in the external format has been implemented in order to be able to support + all Unicode characters in atoms in <em>some future release</em>. Full + support for Unicode atoms will not happen before OTP-R18, and might + be introduced even later than that. Until full Unicode support for + atoms has been introduced, it is an <em>error</em> to pass atoms containing + characters that cannot be encoded in Latin1, and <em>the behavior is + undefined</em>.</p> + <p>When the + <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso> + distribution flag has been exchanged between both nodes in the + <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>, + all atoms in the distribution header will be encoded in UTF-8; otherwise, + all atoms in the distribution header will be encoded in Latin1. The two + new tags <seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>, and + <seealso marker="#SMALL_ATOM_UTF8_EXT">SMALL_ATOM_UTF8_EXT</seealso> + will only be used if the <c>DFLAG_UTF8_ATOMS</c> distribution flag has + been exchanged between nodes, or if an atom containing characters + that cannot be encoded in Latin1 is encountered. + </p> + <p>The maximum number of allowed characters in an atom is 255. In the + UTF-8 case each character may need 4 bytes to be encoded. + </p> + </note> </section> - <section> - <marker id="distribution_header"/> + <marker id="distribution_header"/> + <section> <title>Distribution header</title> <p> As of erts version 5.7.2 the old atom cache protocol was @@ -219,8 +248,7 @@ <p> The least significant bit in that half byte is the <c>LongAtoms</c> flag. If it is set, 2 bytes are used for atom lengths instead of - 1 byte in the distribution header. However, the current emulator - cannot handle long atoms, so it will currently always be 0. + 1 byte in the distribution header. </p> <p> After the <c>Flags</c> field follow the <c>AtomCacheRefs</c>. The @@ -247,16 +275,26 @@ <p> <c>InternalSegmentIndex</c> together with the <c>SegmentIndex</c> completely identify the location of an atom cache entry in the - atom cache. <c>Length</c> is number of one byte characters that - the atom text consists of. Length is a two byte big endian integer + atom cache. <c>Length</c> is number of bytes that <c>AtomText</c> + consists of. Length is a two byte big endian integer if the <c>LongAtoms</c> flag has been set, otherwise a one byte - integer. Subsequent <c>CachedAtomRef</c>s with the same + integer. When the + <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_UTF8_ATOMS</c></seealso> + distribution flag has been exchanged between both nodes in the + <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>, + characters in <c>AtomText</c> is encoded in UTF-8; otherwise, + encoded in Latin1. Subsequent <c>CachedAtomRef</c>s with the same <c>SegmentIndex</c> and <c>InternalSegmentIndex</c> as this <c>NewAtomCacheRef</c> will refer to this atom until a new <c>NewAtomCacheRef</c> with the same <c>SegmentIndex</c> and <c>InternalSegmentIndex</c> appear. </p> <p> + For more information on encoding of atoms, see + <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso> + in the beginning of this document. + </p> + <p> If the <c>NewCacheEntryFlag</c> for the next <c>AtomCacheRef</c> has not been set, a <c>CachedAtomRef</c> on the following format will follow: @@ -383,9 +421,9 @@ <tcaption></tcaption></table> <p> An atom is stored with a 2 byte unsigned length in big-endian order, - followed by <c>Len</c> numbers of 8 bit characters that forms the - <c>AtomName</c>. - Note: The maximum allowed value for <c>Len</c> is 255. + followed by <c>Len</c> numbers of 8 bit Latin1 characters that forms + the <c>AtomName</c>. + <em>Note</em>: The maximum allowed value for <c>Len</c> is 255. </p> </section> @@ -754,12 +792,14 @@ <tcaption></tcaption></table> <p> An atom is stored with a 1 byte unsigned length, - followed by <c>Len</c> numbers of 8 bit characters that + followed by <c>Len</c> numbers of 8 bit Latin1 characters that forms the <c>AtomName</c>. Longer atoms can be represented by <seealso marker="#ATOM_EXT">ATOM_EXT</seealso>. <em>Note</em> the <c>SMALL_ATOM_EXT</c> was introduced in erts version 5.7.2 and - require a small atom distribution flag exchanged in the distribution - handshake. + require an exchange of the + <seealso marker="erl_dist_protocol#dflags"><c>DFLAG_SMALL_ATOM_TAGS</c></seealso> + distribution flag in the + <seealso marker="erl_dist_protocol#distribution_handshake">distribution handshake</seealso>. </p> </section> @@ -1007,7 +1047,62 @@ This term is used in minor version 1 of the external format. </p> </section> + <section> + <marker id="ATOM_UTF8_EXT"/> + <title>ATOM_UTF8_EXT</title> + + <table align="left"> + <row> + <cell align="center">1</cell> + <cell align="center">2</cell> + <cell align="center">Len</cell> + </row> + <row> + <cell align="center"><c>118</c></cell> + <cell align="center"><c>Len</c></cell> + <cell align="center"><c>AtomName</c></cell> + </row> + <tcaption></tcaption></table> + <p> + An atom is stored with a 2 byte unsigned length in big-endian order, + followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded + in UTF-8. + </p> + <p> + For more information on encoding of atoms, see + <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso> + in the beginning of this document. + </p> + </section> + <section> + <marker id="SMALL_ATOM_UTF8_EXT"/> + <title>SMALL_ATOM_UTF8_EXT</title> + + <table align="left"> + <row> + <cell align="center">1</cell> + <cell align="center">1</cell> + <cell align="center">Len</cell> + </row> + <row> + <cell align="center"><c>119</c></cell> + <cell align="center"><c>Len</c></cell> + <cell align="center"><c>AtomName</c></cell> + </row> + <tcaption></tcaption></table> + <p> + An atom is stored with a 1 byte unsigned length, + followed by <c>Len</c> bytes containing the <c>AtomName</c> encoded + in UTF-8. Longer atoms encoded in UTF-8 can be represented using + <seealso marker="#ATOM_UTF8_EXT">ATOM_UTF8_EXT</seealso>. + </p> + <p> + For more information on encoding of atoms, see + <seealso marker="#utf8_atoms">note on UTF-8 encoded atoms</seealso> + in the beginning of this document. + </p> + </section> </chapter> diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml index 5002c48ca1..0077c0096c 100644 --- a/erts/doc/src/erlang.xml +++ b/erts/doc/src/erlang.xml @@ -277,7 +277,9 @@ the binary contains Unicode characters greater than 16#FF. In a future release, such Unicode characters might be allowed and <c>binary_to_atom(<anno>Binary</anno>, utf8)</c> - will not fail in that case.</p></note> + will not fail in that case. For more information on Unicode support in atoms + see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso> + in the chapter about the external term format in the ERTS User's Guide.</p></note> <pre> > <input>binary_to_atom(<<"Erlang">>, latin1).</input> @@ -1647,9 +1649,11 @@ os_prompt% </pre> <desc> <p>Returns the atom whose text representation is <c><anno>String</anno></c>.</p> <p><c><anno>String</anno></c> may only contain ISO-latin-1 - characterns (i.e. numbers below 256) as the current + characters (i.e. numbers below 256) as the current implementation does not allow unicode characters >= 256 in - atoms.</p> + atoms. For more information on Unicode support in atoms + see <seealso marker="erl_ext_dist#utf8_atoms">note on UTF-8 encoded atoms</seealso> + in the chapter about the external term format in the ERTS User's Guide.</p> <pre> > <input>list_to_atom("Erlang").</input> 'Erlang'</pre> |