From b15688d40d5147c1122aaad3b82495fbbc4dede8 Mon Sep 17 00:00:00 2001 From: Rickard Green Date: Sat, 19 Jan 2013 00:45:16 +0100 Subject: UTF-8 atom documentation --- erts/doc/src/erl_dist_protocol.xml | 288 ++++++++++++++++++++++++++++++++++++- erts/doc/src/erl_ext_dist.xml | 121 ++++++++++++++-- erts/doc/src/erlang.xml | 10 +- 3 files changed, 397 insertions(+), 22 deletions(-) (limited to 'erts/doc') diff --git a/erts/doc/src/erl_dist_protocol.xml b/erts/doc/src/erl_dist_protocol.xml index 6c725fc82d..0252187be5 100644 --- a/erts/doc/src/erl_dist_protocol.xml +++ b/erts/doc/src/erl_dist_protocol.xml @@ -547,13 +547,289 @@ If Result > 0, the packet only consists of [119, Result]. --> - +
- Handshake -

- The handshake is discussed in detail in the internal documentation for - the kernel (Erlang) application. -

+ Distribution Handshake +

+ This section describes the distribution handshake protocol + introduced in the OTP-R6 release of Erlang/OTP. This + description was previously located in + $ERL_TOP/lib/kernel/internal_doc/distribution_handshake.txt, + and has more or less been copied and "formatted" here. It has been + more or less unchanged since the year 1999, but the handshake + should not have changed much since then either. +

+
+ General +

+ The TCP/IP distribution uses a handshake which expects a + connection based protocol, i.e. the protocol does not include + any authentication after the handshake procedure. +

+

+ This is not entirely safe, as it is vulnerable against takeover + attacks, but it is a tradeoff between fair safety and performance. +

+

+ The cookies are never sent in cleartext and the handshake procedure + expects the client (called A) to be the first one to prove that it can + generate a sufficient digest. The digest is generated with the + MD5 message digest algorithm and the challenges are expected to be very + random numbers. +

+
+
+ Definitions +

+ A challenge is a 32 bit integer number in big endian order. Below the function + gen_challenge() returns a random 32 bit integer used as a challenge. +

+

+ A digest is a (16 bytes) MD5 hash of the Challenge (as text) concatenated + with the cookie (as text). Below, the function gen_digest(Challenge, Cookie) + generates a digest as described above. +

+

+ An out_cookie is the cookie used in outgoing communication to a certain node, + so that A's out_cookie for B should correspond with B's in_cookie for A and + the other way around. A's out_cookie for B and A's in_cookie for B need NOT + be the same. Below the function out_cookie(Node) returns the current + node's out_cookie for Node. +

+

+ An in_cookie is the cookie expected to be used by another node when + communicating with us, so that A's in_cookie for B corresponds with B's + out_cookie for A. Below the function in_cookie(Node) returns the current + node's in_cookie for Node. +

+

+ The cookies are text strings that can be viewed as passwords. +

+

+ Every message in the handshake starts with a 16 bit big endian integer + which contains the length of the message (not counting the two initial bytes). + In erlang this corresponds to the gen_tcp option {packet, 2}. Note that after + the handshake, the distribution switches to 4 byte packet headers. +

+ +
+
+ The Handshake in Detail +

+ Imagine two nodes, node A, which initiates the handshake and node B, which + accepts the connection. +

+ + 1) connect/accept +

A connects to B via TCP/IP and B accepts the connection.

+ 2) send_name/receive_name +

A sends an initial identification to B. B receives the message. + The message looks like this (every "square" being one byte and the packet + header removed): +

+
++---+--------+--------+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-... +-----+
+
+

+ The 'n' is just a message tag. + Version0 and Version1 is the distribution version selected by node A, + based on information from EPMD. (16 bit big endian) + Flag0 ... Flag3 are capability flags, the capabilities defined in + $ERL_TOP/lib/kernel/include/dist.hrl. + (32 bit big endian) + Name0 ... NameN is the full nodename of A, as a string of bytes (the + packet length denotes how long it is). +

+ 3) recv_status/send_status +

B sends a status message to A, which indicates + if the connection is allowed. The following status codes are defined:

+ + ok + The handshake will continue. + ok_simultaneous + The handshake will continue, but A is informed that B + has another ongoing connection attempt that will be + shut down (simultaneous connect where A's name is + greater than B's name, compared literally). + nok + The handshake will not continue, as B already has an ongoing handshake + which it itself has initiated. (simultaneous connect where B's name is + greater than A's). + not_allowed + The connection is disallowed for some (unspecified) security + reason. + alive + A connection to the node is already active, which either means + that node A is confused or that the TCP connection breakdown + of a previous node with this name has not yet reached node B. + See 3B below. + +

This is the format of the status message:

+
++---+-------+-------+-...-+-------+
+|'s'|Status0|Status1| ... |StatusN|
++---+-------+-------+-...-+-------+
+
+

+ 's' is the message tag Status0 ... StatusN is the status as a string (not terminated) +

+
+ 3B) send_status/recv_status +

If status was 'alive', node A will answer with + another status message containing either 'true' which means that the + connection should continue (The old connection from this node is broken), or + 'false', which simply means that the connection should be closed, the + connection attempt was a mistake.

+ 4) recv_challenge/send_challenge +

If the status was ok or ok_simultaneous, + The handshake continues with B sending A another message, the challenge. + The challenge contains the same type of information as the "name" message + initially sent from A to B, with the addition of a 32 bit challenge:

+
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-...-+-----+
+|'n'|Version0|Version1|Flag0|Flag1|Flag2|Flag3|Chal0|Chal1|Chal2|Chal3|Name0|Name1| ... |NameN|
++---+--------+--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-... +-----+
+
+

+ Where Chal0 ... Chal3 is the challenge as a 32 bit big endian integer + and the other fields are B's version, flags and full nodename. +

+ 5) send_challenge_reply/recv_challenge_reply +

Now A has generated a digest and its own challenge. Those are + sent together in a package to B:

+
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+|'r'|Chal0|Chal1|Chal2|Chal3|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-----+-----+-----+-----+-...-+------+
+
+

+ Where 'r' is the tag, Chal0 ... Chal3 is A's challenge for B to handle and + Dige0 ... Dige15 is the digest that A constructed from the challenge B sent + in the previous step. +

+ 6) recv_challenge_ack/send_challenge_ack +

B checks that the digest received from A is correct and generates a + digest from the challenge received from A. The digest is then sent to A. The + message looks like this:

+
++---+-----+-----+-----+-----+-...-+------+
+|'a'|Dige0|Dige1|Dige2|Dige3| ... |Dige15|
++---+-----+-----+-----+-----+-...-+------+
+
+

+ Where 'a' is the tag and Dige0 ... Dige15 is the digest calculated by B + for A's challenge.

+ 7) +

A checks the digest from B and the connection is up.

+
+
+
+ Semigraphic View +
+A (initiator)						B (acceptor)
+
+TCP connect ----------------------------------------->	
+							TCP accept
+
+send_name   ----------------------------------------->
+							recv_name
+
+	    <----------------------------------------	send_status
+recv_status
+(if status was 'alive'
+ send_status - - - - - - - - - - - - - - - - - - - ->
+							recv_status)
+							ChB = gen_challenge()
+		          (ChB)
+	    <----------------------------------------	send_challenge
+recv_challenge
+
+ChA = gen_challenge(),
+OCA = out_cookie(B),
+DiA = gen_digest(ChB,OCA)
+			  (ChA, DiA)
+send_challenge_reply -------------------------------->
+							recv_challenge_reply
+							ICB = in_cookie(A),
+							check:
+							DiA == gen_digest
+								(ChB, ICB) ?
+							- if OK:
+	    						 OCB = out_cookie(A),
+							 DiB = gen_digest
+			(DiB)					(ChA, OCB)
+	    <-----------------------------------------	 send_challenge_ack
+recv_challenge_ack					 DONE
+ICA = in_cookie(B),                                     - else
+check:                                                   CLOSE
+DiB == gen_digest(ChA,ICA) ?
+- if OK
+ DONE
+- else
+ CLOSE
+
+
+ +
+ The Currently Defined Distribution Flags +

+ Currently (OTP-R16) the following capability flags are defined: +

+
+%% The node should be published and part of the global namespace
+-define(DFLAG_PUBLISHED,1).
+
+%% The node implements an atom cache (obsolete)
+-define(DFLAG_ATOM_CACHE,2).
+
+%% The node implements extended (3 * 32 bits) references. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_REFERENCES,4).
+
+%% The node implements distributed process monitoring.
+-define(DFLAG_DIST_MONITOR,8).
+
+%% The node uses separate tag for fun's (lambdas) in the distribution protocol.
+-define(DFLAG_FUN_TAGS,16#10).
+
+%% The node implements distributed named process monitoring.
+-define(DFLAG_DIST_MONITOR_NAME,16#20).
+
+%% The (hidden) node implements atom cache (obsolete)
+-define(DFLAG_HIDDEN_ATOM_CACHE,16#40).
+
+%% The node understand new fun-tags
+-define(DFLAG_NEW_FUN_TAGS,16#80).
+
+%% The node is capable of handling extended pids and ports. This is
+%% required today. If not present connection will be refused.
+-define(DFLAG_EXTENDED_PIDS_PORTS,16#100).
+
+%%
+-define(DFLAG_EXPORT_PTR_TAG,16#200).
+
+%%
+-define(DFLAG_BIT_BINARIES,16#400).
+
+%% The node understands new float format
+-define(DFLAG_NEW_FLOATS,16#800).
+
+%%
+-define(DFLAG_UNICODE_IO,16#1000).
+
+%% The node implements atom cache in distribution header.
+-define(DFLAG_DIST_HDR_ATOM_CACHE,16#2000).
+
+%% The node understand the SMALL_ATOM_EXT tag
+-define(DFLAG_SMALL_ATOM_TAGS, 16#4000).
+
+%% The node understand UTF-8 encoded atoms
+-define(DFLAG_UTF8_ATOMS, 16#10000).
+
+
+
diff --git a/erts/doc/src/erl_ext_dist.xml b/erts/doc/src/erl_ext_dist.xml index fd2da2cfe3..28afea8b29 100644 --- a/erts/doc/src/erl_ext_dist.xml +++ b/erts/doc/src/erl_ext_dist.xml @@ -119,10 +119,39 @@ Data + + +

As of ERTS version 5.10 (OTP-R16) support + for UTF-8 encoded atoms has been introduced in the external format. + However, only characters that can be encoded using Latin1 (ISO-8859-1) + are currently supported in atoms. The support for UTF-8 encoded atoms + in the external format has been implemented in order to be able to support + all Unicode characters in atoms in some future release. Full + support for Unicode atoms will not happen before OTP-R18, and might + be introduced even later than that. Until full Unicode support for + atoms has been introduced, it is an error to pass atoms containing + characters that cannot be encoded in Latin1, and the behavior is + undefined.

+

When the + DFLAG_UTF8_ATOMS + distribution flag has been exchanged between both nodes in the + distribution handshake, + all atoms in the distribution header will be encoded in UTF-8; otherwise, + all atoms in the distribution header will be encoded in Latin1. The two + new tags ATOM_UTF8_EXT, and + SMALL_ATOM_UTF8_EXT + will only be used if the DFLAG_UTF8_ATOMS distribution flag has + been exchanged between nodes, or if an atom containing characters + that cannot be encoded in Latin1 is encountered. +

+

The maximum number of allowed characters in an atom is 255. In the + UTF-8 case each character may need 4 bytes to be encoded. +

+
-
- + +
Distribution header

As of erts version 5.7.2 the old atom cache protocol was @@ -219,8 +248,7 @@

The least significant bit in that half byte is the LongAtoms flag. If it is set, 2 bytes are used for atom lengths instead of - 1 byte in the distribution header. However, the current emulator - cannot handle long atoms, so it will currently always be 0. + 1 byte in the distribution header.

After the Flags field follow the AtomCacheRefs. The @@ -247,15 +275,25 @@

InternalSegmentIndex together with the SegmentIndex completely identify the location of an atom cache entry in the - atom cache. Length is number of one byte characters that - the atom text consists of. Length is a two byte big endian integer + atom cache. Length is number of bytes that AtomText + consists of. Length is a two byte big endian integer if the LongAtoms flag has been set, otherwise a one byte - integer. Subsequent CachedAtomRefs with the same + integer. When the + DFLAG_UTF8_ATOMS + distribution flag has been exchanged between both nodes in the + distribution handshake, + characters in AtomText is encoded in UTF-8; otherwise, + encoded in Latin1. Subsequent CachedAtomRefs with the same SegmentIndex and InternalSegmentIndex as this NewAtomCacheRef will refer to this atom until a new NewAtomCacheRef with the same SegmentIndex and InternalSegmentIndex appear.

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

If the NewCacheEntryFlag for the next AtomCacheRef has not been set, a CachedAtomRef on the following format @@ -383,9 +421,9 @@

An atom is stored with a 2 byte unsigned length in big-endian order, - followed by Len numbers of 8 bit characters that forms the - AtomName. - Note: The maximum allowed value for Len is 255. + followed by Len numbers of 8 bit Latin1 characters that forms + the AtomName. + Note: The maximum allowed value for Len is 255.

@@ -754,12 +792,14 @@

An atom is stored with a 1 byte unsigned length, - followed by Len numbers of 8 bit characters that + followed by Len numbers of 8 bit Latin1 characters that forms the AtomName. Longer atoms can be represented by ATOM_EXT. Note the SMALL_ATOM_EXT was introduced in erts version 5.7.2 and - require a small atom distribution flag exchanged in the distribution - handshake. + require an exchange of the + DFLAG_SMALL_ATOM_TAGS + distribution flag in the + distribution handshake.

@@ -1007,7 +1047,62 @@ This term is used in minor version 1 of the external format.

+
+ + ATOM_UTF8_EXT + + + + 1 + 2 + Len + + + 118 + Len + AtomName + +
+

+ An atom is stored with a 2 byte unsigned length in big-endian order, + followed by Len bytes containing the AtomName encoded + in UTF-8. +

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

+
+
+ + SMALL_ATOM_UTF8_EXT + + + + 1 + 1 + Len + + + 119 + Len + AtomName + +
+

+ An atom is stored with a 1 byte unsigned length, + followed by Len bytes containing the AtomName encoded + in UTF-8. Longer atoms encoded in UTF-8 can be represented using + ATOM_UTF8_EXT. +

+

+ For more information on encoding of atoms, see + note on UTF-8 encoded atoms + in the beginning of this document. +

+
diff --git a/erts/doc/src/erlang.xml b/erts/doc/src/erlang.xml index 5002c48ca1..0077c0096c 100644 --- a/erts/doc/src/erlang.xml +++ b/erts/doc/src/erlang.xml @@ -277,7 +277,9 @@ the binary contains Unicode characters greater than 16#FF. In a future release, such Unicode characters might be allowed and binary_to_atom(Binary, utf8) - will not fail in that case.

+ will not fail in that case. For more information on Unicode support in atoms + see note on UTF-8 encoded atoms + in the chapter about the external term format in the ERTS User's Guide.

 > binary_to_atom(<<"Erlang">>, latin1).
@@ -1647,9 +1649,11 @@ os_prompt% 

Returns the atom whose text representation is String.

String may only contain ISO-latin-1 - characterns (i.e. numbers below 256) as the current + characters (i.e. numbers below 256) as the current implementation does not allow unicode characters >= 256 in - atoms.

+ atoms. For more information on Unicode support in atoms + see note on UTF-8 encoded atoms + in the chapter about the external term format in the ERTS User's Guide.

 > list_to_atom("Erlang").
 'Erlang'
-- cgit v1.2.3