2007
2014
Ericsson AB, All Rights Reserved
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
The Initial Developer of the Original Code is Ericsson AB.
External Term Format
Kenneth
2007-09-21
PA1
erl_ext_dist.xml
Introduction
The external term format is mainly used in the distribution
mechanism of Erlang.
Since Erlang has a fixed number of types, there is no need for a
programmer to define a specification for the external format used
within some application.
All Erlang terms has an external representation and the interpretation
of the different terms are application specific.
In Erlang the BIF term_to_binary/1,2 is used to convert a
term into the external format.
To convert binary data encoding a term the BIF
binary_to_term/1
is used.
The distribution does this implicitly when sending messages across
node boundaries.
The overall format of the term format is:
When messages are
passed between
connected nodes and a
distribution
header is used, the first byte containing the version
number (131) is omitted from the terms that follow the distribution
header. This since
the version number is implied by the version number in the
distribution header.
A compressed term looks like this:
1 |
1 |
4 |
N |
131 |
80 |
UncompressedSize |
Zlib-compressedData |
Uncompressed Size (unsigned 32 bit integer in big-endian byte order)
is the size of the data before it was compressed.
The compressed data has the following format when it has been
expanded:
1 |
Uncompressed Size |
Tag |
Data |
As of ERTS version 5.10 (OTP-R16) support
for UTF-8 encoded atoms has been introduced in the external format.
However, only characters that can be encoded using Latin1 (ISO-8859-1)
are currently supported in atoms. The support for UTF-8 encoded atoms
in the external format has been implemented in order to be able to support
all Unicode characters in atoms in some future release.
Until full Unicode support for
atoms has been introduced, it is an error to pass atoms containing
characters that cannot be encoded in Latin1, and the behavior is
undefined.
When the
DFLAG_UTF8_ATOMS
distribution flag has been exchanged between both nodes in the
distribution handshake,
all atoms in the distribution header will be encoded in UTF-8; otherwise,
all atoms in the distribution header will be encoded in Latin1. The two
new tags ATOM_UTF8_EXT, and
SMALL_ATOM_UTF8_EXT
will only be used if the DFLAG_UTF8_ATOMS distribution flag has
been exchanged between nodes, or if an atom containing characters
that cannot be encoded in Latin1 is encountered.
The maximum number of allowed characters in an atom is 255. In the
UTF-8 case each character may need 4 bytes to be encoded.
Distribution header
As of erts version 5.7.2 the old atom cache protocol was
dropped and a new one was introduced. This atom cache protocol
introduced the distribution header. Nodes with erts versions
earlier than 5.7.2 can still communicate with new nodes,
but no distribution header and no atom cache will be used.
The distribution header currently only contains an atom cache
reference section, but could in the future contain more
information. The distribution header precedes one or more Erlang
terms on the external format. For more information see the
documentation of the
protocol between
connected nodes in the
distribution protocol
documentation.
ATOM_CACHE_REF
entries with corresponding AtomCacheReferenceIndex in terms
encoded on the external format following a distribution header refers
to the atom cache references made in the distribution header. The range
is 0 <= AtomCacheReferenceIndex < 255, i.e., at most 255
different atom cache references from the following terms can be made.
The distribution header format is:
1 |
1 |
1 |
NumberOfAtomCacheRefs/2+1 | 0 |
N | 0 |
131 |
68 |
NumberOfAtomCacheRefs |
Flags |
AtomCacheRefs |
Flags consists of NumberOfAtomCacheRefs/2+1 bytes,
unless NumberOfAtomCacheRefs is 0. If
NumberOfAtomCacheRefs is 0, Flags and
AtomCacheRefs are omitted. Each atom cache reference have
a half byte flag field. Flags corresponding to a specific
AtomCacheReferenceIndex, are located in flag byte number
AtomCacheReferenceIndex/2. Flag byte 0 is the first byte
after the NumberOfAtomCacheRefs byte. Flags for an even
AtomCacheReferenceIndex are located in the least significant
half byte and flags for an odd AtomCacheReferenceIndex are
located in the most significant half byte.
The flag field of an atom cache reference has the following
format:
1 bit |
3 bits |
NewCacheEntryFlag |
SegmentIndex |
The most significant bit is the NewCacheEntryFlag. If set,
the corresponding cache reference is new. The three least
significant bits are the SegmentIndex of the corresponding
atom cache entry. An atom cache consists of 8 segments each of size
256, i.e., an atom cache can contain 2048 entries.
After flag fields for atom cache references, another half byte flag
field is located which has the following format:
3 bits |
1 bit |
CurrentlyUnused |
LongAtoms |
The least significant bit in that half byte is the LongAtoms
flag. If it is set, 2 bytes are used for atom lengths instead of
1 byte in the distribution header.
After the Flags field follow the AtomCacheRefs. The
first AtomCacheRef is the one corresponding to
AtomCacheReferenceIndex 0. Higher indices follows
in sequence up to index NumberOfAtomCacheRefs - 1.
If the NewCacheEntryFlag for the next AtomCacheRef has
been set, a NewAtomCacheRef on the following format will follow:
1 |
1 | 2 |
Length |
InternalSegmentIndex |
Length |
AtomText |
InternalSegmentIndex together with the SegmentIndex
completely identify the location of an atom cache entry in the
atom cache. Length is number of bytes that AtomText
consists of. Length is a two byte big endian integer
if the LongAtoms flag has been set, otherwise a one byte
integer. When the
DFLAG_UTF8_ATOMS
distribution flag has been exchanged between both nodes in the
distribution handshake,
characters in AtomText is encoded in UTF-8; otherwise,
encoded in Latin1. Subsequent CachedAtomRefs with the same
SegmentIndex and InternalSegmentIndex as this
NewAtomCacheRef will refer to this atom until a new
NewAtomCacheRef with the same SegmentIndex
and InternalSegmentIndex appear.
For more information on encoding of atoms, see
note on UTF-8 encoded atoms
in the beginning of this document.
If the NewCacheEntryFlag for the next AtomCacheRef
has not been set, a CachedAtomRef on the following format
will follow:
InternalSegmentIndex together with the SegmentIndex
identify the location of the atom cache entry in the atom cache.
The atom corresponding to this CachedAtomRef is the
latest NewAtomCacheRef preceding this CachedAtomRef
in another previously passed distribution header.
ATOM_CACHE_REF
1 |
1 |
82 |
AtomCacheReferenceIndex |
Refers to the atom with AtomCacheReferenceIndex in the
distribution header.
SMALL_INTEGER_EXT
Unsigned 8 bit integer.
INTEGER_EXT
Signed 32 bit integer in big-endian format (i.e. MSB first)
FLOAT_EXT
A float is stored in string format. the format used in sprintf to
format the float is "%.20e"
(there are more bytes allocated than necessary).
To unpack the float use sscanf with format "%lf".
This term is used in minor version 0 of the external format;
it has been superseded by
NEW_FLOAT_EXT
.
ATOM_EXT
An atom is stored with a 2 byte unsigned length in big-endian order,
followed by Len numbers of 8 bit Latin1 characters that forms
the AtomName.
Note: The maximum allowed value for Len is 255.
REFERENCE_EXT
1 |
N |
4 |
1 |
101 |
Node |
ID |
Creation |
Encode a reference object (an object generated with make_ref/0).
The Node term is an encoded atom, i.e.
ATOM_EXT,
SMALL_ATOM_EXT or
ATOM_CACHE_REF.
The ID field contains a big-endian
unsigned integer,
but should be regarded as uninterpreted data
since this field is node specific.
Creation is a byte containing a node serial number that
makes it possible to separate old (crashed) nodes from a new one.
In ID, only 18 bits are significant; the rest should be 0.
In Creation, only 2 bits are significant; the rest should be 0.
See NEW_REFERENCE_EXT.
PORT_EXT
1 |
N |
4 |
1 |
102 |
Node |
ID |
Creation |
Encode a port object (obtained form open_port/2).
The ID is a node specific identifier for a local port.
Port operations are not allowed across node boundaries.
The Creation works just like in
REFERENCE_EXT.
PID_EXT
1 |
N |
4 |
4 |
1 |
103 |
Node |
ID |
Serial |
Creation |
Encode a process identifier object (obtained from spawn/3 or
friends).
The ID and Creation fields works just like in
REFERENCE_EXT, while
the Serial field is used to improve safety.
In ID, only 15 bits are significant; the rest should be 0.
SMALL_TUPLE_EXT
SMALL_TUPLE_EXT encodes a tuple. The Arity
field is an unsigned byte that determines how many element
that follows in the Elements section.
LARGE_TUPLE_EXT
Same as
SMALL_TUPLE_EXT
with the exception that Arity is an
unsigned 4 byte integer in big endian format.
MAP_EXT
MAP_EXT encodes a map. The Arity field is an unsigned
4 byte integer in big endian format that determines the number of
key-value pairs in the map. Key and value pairs (Ki => Vi)
are encoded in the Pairs section in the following order:
K1, V1, K2, V2,..., Kn, Vn.
Duplicate keys are not allowed within the same map.
Since: OTP 17.0
NIL_EXT
The representation for an empty list, i.e. the Erlang syntax [].
STRING_EXT
1 |
2 |
Len |
107 |
Length |
Characters |
String does NOT have a corresponding Erlang representation,
but is an optimization for sending lists of bytes (integer in
the range 0-255) more efficiently over the distribution.
Since the Length field is an unsigned 2 byte integer
(big endian), implementations must make sure that lists longer than
65535 elements are encoded as
LIST_EXT.
LIST_EXT
1 |
4 |
|
|
108 |
Length |
Elements |
Tail |
Length is the number of elements that follows in the
Elements section. Tail is the final tail of
the list; it is
NIL_EXT
for a proper list, but may be anything type if the list is
improper (for instance [a|b]).
BINARY_EXT
Binaries are generated with bit syntax expression or with
list_to_binary/1,
term_to_binary/1,
or as input from binary ports.
The Len length field is an unsigned 4 byte integer
(big endian).
SMALL_BIG_EXT
1 |
1 |
1 |
n |
110 |
n |
Sign |
d(0) ... d(n-1) |
Bignums are stored in unary form with a Sign byte
that is 0 if the binum is positive and 1 if is negative. The
digits are stored with the LSB byte stored first. To
calculate the integer the following formula can be used:
B = 256
(d0*B^0 + d1*B^1 + d2*B^2 + ... d(N-1)*B^(n-1))
LARGE_BIG_EXT
1 |
4 |
1 |
n |
111 |
n |
Sign |
d(0) ... d(n-1) |
Same as SMALL_BIG_EXT
with the difference that the length field
is an unsigned 4 byte integer.
NEW_REFERENCE_EXT
1 |
2 |
N |
1 |
N' |
114 |
Len |
Node |
Creation |
ID ... |
Node and Creation are as in
REFERENCE_EXT.
ID contains a sequence of big-endian unsigned integers
(4 bytes each, so N' is a multiple of 4),
but should be regarded as uninterpreted data.
N' = 4 * Len.
In the first word (four bytes) of ID, only 18 bits are
significant, the rest should be 0.
In Creation, only 2 bits are significant,
the rest should be 0.
NEW_REFERENCE_EXT was introduced with distribution version 4.
In version 4, N' should be at most 12.
See REFERENCE_EXT).
SMALL_ATOM_EXT
An atom is stored with a 1 byte unsigned length,
followed by Len numbers of 8 bit Latin1 characters that
forms the AtomName. Longer atoms can be represented
by ATOM_EXT. Note
the SMALL_ATOM_EXT was introduced in erts version 5.7.2 and
require an exchange of the
DFLAG_SMALL_ATOM_TAGS
distribution flag in the
distribution handshake.
FUN_EXT
1 |
4 |
N1 |
N2 |
N3 |
N4 |
N5 |
117 |
NumFree |
Pid |
Module |
Index |
Uniq |
Free vars ... |
Pid
-
is a process identifier as in
PID_EXT.
It represents the process in which the fun was created.
Module
-
is an encoded as an atom, using
ATOM_EXT,
SMALL_ATOM_EXT
or ATOM_CACHE_REF.
This is the module that the fun is implemented in.
Index
-
is an integer encoded using
SMALL_INTEGER_EXT
or INTEGER_EXT.
It is typically a small index into the module's fun table.
Uniq
-
is an integer encoded using
SMALL_INTEGER_EXT or
INTEGER_EXT.
Uniq is the hash value of the parse for the fun.
Free vars
-
is NumFree number of terms, each one encoded according
to its type.
NEW_FUN_EXT
1 |
4 |
1 |
16 |
4 |
4 |
N1 |
N2 |
N3 |
N4 |
N5 |
112 |
Size |
Arity |
Uniq |
Index |
NumFree |
Module |
OldIndex |
OldUniq |
Pid |
Free Vars |
This is the new encoding of internal funs: fun F/A and
fun(Arg1,..) -> ... end.
Size
-
is the total number of bytes, including the Size field.
Arity
-
is the arity of the function implementing the fun.
Uniq
-
is the 16 bytes MD5 of the significant parts of the Beam file.
Index
-
is an index number. Each fun within a module has an unique
index. Index is stored in big-endian byte order.
NumFree
-
is the number of free variables.
Module
-
is an encoded as an atom, using
ATOM_EXT,
SMALL_ATOM_EXT or
ATOM_CACHE_REF.
This is the module that the fun is implemented in.
OldIndex
-
is an integer encoded using
SMALL_INTEGER_EXT
or INTEGER_EXT.
It is typically a small index into the module's fun table.
OldUniq
-
is an integer encoded using
SMALL_INTEGER_EXT
or
INTEGER_EXT.
Uniq is the hash value of the parse tree for the fun.
Pid
-
is a process identifier as in
PID_EXT.
It represents the process in which
the fun was created.
Free vars
-
is NumFree number of terms, each one encoded according
to its type.
EXPORT_EXT
1 |
N1 |
N2 |
N3 |
113 |
Module |
Function |
Arity |
This term is the encoding for external funs: fun M:F/A.
Module and Function are atoms
(encoded using ATOM_EXT,
SMALL_ATOM_EXT or
ATOM_CACHE_REF).
Arity is an integer encoded using
SMALL_INTEGER_EXT.
BIT_BINARY_EXT
1 |
4 |
1 |
Len |
77 |
Len |
Bits |
Data |
This term represents a bitstring whose length in bits does
not have to be a multiple of 8.
The Len field is an unsigned 4 byte integer (big endian).
The Bits field is the number of bits (1-8) that are used
in the last byte in the data field,
counting from the most significant bit towards the least
significant.
NEW_FLOAT_EXT
A float is stored as 8 bytes in big-endian IEEE format.
This term is used in minor version 1 of the external format.
ATOM_UTF8_EXT
An atom is stored with a 2 byte unsigned length in big-endian order,
followed by Len bytes containing the AtomName encoded
in UTF-8.
For more information on encoding of atoms, see
note on UTF-8 encoded atoms
in the beginning of this document.
SMALL_ATOM_UTF8_EXT
An atom is stored with a 1 byte unsigned length,
followed by Len bytes containing the AtomName encoded
in UTF-8. Longer atoms encoded in UTF-8 can be represented using
ATOM_UTF8_EXT.
For more information on encoding of atoms, see
note on UTF-8 encoded atoms
in the beginning of this document.