External Term Format

2007 2017 Ericsson AB, All Rights Reserved Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. The Initial Developer of the Original Code is Ericsson AB. External Term Format Kenneth 2007-09-21 PA1 erl_ext_dist.xml

Introduction

The external term format is mainly used in the distribution mechanism of Erlang.

As Erlang has a fixed number of types, there is no need for a programmer to define a specification for the external format used within some application. All Erlang terms have an external representation and the interpretation of the different terms is application-specific.

In Erlang the BIF erlang:term_to_binary/1,2 is used to convert a term into the external format. To convert binary data encoding to a term, the BIF erlang:binary_to_term/1 is used.

The distribution does this implicitly when sending messages across node boundaries.

The overall format of the term format is as follows:

1 1 N 131 Tag Data Term Format

When messages are passed between connected nodes and a distribution header is used, the first byte containing the version number (131) is omitted from the terms that follow the distribution header. This is because the version number is implied by the version number in the distribution header.

The compressed term format is as follows:

1 1 4 N 131 80 UncompressedSize Zlib-compressedData Compressed Term Format

Uncompressed size (unsigned 32-bit integer in big-endian byte order) is the size of the data before it was compressed. The compressed data has the following format when it has been expanded:

1 Uncompressed Size Tag Data Compressed Data Format when Expanded

As from ERTS 9.0 (OTP 20), atoms may contain any Unicode characters and are always encoded using the UTF-8 external formats ATOM_UTF8_EXT or SMALL_ATOM_UTF8_EXT. The old Latin-1 formats ATOM_EXT and SMALL_ATOM_EXT are deprecated and are only kept for backward compatibility when decoding terms encoded by older nodes.

Support for UTF-8 encoded atoms in the external format has been available since ERTS 5.10 (OTP R16). This abillity allows such old nodes to decode, store and encode any Unicode atoms received from a new OTP 20 node.

The maximum number of allowed characters in an atom is 255. In the UTF-8 case, each character can need 4 bytes to be encoded.

Distribution Header

As from ERTS 5.7.2 the old atom cache protocol was dropped and a new one was introduced. This protocol introduced the distribution header. Nodes with an ERTS version earlier than 5.7.2 can still communicate with new nodes, but no distribution header and no atom cache are used.

The distribution header only contains an atom cache reference section, but can in the future contain more information. The distribution header precedes one or more Erlang terms on the external format. For more information, see the documentation of the protocol between connected nodes in the distribution protocol documentation.

ATOM_CACHE_REF entries with corresponding AtomCacheReferenceIndex in terms encoded on the external format following a distribution header refer to the atom cache references made in the distribution header. The range is 0 <= AtomCacheReferenceIndex < 255, that is, at most 255 different atom cache references from the following terms can be made.

The distribution header format is as follows:

1 1 1 NumberOfAtomCacheRefs/2+1 | 0 N | 0 131 68 NumberOfAtomCacheRefs Flags AtomCacheRefs Distribution Header Format

Flags consist of NumberOfAtomCacheRefs/2+1 bytes, unless NumberOfAtomCacheRefs is 0. If NumberOfAtomCacheRefs is 0, Flags and AtomCacheRefs are omitted. Each atom cache reference has a half byte flag field. Flags corresponding to a specific AtomCacheReferenceIndex are located in flag byte number AtomCacheReferenceIndex/2. Flag byte 0 is the first byte after the NumberOfAtomCacheRefs byte. Flags for an even AtomCacheReferenceIndex are located in the least significant half byte and flags for an odd AtomCacheReferenceIndex are located in the most significant half byte.

The flag field of an atom cache reference has the following format:

1 bit 3 bits NewCacheEntryFlag SegmentIndex

The most significant bit is the NewCacheEntryFlag. If set, the corresponding cache reference is new. The three least significant bits are the SegmentIndex of the corresponding atom cache entry. An atom cache consists of 8 segments, each of size 256, that is, an atom cache can contain 2048 entries.

After flag fields for atom cache references, another half byte flag field is located with the following format:

3 bits 1 bit CurrentlyUnused LongAtoms

The least significant bit in that half byte is flag LongAtoms. If it is set, 2 bytes are used for atom lengths instead of 1 byte in the distribution header.

After the Flags field follow the AtomCacheRefs. The first AtomCacheRef is the one corresponding to AtomCacheReferenceIndex 0. Higher indices follow in sequence up to index NumberOfAtomCacheRefs - 1.

If the NewCacheEntryFlag for the next AtomCacheRef has been set, a NewAtomCacheRef on the following format follows:

1 1 | 2 Length InternalSegmentIndex Length AtomText

InternalSegmentIndex together with the SegmentIndex completely identify the location of an atom cache entry in the atom cache. Length is the number of bytes that AtomText consists of. Length is a 2 byte big-endian integer if flag LongAtoms has been set, otherwise a 1 byte integer. When distribution flag DFLAG_UTF8_ATOMS has been exchanged between both nodes in the distribution handshake, characters in AtomText are encoded in UTF-8, otherwise in Latin-1. The following CachedAtomRefs with the same SegmentIndex and InternalSegmentIndex as this NewAtomCacheRef refer to this atom until a new NewAtomCacheRef with the same SegmentIndex and InternalSegmentIndex appear.

For more information on encoding of atoms, see the note on UTF-8 encoded atoms in the beginning of this section.

If the NewCacheEntryFlag for the next AtomCacheRef has not been set, a CachedAtomRef on the following format follows:

1 InternalSegmentIndex

InternalSegmentIndex together with the SegmentIndex identify the location of the atom cache entry in the atom cache. The atom corresponding to this CachedAtomRef is the latest NewAtomCacheRef preceding this CachedAtomRef in another previously passed distribution header.

ATOM_CACHE_REF 1 1 82 AtomCacheReferenceIndex ATOM_CACHE_REF

Refers to the atom with AtomCacheReferenceIndex in the distribution header.

SMALL_INTEGER_EXT 1 1 97 Int SMALL_INTEGER_EXT

Unsigned 8-bit integer.

INTEGER_EXT 1 4 98 Int INTEGER_EXT

Signed 32-bit integer in big-endian format.

FLOAT_EXT 1 31 99 Float string FLOAT_EXT

A float is stored in string format. The format used in sprintf to format the float is "%.20e" (there are more bytes allocated than necessary). To unpack the float, use sscanf with format "%lf".

This term is used in minor version 0 of the external format; it has been superseded by NEW_FLOAT_EXT.

REFERENCE_EXT 1 N 4 1 101 Node ID Creation REFERENCE_EXT

Encodes a reference object (an object generated with erlang:make_ref/0). The Node term is an encoded atom, that is, ATOM_UTF8_EXT, SMALL_ATOM_UTF8_EXT, or ATOM_CACHE_REF. The ID field contains a big-endian unsigned integer, but is to be regarded as uninterpreted data, as this field is node-specific. Creation is a byte containing a node serial number, which makes it possible to separate old (crashed) nodes from a new one.

In ID, only 18 bits are significant; the rest are to be 0. In Creation, only two bits are significant; the rest are to be 0. See NEW_REFERENCE_EXT.

PORT_EXT 1 N 4 1 102 Node ID Creation PORT_EXT

Encodes a port object (obtained from erlang:open_port/2). The ID is a node-specific identifier for a local port. Port operations are not allowed across node boundaries. The Creation works just like in REFERENCE_EXT.

PID_EXT 1 N 4 4 1 103 Node ID Serial Creation PID_EXT

Encodes a process identifier object (obtained from erlang:spawn/3 or friends). The ID and Creation fields works just like in REFERENCE_EXT, while the Serial field is used to improve safety. In ID, only 15 bits are significant; the rest are to be 0.

SMALL_TUPLE_EXT 1 1 N 104 Arity Elements SMALL_TUPLE_EXT

Encodes a tuple. The Arity field is an unsigned byte that determines how many elements that follows in section Elements.

LARGE_TUPLE_EXT 1 4 N 105 Arity Elements LARGE_TUPLE_EXT

Same as SMALL_TUPLE_EXT except that Arity is an unsigned 4 byte integer in big-endian format.

MAP_EXT 1 4 N 116 Arity Pairs MAP_EXT

Encodes a map. The Arity field is an unsigned 4 byte integer in big-endian format that determines the number of key-value pairs in the map. Key and value pairs (Ki => Vi) are encoded in section Pairs in the following order: K1, V1, K2, V2,..., Kn, Vn. Duplicate keys are not allowed within the same map.

As from Erlang/OTP 17.0

NIL_EXT 1 106 NIL_EXT

The representation for an empty list, that is, the Erlang syntax [].

STRING_EXT 1 2 Len 107 Length Characters STRING_EXT

String does not have a corresponding Erlang representation, but is an optimization for sending lists of bytes (integer in the range 0-255) more efficiently over the distribution. As field Length is an unsigned 2 byte integer (big-endian), implementations must ensure that lists longer than 65535 elements are encoded as LIST_EXT.

LIST_EXT 1 4 108 Length Elements Tail LIST_EXT

Length is the number of elements that follows in section Elements. Tail is the final tail of the list; it is NIL_EXT for a proper list, but can be any type if the list is improper (for example, [a|b]).

BINARY_EXT 1 4 Len 109 Len Data BINARY_EXT

Binaries are generated with bit syntax expression or with erlang:list_to_binary/1, erlang:term_to_binary/1, or as input from binary ports. The Len length field is an unsigned 4 byte integer (big-endian).

SMALL_BIG_EXT 1 1 1 n 110 n Sign d(0) ... d(n-1) SMALL_BIG_EXT

Bignums are stored in unary form with a Sign byte, that is, 0 if the binum is positive and 1 if it is negative. The digits are stored with the least significant byte stored first. To calculate the integer, the following formula can be used:

B = 256
(d0*B^0 + d1*B^1 + d2*B^2 + ... d(N-1)*B^(n-1))

LARGE_BIG_EXT 1 4 1 n 111 n Sign d(0) ... d(n-1) LARGE_BIG_EXT

Same as SMALL_BIG_EXT except that the length field is an unsigned 4 byte integer.

NEW_REFERENCE_EXT 1 2 N 1 N' 114 Len Node Creation ID ... NEW_REFERENCE_EXT

Node and Creation are as in REFERENCE_EXT.

ID contains a sequence of big-endian unsigned integers (4 bytes each, so N' is a multiple of 4), but is to be regarded as uninterpreted data.

N' = 4 * Len.

In the first word (4 bytes) of ID, only 18 bits are significant, the rest are to be 0. In Creation, only two bits are significant, the rest are to be 0.

NEW_REFERENCE_EXT was introduced with distribution version 4. In version 4, N' is to be at most 12.

See REFERENCE_EXT.

FUN_EXT 1 4 N1 N2 N3 N4 N5 117 NumFree Pid Module Index Uniq Free vars ... FUN_EXT

Pid

A process identifier as in PID_EXT. Represents the process in which the fun was created.

Module

Encoded as an atom, using ATOM_UTF8_EXT, SMALL_ATOM_UTF8_EXT, or ATOM_CACHE_REF. This is the module that the fun is implemented in.

Index

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. It is typically a small index into the module's fun table.

Uniq

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Uniq is the hash value of the parse for the fun.

Free vars

NumFree number of terms, each one encoded according to its type.

NEW_FUN_EXT 1 4 1 16 4 4 N1 N2 N3 N4 N5 112 Size Arity Uniq Index NumFree Module OldIndex OldUniq Pid Free Vars NEW_FUN_EXT

This is the new encoding of internal funs: fun F/A and fun(Arg1,..) -> ... end.

Size

The total number of bytes, including field Size.

Arity

The arity of the function implementing the fun.

Uniq

The 16 bytes MD5 of the significant parts of the Beam file.

Index

An index number. Each fun within a module has an unique index. Index is stored in big-endian byte order.

NumFree

The number of free variables.

Module

Encoded as an atom, using ATOM_UTF8_EXT, SMALL_ATOM_UTF8_EXT, or ATOM_CACHE_REF. Is the module that the fun is implemented in.

OldIndex

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Is typically a small index into the module's fun table.

OldUniq

An integer encoded using SMALL_INTEGER_EXT or INTEGER_EXT. Uniq is the hash value of the parse tree for the fun.

Pid

A process identifier as in PID_EXT. Represents the process in which the fun was created.

Free vars

NumFree number of terms, each one encoded according to its type.

EXPORT_EXT 1 N1 N2 N3 113 Module Function Arity EXPORT_EXT

This term is the encoding for external funs: fun M:F/A.

Module and Function are atoms (encoded using ATOM_UTF8_EXT, SMALL_ATOM_UTF8_EXT, or ATOM_CACHE_REF).

Arity is an integer encoded using SMALL_INTEGER_EXT.

BIT_BINARY_EXT 1 4 1 Len 77 Len Bits Data BIT_BINARY_EXT

This term represents a bitstring whose length in bits does not have to be a multiple of 8. The Len field is an unsigned 4 byte integer (big-endian). The Bits field is the number of bits (1-8) that are used in the last byte in the data field, counting from the most significant bit to the least significant.

NEW_FLOAT_EXT 1 8 70 IEEE float NEW_FLOAT_EXT

A float is stored as 8 bytes in big-endian IEEE format.

This term is used in minor version 1 of the external format.

ATOM_UTF8_EXT 1 2 Len 118 Len AtomName ATOM_UTF8_EXT

An atom is stored with a 2 byte unsigned length in big-endian order, followed by Len bytes containing the AtomName encoded in UTF-8.

For more information on encoding of atoms, see the note on UTF-8 encoded atoms in the beginning of this section.

SMALL_ATOM_UTF8_EXT 1 1 Len 119 Len AtomName SMALL_ATOM_UTF8_EXT

An atom is stored with a 1 byte unsigned length, followed by Len bytes containing the AtomName encoded in UTF-8. Longer atoms encoded in UTF-8 can be represented using ATOM_UTF8_EXT.

For more information on encoding of atoms, see the note on UTF-8 encoded atoms in the beginning of this section.

ATOM_EXT (deprecated) 1 2 Len 100 Len AtomName ATOM_EXT

An atom is stored with a 2 byte unsigned length in big-endian order, followed by Len numbers of 8-bit Latin-1 characters that forms the AtomName. The maximum allowed value for Len is 255.

SMALL_ATOM_EXT (deprecated) 1 1 Len 115 Len AtomName SMALL_ATOM_EXT

An atom is stored with a 1 byte unsigned length, followed by Len numbers of 8-bit Latin-1 characters that forms the AtomName.

SMALL_ATOM_EXT was introduced in ERTS 5.7.2 and require an exchange of distribution flag DFLAG_SMALL_ATOM_TAGS in the distribution handshake.