This module contains functions for converting between different character representations. It converts between ISO Latin-1 characters and Unicode characters, but it can also convert between different Unicode encodings (like UTF-8, UTF-16, and UTF-32).
The default Unicode encoding in Erlang is in binaries UTF-8, which is also the format in which built-in functions and libraries in OTP expect to find binary Unicode data. In lists, Unicode data is encoded as integers, each integer representing one character and encoded simply as the Unicode code point for the character.
Other Unicode encodings than integers representing code points or UTF-8 in binaries are referred to as "external encodings". The ISO Latin-1 encoding is in binaries and lists referred to as latin1-encoding.
It is recommended to only use external encodings for communication with external entities where this is required. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters. ISO Latin-1 encoding is supported both for backward compatibility and for communication with external entities not supporting Unicode character sets.
Programs should always operate on a normalized form and compare
canonical-equivalent Unicode characters as equal. All characters
should thus be normalized to one form once on the system borders.
One of the following functions can convert characters to their
normalized forms
A
A
A
An
Same as
Same as
Checks for a UTF Byte Order Mark (BOM) in the beginning of a
binary. If the supplied binary
If no BOM is found, the function returns
Same as
Same as
Behaves as
Options:
An alias for
An alias for
An alias for
The atoms
Errors and exceptions occur as in
Same as
Converts a possibly deep list of integers and binaries into a list of integers representing Unicode characters. The binaries in the input can have characters encoded as one of the following:
ISO Latin-1 (0-255, one character per byte). Here,
case parameter
One of the UTF-encodings, which is specified as parameter
Note that integers in the list always represent code points
regardless of
If
The purpose of the function is mainly to convert
combinations of Unicode characters into a pure Unicode
string in list representation for further processing. For
writing the data to an external entity, the reverse function
Option
If the data cannot be converted, either
because of illegal Unicode/ISO Latin-1 characters in the list,
or because of invalid UTF encoding in any binaries, an error
tuple is returned. The error tuple contains the tag
However, if the input
Errors occur for the following reasons:
Integers out of range.
If
If
An integer > 16#10FFFF (the maximum Unicode character)
An integer in the range 16#D800 to 16#DFFF (invalid range reserved for UTF-16 surrogate pairs)
Incorrect UTF encoding.
If
Errors can occur for various reasons, including the following:
"Pure" decoding errors (like the upper bits of the bytes being wrong).
The bytes are decoded to a too large number.
The bytes are decoded to a code point in the invalid Unicode range.
Encoding is "overlong", meaning that a number should have been encoded in fewer bytes.
The case of a truncated UTF is handled specially, see the paragraph about incomplete binaries below.
If
A special type of error is when no actual invalid integers or
bytes are found, but a trailing
If one UTF character is split over two consecutive binaries in
the
Example:
decode_data(Data) ->
case unicode:characters_to_list(Data,unicode) of
{incomplete,Encoded, Rest} ->
More = get_some_more_data(),
Encoded ++ decode_data([Rest, More]);
{error,Encoded,Rest} ->
handle_error(Encoded,Rest);
List ->
List
end.
However, bit strings that are not whole bytes are not allowed, so a UTF character must be split along 8-bit boundaries to ever be decoded.
A
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Composed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is a list of characters.
3> unicode:characters_to_nfc_list([<<"abc..a">>,[778],$a,[776],$o,[776]]).
"abc..åäö"
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Composed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is an utf8 encoded binary.
4> unicode:characters_to_nfc_binary([<<"abc..a">>,[778],$a,[776],$o,[776]]).
<<"abc..åäö"/utf8>>
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Decomposed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is a list of characters.
1> unicode:characters_to_nfd_list("abc..åäö").
[97,98,99,46,46,97,778,97,776,111,776]
Converts a possibly deep list of characters and binaries into a Normalized Form of canonical equivalent Decomposed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is an utf8 encoded binary.
2> unicode:characters_to_nfd_binary("abc..åäö").
<<97,98,99,46,46,97,204,138,97,204,136,111,204,136>>
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Composed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is a list of characters.
3> unicode:characters_to_nfkc_list([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
"abc..åäö32"
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Composed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is an utf8 encoded binary.
4> unicode:characters_to_nfkc_binary([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
<<"abc..åäö32"/utf8>>
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Decomposed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is a list of characters.
1> unicode:characters_to_nfkd_list(["abc..åäö",[65299,65298]]).
[97,98,99,46,46,97,778,97,776,111,776,51,50]
Converts a possibly deep list of characters and binaries into a Normalized Form of compatibly equivalent Decomposed characters according to the Unicode standard.
Any binaries in the input must be encoded with utf8 encoding.
The result is an utf8 encoded binary.
2> unicode:characters_to_nfkd_binary(["abc..åäö",[65299,65298]]).
<<97,98,99,46,46,97,204,138,97,204,136,111,204,136,51,50>>
Creates a UTF Byte Order Mark (BOM) as a binary from the
supplied
The function returns
Notice that the BOM for UTF-8 is seldom used, and it is really not a byte order mark. There are obviously no byte order issues with UTF-8, so the BOM is only there to differentiate UTF-8 encoding from other UTF formats.