Implementing support for Unicode character sets is an ongoing process. The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode support and specified a default encoding in binaries that all Unicode-aware modules are to handle in the future.
Here is an overview what has been done so far:
The functionality described in EEP10 was implemented in Erlang/OTP R13A.
Erlang/OTP R14B01 added support for Unicode filenames, but it was not complete and was by default disabled on platforms where no guarantee was given for the filename encoding.
With Erlang/OTP R16A came support for UTF-8 encoded
source code, with enhancements to many of the applications to
support both Unicode encoded filenames and support for UTF-8
encoded files in many circumstances. Most notable is the
support for UTF-8 in files read by
In Erlang/OTP 17.0, the encoding default for Erlang source files was switched to UTF-8.
In Erlang/OTP 20.0, atoms and function can contain Unicode characters. Module names, application names, and node names are still restricted to the ISO Latin-1 range.
Support was added for normalizations forms in
This section outlines the current Unicode support and gives some recipes for working with Unicode data.
Experience with the Unicode support in Erlang has made it clear that understanding Unicode characters and encodings is not as easy as one would expect. The complexity of the field and the implications of the standard require thorough understanding of concepts rarely before thought of.
Also, the Erlang implementation requires understanding of concepts that were never an issue for many (Erlang) programmers. To understand and use Unicode characters requires that you study the subject thoroughly, even if you are an experienced programmer.
As an example, contemplate the issue of converting between upper and lower case letters. Reading the standard makes you realize that there is not a simple one to one mapping in all scripts, for example:
In German, the letter "ß" (sharp s) is in lower case, but the uppercase equivalent is "SS".
In Greek, the letter "Σ" has two different lowercase forms, "ς" in word-final position and "σ" elsewhere.
In Turkish, both dotted and dotless "i" exist in lower case and upper case forms.
Cyrillic "I" has usually no lowercase form.
Languages with no concept of upper case (or lower case).
So, a conversion function must know not only one character at a
time, but possibly the whole sentence, the natural language to
translate to, the differences in input and output string length,
and so on. Erlang/OTP has currently no Unicode
Another example is the accented characters, where the same glyph has two different representations. The Swedish letter "ö" is one example. The Unicode standard has a code point for it, but you can also write it as "o" followed by "U+0308" (Combining Diaeresis, with the simplified meaning that the last letter is to have "¨" above). They have the same glyph, user perceived character. They are for most purposes the same, but have different representations. For example, MacOS X converts all filenames to use Combining Diaeresis, while most other programs (including Erlang) try to hide that by doing the opposite when, for example, listing directories. However it is done, it is usually important to normalize such characters to avoid confusion.
The list of examples can be made long. One need a kind of knowledge that was not needed when programs only considered one or two languages. The complexity of human languages and scripts has certainly made this a challenge when constructing a universal standard. Supporting Unicode properly in your program will require effort.
Unicode is a standard defining code points (numbers) for all known, living or dead, scripts. In principle, every symbol used in any language has a Unicode code point. Unicode code points are defined and published by the Unicode Consortium, which is a non-profit organization.
Support for Unicode is increasing throughout the world of computing, as the benefits of one common character set are overwhelming when programs are used in a global environment. Along with the base of the standard, the code points for all the scripts, some encoding standards are available.
It is vital to understand the difference between encodings and Unicode characters. Unicode characters are code points according to the Unicode standard, while the encodings are ways to represent such code points. An encoding is only a standard for representation. UTF-8 can, for example, be used to represent a very limited part of the Unicode character set (for example ISO-Latin-1) or the full Unicode range. It is only an encoding format.
As long as all character sets were limited to 256 characters, each character could be stored in one single byte, so there was more or less only one practical encoding for the characters. Encoding each character in one byte was so common that the encoding was not even named. With the Unicode system there are much more than 256 characters, so a common way is needed to represent these. The common ways of representing the code points are the encodings. This means a whole new concept to the programmer, the concept of character representation, which was a non-issue earlier.
Different operating systems and tools support different encodings. For example, Linux and MacOS X have chosen the UTF-8 encoding, which is backward compatible with 7-bit ASCII and therefore affects programs written in plain English the least. Windows supports a limited version of UTF-16, namely all the code planes where the characters can be stored in one single 16-bit entity, which includes most living languages.
The following are the most widely spread encodings:
This is not a proper Unicode representation, but the representation
used for characters before the Unicode standard. It can still be used
to represent character code points in the Unicode standard with
numbers < 256, which exactly corresponds to the ISO Latin-1
character set. In Erlang, this is commonly denoted
Each character is stored in one to four bytes depending on code point. The encoding is backward compatible with bytewise representation of 7-bit ASCII, as all 7-bit characters are stored in one single byte in UTF-8. The characters beyond code point 127 are stored in more bytes, letting the most significant bit in the first character indicate a multi-byte character. For details on the encoding, the RFC is publicly available.
Notice that UTF-8 is not compatible with bytewise representation for code points from 128 through 255, so an ISO Latin-1 bytewise representation is generally incompatible with UTF-8.
This encoding has many similarities to UTF-8, but the basic unit is a 16-bit number. This means that all characters occupy at least two bytes, and some high numbers four bytes. Some programs, libraries, and operating systems claiming to use UTF-16 only allow for characters that can be stored in one 16-bit entity, which is usually sufficient to handle living languages. As the basic unit is more than one byte, byte-order issues occur, which is why UTF-16 exists in both a big-endian and a little-endian variant.
In Erlang, the full UTF-16 range is supported when applicable, like
in the
The most straightforward representation. Each character is stored in one single 32-bit number. There is no need for escapes or any variable number of entities for one character. All Unicode code points can be stored in one single 32-bit entity. As with UTF-16, there are byte-order issues. UTF-32 can be both big-endian and little-endian.
Basically the same as UTF-32, but without some Unicode semantics, defined by IEEE, and has little use as a separate encoding standard. For all normal (and possibly abnormal) use, UTF-32 and UCS-4 are interchangeable.
Certain number ranges are unused in the Unicode standard and certain ranges are even deemed invalid. The most notable invalid range is 16#D800-16#DFFF, as the UTF-16 encoding does not allow for encoding of these numbers. This is possibly because the UTF-16 encoding standard, from the beginning, was expected to be able to hold all Unicode characters in one 16-bit entity, but was then extended, leaving a hole in the Unicode range to handle backward compatibility.
Code point 16#FEFF is used for Byte Order Marks (BOMs) and use of that character is not encouraged in other contexts. It is valid though, as the character "ZWNBS" (Zero Width Non Breaking Space). BOMs are used to identify encodings and byte order for programs where such parameters are not known in advance. BOMs are more seldom used than expected, but can become more widely spread as they provide the means for programs to make educated guesses about the Unicode format of a certain file.
To support Unicode in Erlang, problems in various areas have been addressed. This section describes each area briefly and more thoroughly later in this User's Guide.
To handle Unicode characters in Erlang, a common representation in both lists and binaries is needed. EEP (10) and the subsequent initial implementation in Erlang/OTP R13A settled a standard representation of Unicode characters in Erlang.
The Unicode characters need to be processed by the Erlang
program, which is why library functions must be able to handle
them. In some cases functionality has been added to already
existing interfaces (as the
I/O is by far the most problematic area for Unicode. A file is an entity where bytes are stored, and the lore of programming has been to treat characters and bytes as interchangeable. With Unicode characters, you must decide on an encoding when you want to store the data in a file. In Erlang, you can open a text file with an encoding option, so that you can read characters from it rather than bytes, but you can also open a file for bytewise I/O.
The Erlang I/O-system has been designed (or at least used) in a way where you expect any I/O server to handle any string data. That is, however, no longer the case when working with Unicode characters. The Erlang programmer must now know the capabilities of the device where the data ends up. Also, ports in Erlang are byte-oriented, so an arbitrary string of (Unicode) characters cannot be sent to a port without first converting it to an encoding of choice.
Terminal I/O is slightly easier than file I/O. The output is meant
for human reading and is usually Erlang syntax (for example, in the
shell). There exists syntactic representation of any Unicode
character without displaying the glyph (instead written as
Filenames can be stored as Unicode strings in different ways depending on the underlying operating system and file system. This can be handled fairly easy by a program. The problems arise when the file system is inconsistent in its encodings. For example, Linux allows files to be named with any sequence of bytes, leaving to each program to interpret those bytes. On systems where these "transparent" filenames are used, Erlang must be informed about the filename encoding by a startup flag. The default is bytewise interpretation, which is usually wrong, but allows for interpretation of all filenames.
The concept of "raw filenames" can be used to handle wrongly encoded
filenames if one enables Unicode filename translation (
The Erlang source code has support for the UTF-8 encoding
and bytewise encoding. The default in Erlang/OTP R16B was bytewise
(
%% -*- coding: utf-8 -*-
This of course requires your editor to support UTF-8 as well. The
same comment is also interpreted by functions like
Having the source code in UTF-8 also allows you to write string
literals, function names, and atoms containing Unicode
characters with code points > 255.
Module names, application names, and node names are still restricted
to the ISO Latin-1 range. Binary literals, where you use type
EEP 40 suggests that the language is also to allow for Unicode characters > 255 in variable names. Whether to implement that EEP is yet to be decided.
In Erlang, strings are lists of integers. A string was until Erlang/OTP R13 defined to be encoded in the ISO Latin-1 (ISO 8859-1) character set, which is, code point by code point, a subrange of the Unicode character set.
The standard list encoding for strings was therefore easily extended to handle the whole Unicode range. A Unicode string in Erlang is a list containing integers, where each integer is a valid Unicode code point and represents one character in the Unicode character set.
Erlang strings in ISO Latin-1 are a subset of Unicode strings.
Only if a string contains code points < 256, can it be directly
converted to a binary by using, for example,
Binaries are more troublesome. For performance reasons, programs often
store textual data in binaries instead of lists, mainly because they are
more compact (one byte per character instead of two words per character,
as is the case with lists). Using
As the UTF-8 encoding is widely spread and provides some backward compatibility in the 7-bit ASCII range, it is selected as the standard encoding for Unicode characters in binaries for Erlang.
The standard binary encoding is used whenever a library function in Erlang is to handle Unicode data in binaries, but is of course not enforced when communicating externally. Functions and bit syntax exist to encode and decode both UTF-8, UTF-16, and UTF-32 in binaries. However, library functions dealing with binaries and Unicode in general only deal with the default encoding.
Character data can be combined from many sources, sometimes available in
a mix of strings and binaries. Erlang has for long had the concept of
unicode_binary() = binary() with characters encoded in UTF-8 coding standard
chardata() = charlist() | unicode_binary()
charlist() = maybe_improper_list(char() | unicode_binary() | charlist(),
unicode_binary() | nil())
The module
external_unicode_binary() = binary() with characters coded in a user-specified
Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
external_chardata() = external_charlist() | external_unicode_binary()
external_charlist() = maybe_improper_list(char() | external_unicode_binary() |
external_charlist(), external_unicode_binary() | nil())
The bit syntax contains types for handling binary data in the
three main encodings. The types are named
<<Ch/utf8,_/binary>> = Bin1,
<<Ch/utf16-little,_/binary>> = Bin2,
Bin3 = <<$H/utf32-little, $e/utf32-little, $l/utf32-little, $l/utf32-little,
$o/utf32-little>>,
For convenience, literal strings can be encoded with a Unicode encoding in binaries using the following (or similar) syntax:
Bin4 = <<"Hello"/utf16>>,
For source code, there is an extension to syntax
In the shell, if using a Unicode input device, or in source code
stored in UTF-8,
7> $с. 1089
In certain output functions and in the output of return values in the shell, Erlang tries to detect string data in lists and binaries heuristically. Typically you will see heuristic detection in a situation like this:
1> [97,98,99]. "abc" 2> <<97,98,99>>. <<"abc">> 3> <<195,165,195,164,195,182>>. <<"åäö"/utf8>>
Here the shell detects lists containing printable characters or binaries containing printable characters in bytewise or UTF-8 encoding. But what is a printable character? One view is that anything the Unicode standard thinks is printable, is also printable according to the heuristic detection. The result is then that almost any list of integers are deemed a string, and all sorts of characters are printed, maybe also characters that your terminal lacks in its font set (resulting in some unappreciated generic output). Another way is to keep it backward compatible so that only the ISO Latin-1 character set is used to detect a string. A third way is to let the user decide exactly what Unicode ranges that are to be viewed as characters.
As from Erlang/OTP R16B you can select the ISO Latin-1 range or the
whole Unicode range by supplying startup flag
The following examples show the two startup options:
$ erl +pc latin1 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> [1024]. [1024] 2> [1070,1085,1080,1082,1086,1076]. [1070,1085,1080,1082,1086,1076] 3> [229,228,246]. "åäö" 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>. <<208,174,208,189,208,184,208,186,208,190,208,180>> 5> <<229/utf8,228/utf8,246/utf8>>. <<"åäö"/utf8>>
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> [1024]. "Ѐ" 2> [1070,1085,1080,1082,1086,1076]. "Юникод" 3> [229,228,246]. "åäö" 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>. <<"Юникод"/utf8>> 5> <<229/utf8,228/utf8,246/utf8>>. <<"åäö"/utf8>>
In the examples, you can see that the default Erlang shell interprets
only characters from the ISO Latin1 range as printable and only detects
lists or binaries with those "printable" characters as containing
string data. The valid UTF-8 binary containing the Russian word
"Юникод", is not printed as a string. When started with all Unicode
characters printable (
These heuristics are also used by
$ erl +pc latin1 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]). {<<"åäö">>,<<"åäö"/utf8>>,<<208,174,208,189,208,184,208,186,208,190,208,180>>} ok
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]). {<<"åäö">>,<<"åäö"/utf8>>,<<"Юникод"/utf8>>} ok
Notice that this only affects heuristic interpretation of
lists and binaries on output. For example, the
The interactive Erlang shell, when started to a terminal or started
using command
On Windows, proper operation requires that a suitable font is
installed and selected for the Erlang application to use. If no suitable
font is available on your system, try installing the
On Unix-like operating systems, the terminal is to be able to handle
UTF-8 on input and output (this is done by, for example, modern versions
of XTerm, KDE Konsole, and the Gnome terminal)
and your locale settings must be proper. As
an example, a
$ echo $LANG en_US.UTF-8
Most systems handle variable
$ echo $LC_CTYPE en_US.UTF-8
The
To investigate what Erlang thinks about the terminal, the call
$ LC_CTYPE=en_US.ISO-8859-1 erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,latin1} 2> q(). ok $ LC_CTYPE=en_US.UTF-8 erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,unicode} 2>
When (finally?) everything is in order with the locale settings, fonts. and the terminal emulator, you have probably found a way to input characters in the script you desire. For testing, the simplest way is to add some keyboard mappings for other languages, usually done with some applet in your desktop environment.
In a KDE environment, select KDE Control Center (Personal Settings) > Regional and Accessibility > Keyboard Layout.
On Windows XP, select Control Panel > Regional and Language Options, select tab Language, and click button Details... in the square named Text Services and Input Languages.
Your environment probably provides similar means of changing the keyboard layout. Ensure that you have a way to switch back and forth between keyboards easily if you are not used to this. For example, entering commands using a Cyrillic character set is not easily done in the Erlang shell.
Now you are set up for some Unicode input and output. The simplest thing to do is to enter a string in the shell:
$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,unicode} 2> "Юникод". "Юникод" 3> io:format("~ts~n", [v(2)]). Юникод ok 4>
While strings can be input as Unicode characters, the language elements are still limited to the ISO Latin-1 character set. Only character constants and strings are allowed to be beyond that range:
$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> $ξ. 958 2> Юникод. * 1: illegal character 2>
Most modern operating systems support Unicode filenames in some way. There are many different ways to do this and Erlang by default treats the different approaches differently:
Windows and, for most common uses, MacOS X enforce Unicode support for filenames. All files created in the file system have names that can consistently be interpreted. In MacOS X, all filenames are retrieved in UTF-8 encoding. In Windows, each system call handling filenames has a special Unicode-aware variant, giving much the same effect. There are no filenames on these systems that are not Unicode filenames. So, the default behavior of the Erlang VM is to work in "Unicode filename translation mode". This means that a filename can be specified as a Unicode list, which is automatically translated to the proper name encoding for the underlying operating system and file system.
Doing, for example, a
Most Unix operating systems have adopted a simpler approach, namely that Unicode file naming is not enforced, but by convention. Those systems usually use UTF-8 encoding for Unicode filenames, but do not enforce it. On such a system, a filename containing characters with code points from 128 through 255 can be named as plain ISO Latin-1 or use UTF-8 encoding. As no consistency is enforced, the Erlang VM cannot do consistent translation of all filenames.
By default on such systems, Erlang starts in
In
The Unicode file naming support was introduced in Erlang/OTP R14B01. A VM operating in Unicode filename translation mode can work with files having names in any language or character set (as long as it is supported by the underlying operating system and file system). The Unicode character list is used to denote filenames or directory names. If the file system content is listed, you also get Unicode lists as return value. The support lies in the Kernel and STDLIB modules, which is why most applications (that do not explicitly require the filenames to be in the ISO Latin-1 range) benefit from the Unicode support without change.
On operating systems with mandatory Unicode filenames, this means that
you more easily conform to the filenames of other (non-Erlang)
applications. You can also process filenames that, at least on Windows,
were inaccessible (because of having names that could not be represented
in ISO Latin-1). Also, you avoid creating incomprehensible filenames
on MacOS X, as the
For most systems, turning on Unicode filename translation is no problem even if it uses transparent file naming. Very few systems have mixed filename encodings. A consistent UTF-8 named system works perfectly in Unicode filename mode. It was still, however, considered experimental in Erlang/OTP R14B01 and is still not the default on such systems.
Unicode filename translation is turned on with switch
Notice that
In Unicode filename mode, filenames given to BIF
Notice that the file encoding options specified when opening a file has
nothing to do with the filename encoding convention. You can very well
open files containing data encoded in UTF-8, but having filenames in
bytewise (
Erlang drivers and NIF-shared objects still cannot be named with names containing code points > 127. This limitation will be removed in a future release. However, Erlang modules can, but it is definitely not a good idea and is still considered experimental.
Raw filenames were introduced together with Unicode filename support in ERTS 5.8.2 (Erlang/OTP R14B01). The reason "raw filenames" were introduced in the system was to be able to represent filenames, specified in different encodings on the same system, consistently. It can seem practical to have the VM automatically translate a filename that is not in UTF-8 to a list of Unicode characters, but this would open up for both duplicate filenames and other inconsistent behavior.
Consider a directory containing a file named "björn" in ISO
Latin-1, while the Erlang VM is operating in Unicode filename mode (and
therefore expects UTF-8 file naming). The ISO Latin-1 name is not valid
UTF-8 and one can be tempted to think that automatic conversion in, for
example,
The
To force Unicode filename translation mode on systems where this is not
the default was considered experimental in Erlang/OTP R14B01. This was
because the initial implementation did not ignore wrongly encoded
filenames, so that raw filenames could spread unexpectedly throughout
the system. As from Erlang/OTP R16B, the wrongly encoded
filenames are only retrieved by special functions (such as
Even if you are operating without Unicode file naming translation automatically done by the VM, you can access and create files with names in UTF-8 encoding by using raw filenames encoded as UTF-8. Enforcing the UTF-8 encoding regardless of the mode the Erlang VM is started in can in some circumstances be a good idea, as the convention of using UTF-8 filenames is spreading.
The
MacOS X reorganizes the filenames so that the representation of
accents, and so on, uses the "combining characters". For example,
character
Environment variables and their interpretation are handled much in the same way as filenames. If Unicode filenames are enabled, environment variables as well as parameters to the Erlang VM are expected to be in Unicode.
If Unicode filenames are enabled, the calls to
On Unix-like operating systems, parameters are expected to be UTF-8 without translation if Unicode filenames are enabled.
Most of the modules in Erlang/OTP are Unicode-unaware in the sense that
they have no notion of Unicode and should not have. Typically they handle
non-textual or byte-oriented data (such as
Modules handling textual data (such as
Fortunately, most textual data has been stored in lists and range
checking has been sparse, so modules like
Some modules are, however, changed to be explicitly Unicode-aware. These modules include:
The
The
I/O-servers throughout the system can handle Unicode data and have
options for converting data upon output or input to/from the device.
As shown earlier, the
Reading and writing of files with Unicode data is, however, not best
done with the
The
The graphical library
The
Although Erlang can handle Unicode data in many forms does not automatically mean that the content of any file can be Unicode text. The external entities, such as ports and I/O servers, are not generally Unicode capable.
Ports are always byte-oriented, so before sending data that you are not sure is bytewise-encoded to a port, ensure to encode it in a proper Unicode encoding. Sometimes this means that only part of the data must be encoded as, for example, UTF-8. Some parts can be binary data (like a length indicator) or something else that must not undergo character encoding, so no automatic translation is present.
I/O servers behave a little differently. The I/O servers connected to
terminals (or
A file can have an encoding option that makes it generally usable by the
Recommendations:
Use the
Use the
Functions reading Erlang syntax from files recognize the
$ erl +fna +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> file:write_file("test.term",<<"%% coding: utf-8\n[{\"Юникод\",4711}].\n"/utf8>>). ok 2> file:consult("test.term"). {ok,[[{"Юникод",4711}]]}
The Unicode support is controlled by both command-line switches, some standard environment variables, and the OTP version you are using. Most options affect mainly how Unicode data is displayed, not the functionality of the APIs in the standard libraries. This means that Erlang programs usually do not need to concern themselves with these options, they are more for the development environment. An Erlang program can be written so that it works well regardless of the type of system or the Unicode options that are in effect.
Here follows a summary of the settings affecting Unicode:
The language setting in the operating system mainly affects the
shell. The terminal (that is, the group leader) operates with
The environment can also affect filename interpretation, if Erlang
is started with flag
You can check the setting of this by calling
This flag affects what is interpreted as string data when doing
heuristic string detection in the shell and in
You can check this option by calling
This flag affects how the filenames are to be interpreted. On operating systems with transparent file naming, this must be specified to allow for file naming in Unicode characters (and for correct interpretation of filenames containing characters > 255).
The filename translation mode can be read with function
This function returns the default encoding for Erlang source files
(if no encoding comment is present) in the currently running release.
In Erlang/OTP R16B,
The encoding of each file can be specified using comments as
described in the
When Erlang is started with
You can set the encoding of a file or other I/O server with function
Opening files with option
You can retrieve the
When starting with Unicode, one often stumbles over some common issues. This section describes some methods of dealing with Unicode data.
A common method of identifying encoding in text files is to put a Byte
Order Mark (BOM) first in the file. The BOM is the code point 16#FEFF
encoded in the same way as the remaining file. If such a file is to be
read, the first few bytes (depending on encoding) are not part of the
text. This code outlines how to open a file that is believed to
have a BOM, and sets the files encoding and position for further
sequential reading (preferably using the
Notice that error handling is omitted from the code:
open_bom_file_for_reading(File) ->
{ok,F} = file:open(File,[read,binary]),
{ok,Bin} = file:read(F,4),
{Type,Bytes} = unicode:bom_to_encoding(Bin),
file:position(F,Bytes),
io:setopts(F,[{encoding,Type}]),
{ok,F}.
Function
To open a file for writing and place the BOM first is even simpler:
open_bom_file_for_writing(File,Encoding) ->
{ok,F} = file:open(File,[write,binary]),
ok = file:write(File,unicode:encoding_to_bom(Encoding)),
io:setopts(F,[{encoding,Encoding}]),
{ok,F}.
The file is in both these cases then best processed using the
When reading and writing to Unicode-aware entities, like a
file opened for Unicode translation, you probably want to format text
strings using the functions in the
1> io:format("~ts~n",[<<"åäö"/utf8>>]). åäö ok 2> io:format("~s~n",[<<"åäö"/utf8>>]). åäö ok
Clearly, the second
As long as the data is always lists, modifier
Function
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io_lib:format("~ts~n", ["Γιούνικοντ"]). ["Γιούνικοντ","\n"] 2> io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])). Γιούνικοντ ok
The Unicode string is returned as a Unicode list, which is recognized
as such, as the Erlang shell uses the Unicode encoding (and is started
with all Unicode characters considered printable). The Unicode list is
valid input to function
So, you can always send Unicode data to the
While it is strongly encouraged that the encoding of characters in binary data is known before processing, that is not always possible. On a typical Linux system, there is a mix of UTF-8 and ISO Latin-1 text files, and there are seldom any BOMs in the files to identify them.
UTF-8 is designed so that ISO Latin-1 characters with numbers beyond
the 7-bit ASCII range are seldom considered valid when decoded as UTF-8.
Therefore one can usually use heuristics to determine if a file is in
UTF-8 or if it is encoded in ISO Latin-1 (one byte per character).
The
heuristic_encoding_bin(Bin) when is_binary(Bin) ->
case unicode:characters_to_binary(Bin,utf8,utf8) of
Bin ->
utf8;
_ ->
latin1
end.
If you do not have a complete binary of the file content, you can
instead chunk through the file and check part by part. The return-tuple
heuristic_encoding_file(FileName) ->
{ok,F} = file:open(FileName,[read,binary]),
loop_through_file(F,<<>>,file:read(F,1024)).
loop_through_file(_,<<>>,eof) ->
utf8;
loop_through_file(_,_,eof) ->
latin1;
loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) ->
case unicode:characters_to_binary([Acc,Bin]) of
{error,_,_} ->
latin1;
{incomplete,_,Rest} ->
loop_through_file(F,Rest,file:read(F,1024));
Res when is_binary(Res) ->
loop_through_file(F,<<>>,file:read(F,1024))
end.
Another option is to try to read the whole file in UTF-8 encoding and
see if it fails. Here we need to read the file using function
heuristic_encoding_file2(FileName) ->
{ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]),
loop_through_file2(F,io:get_chars(F,'',1024)).
loop_through_file2(_,eof) ->
utf8;
loop_through_file2(_,{error,_Err}) ->
latin1;
loop_through_file2(F,Bin) when is_binary(Bin) ->
loop_through_file2(F,io:get_chars(F,'',1024)).
For various reasons, you can sometimes have a list of UTF-8 bytes. This is not a regular string of Unicode characters, as each list element does not contain one character. Instead you get the "raw" UTF-8 encoding that you have in binaries. This is easily converted to a proper Unicode string by first converting byte per byte into a binary, and then converting the binary of UTF-8 encoded characters back to a Unicode string:
utf8_list_to_string(StrangeList) ->
unicode:characters_to_list(list_to_binary(StrangeList)).
When working with binaries, you can get the horrible "double UTF-8
encoding", where strange characters are encoded in your binaries or
files. In other words, you can get a UTF-8 encoded binary that for the
second time is encoded as UTF-8. A common situation is where you read a
file, byte by byte, but the content is already UTF-8. If you then
convert the bytes to UTF-8, using, for example, the
By far the most common situation where this occurs, is when you get lists of UTF-8 instead of proper Unicode strings, and then convert them to UTF-8 in a binary or on a file:
wrong_thing_to_do() ->
{ok,Bin} = file:read_file("an_utf8_encoded_file.txt"),
MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary!
{ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]),
io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8
%% bytes in a list!
file:close(C). %% The file catastrophe.txt contains more or less unreadable
%% garbage!
Ensure you know what a binary contains before converting it to a string. If no other option exists, try heuristics:
if_you_can_not_know() ->
{ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"),
MyList = case unicode:characters_to_list(Bin) of
L when is_list(L) ->
L;
_ ->
binary_to_list(Bin) %% The file was bytewise encoded
end,
%% Now we know that the list is a Unicode string, not a list of UTF-8 bytes
{ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]),
io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets!
file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!