Implementing support for Unicode character sets is an ongoing process. The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode support and also specified a default encoding in binaries that all Unicode-aware modules should handle in the future.
The functionality described in EEP10 was implemented in Erlang/OTP
R13A, but that was by no means the end of it. In Erlang/OTP R14B01 support
for Unicode file names was added, although it was in no way complete
and was by default disabled on platforms where no guarantee was given
for the file name encoding. With Erlang/OTP R16A came support for UTF-8 encoded
source code, among with enhancements to many of the applications to
support both Unicode encoded file names as well as support for UTF-8
encoded files in several circumstances. Most notable is the support
for UTF-8 in files read by
This guide outlines the current Unicode support and gives a couple of recipes for working with Unicode data.
Experience with the Unicode support in Erlang has made it painfully clear that understanding Unicode characters and encodings is not as easy as one would expect. The complexity of the field as well as the implications of the standard requires thorough understanding of concepts rarely before thought of.
Furthermore the Erlang implementation requires understanding of concepts that never were an issue for many (Erlang) programmers. To understand and use Unicode characters requires that you study the subject thoroughly, even if you're an experienced programmer.
As an example, one could contemplate the issue of converting between upper and lower case letters. Reading the standard will make you realize that, to begin with, there's not a simple one to one mapping in all scripts. Take German as an example, where there's a letter "ß" (Sharp s) in lower case, but the uppercase equivalent is "SS". Or Greek, where "Σ" has two different lowercase forms: "ς" in word-final position and "σ" elsewhere. Or Turkish where dotted and dot-less "i" both exist in lower case and upper case forms, or Cyrillic "I" which usually has no lowercase form. Or of course languages that have no concept of upper case (or lower case). So, a conversion function will need to know not only one character at a time, but possibly the whole sentence, maybe the natural language the translation should be in and also take into account differences in input and output string length and so on. There is at the time of writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but there are publicly available libraries that address these issues.
Another example is the accented characters where the same glyph has two different representations. Let's look at the Swedish "ö". There's a code point for that in the Unicode standard, but you can also write it as "o" followed by U+0308 (Combining Diaeresis, with the simplified meaning that the last letter should have a "¨" above). They have exactly the same glyph. They are for most purposes the same, but they have completely different representations. For example MacOS X converts all file names to use Combining Diaeresis, while most other programs (including Erlang) try to hide that by doing the opposite when for example listing directories. However it's done, it's usually important to normalize such characters to avoid utter confusion.
The list of examples can be made as long as the Unicode standard, I suspect. The point is that one need a kind of knowledge that was never needed when programs only took one or two languages into account. The complexity of human languages and scripts, certainly has made this a challenge when constructing a universal standard. Supporting Unicode properly in your program will require effort.
Unicode is a standard defining code points (numbers) for all known, living or dead, scripts. In principle, every known symbol used in any language has a Unicode code point.
Unicode code points are defined and published by the Unicode Consortium, which is a non profit organization.
Support for Unicode is increasing throughout the world of computing, as the benefits of one common character set are overwhelming when programs are used in a global environment.
Along with the base of the standard: the code points for all the scripts, there are a couple of encoding standards available.
It is vital to understand the difference between encodings and Unicode characters. Unicode characters are code points according to the Unicode standard, while the encodings are ways to represent such code points. An encoding is just a standard for representation, UTF-8 can for example be used to represent a very limited part of the Unicode character set (e.g. ISO-Latin-1), or the full Unicode range. It's just an encoding format.
As long as all character sets were limited to 256 characters, each character could be stored in one single byte, so there was more or less only one practical encoding for the characters. Encoding each character in one byte was so common that the encoding wasn't even named. When we now, with the Unicode system, have a lot more than 256 characters, we need a common way to represent these. The common ways of representing the code points are the encodings. This means a whole new concept to the programmer, the concept of character representation, which was before a non-issue.
Different operating systems and tools support different encodings. For example Linux and MacOS X has chosen the UTF-8 encoding, which is backwards compatible with 7-bit ASCII and therefore affects programs written in plain English the least. Windows on the other hand supports a limited version of UTF-16, namely all the code planes where the characters can be stored in one single 16-bit entity, which includes most living languages.
The most widely spread encodings are:
Certain ranges of numbers are left unused in the Unicode standard and certain ranges are even deemed invalid. The most notable invalid range is 16#D800 - 16#DFFF, as the UTF-16 encoding does not allow for encoding of these numbers. It can be speculated that the UTF-16 encoding standard was, from the beginning, expected to be able to hold all Unicode characters in one 16-bit entity, but then had to be extended, leaving a hole in the Unicode range to cope with backward compatibility.
Additionally, the code point 16#FEFF is used for byte order marks (BOM's) and use of that character is not encouraged in other contexts than that. It actually is valid though, as the character "ZWNBS" (Zero Width Non Breaking Space). BOM's are used to identify encodings and byte order for programs where such parameters are not known in advance. Byte order marks are more seldom used than one could expect, but their use might become more widely spread as they provide the means for programs to make educated guesses about the Unicode format of a certain file.
To support Unicode in Erlang, problems in several areas have been addressed. Each area is described briefly in this section and more thoroughly further down in this document:
%% -*- coding: utf-8 -*-
in the beginning of the file. This of course requires your editor to
support UTF-8 as well. The same comment is also interpreted by
functions like In Erlang, strings are actually lists of integers. A string was up until Erlang/OTP R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) character set, which is, code point by code point, a sub-range of the Unicode character set.
The standard list encoding for strings was therefore easily extended to cope with the whole Unicode range: A Unicode string in Erlang is simply a list containing integers, each integer being a valid Unicode code point and representing one character in the Unicode character set.
Erlang strings in ISO-latin-1 are a subset of Unicode strings.
Only if a string contains code points < 256, can it be
directly converted to a binary by using
i.e.
Binaries are more troublesome. For performance reasons, programs
often store textual data in binaries instead of lists, mainly
because they are more compact (one byte per character instead of two
words per character, as is the case with lists). Using
As the UTF-8 encoding is widely spread and provides some backward compatibility in the 7-bit ASCII range, it is selected as the standard encoding for Unicode characters in binaries for Erlang.
The standard binary encoding is used whenever a library function in Erlang should cope with Unicode data in binaries, but is of course not enforced when communicating externally. Functions and bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32 in binaries. Library functions dealing with binaries and Unicode in general, however, only deal with the default encoding.
Character data may be combined from several sources, sometimes
available in a mix of strings and binaries. Erlang has for long had
the concept of
unicode_binary() = binary() with characters encoded in UTF-8 coding standard
chardata() = charlist() | unicode_binary()
charlist() = maybe_improper_list(char() | unicode_binary() | charlist(),
unicode_binary() | nil())
The module
external_unicode_binary() = binary() with characters coded in
a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
external_chardata() = external_charlist() | external_unicode_binary()
external_charlist() = maybe_improper_list(char() |
external_unicode_binary() |
external_charlist(),
external_unicode_binary() | nil())
The bit-syntax contains types for coping with binary data in the
three main encodings. The types are named
<<Ch/utf8,_/binary>> = Bin1,
<<Ch/utf16-little,_/binary>> = Bin2,
Bin3 = <<$H/utf32-little, $e/utf32-little, $l/utf32-little, $l/utf32-little,
$o/utf32-little>>,
For convenience, literal strings can be encoded with a Unicode encoding in binaries using the following (or similar) syntax:
Bin4 = <<"Hello"/utf16>>,
For source code, there is an extension to the
In the shell, if using a Unicode input device, or in source
code stored in UTF-8,
7> $с. 1089
In certain output functions and in the output of return values in the shell, Erlang tries to heuristically detect string data in lists and binaries. Typically you will see heuristic detection in a situation like this:
1> [97,98,99]. "abc" 2> <<97,98,99>>. <<"abc">> 3> <<195,165,195,164,195,182>>. <<"åäö"/utf8>>
Here the shell will detect lists containing printable
characters or binaries containing printable characters either in
bytewise or UTF-8 encoding. The question here is: what is a
printable character? One view would be that anything the Unicode
standard thinks is printable, will also be printable according to
the heuristic detection. The result would be that almost any list
of integers will be deemed a string, resulting in all sorts of
characters being printed, maybe even characters your terminal does
not have in its font set (resulting in some generic output you
probably will not appreciate). Another way is to keep it backwards
compatible so that only the ISO-Latin-1 character set is used to
detect a string. A third way would be to let the user decide
exactly what Unicode ranges are to be viewed as characters. Since
Erlang/OTP R16B you can select either the whole Unicode range or the
ISO-Latin-1 range by supplying the startup flag
Lets look at an example with the two different startup options:
$ erl +pc latin1 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> [1024]. [1024] 2> [1070,1085,1080,1082,1086,1076]. [1070,1085,1080,1082,1086,1076] 3> [229,228,246]. "åäö" 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>. <<208,174,208,189,208,184,208,186,208,190,208,180>> 5> <<229/utf8,228/utf8,246/utf8>>. <<"åäö"/utf8>>
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> [1024]. "Ѐ" 2> [1070,1085,1080,1082,1086,1076]. "Юникод" 3> [229,228,246]. "åäö" 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>. <<"Юникод"/utf8>> 5> <<229/utf8,228/utf8,246/utf8>>. <<"åäö"/utf8>>
In the examples, we can see that the default Erlang shell will
only interpret characters from the ISO-Latin1 range as printable
and will only detect lists or binaries with those "printable"
characters as containing string data. The valid UTF-8 binary
containing "Юникод", will not be printed as a string. When, on the
other hand, started with all Unicode characters printable (
These heuristics are also used by
$ erl +pc latin1 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]). {<<"åäö">>,<<"åäö"/utf8>>,<<208,174,208,189,208,184,208,186,208,190,208,180>>} ok
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]). {<<"åäö">>,<<"åäö"/utf8>>,<<"Юникод"/utf8>>} ok
Please observe that this only affects heuristic interpretation
of lists and binaries on output. For example the
The interactive Erlang shell, when started towards a terminal or
started using the
On Windows, proper operation requires that a suitable font
is installed and selected for the Erlang application to use. If no
suitable font is available on your system, try installing the DejaVu
fonts (
On Unix-like operating systems, the terminal should be able
to handle UTF-8 on input and output (modern versions of XTerm, KDE
konsole and the Gnome terminal do for example) and your locale
settings have to be proper. As an example, my
$ echo $LANG en_US.UTF-8
Actually, most systems handle the
$ echo $LC_CTYPE en_US.UTF-8
The
To investigate what Erlang thinks about the terminal, the
$ LC_CTYPE=en_US.ISO-8859-1 erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,latin1} 2> q(). ok $ LC_CTYPE=en_US.UTF-8 erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,unicode} 2>
When (finally?) everything is in order with the locale settings, fonts and the terminal emulator, you probably also have discovered a way to input characters in the script you desire. For testing, the simplest way is to add some keyboard mappings for other languages, usually done with some applet in your desktop environment. In my KDE environment, I start the KDE Control Center (Personal Settings), select "Regional and Accessibility" and then "Keyboard Layout". On Windows XP, I start Control Panel->Regional and Language Options, select the Language tab and click the Details... button in the square named "Text services and input Languages". Your environment probably provides similar means of changing the keyboard layout. Make sure you have a way to easily switch back and forth between keyboards if you are not used to this, entering commands using a Cyrillic character set is, as an example, not easily done in the Erlang shell.
Now you are set up for some Unicode input and output. The simplest thing to do is of course to enter a string in the shell:
$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,unicode} 2> "Юникод". "Юникод" 3> io:format("~ts~n", [v(2)]). Юникод ok 4>
While strings can be input as Unicode characters, the language elements are still limited to the ISO-latin-1 character set. Only character constants and strings are allowed to be beyond that range:
$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> $ξ. 958 2> Юникод. * 1: illegal character 2>
Most modern operating systems support Unicode file names in some way or another. There are several different ways to do this and Erlang by default treats the different approaches differently:
Windows and, for most common uses, MacOS X enforces Unicode support for file names. All files created in the file system have names that can consistently be interpreted. In MacOS X, all file names are retrieved in UTF-8 encoding, while Windows has selected an approach where each system call handling file names has a special Unicode aware variant, giving much the same effect. There are no file names on these systems that are not Unicode file names, why the default behavior of the Erlang VM is to work in "Unicode file name translation mode", meaning that a file name can be given as a Unicode list and that will be automatically translated to the proper name encoding for the underlying operating and file system.
Doing i.e. a
As the feature is fairly new, you may still stumble upon non core applications that cannot handle being provided with file names containing characters with code points larger than 255, but the core Erlang system should have no problems with Unicode file names.
Most Unix operating systems have adopted a simpler approach, namely that Unicode file naming is not enforced, but by convention. Those systems usually use UTF-8 encoding for Unicode file names, but do not enforce it. On such a system, a file name containing characters having code points between 128 and 255 may be named either as plain ISO-latin-1 or using UTF-8 encoding. As no consistency is enforced, the Erlang VM can do no consistent translation of all file names.
By default on such systems, Erlang starts in
In the
The Unicode file naming support was introduced with Erlang/OTP R14B01. A VM operating in Unicode file name translation mode can work with files having names in any language or character set (as long as it is supported by the underlying OS and file system). The Unicode character list is used to denote file or directory names and if the file system content is listed, you will also get Unicode lists as return value. The support lies in the Kernel and STDLIB modules, why most applications (that does not explicitly require the file names to be in the ISO-latin-1 range) will benefit from the Unicode support without change.
On operating systems with mandatory Unicode file names, this means that you more easily conform to the file names of other (non Erlang) applications, and you can also process file names that, at least on Windows, were completely inaccessible (due to having names that could not be represented in ISO-latin-1). Also you will avoid creating incomprehensible file names on MacOS X as the vfs layer of the OS will accept all your file names as UTF-8 and will not rewrite them.
For most systems, turning on Unicode file name translation is no
problem even if it uses transparent file naming. Very few systems
have mixed file name encodings. A consistent UTF-8 named system will
work perfectly in Unicode file name mode. It was still however
considered experimental in Erlang/OTP R14B01 and is still not the default on
such systems. Unicode file name translation is turned on with the
In Unicode file name mode, file names given to the BIF
It is worth noting that the file
Erlang drivers and NIF shared objects still can not be named with names containing code points beyond 127. This is a known limitation to be removed in a future release. Erlang modules however can, but it is definitely not a good idea and is still considered experimental.
Raw file names were introduced together with Unicode file name
support in erts-5.8.2 (Erlang/OTP R14B01). The reason "raw file
names" was introduced in the system was to be able to
consistently represent file names given in different encodings on
the same system. Having the VM automatically translate a file name
that is not in UTF-8 to a list of Unicode characters might seem
practical, but this would open up for both duplicate file names and
other inconsistent behavior. Consider a directory containing a file
named "björn" in ISO-latin-1, while the Erlang VM is
operating in Unicode file name mode (and therefore expecting UTF-8
file naming). The ISO-latin-1 name is not valid UTF-8 and one could
be tempted to think that automatic conversion in for example
The Erlang
To force Unicode file name translation mode on systems where this
is not the default was considered experimental in Erlang/OTP R14B01 due to
the fact that the initial implementation did not ignore wrongly
encoded file names, so that raw file names could spread unexpectedly
throughout the system. Beginning with Erlang/OTP R16B, the wrongly encoded file
names are only retrieved by special functions
(e.g.
Even if you are operating without Unicode file naming translation automatically done by the VM, you can access and create files with names in UTF-8 encoding by using raw file names encoded as UTF-8. Enforcing the UTF-8 encoding regardless of the mode the Erlang VM is started in might, in some circumstances be a good idea, as the convention of using UTF-8 file names is spreading.
MacOS X's vfs layer enforces UTF-8 file names in a quite
aggressive way. Older versions did this by simply refusing to create
non UTF-8 conforming file names, while newer versions replace
offending bytes with the sequence "%HH", where HH is the
original character in hexadecimal notation. As Unicode translation
is enabled by default on MacOS X, the only way to come up against
this is to either start the VM with the
MacOS X also reorganizes the names of files so that the
representation of accents etc is using the "combining characters",
i.e. the character
Environment variables and their interpretation is handled much in the same way as file names. If Unicode file names are enabled, environment variables as well as parameters to the Erlang VM are expected to be in Unicode.
If Unicode file names are enabled, the calls to
On Unix-like operating systems, parameters are expected to be UTF-8 without translation if Unicode file names are enabled.
Most of the modules in Erlang/OTP are of course Unicode-unaware
in the sense that they have no notion of Unicode and really should
not have. Typically they handle non-textual or byte-oriented data
(like
Modules that actually handle textual data (like
Fortunately, most textual data has been stored in lists and range
checking has been sparse, why modules like
Some modules are however changed to be explicitly Unicode-aware. These modules include:
The module
The
I/O-servers throughout the system are able to handle
Unicode data and has options for converting data upon actual
output or input to/from the device. As shown earlier, the
The actual reading and writing of files with Unicode data is
however not best done with the
The
The
The module
The fact that Erlang as such can handle Unicode data in many forms does not automatically mean that the content of any file can be Unicode text. The external entities such as ports or I/O-servers are not generally Unicode capable.
Ports are always byte oriented, so before sending data that you are not sure is bytewise encoded to a port, make sure to encode it in a proper Unicode encoding. Sometimes this will mean that only part of the data shall be encoded as e.g. UTF-8, some parts may be binary data (like a length indicator) or something else that shall not undergo character encoding, so no automatic translation is present.
I/O-servers behave a little differently. The I/O-servers connected
to terminals (or stdout) can usually cope with Unicode data
regardless of the
The rule of thumb is that the
Functions reading Erlang syntax from files generally recognize
the
$ erl +fna +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> file:write_file("test.term",<<"%% coding: utf-8\n[{\"Юникод\",4711}].\n"/utf8>>). ok 2> file:consult("test.term"). {ok,[[{"Юникод",4711}]]}
The Unicode support is controlled by both command line switches, some standard environment variables and the version of OTP you are using. Most options affect mainly the way Unicode data is displayed, not the actual functionality of the API's in the standard libraries. This means that Erlang programs usually do not need to concern themselves with these options, they are more for the development environment. An Erlang program can be written so that it works well regardless of the type of system or the Unicode options that are in effect.
Here follows a summary of the settings affecting Unicode:
The language setting in the OS mainly affects the shell. The
terminal (i.e. the group leader) will operate with
The environment can also affect file name interpretation, if
Erlang is started with the
You can check the setting of this by calling
This flag affects what is interpreted as string data when
doing heuristic string detection in the shell and in
You can check this option by calling io:printable_range/0,
which will return
This flag affects how the file names are to be interpreted. On operating systems with transparent file naming, this has to be specified to allow for file naming in Unicode characters (and for correct interpretation of file names containing characters > 255.
The file name translation mode can be read with the
This function returns the default encoding for Erlang source
files (if no encoding comment is present) in the currently
running release. In Erlang/OTP R16B
The encoding of each file can be specified using comments as
described in
When Erlang is started with
With the
Opening files with
You can retrieve the
When starting with Unicode, one often stumbles over some common issues. I try to outline some methods of dealing with Unicode data in this section.
A common method of identifying encoding in text-files is to put
a byte order mark (BOM) first in the file. The BOM is the
code point 16#FEFF encoded in the same way as the rest of the
file. If such a file is to be read, the first few bytes (depending
on encoding) is not part of the actual text. This code outlines
how to open a file which is believed to have a BOM and set the
files encoding and position for further sequential reading
(preferably using the
open_bom_file_for_reading(File) ->
{ok,F} = file:open(File,[read,binary]),
{ok,Bin} = file:read(F,4),
{Type,Bytes} = unicode:bom_to_encoding(Bin),
file:position(F,Bytes),
io:setopts(F,[{encoding,Type}]),
{ok,F}.
The
To open a file for writing and putting the BOM first is even simpler:
open_bom_file_for_writing(File,Encoding) ->
{ok,F} = file:open(File,[write,binary]),
ok = file:write(File,unicode:encoding_to_bom(Encoding)),
io:setopts(F,[{encoding,Encoding}]),
{ok,F}.
In both cases the file is then best processed using the
When reading and writing to Unicode-aware entities, like the
User or a file opened for Unicode translation, you will probably
want to format text strings using the functions in
1> io:format("~ts~n",[<<"åäö"/utf8>>]). åäö ok 2> io:format("~s~n",[<<"åäö"/utf8>>]). åäö ok
Obviously the second
As long as the data is always lists, the
The function
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> io_lib:format("~ts~n", ["Γιούνικοντ"]). ["Γιούνικοντ","\n"] 2> io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])). Γιούνικοντ ok
The Unicode string is returned as a Unicode list, which is
recognized as such since the Erlang shell uses the Unicode
encoding (and is started with all Unicode characters considered
printable). The Unicode list is valid input to the
While it is strongly encouraged that the actual encoding of characters in binary data is known prior to processing, that is not always possible. On a typical Linux system, there is a mix of UTF-8 and ISO-latin-1 text files and there are seldom any BOM's in the files to identify them.
UTF-8 is designed in such a way that ISO-latin-1 characters
with numbers beyond the 7-bit ASCII range are seldom considered
valid when decoded as UTF-8. Therefore one can usually use
heuristics to determine if a file is in UTF-8 or if it is encoded
in ISO-latin-1 (one byte per character) encoding. The
heuristic_encoding_bin(Bin) when is_binary(Bin) ->
case unicode:characters_to_binary(Bin,utf8,utf8) of
Bin ->
utf8;
_ ->
latin1
end.
If one does not have a complete binary of the file content, one
could instead chunk through the file and check part by part. The
return-tuple
heuristic_encoding_file(FileName) ->
{ok,F} = file:open(FileName,[read,binary]),
loop_through_file(F,<<>>,file:read(F,1024)).
loop_through_file(_,<<>>,eof) ->
utf8;
loop_through_file(_,_,eof) ->
latin1;
loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) ->
case unicode:characters_to_binary([Acc,Bin]) of
{error,_,_} ->
latin1;
{incomplete,_,Rest} ->
loop_through_file(F,Rest,file:read(F,1024));
Res when is_binary(Res) ->
loop_through_file(F,<<>>,file:read(F,1024))
end.
Another option is to try to read the whole file in UTF-8
encoding and see if it fails. Here we need to read the file using
heuristic_encoding_file2(FileName) ->
{ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]),
loop_through_file2(F,io:get_chars(F,'',1024)).
loop_through_file2(_,eof) ->
utf8;
loop_through_file2(_,{error,_Err}) ->
latin1;
loop_through_file2(F,Bin) when is_binary(Bin) ->
loop_through_file2(F,io:get_chars(F,'',1024)).
For various reasons, you may find yourself having a list of UTF-8 bytes. This is not a regular string of Unicode characters as each element in the list does not contain one character. Instead you get the "raw" UTF-8 encoding that you have in binaries. This is easily converted to a proper Unicode string by first converting byte per byte into a binary and then converting the binary of UTF-8 encoded characters back to a Unicode string:
utf8_list_to_string(StrangeList) ->
unicode:characters_to_list(list_to_binary(StrangeList)).
When working with binaries, you may get the horrible "double
UTF-8 encoding", where strange characters are encoded in your
binaries or files that you did not expect. What you may have got,
is a UTF-8 encoded binary that is for the second time encoded as
UTF-8. A common situation is where you read a file, byte by byte,
but the actual content is already UTF-8. If you then convert the
bytes to UTF-8, using i.e. the
The by far most common situation where this happens, is when you get lists of UTF-8 instead of proper Unicode strings, and then convert them to UTF-8 in a binary or on a file:
wrong_thing_to_do() ->
{ok,Bin} = file:read_file("an_utf8_encoded_file.txt"),
MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary!
{ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]),
io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8
%% bytes in a list!
file:close(C). %% The file catastrophe.txt contains more or less unreadable
%% garbage!
Make very sure you know what a binary contains before converting it to a string. If no other option exists, try heuristics:
if_you_can_not_know() ->
{ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"),
MyList = case unicode:characters_to_list(Bin) of
L when is_list(L) ->
L;
_ ->
binary_to_list(Bin) %% The file was bytewise encoded
end,
%% Now we know that the list is a Unicode string, not a list of UTF-8 bytes
{ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]),
io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets!
file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!