From 8d0de294eb7ff5d17d3e71101f273bbcae3737b8 Mon Sep 17 00:00:00 2001
From: Patrik Nyblom Implementing support for Unicode character sets is an ongoing
process. The Erlang Enhancement Proposal (EEP) 10 outlined the
basics of Unicode support and also specified a default encoding in
@@ -48,13 +48,13 @@
source code, among with enhancements to many of the applications to
support both Unicode encoded file names as well as support for UTF-8
encoded files in several circumstances. Most notable is the support
- for UTF-8 in files read by file:consult/1, release handler support
+ for UTF-8 in files read by
In R17, the encoding default for Erlang source files will be switched to UTF-8 and in R18 Erlang will support atoms in the full - Unicode range, meaning full Unicode function names and module + Unicode range, meaning full Unicode function and module names
This guide outlines the current Unicode support and gives a couple @@ -88,14 +88,14 @@ the translation should be in and also take into account differences in input and output string length and so on. There is at the time of writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but - there are publicly available libraries that addresses these issues.
+ there are publicly available libraries that address these issues.Another example is the accented characters where the same glyph
has two different representations. Let's look at the Swedish
"ö". There's a code point for that in the Unicode standard, but you
can also write it as "o" followed by U+0308 (Combining Diaeresis,
with the simplified meaning that the last letter should have a "¨"
- above). They have exactly the same glyph. they are for most
+ above). They have exactly the same glyph. They are for most
purposes the same, but they have completely different
representations. For example MacOS X converts all file names to use
Combining Diaeresis, while most other programs (including Erlang)
@@ -113,7 +113,7 @@
Unicode is a standard defining code points (numbers) for all
known, living or dead, scripts. In principle, every known symbol
used in any language has a Unicode code point. It is vital to understand the difference between encodings and
Unicode characters. Unicode characters are code points according to
the Unicode standard, while the encodings are ways to represent such
- code points. An encoding is just an standard for representation,
+ code points. An encoding is just a standard for representation,
UTF-8 can for example be used to represent a very limited part of
the Unicode character set (e.g. ISO-Latin-1), or the full Unicode
range. It's just an encoding format.
To support Unicode in Erlang, problems in several areas have been addressed. Each area is described briefly in this section and more thoroughly further down in this document:
@@ -231,7 +231,7 @@ (as the string module now can handle lists with arbitrary code points), in some cases new functionality or options need to be added (as in the
%% -*- coding: utf-8 -*-
- in the beginning of the file. It of course requires your editor to
+ in the beginning of the file. This of course requires your editor to
support UTF-8 as well. The same comment is also interpreted by
- functions like file:consult/1 , the release handler etc, so that
+ functions like In Erlang, strings are actually lists of integers. A string was up -until R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) -character set, which is, code point by code point, a sub-range of the -Unicode character set.
-The standard list encoding for strings was therefore easily -extended to cope with the whole Unicode range: A Unicode string in -Erlang is simply a list containing integers, each integer being a -valid Unicode code point and representing one character in the Unicode -character set.
-Erlang strings in ISO-latin-1 are a subset of Unicode strings.
-Only if a string contains code points < 256, can it be directly
-converted to a binary by using i.e.
In Erlang, strings are actually lists of integers. A string was + up until R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) + character set, which is, code point by code point, a sub-range of + the Unicode character set.
+The standard list encoding for strings was therefore easily + extended to cope with the whole Unicode range: A Unicode string in + Erlang is simply a list containing integers, each integer being a + valid Unicode code point and representing one character in the + Unicode character set.
+Erlang strings in ISO-latin-1 are a subset of Unicode + strings.
+Only if a string contains code points < 256, can it be
+ directly converted to a binary by using
+ i.e.
Binaries are more troublesome. For performance reasons, programs
-often store textual data in binaries instead of lists, mainly because
-they are more compact (one byte per character instead of two words per
-character, as is the case with lists). Using
-
As the UTF-8 encoding is widely spread and provides some backward -compatibility in the 7-bit ASCII range, it is selected as the standard -encoding for Unicode characters in binaries for Erlang.
-The standard binary encoding is used whenever a library function in -Erlang should cope with Unicode data in binaries, but is of course not -enforced when communicating externally. Functions and bit-syntax exist -to encode and decode both UTF-8, UTF-16 and UTF-32 in -binaries. Library functions dealing with binaries and Unicode in -general, however, only deal with the default encoding.
+Binaries are more troublesome. For performance reasons, programs
+ often store textual data in binaries instead of lists, mainly
+ because they are more compact (one byte per character instead of two
+ words per character, as is the case with lists). Using
+
As the UTF-8 encoding is widely spread and provides some backward + compatibility in the 7-bit ASCII range, it is selected as the + standard encoding for Unicode characters in binaries for Erlang.
+The standard binary encoding is used whenever a library function + in Erlang should cope with Unicode data in binaries, but is of + course not enforced when communicating externally. Functions and + bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32 + in binaries. Library functions dealing with binaries and Unicode in + general, however, only deal with the default encoding.
-Character data may be combined from several sources, sometimes
-available in a mix of strings and binaries. Erlang has for long had
-the concept of
+ Character data may be combined from several sources, sometimes
+ available in a mix of strings and binaries. Erlang has for long had
+ the concept of iodata or iolist s, where binaries and
+ lists can be combined to represent a sequence of bytes. In the same
+ way, the Unicode aware modules often allow for combinations of
+ binaries and lists where the binaries have characters encoded in
+ UTF-8 and the lists contain such binaries or numbers representing
+ Unicode code points:
+
unicode_binary() = binary() with characters encoded in UTF-8 coding standard
chardata() = charlist() | unicode_binary()
charlist() = maybe_improper_list(char() | unicode_binary() | charlist(),
unicode_binary() | nil())
-The module unicode in STDLIB even supports similar mixes
-with binaries containing other encodings than UTF-8, but that is a
-special case to allow for conversions to and from external data:
-
+ The module unicode in STDLIB even
+ supports similar mixes with binaries containing other encodings than
+ UTF-8, but that is a special case to allow for conversions to and
+ from external data:
+
external_unicode_binary() = binary() with characters coded in
a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
@@ -371,12 +375,12 @@ external_charlist() = maybe_improper_list(char() |
external_unicode_binary() | nil())
For source code, there is an extension to the
In the shell, if using a Unicode input device, or in source
code stored in UTF-8,
In certain output functions and in the output of return values in the shell, Erlang tries to heuristically detect string data in lists and binaries. Typically you will see heuristic detection in @@ -429,7 +433,7 @@ Bin4 = <<"Hello"/utf16>>, "abc" 2> <<97,98,99>>. <<"abc">> -3> <<195,165,195,164,195,182>> +3> <<195,165,195,164,195,182>>. <<"åäö"/utf8>>
Here the shell will detect lists containing printable
characters or binaries containing printable characters either in
@@ -439,7 +443,7 @@ Bin4 = <<"Hello"/utf16>>,
the heuristic detection. The result would be that almost any list
of integers will be deemed a string, resulting in all sorts of
characters being printed, maybe even characters your terminal does
- not have in it's font set (resulting in some generic output you
+ not have in its font set (resulting in some generic output you
probably will not appreciate). Another way is to keep it backwards
compatible so that only the ISO-Latin-1 character set is used to
detect a string. A third way would be to let the user decide
@@ -489,7 +493,7 @@ Eshell V5.10.1 (abort with ^G)
only interpret characters from the ISO-Latin1 range as printable
and will only detect lists or binaries with those "printable"
characters as containing string data. The valid UTF-8 binary
- containing "Юникод", will not be print as a string. When, on the
+ containing "Юникод", will not be printed as a string. When, on the
other hand, started with all Unicode characters printable (
The interactive Erlang shell, when started towards a terminal or
-started using the
On Windows®, proper operation requires that a suitable font is
-installed and selected for the Erlang application to use. If no
-suitable font is available on your system, try installing the DejaVu
-fonts (
On Unix®-like operating systems, the terminal should be able to
-handle UTF-8 on input and output (modern versions of XTerm, KDE
-konsole and the Gnome terminal do for example) and your locale
-settings have to be proper. As an example, my
+The Interactive Shell +The interactive Erlang shell, when started towards a terminal or + started using the
+werl command on windows, can support + Unicode input and output.On Windows, proper operation requires that a suitable font + is installed and selected for the Erlang application to use. If no + suitable font is available on your system, try installing the DejaVu + fonts (
+dejavu-fonts.org ), which are freely available and then + select that font in the Erlang shell application.On Unix-like operating systems, the terminal should be able + to handle UTF-8 on input and output (modern versions of XTerm, KDE + konsole and the Gnome terminal do for example) and your locale + settings have to be proper. As an example, my
+LANG + environment variable is set as this:$ echo $LANG en_US.UTF-8-Actually, most systems handle the
-LC_CTYPE variable before -LANG , so if that is set, it has to be set toUTF-8 :+Actually, most systems handle the
+LC_CTYPE variable before +LANG , so if that is set, it has to be set to +UTF-8 :$ echo $LC_CTYPE en_US.UTF-8-The
-LANG orLC_CTYPE setting should be consistent -with what the terminal is capable of, there is no portable way for -Erlang to ask the actual terminal about its UTF-8 capacity, we have to -rely on the language and character type settings.To investigate what Erlang thinks about the terminal, the -
-io:getopts() call can be used when the shell is started:+The
+LANG orLC_CTYPE setting should be consistent + with what the terminal is capable of, there is no portable way for + Erlang to ask the actual terminal about its UTF-8 capacity, we have + to rely on the language and character type settings.To investigate what Erlang thinks about the terminal, the +
+io:getopts() call can be used when the shell is started:$ LC_CTYPE=en_US.ISO-8859-1 erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -570,47 +575,49 @@ Eshell V5.10.1 (abort with ^G) {encoding,unicode} 2>-When (finally?) everything is in order with the locale settings, -fonts and the terminal emulator, you probably also have discovered a -way to input characters in the script you desire. For testing, the -simplest way is to add some keyboard mappings for other languages, -usually done with some applet in your desktop environment. In my KDE -environment, I start the KDE Control Center (Personal Settings), -select "Regional and Accessibility" and then "Keyboard Layout". On -Windows XP®, I start Control Panel->Regional and Language Options, -select the Language tab and click the Details... button in the square -named "Text services and input Languages". Your environment probably -provides similar means of changing the keyboard layout. Make sure you -have a way to easily switch back and forth between keyboards if you -are not used to this, entering commands using a Cyrillic character set -is, as an example, not easily done in the Erlang shell.
+When (finally?) everything is in order with the locale settings, + fonts and the terminal emulator, you probably also have discovered a + way to input characters in the script you desire. For testing, the + simplest way is to add some keyboard mappings for other languages, + usually done with some applet in your desktop environment. In my KDE + environment, I start the KDE Control Center (Personal Settings), + select "Regional and Accessibility" and then "Keyboard Layout". On + Windows XP, I start Control Panel->Regional and Language + Options, select the Language tab and click the Details... button in + the square named "Text services and input Languages". Your + environment probably provides similar means of changing the keyboard + layout. Make sure you have a way to easily switch back and forth + between keyboards if you are not used to this, entering commands + using a Cyrillic character set is, as an example, not easily done in + the Erlang shell.
-Now you are set up for some Unicode input and output. The simplest -thing to do is of course to enter a string in the shell:
+Now you are set up for some Unicode input and output. The + simplest thing to do is of course to enter a string in the + shell:
-+$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> lists:keyfind(encoding, 1, io:getopts()). {encoding,unicode} -2> "Юникод" +2> "Юникод". "Юникод" 3> io:format("~ts~n", [v(2)]). Юникод ok 4>-While strings can be input as Unicode characters, the language -elements are still limited to the ISO-latin-1 character set. Only -character constants and strings are allowed to be beyond that -range:
-+While strings can be input as Unicode characters, the language + elements are still limited to the ISO-latin-1 character set. Only + character constants and strings are allowed to be beyond that + range:
+$ erl Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) -1> $ξ +1> $ξ. 958 2> Юникод. * 1: illegal character @@ -666,23 +673,11 @@ Eshell V5.10.1 (abort with ^G)[195,150,115,116,101,114,115,117,110,100] , which is a list containing UTF-8 bytes - not what you would want... If you on the other hand use Unicode file name translation on such a - system, nun-UTF-8 file names will simply be ignored by functions + system, non-UTF-8 file names will simply be ignored by functions likefile:list_dir/1 . They can be retrieved withfile:list_dir_all/1 , but wrongly encoded file names will appear as "raw file names". -A raw file name is not a list, but a binary with undefined - encoding. Many non core applications still do not handle file - names given as binaries, why such raw names are avoided by - default. All functions in the
@@ -691,13 +686,13 @@ Eshell V5.10.1 (abort with ^G) work with files having names in any language or character set (as long as it is supported by the underlying OS and file system). The Unicode character list is used to denote file or directory names and - if the file system content is listed, you will also be able to get + if the file system content is listed, you will also get Unicode lists as return value. The support lies in the Kernel and STDLIB modules, why most applications (that does not explicitly require the file names to be in the ISO-latin-1 range) will benefit from the Unicode support without change. -file module taking - file names as input will handle raw file names, sending them more - or less uninterpreted to the underlying OS API, but only the - functions with names ending in_all will produce raw file - names. As special considerations will have to be taken by tools - etc to be able to handle non-UTF-8 encoded file names when - Unicode file name translation is activated on systems with - transparent file naming, the default is to leave such - translation off on such operating systems.On Operating systems with mandatory Unicode file names, this +
On operating systems with mandatory Unicode file names, this means that you more easily conform to the file names of other (non Erlang) applications, and you can also process file names that, at least on Windows, were completely inaccessible (due to having names @@ -712,8 +707,17 @@ Eshell V5.10.1 (abort with ^G) work perfectly in Unicode file name mode. It was still however considered experimental in R14B01 and is still not the default on such systems. Unicode file name translation is turned on with the -
+ (or directory) name is encountered.+fnu switch to theerl program. If the VM is started - in Unicode file name translation mode, ++fnu switch to the On Linux, a VM started without explicitly + stating the file name translation mode will default tolatin1 + as the native file name encoding. On Windows and MacOS X, the + default behavior is that of Unicode file name translation, why the +file:native_name_encoding/0 by default returnsutf8 on + those systems (the fact that Windows actually does not use UTF-8 on + the file system level can safely be ignored by the Erlang + programmer). The default behavior can, as stated before, be + changed using the+fnu or+fnl options to the VM, see + theprogram. If the + VM is started in Unicode file name translation mode, erl file:native_name_encoding/0 will return the atomutf8 . The+fnu switch can be followed byw ,i ore , to control how wrongly encoded file names are @@ -722,7 +726,9 @@ Eshell V5.10.1 (abort with ^G) "skipped" in directory listings,i means that those wrongly encoded file names are silently ignored ande means that the API function will return an error whenever a wrongly encoded file - (or directory) name is encountered.w is the default.w is the default. Note + thatfile:read_link/1 will always return an error if the link + points to an invalid file name.In Unicode file name mode, file names given to the BIF
open_port/2 with the option{spawn_executable,...} are @@ -734,7 +740,7 @@ Eshell V5.10.1 (abort with ^G)It is worth noting that the file
encoding options given when opening a file has nothing to do with the file name encoding convention. You can very well open files containing data - encoded in UTF-8 but having file names in bytewise (latin1) encoding + encoded in UTF-8 but having file names in bytewise (latin1 ) encoding or vice versa.Erlang drivers and NIF shared objects still can not be @@ -744,9 +750,9 @@ Eshell V5.10.1 (abort with ^G) experimental.
- Notes About Raw File Names and Automatic File Name Conversion +Notes About Raw File Names -Raw file names was introduced together with Unicode file name +
Raw file names were introduced together with Unicode file name support in erts-5.8.2 (OTP R14B01). The reason "raw file names" was introduced in the system was to be able to consistently represent file names given in different encodings on @@ -793,26 +799,10 @@ Eshell V5.10.1 (abort with ^G) encoded file names, so that raw file names could spread unexpectedly throughout the system. Beginning with R16B, the wrongly encoded file names are only retrieved by special functions - (e.g.
-file:list_dir_all/1 , so the impact on existing code is + (e.g.file:list_dir_all/1 ), so the impact on existing code is much lower, why it is now supported. Unicode file name translation is expected to be default in future releases.If working with raw file names, one can still conform to the - encoding convention of the Erlang VM by using the -
-file:native_name_encoding/0 function, which returns either - the atomlatin1 or the atomutf8 depending on the file - name translation mode. On Linux, a VM started without explicitly - stating the file name translation mode will default tolatin1 - as the native file name encoding. On Windows and MacOS X, the default - behavior is that of Unicode file name translation, why the -file:native_name_encoding/0 by default returnsutf8 on - those systems (the fact that Windows actually does not use UTF-8 on - the file system level can safely be ignored by the Erlang - programmer). The default behavior can, as been stated before, be - changed using the+fnu or+fnl options to the VM, see - thecommand - manual page. erl(1) Even if you are operating without Unicode file naming translation automatically done by the VM, you can access and create files with names in UTF-8 encoding by using raw file names encoded as @@ -822,16 +812,19 @@ Eshell V5.10.1 (abort with ^G)
Notes About MacOS X -MacOS X's vfs layer enforces UTF-8 file names in a quite aggressive - way. Older versions did this by simply refusing to create non UTF-8 - conforming file names, while newer versions replace offending bytes - with the sequence "%HH", where HH is the original - character in hexadecimal notation. As Unicode translation is enabled - by default on MacOS X, the only way to come up against this is to - either start the VM with the
++fnl flag or to use a raw file - name inlatin1 encoding. In that case, the file can not be - opened with the same name as the one used to create this. The - problem is by design in newer versions of MacOS X.MacOS X's vfs layer enforces UTF-8 file names in a quite + aggressive way. Older versions did this by simply refusing to create + non UTF-8 conforming file names, while newer versions replace + offending bytes with the sequence "%HH", where HH is the + original character in hexadecimal notation. As Unicode translation + is enabled by default on MacOS X, the only way to come up against + this is to either start the VM with the
+fnl flag or to use a + raw file name in bytewise (latin1 ) encoding. If using a raw + filename, with a bytewise encoding containing characters between 127 + and 255, to create a file, the file can not be opened using the same + name as the one used to create it. There is no remedy for this + behaviour, other than keeping the file names in the right + encoding.MacOS X also reorganizes the names of files so that the representation of accents etc is using the "combining characters", @@ -850,7 +843,7 @@ Eshell V5.10.1 (abort with ^G)
Environment variables and their interpretation is handled much in
the same way as file names. If Unicode file names are enabled,
environment variables as well as parameters to the Erlang VM are
@@ -884,7 +877,7 @@ Eshell V5.10.1 (abort with ^G)
The module The module The The I/O-servers throughout the system are able both to handle
+ I/O-servers throughout the system are able to handle
Unicode data and has options for converting data upon actual
output or input to/from the device. As shown earlier, the
- The actual reading and writing of files with Unicode data is
however not best done with the
The
The
The
The
The module
The module
The fact that Erlang as such can handle Unicode data in many forms does not automatically mean that the content of any file can be - Unicode text. The external entities such as ports or io_servers are + Unicode text. The external entities such as ports or I/O-servers are not generally Unicode capable.
Ports are always byte oriented, so before sending data that you are not sure is bytewise encoded to a port, make sure to encode it @@ -951,7 +945,7 @@ Eshell V5.10.1 (abort with ^G) binary data (like a length indicator) or something else that shall not undergo character encoding, so no automatic translation is present.
-io_servers behave a little differently. The io_servers connected +
I/O-servers behave a little differently. The I/O-servers connected
to terminals (or stdout) can usually cope with Unicode data
regardless of the
The rule of thumb is that the
The Unicode support is controlled by both command line switches,
some standard environment variables and the version of OTP you are
using. Most options affect mainly the way Unicode data is displayed,
not the actual functionality of the API's in the standard
- libraries. This means that actual Erlang programs usually do not
+ libraries. This means that Erlang programs usually do not
need to concern themselves with these options, they are more for the
development environment. An Erlang program can be written so that it
works well regardless of the type of system or the Unicode options
@@ -1014,14 +1008,14 @@ ok
The language setting in the OS mainly affects the shell. The
- terminal (i.e. the group_leader) will operate with The environment can also affect file name interpretation, if
Erlang is started with the You can check the setting of this by calling
-
You can check this option by calling io:printable_range/0,
- which will in R16 return
The additional {
The file name translation mode can be read with the
This function returns the default encoding for Erlang source
files (if no encoding comment is present) in the currently
@@ -1092,31 +1071,31 @@ ok
The encoding of each file can be specified using comments as
described in
-
When Erlang is started with
With the
Opening files with
You can retrieve the
You can retrieve the
When starting with Unicode, one often stumbles over some common issues. I try to outline some methods of dealing with Unicode data in this section.
@@ -1129,7 +1108,7 @@ ok on encoding) is not part of the actual text. This code outlines how to open a file which is believed to have a BOM and set the files encoding and position for further sequential reading - (preferably using the
open_bom_file_for_reading(File) ->
@@ -1140,8 +1119,15 @@ open_bom_file_for_reading(File) ->
io:setopts(F,[{encoding,Type}]),
{ok,F}.
-The
To open a file for writing and putting the BOM first is even simpler:
+The
To open a file for writing and putting the BOM first is even + simpler:
open_bom_file_for_writing(File,Encoding) ->
{ok,F} = file:open(File,[write,binary]),
@@ -1149,21 +1135,53 @@ open_bom_file_for_writing(File,Encoding) ->
io:setopts(F,[{encoding,Encoding}]),
{ok,F}.
-In both cases the file is then best processed using the
When reading and writing to Unicode-aware entities, like the User or a file opened for Unicode translation, you will probably want to format text strings using the functions in
+In both cases the file is then best processed using the +
+io module, as the functions inio can handle code + points beyond the ISO-latin-1 range.
When reading and writing to Unicode-aware entities, like the
+ User or a file opened for Unicode translation, you will probably
+ want to format text strings using the functions in
1> io:format("~ts~n",[<<"åäö"/utf8>>]). åäö ok 2> io:format("~s~n",[<<"åäö"/utf8>>]). åäö ok-
Obviously the second
As long as the data is always lists, the
The function
Obviously the second
As long as the data is always lists, the
The function
$ erl +pc unicode Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -1174,26 +1192,38 @@ Eshell V5.10.1 (abort with ^G) 2> io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])). Γιούνικοντ ok-
The Unicode string is returned as a Unicode list, which is
-recognized as such since the Erlang shell uses the Unicode encoding
-(and is started with all Unicode characters considered printable). The
-Unicode list is valid input to the
-
While it is strongly encouraged that the actual encoding of characters in binary data is known prior to processing, that is not always possible. On a typical Linux® system, there is a mix of UTF-8 and ISO-latin-1 text files and there are seldom any BOM's in the files to identify them.
-UTF-8 is designed in such a way that ISO-latin-1 characters with numbers beyond the 7-bit ASCII range are seldom considered valid when decoded as UTF-8. Therefore one can usually use heuristics to determine if a file is in UTF-8 or if it is encoded in ISO-latin-1 (one byte per character) encoding. The
+ The Unicode string is returned as a Unicode list, which is
+ recognized as such since the Erlang shell uses the Unicode
+ encoding (and is started with all Unicode characters considered
+ printable). The Unicode list is valid input to the io:put_chars/2 function,
+ so data can be output on any Unicode capable device. If the device
+ is a terminal, characters will be output in the \x{ H
+ ...} format if encoding is latin1 otherwise in UTF-8
+ (for the non-interactive terminal - "oldshell" or "noshell") or
+ whatever is suitable to show the character properly (for an
+ interactive terminal - the regular shell). The bottom line is that
+ you can always send Unicode data to the standard_io
+ device. Files will however only accept Unicode code points beyond
+ ISO-latin-1 if encoding is set to something else than
+ latin1 .
+
While it is + strongly encouraged that the actual encoding of characters in + binary data is known prior to processing, that is not always + possible. On a typical Linux system, there is a mix of UTF-8 + and ISO-latin-1 text files and there are seldom any BOM's in the + files to identify them.
+UTF-8 is designed in such a way that ISO-latin-1 characters
+ with numbers beyond the 7-bit ASCII range are seldom considered
+ valid when decoded as UTF-8. Therefore one can usually use
+ heuristics to determine if a file is in UTF-8 or if it is encoded
+ in ISO-latin-1 (one byte per character) encoding. The
+
heuristic_encoding_bin(Bin) when is_binary(Bin) ->
case unicode:characters_to_binary(Bin,utf8,utf8) of
Bin ->
@@ -1201,9 +1231,16 @@ heuristic_encoding_bin(Bin) when is_binary(Bin) ->
_ ->
latin1
end.
-
-If one does not have a complete binary of the file content, one could instead chunk through the file and check part by part. The return-tuple
+
+ If one does not have a complete binary of the file content, one
+ could instead chunk through the file and check part by part. The
+ return-tuple
heuristic_encoding_file(FileName) ->
{ok,F} = file:open(FileName,[read,binary]),
loop_through_file(F,<<>>,file:read(F,1024)).
@@ -1221,9 +1258,12 @@ loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) ->
Res when is_binary(Res) ->
loop_through_file(F,<<>>,file:read(F,1024))
end.
-
-Another option is to try to read the whole file in utf8 encoding and see if it fails. Here we need to read the file using
+
+ Another option is to try to read the whole file in UTF-8
+ encoding and see if it fails. Here we need to read the file using
+
heuristic_encoding_file2(FileName) ->
{ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]),
loop_through_file2(F,io:get_chars(F,'',1024)).
@@ -1234,42 +1274,42 @@ loop_through_file2(_,{error,_Err}) ->
latin1;
loop_through_file2(F,Bin) when is_binary(Bin) ->
loop_through_file2(F,io:get_chars(F,'',1024)).
-
-For various reasons, you may find yourself having a list of UTF-8 - bytes. This is not a regular string of Unicode characters as each - element in the list does not contain one character. Instead you get - the "raw" UTF-8 encoding that you have in binaries. This is easily - converted to a proper Unicode string by first converting byte per - byte into a binary and then converting the binary of UTF-8 encoded - characters back to a Unicode string:
-
+
+ For various reasons, you may find yourself having a list of + UTF-8 bytes. This is not a regular string of Unicode characters as + each element in the list does not contain one character. Instead + you get the "raw" UTF-8 encoding that you have in binaries. This + is easily converted to a proper Unicode string by first converting + byte per byte into a binary and then converting the binary of + UTF-8 encoded characters back to a Unicode string:
+
utf8_list_to_string(StrangeList) ->
unicode:characters_to_list(list_to_binary(StrangeList)).
-
-When working with binaries, you may get the horrible "double
- UTF-8 encoding", where strange characters are encoded in your
- binaries or files that you did not expect. What you may have got, is
- an UTF-8 encoded binary that is for the second time encoded as
- UTF-8. A common situation is where you read a file, byte by byte,
- but the actual content is already UTF-8. If you then convert the
- bytes to UTF-8, using the i.e. the
The by far most common situation where this happens, is when you - get lists of UTF-8 instead of proper Unicode strings, and then convert - them to UTF-8 in a binary or on a file:
-
+
+ When working with binaries, you may get the horrible "double
+ UTF-8 encoding", where strange characters are encoded in your
+ binaries or files that you did not expect. What you may have got,
+ is a UTF-8 encoded binary that is for the second time encoded as
+ UTF-8. A common situation is where you read a file, byte by byte,
+ but the actual content is already UTF-8. If you then convert the
+ bytes to UTF-8, using i.e. the
The by far most common situation where this happens, is when + you get lists of UTF-8 instead of proper Unicode strings, and then + convert them to UTF-8 in a binary or on a file:
+
wrong_thing_to_do() ->
{ok,Bin} = file:read_file("an_utf8_encoded_file.txt"),
MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary!
@@ -1278,10 +1318,11 @@ loop_through_file2(F,Bin) when is_binary(Bin) ->
%% bytes in a list!
file:close(C). %% The file catastrophe.txt contains more or less unreadable
%% garbage!
-
- Make very sure you know what a binary contains before converting - it to a string. If no other option exists, try heuristics:
-
+
+ Make very sure you know what a binary contains before + converting it to a string. If no other option exists, try + heuristics:
+
if_you_can_not_know() ->
{ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"),
MyList = case unicode:characters_to_list(Bin) of
@@ -1294,7 +1335,7 @@ loop_through_file2(F,Bin) when is_binary(Bin) ->
{ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]),
io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets!
file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!
-
-