aboutsummaryrefslogtreecommitdiffstats
path: root/lib/stdlib/doc/src/unicode_usage.xml
diff options
context:
space:
mode:
Diffstat (limited to 'lib/stdlib/doc/src/unicode_usage.xml')
-rw-r--r--lib/stdlib/doc/src/unicode_usage.xml2420
1 files changed, 1278 insertions, 1142 deletions
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml
index b4c9385e33..7f79ac88a1 100644
--- a/lib/stdlib/doc/src/unicode_usage.xml
+++ b/lib/stdlib/doc/src/unicode_usage.xml
@@ -33,427 +33,495 @@
<rev>PA1</rev>
<file>unicode_usage.xml</file>
</header>
-<section>
-<title>Unicode Implementation</title>
- <p>Implementing support for Unicode character sets is an ongoing
- process. The Erlang Enhancement Proposal (EEP) 10 outlined the
- basics of Unicode support and also specified a default encoding in
- binaries that all Unicode-aware modules should handle in the
- future.</p>
-
- <p>The functionality described in EEP10 was implemented in Erlang/OTP
- R13A, but that was by no means the end of it. In Erlang/OTP R14B01 support
- for Unicode file names was added, although it was in no way complete
- and was by default disabled on platforms where no guarantee was given
- for the file name encoding. With Erlang/OTP R16A came support for UTF-8 encoded
- source code, among with enhancements to many of the applications to
- support both Unicode encoded file names as well as support for UTF-8
- encoded files in several circumstances. Most notable is the support
- for UTF-8 in files read by <c>file:consult/1</c>, release handler support
- for UTF-8 and more support for Unicode character sets in the
- I/O-system. In Erlang/OTP 17.0, the encoding default for Erlang source files was
- switched to UTF-8.</p>
-
- <p>This guide outlines the current Unicode support and gives a couple
- of recipes for working with Unicode data.</p>
-</section>
-<section>
-<title>Understanding Unicode</title>
- <p>Experience with the Unicode support in Erlang has made it
- painfully clear that understanding Unicode characters and encodings
- is not as easy as one would expect. The complexity of the field as
- well as the implications of the standard requires thorough
- understanding of concepts rarely before thought of.</p>
-
- <p>Furthermore the Erlang implementation requires understanding of
- concepts that never were an issue for many (Erlang) programmers. To
- understand and use Unicode characters requires that you study the
- subject thoroughly, even if you're an experienced programmer.</p>
-
- <p>As an example, one could contemplate the issue of converting
- between upper and lower case letters. Reading the standard will make
- you realize that, to begin with, there's not a simple one to one
- mapping in all scripts. Take German as an example, where there's a
- letter "ß" (Sharp s) in lower case, but the uppercase equivalent is
- "SS". Or Greek, where "Σ" has two different lowercase forms: "ς" in
- word-final position and "σ" elsewhere. Or Turkish where dotted and
- dot-less "i" both exist in lower case and upper case forms, or
- Cyrillic "I" which usually has no lowercase form. Or of course
- languages that have no concept of upper case (or lower case). So, a
- conversion function will need to know not only one character at a
- time, but possibly the whole sentence, maybe the natural language
- the translation should be in and also take into account differences
- in input and output string length and so on. There is at the time of
- writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but
- there are publicly available libraries that address these issues.</p>
-
- <p>Another example is the accented characters where the same glyph
- has two different representations. Let's look at the Swedish
- "ö". There's a code point for that in the Unicode standard, but you
- can also write it as "o" followed by U+0308 (Combining Diaeresis,
- with the simplified meaning that the last letter should have a "¨"
- above). They have exactly the same glyph. They are for most
- purposes the same, but they have completely different
- representations. For example MacOS X converts all file names to use
- Combining Diaeresis, while most other programs (including Erlang)
- try to hide that by doing the opposite when for example listing
- directories. However it's done, it's usually important to normalize
- such characters to avoid utter confusion.</p>
-
- <p>The list of examples can be made as long as the Unicode standard, I
- suspect. The point is that one need a kind of knowledge that was
- never needed when programs only took one or two languages into
- account. The complexity of human languages and scripts, certainly
- has made this a challenge when constructing a universal
- standard. Supporting Unicode properly in your program <em>will</em> require
- effort.</p>
-
-</section>
-<section>
-<title>What Unicode Is</title>
- <p>Unicode is a standard defining code points (numbers) for all
- known, living or dead, scripts. In principle, every known symbol
- used in any language has a Unicode code point.</p>
- <p>Unicode code points are defined and published by the <em>Unicode
- Consortium</em>, which is a non profit organization.</p>
- <p>Support for Unicode is increasing throughout the world of
- computing, as the benefits of one common character set are
- overwhelming when programs are used in a global environment.</p>
- <p>Along with the base of the standard: the code points for all the
- scripts, there are a couple of <em>encoding standards</em> available.</p>
- <p>It is vital to understand the difference between encodings and
- Unicode characters. Unicode characters are code points according to
- the Unicode standard, while the encodings are ways to represent such
- code points. An encoding is just a standard for representation,
- UTF-8 can for example be used to represent a very limited part of
- the Unicode character set (e.g. ISO-Latin-1), or the full Unicode
- range. It's just an encoding format.</p>
- <p>As long as all character sets were limited to 256 characters,
- each character could be stored in one single byte, so there was more
- or less only one practical encoding for the characters. Encoding
- each character in one byte was so common that the encoding wasn't
- even named. When we now, with the Unicode system, have a lot more
- than 256 characters, we need a common way to represent these. The
- common ways of representing the code points are the encodings. This
- means a whole new concept to the programmer, the concept of
- character representation, which was before a non-issue.</p>
-
- <p>Different operating systems and tools support different
- encodings. For example Linux and MacOS X has chosen the UTF-8
- encoding, which is backwards compatible with 7-bit ASCII and
- therefore affects programs written in plain English the
- least. Windows on the other hand supports a limited version of
- UTF-16, namely all the code planes where the characters can be
- stored in one single 16-bit entity, which includes most living
- languages.</p>
-
- <p>The most widely spread encodings are:</p>
- <taglist>
- <tag>Bytewise representation</tag>
- <item>This is not a proper Unicode representation, but the
- representation used for characters before the Unicode standard. It
- can still be used to represent character code points in the Unicode
- standard that have numbers below 256, which corresponds exactly to
- the ISO-Latin-1 character set. In Erlang, this is commonly denoted
- <c>latin1</c> encoding, which is slightly misleading as ISO-Latin-1 is
- a character code range, not an encoding.</item>
- <tag>UTF-8</tag>
- <item>Each character is stored in one to four bytes depending on
- code point. The encoding is backwards compatible with bytewise
- representation of 7-bit ASCII as all 7-bit characters are stored
- in one single byte in UTF-8. The characters beyond code point 127
- are stored in more bytes, letting the most significant bit in the
- first character indicate a multi-byte character. For details on
- the encoding, the RFC is publicly available. Note that UTF-8 is
- <em>not</em> compatible with bytewise representation for
- code points between 128 and 255, so a ISO-Latin-1 bytewise
- representation is not generally compatible with UTF-8.</item>
- <tag>UTF-16</tag>
- <item>This encoding has many similarities to UTF-8, but the basic
- unit is a 16-bit number. This means that all characters occupy at
- least two bytes, some high numbers even four bytes. Some programs,
- libraries and operating systems claiming to use UTF-16 only allows
- for characters that can be stored in one 16-bit entity, which is
- usually sufficient to handle living languages. As the basic unit
- is more than one byte, byte-order issues occur, why UTF-16 exists
- in both a big-endian and little-endian variant. In Erlang, the
- full UTF-16 range is supported when applicable, like in the
- <c>unicode</c> module and in the bit syntax.</item>
- <tag>UTF-32</tag>
- <item>The most straight forward representation. Each character is
- stored in one single 32-bit number. There is no need for escapes
- or any variable amount of entities for one character, all Unicode
- code points can be stored in one single 32-bit entity. As with
- UTF-16, there are byte-order issues, UTF-32 can be both big- and
- little-endian.</item>
- <tag>UCS-4</tag>
- <item>Basically the same as UTF-32, but without some Unicode
- semantics, defined by IEEE and has little use as a separate
- encoding standard. For all normal (and possibly abnormal) usages,
- UTF-32 and UCS-4 are interchangeable.</item>
- </taglist>
- <p>Certain ranges of numbers are left unused in the Unicode standard
- and certain ranges are even deemed invalid. The most notable invalid
- range is 16#D800 - 16#DFFF, as the UTF-16 encoding does not allow
- for encoding of these numbers. It can be speculated that the UTF-16
- encoding standard was, from the beginning, expected to be able to
- hold all Unicode characters in one 16-bit entity, but then had to be
- extended, leaving a hole in the Unicode range to cope with backward
- compatibility.</p>
- <p>Additionally, the code point 16#FEFF is used for byte order marks
- (BOM's) and use of that character is not encouraged in other
- contexts than that. It actually is valid though, as the character
- "ZWNBS" (Zero Width Non Breaking Space). BOM's are used to identify
- encodings and byte order for programs where such parameters are not
- known in advance. Byte order marks are more seldom used than one
- could expect, but their use might become more widely spread as they
- provide the means for programs to make educated guesses about the
- Unicode format of a certain file.</p>
-</section>
-<section>
- <title>Areas of Unicode Support</title>
- <p>To support Unicode in Erlang, problems in several areas have been
- addressed. Each area is described briefly in this section and more
- thoroughly further down in this document:</p>
- <taglist>
- <tag>Representation</tag>
- <item>To handle Unicode characters in Erlang, we have to have a
- common representation both in lists and binaries. The EEP (10) and
- the subsequent initial implementation in Erlang/OTP R13A settled a standard
- representation of Unicode characters in Erlang.</item>
- <tag>Manipulation</tag>
- <item>The Unicode characters need to be processed by the Erlang
- program, why library functions need to be able to handle them. In
- some cases functionality was added to already existing interfaces
- (as the string module now can handle lists with arbitrary code points),
- in some cases new functionality or options need to be added (as in
- the <c>io</c>-module, the file handling, the <c>unicode</c> module
- and the bit syntax). Today most modules in kernel and STDLIB, as
- well as the VM are Unicode aware.</item>
- <tag>File I/O</tag>
- <item>I/O is by far the most problematic area for Unicode. A file
- is an entity where bytes are stored and the lore of programming
- has been to treat characters and bytes as interchangeable. With
- Unicode characters, you need to decide on an encoding as soon as
- you want to store the data in a file. In Erlang you can open a
- text file with an encoding option, so that you can read characters
- from it rather than bytes, but you can also open a file for
- bytewise I/O. The I/O-system of Erlang has been designed (or at
- least used) in a way where you expect any I/O-server to be
- able to cope with any string data, but that is no longer the case
- when you work with Unicode characters. Handling the fact that you
- need to know the capabilities of the device where your data ends
- up is something new to the Erlang programmer. Furthermore, ports
- in Erlang are byte oriented, so an arbitrary string of (Unicode)
- characters can not be sent to a port without first converting it
- to an encoding of choice.</item>
- <tag>Terminal I/O</tag>
- <item>Terminal I/O is slightly easier than file I/O. The output is
- meant for human reading and is usually Erlang syntax (e.g. in the
- shell). There exists syntactic representation of any Unicode
- character without actually displaying the glyph (instead written
- as <c>\x{</c>HHH<c>}</c>), so Unicode data can usually be displayed
- even if the terminal as such do not support the whole Unicode
- range.</item>
- <tag>File names</tag>
- <item>File names can be stored as Unicode strings, in different
- ways depending on the underlying OS and file system. This can be
- handled fairly easy by a program. The problems arise when the file
- system is not consistent in it's encodings, like for example
- Linux. Linux allows files to be named with any sequence of bytes,
- leaving to each program to interpret those bytes. On systems where
- these "transparent" file names are used, Erlang has to be informed
- about the file name encoding by a startup flag. The default is
- bytewise interpretation, which is actually usually wrong, but
- allows for interpretation of <em>all</em> file names. The concept
- of "raw file names" can be used to handle wrongly encoded
- file names if one enables Unicode file name translation
- (<c>+fnu</c>) on platforms where this is not the default.</item>
- <tag>Source code encoding</tag>
- <item>When it comes to the Erlang source code, there is support
- for the UTF-8 encoding and bytewise encoding. The default in
- Erlang/OTP R16B was bytewise (or latin1) encoding; in Erlang/OTP 17.0
- it was changed to UTF-8. You can control the encoding by a comment like:
-<code>
-%% -*- coding: utf-8 -*-
-</code>
- in the beginning of the file. This of course requires your editor to
- support UTF-8 as well. The same comment is also interpreted by
- functions like <c>file:consult/1</c>, the release handler etc, so that
- you can have all text files in your source directories in UTF-8
- encoding.
- </item>
- <tag>The language</tag>
- <item>Having the source code in UTF-8 also allows you to write
- string literals containing Unicode characters with code points &gt;
- 255, although atoms, module names and function names are
- restricted to the ISO-Latin-1 range. Binary
- literals where you use the <c>/utf8</c> type, can also be
- expressed using Unicode characters &gt; 255. Having module names
- using characters other than 7-bit ASCII can cause trouble on
- operating systems with inconsistent file naming schemes, and might
- also hurt portability, so it's not really recommended. It is
- suggested in EEP 40 that the language should also allow for
- Unicode characters &gt; 255 in variable names. Whether to
- implement that EEP or not is yet to be decided.</item>
- </taglist>
-</section>
-<section>
- <title>Standard Unicode Representation</title>
- <p>In Erlang, strings are actually lists of integers. A string was
- up until Erlang/OTP R13 defined to be encoded in the ISO-latin-1 (ISO8859-1)
- character set, which is, code point by code point, a sub-range of
- the Unicode character set.</p>
- <p>The standard list encoding for strings was therefore easily
- extended to cope with the whole Unicode range: A Unicode string in
- Erlang is simply a list containing integers, each integer being a
- valid Unicode code point and representing one character in the
- Unicode character set.</p>
- <p>Erlang strings in ISO-latin-1 are a subset of Unicode
- strings.</p>
- <p>Only if a string contains code points &lt; 256, can it be
- directly converted to a binary by using
- i.e. <c>erlang:iolist_to_binary/1</c> or can be sent directly to a
- port. If the string contains Unicode characters &gt; 255, an
- encoding has to be decided upon and the string should be converted
- to a binary in the preferred encoding using
- <c>unicode:characters_to_binary/{1,2,3}</c>. Strings are not
- generally lists of bytes, as they were before Erlang/OTP R13. They are lists of
- characters. Characters are not generally bytes, they are Unicode
- code points.</p>
-
- <p>Binaries are more troublesome. For performance reasons, programs
- often store textual data in binaries instead of lists, mainly
- because they are more compact (one byte per character instead of two
- words per character, as is the case with lists). Using
- <c>erlang:list_to_binary/1</c>, an ISO-Latin-1 Erlang string could
- be converted into a binary, effectively using bytewise encoding -
- one byte per character. This was very convenient for those limited
- Erlang strings, but cannot be done for arbitrary Unicode lists.</p>
- <p>As the UTF-8 encoding is widely spread and provides some backward
- compatibility in the 7-bit ASCII range, it is selected as the
- standard encoding for Unicode characters in binaries for Erlang.</p>
- <p>The standard binary encoding is used whenever a library function
- in Erlang should cope with Unicode data in binaries, but is of
- course not enforced when communicating externally. Functions and
- bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32
- in binaries. Library functions dealing with binaries and Unicode in
- general, however, only deal with the default encoding.</p>
-
- <p>Character data may be combined from several sources, sometimes
- available in a mix of strings and binaries. Erlang has for long had
- the concept of <c>iodata</c> or <c>iolist</c>s, where binaries and
- lists can be combined to represent a sequence of bytes. In the same
- way, the Unicode aware modules often allow for combinations of
- binaries and lists where the binaries have characters encoded in
- UTF-8 and the lists contain such binaries or numbers representing
- Unicode code points:</p>
- <code type="none">
+ <section>
+ <title>Unicode Implementation</title>
+ <p>Implementing support for Unicode character sets is an ongoing process.
+ The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode
+ support and specified a default encoding in binaries that all
+ Unicode-aware modules are to handle in the future.</p>
+
+ <p>Here is an overview what has been done so far:</p>
+
+ <list type="bulleted">
+ <item><p>The functionality described in EEP10 was implemented
+ in Erlang/OTP R13A.</p></item>
+
+ <item><p>Erlang/OTP R14B01 added support for Unicode
+ filenames, but it was not complete and was by default
+ disabled on platforms where no guarantee was given for the
+ filename encoding.</p></item>
+
+ <item><p>With Erlang/OTP R16A came support for UTF-8 encoded
+ source code, with enhancements to many of the applications to
+ support both Unicode encoded filenames and support for UTF-8
+ encoded files in many circumstances. Most notable is the
+ support for UTF-8 in files read by <seealso
+ marker="kernel:file#consult/1"><c>file:consult/1</c></seealso>,
+ release handler support for UTF-8, and more support for
+ Unicode character sets in the I/O system.</p></item>
+
+ <item><p>In Erlang/OTP 17.0, the encoding default for Erlang
+ source files was switched to UTF-8.</p></item>
+ </list>
+
+ <p>This section outlines the current Unicode support and gives some
+ recipes for working with Unicode data.</p>
+ </section>
+
+ <section>
+ <title>Understanding Unicode</title>
+ <p>Experience with the Unicode support in Erlang has made it clear that
+ understanding Unicode characters and encodings is not as easy as one
+ would expect. The complexity of the field and the implications of the
+ standard require thorough understanding of concepts rarely before
+ thought of.</p>
+
+ <p>Also, the Erlang implementation requires understanding of
+ concepts that were never an issue for many (Erlang) programmers. To
+ understand and use Unicode characters requires that you study the
+ subject thoroughly, even if you are an experienced programmer.</p>
+
+ <p>As an example, contemplate the issue of converting between upper and
+ lower case letters. Reading the standard makes you realize that there is
+ not a simple one to one mapping in all scripts, for example:</p>
+
+ <list type="bulleted">
+ <item>
+ <p>In German, the letter "ß" (sharp s) is in lower case, but the
+ uppercase equivalent is "SS".</p>
+ </item>
+ <item>
+ <p>In Greek, the letter "Σ" has two different lowercase forms,
+ "ς" in word-final position and "σ" elsewhere.</p>
+ </item>
+ <item>
+ <p>In Turkish, both dotted and dotless "i" exist in lower case and
+ upper case forms.</p>
+ </item>
+ <item>
+ <p>Cyrillic "I" has usually no lowercase form.</p>
+ </item>
+ <item>
+ <p>Languages with no concept of upper case (or lower case).</p>
+ </item>
+ </list>
+
+ <p>So, a conversion function must know not only one character at a time,
+ but possibly the whole sentence, the natural language to translate to,
+ the differences in input and output string length, and so on.
+ Erlang/OTP has currently no Unicode <c>to_upper</c>/<c>to_lower</c>
+ functionality, but publicly available libraries address these issues.</p>
+
+ <p>Another example is the accented characters, where the same glyph has two
+ different representations. The Swedish letter "ö" is one example.
+ The Unicode standard has a code point for it, but you can also write it
+ as "o" followed by "U+0308" (Combining Diaeresis, with the simplified
+ meaning that the last letter is to have "¨" above). They have the same
+ glyph. They are for most purposes the same, but have different
+ representations. For example, MacOS X converts all filenames to use
+ Combining Diaeresis, while most other programs (including Erlang) try to
+ hide that by doing the opposite when, for example, listing directories.
+ However it is done, it is usually important to normalize such
+ characters to avoid confusion.</p>
+
+ <p>The list of examples can be made long. One need a kind of knowledge that
+ was not needed when programs only considered one or two languages. The
+ complexity of human languages and scripts has certainly made this a
+ challenge when constructing a universal standard. Supporting Unicode
+ properly in your program will require effort.</p>
+ </section>
+
+ <section>
+ <title>What Unicode Is</title>
+ <p>Unicode is a standard defining code points (numbers) for all known,
+ living or dead, scripts. In principle, every symbol used in any
+ language has a Unicode code point. Unicode code points are defined and
+ published by the Unicode Consortium, which is a non-profit
+ organization.</p>
+
+ <p>Support for Unicode is increasing throughout the world of computing, as
+ the benefits of one common character set are overwhelming when programs
+ are used in a global environment. Along with the base of the standard,
+ the code points for all the scripts, some <em>encoding standards</em> are
+ available.</p>
+
+ <p>It is vital to understand the difference between encodings and Unicode
+ characters. Unicode characters are code points according to the Unicode
+ standard, while the encodings are ways to represent such code points. An
+ encoding is only a standard for representation. UTF-8 can, for example,
+ be used to represent a very limited part of the Unicode character set
+ (for example ISO-Latin-1) or the full Unicode range. It is only an
+ encoding format.</p>
+
+ <p>As long as all character sets were limited to 256 characters, each
+ character could be stored in one single byte, so there was more or less
+ only one practical encoding for the characters. Encoding each character
+ in one byte was so common that the encoding was not even named. With the
+ Unicode system there are much more than 256 characters, so a common way
+ is needed to represent these. The common ways of representing the code
+ points are the encodings. This means a whole new concept to the
+ programmer, the concept of character representation, which was a
+ non-issue earlier.</p>
+
+ <p>Different operating systems and tools support different encodings. For
+ example, Linux and MacOS X have chosen the UTF-8 encoding, which is
+ backward compatible with 7-bit ASCII and therefore affects programs
+ written in plain English the least. Windows supports a limited version
+ of UTF-16, namely all the code planes where the characters can be
+ stored in one single 16-bit entity, which includes most living
+ languages.</p>
+
+ <p>The following are the most widely spread encodings:</p>
+
+ <taglist>
+ <tag>Bytewise representation</tag>
+ <item>
+ <p>This is not a proper Unicode representation, but the representation
+ used for characters before the Unicode standard. It can still be used
+ to represent character code points in the Unicode standard with
+ numbers &lt; 256, which exactly corresponds to the ISO Latin-1
+ character set. In Erlang, this is commonly denoted <c>latin1</c>
+ encoding, which is slightly misleading as ISO Latin-1 is a
+ character code range, not an encoding.</p>
+ </item>
+ <tag>UTF-8</tag>
+ <item>
+ <p>Each character is stored in one to four bytes depending on code
+ point. The encoding is backward compatible with bytewise
+ representation of 7-bit ASCII, as all 7-bit characters are stored in
+ one single byte in UTF-8. The characters beyond code point 127 are
+ stored in more bytes, letting the most significant bit in the first
+ character indicate a multi-byte character. For details on the
+ encoding, the RFC is publicly available.</p>
+ <p>Notice that UTF-8 is <em>not</em> compatible with bytewise
+ representation for code points from 128 through 255, so an ISO
+ Latin-1 bytewise representation is generally incompatible with
+ UTF-8.</p>
+ </item>
+ <tag>UTF-16</tag>
+ <item>
+ <p>This encoding has many similarities to UTF-8, but the basic
+ unit is a 16-bit number. This means that all characters occupy
+ at least two bytes, and some high numbers four bytes. Some
+ programs, libraries, and operating systems claiming to use
+ UTF-16 only allow for characters that can be stored in one
+ 16-bit entity, which is usually sufficient to handle living
+ languages. As the basic unit is more than one byte, byte-order
+ issues occur, which is why UTF-16 exists in both a big-endian
+ and a little-endian variant.</p>
+ <p>In Erlang, the full UTF-16 range is supported when applicable, like
+ in the <seealso marker="stdlib:unicode"><c>unicode</c></seealso>
+ module and in the bit syntax.</p>
+ </item>
+ <tag>UTF-32</tag>
+ <item>
+ <p>The most straightforward representation. Each character is stored in
+ one single 32-bit number. There is no need for escapes or any
+ variable number of entities for one character. All Unicode code
+ points can be stored in one single 32-bit entity. As with UTF-16,
+ there are byte-order issues. UTF-32 can be both big-endian and
+ little-endian.</p>
+ </item>
+ <tag>UCS-4</tag>
+ <item>
+ <p>Basically the same as UTF-32, but without some Unicode semantics,
+ defined by IEEE, and has little use as a separate encoding standard.
+ For all normal (and possibly abnormal) use, UTF-32 and UCS-4 are
+ interchangeable.</p>
+ </item>
+ </taglist>
+
+ <p>Certain number ranges are unused in the Unicode standard and certain
+ ranges are even deemed invalid. The most notable invalid range is
+ 16#D800-16#DFFF, as the UTF-16 encoding does not allow for encoding of
+ these numbers. This is possibly because the UTF-16 encoding standard,
+ from the beginning, was expected to be able to hold all Unicode
+ characters in one 16-bit entity, but was then extended, leaving a hole
+ in the Unicode range to handle backward compatibility.</p>
+
+ <p>Code point 16#FEFF is used for Byte Order Marks (BOMs) and use of that
+ character is not encouraged in other contexts. It is valid though, as
+ the character "ZWNBS" (Zero Width Non Breaking Space). BOMs are used to
+ identify encodings and byte order for programs where such parameters are
+ not known in advance. BOMs are more seldom used than expected, but can
+ become more widely spread as they provide the means for programs to make
+ educated guesses about the Unicode format of a certain file.</p>
+ </section>
+
+ <section>
+ <title>Areas of Unicode Support</title>
+ <p>To support Unicode in Erlang, problems in various areas have been
+ addressed. This section describes each area briefly and more
+ thoroughly later in this User's Guide.</p>
+
+ <taglist>
+ <tag>Representation</tag>
+ <item>
+ <p>To handle Unicode characters in Erlang, a common representation
+ in both lists and binaries is needed. EEP (10) and the subsequent
+ initial implementation in Erlang/OTP R13A settled a standard
+ representation of Unicode characters in Erlang.</p>
+ </item>
+ <tag>Manipulation</tag>
+ <item>
+ <p>The Unicode characters need to be processed by the Erlang
+ program, which is why library functions must be able to handle
+ them. In some cases functionality has been added to already
+ existing interfaces (as the <seealso
+ marker="stdlib:string"><c>string</c></seealso> module now can
+ handle lists with any code points). In some cases new
+ functionality or options have been added (as in the <seealso
+ marker="stdlib:io"><c>io</c></seealso> module, the file
+ handling, the <seealso
+ marker="stdlib:unicode"><c>unicode</c></seealso> module, and
+ the bit syntax). Today most modules in <c>Kernel</c> and
+ <c>STDLIB</c>, as well as the VM are Unicode-aware.</p>
+ </item>
+ <tag>File I/O</tag>
+ <item>
+ <p>I/O is by far the most problematic area for Unicode. A file is an
+ entity where bytes are stored, and the lore of programming has been
+ to treat characters and bytes as interchangeable. With Unicode
+ characters, you must decide on an encoding when you want to store
+ the data in a file. In Erlang, you can open a text file with an
+ encoding option, so that you can read characters from it rather than
+ bytes, but you can also open a file for bytewise I/O.</p>
+ <p>The Erlang I/O-system has been designed (or at least used) in a way
+ where you expect any I/O server to handle any string data.
+ That is, however, no longer the case when working with Unicode
+ characters. The Erlang programmer must now know the
+ capabilities of the device where the data ends up. Also, ports in
+ Erlang are byte-oriented, so an arbitrary string of (Unicode)
+ characters cannot be sent to a port without first converting it to an
+ encoding of choice.</p>
+ </item>
+ <tag>Terminal I/O</tag>
+ <item>
+ <p>Terminal I/O is slightly easier than file I/O. The output is meant
+ for human reading and is usually Erlang syntax (for example, in the
+ shell). There exists syntactic representation of any Unicode
+ character without displaying the glyph (instead written as
+ <c>\x</c>{<c>HHH</c>}). Unicode data can therefore usually be
+ displayed even if the terminal as such does not support the whole
+ Unicode range.</p>
+ </item>
+ <tag>Filenames</tag>
+ <item>
+ <p>Filenames can be stored as Unicode strings in different ways
+ depending on the underlying operating system and file system. This
+ can be handled fairly easy by a program. The problems arise when the
+ file system is inconsistent in its encodings. For example, Linux
+ allows files to be named with any sequence of bytes, leaving to each
+ program to interpret those bytes. On systems where these
+ "transparent" filenames are used, Erlang must be informed about the
+ filename encoding by a startup flag. The default is bytewise
+ interpretation, which is usually wrong, but allows for interpretation
+ of <em>all</em> filenames.</p>
+ <p>The concept of "raw filenames" can be used to handle wrongly encoded
+ filenames if one enables Unicode filename translation (<c>+fnu</c>)
+ on platforms where this is not the default.</p>
+ </item>
+ <tag>Source code encoding</tag>
+ <item>
+ <p>The Erlang source code has support for the UTF-8 encoding
+ and bytewise encoding. The default in Erlang/OTP R16B was bytewise
+ (<c>latin1</c>) encoding. It was changed to UTF-8 in Erlang/OTP 17.0.
+ You can control the encoding by a comment like the following in the
+ beginning of the file:</p>
+ <code>
+%% -*- coding: utf-8 -*-</code>
+ <p>This of course requires your editor to support UTF-8 as well. The
+ same comment is also interpreted by functions like
+ <seealso marker="kernel:file#consult/1"><c>file:consult/1</c></seealso>,
+ the release handler, and so on, so that you can have all text files
+ in your source directories in UTF-8 encoding.</p>
+ </item>
+ <tag>The language</tag>
+ <item>
+ <p>Having the source code in UTF-8 also allows you to write string
+ literals containing Unicode characters with code points &gt; 255,
+ although atoms, module names, and function names are restricted to
+ the ISO Latin-1 range. Binary literals, where you use type
+ <c>/utf8</c>, can also be expressed using Unicode characters &gt; 255.
+ Having module names using characters other than 7-bit ASCII can cause
+ trouble on operating systems with inconsistent file naming schemes,
+ and can hurt portability, so it is not recommended.</p>
+ <p>EEP 40 suggests that the language is also to allow for Unicode
+ characters &gt; 255 in variable names. Whether to implement that EEP
+ is yet to be decided.</p>
+ </item>
+ </taglist>
+ </section>
+
+ <section>
+ <title>Standard Unicode Representation</title>
+ <p>In Erlang, strings are lists of integers. A string was until
+ Erlang/OTP R13 defined to be encoded in the ISO Latin-1 (ISO 8859-1)
+ character set, which is, code point by code point, a subrange of the
+ Unicode character set.</p>
+
+ <p>The standard list encoding for strings was therefore easily extended to
+ handle the whole Unicode range. A Unicode string in Erlang is a list
+ containing integers, where each integer is a valid Unicode code point and
+ represents one character in the Unicode character set.</p>
+
+ <p>Erlang strings in ISO Latin-1 are a subset of Unicode strings.</p>
+
+ <p>Only if a string contains code points &lt; 256, can it be directly
+ converted to a binary by using, for example,
+ <seealso marker="erts:erlang#iolist_to_binary/1"><c>erlang:iolist_to_binary/1</c></seealso>
+ or can be sent directly to a port. If the string contains Unicode
+ characters &gt; 255, an encoding must be decided upon and the string is to
+ be converted to a binary in the preferred encoding using
+ <seealso marker="stdlib:unicode#characters_to_binary/1"><c>unicode:characters_to_binary/1,2,3</c></seealso>.
+ Strings are not generally lists of bytes, as they were before
+ Erlang/OTP R13, they are lists of characters. Characters are not
+ generally bytes, they are Unicode code points.</p>
+
+ <p>Binaries are more troublesome. For performance reasons, programs often
+ store textual data in binaries instead of lists, mainly because they are
+ more compact (one byte per character instead of two words per character,
+ as is the case with lists). Using
+ <seealso marker="erts:erlang#list_to_binary/1"><c>erlang:list_to_binary/1</c></seealso>,
+ an ISO Latin-1 Erlang string can be converted into a binary, effectively
+ using bytewise encoding: one byte per character. This was convenient for
+ those limited Erlang strings, but cannot be done for arbitrary Unicode
+ lists.</p>
+
+ <p>As the UTF-8 encoding is widely spread and provides some backward
+ compatibility in the 7-bit ASCII range, it is selected as the standard
+ encoding for Unicode characters in binaries for Erlang.</p>
+
+ <p>The standard binary encoding is used whenever a library function in
+ Erlang is to handle Unicode data in binaries, but is of course not
+ enforced when communicating externally. Functions and bit syntax exist to
+ encode and decode both UTF-8, UTF-16, and UTF-32 in binaries. However,
+ library functions dealing with binaries and Unicode in general only deal
+ with the default encoding.</p>
+
+ <p>Character data can be combined from many sources, sometimes available in
+ a mix of strings and binaries. Erlang has for long had the concept of
+ <c>iodata</c> or <c>iolist</c>s, where binaries and lists can be combined
+ to represent a sequence of bytes. In the same way, the Unicode-aware
+ modules often allow for combinations of binaries and lists, where the
+ binaries have characters encoded in UTF-8 and the lists contain such
+ binaries or numbers representing Unicode code points:</p>
+
+ <code type="none">
unicode_binary() = binary() with characters encoded in UTF-8 coding standard
chardata() = charlist() | unicode_binary()
charlist() = maybe_improper_list(char() | unicode_binary() | charlist(),
- unicode_binary() | nil())</code>
- <p>The module <seealso
- marker="stdlib:unicode"><c>unicode</c></seealso> in STDLIB even
- supports similar mixes with binaries containing other encodings than
- UTF-8, but that is a special case to allow for conversions to and
- from external data:</p>
- <code type="none">
-external_unicode_binary() = binary() with characters coded in
- a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
+ unicode_binary() | nil())</code>
+
+ <p>The module <seealso marker="stdlib:unicode"><c>unicode</c></seealso>
+ even supports similar mixes with binaries containing other encodings than
+ UTF-8, but that is a special case to allow for conversions to and from
+ external data:</p>
+
+ <code type="none">
+external_unicode_binary() = binary() with characters coded in a user-specified
+ Unicode encoding other than UTF-8 (UTF-16 or UTF-32)
external_chardata() = external_charlist() | external_unicode_binary()
-external_charlist() = maybe_improper_list(char() |
- external_unicode_binary() |
- external_charlist(),
- external_unicode_binary() | nil())</code>
-</section>
-<section>
- <title>Basic Language Support</title>
- <p><marker id="unicode_in_erlang"/>As of Erlang/OTP R16 Erlang
- source files can be written in either UTF-8 or bytewise encoding
- (a.k.a. <c>latin1</c> encoding). The details on how to state the encoding
- of an Erlang source file can be found in
- <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>. Strings and comments
- can be written using Unicode, but functions still have to be named
- using characters from the ISO-latin-1 character set and atoms are
- restricted to the same ISO-latin-1 range. These restrictions in the
- language are of course independent of the encoding of the source
- file.</p>
+external_charlist() = maybe_improper_list(char() | external_unicode_binary() |
+ external_charlist(), external_unicode_binary() | nil())</code>
+ </section>
+
<section>
- <title>Bit-syntax</title>
- <p>The bit-syntax contains types for coping with binary data in the
- three main encodings. The types are named <c>utf8</c>, <c>utf16</c>
- and <c>utf32</c> respectively. The <c>utf16</c> and <c>utf32</c> types
- can be in a big- or little-endian variant:</p>
- <code>
+ <title>Basic Language Support</title>
+ <p><marker id="unicode_in_erlang"/>As from Erlang/OTP R16, Erlang source
+ files can be written in UTF-8 or bytewise (<c>latin1</c>) encoding. For
+ information about how to state the encoding of an Erlang source file, see
+ the <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso> module.
+ Strings and comments can be written using Unicode, but functions must
+ still be named using characters from the ISO Latin-1 character set, and
+ atoms are restricted to the same ISO Latin-1 range. These restrictions in
+ the language are of course independent of the encoding of the source
+ file.</p>
+
+ <section>
+ <title>Bit Syntax</title>
+ <p>The bit syntax contains types for handling binary data in the
+ three main encodings. The types are named <c>utf8</c>, <c>utf16</c>,
+ and <c>utf32</c>. The <c>utf16</c> and <c>utf32</c> types can be in a
+ big-endian or a little-endian variant:</p>
+
+ <code>
&lt;&lt;Ch/utf8,_/binary&gt;&gt; = Bin1,
&lt;&lt;Ch/utf16-little,_/binary&gt;&gt; = Bin2,
Bin3 = &lt;&lt;$H/utf32-little, $e/utf32-little, $l/utf32-little, $l/utf32-little,
$o/utf32-little&gt;&gt;,</code>
- <p>For convenience, literal strings can be encoded with a Unicode
- encoding in binaries using the following (or similar) syntax:</p>
- <code>
+
+ <p>For convenience, literal strings can be encoded with a Unicode
+ encoding in binaries using the following (or similar) syntax:</p>
+
+ <code>
Bin4 = &lt;&lt;"Hello"/utf16&gt;&gt;,</code>
- </section>
- <section>
- <title>String and Character Literals</title>
- <p>For source code, there is an extension to the <c>\</c>OOO
- (backslash followed by three octal numbers) and <c>\x</c>HH
- (backslash followed by <c>x</c>, followed by two hexadecimal
- characters) syntax, namely <c>\x{</c>H ...<c>}</c> (a backslash
- followed by an <c>x</c>, followed by left curly bracket, any
- number of hexadecimal digits and a terminating right curly
- bracket). This allows for entering characters of any code point
- literally in a string even when the encoding of the source file is
- bytewise (<c>latin1</c>).</p>
- <p>In the shell, if using a Unicode input device, or in source
- code stored in UTF-8, <c>$</c> can be followed directly by a
- Unicode character producing an integer. In the following example
- the code point of a Cyrillic <c>с</c> is output:</p>
- <pre>
+ </section>
+
+ <section>
+ <title>String and Character Literals</title>
+ <p>For source code, there is an extension to syntax <c>\</c>OOO
+ (backslash followed by three octal numbers) and <c>\x</c>HH (backslash
+ followed by <c>x</c>, followed by two hexadecimal characters), namely
+ <c>\x{</c>H ...<c>}</c> (backslash followed by <c>x</c>, followed by
+ left curly bracket, any number of hexadecimal digits, and a terminating
+ right curly bracket). This allows for entering characters of any code
+ point literally in a string even when the encoding of the source file
+ is bytewise (<c>latin1</c>).</p>
+
+ <p>In the shell, if using a Unicode input device, or in source code
+ stored in UTF-8, <c>$</c> can be followed directly by a Unicode
+ character producing an integer. In the following example, the code
+ point of a Cyrillic <c>с</c> is output:</p>
+
+ <pre>
7> <input>$с.</input>
1089</pre>
- </section>
- <section>
- <title>Heuristic String Detection</title>
- <p>In certain output functions and in the output of return values
- in the shell, Erlang tries to heuristically detect string data in
- lists and binaries. Typically you will see heuristic detection in
- a situation like this:</p>
- <pre>
+ </section>
+
+ <section>
+ <title>Heuristic String Detection</title>
+ <p>In certain output functions and in the output of return values in
+ the shell, Erlang tries to detect string data in lists and binaries
+ heuristically. Typically you will see heuristic detection in a
+ situation like this:</p>
+
+ <pre>
1> <input>[97,98,99].</input>
"abc"
2> <input>&lt;&lt;97,98,99&gt;&gt;.</input>
&lt;&lt;"abc"&gt;&gt;
3> <input>&lt;&lt;195,165,195,164,195,182&gt;&gt;.</input>
&lt;&lt;"åäö"/utf8&gt;&gt;</pre>
- <p>Here the shell will detect lists containing printable
- characters or binaries containing printable characters either in
- bytewise or UTF-8 encoding. The question here is: what is a
- printable character? One view would be that anything the Unicode
- standard thinks is printable, will also be printable according to
- the heuristic detection. The result would be that almost any list
- of integers will be deemed a string, resulting in all sorts of
- characters being printed, maybe even characters your terminal does
- not have in its font set (resulting in some generic output you
- probably will not appreciate). Another way is to keep it backwards
- compatible so that only the ISO-Latin-1 character set is used to
- detect a string. A third way would be to let the user decide
- exactly what Unicode ranges are to be viewed as characters. Since
- Erlang/OTP R16B you can select either the whole Unicode range or the
- ISO-Latin-1 range by supplying the startup flag <c>+pc
- </c><i>Range</i>, where <i>Range</i> is either <c>latin1</c> or
- <c>unicode</c>. For backwards compatibility, the default is
- <c>latin1</c>. This only controls how heuristic string detection
- is done. In the future, more ranges are expected to be added, so
- that one can tailor the heuristics to the language and region
- relevant to the user.</p>
- <p>Lets look at an example with the two different startup options:</p>
-<pre>
+
+ <p>Here the shell detects lists containing printable characters or
+ binaries containing printable characters in bytewise or UTF-8 encoding.
+ But what is a printable character? One view is that anything the Unicode
+ standard thinks is printable, is also printable according to the
+ heuristic detection. The result is then that almost any list of
+ integers are deemed a string, and all sorts of characters are printed,
+ maybe also characters that your terminal lacks in its font set
+ (resulting in some unappreciated generic output).
+ Another way is to keep it backward compatible so that only the ISO
+ Latin-1 character set is used to detect a string. A third way is to let
+ the user decide exactly what Unicode ranges that are to be viewed as
+ characters.</p>
+
+ <p>As from Erlang/OTP R16B you can select the ISO Latin-1 range or the
+ whole Unicode range by supplying startup flag <c>+pc latin1</c> or
+ <c>+pc unicode</c>, respectively. For backward compatibility,
+ <c>latin1</c> is default. This only controls how heuristic string
+ detection is done. More ranges are expected to be added in the future,
+ enabling tailoring of the heuristics to the language and region
+ relevant to the user.</p>
+
+ <p>The following examples show the two startup options:</p>
+
+ <pre>
$ <input>erl +pc latin1</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -467,9 +535,9 @@ Eshell V5.10.1 (abort with ^G)
4> <input>&lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;.</input>
&lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;
5> <input>&lt;&lt;229/utf8,228/utf8,246/utf8&gt;&gt;.</input>
-&lt;&lt;"åäö"/utf8&gt;&gt;
-</pre>
-<pre>
+&lt;&lt;"åäö"/utf8&gt;&gt;</pre>
+
+ <pre>
$ <input>erl +pc unicode</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -483,78 +551,88 @@ Eshell V5.10.1 (abort with ^G)
4> <input>&lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;.</input>
&lt;&lt;"Юникод"/utf8&gt;&gt;
5> <input>&lt;&lt;229/utf8,228/utf8,246/utf8&gt;&gt;.</input>
-&lt;&lt;"åäö"/utf8&gt;&gt;
-</pre>
- <p>In the examples, we can see that the default Erlang shell will
- only interpret characters from the ISO-Latin1 range as printable
- and will only detect lists or binaries with those "printable"
- characters as containing string data. The valid UTF-8 binary
- containing "Юникод", will not be printed as a string. When, on the
- other hand, started with all Unicode characters printable (<c>+pc
- unicode</c>), the shell will output anything containing printable
- Unicode data (in binaries either UTF-8 or bytewise encoded) as
- string data.</p>
-
- <p>These heuristics are also used by
- <c>io</c>(<c>_lib</c>)<c>:format/2</c> and friends when the
- <c>t</c> modifier is used in conjunction with <c>~p</c> or
- <c>~P</c>:</p>
-<pre>
+&lt;&lt;"åäö"/utf8&gt;&gt;</pre>
+
+ <p>In the examples, you can see that the default Erlang shell interprets
+ only characters from the ISO Latin1 range as printable and only detects
+ lists or binaries with those "printable" characters as containing
+ string data. The valid UTF-8 binary containing the Russian word
+ "Юникод", is not printed as a string. When started with all Unicode
+ characters printable (<c>+pc unicode</c>), the shell outputs anything
+ containing printable Unicode data (in binaries, either UTF-8 or
+ bytewise encoded) as string data.</p>
+
+ <p>These heuristics are also used by
+ <seealso marker="stdlib:io#format/2"><c>io:format/2</c></seealso>,
+ <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format/2</c></seealso>,
+ and friends when modifier <c>t</c> is used with <c>~p</c> or
+ <c>~P</c>:</p>
+
+ <pre>
$ <input>erl +pc latin1</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
1> <input>io:format("~tp~n",[{&lt;&lt;"åäö"&gt;&gt;, &lt;&lt;"åäö"/utf8&gt;&gt;, &lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;}]).</input>
{&lt;&lt;"åäö"&gt;&gt;,&lt;&lt;"åäö"/utf8&gt;&gt;,&lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;}
-ok
-</pre>
-<pre>
+ok</pre>
+
+ <pre>
$ <input>erl +pc unicode</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
Eshell V5.10.1 (abort with ^G)
1> <input>io:format("~tp~n",[{&lt;&lt;"åäö"&gt;&gt;, &lt;&lt;"åäö"/utf8&gt;&gt;, &lt;&lt;208,174,208,189,208,184,208,186,208,190,208,180&gt;&gt;}]).</input>
{&lt;&lt;"åäö"&gt;&gt;,&lt;&lt;"åäö"/utf8&gt;&gt;,&lt;&lt;"Юникод"/utf8&gt;&gt;}
-ok
-</pre>
- <p>Please observe that this only affects <i>heuristic</i> interpretation
- of lists and binaries on output. For example the <c>~ts</c> format
- sequence does always output a valid lists of characters,
- regardless of the <c>+pc</c> setting, as the programmer has
- explicitly requested string output.</p>
+ok</pre>
+
+ <p>Notice that this only affects <em>heuristic</em> interpretation of
+ lists and binaries on output. For example, the <c>~ts</c> format
+ sequence always outputs a valid list of characters, regardless of the
+ <c>+pc</c> setting, as the programmer has explicitly requested string
+ output.</p>
+ </section>
</section>
-</section>
-<section>
- <title>The Interactive Shell</title>
- <p>The interactive Erlang shell, when started towards a terminal or
- started using the <c>werl</c> command on windows, can support
- Unicode input and output.</p>
- <p>On Windows, proper operation requires that a suitable font
- is installed and selected for the Erlang application to use. If no
- suitable font is available on your system, try installing the DejaVu
- fonts (<c>dejavu-fonts.org</c>), which are freely available and then
- select that font in the Erlang shell application.</p>
- <p>On Unix-like operating systems, the terminal should be able
- to handle UTF-8 on input and output (modern versions of XTerm, KDE
- konsole and the Gnome terminal do for example) and your locale
- settings have to be proper. As an example, my <c>LANG</c>
- environment variable is set as this:</p>
- <pre>
+
+ <section>
+ <title>The Interactive Shell</title>
+ <p>The interactive Erlang shell, when started to a terminal or started
+ using command <c>werl</c> on Windows, can support Unicode input and
+ output.</p>
+
+ <p>On Windows, proper operation requires that a suitable font is
+ installed and selected for the Erlang application to use. If no suitable
+ font is available on your system, try installing the
+ <url href="http://dejavu-fonts.org">DejaVu fonts</url>, which are freely
+ available, and then select that font in the Erlang shell application.</p>
+
+ <p>On Unix-like operating systems, the terminal is to be able to handle
+ UTF-8 on input and output (this is done by, for example, modern versions
+ of XTerm, KDE Konsole, and the Gnome terminal)
+ and your locale settings must be proper. As
+ an example, a <c>LANG</c> environment variable can be set as follows:</p>
+
+ <pre>
$ <input>echo $LANG</input>
en_US.UTF-8</pre>
- <p>Actually, most systems handle the <c>LC_CTYPE</c> variable before
- <c>LANG</c>, so if that is set, it has to be set to
- <c>UTF-8</c>:</p>
- <pre>
+
+ <p>Most systems handle variable <c>LC_CTYPE</c> before <c>LANG</c>, so if
+ that is set, it must be set to <c>UTF-8</c>:</p>
+
+ <pre>
$ echo <input>$LC_CTYPE</input>
en_US.UTF-8</pre>
- <p>The <c>LANG</c> or <c>LC_CTYPE</c> setting should be consistent
- with what the terminal is capable of, there is no portable way for
- Erlang to ask the actual terminal about its UTF-8 capacity, we have
- to rely on the language and character type settings.</p>
- <p>To investigate what Erlang thinks about the terminal, the
- <c>io:getopts()</c> call can be used when the shell is started:</p>
- <pre>
+
+ <p>The <c>LANG</c> or <c>LC_CTYPE</c> setting are to be consistent with
+ what the terminal is capable of. There is no portable way for Erlang to
+ ask the terminal about its UTF-8 capacity, we have to rely on the
+ language and character type settings.</p>
+
+ <p>To investigate what Erlang thinks about the terminal, the call
+ <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>
+ can be used when the shell is started:</p>
+
+ <pre>
$ <input>LC_CTYPE=en_US.ISO-8859-1 erl</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -571,27 +649,31 @@ Eshell V5.10.1 (abort with ^G)
{encoding,unicode}
2></pre>
- <p>When (finally?) everything is in order with the locale settings,
- fonts and the terminal emulator, you probably also have discovered a
- way to input characters in the script you desire. For testing, the
- simplest way is to add some keyboard mappings for other languages,
- usually done with some applet in your desktop environment. In my KDE
- environment, I start the KDE Control Center (Personal Settings),
- select "Regional and Accessibility" and then "Keyboard Layout". On
- Windows XP, I start Control Panel->Regional and Language
- Options, select the Language tab and click the Details... button in
- the square named "Text services and input Languages". Your
- environment probably provides similar means of changing the keyboard
- layout. Make sure you have a way to easily switch back and forth
- between keyboards if you are not used to this, entering commands
- using a Cyrillic character set is, as an example, not easily done in
- the Erlang shell.</p>
-
- <p>Now you are set up for some Unicode input and output. The
- simplest thing to do is of course to enter a string in the
- shell:</p>
-
- <pre>
+ <p>When (finally?) everything is in order with the locale settings, fonts.
+ and the terminal emulator, you have probably found a way to input
+ characters in the script you desire. For testing, the simplest way is to
+ add some keyboard mappings for other languages, usually done with some
+ applet in your desktop environment.</p>
+
+ <p>In a KDE environment, select <em>KDE Control Center (Personal
+ Settings)</em> > <em>Regional and Accessibility</em> > <em>Keyboard
+ Layout</em>.</p>
+
+ <p>On Windows XP, select <em>Control Panel</em> > <em>Regional and Language
+ Options</em>, select tab <em>Language</em>, and click button
+ <em>Details...</em> in the square named <em>Text Services and Input
+ Languages</em>.</p>
+
+ <p>Your environment
+ probably provides similar means of changing the keyboard layout. Ensure
+ that you have a way to switch back and forth between keyboards easily if
+ you are not used to this. For example, entering commands using a Cyrillic
+ character set is not easily done in the Erlang shell.</p>
+
+ <p>Now you are set up for some Unicode input and output. The simplest thing
+ to do is to enter a string in the shell:</p>
+
+ <pre>
$ <input>erl</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -603,12 +685,13 @@ Eshell V5.10.1 (abort with ^G)
3> <input>io:format("~ts~n", [v(2)]).</input>
Юникод
ok
-4> </pre>
- <p>While strings can be input as Unicode characters, the language
- elements are still limited to the ISO-latin-1 character set. Only
- character constants and strings are allowed to be beyond that
- range:</p>
- <pre>
+4></pre>
+
+ <p>While strings can be input as Unicode characters, the language elements
+ are still limited to the ISO Latin-1 character set. Only character
+ constants and strings are allowed to be beyond that range:</p>
+
+ <pre>
$ <input>erl</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -618,371 +701,398 @@ Eshell V5.10.1 (abort with ^G)
2> <input>Юникод.</input>
* 1: illegal character
2> </pre>
-</section>
-<section>
- <title>Unicode File Names</title>
- <marker id="unicode_file_names"/>
- <p>Most modern operating systems support Unicode file names in some
- way or another. There are several different ways to do this and
- Erlang by default treats the different approaches differently:</p>
- <taglist>
- <tag>Mandatory Unicode file naming</tag>
- <item>
- <p>Windows and, for most common uses, MacOS X enforces Unicode
- support for file names. All files created in the file system have
- names that can consistently be interpreted. In MacOS X, all file
- names are retrieved in UTF-8 encoding, while Windows has
- selected an approach where each system call handling file names
- has a special Unicode aware variant, giving much the same
- effect. There are no file names on these systems that are not
- Unicode file names, why the default behavior of the Erlang VM is
- to work in &quot;Unicode file name translation mode&quot;,
- meaning that a file name can be given as a Unicode list and that
- will be automatically translated to the proper name encoding for
- the underlying operating and file system.</p>
- <p>Doing i.e. a <c>file:list_dir/1</c> on one of these systems
- may return Unicode lists with code points beyond 255, depending
- on the content of the actual file system.</p>
- <p>As the feature is fairly new, you may still stumble upon non
- core applications that cannot handle being provided with file
- names containing characters with code points larger than 255, but
- the core Erlang system should have no problems with Unicode file
- names.</p>
- </item>
- <tag>Transparent file naming</tag>
- <item>
- <p>Most Unix operating systems have adopted a simpler approach,
- namely that Unicode file naming is not enforced, but by
- convention. Those systems usually use UTF-8 encoding for Unicode
- file names, but do not enforce it. On such a system, a file name
- containing characters having code points between 128 and 255 may
- be named either as plain ISO-latin-1 or using UTF-8 encoding. As
- no consistency is enforced, the Erlang VM can do no consistent
- translation of all file names.</p>
-
- <p>By default on such systems, Erlang starts in <c>utf8</c> file
- name mode if the terminal supports UTF-8, otherwise in
- <c>latin1</c> mode.</p>
-
- <p>In the <c>latin1</c> mode, file names are bytewise endcoded.
- This allows for list representation of all file names in
- the system, but, for example, a file named "Östersund.txt", will
- appear in <c>file:list_dir/1</c> as either "Östersund.txt" (if
- the file name was encoded in bytewise ISO-Latin-1 by the program
- creating the file, or more probably as
- <c>[195,150,115,116,101,114,115,117,110,100]</c>, which is a
- list containing UTF-8 bytes - not what you would want... If you
- on the other hand use Unicode file name translation on such a
- system, non-UTF-8 file names will simply be ignored by functions
- like <c>file:list_dir/1</c>. They can be retrieved with
- <c>file:list_dir_all/1</c>, but wrongly encoded file names will
- appear as &quot;raw file names&quot;.</p>
-
- </item>
- </taglist>
-
- <p>The Unicode file naming support was introduced with Erlang/OTP
- R14B01. A VM operating in Unicode file name translation mode can
- work with files having names in any language or character set (as
- long as it is supported by the underlying OS and file system). The
- Unicode character list is used to denote file or directory names and
- if the file system content is listed, you will also get
- Unicode lists as return value. The support lies in the Kernel and
- STDLIB modules, why most applications (that does not explicitly
- require the file names to be in the ISO-latin-1 range) will benefit
- from the Unicode support without change.</p>
-
- <p>On operating systems with mandatory Unicode file names, this
- means that you more easily conform to the file names of other (non
- Erlang) applications, and you can also process file names that, at
- least on Windows, were completely inaccessible (due to having names
- that could not be represented in ISO-latin-1). Also you will avoid
- creating incomprehensible file names on MacOS X as the vfs layer of
- the OS will accept all your file names as UTF-8 and will not rewrite
- them.</p>
-
- <p>For most systems, turning on Unicode file name translation is no
- problem even if it uses transparent file naming. Very few systems
- have mixed file name encodings. A consistent UTF-8 named system will
- work perfectly in Unicode file name mode. It was still however
- considered experimental in Erlang/OTP R14B01 and is still not the default on
- such systems. Unicode file name translation is turned on with the
- <c>+fnu</c> switch to the On Linux, a VM started without explicitly
- stating the file name translation mode will default to <c>latin1</c>
- as the native file name encoding. On Windows and MacOS X, the
- default behavior is that of Unicode file name translation, why the
- <c>file:native_name_encoding/0</c> by default returns <c>utf8</c> on
- those systems (the fact that Windows actually does not use UTF-8 on
- the file system level can safely be ignored by the Erlang
- programmer). The default behavior can, as stated before, be
- changed using the <c>+fnu</c> or <c>+fnl</c> options to the VM, see
- the <seealso marker="erts:erl"><c>erl</c></seealso> program. If the
- VM is started in Unicode file name translation mode,
- <c>file:native_name_encoding/0</c> will return the atom
- <c>utf8</c>. The <c>+fnu</c> switch can be followed by <c>w</c>,
- <c>i</c> or <c>e</c>, to control how wrongly encoded file names are
- to be reported. <c>w</c> means that a warning is sent to the
- <c>error_logger</c> whenever a wrongly encoded file name is
- "skipped" in directory listings, <c>i</c> means that those wrongly
- encoded file names are silently ignored and <c>e</c> means that the
- API function will return an error whenever a wrongly encoded file
- (or directory) name is encountered. <c>w</c> is the default. Note
- that <c>file:read_link/1</c> will always return an error if the link
- points to an invalid file name.</p>
-
- <p>In Unicode file name mode, file names given to the BIF
- <c>open_port/2</c> with the option <c>{spawn_executable,...}</c> are
- also interpreted as Unicode. So is the parameter list given in the
- <c>args</c> option available when using <c>spawn_executable</c>. The
- UTF-8 translation of arguments can be avoided using binaries, see
- the discussion about raw file names below.</p>
-
- <p>It is worth noting that the file <c>encoding</c> options given
- when opening a file has nothing to do with the file <em>name</em>
- encoding convention. You can very well open files containing data
- encoded in UTF-8 but having file names in bytewise (<c>latin1</c>) encoding
- or vice versa.</p>
-
- <note><p>Erlang drivers and NIF shared objects still can not be
- named with names containing code points beyond 127. This is a known
- limitation to be removed in a future release. Erlang modules however
- can, but it is definitely not a good idea and is still considered
- experimental.</p></note>
-
-<section>
- <title>Notes About Raw File Names</title>
- <marker id="notes-about-raw-filenames"/>
- <p>Raw file names were introduced together with Unicode file name
- support in erts-5.8.2 (Erlang/OTP R14B01). The reason &quot;raw file
- names&quot; was introduced in the system was to be able to
- consistently represent file names given in different encodings on
- the same system. Having the VM automatically translate a file name
- that is not in UTF-8 to a list of Unicode characters might seem
- practical, but this would open up for both duplicate file names and
- other inconsistent behavior. Consider a directory containing a file
- named &quot;björn&quot; in ISO-latin-1, while the Erlang VM is
- operating in Unicode file name mode (and therefore expecting UTF-8
- file naming). The ISO-latin-1 name is not valid UTF-8 and one could
- be tempted to think that automatic conversion in for example
- <c>file:list_dir/1</c> is a good idea. But what would happen if we
- later tried to open the file and have the name as a Unicode list
- (magically converted from the ISO-latin-1 file name)? The VM will
- convert the file name given to UTF-8, as this is the encoding
- expected. Effectively this means trying to open the file named
- &lt;&lt;&quot;björn&quot;/utf8&gt;&gt;. This file does not exist,
- and even if it existed it would not be the same file as the one that
- was listed. We could even create two files named &quot;björn&quot;,
- one named in the UTF-8 encoding and one not. If
- <c>file:list_dir/1</c> would automatically convert the ISO-latin-1
- file name to a list, we would get two identical file names as the
- result. To avoid this, we need to differentiate between file names
- being properly encoded according to the Unicode file naming
- convention (i.e. UTF-8) and file names being invalid under the
- encoding. By the common <c>file:list_dir/1</c> function, the wrongly
- encoded file names are simply ignored in Unicode file name
- translation mode, but by the <c>file:list_dir_all/1</c> function,
- the file names with invalid encoding are returned as &quot;raw&quot;
- file names, i.e. as binaries.</p>
-
- <p>The Erlang <c>file</c> module accepts raw file names as
- input. <c>open_port({spawn_executable, ...} ...)</c> also accepts
- them. As mentioned earlier, the arguments given in the option list
- to <c>open_port({spawn_executable, ...} ...)</c> undergo the same
- conversion as the file names, meaning that the executable will be
- provided with arguments in UTF-8 as well. This translation is
- avoided consistently with how the file names are treated, by giving
- the argument as a binary.</p>
-
- <p>To force Unicode file name translation mode on systems where this
- is not the default was considered experimental in Erlang/OTP R14B01 due to
- the fact that the initial implementation did not ignore wrongly
- encoded file names, so that raw file names could spread unexpectedly
- throughout the system. Beginning with Erlang/OTP R16B, the wrongly encoded file
- names are only retrieved by special functions
- (e.g. <c>file:list_dir_all/1</c>), so the impact on existing code is
- much lower, why it is now supported. Unicode file name translation
- is expected to be default in future releases.</p>
-
- <p>Even if you are operating without Unicode file naming translation
- automatically done by the VM, you can access and create files with
- names in UTF-8 encoding by using raw file names encoded as
- UTF-8. Enforcing the UTF-8 encoding regardless of the mode the
- Erlang VM is started in might, in some circumstances be a good idea,
- as the convention of using UTF-8 file names is spreading.</p>
-</section>
-<section>
- <title>Notes About MacOS X</title>
- <p>MacOS X's vfs layer enforces UTF-8 file names in a quite
- aggressive way. Older versions did this by simply refusing to create
- non UTF-8 conforming file names, while newer versions replace
- offending bytes with the sequence &quot;%HH&quot;, where HH is the
- original character in hexadecimal notation. As Unicode translation
- is enabled by default on MacOS X, the only way to come up against
- this is to either start the VM with the <c>+fnl</c> flag or to use a
- raw file name in bytewise (<c>latin1</c>) encoding. If using a raw
- filename, with a bytewise encoding containing characters between 127
- and 255, to create a file, the file can not be opened using the same
- name as the one used to create it. There is no remedy for this
- behaviour, other than keeping the file names in the right
- encoding.</p>
-
- <p>MacOS X also reorganizes the names of files so that the
- representation of accents etc is using the "combining characters",
- i.e. the character <c>ö</c> is represented as the code points
- [111,776], where 111 is the character <c>o</c> and 776 is the
- special accent character "combining diaeresis". This way of
- normalizing Unicode is otherwise very seldom used and Erlang
- normalizes those file names in the opposite way upon retrieval, so
- that file names using combining accents are not passed up to the
- Erlang application. In Erlang the file name &quot;björn&quot; is
- retrieved as [98,106,246,114,110], not as [98,106,117,776,114,110],
- even though the file system might think differently. The
- normalization into combining accents are redone when actually
- accessing files, so this can usually be ignored by the Erlang
- programmer.</p>
-</section>
-</section>
-<section>
- <title>Unicode in Environment and Parameters</title>
- <marker id="unicode_in_environment_and_parameters"/>
- <p>Environment variables and their interpretation is handled much in
- the same way as file names. If Unicode file names are enabled,
- environment variables as well as parameters to the Erlang VM are
- expected to be in Unicode.</p>
- <p>If Unicode file names are enabled, the calls to
- <seealso marker="kernel:os#getenv/0"><c>os:getenv/0</c></seealso>,
- <seealso marker="kernel:os#getenv/1"><c>os:getenv/1</c></seealso>,
- <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso> and
- <seealso marker="kernel:os#unsetenv/1"><c>os:unsetenv/1</c></seealso>
- will handle Unicode strings. On Unix-like platforms, the built-in
- functions will translate environment variables in UTF-8 to/from
- Unicode strings, possibly with code points > 255. On Windows the
- Unicode versions of the environment system API will be used, also
- allowing for code points > 255.</p>
- <p>On Unix-like operating systems, parameters are expected to be
- UTF-8 without translation if Unicode file names are enabled.</p>
-</section>
-<section>
- <title>Unicode-aware Modules</title>
- <p>Most of the modules in Erlang/OTP are of course Unicode-unaware
- in the sense that they have no notion of Unicode and really should
- not have. Typically they handle non-textual or byte-oriented data
- (like <c>gen_tcp</c> etc).</p>
- <p>Modules that actually handle textual data (like <c>io_lib</c>,
- <c>string</c> etc) are sometimes subject to conversion or extension
- to be able to handle Unicode characters.</p>
- <p>Fortunately, most textual data has been stored in lists and range
- checking has been sparse, why modules like <c>string</c> works well
- for Unicode lists with little need for conversion or extension.</p>
- <p>Some modules are however changed to be explicitly
- Unicode-aware. These modules include:</p>
- <taglist>
- <tag><c>unicode</c></tag>
- <item>
- <p>The module <seealso marker="stdlib:unicode"><c>unicode</c></seealso>
- is obviously Unicode-aware. It contains functions for conversion
- between different Unicode formats as well as some utilities for
- identifying byte order marks. Few programs handling Unicode data
- will survive without this module.</p>
- </item>
- <tag><c>io</c></tag>
- <item>
- <p>The <seealso marker="stdlib:io"><c>io</c></seealso> module has been
- extended along with the actual I/O-protocol to handle Unicode
- data. This means that several functions require binaries to be
- in UTF-8 and there are modifiers to formatting control sequences
- to allow for outputting of Unicode strings.</p>
- </item>
- <tag><c>file</c>, <c>group</c>, <c>user</c></tag>
- <item>
- <p>I/O-servers throughout the system are able to handle
- Unicode data and has options for converting data upon actual
- output or input to/from the device. As shown earlier, the
- <seealso marker="stdlib:shell"><c>shell</c></seealso> has support for
- Unicode terminals and the <seealso
- marker="kernel:file"><c>file</c></seealso> module allows for
- translation to and from various Unicode formats on disk.</p>
- <p>The actual reading and writing of files with Unicode data is
- however not best done with the <c>file</c> module as its
- interface is byte oriented. A file opened with a Unicode
- encoding (like UTF-8), is then best read or written using the
- <seealso marker="stdlib:io"><c>io</c></seealso> module.</p>
- </item>
- <tag><c>re</c></tag>
- <item>
- <p>The <seealso marker="stdlib:re"><c>re</c></seealso> module allows
- for matching Unicode strings as a special option. As the library
- is actually centered on matching in binaries, the Unicode
- support is UTF-8-centered.</p>
- </item>
- <tag><c>wx</c></tag>
- <item>
- <p>The <seealso marker="wx:wx"><c>wx</c></seealso> graphical library
- has extensive support for Unicode text</p>
- </item>
- </taglist>
- <p>The module <seealso
- marker="stdlib:string"><c>string</c></seealso> works perfectly for
- Unicode strings as well as for ISO-latin-1 strings with the
- exception of the language-dependent <seealso
- marker="stdlib:string#to_upper/1"><c>to_upper</c></seealso> and
- <seealso marker="stdlib:string#to_lower/1"><c>to_lower</c></seealso>
- functions, which are only correct for the ISO-latin-1 character
- set. Actually they can never function correctly for Unicode
- characters in their current form, as there are language and locale
- issues as well as multi-character mappings to consider when
- converting text between cases. Converting case in an international
- environment is a big subject not yet addressed in OTP.</p>
-</section>
-<section>
- <title>Unicode Data in Files</title>
- <p>The fact that Erlang as such can handle Unicode data in many forms
- does not automatically mean that the content of any file can be
- Unicode text. The external entities such as ports or I/O-servers are
- not generally Unicode capable.</p>
- <p>Ports are always byte oriented, so before sending data that you
- are not sure is bytewise encoded to a port, make sure to encode it
- in a proper Unicode encoding. Sometimes this will mean that only
- part of the data shall be encoded as e.g. UTF-8, some parts may be
- binary data (like a length indicator) or something else that shall
- not undergo character encoding, so no automatic translation is
- present.</p>
- <p>I/O-servers behave a little differently. The I/O-servers connected
- to terminals (or stdout) can usually cope with Unicode data
- regardless of the <c>encoding</c> option. This is convenient when
- one expects a modern environment but do not want to crash when
- writing to a archaic terminal or pipe. Files on the other hand are
- more picky. A file can have an encoding option which makes it
- generally usable by the io-module (e.g. <c>{encoding,utf8}</c>), but
- is by default opened as a byte oriented file. The <seealso
- marker="kernel:file"><c>file</c></seealso> module is byte oriented, why only
- ISO-Latin-1 characters can be written using that module. The
- <seealso marker="stdlib:io"><c>io</c></seealso> module is the one to use if
- Unicode data is to be output to a file with other <c>encoding</c>
- than <c>latin1</c> (a.k.a. bytewise encoding). It is slightly
- confusing that a file opened with
- e.g. <c>file:open(Name,[read,{encoding,utf8}])</c>, cannot be
- properly read using <c>file:read(File,N)</c> but you have to use the
- <c>io</c> module to retrieve the Unicode data from it. The reason is
- that <c>file:read</c> and <c>file:write</c> (and friends) are purely
- byte oriented, and should so be, as that is the way to access
- files other than text files - byte by byte. Just as with ports, you
- can of course write encoded data into a file by "manually" converting
- the data to the encoding of choice (using the <seealso
- marker="stdlib:unicode"><c>unicode</c></seealso> module or the bit syntax)
- and then output it on a bytewise encoded (<c>latin1</c>) file.</p>
- <p>The rule of thumb is that the <seealso
- marker="kernel:file"><c>file</c></seealso> module should be used for files
- opened for bytewise access (<c>{encoding,latin1}</c>) and the
- <seealso marker="stdlib:io"><c>io</c></seealso> module should be used when
- accessing files with any other encoding
- (e.g. <c>{encoding,uf8}</c>).</p>
-
- <p>Functions reading Erlang syntax from files generally recognize
- the <c>coding:</c> comment and can therefore handle Unicode data on
- input. When writing Erlang Terms to a file, you should insert
- such comments when applicable:</p>
- <pre>
+ </section>
+
+ <section>
+ <title>Unicode Filenames</title>
+ <marker id="unicode_file_names"/>
+ <p>Most modern operating systems support Unicode filenames in some way.
+ There are many different ways to do this and Erlang by default treats the
+ different approaches differently:</p>
+
+ <taglist>
+ <tag>Mandatory Unicode file naming</tag>
+ <item>
+ <p>Windows and, for most common uses, MacOS X enforce Unicode support
+ for filenames. All files created in the file system have names that
+ can consistently be interpreted. In MacOS X, all filenames are
+ retrieved in UTF-8 encoding. In Windows, each system call handling
+ filenames has a special Unicode-aware variant, giving much the same
+ effect. There are no filenames on these systems that are not Unicode
+ filenames. So, the default behavior of the Erlang VM is to work in
+ &quot;Unicode filename translation mode&quot;. This means that a
+ filename can be specified as a Unicode list, which is automatically
+ translated to the proper name encoding for the underlying operating
+ system and file system.</p>
+ <p>Doing, for example, a
+ <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso>
+ on one of these systems can return Unicode lists with code points
+ &gt; 255, depending on the content of the file system.</p>
+ </item>
+ <tag>Transparent file naming</tag>
+ <item>
+ <p>Most Unix operating systems have adopted a simpler approach, namely
+ that Unicode file naming is not enforced, but by convention. Those
+ systems usually use UTF-8 encoding for Unicode filenames, but do not
+ enforce it. On such a system, a filename containing characters with
+ code points from 128 through 255 can be named as plain ISO Latin-1 or
+ use UTF-8 encoding. As no consistency is enforced, the Erlang VM
+ cannot do consistent translation of all filenames.</p>
+ <p>By default on such systems, Erlang starts in <c>utf8</c> filename
+ mode if the terminal supports UTF-8, otherwise in <c>latin1</c>
+ mode.</p>
+ <p>In <c>latin1</c> mode, filenames are bytewise encoded. This allows
+ for list representation of all filenames in the system. However, a
+ a file named "Östersund.txt", appears in
+ <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso>
+ either as "Östersund.txt" (if the filename was encoded in bytewise
+ ISO Latin-1 by the program creating the file) or more probably as
+ <c>[195,150,115,116,101,114,115,117,110,100]</c>, which is a list
+ containing UTF-8 bytes (not what you want). If you use Unicode
+ filename translation on such a system, non-UTF-8 filenames are
+ ignored by functions like <c>file:list_dir/1</c>. They can be
+ retrieved with function
+ <seealso marker="kernel:file#list_dir_all/1"><c>file:list_dir_all/1</c></seealso>,
+ but wrongly encoded filenames appear as &quot;raw filenames&quot;.
+ </p>
+ </item>
+ </taglist>
+
+ <p>The Unicode file naming support was introduced in Erlang/OTP
+ R14B01. A VM operating in Unicode filename translation mode can
+ work with files having names in any language or character set (as
+ long as it is supported by the underlying operating system and
+ file system). The Unicode character list is used to denote
+ filenames or directory names. If the file system content is
+ listed, you also get Unicode lists as return value. The support
+ lies in the <c>Kernel</c> and <c>STDLIB</c> modules, which is why
+ most applications (that does not explicitly require the filenames
+ to be in the ISO Latin-1 range) benefit from the Unicode support
+ without change.</p>
+
+ <p>On operating systems with mandatory Unicode filenames, this means that
+ you more easily conform to the filenames of other (non-Erlang)
+ applications. You can also process filenames that, at least on Windows,
+ were inaccessible (because of having names that could not be represented
+ in ISO Latin-1). Also, you avoid creating incomprehensible filenames
+ on MacOS X, as the <c>vfs</c> layer of the operating system accepts all
+ your filenames as UTF-8 does not rewrite them.</p>
+
+ <p>For most systems, turning on Unicode filename translation is no problem
+ even if it uses transparent file naming. Very few systems have mixed
+ filename encodings. A consistent UTF-8 named system works perfectly in
+ Unicode filename mode. It was still, however, considered experimental in
+ Erlang/OTP R14B01 and is still not the default on such systems.</p>
+
+ <p>Unicode filename translation is turned on with switch <c>+fnu</c>. On
+ Linux, a VM started without explicitly stating the filename translation
+ mode defaults to <c>latin1</c> as the native filename encoding. On
+ Windows and MacOS X, the default behavior is that of Unicode filename
+ translation. Therefore
+ <seealso marker="kernel:file#native_name_encoding/0"><c>file:native_name_encoding/0</c></seealso>
+ by default returns <c>utf8</c> on those systems (Windows does not use
+ UTF-8 on the file system level, but this can safely be ignored by the
+ Erlang programmer). The default behavior can, as stated earlier, be
+ changed using option <c>+fnu</c> or <c>+fnl</c> to the VM, see the
+ <seealso marker="erts:erl"><c>erl</c></seealso> program. If the VM is
+ started in Unicode filename translation mode,
+ <c>file:native_name_encoding/0</c> returns atom <c>utf8</c>. Switch
+ <c>+fnu</c> can be followed by <c>w</c>, <c>i</c>, or <c>e</c> to control
+ how wrongly encoded filenames are to be reported.</p>
+
+ <list type="bulleted">
+ <item>
+ <p><c>w</c> means that a warning is sent to the <c>error_logger</c>
+ whenever a wrongly encoded filename is "skipped" in directory
+ listings. <c>w</c> is the default.</p>
+ </item>
+ <item>
+ <p><c>i</c> means that wrongly encoded filenames are silently ignored.
+ </p>
+ </item>
+ <item>
+ <p><c>e</c> means that the API function returns an error whenever a
+ wrongly encoded filename (or directory name) is encountered.</p>
+ </item>
+ </list>
+
+ <p>Notice that
+ <seealso marker="kernel:file#read_link/1"><c>file:read_link/1</c></seealso>
+ always returns an error if the link points to an invalid filename.</p>
+
+ <p>In Unicode filename mode, filenames given to BIF <c>open_port/2</c> with
+ option <c>{spawn_executable,...}</c> are also interpreted as Unicode. So
+ is the parameter list specified in option <c>args</c> available when
+ using <c>spawn_executable</c>. The UTF-8 translation of arguments can be
+ avoided using binaries, see section
+ <seealso marker="#notes-about-raw-filenames">Notes About Raw Filenames</seealso>.
+ </p>
+
+ <p>Notice that the file encoding options specified when opening a file has
+ nothing to do with the filename encoding convention. You can very well
+ open files containing data encoded in UTF-8, but having filenames in
+ bytewise (<c>latin1</c>) encoding or conversely.</p>
+
+ <note><p>Erlang drivers and NIF-shared objects still cannot be named with
+ names containing code points &gt; 127. This limitation will be removed in
+ a future release. However, Erlang modules can, but it is definitely not a
+ good idea and is still considered experimental.</p>
+ </note>
+
+ <section>
+ <title>Notes About Raw Filenames</title>
+ <marker id="notes-about-raw-filenames"/>
+ <p>Raw filenames were introduced together with Unicode filename support
+ in <c>ERTS</c> 5.8.2 (Erlang/OTP R14B01). The reason &quot;raw
+ filenames&quot; were introduced in the system was
+ to be able to represent
+ filenames, specified in different encodings on the same system,
+ consistently. It can seem practical to have the VM automatically
+ translate a filename that is not in UTF-8 to a list of Unicode
+ characters, but this would open up for both duplicate filenames and
+ other inconsistent behavior.</p>
+
+ <p>Consider a directory containing a file named &quot;björn&quot; in ISO
+ Latin-1, while the Erlang VM is operating in Unicode filename mode (and
+ therefore expects UTF-8 file naming). The ISO Latin-1 name is not valid
+ UTF-8 and one can be tempted to think that automatic conversion in, for
+ example,
+ <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso>
+ is a good idea. But what would happen if we later tried to open the file
+ and have the name as a Unicode list (magically converted from the ISO
+ Latin-1 filename)? The VM converts the filename to UTF-8, as this is
+ the encoding expected. Effectively this means trying to open the file
+ named &lt;&lt;&quot;björn&quot;/utf8&gt;&gt;. This file does not exist,
+ and even if it existed it would not be the same file as the one that was
+ listed. We could even create two files named &quot;björn&quot;, one
+ named in UTF-8 encoding and one not. If <c>file:list_dir/1</c> would
+ automatically convert the ISO Latin-1 filename to a list, we would get
+ two identical filenames as the result. To avoid this, we must
+ differentiate between filenames that are properly encoded according to
+ the Unicode file naming convention (that is, UTF-8) and filenames that
+ are invalid under the encoding. By the common function
+ <c>file:list_dir/1</c>, the wrongly encoded filenames are ignored in
+ Unicode filename translation mode, but by function
+ <seealso marker="kernel:file#list_dir_all/1"><c>file:list_dir_all/1</c></seealso>
+ the filenames with invalid encoding are returned as &quot;raw&quot;
+ filenames, that is, as binaries.</p>
+
+ <p>The <c>file</c> module accepts raw filenames as input.
+ <c>open_port({spawn_executable, ...} ...)</c> also accepts them. As
+ mentioned earlier, the arguments specified in the option list to
+ <c>open_port({spawn_executable, ...} ...)</c> undergo the same
+ conversion as the filenames, meaning that the executable is provided
+ with arguments in UTF-8 as well. This translation is avoided
+ consistently with how the filenames are treated, by giving the argument
+ as a binary.</p>
+
+ <p>To force Unicode filename translation mode on systems where this is not
+ the default was considered experimental in Erlang/OTP R14B01. This was
+ because the initial implementation did not ignore wrongly encoded
+ filenames, so that raw filenames could spread unexpectedly throughout
+ the system. As from Erlang/OTP R16B, the wrongly encoded
+ filenames are only retrieved by special functions (such as
+ <c>file:list_dir_all/1</c>). Since the impact on existing code is
+ therefore much lower it is now supported.
+ Unicode filename translation is
+ expected to be default in future releases.</p>
+
+ <p>Even if you are operating without Unicode file naming translation
+ automatically done by the VM, you can access and create files with
+ names in UTF-8 encoding by using raw filenames encoded as UTF-8.
+ Enforcing the UTF-8 encoding regardless of the mode the Erlang VM is
+ started in can in some circumstances be a good idea, as the convention
+ of using UTF-8 filenames is spreading.</p>
+ </section>
+
+ <section>
+ <title>Notes About MacOS X</title>
+ <p>The <c>vfs</c> layer of MacOS X enforces UTF-8 filenames in an
+ aggressive way. Older versions did this by refusing to create non-UTF-8
+ conforming filenames, while newer versions replace offending bytes with
+ the sequence &quot;%HH&quot;, where HH is the original character in
+ hexadecimal notation. As Unicode translation is enabled by default on
+ MacOS X, the only way to come up against this is to either start the VM
+ with flag <c>+fnl</c> or to use a raw filename in bytewise
+ (<c>latin1</c>) encoding. If using a raw filename, with a bytewise
+ encoding containing characters from 127 through 255, to create a file,
+ the file cannot be opened using the same name as the one used to create
+ it. There is no remedy for this behavior, except keeping the filenames
+ in the correct encoding.</p>
+
+ <p>MacOS X reorganizes the filenames so that the representation of
+ accents, and so on, uses the "combining characters". For example,
+ character <c>ö</c> is represented as code points <c>[111,776]</c>,
+ where <c>111</c> is character <c>o</c> and <c>776</c> is the special
+ accent character "Combining Diaeresis". This way of normalizing Unicode
+ is otherwise very seldom used. Erlang normalizes those filenames in the
+ opposite way upon retrieval, so that filenames using combining accents
+ are not passed up to the Erlang application. In Erlang, filename
+ &quot;björn&quot; is retrieved as <c>[98,106,246,114,110]</c>, not as
+ <c>[98,106,117,776,114,110]</c>, although the file system can think
+ differently. The normalization into combining accents is redone when
+ accessing files, so this can usually be ignored by the Erlang
+ programmer.</p>
+ </section>
+ </section>
+
+ <section>
+ <title>Unicode in Environment and Parameters</title>
+ <marker id="unicode_in_environment_and_parameters"/>
+ <p>Environment variables and their interpretation are handled much in the
+ same way as filenames. If Unicode filenames are enabled, environment
+ variables as well as parameters to the Erlang VM are expected to be in
+ Unicode.</p>
+
+ <p>If Unicode filenames are enabled, the calls to
+ <seealso marker="kernel:os#getenv/0"><c>os:getenv/0,1</c></seealso>,
+ <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso>, and
+ <seealso marker="kernel:os#unsetenv/1"><c>os:unsetenv/1</c></seealso>
+ handle Unicode strings. On Unix-like platforms, the built-in functions
+ translate environment variables in UTF-8 to/from Unicode strings, possibly
+ with code points &gt; 255. On Windows, the Unicode versions of the
+ environment system API are used, and code points &gt; 255 are allowed.</p>
+ <p>On Unix-like operating systems, parameters are expected to be UTF-8
+ without translation if Unicode filenames are enabled.</p>
+ </section>
+
+ <section>
+ <title>Unicode-Aware Modules</title>
+ <p>Most of the modules in Erlang/OTP are Unicode-unaware in the sense that
+ they have no notion of Unicode and should not have. Typically they handle
+ non-textual or byte-oriented data (such as <c>gen_tcp</c>).</p>
+
+ <p>Modules handling textual data (such as
+ <seealso marker="stdlib:io_lib"><c>io_lib</c></seealso> and
+ <seealso marker="stdlib:string"><c>string</c></seealso> are sometimes
+ subject to conversion or extension to be able to handle Unicode
+ characters.</p>
+
+ <p>Fortunately, most textual data has been stored in lists and range
+ checking has been sparse, so modules like <c>string</c> work well for
+ Unicode lists with little need for conversion or extension.</p>
+
+ <p>Some modules are, however, changed to be explicitly Unicode-aware. These
+ modules include:</p>
+
+ <taglist>
+ <tag><c>unicode</c></tag>
+ <item>
+ <p>The <seealso marker="stdlib:unicode"><c>unicode</c></seealso>
+ module is clearly Unicode-aware. It contains functions for conversion
+ between different Unicode formats and some utilities for identifying
+ byte order marks. Few programs handling Unicode data survive without
+ this module.</p>
+ </item>
+ <tag><c>io</c></tag>
+ <item>
+ <p>The <seealso marker="stdlib:io"><c>io</c></seealso> module has been
+ extended along with the actual I/O protocol to handle Unicode data.
+ This means that many functions require binaries to be in UTF-8, and
+ there are modifiers to format control sequences to allow for output
+ of Unicode strings.</p>
+ </item>
+ <tag><c>file</c>, <c>group</c>, <c>user</c></tag>
+ <item>
+ <p>I/O-servers throughout the system can handle Unicode data and have
+ options for converting data upon output or input to/from the device.
+ As shown earlier, the
+ <seealso marker="stdlib:shell"><c>shell</c></seealso> module has
+ support for Unicode terminals and the
+ <seealso marker="kernel:file"><c>file</c></seealso> module
+ allows for translation to and from various Unicode formats on
+ disk.</p>
+ <p>Reading and writing of files with Unicode data is, however, not best
+ done with the <c>file</c> module, as its interface is
+ byte-oriented. A file opened with a Unicode encoding (like UTF-8) is
+ best read or written using the
+ <seealso marker="stdlib:io"><c>io</c></seealso> module.</p>
+ </item>
+ <tag><c>re</c></tag>
+ <item>
+ <p>The <seealso marker="stdlib:re"><c>re</c></seealso> module allows
+ for matching Unicode strings as a special option. As the library is
+ centered on matching in binaries, the Unicode support is
+ UTF-8-centered.</p>
+ </item>
+ <tag><c>wx</c></tag>
+ <item>
+ <p>The graphical library <seealso marker="wx:wx"><c>wx</c></seealso>
+ has extensive support for Unicode text.</p></item>
+ </taglist>
+
+ <p>The <seealso marker="stdlib:string"><c>string</c></seealso> module works
+ perfectly for Unicode strings and ISO Latin-1 strings, except the
+ language-dependent functions
+ <seealso marker="stdlib:string#to_upper/1"><c>string:to_upper/1</c></seealso>
+ and
+ <seealso marker="stdlib:string#to_lower/1"><c>string:to_lower/1</c></seealso>,
+ which are only correct for the ISO Latin-1 character set. These two
+ functions can never function correctly for Unicode characters in their
+ current form, as there are language and locale issues as well as
+ multi-character mappings to consider when converting text between cases.
+ Converting case in an international environment is a large subject not
+ yet addressed in OTP.</p>
+ </section>
+
+ <section>
+ <title>Unicode Data in Files</title>
+ <p>Although Erlang can handle Unicode data in many forms does not
+ automatically mean that the content of any file can be Unicode text. The
+ external entities, such as ports and I/O servers, are not generally
+ Unicode capable.</p>
+
+ <p>Ports are always byte-oriented, so before sending data that you are not
+ sure is bytewise-encoded to a port, ensure to encode it in a proper
+ Unicode encoding. Sometimes this means that only part of the data must
+ be encoded as, for example, UTF-8. Some parts can be binary data (like a
+ length indicator) or something else that must not undergo character
+ encoding, so no automatic translation is present.</p>
+
+ <p>I/O servers behave a little differently. The I/O servers connected to
+ terminals (or <c>stdout</c>) can usually cope with Unicode data
+ regardless of the encoding option. This is convenient when one expects
+ a modern environment but do not want to crash when writing to an archaic
+ terminal or pipe.</p>
+
+ <p>A file can have an encoding option that makes it generally usable by the
+ <seealso marker="stdlib:io"><c>io</c></seealso> module (for example
+ <c>{encoding,utf8}</c>), but is by default opened as a byte-oriented file.
+ The <seealso marker="kernel:file"><c>file</c></seealso> module is
+ byte-oriented, so only ISO Latin-1 characters can be written using that
+ module. Use the <c>io</c> module if Unicode data is to be output to a
+ file with other <c>encoding</c> than <c>latin1</c> (bytewise encoding).
+ It is slightly confusing that a file opened with, for example,
+ <c>file:open(Name,[read,{encoding,utf8}])</c> cannot be properly read
+ using <c>file:read(File,N)</c>, but using the <c>io</c> module to retrieve
+ the Unicode data from it. The reason is that <c>file:read</c> and
+ <c>file:write</c> (and friends) are purely byte-oriented, and should be,
+ as that is the way to access files other than text files, byte by byte.
+ As with ports, you can write encoded data into a file by "manually"
+ converting the data to the encoding of choice (using the
+ <seealso marker="stdlib:unicode"><c>unicode</c></seealso> module or the
+ bit syntax) and then output it on a bytewise (<c>latin1</c>) encoded
+ file.</p>
+
+ <p>Recommendations:</p>
+
+ <list type="bulleted">
+ <item><p>Use the
+ <seealso marker="kernel:file"><c>file</c></seealso> module for
+ files opened for bytewise access (<c>{encoding,latin1}</c>).</p>
+ </item>
+ <item><p>Use the <seealso marker="stdlib:io"><c>io</c></seealso> module
+ when accessing files with any other encoding (for example
+ <c>{encoding,uf8}</c>).</p>
+ </item>
+ </list>
+
+ <p>Functions reading Erlang syntax from files recognize the <c>coding:</c>
+ comment and can therefore handle Unicode data on input. When writing
+ Erlang terms to a file, you are advised to insert such comments when
+ applicable:</p>
+
+ <pre>
$ <input>erl +fna +pc unicode</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -990,202 +1100,224 @@ Eshell V5.10.1 (abort with ^G)
1> <input>file:write_file("test.term",&lt;&lt;"%% coding: utf-8\n[{\"Юникод\",4711}].\n"/utf8&gt;&gt;).</input>
ok
2> <input>file:consult("test.term").</input>
-{ok,[[{"Юникод",4711}]]}
- </pre>
-</section>
-<section>
- <title>Summary of Options</title>
- <marker id="unicode_options_summary"/>
- <p>The Unicode support is controlled by both command line switches,
- some standard environment variables and the version of OTP you are
- using. Most options affect mainly the way Unicode data is displayed,
- not the actual functionality of the API's in the standard
- libraries. This means that Erlang programs usually do not
- need to concern themselves with these options, they are more for the
- development environment. An Erlang program can be written so that it
- works well regardless of the type of system or the Unicode options
- that are in effect.</p>
-
- <p>Here follows a summary of the settings affecting Unicode:</p>
- <taglist>
- <tag>The <c>LANG</c> and <c>LC_CTYPE</c> environment variables</tag>
- <item>
- <p>The language setting in the OS mainly affects the shell. The
- terminal (i.e. the group leader) will operate with <c>{encoding,
- unicode}</c> only if the environment tells it that UTF-8 is
- allowed. This setting should correspond to the actual terminal
- you are using.</p>
- <p>The environment can also affect file name interpretation, if
- Erlang is started with the <c>+fna</c> flag (which is default from
- Erlang/OTP 17.0).</p>
- <p>You can check the setting of this by calling
- <c>io:getopts()</c>, which will give you an option list
- containing <c>{encoding,unicode}</c> or
- <c>{encoding,latin1}</c>.</p>
- </item>
- <tag>The <c>+pc </c>{<c>unicode</c>|<c>latin1</c>} flag to
- <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag>
- <item>
- <p>This flag affects what is interpreted as string data when
- doing heuristic string detection in the shell and in
- <c>io</c>/<c>io_lib:format</c> with the <c>"~tp"</c> and
- <c>~tP</c> formatting instructions, as described above.</p>
- <p>You can check this option by calling io:printable_range/0,
- which will return <c>unicode</c> or <c>latin1</c>. To be
- compatible with future (expected) extensions to the settings,
- one should rather use <c>io_lib:printable_list/1</c> to check if
- a list is printable according to the setting. That function will
- take into account new possible settings returned from
- <c>io:printable_range/0</c>.</p>
- </item>
- <tag>The <c>+fn</c>{<c>l</c>|<c>a</c>|<c>u</c>}
- [{<c>w</c>|<c>i</c>|<c>e</c>}]
- flag to <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag>
- <item>
- <p>This flag affects how the file names are to be interpreted. On
- operating systems with transparent file naming, this has to be
- specified to allow for file naming in Unicode characters (and
- for correct interpretation of file names containing characters
- &gt; 255.</p>
- <p><c>+fnl</c> means bytewise interpretation of file names, which
- was the usual way to represent ISO-Latin-1 file names before
- UTF-8 file naming got widespread.</p>
- <p><c>+fnu</c> means that file names are encoded in UTF-8, which
- is nowadays the common scheme (although not enforced).</p>
- <p><c>+fna</c> means that you automatically select between
- <c>+fnl</c> and <c>+fnu</c>, based on the <c>LANG</c> and
- <c>LC_CTYPE</c> environment variables. This is optimistic
- heuristics indeed, nothing enforces a user to have a terminal
- with the same encoding as the file system, but usually, this is
- the case. This is the default on all Unix-like operating
- systems except MacOS X.</p>
-
- <p>The file name translation mode can be read with the
- <c>file:native_name_encoding/0</c> function, which returns
- <c>latin1</c> (meaning bytewise encoding) or <c>utf8</c>.</p>
- </item>
- <tag><seealso marker="stdlib:epp#default_encoding/0">
- <c>epp:default_encoding/0</c></seealso></tag>
- <item>
- <p>This function returns the default encoding for Erlang source
- files (if no encoding comment is present) in the currently
- running release. In Erlang/OTP R16B <c>latin1</c> was returned (meaning
- bytewise encoding). In Erlang/OTP 17.0 and forward it returns
- <c>utf8</c>.</p>
- <p>The encoding of each file can be specified using comments as
- described in
- <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>.</p>
- </item>
- <tag><seealso marker="stdlib:io#setopts/1"><c>io:setopts/</c>{<c>1</c>,<c>2</c>}</seealso> and the <c>-oldshell</c>/<c>-noshell</c> flags.</tag>
- <item>
- <p>When Erlang is started with <c>-oldshell</c> or
- <c>-noshell</c>, the I/O-server for <c>standard_io</c> is default
- set to bytewise encoding, while an interactive shell defaults to
- what the environment variables says.</p>
- <p>With the <c>io:setopts/2</c> function you can set the
- encoding of a file or other I/O-server. This can also be set when
- opening a file. Setting the terminal (or other
- <c>standard_io</c> server) unconditionally to the option
- <c>{encoding,utf8}</c> will for example make UTF-8 encoded characters
- being written to the device regardless of how Erlang was started or
- the users environment.</p>
- <p>Opening files with <c>encoding</c> option is convenient when
- writing or reading text files in a known encoding.</p>
- <p>You can retrieve the <c>encoding</c> setting for an I/O-server
- using <seealso
- marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>.</p>
- </item>
- </taglist>
-</section>
-<section>
- <title>Recipes</title>
- <p>When starting with Unicode, one often stumbles over some common
- issues. I try to outline some methods of dealing with Unicode data
- in this section.</p>
+{ok,[[{"Юникод",4711}]]}</pre>
+ </section>
+
+ <section>
+ <title>Summary of Options</title>
+ <marker id="unicode_options_summary"/>
+ <p>The Unicode support is controlled by both command-line switches, some
+ standard environment variables, and the OTP version you are using. Most
+ options affect mainly how Unicode data is displayed, not the
+ functionality of the APIs in the standard libraries. This means that
+ Erlang programs usually do not need to concern themselves with these
+ options, they are more for the development environment. An Erlang program
+ can be written so that it works well regardless of the type of system or
+ the Unicode options that are in effect.</p>
+
+ <p>Here follows a summary of the settings affecting Unicode:</p>
+
+ <taglist>
+ <tag>The <c>LANG</c> and <c>LC_CTYPE</c> environment variables</tag>
+ <item>
+ <p>The language setting in the operating system mainly affects the
+ shell. The terminal (that is, the group leader) operates with
+ <c>{encoding, unicode}</c> only if the environment tells it that
+ UTF-8 is allowed. This setting is to correspond to the terminal you
+ are using.</p>
+ <p>The environment can also affect filename interpretation, if Erlang
+ is started with flag <c>+fna</c> (which is default from
+ Erlang/OTP 17.0).</p>
+ <p>You can check the setting of this by calling
+ <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>,
+ which gives you an option list containing <c>{encoding,unicode}</c>
+ or <c>{encoding,latin1}</c>.</p>
+ </item>
+ <tag>The <c>+pc</c> {<c>unicode</c>|<c>latin1</c>} flag to
+ <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag>
+ <item>
+ <p>This flag affects what is interpreted as string data when doing
+ heuristic string detection in the shell and in
+ <seealso marker="stdlib:io"><c>io</c></seealso>/
+ <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format</c></seealso>
+ with the <c>"~tp"</c> and <c>~tP</c> formatting instructions, as
+ described earlier.</p>
+ <p>You can check this option by calling
+ <seealso marker="stdlib:io#printable_range/0"><c>io:printable_range/0</c></seealso>,
+ which returns <c>unicode</c> or <c>latin1</c>. To be compatible with
+ future (expected) extensions to the settings, rather use
+ <seealso marker="stdlib:io_lib#printable_list/1"><c>io_lib:printable_list/1</c></seealso>
+ to check if a list is printable according to the setting. That
+ function takes into account new possible settings returned from
+ <c>io:printable_range/0</c>.</p>
+ </item>
+ <tag>The <c>+fn</c>{<c>l</c>|<c>u</c>|<c>a</c>}
+ [{<c>w</c>|<c>i</c>|<c>e</c>}] flag to
+ <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag>
+ <item>
+ <p>This flag affects how the filenames are to be interpreted. On
+ operating systems with transparent file naming, this must be
+ specified to allow for file naming in Unicode characters (and for
+ correct interpretation of filenames containing characters &gt; 255).
+ </p>
+ <list type="bulleted">
+ <item>
+ <p><c>+fnl</c> means bytewise interpretation of filenames, which was
+ the usual way to represent ISO Latin-1 filenames before UTF-8
+ file naming got widespread.</p>
+ </item>
+ <item>
+ <p><c>+fnu</c> means that filenames are encoded in UTF-8, which is
+ nowadays the common scheme (although not enforced).</p>
+ </item>
+ <item>
+ <p><c>+fna</c> means that you automatically select between
+ <c>+fnl</c> and <c>+fnu</c>, based on environment variables
+ <c>LANG</c> and <c>LC_CTYPE</c>. This is optimistic
+ heuristics indeed, nothing enforces a user to have a terminal with
+ the same encoding as the file system, but this is usually the
+ case. This is the default on all Unix-like operating systems,
+ except MacOS X.</p>
+ </item>
+ </list>
+ <p>The filename translation mode can be read with function
+ <seealso marker="kernel:file#native_name_encoding/0"><c>file:native_name_encoding/0</c></seealso>,
+ which returns <c>latin1</c> (bytewise encoding) or <c>utf8</c>.</p>
+ </item>
+ <tag><seealso marker="stdlib:epp#default_encoding/0"><c>epp:default_encoding/0</c></seealso></tag>
+ <item>
+ <p>This function returns the default encoding for Erlang source files
+ (if no encoding comment is present) in the currently running release.
+ In Erlang/OTP R16B, <c>latin1</c> (bytewise encoding) was returned.
+ As from Erlang/OTP 17.0, <c>utf8</c> is returned.</p>
+ <p>The encoding of each file can be specified using comments as
+ described in the
+ <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso> module.
+ </p>
+ </item>
+ <tag><seealso marker="stdlib:io#setopts/1"><c>io:setopts/1,2</c></seealso>
+ and flags <c>-oldshell</c>/<c>-noshell</c></tag>
+ <item>
+ <p>When Erlang is started with <c>-oldshell</c> or <c>-noshell</c>, the
+ I/O server for <c>standard_io</c> is by default set to bytewise
+ encoding, while an interactive shell defaults to what the
+ environment variables says.</p>
+ <p>You can set the encoding of a file or other I/O server with function
+ <seealso marker="stdlib:io#setopts/1"><c>io:setopts/2</c></seealso>.
+ This can also be set when opening a file. Setting the terminal (or
+ other <c>standard_io</c> server) unconditionally to option
+ <c>{encoding,utf8}</c> implies that UTF-8 encoded characters are
+ written to the device, regardless of how Erlang was started or the
+ user's environment.</p>
+ <p>Opening files with option <c>encoding</c> is convenient when
+ writing or reading text files in a known encoding.</p>
+ <p>You can retrieve the <c>encoding</c> setting for an I/O server with
+ function
+ <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>.
+ </p>
+ </item>
+ </taglist>
+ </section>
+
<section>
- <title>Byte Order Marks</title>
- <p>A common method of identifying encoding in text-files is to put
- a byte order mark (BOM) first in the file. The BOM is the
- code point 16#FEFF encoded in the same way as the rest of the
- file. If such a file is to be read, the first few bytes (depending
- on encoding) is not part of the actual text. This code outlines
- how to open a file which is believed to have a BOM and set the
- files encoding and position for further sequential reading
- (preferably using the <seealso marker="stdlib:io"><c>io</c></seealso>
- module). Note that error handling is omitted from the code:</p>
-<code>
+ <title>Recipes</title>
+ <p>When starting with Unicode, one often stumbles over some common issues.
+ This section describes some methods of dealing with Unicode data.</p>
+
+ <section>
+ <title>Byte Order Marks</title>
+ <p>A common method of identifying encoding in text files is to put a Byte
+ Order Mark (BOM) first in the file. The BOM is the code point 16#FEFF
+ encoded in the same way as the remaining file. If such a file is to be
+ read, the first few bytes (depending on encoding) are not part of the
+ text. This code outlines how to open a file that is believed to
+ have a BOM, and sets the files encoding and position for further
+ sequential reading (preferably using the
+ <seealso marker="stdlib:io"><c>io</c></seealso> module).</p>
+
+ <p>Notice that error handling is omitted from the code:</p>
+
+ <code>
open_bom_file_for_reading(File) -&gt;
{ok,F} = file:open(File,[read,binary]),
{ok,Bin} = file:read(F,4),
{Type,Bytes} = unicode:bom_to_encoding(Bin),
file:position(F,Bytes),
io:setopts(F,[{encoding,Type}]),
- {ok,F}.
-</code>
- <p>The <c>unicode:bom_to_encoding/1</c> function identifies the
- encoding from a binary of at least four bytes. It returns, along
- with an term suitable for setting the encoding of the file, the
- actual length of the BOM, so that the file position can be set
- accordingly. Note that <c>file:position/2</c> always works on
- byte-offsets, so that the actual byte-length of the BOM is
- needed.</p>
- <p>To open a file for writing and putting the BOM first is even
- simpler:</p>
-<code>
+ {ok,F}.</code>
+
+ <p>Function
+ <seealso marker="stdlib:unicode#bom_to_encoding/1"><c>unicode:bom_to_encoding/1</c></seealso>
+ identifies the encoding from a binary of at least four bytes. It
+ returns, along with a term suitable for setting the encoding of the
+ file, the byte length of the BOM, so that the file position can be set
+ accordingly. Notice that function
+ <seealso marker="kernel:file#position/2"><c>file:position/2</c></seealso>
+ always works on byte-offsets, so that the byte length of the BOM is
+ needed.</p>
+
+ <p>To open a file for writing and place the BOM first is even simpler:</p>
+
+ <code>
open_bom_file_for_writing(File,Encoding) -&gt;
{ok,F} = file:open(File,[write,binary]),
ok = file:write(File,unicode:encoding_to_bom(Encoding)),
io:setopts(F,[{encoding,Encoding}]),
- {ok,F}.
-</code>
- <p>In both cases the file is then best processed using the
- <c>io</c> module, as the functions in <c>io</c> can handle code
- points beyond the ISO-latin-1 range.</p>
- </section>
- <section>
- <title>Formatted I/O</title>
- <p>When reading and writing to Unicode-aware entities, like the
- User or a file opened for Unicode translation, you will probably
- want to format text strings using the functions in <seealso
- marker="stdlib:io"><c>io</c></seealso> or <seealso
- marker="stdlib:io_lib"><c>io_lib</c></seealso>. For backward
- compatibility reasons, these functions do not accept just any list
- as a string, but require a special <em>translation modifier</em>
- when working with Unicode texts. The modifier is <c>t</c>. When
- applied to the <c>s</c> control character in a formatting string,
- it accepts all Unicode code points and expect binaries to be in
- UTF-8:</p>
- <pre>
+ {ok,F}.</code>
+
+ <p>The file is in both these cases then best processed using the
+ <seealso marker="stdlib:io"><c>io</c></seealso> module, as the functions
+ in that module can handle code points beyond the ISO Latin-1 range.</p>
+ </section>
+
+ <section>
+ <title>Formatted I/O</title>
+ <p>When reading and writing to Unicode-aware entities, like a
+ file opened for Unicode translation, you probably want to format text
+ strings using the functions in the
+ <seealso marker="stdlib:io"><c>io</c></seealso> module or the
+ <seealso marker="stdlib:io_lib"><c>io_lib</c></seealso> module. For
+ backward compatibility reasons, these functions do not accept any list
+ as a string, but require a special <em>translation modifier</em> when
+ working with Unicode texts. The modifier is <c>t</c>. When applied to
+ control character <c>s</c> in a formatting string, it accepts all
+ Unicode code points and expects binaries to be in UTF-8:</p>
+
+ <pre>
1> <input>io:format("~ts~n",[&lt;&lt;"åäö"/utf8&gt;&gt;]).</input>
åäö
ok
2> <input>io:format("~s~n",[&lt;&lt;"åäö"/utf8&gt;&gt;]).</input>
åäö
ok</pre>
- <p>Obviously the second <c>io:format/2</c> gives undesired output
- because the UTF-8 binary is not in latin1. For backward
- compatibility, the non prefixed <c>s</c> control character expects
- bytewise encoded ISO-latin-1 characters in binaries and lists
- containing only code points &lt; 256.</p>
- <p>As long as the data is always lists, the <c>t</c> modifier can
- be used for any string, but when binary data is involved, care
- must be taken to make the right choice of formatting characters. A
- bytewise encoded binary will also be interpreted as a string and
- printed even when using <c>~ts</c>, but it might be mistaken for a
- valid UTF-8 string and one should therefore avoid using the
- <c>~ts</c> control if the binary contains bytewise encoded
- characters and not UTF-8.</p>
- <p>The function <c>format/2</c> in <c>io_lib</c> behaves
- similarly. This function is defined to return a deep list of
- characters and the output could easily be converted to binary data
- for outputting on a device of any kind by a simple
- <c>erlang:list_to_binary/1</c>. When the translation modifier is
- used, the list can however contain characters that cannot be
- stored in one byte. The call to <c>erlang:list_to_binary/1</c>
- will in that case fail. However, if the I/O server you want to
- communicate with is Unicode-aware, the list returned can still be
- used directly:</p>
-<pre>
+
+ <p>Clearly, the second <c>io:format/2</c> gives undesired output, as the
+ UTF-8 binary is not in <c>latin1</c>. For backward compatibility, the
+ non-prefixed control character <c>s</c> expects bytewise-encoded ISO
+ Latin-1 characters in binaries and lists containing only code points
+ &lt; 256.</p>
+
+ <p>As long as the data is always lists, modifier <c>t</c> can be used for
+ any string, but when binary data is involved, care must be taken to
+ make the correct choice of formatting characters. A bytewise-encoded
+ binary is also interpreted as a string, and printed even when using
+ <c>~ts</c>, but it can be mistaken for a valid UTF-8 string. Avoid
+ therefore using the <c>~ts</c> control if the binary contains
+ bytewise-encoded characters and not UTF-8.</p>
+
+ <p>Function
+ <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format/2</c></seealso>
+ behaves similarly. It is defined to return a deep list of characters
+ and the output can easily be converted to binary data for outputting on
+ any device by a simple
+ <seealso marker="erts:erlang#list_to_binary/1"><c>erlang:list_to_binary/1</c></seealso>.
+ When the translation modifier is used, the list can, however, contain
+ characters that cannot be stored in one byte. The call to
+ <c>erlang:list_to_binary/1</c> then fails. However, if the I/O server
+ you want to communicate with is Unicode-aware, the returned list can
+ still be used directly:</p>
+
+ <pre>
$ <input>erl +pc unicode</input>
Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
@@ -1195,55 +1327,56 @@ Eshell V5.10.1 (abort with ^G)
2> <input>io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])).</input>
Γιούνικοντ
ok</pre>
- <p>The Unicode string is returned as a Unicode list, which is
- recognized as such since the Erlang shell uses the Unicode
- encoding (and is started with all Unicode characters considered
- printable). The Unicode list is valid input to the <seealso
- marker="stdlib:io#put_chars/2"><c>io:put_chars/2</c></seealso> function,
- so data can be output on any Unicode capable device. If the device
- is a terminal, characters will be output in the <c>\x{</c>H
- ...<c>}</c> format if encoding is <c>latin1</c> otherwise in UTF-8
- (for the non-interactive terminal - "oldshell" or "noshell") or
- whatever is suitable to show the character properly (for an
- interactive terminal - the regular shell). The bottom line is that
- you can always send Unicode data to the <c>standard_io</c>
- device. Files will however only accept Unicode code points beyond
- ISO-latin-1 if <c>encoding</c> is set to something else than
- <c>latin1</c>.</p>
- </section>
- <section>
- <title>Heuristic Identification of UTF-8</title>
- <p>While it is
- strongly encouraged that the actual encoding of characters in
- binary data is known prior to processing, that is not always
- possible. On a typical Linux system, there is a mix of UTF-8
- and ISO-latin-1 text files and there are seldom any BOM's in the
- files to identify them.</p>
- <p>UTF-8 is designed in such a way that ISO-latin-1 characters
- with numbers beyond the 7-bit ASCII range are seldom considered
- valid when decoded as UTF-8. Therefore one can usually use
- heuristics to determine if a file is in UTF-8 or if it is encoded
- in ISO-latin-1 (one byte per character) encoding. The
- <c>unicode</c> module can be used to determine if data can be
- interpreted as UTF-8:</p>
- <code>
+
+ <p>The Unicode string is returned as a Unicode list, which is recognized
+ as such, as the Erlang shell uses the Unicode encoding (and is started
+ with all Unicode characters considered printable). The Unicode list is
+ valid input to function
+ <seealso marker="stdlib:io#put_chars/2"><c>io:put_chars/2</c></seealso>,
+ so data can be output on any Unicode-capable device. If the device is a
+ terminal, characters are output in format <c>\x{</c>H...<c>}</c> if
+ encoding is <c>latin1</c>. Otherwise in UTF-8 (for the non-interactive
+ terminal: "oldshell" or "noshell") or whatever is suitable to show the
+ character properly (for an interactive terminal: the regular shell).</p>
+
+ <p>So, you can always send Unicode data to the <c>standard_io</c> device.
+ Files, however, accept only Unicode code points beyond ISO Latin-1 if
+ <c>encoding</c> is set to something else than <c>latin1</c>.</p>
+ </section>
+
+ <section>
+ <title>Heuristic Identification of UTF-8</title>
+ <p>While it is strongly encouraged that the encoding of characters
+ in binary data is known before processing, that is not always possible.
+ On a typical Linux system, there is a mix of UTF-8 and ISO Latin-1 text
+ files, and there are seldom any BOMs in the files to identify them.</p>
+
+ <p>UTF-8 is designed so that ISO Latin-1 characters with numbers beyond
+ the 7-bit ASCII range are seldom considered valid when decoded as UTF-8.
+ Therefore one can usually use heuristics to determine if a file is in
+ UTF-8 or if it is encoded in ISO Latin-1 (one byte per character).
+ The <seealso marker="stdlib:unicode"><c>unicode</c></seealso>
+ module can be used to determine if data can be interpreted as UTF-8:</p>
+
+ <code>
heuristic_encoding_bin(Bin) when is_binary(Bin) -&gt;
case unicode:characters_to_binary(Bin,utf8,utf8) of
Bin ->
utf8;
_ ->
latin1
- end.
- </code>
- <p>If one does not have a complete binary of the file content, one
- could instead chunk through the file and check part by part. The
- return-tuple <c>{incomplete,Decoded,Rest}</c> from
- <c>unicode:characters_to_binary/{1,2,3}</c> comes in handy. The
- incomplete rest from one chunk of data read from the file is
- prepended to the next chunk and we therefore circumvent the
- problem of character boundaries when reading chunks of bytes in
- UTF-8 encoding:</p>
- <code>
+ end.</code>
+
+ <p>If you do not have a complete binary of the file content, you can
+ instead chunk through the file and check part by part. The return-tuple
+ <c>{incomplete,Decoded,Rest}</c> from function
+ <seealso marker="stdlib:unicode#characters_to_binary/1"><c>unicode:characters_to_binary/1,2,3</c></seealso>
+ comes in handy. The incomplete rest from one chunk of data read from the
+ file is prepended to the next chunk and we therefore avoid the problem
+ of character boundaries when reading chunks of bytes in UTF-8
+ encoding:</p>
+
+ <code>
heuristic_encoding_file(FileName) -&gt;
{ok,F} = file:open(FileName,[read,binary]),
loop_through_file(F,&lt;&lt;&gt;&gt;,file:read(F,1024)).
@@ -1260,13 +1393,14 @@ loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) -&gt;
loop_through_file(F,Rest,file:read(F,1024));
Res when is_binary(Res) ->
loop_through_file(F,&lt;&lt;&gt;&gt;,file:read(F,1024))
- end.
- </code>
- <p>Another option is to try to read the whole file in UTF-8
- encoding and see if it fails. Here we need to read the file using
- <c>io:get_chars/3</c>, as we have to succeed in reading characters
- with a code point over 255:</p>
- <code>
+ end.</code>
+
+ <p>Another option is to try to read the whole file in UTF-8 encoding and
+ see if it fails. Here we need to read the file using function
+ <seealso marker="stdlib:io#get_chars/3"><c>io:get_chars/3</c></seealso>,
+ as we have to read characters with a code point &gt; 255:</p>
+
+ <code>
heuristic_encoding_file2(FileName) -&gt;
{ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]),
loop_through_file2(F,io:get_chars(F,'',1024)).
@@ -1276,69 +1410,71 @@ loop_through_file2(_,eof) -&gt;
loop_through_file2(_,{error,_Err}) -&gt;
latin1;
loop_through_file2(F,Bin) when is_binary(Bin) -&gt;
- loop_through_file2(F,io:get_chars(F,'',1024)).
- </code>
- </section>
- <section>
- <title>Lists of UTF-8 Bytes</title>
- <p>For various reasons, you may find yourself having a list of
- UTF-8 bytes. This is not a regular string of Unicode characters as
- each element in the list does not contain one character. Instead
- you get the "raw" UTF-8 encoding that you have in binaries. This
- is easily converted to a proper Unicode string by first converting
- byte per byte into a binary and then converting the binary of
- UTF-8 encoded characters back to a Unicode string:</p>
- <code>
- utf8_list_to_string(StrangeList) ->
- unicode:characters_to_list(list_to_binary(StrangeList)).
- </code>
- </section>
- <section>
- <title>Double UTF-8 Encoding</title>
- <p>When working with binaries, you may get the horrible "double
- UTF-8 encoding", where strange characters are encoded in your
- binaries or files that you did not expect. What you may have got,
- is a UTF-8 encoded binary that is for the second time encoded as
- UTF-8. A common situation is where you read a file, byte by byte,
- but the actual content is already UTF-8. If you then convert the
- bytes to UTF-8, using i.e. the <c>unicode</c> module or by
- writing to a file opened with the <c>{encoding,utf8}</c>
- option. You will have each <i>byte</i> in the in the input file
- encoded as UTF-8, not each character of the original text (one
- character may have been encoded in several bytes). There is no
- real remedy for this other than being very sure of which data is
- actually encoded in which format, and never convert UTF-8 data
- (possibly read byte by byte from a file) into UTF-8 again.</p>
- <p>The by far most common situation where this happens, is when
- you get lists of UTF-8 instead of proper Unicode strings, and then
- convert them to UTF-8 in a binary or on a file:</p>
- <code>
- wrong_thing_to_do() ->
- {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"),
- MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary!
- {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]),
- io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8
- %% bytes in a list!
- file:close(C). %% The file catastrophe.txt contains more or less unreadable
- %% garbage!
- </code>
- <p>Make very sure you know what a binary contains before
- converting it to a string. If no other option exists, try
- heuristics:</p>
- <code>
- if_you_can_not_know() ->
- {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"),
- MyList = case unicode:characters_to_list(Bin) of
- L when is_list(L) ->
- L;
- _ ->
- binary_to_list(Bin) %% The file was bytewise encoded
- end,
- %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes
- {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]),
- io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets!
- file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!
- </code>
+ loop_through_file2(F,io:get_chars(F,'',1024)).</code>
+ </section>
+
+ <section>
+ <title>Lists of UTF-8 Bytes</title>
+ <p>For various reasons, you can sometimes have a list of UTF-8
+ bytes. This is not a regular string of Unicode characters, as each list
+ element does not contain one character. Instead you get the "raw" UTF-8
+ encoding that you have in binaries. This is easily converted to a proper
+ Unicode string by first converting byte per byte into a binary, and then
+ converting the binary of UTF-8 encoded characters back to a Unicode
+ string:</p>
+
+ <code>
+utf8_list_to_string(StrangeList) ->
+ unicode:characters_to_list(list_to_binary(StrangeList)).</code>
+ </section>
+
+ <section>
+ <title>Double UTF-8 Encoding</title>
+ <p>When working with binaries, you can get the horrible "double UTF-8
+ encoding", where strange characters are encoded in your binaries or
+ files. In other words, you can get a UTF-8 encoded binary that for the
+ second time is encoded as UTF-8. A common situation is where you read a
+ file, byte by byte, but the content is already UTF-8. If you then
+ convert the bytes to UTF-8, using, for example, the
+ <seealso marker="stdlib:unicode"><c>unicode</c></seealso> module, or by
+ writing to a file opened with option <c>{encoding,utf8}</c>, you have
+ each <em>byte</em> in the input file encoded as UTF-8, not each
+ character of the original text (one character can have been encoded in
+ many bytes). There is no real remedy for this other than to be sure of
+ which data is encoded in which format, and never convert UTF-8 data
+ (possibly read byte by byte from a file) into UTF-8 again.</p>
+
+ <p>By far the most common situation where this occurs, is when you get
+ lists of UTF-8 instead of proper Unicode strings, and then convert them
+ to UTF-8 in a binary or on a file:</p>
+
+ <code>
+wrong_thing_to_do() ->
+ {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"),
+ MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary!
+ {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]),
+ io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8
+ %% bytes in a list!
+ file:close(C). %% The file catastrophe.txt contains more or less unreadable
+ %% garbage!</code>
+
+ <p>Ensure you know what a binary contains before converting it to a
+ string. If no other option exists, try heuristics:</p>
+
+ <code>
+if_you_can_not_know() ->
+ {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"),
+ MyList = case unicode:characters_to_list(Bin) of
+ L when is_list(L) ->
+ L;
+ _ ->
+ binary_to_list(Bin) %% The file was bytewise encoded
+ end,
+ %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes
+ {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]),
+ io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets!
+ file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!</code>
+ </section>
</section>
-</section>
</chapter>
+