diff options
author | Björn Gustavsson <[email protected]> | 2016-05-18 15:53:35 +0200 |
---|---|---|
committer | Björn Gustavsson <[email protected]> | 2016-06-13 12:05:57 +0200 |
commit | 68d53c01b0b8e9a007a6a30158c19e34b2d2a34e (patch) | |
tree | 4613f513b9465beb7febec6c74c8ef0502f861fe /lib/stdlib/doc/src/unicode_usage.xml | |
parent | 99b379365981e14e2c8dde7b1a337c8ff856bd4a (diff) | |
download | otp-68d53c01b0b8e9a007a6a30158c19e34b2d2a34e.tar.gz otp-68d53c01b0b8e9a007a6a30158c19e34b2d2a34e.tar.bz2 otp-68d53c01b0b8e9a007a6a30158c19e34b2d2a34e.zip |
Update STDLIB documentation
Language cleaned up by the technical writers xsipewe and tmanevik
from Combitech. Proofreading and corrections by Björn Gustavsson
and Hans Bolinder.
Diffstat (limited to 'lib/stdlib/doc/src/unicode_usage.xml')
-rw-r--r-- | lib/stdlib/doc/src/unicode_usage.xml | 2420 |
1 files changed, 1278 insertions, 1142 deletions
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index b4c9385e33..7f79ac88a1 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -33,427 +33,495 @@ <rev>PA1</rev> <file>unicode_usage.xml</file> </header> -<section> -<title>Unicode Implementation</title> - <p>Implementing support for Unicode character sets is an ongoing - process. The Erlang Enhancement Proposal (EEP) 10 outlined the - basics of Unicode support and also specified a default encoding in - binaries that all Unicode-aware modules should handle in the - future.</p> - - <p>The functionality described in EEP10 was implemented in Erlang/OTP - R13A, but that was by no means the end of it. In Erlang/OTP R14B01 support - for Unicode file names was added, although it was in no way complete - and was by default disabled on platforms where no guarantee was given - for the file name encoding. With Erlang/OTP R16A came support for UTF-8 encoded - source code, among with enhancements to many of the applications to - support both Unicode encoded file names as well as support for UTF-8 - encoded files in several circumstances. Most notable is the support - for UTF-8 in files read by <c>file:consult/1</c>, release handler support - for UTF-8 and more support for Unicode character sets in the - I/O-system. In Erlang/OTP 17.0, the encoding default for Erlang source files was - switched to UTF-8.</p> - - <p>This guide outlines the current Unicode support and gives a couple - of recipes for working with Unicode data.</p> -</section> -<section> -<title>Understanding Unicode</title> - <p>Experience with the Unicode support in Erlang has made it - painfully clear that understanding Unicode characters and encodings - is not as easy as one would expect. The complexity of the field as - well as the implications of the standard requires thorough - understanding of concepts rarely before thought of.</p> - - <p>Furthermore the Erlang implementation requires understanding of - concepts that never were an issue for many (Erlang) programmers. To - understand and use Unicode characters requires that you study the - subject thoroughly, even if you're an experienced programmer.</p> - - <p>As an example, one could contemplate the issue of converting - between upper and lower case letters. Reading the standard will make - you realize that, to begin with, there's not a simple one to one - mapping in all scripts. Take German as an example, where there's a - letter "ß" (Sharp s) in lower case, but the uppercase equivalent is - "SS". Or Greek, where "Σ" has two different lowercase forms: "ς" in - word-final position and "σ" elsewhere. Or Turkish where dotted and - dot-less "i" both exist in lower case and upper case forms, or - Cyrillic "I" which usually has no lowercase form. Or of course - languages that have no concept of upper case (or lower case). So, a - conversion function will need to know not only one character at a - time, but possibly the whole sentence, maybe the natural language - the translation should be in and also take into account differences - in input and output string length and so on. There is at the time of - writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but - there are publicly available libraries that address these issues.</p> - - <p>Another example is the accented characters where the same glyph - has two different representations. Let's look at the Swedish - "ö". There's a code point for that in the Unicode standard, but you - can also write it as "o" followed by U+0308 (Combining Diaeresis, - with the simplified meaning that the last letter should have a "¨" - above). They have exactly the same glyph. They are for most - purposes the same, but they have completely different - representations. For example MacOS X converts all file names to use - Combining Diaeresis, while most other programs (including Erlang) - try to hide that by doing the opposite when for example listing - directories. However it's done, it's usually important to normalize - such characters to avoid utter confusion.</p> - - <p>The list of examples can be made as long as the Unicode standard, I - suspect. The point is that one need a kind of knowledge that was - never needed when programs only took one or two languages into - account. The complexity of human languages and scripts, certainly - has made this a challenge when constructing a universal - standard. Supporting Unicode properly in your program <em>will</em> require - effort.</p> - -</section> -<section> -<title>What Unicode Is</title> - <p>Unicode is a standard defining code points (numbers) for all - known, living or dead, scripts. In principle, every known symbol - used in any language has a Unicode code point.</p> - <p>Unicode code points are defined and published by the <em>Unicode - Consortium</em>, which is a non profit organization.</p> - <p>Support for Unicode is increasing throughout the world of - computing, as the benefits of one common character set are - overwhelming when programs are used in a global environment.</p> - <p>Along with the base of the standard: the code points for all the - scripts, there are a couple of <em>encoding standards</em> available.</p> - <p>It is vital to understand the difference between encodings and - Unicode characters. Unicode characters are code points according to - the Unicode standard, while the encodings are ways to represent such - code points. An encoding is just a standard for representation, - UTF-8 can for example be used to represent a very limited part of - the Unicode character set (e.g. ISO-Latin-1), or the full Unicode - range. It's just an encoding format.</p> - <p>As long as all character sets were limited to 256 characters, - each character could be stored in one single byte, so there was more - or less only one practical encoding for the characters. Encoding - each character in one byte was so common that the encoding wasn't - even named. When we now, with the Unicode system, have a lot more - than 256 characters, we need a common way to represent these. The - common ways of representing the code points are the encodings. This - means a whole new concept to the programmer, the concept of - character representation, which was before a non-issue.</p> - - <p>Different operating systems and tools support different - encodings. For example Linux and MacOS X has chosen the UTF-8 - encoding, which is backwards compatible with 7-bit ASCII and - therefore affects programs written in plain English the - least. Windows on the other hand supports a limited version of - UTF-16, namely all the code planes where the characters can be - stored in one single 16-bit entity, which includes most living - languages.</p> - - <p>The most widely spread encodings are:</p> - <taglist> - <tag>Bytewise representation</tag> - <item>This is not a proper Unicode representation, but the - representation used for characters before the Unicode standard. It - can still be used to represent character code points in the Unicode - standard that have numbers below 256, which corresponds exactly to - the ISO-Latin-1 character set. In Erlang, this is commonly denoted - <c>latin1</c> encoding, which is slightly misleading as ISO-Latin-1 is - a character code range, not an encoding.</item> - <tag>UTF-8</tag> - <item>Each character is stored in one to four bytes depending on - code point. The encoding is backwards compatible with bytewise - representation of 7-bit ASCII as all 7-bit characters are stored - in one single byte in UTF-8. The characters beyond code point 127 - are stored in more bytes, letting the most significant bit in the - first character indicate a multi-byte character. For details on - the encoding, the RFC is publicly available. Note that UTF-8 is - <em>not</em> compatible with bytewise representation for - code points between 128 and 255, so a ISO-Latin-1 bytewise - representation is not generally compatible with UTF-8.</item> - <tag>UTF-16</tag> - <item>This encoding has many similarities to UTF-8, but the basic - unit is a 16-bit number. This means that all characters occupy at - least two bytes, some high numbers even four bytes. Some programs, - libraries and operating systems claiming to use UTF-16 only allows - for characters that can be stored in one 16-bit entity, which is - usually sufficient to handle living languages. As the basic unit - is more than one byte, byte-order issues occur, why UTF-16 exists - in both a big-endian and little-endian variant. In Erlang, the - full UTF-16 range is supported when applicable, like in the - <c>unicode</c> module and in the bit syntax.</item> - <tag>UTF-32</tag> - <item>The most straight forward representation. Each character is - stored in one single 32-bit number. There is no need for escapes - or any variable amount of entities for one character, all Unicode - code points can be stored in one single 32-bit entity. As with - UTF-16, there are byte-order issues, UTF-32 can be both big- and - little-endian.</item> - <tag>UCS-4</tag> - <item>Basically the same as UTF-32, but without some Unicode - semantics, defined by IEEE and has little use as a separate - encoding standard. For all normal (and possibly abnormal) usages, - UTF-32 and UCS-4 are interchangeable.</item> - </taglist> - <p>Certain ranges of numbers are left unused in the Unicode standard - and certain ranges are even deemed invalid. The most notable invalid - range is 16#D800 - 16#DFFF, as the UTF-16 encoding does not allow - for encoding of these numbers. It can be speculated that the UTF-16 - encoding standard was, from the beginning, expected to be able to - hold all Unicode characters in one 16-bit entity, but then had to be - extended, leaving a hole in the Unicode range to cope with backward - compatibility.</p> - <p>Additionally, the code point 16#FEFF is used for byte order marks - (BOM's) and use of that character is not encouraged in other - contexts than that. It actually is valid though, as the character - "ZWNBS" (Zero Width Non Breaking Space). BOM's are used to identify - encodings and byte order for programs where such parameters are not - known in advance. Byte order marks are more seldom used than one - could expect, but their use might become more widely spread as they - provide the means for programs to make educated guesses about the - Unicode format of a certain file.</p> -</section> -<section> - <title>Areas of Unicode Support</title> - <p>To support Unicode in Erlang, problems in several areas have been - addressed. Each area is described briefly in this section and more - thoroughly further down in this document:</p> - <taglist> - <tag>Representation</tag> - <item>To handle Unicode characters in Erlang, we have to have a - common representation both in lists and binaries. The EEP (10) and - the subsequent initial implementation in Erlang/OTP R13A settled a standard - representation of Unicode characters in Erlang.</item> - <tag>Manipulation</tag> - <item>The Unicode characters need to be processed by the Erlang - program, why library functions need to be able to handle them. In - some cases functionality was added to already existing interfaces - (as the string module now can handle lists with arbitrary code points), - in some cases new functionality or options need to be added (as in - the <c>io</c>-module, the file handling, the <c>unicode</c> module - and the bit syntax). Today most modules in kernel and STDLIB, as - well as the VM are Unicode aware.</item> - <tag>File I/O</tag> - <item>I/O is by far the most problematic area for Unicode. A file - is an entity where bytes are stored and the lore of programming - has been to treat characters and bytes as interchangeable. With - Unicode characters, you need to decide on an encoding as soon as - you want to store the data in a file. In Erlang you can open a - text file with an encoding option, so that you can read characters - from it rather than bytes, but you can also open a file for - bytewise I/O. The I/O-system of Erlang has been designed (or at - least used) in a way where you expect any I/O-server to be - able to cope with any string data, but that is no longer the case - when you work with Unicode characters. Handling the fact that you - need to know the capabilities of the device where your data ends - up is something new to the Erlang programmer. Furthermore, ports - in Erlang are byte oriented, so an arbitrary string of (Unicode) - characters can not be sent to a port without first converting it - to an encoding of choice.</item> - <tag>Terminal I/O</tag> - <item>Terminal I/O is slightly easier than file I/O. The output is - meant for human reading and is usually Erlang syntax (e.g. in the - shell). There exists syntactic representation of any Unicode - character without actually displaying the glyph (instead written - as <c>\x{</c>HHH<c>}</c>), so Unicode data can usually be displayed - even if the terminal as such do not support the whole Unicode - range.</item> - <tag>File names</tag> - <item>File names can be stored as Unicode strings, in different - ways depending on the underlying OS and file system. This can be - handled fairly easy by a program. The problems arise when the file - system is not consistent in it's encodings, like for example - Linux. Linux allows files to be named with any sequence of bytes, - leaving to each program to interpret those bytes. On systems where - these "transparent" file names are used, Erlang has to be informed - about the file name encoding by a startup flag. The default is - bytewise interpretation, which is actually usually wrong, but - allows for interpretation of <em>all</em> file names. The concept - of "raw file names" can be used to handle wrongly encoded - file names if one enables Unicode file name translation - (<c>+fnu</c>) on platforms where this is not the default.</item> - <tag>Source code encoding</tag> - <item>When it comes to the Erlang source code, there is support - for the UTF-8 encoding and bytewise encoding. The default in - Erlang/OTP R16B was bytewise (or latin1) encoding; in Erlang/OTP 17.0 - it was changed to UTF-8. You can control the encoding by a comment like: -<code> -%% -*- coding: utf-8 -*- -</code> - in the beginning of the file. This of course requires your editor to - support UTF-8 as well. The same comment is also interpreted by - functions like <c>file:consult/1</c>, the release handler etc, so that - you can have all text files in your source directories in UTF-8 - encoding. - </item> - <tag>The language</tag> - <item>Having the source code in UTF-8 also allows you to write - string literals containing Unicode characters with code points > - 255, although atoms, module names and function names are - restricted to the ISO-Latin-1 range. Binary - literals where you use the <c>/utf8</c> type, can also be - expressed using Unicode characters > 255. Having module names - using characters other than 7-bit ASCII can cause trouble on - operating systems with inconsistent file naming schemes, and might - also hurt portability, so it's not really recommended. It is - suggested in EEP 40 that the language should also allow for - Unicode characters > 255 in variable names. Whether to - implement that EEP or not is yet to be decided.</item> - </taglist> -</section> -<section> - <title>Standard Unicode Representation</title> - <p>In Erlang, strings are actually lists of integers. A string was - up until Erlang/OTP R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) - character set, which is, code point by code point, a sub-range of - the Unicode character set.</p> - <p>The standard list encoding for strings was therefore easily - extended to cope with the whole Unicode range: A Unicode string in - Erlang is simply a list containing integers, each integer being a - valid Unicode code point and representing one character in the - Unicode character set.</p> - <p>Erlang strings in ISO-latin-1 are a subset of Unicode - strings.</p> - <p>Only if a string contains code points < 256, can it be - directly converted to a binary by using - i.e. <c>erlang:iolist_to_binary/1</c> or can be sent directly to a - port. If the string contains Unicode characters > 255, an - encoding has to be decided upon and the string should be converted - to a binary in the preferred encoding using - <c>unicode:characters_to_binary/{1,2,3}</c>. Strings are not - generally lists of bytes, as they were before Erlang/OTP R13. They are lists of - characters. Characters are not generally bytes, they are Unicode - code points.</p> - - <p>Binaries are more troublesome. For performance reasons, programs - often store textual data in binaries instead of lists, mainly - because they are more compact (one byte per character instead of two - words per character, as is the case with lists). Using - <c>erlang:list_to_binary/1</c>, an ISO-Latin-1 Erlang string could - be converted into a binary, effectively using bytewise encoding - - one byte per character. This was very convenient for those limited - Erlang strings, but cannot be done for arbitrary Unicode lists.</p> - <p>As the UTF-8 encoding is widely spread and provides some backward - compatibility in the 7-bit ASCII range, it is selected as the - standard encoding for Unicode characters in binaries for Erlang.</p> - <p>The standard binary encoding is used whenever a library function - in Erlang should cope with Unicode data in binaries, but is of - course not enforced when communicating externally. Functions and - bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32 - in binaries. Library functions dealing with binaries and Unicode in - general, however, only deal with the default encoding.</p> - - <p>Character data may be combined from several sources, sometimes - available in a mix of strings and binaries. Erlang has for long had - the concept of <c>iodata</c> or <c>iolist</c>s, where binaries and - lists can be combined to represent a sequence of bytes. In the same - way, the Unicode aware modules often allow for combinations of - binaries and lists where the binaries have characters encoded in - UTF-8 and the lists contain such binaries or numbers representing - Unicode code points:</p> - <code type="none"> + <section> + <title>Unicode Implementation</title> + <p>Implementing support for Unicode character sets is an ongoing process. + The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode + support and specified a default encoding in binaries that all + Unicode-aware modules are to handle in the future.</p> + + <p>Here is an overview what has been done so far:</p> + + <list type="bulleted"> + <item><p>The functionality described in EEP10 was implemented + in Erlang/OTP R13A.</p></item> + + <item><p>Erlang/OTP R14B01 added support for Unicode + filenames, but it was not complete and was by default + disabled on platforms where no guarantee was given for the + filename encoding.</p></item> + + <item><p>With Erlang/OTP R16A came support for UTF-8 encoded + source code, with enhancements to many of the applications to + support both Unicode encoded filenames and support for UTF-8 + encoded files in many circumstances. Most notable is the + support for UTF-8 in files read by <seealso + marker="kernel:file#consult/1"><c>file:consult/1</c></seealso>, + release handler support for UTF-8, and more support for + Unicode character sets in the I/O system.</p></item> + + <item><p>In Erlang/OTP 17.0, the encoding default for Erlang + source files was switched to UTF-8.</p></item> + </list> + + <p>This section outlines the current Unicode support and gives some + recipes for working with Unicode data.</p> + </section> + + <section> + <title>Understanding Unicode</title> + <p>Experience with the Unicode support in Erlang has made it clear that + understanding Unicode characters and encodings is not as easy as one + would expect. The complexity of the field and the implications of the + standard require thorough understanding of concepts rarely before + thought of.</p> + + <p>Also, the Erlang implementation requires understanding of + concepts that were never an issue for many (Erlang) programmers. To + understand and use Unicode characters requires that you study the + subject thoroughly, even if you are an experienced programmer.</p> + + <p>As an example, contemplate the issue of converting between upper and + lower case letters. Reading the standard makes you realize that there is + not a simple one to one mapping in all scripts, for example:</p> + + <list type="bulleted"> + <item> + <p>In German, the letter "ß" (sharp s) is in lower case, but the + uppercase equivalent is "SS".</p> + </item> + <item> + <p>In Greek, the letter "Σ" has two different lowercase forms, + "ς" in word-final position and "σ" elsewhere.</p> + </item> + <item> + <p>In Turkish, both dotted and dotless "i" exist in lower case and + upper case forms.</p> + </item> + <item> + <p>Cyrillic "I" has usually no lowercase form.</p> + </item> + <item> + <p>Languages with no concept of upper case (or lower case).</p> + </item> + </list> + + <p>So, a conversion function must know not only one character at a time, + but possibly the whole sentence, the natural language to translate to, + the differences in input and output string length, and so on. + Erlang/OTP has currently no Unicode <c>to_upper</c>/<c>to_lower</c> + functionality, but publicly available libraries address these issues.</p> + + <p>Another example is the accented characters, where the same glyph has two + different representations. The Swedish letter "ö" is one example. + The Unicode standard has a code point for it, but you can also write it + as "o" followed by "U+0308" (Combining Diaeresis, with the simplified + meaning that the last letter is to have "¨" above). They have the same + glyph. They are for most purposes the same, but have different + representations. For example, MacOS X converts all filenames to use + Combining Diaeresis, while most other programs (including Erlang) try to + hide that by doing the opposite when, for example, listing directories. + However it is done, it is usually important to normalize such + characters to avoid confusion.</p> + + <p>The list of examples can be made long. One need a kind of knowledge that + was not needed when programs only considered one or two languages. The + complexity of human languages and scripts has certainly made this a + challenge when constructing a universal standard. Supporting Unicode + properly in your program will require effort.</p> + </section> + + <section> + <title>What Unicode Is</title> + <p>Unicode is a standard defining code points (numbers) for all known, + living or dead, scripts. In principle, every symbol used in any + language has a Unicode code point. Unicode code points are defined and + published by the Unicode Consortium, which is a non-profit + organization.</p> + + <p>Support for Unicode is increasing throughout the world of computing, as + the benefits of one common character set are overwhelming when programs + are used in a global environment. Along with the base of the standard, + the code points for all the scripts, some <em>encoding standards</em> are + available.</p> + + <p>It is vital to understand the difference between encodings and Unicode + characters. Unicode characters are code points according to the Unicode + standard, while the encodings are ways to represent such code points. An + encoding is only a standard for representation. UTF-8 can, for example, + be used to represent a very limited part of the Unicode character set + (for example ISO-Latin-1) or the full Unicode range. It is only an + encoding format.</p> + + <p>As long as all character sets were limited to 256 characters, each + character could be stored in one single byte, so there was more or less + only one practical encoding for the characters. Encoding each character + in one byte was so common that the encoding was not even named. With the + Unicode system there are much more than 256 characters, so a common way + is needed to represent these. The common ways of representing the code + points are the encodings. This means a whole new concept to the + programmer, the concept of character representation, which was a + non-issue earlier.</p> + + <p>Different operating systems and tools support different encodings. For + example, Linux and MacOS X have chosen the UTF-8 encoding, which is + backward compatible with 7-bit ASCII and therefore affects programs + written in plain English the least. Windows supports a limited version + of UTF-16, namely all the code planes where the characters can be + stored in one single 16-bit entity, which includes most living + languages.</p> + + <p>The following are the most widely spread encodings:</p> + + <taglist> + <tag>Bytewise representation</tag> + <item> + <p>This is not a proper Unicode representation, but the representation + used for characters before the Unicode standard. It can still be used + to represent character code points in the Unicode standard with + numbers < 256, which exactly corresponds to the ISO Latin-1 + character set. In Erlang, this is commonly denoted <c>latin1</c> + encoding, which is slightly misleading as ISO Latin-1 is a + character code range, not an encoding.</p> + </item> + <tag>UTF-8</tag> + <item> + <p>Each character is stored in one to four bytes depending on code + point. The encoding is backward compatible with bytewise + representation of 7-bit ASCII, as all 7-bit characters are stored in + one single byte in UTF-8. The characters beyond code point 127 are + stored in more bytes, letting the most significant bit in the first + character indicate a multi-byte character. For details on the + encoding, the RFC is publicly available.</p> + <p>Notice that UTF-8 is <em>not</em> compatible with bytewise + representation for code points from 128 through 255, so an ISO + Latin-1 bytewise representation is generally incompatible with + UTF-8.</p> + </item> + <tag>UTF-16</tag> + <item> + <p>This encoding has many similarities to UTF-8, but the basic + unit is a 16-bit number. This means that all characters occupy + at least two bytes, and some high numbers four bytes. Some + programs, libraries, and operating systems claiming to use + UTF-16 only allow for characters that can be stored in one + 16-bit entity, which is usually sufficient to handle living + languages. As the basic unit is more than one byte, byte-order + issues occur, which is why UTF-16 exists in both a big-endian + and a little-endian variant.</p> + <p>In Erlang, the full UTF-16 range is supported when applicable, like + in the <seealso marker="stdlib:unicode"><c>unicode</c></seealso> + module and in the bit syntax.</p> + </item> + <tag>UTF-32</tag> + <item> + <p>The most straightforward representation. Each character is stored in + one single 32-bit number. There is no need for escapes or any + variable number of entities for one character. All Unicode code + points can be stored in one single 32-bit entity. As with UTF-16, + there are byte-order issues. UTF-32 can be both big-endian and + little-endian.</p> + </item> + <tag>UCS-4</tag> + <item> + <p>Basically the same as UTF-32, but without some Unicode semantics, + defined by IEEE, and has little use as a separate encoding standard. + For all normal (and possibly abnormal) use, UTF-32 and UCS-4 are + interchangeable.</p> + </item> + </taglist> + + <p>Certain number ranges are unused in the Unicode standard and certain + ranges are even deemed invalid. The most notable invalid range is + 16#D800-16#DFFF, as the UTF-16 encoding does not allow for encoding of + these numbers. This is possibly because the UTF-16 encoding standard, + from the beginning, was expected to be able to hold all Unicode + characters in one 16-bit entity, but was then extended, leaving a hole + in the Unicode range to handle backward compatibility.</p> + + <p>Code point 16#FEFF is used for Byte Order Marks (BOMs) and use of that + character is not encouraged in other contexts. It is valid though, as + the character "ZWNBS" (Zero Width Non Breaking Space). BOMs are used to + identify encodings and byte order for programs where such parameters are + not known in advance. BOMs are more seldom used than expected, but can + become more widely spread as they provide the means for programs to make + educated guesses about the Unicode format of a certain file.</p> + </section> + + <section> + <title>Areas of Unicode Support</title> + <p>To support Unicode in Erlang, problems in various areas have been + addressed. This section describes each area briefly and more + thoroughly later in this User's Guide.</p> + + <taglist> + <tag>Representation</tag> + <item> + <p>To handle Unicode characters in Erlang, a common representation + in both lists and binaries is needed. EEP (10) and the subsequent + initial implementation in Erlang/OTP R13A settled a standard + representation of Unicode characters in Erlang.</p> + </item> + <tag>Manipulation</tag> + <item> + <p>The Unicode characters need to be processed by the Erlang + program, which is why library functions must be able to handle + them. In some cases functionality has been added to already + existing interfaces (as the <seealso + marker="stdlib:string"><c>string</c></seealso> module now can + handle lists with any code points). In some cases new + functionality or options have been added (as in the <seealso + marker="stdlib:io"><c>io</c></seealso> module, the file + handling, the <seealso + marker="stdlib:unicode"><c>unicode</c></seealso> module, and + the bit syntax). Today most modules in <c>Kernel</c> and + <c>STDLIB</c>, as well as the VM are Unicode-aware.</p> + </item> + <tag>File I/O</tag> + <item> + <p>I/O is by far the most problematic area for Unicode. A file is an + entity where bytes are stored, and the lore of programming has been + to treat characters and bytes as interchangeable. With Unicode + characters, you must decide on an encoding when you want to store + the data in a file. In Erlang, you can open a text file with an + encoding option, so that you can read characters from it rather than + bytes, but you can also open a file for bytewise I/O.</p> + <p>The Erlang I/O-system has been designed (or at least used) in a way + where you expect any I/O server to handle any string data. + That is, however, no longer the case when working with Unicode + characters. The Erlang programmer must now know the + capabilities of the device where the data ends up. Also, ports in + Erlang are byte-oriented, so an arbitrary string of (Unicode) + characters cannot be sent to a port without first converting it to an + encoding of choice.</p> + </item> + <tag>Terminal I/O</tag> + <item> + <p>Terminal I/O is slightly easier than file I/O. The output is meant + for human reading and is usually Erlang syntax (for example, in the + shell). There exists syntactic representation of any Unicode + character without displaying the glyph (instead written as + <c>\x</c>{<c>HHH</c>}). Unicode data can therefore usually be + displayed even if the terminal as such does not support the whole + Unicode range.</p> + </item> + <tag>Filenames</tag> + <item> + <p>Filenames can be stored as Unicode strings in different ways + depending on the underlying operating system and file system. This + can be handled fairly easy by a program. The problems arise when the + file system is inconsistent in its encodings. For example, Linux + allows files to be named with any sequence of bytes, leaving to each + program to interpret those bytes. On systems where these + "transparent" filenames are used, Erlang must be informed about the + filename encoding by a startup flag. The default is bytewise + interpretation, which is usually wrong, but allows for interpretation + of <em>all</em> filenames.</p> + <p>The concept of "raw filenames" can be used to handle wrongly encoded + filenames if one enables Unicode filename translation (<c>+fnu</c>) + on platforms where this is not the default.</p> + </item> + <tag>Source code encoding</tag> + <item> + <p>The Erlang source code has support for the UTF-8 encoding + and bytewise encoding. The default in Erlang/OTP R16B was bytewise + (<c>latin1</c>) encoding. It was changed to UTF-8 in Erlang/OTP 17.0. + You can control the encoding by a comment like the following in the + beginning of the file:</p> + <code> +%% -*- coding: utf-8 -*-</code> + <p>This of course requires your editor to support UTF-8 as well. The + same comment is also interpreted by functions like + <seealso marker="kernel:file#consult/1"><c>file:consult/1</c></seealso>, + the release handler, and so on, so that you can have all text files + in your source directories in UTF-8 encoding.</p> + </item> + <tag>The language</tag> + <item> + <p>Having the source code in UTF-8 also allows you to write string + literals containing Unicode characters with code points > 255, + although atoms, module names, and function names are restricted to + the ISO Latin-1 range. Binary literals, where you use type + <c>/utf8</c>, can also be expressed using Unicode characters > 255. + Having module names using characters other than 7-bit ASCII can cause + trouble on operating systems with inconsistent file naming schemes, + and can hurt portability, so it is not recommended.</p> + <p>EEP 40 suggests that the language is also to allow for Unicode + characters > 255 in variable names. Whether to implement that EEP + is yet to be decided.</p> + </item> + </taglist> + </section> + + <section> + <title>Standard Unicode Representation</title> + <p>In Erlang, strings are lists of integers. A string was until + Erlang/OTP R13 defined to be encoded in the ISO Latin-1 (ISO 8859-1) + character set, which is, code point by code point, a subrange of the + Unicode character set.</p> + + <p>The standard list encoding for strings was therefore easily extended to + handle the whole Unicode range. A Unicode string in Erlang is a list + containing integers, where each integer is a valid Unicode code point and + represents one character in the Unicode character set.</p> + + <p>Erlang strings in ISO Latin-1 are a subset of Unicode strings.</p> + + <p>Only if a string contains code points < 256, can it be directly + converted to a binary by using, for example, + <seealso marker="erts:erlang#iolist_to_binary/1"><c>erlang:iolist_to_binary/1</c></seealso> + or can be sent directly to a port. If the string contains Unicode + characters > 255, an encoding must be decided upon and the string is to + be converted to a binary in the preferred encoding using + <seealso marker="stdlib:unicode#characters_to_binary/1"><c>unicode:characters_to_binary/1,2,3</c></seealso>. + Strings are not generally lists of bytes, as they were before + Erlang/OTP R13, they are lists of characters. Characters are not + generally bytes, they are Unicode code points.</p> + + <p>Binaries are more troublesome. For performance reasons, programs often + store textual data in binaries instead of lists, mainly because they are + more compact (one byte per character instead of two words per character, + as is the case with lists). Using + <seealso marker="erts:erlang#list_to_binary/1"><c>erlang:list_to_binary/1</c></seealso>, + an ISO Latin-1 Erlang string can be converted into a binary, effectively + using bytewise encoding: one byte per character. This was convenient for + those limited Erlang strings, but cannot be done for arbitrary Unicode + lists.</p> + + <p>As the UTF-8 encoding is widely spread and provides some backward + compatibility in the 7-bit ASCII range, it is selected as the standard + encoding for Unicode characters in binaries for Erlang.</p> + + <p>The standard binary encoding is used whenever a library function in + Erlang is to handle Unicode data in binaries, but is of course not + enforced when communicating externally. Functions and bit syntax exist to + encode and decode both UTF-8, UTF-16, and UTF-32 in binaries. However, + library functions dealing with binaries and Unicode in general only deal + with the default encoding.</p> + + <p>Character data can be combined from many sources, sometimes available in + a mix of strings and binaries. Erlang has for long had the concept of + <c>iodata</c> or <c>iolist</c>s, where binaries and lists can be combined + to represent a sequence of bytes. In the same way, the Unicode-aware + modules often allow for combinations of binaries and lists, where the + binaries have characters encoded in UTF-8 and the lists contain such + binaries or numbers representing Unicode code points:</p> + + <code type="none"> unicode_binary() = binary() with characters encoded in UTF-8 coding standard chardata() = charlist() | unicode_binary() charlist() = maybe_improper_list(char() | unicode_binary() | charlist(), - unicode_binary() | nil())</code> - <p>The module <seealso - marker="stdlib:unicode"><c>unicode</c></seealso> in STDLIB even - supports similar mixes with binaries containing other encodings than - UTF-8, but that is a special case to allow for conversions to and - from external data:</p> - <code type="none"> -external_unicode_binary() = binary() with characters coded in - a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32) + unicode_binary() | nil())</code> + + <p>The module <seealso marker="stdlib:unicode"><c>unicode</c></seealso> + even supports similar mixes with binaries containing other encodings than + UTF-8, but that is a special case to allow for conversions to and from + external data:</p> + + <code type="none"> +external_unicode_binary() = binary() with characters coded in a user-specified + Unicode encoding other than UTF-8 (UTF-16 or UTF-32) external_chardata() = external_charlist() | external_unicode_binary() -external_charlist() = maybe_improper_list(char() | - external_unicode_binary() | - external_charlist(), - external_unicode_binary() | nil())</code> -</section> -<section> - <title>Basic Language Support</title> - <p><marker id="unicode_in_erlang"/>As of Erlang/OTP R16 Erlang - source files can be written in either UTF-8 or bytewise encoding - (a.k.a. <c>latin1</c> encoding). The details on how to state the encoding - of an Erlang source file can be found in - <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>. Strings and comments - can be written using Unicode, but functions still have to be named - using characters from the ISO-latin-1 character set and atoms are - restricted to the same ISO-latin-1 range. These restrictions in the - language are of course independent of the encoding of the source - file.</p> +external_charlist() = maybe_improper_list(char() | external_unicode_binary() | + external_charlist(), external_unicode_binary() | nil())</code> + </section> + <section> - <title>Bit-syntax</title> - <p>The bit-syntax contains types for coping with binary data in the - three main encodings. The types are named <c>utf8</c>, <c>utf16</c> - and <c>utf32</c> respectively. The <c>utf16</c> and <c>utf32</c> types - can be in a big- or little-endian variant:</p> - <code> + <title>Basic Language Support</title> + <p><marker id="unicode_in_erlang"/>As from Erlang/OTP R16, Erlang source + files can be written in UTF-8 or bytewise (<c>latin1</c>) encoding. For + information about how to state the encoding of an Erlang source file, see + the <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso> module. + Strings and comments can be written using Unicode, but functions must + still be named using characters from the ISO Latin-1 character set, and + atoms are restricted to the same ISO Latin-1 range. These restrictions in + the language are of course independent of the encoding of the source + file.</p> + + <section> + <title>Bit Syntax</title> + <p>The bit syntax contains types for handling binary data in the + three main encodings. The types are named <c>utf8</c>, <c>utf16</c>, + and <c>utf32</c>. The <c>utf16</c> and <c>utf32</c> types can be in a + big-endian or a little-endian variant:</p> + + <code> <<Ch/utf8,_/binary>> = Bin1, <<Ch/utf16-little,_/binary>> = Bin2, Bin3 = <<$H/utf32-little, $e/utf32-little, $l/utf32-little, $l/utf32-little, $o/utf32-little>>,</code> - <p>For convenience, literal strings can be encoded with a Unicode - encoding in binaries using the following (or similar) syntax:</p> - <code> + + <p>For convenience, literal strings can be encoded with a Unicode + encoding in binaries using the following (or similar) syntax:</p> + + <code> Bin4 = <<"Hello"/utf16>>,</code> - </section> - <section> - <title>String and Character Literals</title> - <p>For source code, there is an extension to the <c>\</c>OOO - (backslash followed by three octal numbers) and <c>\x</c>HH - (backslash followed by <c>x</c>, followed by two hexadecimal - characters) syntax, namely <c>\x{</c>H ...<c>}</c> (a backslash - followed by an <c>x</c>, followed by left curly bracket, any - number of hexadecimal digits and a terminating right curly - bracket). This allows for entering characters of any code point - literally in a string even when the encoding of the source file is - bytewise (<c>latin1</c>).</p> - <p>In the shell, if using a Unicode input device, or in source - code stored in UTF-8, <c>$</c> can be followed directly by a - Unicode character producing an integer. In the following example - the code point of a Cyrillic <c>с</c> is output:</p> - <pre> + </section> + + <section> + <title>String and Character Literals</title> + <p>For source code, there is an extension to syntax <c>\</c>OOO + (backslash followed by three octal numbers) and <c>\x</c>HH (backslash + followed by <c>x</c>, followed by two hexadecimal characters), namely + <c>\x{</c>H ...<c>}</c> (backslash followed by <c>x</c>, followed by + left curly bracket, any number of hexadecimal digits, and a terminating + right curly bracket). This allows for entering characters of any code + point literally in a string even when the encoding of the source file + is bytewise (<c>latin1</c>).</p> + + <p>In the shell, if using a Unicode input device, or in source code + stored in UTF-8, <c>$</c> can be followed directly by a Unicode + character producing an integer. In the following example, the code + point of a Cyrillic <c>с</c> is output:</p> + + <pre> 7> <input>$с.</input> 1089</pre> - </section> - <section> - <title>Heuristic String Detection</title> - <p>In certain output functions and in the output of return values - in the shell, Erlang tries to heuristically detect string data in - lists and binaries. Typically you will see heuristic detection in - a situation like this:</p> - <pre> + </section> + + <section> + <title>Heuristic String Detection</title> + <p>In certain output functions and in the output of return values in + the shell, Erlang tries to detect string data in lists and binaries + heuristically. Typically you will see heuristic detection in a + situation like this:</p> + + <pre> 1> <input>[97,98,99].</input> "abc" 2> <input><<97,98,99>>.</input> <<"abc">> 3> <input><<195,165,195,164,195,182>>.</input> <<"åäö"/utf8>></pre> - <p>Here the shell will detect lists containing printable - characters or binaries containing printable characters either in - bytewise or UTF-8 encoding. The question here is: what is a - printable character? One view would be that anything the Unicode - standard thinks is printable, will also be printable according to - the heuristic detection. The result would be that almost any list - of integers will be deemed a string, resulting in all sorts of - characters being printed, maybe even characters your terminal does - not have in its font set (resulting in some generic output you - probably will not appreciate). Another way is to keep it backwards - compatible so that only the ISO-Latin-1 character set is used to - detect a string. A third way would be to let the user decide - exactly what Unicode ranges are to be viewed as characters. Since - Erlang/OTP R16B you can select either the whole Unicode range or the - ISO-Latin-1 range by supplying the startup flag <c>+pc - </c><i>Range</i>, where <i>Range</i> is either <c>latin1</c> or - <c>unicode</c>. For backwards compatibility, the default is - <c>latin1</c>. This only controls how heuristic string detection - is done. In the future, more ranges are expected to be added, so - that one can tailor the heuristics to the language and region - relevant to the user.</p> - <p>Lets look at an example with the two different startup options:</p> -<pre> + + <p>Here the shell detects lists containing printable characters or + binaries containing printable characters in bytewise or UTF-8 encoding. + But what is a printable character? One view is that anything the Unicode + standard thinks is printable, is also printable according to the + heuristic detection. The result is then that almost any list of + integers are deemed a string, and all sorts of characters are printed, + maybe also characters that your terminal lacks in its font set + (resulting in some unappreciated generic output). + Another way is to keep it backward compatible so that only the ISO + Latin-1 character set is used to detect a string. A third way is to let + the user decide exactly what Unicode ranges that are to be viewed as + characters.</p> + + <p>As from Erlang/OTP R16B you can select the ISO Latin-1 range or the + whole Unicode range by supplying startup flag <c>+pc latin1</c> or + <c>+pc unicode</c>, respectively. For backward compatibility, + <c>latin1</c> is default. This only controls how heuristic string + detection is done. More ranges are expected to be added in the future, + enabling tailoring of the heuristics to the language and region + relevant to the user.</p> + + <p>The following examples show the two startup options:</p> + + <pre> $ <input>erl +pc latin1</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -467,9 +535,9 @@ Eshell V5.10.1 (abort with ^G) 4> <input><<208,174,208,189,208,184,208,186,208,190,208,180>>.</input> <<208,174,208,189,208,184,208,186,208,190,208,180>> 5> <input><<229/utf8,228/utf8,246/utf8>>.</input> -<<"åäö"/utf8>> -</pre> -<pre> +<<"åäö"/utf8>></pre> + + <pre> $ <input>erl +pc unicode</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -483,78 +551,88 @@ Eshell V5.10.1 (abort with ^G) 4> <input><<208,174,208,189,208,184,208,186,208,190,208,180>>.</input> <<"Юникод"/utf8>> 5> <input><<229/utf8,228/utf8,246/utf8>>.</input> -<<"åäö"/utf8>> -</pre> - <p>In the examples, we can see that the default Erlang shell will - only interpret characters from the ISO-Latin1 range as printable - and will only detect lists or binaries with those "printable" - characters as containing string data. The valid UTF-8 binary - containing "Юникод", will not be printed as a string. When, on the - other hand, started with all Unicode characters printable (<c>+pc - unicode</c>), the shell will output anything containing printable - Unicode data (in binaries either UTF-8 or bytewise encoded) as - string data.</p> - - <p>These heuristics are also used by - <c>io</c>(<c>_lib</c>)<c>:format/2</c> and friends when the - <c>t</c> modifier is used in conjunction with <c>~p</c> or - <c>~P</c>:</p> -<pre> +<<"åäö"/utf8>></pre> + + <p>In the examples, you can see that the default Erlang shell interprets + only characters from the ISO Latin1 range as printable and only detects + lists or binaries with those "printable" characters as containing + string data. The valid UTF-8 binary containing the Russian word + "Юникод", is not printed as a string. When started with all Unicode + characters printable (<c>+pc unicode</c>), the shell outputs anything + containing printable Unicode data (in binaries, either UTF-8 or + bytewise encoded) as string data.</p> + + <p>These heuristics are also used by + <seealso marker="stdlib:io#format/2"><c>io:format/2</c></seealso>, + <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format/2</c></seealso>, + and friends when modifier <c>t</c> is used with <c>~p</c> or + <c>~P</c>:</p> + + <pre> $ <input>erl +pc latin1</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> <input>io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]).</input> {<<"åäö">>,<<"åäö"/utf8>>,<<208,174,208,189,208,184,208,186,208,190,208,180>>} -ok -</pre> -<pre> +ok</pre> + + <pre> $ <input>erl +pc unicode</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> <input>io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]).</input> {<<"åäö">>,<<"åäö"/utf8>>,<<"Юникод"/utf8>>} -ok -</pre> - <p>Please observe that this only affects <i>heuristic</i> interpretation - of lists and binaries on output. For example the <c>~ts</c> format - sequence does always output a valid lists of characters, - regardless of the <c>+pc</c> setting, as the programmer has - explicitly requested string output.</p> +ok</pre> + + <p>Notice that this only affects <em>heuristic</em> interpretation of + lists and binaries on output. For example, the <c>~ts</c> format + sequence always outputs a valid list of characters, regardless of the + <c>+pc</c> setting, as the programmer has explicitly requested string + output.</p> + </section> </section> -</section> -<section> - <title>The Interactive Shell</title> - <p>The interactive Erlang shell, when started towards a terminal or - started using the <c>werl</c> command on windows, can support - Unicode input and output.</p> - <p>On Windows, proper operation requires that a suitable font - is installed and selected for the Erlang application to use. If no - suitable font is available on your system, try installing the DejaVu - fonts (<c>dejavu-fonts.org</c>), which are freely available and then - select that font in the Erlang shell application.</p> - <p>On Unix-like operating systems, the terminal should be able - to handle UTF-8 on input and output (modern versions of XTerm, KDE - konsole and the Gnome terminal do for example) and your locale - settings have to be proper. As an example, my <c>LANG</c> - environment variable is set as this:</p> - <pre> + + <section> + <title>The Interactive Shell</title> + <p>The interactive Erlang shell, when started to a terminal or started + using command <c>werl</c> on Windows, can support Unicode input and + output.</p> + + <p>On Windows, proper operation requires that a suitable font is + installed and selected for the Erlang application to use. If no suitable + font is available on your system, try installing the + <url href="http://dejavu-fonts.org">DejaVu fonts</url>, which are freely + available, and then select that font in the Erlang shell application.</p> + + <p>On Unix-like operating systems, the terminal is to be able to handle + UTF-8 on input and output (this is done by, for example, modern versions + of XTerm, KDE Konsole, and the Gnome terminal) + and your locale settings must be proper. As + an example, a <c>LANG</c> environment variable can be set as follows:</p> + + <pre> $ <input>echo $LANG</input> en_US.UTF-8</pre> - <p>Actually, most systems handle the <c>LC_CTYPE</c> variable before - <c>LANG</c>, so if that is set, it has to be set to - <c>UTF-8</c>:</p> - <pre> + + <p>Most systems handle variable <c>LC_CTYPE</c> before <c>LANG</c>, so if + that is set, it must be set to <c>UTF-8</c>:</p> + + <pre> $ echo <input>$LC_CTYPE</input> en_US.UTF-8</pre> - <p>The <c>LANG</c> or <c>LC_CTYPE</c> setting should be consistent - with what the terminal is capable of, there is no portable way for - Erlang to ask the actual terminal about its UTF-8 capacity, we have - to rely on the language and character type settings.</p> - <p>To investigate what Erlang thinks about the terminal, the - <c>io:getopts()</c> call can be used when the shell is started:</p> - <pre> + + <p>The <c>LANG</c> or <c>LC_CTYPE</c> setting are to be consistent with + what the terminal is capable of. There is no portable way for Erlang to + ask the terminal about its UTF-8 capacity, we have to rely on the + language and character type settings.</p> + + <p>To investigate what Erlang thinks about the terminal, the call + <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso> + can be used when the shell is started:</p> + + <pre> $ <input>LC_CTYPE=en_US.ISO-8859-1 erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -571,27 +649,31 @@ Eshell V5.10.1 (abort with ^G) {encoding,unicode} 2></pre> - <p>When (finally?) everything is in order with the locale settings, - fonts and the terminal emulator, you probably also have discovered a - way to input characters in the script you desire. For testing, the - simplest way is to add some keyboard mappings for other languages, - usually done with some applet in your desktop environment. In my KDE - environment, I start the KDE Control Center (Personal Settings), - select "Regional and Accessibility" and then "Keyboard Layout". On - Windows XP, I start Control Panel->Regional and Language - Options, select the Language tab and click the Details... button in - the square named "Text services and input Languages". Your - environment probably provides similar means of changing the keyboard - layout. Make sure you have a way to easily switch back and forth - between keyboards if you are not used to this, entering commands - using a Cyrillic character set is, as an example, not easily done in - the Erlang shell.</p> - - <p>Now you are set up for some Unicode input and output. The - simplest thing to do is of course to enter a string in the - shell:</p> - - <pre> + <p>When (finally?) everything is in order with the locale settings, fonts. + and the terminal emulator, you have probably found a way to input + characters in the script you desire. For testing, the simplest way is to + add some keyboard mappings for other languages, usually done with some + applet in your desktop environment.</p> + + <p>In a KDE environment, select <em>KDE Control Center (Personal + Settings)</em> > <em>Regional and Accessibility</em> > <em>Keyboard + Layout</em>.</p> + + <p>On Windows XP, select <em>Control Panel</em> > <em>Regional and Language + Options</em>, select tab <em>Language</em>, and click button + <em>Details...</em> in the square named <em>Text Services and Input + Languages</em>.</p> + + <p>Your environment + probably provides similar means of changing the keyboard layout. Ensure + that you have a way to switch back and forth between keyboards easily if + you are not used to this. For example, entering commands using a Cyrillic + character set is not easily done in the Erlang shell.</p> + + <p>Now you are set up for some Unicode input and output. The simplest thing + to do is to enter a string in the shell:</p> + + <pre> $ <input>erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -603,12 +685,13 @@ Eshell V5.10.1 (abort with ^G) 3> <input>io:format("~ts~n", [v(2)]).</input> Юникод ok -4> </pre> - <p>While strings can be input as Unicode characters, the language - elements are still limited to the ISO-latin-1 character set. Only - character constants and strings are allowed to be beyond that - range:</p> - <pre> +4></pre> + + <p>While strings can be input as Unicode characters, the language elements + are still limited to the ISO Latin-1 character set. Only character + constants and strings are allowed to be beyond that range:</p> + + <pre> $ <input>erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -618,371 +701,398 @@ Eshell V5.10.1 (abort with ^G) 2> <input>Юникод.</input> * 1: illegal character 2> </pre> -</section> -<section> - <title>Unicode File Names</title> - <marker id="unicode_file_names"/> - <p>Most modern operating systems support Unicode file names in some - way or another. There are several different ways to do this and - Erlang by default treats the different approaches differently:</p> - <taglist> - <tag>Mandatory Unicode file naming</tag> - <item> - <p>Windows and, for most common uses, MacOS X enforces Unicode - support for file names. All files created in the file system have - names that can consistently be interpreted. In MacOS X, all file - names are retrieved in UTF-8 encoding, while Windows has - selected an approach where each system call handling file names - has a special Unicode aware variant, giving much the same - effect. There are no file names on these systems that are not - Unicode file names, why the default behavior of the Erlang VM is - to work in "Unicode file name translation mode", - meaning that a file name can be given as a Unicode list and that - will be automatically translated to the proper name encoding for - the underlying operating and file system.</p> - <p>Doing i.e. a <c>file:list_dir/1</c> on one of these systems - may return Unicode lists with code points beyond 255, depending - on the content of the actual file system.</p> - <p>As the feature is fairly new, you may still stumble upon non - core applications that cannot handle being provided with file - names containing characters with code points larger than 255, but - the core Erlang system should have no problems with Unicode file - names.</p> - </item> - <tag>Transparent file naming</tag> - <item> - <p>Most Unix operating systems have adopted a simpler approach, - namely that Unicode file naming is not enforced, but by - convention. Those systems usually use UTF-8 encoding for Unicode - file names, but do not enforce it. On such a system, a file name - containing characters having code points between 128 and 255 may - be named either as plain ISO-latin-1 or using UTF-8 encoding. As - no consistency is enforced, the Erlang VM can do no consistent - translation of all file names.</p> - - <p>By default on such systems, Erlang starts in <c>utf8</c> file - name mode if the terminal supports UTF-8, otherwise in - <c>latin1</c> mode.</p> - - <p>In the <c>latin1</c> mode, file names are bytewise endcoded. - This allows for list representation of all file names in - the system, but, for example, a file named "Östersund.txt", will - appear in <c>file:list_dir/1</c> as either "Östersund.txt" (if - the file name was encoded in bytewise ISO-Latin-1 by the program - creating the file, or more probably as - <c>[195,150,115,116,101,114,115,117,110,100]</c>, which is a - list containing UTF-8 bytes - not what you would want... If you - on the other hand use Unicode file name translation on such a - system, non-UTF-8 file names will simply be ignored by functions - like <c>file:list_dir/1</c>. They can be retrieved with - <c>file:list_dir_all/1</c>, but wrongly encoded file names will - appear as "raw file names".</p> - - </item> - </taglist> - - <p>The Unicode file naming support was introduced with Erlang/OTP - R14B01. A VM operating in Unicode file name translation mode can - work with files having names in any language or character set (as - long as it is supported by the underlying OS and file system). The - Unicode character list is used to denote file or directory names and - if the file system content is listed, you will also get - Unicode lists as return value. The support lies in the Kernel and - STDLIB modules, why most applications (that does not explicitly - require the file names to be in the ISO-latin-1 range) will benefit - from the Unicode support without change.</p> - - <p>On operating systems with mandatory Unicode file names, this - means that you more easily conform to the file names of other (non - Erlang) applications, and you can also process file names that, at - least on Windows, were completely inaccessible (due to having names - that could not be represented in ISO-latin-1). Also you will avoid - creating incomprehensible file names on MacOS X as the vfs layer of - the OS will accept all your file names as UTF-8 and will not rewrite - them.</p> - - <p>For most systems, turning on Unicode file name translation is no - problem even if it uses transparent file naming. Very few systems - have mixed file name encodings. A consistent UTF-8 named system will - work perfectly in Unicode file name mode. It was still however - considered experimental in Erlang/OTP R14B01 and is still not the default on - such systems. Unicode file name translation is turned on with the - <c>+fnu</c> switch to the On Linux, a VM started without explicitly - stating the file name translation mode will default to <c>latin1</c> - as the native file name encoding. On Windows and MacOS X, the - default behavior is that of Unicode file name translation, why the - <c>file:native_name_encoding/0</c> by default returns <c>utf8</c> on - those systems (the fact that Windows actually does not use UTF-8 on - the file system level can safely be ignored by the Erlang - programmer). The default behavior can, as stated before, be - changed using the <c>+fnu</c> or <c>+fnl</c> options to the VM, see - the <seealso marker="erts:erl"><c>erl</c></seealso> program. If the - VM is started in Unicode file name translation mode, - <c>file:native_name_encoding/0</c> will return the atom - <c>utf8</c>. The <c>+fnu</c> switch can be followed by <c>w</c>, - <c>i</c> or <c>e</c>, to control how wrongly encoded file names are - to be reported. <c>w</c> means that a warning is sent to the - <c>error_logger</c> whenever a wrongly encoded file name is - "skipped" in directory listings, <c>i</c> means that those wrongly - encoded file names are silently ignored and <c>e</c> means that the - API function will return an error whenever a wrongly encoded file - (or directory) name is encountered. <c>w</c> is the default. Note - that <c>file:read_link/1</c> will always return an error if the link - points to an invalid file name.</p> - - <p>In Unicode file name mode, file names given to the BIF - <c>open_port/2</c> with the option <c>{spawn_executable,...}</c> are - also interpreted as Unicode. So is the parameter list given in the - <c>args</c> option available when using <c>spawn_executable</c>. The - UTF-8 translation of arguments can be avoided using binaries, see - the discussion about raw file names below.</p> - - <p>It is worth noting that the file <c>encoding</c> options given - when opening a file has nothing to do with the file <em>name</em> - encoding convention. You can very well open files containing data - encoded in UTF-8 but having file names in bytewise (<c>latin1</c>) encoding - or vice versa.</p> - - <note><p>Erlang drivers and NIF shared objects still can not be - named with names containing code points beyond 127. This is a known - limitation to be removed in a future release. Erlang modules however - can, but it is definitely not a good idea and is still considered - experimental.</p></note> - -<section> - <title>Notes About Raw File Names</title> - <marker id="notes-about-raw-filenames"/> - <p>Raw file names were introduced together with Unicode file name - support in erts-5.8.2 (Erlang/OTP R14B01). The reason "raw file - names" was introduced in the system was to be able to - consistently represent file names given in different encodings on - the same system. Having the VM automatically translate a file name - that is not in UTF-8 to a list of Unicode characters might seem - practical, but this would open up for both duplicate file names and - other inconsistent behavior. Consider a directory containing a file - named "björn" in ISO-latin-1, while the Erlang VM is - operating in Unicode file name mode (and therefore expecting UTF-8 - file naming). The ISO-latin-1 name is not valid UTF-8 and one could - be tempted to think that automatic conversion in for example - <c>file:list_dir/1</c> is a good idea. But what would happen if we - later tried to open the file and have the name as a Unicode list - (magically converted from the ISO-latin-1 file name)? The VM will - convert the file name given to UTF-8, as this is the encoding - expected. Effectively this means trying to open the file named - <<"björn"/utf8>>. This file does not exist, - and even if it existed it would not be the same file as the one that - was listed. We could even create two files named "björn", - one named in the UTF-8 encoding and one not. If - <c>file:list_dir/1</c> would automatically convert the ISO-latin-1 - file name to a list, we would get two identical file names as the - result. To avoid this, we need to differentiate between file names - being properly encoded according to the Unicode file naming - convention (i.e. UTF-8) and file names being invalid under the - encoding. By the common <c>file:list_dir/1</c> function, the wrongly - encoded file names are simply ignored in Unicode file name - translation mode, but by the <c>file:list_dir_all/1</c> function, - the file names with invalid encoding are returned as "raw" - file names, i.e. as binaries.</p> - - <p>The Erlang <c>file</c> module accepts raw file names as - input. <c>open_port({spawn_executable, ...} ...)</c> also accepts - them. As mentioned earlier, the arguments given in the option list - to <c>open_port({spawn_executable, ...} ...)</c> undergo the same - conversion as the file names, meaning that the executable will be - provided with arguments in UTF-8 as well. This translation is - avoided consistently with how the file names are treated, by giving - the argument as a binary.</p> - - <p>To force Unicode file name translation mode on systems where this - is not the default was considered experimental in Erlang/OTP R14B01 due to - the fact that the initial implementation did not ignore wrongly - encoded file names, so that raw file names could spread unexpectedly - throughout the system. Beginning with Erlang/OTP R16B, the wrongly encoded file - names are only retrieved by special functions - (e.g. <c>file:list_dir_all/1</c>), so the impact on existing code is - much lower, why it is now supported. Unicode file name translation - is expected to be default in future releases.</p> - - <p>Even if you are operating without Unicode file naming translation - automatically done by the VM, you can access and create files with - names in UTF-8 encoding by using raw file names encoded as - UTF-8. Enforcing the UTF-8 encoding regardless of the mode the - Erlang VM is started in might, in some circumstances be a good idea, - as the convention of using UTF-8 file names is spreading.</p> -</section> -<section> - <title>Notes About MacOS X</title> - <p>MacOS X's vfs layer enforces UTF-8 file names in a quite - aggressive way. Older versions did this by simply refusing to create - non UTF-8 conforming file names, while newer versions replace - offending bytes with the sequence "%HH", where HH is the - original character in hexadecimal notation. As Unicode translation - is enabled by default on MacOS X, the only way to come up against - this is to either start the VM with the <c>+fnl</c> flag or to use a - raw file name in bytewise (<c>latin1</c>) encoding. If using a raw - filename, with a bytewise encoding containing characters between 127 - and 255, to create a file, the file can not be opened using the same - name as the one used to create it. There is no remedy for this - behaviour, other than keeping the file names in the right - encoding.</p> - - <p>MacOS X also reorganizes the names of files so that the - representation of accents etc is using the "combining characters", - i.e. the character <c>ö</c> is represented as the code points - [111,776], where 111 is the character <c>o</c> and 776 is the - special accent character "combining diaeresis". This way of - normalizing Unicode is otherwise very seldom used and Erlang - normalizes those file names in the opposite way upon retrieval, so - that file names using combining accents are not passed up to the - Erlang application. In Erlang the file name "björn" is - retrieved as [98,106,246,114,110], not as [98,106,117,776,114,110], - even though the file system might think differently. The - normalization into combining accents are redone when actually - accessing files, so this can usually be ignored by the Erlang - programmer.</p> -</section> -</section> -<section> - <title>Unicode in Environment and Parameters</title> - <marker id="unicode_in_environment_and_parameters"/> - <p>Environment variables and their interpretation is handled much in - the same way as file names. If Unicode file names are enabled, - environment variables as well as parameters to the Erlang VM are - expected to be in Unicode.</p> - <p>If Unicode file names are enabled, the calls to - <seealso marker="kernel:os#getenv/0"><c>os:getenv/0</c></seealso>, - <seealso marker="kernel:os#getenv/1"><c>os:getenv/1</c></seealso>, - <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso> and - <seealso marker="kernel:os#unsetenv/1"><c>os:unsetenv/1</c></seealso> - will handle Unicode strings. On Unix-like platforms, the built-in - functions will translate environment variables in UTF-8 to/from - Unicode strings, possibly with code points > 255. On Windows the - Unicode versions of the environment system API will be used, also - allowing for code points > 255.</p> - <p>On Unix-like operating systems, parameters are expected to be - UTF-8 without translation if Unicode file names are enabled.</p> -</section> -<section> - <title>Unicode-aware Modules</title> - <p>Most of the modules in Erlang/OTP are of course Unicode-unaware - in the sense that they have no notion of Unicode and really should - not have. Typically they handle non-textual or byte-oriented data - (like <c>gen_tcp</c> etc).</p> - <p>Modules that actually handle textual data (like <c>io_lib</c>, - <c>string</c> etc) are sometimes subject to conversion or extension - to be able to handle Unicode characters.</p> - <p>Fortunately, most textual data has been stored in lists and range - checking has been sparse, why modules like <c>string</c> works well - for Unicode lists with little need for conversion or extension.</p> - <p>Some modules are however changed to be explicitly - Unicode-aware. These modules include:</p> - <taglist> - <tag><c>unicode</c></tag> - <item> - <p>The module <seealso marker="stdlib:unicode"><c>unicode</c></seealso> - is obviously Unicode-aware. It contains functions for conversion - between different Unicode formats as well as some utilities for - identifying byte order marks. Few programs handling Unicode data - will survive without this module.</p> - </item> - <tag><c>io</c></tag> - <item> - <p>The <seealso marker="stdlib:io"><c>io</c></seealso> module has been - extended along with the actual I/O-protocol to handle Unicode - data. This means that several functions require binaries to be - in UTF-8 and there are modifiers to formatting control sequences - to allow for outputting of Unicode strings.</p> - </item> - <tag><c>file</c>, <c>group</c>, <c>user</c></tag> - <item> - <p>I/O-servers throughout the system are able to handle - Unicode data and has options for converting data upon actual - output or input to/from the device. As shown earlier, the - <seealso marker="stdlib:shell"><c>shell</c></seealso> has support for - Unicode terminals and the <seealso - marker="kernel:file"><c>file</c></seealso> module allows for - translation to and from various Unicode formats on disk.</p> - <p>The actual reading and writing of files with Unicode data is - however not best done with the <c>file</c> module as its - interface is byte oriented. A file opened with a Unicode - encoding (like UTF-8), is then best read or written using the - <seealso marker="stdlib:io"><c>io</c></seealso> module.</p> - </item> - <tag><c>re</c></tag> - <item> - <p>The <seealso marker="stdlib:re"><c>re</c></seealso> module allows - for matching Unicode strings as a special option. As the library - is actually centered on matching in binaries, the Unicode - support is UTF-8-centered.</p> - </item> - <tag><c>wx</c></tag> - <item> - <p>The <seealso marker="wx:wx"><c>wx</c></seealso> graphical library - has extensive support for Unicode text</p> - </item> - </taglist> - <p>The module <seealso - marker="stdlib:string"><c>string</c></seealso> works perfectly for - Unicode strings as well as for ISO-latin-1 strings with the - exception of the language-dependent <seealso - marker="stdlib:string#to_upper/1"><c>to_upper</c></seealso> and - <seealso marker="stdlib:string#to_lower/1"><c>to_lower</c></seealso> - functions, which are only correct for the ISO-latin-1 character - set. Actually they can never function correctly for Unicode - characters in their current form, as there are language and locale - issues as well as multi-character mappings to consider when - converting text between cases. Converting case in an international - environment is a big subject not yet addressed in OTP.</p> -</section> -<section> - <title>Unicode Data in Files</title> - <p>The fact that Erlang as such can handle Unicode data in many forms - does not automatically mean that the content of any file can be - Unicode text. The external entities such as ports or I/O-servers are - not generally Unicode capable.</p> - <p>Ports are always byte oriented, so before sending data that you - are not sure is bytewise encoded to a port, make sure to encode it - in a proper Unicode encoding. Sometimes this will mean that only - part of the data shall be encoded as e.g. UTF-8, some parts may be - binary data (like a length indicator) or something else that shall - not undergo character encoding, so no automatic translation is - present.</p> - <p>I/O-servers behave a little differently. The I/O-servers connected - to terminals (or stdout) can usually cope with Unicode data - regardless of the <c>encoding</c> option. This is convenient when - one expects a modern environment but do not want to crash when - writing to a archaic terminal or pipe. Files on the other hand are - more picky. A file can have an encoding option which makes it - generally usable by the io-module (e.g. <c>{encoding,utf8}</c>), but - is by default opened as a byte oriented file. The <seealso - marker="kernel:file"><c>file</c></seealso> module is byte oriented, why only - ISO-Latin-1 characters can be written using that module. The - <seealso marker="stdlib:io"><c>io</c></seealso> module is the one to use if - Unicode data is to be output to a file with other <c>encoding</c> - than <c>latin1</c> (a.k.a. bytewise encoding). It is slightly - confusing that a file opened with - e.g. <c>file:open(Name,[read,{encoding,utf8}])</c>, cannot be - properly read using <c>file:read(File,N)</c> but you have to use the - <c>io</c> module to retrieve the Unicode data from it. The reason is - that <c>file:read</c> and <c>file:write</c> (and friends) are purely - byte oriented, and should so be, as that is the way to access - files other than text files - byte by byte. Just as with ports, you - can of course write encoded data into a file by "manually" converting - the data to the encoding of choice (using the <seealso - marker="stdlib:unicode"><c>unicode</c></seealso> module or the bit syntax) - and then output it on a bytewise encoded (<c>latin1</c>) file.</p> - <p>The rule of thumb is that the <seealso - marker="kernel:file"><c>file</c></seealso> module should be used for files - opened for bytewise access (<c>{encoding,latin1}</c>) and the - <seealso marker="stdlib:io"><c>io</c></seealso> module should be used when - accessing files with any other encoding - (e.g. <c>{encoding,uf8}</c>).</p> - - <p>Functions reading Erlang syntax from files generally recognize - the <c>coding:</c> comment and can therefore handle Unicode data on - input. When writing Erlang Terms to a file, you should insert - such comments when applicable:</p> - <pre> + </section> + + <section> + <title>Unicode Filenames</title> + <marker id="unicode_file_names"/> + <p>Most modern operating systems support Unicode filenames in some way. + There are many different ways to do this and Erlang by default treats the + different approaches differently:</p> + + <taglist> + <tag>Mandatory Unicode file naming</tag> + <item> + <p>Windows and, for most common uses, MacOS X enforce Unicode support + for filenames. All files created in the file system have names that + can consistently be interpreted. In MacOS X, all filenames are + retrieved in UTF-8 encoding. In Windows, each system call handling + filenames has a special Unicode-aware variant, giving much the same + effect. There are no filenames on these systems that are not Unicode + filenames. So, the default behavior of the Erlang VM is to work in + "Unicode filename translation mode". This means that a + filename can be specified as a Unicode list, which is automatically + translated to the proper name encoding for the underlying operating + system and file system.</p> + <p>Doing, for example, a + <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso> + on one of these systems can return Unicode lists with code points + > 255, depending on the content of the file system.</p> + </item> + <tag>Transparent file naming</tag> + <item> + <p>Most Unix operating systems have adopted a simpler approach, namely + that Unicode file naming is not enforced, but by convention. Those + systems usually use UTF-8 encoding for Unicode filenames, but do not + enforce it. On such a system, a filename containing characters with + code points from 128 through 255 can be named as plain ISO Latin-1 or + use UTF-8 encoding. As no consistency is enforced, the Erlang VM + cannot do consistent translation of all filenames.</p> + <p>By default on such systems, Erlang starts in <c>utf8</c> filename + mode if the terminal supports UTF-8, otherwise in <c>latin1</c> + mode.</p> + <p>In <c>latin1</c> mode, filenames are bytewise encoded. This allows + for list representation of all filenames in the system. However, a + a file named "Östersund.txt", appears in + <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso> + either as "Östersund.txt" (if the filename was encoded in bytewise + ISO Latin-1 by the program creating the file) or more probably as + <c>[195,150,115,116,101,114,115,117,110,100]</c>, which is a list + containing UTF-8 bytes (not what you want). If you use Unicode + filename translation on such a system, non-UTF-8 filenames are + ignored by functions like <c>file:list_dir/1</c>. They can be + retrieved with function + <seealso marker="kernel:file#list_dir_all/1"><c>file:list_dir_all/1</c></seealso>, + but wrongly encoded filenames appear as "raw filenames". + </p> + </item> + </taglist> + + <p>The Unicode file naming support was introduced in Erlang/OTP + R14B01. A VM operating in Unicode filename translation mode can + work with files having names in any language or character set (as + long as it is supported by the underlying operating system and + file system). The Unicode character list is used to denote + filenames or directory names. If the file system content is + listed, you also get Unicode lists as return value. The support + lies in the <c>Kernel</c> and <c>STDLIB</c> modules, which is why + most applications (that does not explicitly require the filenames + to be in the ISO Latin-1 range) benefit from the Unicode support + without change.</p> + + <p>On operating systems with mandatory Unicode filenames, this means that + you more easily conform to the filenames of other (non-Erlang) + applications. You can also process filenames that, at least on Windows, + were inaccessible (because of having names that could not be represented + in ISO Latin-1). Also, you avoid creating incomprehensible filenames + on MacOS X, as the <c>vfs</c> layer of the operating system accepts all + your filenames as UTF-8 does not rewrite them.</p> + + <p>For most systems, turning on Unicode filename translation is no problem + even if it uses transparent file naming. Very few systems have mixed + filename encodings. A consistent UTF-8 named system works perfectly in + Unicode filename mode. It was still, however, considered experimental in + Erlang/OTP R14B01 and is still not the default on such systems.</p> + + <p>Unicode filename translation is turned on with switch <c>+fnu</c>. On + Linux, a VM started without explicitly stating the filename translation + mode defaults to <c>latin1</c> as the native filename encoding. On + Windows and MacOS X, the default behavior is that of Unicode filename + translation. Therefore + <seealso marker="kernel:file#native_name_encoding/0"><c>file:native_name_encoding/0</c></seealso> + by default returns <c>utf8</c> on those systems (Windows does not use + UTF-8 on the file system level, but this can safely be ignored by the + Erlang programmer). The default behavior can, as stated earlier, be + changed using option <c>+fnu</c> or <c>+fnl</c> to the VM, see the + <seealso marker="erts:erl"><c>erl</c></seealso> program. If the VM is + started in Unicode filename translation mode, + <c>file:native_name_encoding/0</c> returns atom <c>utf8</c>. Switch + <c>+fnu</c> can be followed by <c>w</c>, <c>i</c>, or <c>e</c> to control + how wrongly encoded filenames are to be reported.</p> + + <list type="bulleted"> + <item> + <p><c>w</c> means that a warning is sent to the <c>error_logger</c> + whenever a wrongly encoded filename is "skipped" in directory + listings. <c>w</c> is the default.</p> + </item> + <item> + <p><c>i</c> means that wrongly encoded filenames are silently ignored. + </p> + </item> + <item> + <p><c>e</c> means that the API function returns an error whenever a + wrongly encoded filename (or directory name) is encountered.</p> + </item> + </list> + + <p>Notice that + <seealso marker="kernel:file#read_link/1"><c>file:read_link/1</c></seealso> + always returns an error if the link points to an invalid filename.</p> + + <p>In Unicode filename mode, filenames given to BIF <c>open_port/2</c> with + option <c>{spawn_executable,...}</c> are also interpreted as Unicode. So + is the parameter list specified in option <c>args</c> available when + using <c>spawn_executable</c>. The UTF-8 translation of arguments can be + avoided using binaries, see section + <seealso marker="#notes-about-raw-filenames">Notes About Raw Filenames</seealso>. + </p> + + <p>Notice that the file encoding options specified when opening a file has + nothing to do with the filename encoding convention. You can very well + open files containing data encoded in UTF-8, but having filenames in + bytewise (<c>latin1</c>) encoding or conversely.</p> + + <note><p>Erlang drivers and NIF-shared objects still cannot be named with + names containing code points > 127. This limitation will be removed in + a future release. However, Erlang modules can, but it is definitely not a + good idea and is still considered experimental.</p> + </note> + + <section> + <title>Notes About Raw Filenames</title> + <marker id="notes-about-raw-filenames"/> + <p>Raw filenames were introduced together with Unicode filename support + in <c>ERTS</c> 5.8.2 (Erlang/OTP R14B01). The reason "raw + filenames" were introduced in the system was + to be able to represent + filenames, specified in different encodings on the same system, + consistently. It can seem practical to have the VM automatically + translate a filename that is not in UTF-8 to a list of Unicode + characters, but this would open up for both duplicate filenames and + other inconsistent behavior.</p> + + <p>Consider a directory containing a file named "björn" in ISO + Latin-1, while the Erlang VM is operating in Unicode filename mode (and + therefore expects UTF-8 file naming). The ISO Latin-1 name is not valid + UTF-8 and one can be tempted to think that automatic conversion in, for + example, + <seealso marker="kernel:file#list_dir/1"><c>file:list_dir/1</c></seealso> + is a good idea. But what would happen if we later tried to open the file + and have the name as a Unicode list (magically converted from the ISO + Latin-1 filename)? The VM converts the filename to UTF-8, as this is + the encoding expected. Effectively this means trying to open the file + named <<"björn"/utf8>>. This file does not exist, + and even if it existed it would not be the same file as the one that was + listed. We could even create two files named "björn", one + named in UTF-8 encoding and one not. If <c>file:list_dir/1</c> would + automatically convert the ISO Latin-1 filename to a list, we would get + two identical filenames as the result. To avoid this, we must + differentiate between filenames that are properly encoded according to + the Unicode file naming convention (that is, UTF-8) and filenames that + are invalid under the encoding. By the common function + <c>file:list_dir/1</c>, the wrongly encoded filenames are ignored in + Unicode filename translation mode, but by function + <seealso marker="kernel:file#list_dir_all/1"><c>file:list_dir_all/1</c></seealso> + the filenames with invalid encoding are returned as "raw" + filenames, that is, as binaries.</p> + + <p>The <c>file</c> module accepts raw filenames as input. + <c>open_port({spawn_executable, ...} ...)</c> also accepts them. As + mentioned earlier, the arguments specified in the option list to + <c>open_port({spawn_executable, ...} ...)</c> undergo the same + conversion as the filenames, meaning that the executable is provided + with arguments in UTF-8 as well. This translation is avoided + consistently with how the filenames are treated, by giving the argument + as a binary.</p> + + <p>To force Unicode filename translation mode on systems where this is not + the default was considered experimental in Erlang/OTP R14B01. This was + because the initial implementation did not ignore wrongly encoded + filenames, so that raw filenames could spread unexpectedly throughout + the system. As from Erlang/OTP R16B, the wrongly encoded + filenames are only retrieved by special functions (such as + <c>file:list_dir_all/1</c>). Since the impact on existing code is + therefore much lower it is now supported. + Unicode filename translation is + expected to be default in future releases.</p> + + <p>Even if you are operating without Unicode file naming translation + automatically done by the VM, you can access and create files with + names in UTF-8 encoding by using raw filenames encoded as UTF-8. + Enforcing the UTF-8 encoding regardless of the mode the Erlang VM is + started in can in some circumstances be a good idea, as the convention + of using UTF-8 filenames is spreading.</p> + </section> + + <section> + <title>Notes About MacOS X</title> + <p>The <c>vfs</c> layer of MacOS X enforces UTF-8 filenames in an + aggressive way. Older versions did this by refusing to create non-UTF-8 + conforming filenames, while newer versions replace offending bytes with + the sequence "%HH", where HH is the original character in + hexadecimal notation. As Unicode translation is enabled by default on + MacOS X, the only way to come up against this is to either start the VM + with flag <c>+fnl</c> or to use a raw filename in bytewise + (<c>latin1</c>) encoding. If using a raw filename, with a bytewise + encoding containing characters from 127 through 255, to create a file, + the file cannot be opened using the same name as the one used to create + it. There is no remedy for this behavior, except keeping the filenames + in the correct encoding.</p> + + <p>MacOS X reorganizes the filenames so that the representation of + accents, and so on, uses the "combining characters". For example, + character <c>ö</c> is represented as code points <c>[111,776]</c>, + where <c>111</c> is character <c>o</c> and <c>776</c> is the special + accent character "Combining Diaeresis". This way of normalizing Unicode + is otherwise very seldom used. Erlang normalizes those filenames in the + opposite way upon retrieval, so that filenames using combining accents + are not passed up to the Erlang application. In Erlang, filename + "björn" is retrieved as <c>[98,106,246,114,110]</c>, not as + <c>[98,106,117,776,114,110]</c>, although the file system can think + differently. The normalization into combining accents is redone when + accessing files, so this can usually be ignored by the Erlang + programmer.</p> + </section> + </section> + + <section> + <title>Unicode in Environment and Parameters</title> + <marker id="unicode_in_environment_and_parameters"/> + <p>Environment variables and their interpretation are handled much in the + same way as filenames. If Unicode filenames are enabled, environment + variables as well as parameters to the Erlang VM are expected to be in + Unicode.</p> + + <p>If Unicode filenames are enabled, the calls to + <seealso marker="kernel:os#getenv/0"><c>os:getenv/0,1</c></seealso>, + <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso>, and + <seealso marker="kernel:os#unsetenv/1"><c>os:unsetenv/1</c></seealso> + handle Unicode strings. On Unix-like platforms, the built-in functions + translate environment variables in UTF-8 to/from Unicode strings, possibly + with code points > 255. On Windows, the Unicode versions of the + environment system API are used, and code points > 255 are allowed.</p> + <p>On Unix-like operating systems, parameters are expected to be UTF-8 + without translation if Unicode filenames are enabled.</p> + </section> + + <section> + <title>Unicode-Aware Modules</title> + <p>Most of the modules in Erlang/OTP are Unicode-unaware in the sense that + they have no notion of Unicode and should not have. Typically they handle + non-textual or byte-oriented data (such as <c>gen_tcp</c>).</p> + + <p>Modules handling textual data (such as + <seealso marker="stdlib:io_lib"><c>io_lib</c></seealso> and + <seealso marker="stdlib:string"><c>string</c></seealso> are sometimes + subject to conversion or extension to be able to handle Unicode + characters.</p> + + <p>Fortunately, most textual data has been stored in lists and range + checking has been sparse, so modules like <c>string</c> work well for + Unicode lists with little need for conversion or extension.</p> + + <p>Some modules are, however, changed to be explicitly Unicode-aware. These + modules include:</p> + + <taglist> + <tag><c>unicode</c></tag> + <item> + <p>The <seealso marker="stdlib:unicode"><c>unicode</c></seealso> + module is clearly Unicode-aware. It contains functions for conversion + between different Unicode formats and some utilities for identifying + byte order marks. Few programs handling Unicode data survive without + this module.</p> + </item> + <tag><c>io</c></tag> + <item> + <p>The <seealso marker="stdlib:io"><c>io</c></seealso> module has been + extended along with the actual I/O protocol to handle Unicode data. + This means that many functions require binaries to be in UTF-8, and + there are modifiers to format control sequences to allow for output + of Unicode strings.</p> + </item> + <tag><c>file</c>, <c>group</c>, <c>user</c></tag> + <item> + <p>I/O-servers throughout the system can handle Unicode data and have + options for converting data upon output or input to/from the device. + As shown earlier, the + <seealso marker="stdlib:shell"><c>shell</c></seealso> module has + support for Unicode terminals and the + <seealso marker="kernel:file"><c>file</c></seealso> module + allows for translation to and from various Unicode formats on + disk.</p> + <p>Reading and writing of files with Unicode data is, however, not best + done with the <c>file</c> module, as its interface is + byte-oriented. A file opened with a Unicode encoding (like UTF-8) is + best read or written using the + <seealso marker="stdlib:io"><c>io</c></seealso> module.</p> + </item> + <tag><c>re</c></tag> + <item> + <p>The <seealso marker="stdlib:re"><c>re</c></seealso> module allows + for matching Unicode strings as a special option. As the library is + centered on matching in binaries, the Unicode support is + UTF-8-centered.</p> + </item> + <tag><c>wx</c></tag> + <item> + <p>The graphical library <seealso marker="wx:wx"><c>wx</c></seealso> + has extensive support for Unicode text.</p></item> + </taglist> + + <p>The <seealso marker="stdlib:string"><c>string</c></seealso> module works + perfectly for Unicode strings and ISO Latin-1 strings, except the + language-dependent functions + <seealso marker="stdlib:string#to_upper/1"><c>string:to_upper/1</c></seealso> + and + <seealso marker="stdlib:string#to_lower/1"><c>string:to_lower/1</c></seealso>, + which are only correct for the ISO Latin-1 character set. These two + functions can never function correctly for Unicode characters in their + current form, as there are language and locale issues as well as + multi-character mappings to consider when converting text between cases. + Converting case in an international environment is a large subject not + yet addressed in OTP.</p> + </section> + + <section> + <title>Unicode Data in Files</title> + <p>Although Erlang can handle Unicode data in many forms does not + automatically mean that the content of any file can be Unicode text. The + external entities, such as ports and I/O servers, are not generally + Unicode capable.</p> + + <p>Ports are always byte-oriented, so before sending data that you are not + sure is bytewise-encoded to a port, ensure to encode it in a proper + Unicode encoding. Sometimes this means that only part of the data must + be encoded as, for example, UTF-8. Some parts can be binary data (like a + length indicator) or something else that must not undergo character + encoding, so no automatic translation is present.</p> + + <p>I/O servers behave a little differently. The I/O servers connected to + terminals (or <c>stdout</c>) can usually cope with Unicode data + regardless of the encoding option. This is convenient when one expects + a modern environment but do not want to crash when writing to an archaic + terminal or pipe.</p> + + <p>A file can have an encoding option that makes it generally usable by the + <seealso marker="stdlib:io"><c>io</c></seealso> module (for example + <c>{encoding,utf8}</c>), but is by default opened as a byte-oriented file. + The <seealso marker="kernel:file"><c>file</c></seealso> module is + byte-oriented, so only ISO Latin-1 characters can be written using that + module. Use the <c>io</c> module if Unicode data is to be output to a + file with other <c>encoding</c> than <c>latin1</c> (bytewise encoding). + It is slightly confusing that a file opened with, for example, + <c>file:open(Name,[read,{encoding,utf8}])</c> cannot be properly read + using <c>file:read(File,N)</c>, but using the <c>io</c> module to retrieve + the Unicode data from it. The reason is that <c>file:read</c> and + <c>file:write</c> (and friends) are purely byte-oriented, and should be, + as that is the way to access files other than text files, byte by byte. + As with ports, you can write encoded data into a file by "manually" + converting the data to the encoding of choice (using the + <seealso marker="stdlib:unicode"><c>unicode</c></seealso> module or the + bit syntax) and then output it on a bytewise (<c>latin1</c>) encoded + file.</p> + + <p>Recommendations:</p> + + <list type="bulleted"> + <item><p>Use the + <seealso marker="kernel:file"><c>file</c></seealso> module for + files opened for bytewise access (<c>{encoding,latin1}</c>).</p> + </item> + <item><p>Use the <seealso marker="stdlib:io"><c>io</c></seealso> module + when accessing files with any other encoding (for example + <c>{encoding,uf8}</c>).</p> + </item> + </list> + + <p>Functions reading Erlang syntax from files recognize the <c>coding:</c> + comment and can therefore handle Unicode data on input. When writing + Erlang terms to a file, you are advised to insert such comments when + applicable:</p> + + <pre> $ <input>erl +fna +pc unicode</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -990,202 +1100,224 @@ Eshell V5.10.1 (abort with ^G) 1> <input>file:write_file("test.term",<<"%% coding: utf-8\n[{\"Юникод\",4711}].\n"/utf8>>).</input> ok 2> <input>file:consult("test.term").</input> -{ok,[[{"Юникод",4711}]]} - </pre> -</section> -<section> - <title>Summary of Options</title> - <marker id="unicode_options_summary"/> - <p>The Unicode support is controlled by both command line switches, - some standard environment variables and the version of OTP you are - using. Most options affect mainly the way Unicode data is displayed, - not the actual functionality of the API's in the standard - libraries. This means that Erlang programs usually do not - need to concern themselves with these options, they are more for the - development environment. An Erlang program can be written so that it - works well regardless of the type of system or the Unicode options - that are in effect.</p> - - <p>Here follows a summary of the settings affecting Unicode:</p> - <taglist> - <tag>The <c>LANG</c> and <c>LC_CTYPE</c> environment variables</tag> - <item> - <p>The language setting in the OS mainly affects the shell. The - terminal (i.e. the group leader) will operate with <c>{encoding, - unicode}</c> only if the environment tells it that UTF-8 is - allowed. This setting should correspond to the actual terminal - you are using.</p> - <p>The environment can also affect file name interpretation, if - Erlang is started with the <c>+fna</c> flag (which is default from - Erlang/OTP 17.0).</p> - <p>You can check the setting of this by calling - <c>io:getopts()</c>, which will give you an option list - containing <c>{encoding,unicode}</c> or - <c>{encoding,latin1}</c>.</p> - </item> - <tag>The <c>+pc </c>{<c>unicode</c>|<c>latin1</c>} flag to - <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag> - <item> - <p>This flag affects what is interpreted as string data when - doing heuristic string detection in the shell and in - <c>io</c>/<c>io_lib:format</c> with the <c>"~tp"</c> and - <c>~tP</c> formatting instructions, as described above.</p> - <p>You can check this option by calling io:printable_range/0, - which will return <c>unicode</c> or <c>latin1</c>. To be - compatible with future (expected) extensions to the settings, - one should rather use <c>io_lib:printable_list/1</c> to check if - a list is printable according to the setting. That function will - take into account new possible settings returned from - <c>io:printable_range/0</c>.</p> - </item> - <tag>The <c>+fn</c>{<c>l</c>|<c>a</c>|<c>u</c>} - [{<c>w</c>|<c>i</c>|<c>e</c>}] - flag to <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag> - <item> - <p>This flag affects how the file names are to be interpreted. On - operating systems with transparent file naming, this has to be - specified to allow for file naming in Unicode characters (and - for correct interpretation of file names containing characters - > 255.</p> - <p><c>+fnl</c> means bytewise interpretation of file names, which - was the usual way to represent ISO-Latin-1 file names before - UTF-8 file naming got widespread.</p> - <p><c>+fnu</c> means that file names are encoded in UTF-8, which - is nowadays the common scheme (although not enforced).</p> - <p><c>+fna</c> means that you automatically select between - <c>+fnl</c> and <c>+fnu</c>, based on the <c>LANG</c> and - <c>LC_CTYPE</c> environment variables. This is optimistic - heuristics indeed, nothing enforces a user to have a terminal - with the same encoding as the file system, but usually, this is - the case. This is the default on all Unix-like operating - systems except MacOS X.</p> - - <p>The file name translation mode can be read with the - <c>file:native_name_encoding/0</c> function, which returns - <c>latin1</c> (meaning bytewise encoding) or <c>utf8</c>.</p> - </item> - <tag><seealso marker="stdlib:epp#default_encoding/0"> - <c>epp:default_encoding/0</c></seealso></tag> - <item> - <p>This function returns the default encoding for Erlang source - files (if no encoding comment is present) in the currently - running release. In Erlang/OTP R16B <c>latin1</c> was returned (meaning - bytewise encoding). In Erlang/OTP 17.0 and forward it returns - <c>utf8</c>.</p> - <p>The encoding of each file can be specified using comments as - described in - <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>.</p> - </item> - <tag><seealso marker="stdlib:io#setopts/1"><c>io:setopts/</c>{<c>1</c>,<c>2</c>}</seealso> and the <c>-oldshell</c>/<c>-noshell</c> flags.</tag> - <item> - <p>When Erlang is started with <c>-oldshell</c> or - <c>-noshell</c>, the I/O-server for <c>standard_io</c> is default - set to bytewise encoding, while an interactive shell defaults to - what the environment variables says.</p> - <p>With the <c>io:setopts/2</c> function you can set the - encoding of a file or other I/O-server. This can also be set when - opening a file. Setting the terminal (or other - <c>standard_io</c> server) unconditionally to the option - <c>{encoding,utf8}</c> will for example make UTF-8 encoded characters - being written to the device regardless of how Erlang was started or - the users environment.</p> - <p>Opening files with <c>encoding</c> option is convenient when - writing or reading text files in a known encoding.</p> - <p>You can retrieve the <c>encoding</c> setting for an I/O-server - using <seealso - marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>.</p> - </item> - </taglist> -</section> -<section> - <title>Recipes</title> - <p>When starting with Unicode, one often stumbles over some common - issues. I try to outline some methods of dealing with Unicode data - in this section.</p> +{ok,[[{"Юникод",4711}]]}</pre> + </section> + + <section> + <title>Summary of Options</title> + <marker id="unicode_options_summary"/> + <p>The Unicode support is controlled by both command-line switches, some + standard environment variables, and the OTP version you are using. Most + options affect mainly how Unicode data is displayed, not the + functionality of the APIs in the standard libraries. This means that + Erlang programs usually do not need to concern themselves with these + options, they are more for the development environment. An Erlang program + can be written so that it works well regardless of the type of system or + the Unicode options that are in effect.</p> + + <p>Here follows a summary of the settings affecting Unicode:</p> + + <taglist> + <tag>The <c>LANG</c> and <c>LC_CTYPE</c> environment variables</tag> + <item> + <p>The language setting in the operating system mainly affects the + shell. The terminal (that is, the group leader) operates with + <c>{encoding, unicode}</c> only if the environment tells it that + UTF-8 is allowed. This setting is to correspond to the terminal you + are using.</p> + <p>The environment can also affect filename interpretation, if Erlang + is started with flag <c>+fna</c> (which is default from + Erlang/OTP 17.0).</p> + <p>You can check the setting of this by calling + <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>, + which gives you an option list containing <c>{encoding,unicode}</c> + or <c>{encoding,latin1}</c>.</p> + </item> + <tag>The <c>+pc</c> {<c>unicode</c>|<c>latin1</c>} flag to + <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag> + <item> + <p>This flag affects what is interpreted as string data when doing + heuristic string detection in the shell and in + <seealso marker="stdlib:io"><c>io</c></seealso>/ + <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format</c></seealso> + with the <c>"~tp"</c> and <c>~tP</c> formatting instructions, as + described earlier.</p> + <p>You can check this option by calling + <seealso marker="stdlib:io#printable_range/0"><c>io:printable_range/0</c></seealso>, + which returns <c>unicode</c> or <c>latin1</c>. To be compatible with + future (expected) extensions to the settings, rather use + <seealso marker="stdlib:io_lib#printable_list/1"><c>io_lib:printable_list/1</c></seealso> + to check if a list is printable according to the setting. That + function takes into account new possible settings returned from + <c>io:printable_range/0</c>.</p> + </item> + <tag>The <c>+fn</c>{<c>l</c>|<c>u</c>|<c>a</c>} + [{<c>w</c>|<c>i</c>|<c>e</c>}] flag to + <seealso marker="erts:erl"><c>erl(1)</c></seealso></tag> + <item> + <p>This flag affects how the filenames are to be interpreted. On + operating systems with transparent file naming, this must be + specified to allow for file naming in Unicode characters (and for + correct interpretation of filenames containing characters > 255). + </p> + <list type="bulleted"> + <item> + <p><c>+fnl</c> means bytewise interpretation of filenames, which was + the usual way to represent ISO Latin-1 filenames before UTF-8 + file naming got widespread.</p> + </item> + <item> + <p><c>+fnu</c> means that filenames are encoded in UTF-8, which is + nowadays the common scheme (although not enforced).</p> + </item> + <item> + <p><c>+fna</c> means that you automatically select between + <c>+fnl</c> and <c>+fnu</c>, based on environment variables + <c>LANG</c> and <c>LC_CTYPE</c>. This is optimistic + heuristics indeed, nothing enforces a user to have a terminal with + the same encoding as the file system, but this is usually the + case. This is the default on all Unix-like operating systems, + except MacOS X.</p> + </item> + </list> + <p>The filename translation mode can be read with function + <seealso marker="kernel:file#native_name_encoding/0"><c>file:native_name_encoding/0</c></seealso>, + which returns <c>latin1</c> (bytewise encoding) or <c>utf8</c>.</p> + </item> + <tag><seealso marker="stdlib:epp#default_encoding/0"><c>epp:default_encoding/0</c></seealso></tag> + <item> + <p>This function returns the default encoding for Erlang source files + (if no encoding comment is present) in the currently running release. + In Erlang/OTP R16B, <c>latin1</c> (bytewise encoding) was returned. + As from Erlang/OTP 17.0, <c>utf8</c> is returned.</p> + <p>The encoding of each file can be specified using comments as + described in the + <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso> module. + </p> + </item> + <tag><seealso marker="stdlib:io#setopts/1"><c>io:setopts/1,2</c></seealso> + and flags <c>-oldshell</c>/<c>-noshell</c></tag> + <item> + <p>When Erlang is started with <c>-oldshell</c> or <c>-noshell</c>, the + I/O server for <c>standard_io</c> is by default set to bytewise + encoding, while an interactive shell defaults to what the + environment variables says.</p> + <p>You can set the encoding of a file or other I/O server with function + <seealso marker="stdlib:io#setopts/1"><c>io:setopts/2</c></seealso>. + This can also be set when opening a file. Setting the terminal (or + other <c>standard_io</c> server) unconditionally to option + <c>{encoding,utf8}</c> implies that UTF-8 encoded characters are + written to the device, regardless of how Erlang was started or the + user's environment.</p> + <p>Opening files with option <c>encoding</c> is convenient when + writing or reading text files in a known encoding.</p> + <p>You can retrieve the <c>encoding</c> setting for an I/O server with + function + <seealso marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>. + </p> + </item> + </taglist> + </section> + <section> - <title>Byte Order Marks</title> - <p>A common method of identifying encoding in text-files is to put - a byte order mark (BOM) first in the file. The BOM is the - code point 16#FEFF encoded in the same way as the rest of the - file. If such a file is to be read, the first few bytes (depending - on encoding) is not part of the actual text. This code outlines - how to open a file which is believed to have a BOM and set the - files encoding and position for further sequential reading - (preferably using the <seealso marker="stdlib:io"><c>io</c></seealso> - module). Note that error handling is omitted from the code:</p> -<code> + <title>Recipes</title> + <p>When starting with Unicode, one often stumbles over some common issues. + This section describes some methods of dealing with Unicode data.</p> + + <section> + <title>Byte Order Marks</title> + <p>A common method of identifying encoding in text files is to put a Byte + Order Mark (BOM) first in the file. The BOM is the code point 16#FEFF + encoded in the same way as the remaining file. If such a file is to be + read, the first few bytes (depending on encoding) are not part of the + text. This code outlines how to open a file that is believed to + have a BOM, and sets the files encoding and position for further + sequential reading (preferably using the + <seealso marker="stdlib:io"><c>io</c></seealso> module).</p> + + <p>Notice that error handling is omitted from the code:</p> + + <code> open_bom_file_for_reading(File) -> {ok,F} = file:open(File,[read,binary]), {ok,Bin} = file:read(F,4), {Type,Bytes} = unicode:bom_to_encoding(Bin), file:position(F,Bytes), io:setopts(F,[{encoding,Type}]), - {ok,F}. -</code> - <p>The <c>unicode:bom_to_encoding/1</c> function identifies the - encoding from a binary of at least four bytes. It returns, along - with an term suitable for setting the encoding of the file, the - actual length of the BOM, so that the file position can be set - accordingly. Note that <c>file:position/2</c> always works on - byte-offsets, so that the actual byte-length of the BOM is - needed.</p> - <p>To open a file for writing and putting the BOM first is even - simpler:</p> -<code> + {ok,F}.</code> + + <p>Function + <seealso marker="stdlib:unicode#bom_to_encoding/1"><c>unicode:bom_to_encoding/1</c></seealso> + identifies the encoding from a binary of at least four bytes. It + returns, along with a term suitable for setting the encoding of the + file, the byte length of the BOM, so that the file position can be set + accordingly. Notice that function + <seealso marker="kernel:file#position/2"><c>file:position/2</c></seealso> + always works on byte-offsets, so that the byte length of the BOM is + needed.</p> + + <p>To open a file for writing and place the BOM first is even simpler:</p> + + <code> open_bom_file_for_writing(File,Encoding) -> {ok,F} = file:open(File,[write,binary]), ok = file:write(File,unicode:encoding_to_bom(Encoding)), io:setopts(F,[{encoding,Encoding}]), - {ok,F}. -</code> - <p>In both cases the file is then best processed using the - <c>io</c> module, as the functions in <c>io</c> can handle code - points beyond the ISO-latin-1 range.</p> - </section> - <section> - <title>Formatted I/O</title> - <p>When reading and writing to Unicode-aware entities, like the - User or a file opened for Unicode translation, you will probably - want to format text strings using the functions in <seealso - marker="stdlib:io"><c>io</c></seealso> or <seealso - marker="stdlib:io_lib"><c>io_lib</c></seealso>. For backward - compatibility reasons, these functions do not accept just any list - as a string, but require a special <em>translation modifier</em> - when working with Unicode texts. The modifier is <c>t</c>. When - applied to the <c>s</c> control character in a formatting string, - it accepts all Unicode code points and expect binaries to be in - UTF-8:</p> - <pre> + {ok,F}.</code> + + <p>The file is in both these cases then best processed using the + <seealso marker="stdlib:io"><c>io</c></seealso> module, as the functions + in that module can handle code points beyond the ISO Latin-1 range.</p> + </section> + + <section> + <title>Formatted I/O</title> + <p>When reading and writing to Unicode-aware entities, like a + file opened for Unicode translation, you probably want to format text + strings using the functions in the + <seealso marker="stdlib:io"><c>io</c></seealso> module or the + <seealso marker="stdlib:io_lib"><c>io_lib</c></seealso> module. For + backward compatibility reasons, these functions do not accept any list + as a string, but require a special <em>translation modifier</em> when + working with Unicode texts. The modifier is <c>t</c>. When applied to + control character <c>s</c> in a formatting string, it accepts all + Unicode code points and expects binaries to be in UTF-8:</p> + + <pre> 1> <input>io:format("~ts~n",[<<"åäö"/utf8>>]).</input> åäö ok 2> <input>io:format("~s~n",[<<"åäö"/utf8>>]).</input> åäö ok</pre> - <p>Obviously the second <c>io:format/2</c> gives undesired output - because the UTF-8 binary is not in latin1. For backward - compatibility, the non prefixed <c>s</c> control character expects - bytewise encoded ISO-latin-1 characters in binaries and lists - containing only code points < 256.</p> - <p>As long as the data is always lists, the <c>t</c> modifier can - be used for any string, but when binary data is involved, care - must be taken to make the right choice of formatting characters. A - bytewise encoded binary will also be interpreted as a string and - printed even when using <c>~ts</c>, but it might be mistaken for a - valid UTF-8 string and one should therefore avoid using the - <c>~ts</c> control if the binary contains bytewise encoded - characters and not UTF-8.</p> - <p>The function <c>format/2</c> in <c>io_lib</c> behaves - similarly. This function is defined to return a deep list of - characters and the output could easily be converted to binary data - for outputting on a device of any kind by a simple - <c>erlang:list_to_binary/1</c>. When the translation modifier is - used, the list can however contain characters that cannot be - stored in one byte. The call to <c>erlang:list_to_binary/1</c> - will in that case fail. However, if the I/O server you want to - communicate with is Unicode-aware, the list returned can still be - used directly:</p> -<pre> + + <p>Clearly, the second <c>io:format/2</c> gives undesired output, as the + UTF-8 binary is not in <c>latin1</c>. For backward compatibility, the + non-prefixed control character <c>s</c> expects bytewise-encoded ISO + Latin-1 characters in binaries and lists containing only code points + < 256.</p> + + <p>As long as the data is always lists, modifier <c>t</c> can be used for + any string, but when binary data is involved, care must be taken to + make the correct choice of formatting characters. A bytewise-encoded + binary is also interpreted as a string, and printed even when using + <c>~ts</c>, but it can be mistaken for a valid UTF-8 string. Avoid + therefore using the <c>~ts</c> control if the binary contains + bytewise-encoded characters and not UTF-8.</p> + + <p>Function + <seealso marker="stdlib:io_lib#format/2"><c>io_lib:format/2</c></seealso> + behaves similarly. It is defined to return a deep list of characters + and the output can easily be converted to binary data for outputting on + any device by a simple + <seealso marker="erts:erlang#list_to_binary/1"><c>erlang:list_to_binary/1</c></seealso>. + When the translation modifier is used, the list can, however, contain + characters that cannot be stored in one byte. The call to + <c>erlang:list_to_binary/1</c> then fails. However, if the I/O server + you want to communicate with is Unicode-aware, the returned list can + still be used directly:</p> + + <pre> $ <input>erl +pc unicode</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -1195,55 +1327,56 @@ Eshell V5.10.1 (abort with ^G) 2> <input>io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])).</input> Γιούνικοντ ok</pre> - <p>The Unicode string is returned as a Unicode list, which is - recognized as such since the Erlang shell uses the Unicode - encoding (and is started with all Unicode characters considered - printable). The Unicode list is valid input to the <seealso - marker="stdlib:io#put_chars/2"><c>io:put_chars/2</c></seealso> function, - so data can be output on any Unicode capable device. If the device - is a terminal, characters will be output in the <c>\x{</c>H - ...<c>}</c> format if encoding is <c>latin1</c> otherwise in UTF-8 - (for the non-interactive terminal - "oldshell" or "noshell") or - whatever is suitable to show the character properly (for an - interactive terminal - the regular shell). The bottom line is that - you can always send Unicode data to the <c>standard_io</c> - device. Files will however only accept Unicode code points beyond - ISO-latin-1 if <c>encoding</c> is set to something else than - <c>latin1</c>.</p> - </section> - <section> - <title>Heuristic Identification of UTF-8</title> - <p>While it is - strongly encouraged that the actual encoding of characters in - binary data is known prior to processing, that is not always - possible. On a typical Linux system, there is a mix of UTF-8 - and ISO-latin-1 text files and there are seldom any BOM's in the - files to identify them.</p> - <p>UTF-8 is designed in such a way that ISO-latin-1 characters - with numbers beyond the 7-bit ASCII range are seldom considered - valid when decoded as UTF-8. Therefore one can usually use - heuristics to determine if a file is in UTF-8 or if it is encoded - in ISO-latin-1 (one byte per character) encoding. The - <c>unicode</c> module can be used to determine if data can be - interpreted as UTF-8:</p> - <code> + + <p>The Unicode string is returned as a Unicode list, which is recognized + as such, as the Erlang shell uses the Unicode encoding (and is started + with all Unicode characters considered printable). The Unicode list is + valid input to function + <seealso marker="stdlib:io#put_chars/2"><c>io:put_chars/2</c></seealso>, + so data can be output on any Unicode-capable device. If the device is a + terminal, characters are output in format <c>\x{</c>H...<c>}</c> if + encoding is <c>latin1</c>. Otherwise in UTF-8 (for the non-interactive + terminal: "oldshell" or "noshell") or whatever is suitable to show the + character properly (for an interactive terminal: the regular shell).</p> + + <p>So, you can always send Unicode data to the <c>standard_io</c> device. + Files, however, accept only Unicode code points beyond ISO Latin-1 if + <c>encoding</c> is set to something else than <c>latin1</c>.</p> + </section> + + <section> + <title>Heuristic Identification of UTF-8</title> + <p>While it is strongly encouraged that the encoding of characters + in binary data is known before processing, that is not always possible. + On a typical Linux system, there is a mix of UTF-8 and ISO Latin-1 text + files, and there are seldom any BOMs in the files to identify them.</p> + + <p>UTF-8 is designed so that ISO Latin-1 characters with numbers beyond + the 7-bit ASCII range are seldom considered valid when decoded as UTF-8. + Therefore one can usually use heuristics to determine if a file is in + UTF-8 or if it is encoded in ISO Latin-1 (one byte per character). + The <seealso marker="stdlib:unicode"><c>unicode</c></seealso> + module can be used to determine if data can be interpreted as UTF-8:</p> + + <code> heuristic_encoding_bin(Bin) when is_binary(Bin) -> case unicode:characters_to_binary(Bin,utf8,utf8) of Bin -> utf8; _ -> latin1 - end. - </code> - <p>If one does not have a complete binary of the file content, one - could instead chunk through the file and check part by part. The - return-tuple <c>{incomplete,Decoded,Rest}</c> from - <c>unicode:characters_to_binary/{1,2,3}</c> comes in handy. The - incomplete rest from one chunk of data read from the file is - prepended to the next chunk and we therefore circumvent the - problem of character boundaries when reading chunks of bytes in - UTF-8 encoding:</p> - <code> + end.</code> + + <p>If you do not have a complete binary of the file content, you can + instead chunk through the file and check part by part. The return-tuple + <c>{incomplete,Decoded,Rest}</c> from function + <seealso marker="stdlib:unicode#characters_to_binary/1"><c>unicode:characters_to_binary/1,2,3</c></seealso> + comes in handy. The incomplete rest from one chunk of data read from the + file is prepended to the next chunk and we therefore avoid the problem + of character boundaries when reading chunks of bytes in UTF-8 + encoding:</p> + + <code> heuristic_encoding_file(FileName) -> {ok,F} = file:open(FileName,[read,binary]), loop_through_file(F,<<>>,file:read(F,1024)). @@ -1260,13 +1393,14 @@ loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) -> loop_through_file(F,Rest,file:read(F,1024)); Res when is_binary(Res) -> loop_through_file(F,<<>>,file:read(F,1024)) - end. - </code> - <p>Another option is to try to read the whole file in UTF-8 - encoding and see if it fails. Here we need to read the file using - <c>io:get_chars/3</c>, as we have to succeed in reading characters - with a code point over 255:</p> - <code> + end.</code> + + <p>Another option is to try to read the whole file in UTF-8 encoding and + see if it fails. Here we need to read the file using function + <seealso marker="stdlib:io#get_chars/3"><c>io:get_chars/3</c></seealso>, + as we have to read characters with a code point > 255:</p> + + <code> heuristic_encoding_file2(FileName) -> {ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]), loop_through_file2(F,io:get_chars(F,'',1024)). @@ -1276,69 +1410,71 @@ loop_through_file2(_,eof) -> loop_through_file2(_,{error,_Err}) -> latin1; loop_through_file2(F,Bin) when is_binary(Bin) -> - loop_through_file2(F,io:get_chars(F,'',1024)). - </code> - </section> - <section> - <title>Lists of UTF-8 Bytes</title> - <p>For various reasons, you may find yourself having a list of - UTF-8 bytes. This is not a regular string of Unicode characters as - each element in the list does not contain one character. Instead - you get the "raw" UTF-8 encoding that you have in binaries. This - is easily converted to a proper Unicode string by first converting - byte per byte into a binary and then converting the binary of - UTF-8 encoded characters back to a Unicode string:</p> - <code> - utf8_list_to_string(StrangeList) -> - unicode:characters_to_list(list_to_binary(StrangeList)). - </code> - </section> - <section> - <title>Double UTF-8 Encoding</title> - <p>When working with binaries, you may get the horrible "double - UTF-8 encoding", where strange characters are encoded in your - binaries or files that you did not expect. What you may have got, - is a UTF-8 encoded binary that is for the second time encoded as - UTF-8. A common situation is where you read a file, byte by byte, - but the actual content is already UTF-8. If you then convert the - bytes to UTF-8, using i.e. the <c>unicode</c> module or by - writing to a file opened with the <c>{encoding,utf8}</c> - option. You will have each <i>byte</i> in the in the input file - encoded as UTF-8, not each character of the original text (one - character may have been encoded in several bytes). There is no - real remedy for this other than being very sure of which data is - actually encoded in which format, and never convert UTF-8 data - (possibly read byte by byte from a file) into UTF-8 again.</p> - <p>The by far most common situation where this happens, is when - you get lists of UTF-8 instead of proper Unicode strings, and then - convert them to UTF-8 in a binary or on a file:</p> - <code> - wrong_thing_to_do() -> - {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"), - MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary! - {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]), - io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8 - %% bytes in a list! - file:close(C). %% The file catastrophe.txt contains more or less unreadable - %% garbage! - </code> - <p>Make very sure you know what a binary contains before - converting it to a string. If no other option exists, try - heuristics:</p> - <code> - if_you_can_not_know() -> - {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"), - MyList = case unicode:characters_to_list(Bin) of - L when is_list(L) -> - L; - _ -> - binary_to_list(Bin) %% The file was bytewise encoded - end, - %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes - {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]), - io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets! - file:close(G). %% The file contains valid UTF-8 encoded Unicode characters! - </code> + loop_through_file2(F,io:get_chars(F,'',1024)).</code> + </section> + + <section> + <title>Lists of UTF-8 Bytes</title> + <p>For various reasons, you can sometimes have a list of UTF-8 + bytes. This is not a regular string of Unicode characters, as each list + element does not contain one character. Instead you get the "raw" UTF-8 + encoding that you have in binaries. This is easily converted to a proper + Unicode string by first converting byte per byte into a binary, and then + converting the binary of UTF-8 encoded characters back to a Unicode + string:</p> + + <code> +utf8_list_to_string(StrangeList) -> + unicode:characters_to_list(list_to_binary(StrangeList)).</code> + </section> + + <section> + <title>Double UTF-8 Encoding</title> + <p>When working with binaries, you can get the horrible "double UTF-8 + encoding", where strange characters are encoded in your binaries or + files. In other words, you can get a UTF-8 encoded binary that for the + second time is encoded as UTF-8. A common situation is where you read a + file, byte by byte, but the content is already UTF-8. If you then + convert the bytes to UTF-8, using, for example, the + <seealso marker="stdlib:unicode"><c>unicode</c></seealso> module, or by + writing to a file opened with option <c>{encoding,utf8}</c>, you have + each <em>byte</em> in the input file encoded as UTF-8, not each + character of the original text (one character can have been encoded in + many bytes). There is no real remedy for this other than to be sure of + which data is encoded in which format, and never convert UTF-8 data + (possibly read byte by byte from a file) into UTF-8 again.</p> + + <p>By far the most common situation where this occurs, is when you get + lists of UTF-8 instead of proper Unicode strings, and then convert them + to UTF-8 in a binary or on a file:</p> + + <code> +wrong_thing_to_do() -> + {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"), + MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary! + {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]), + io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8 + %% bytes in a list! + file:close(C). %% The file catastrophe.txt contains more or less unreadable + %% garbage!</code> + + <p>Ensure you know what a binary contains before converting it to a + string. If no other option exists, try heuristics:</p> + + <code> +if_you_can_not_know() -> + {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"), + MyList = case unicode:characters_to_list(Bin) of + L when is_list(L) -> + L; + _ -> + binary_to_list(Bin) %% The file was bytewise encoded + end, + %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes + {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]), + io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets! + file:close(G). %% The file contains valid UTF-8 encoded Unicode characters!</code> + </section> </section> -</section> </chapter> + |