diff options
Diffstat (limited to 'lib')
-rw-r--r-- | lib/stdlib/doc/src/unicode_usage.xml | 669 |
1 files changed, 355 insertions, 314 deletions
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index 85bb778fc4..c5d476e54b 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -33,7 +33,7 @@ <file>unicode_usage.xml</file> </header> <section> -<title>Unicode implementation in Erlang/OTP</title> +<title>Unicode Implementation</title> <p>Implementing support for Unicode character sets is an ongoing process. The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode support and also specified a default encoding in @@ -48,13 +48,13 @@ source code, among with enhancements to many of the applications to support both Unicode encoded file names as well as support for UTF-8 encoded files in several circumstances. Most notable is the support - for UTF-8 in files read by file:consult/1, release handler support + for UTF-8 in files read by <c>file:consult/1</c>, release handler support for UTF-8 and more support for Unicode character sets in the - io-system.</p> + I/O-system.</p> <p>In R17, the encoding default for Erlang source files will be switched to UTF-8 and in R18 Erlang will support atoms in the full - Unicode range, meaning full Unicode function names and module + Unicode range, meaning full Unicode function and module names</p> <p>This guide outlines the current Unicode support and gives a couple @@ -88,14 +88,14 @@ the translation should be in and also take into account differences in input and output string length and so on. There is at the time of writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but - there are publicly available libraries that addresses these issues.</p> + there are publicly available libraries that address these issues.</p> <p>Another example is the accented characters where the same glyph has two different representations. Let's look at the Swedish "ö". There's a code point for that in the Unicode standard, but you can also write it as "o" followed by U+0308 (Combining Diaeresis, with the simplified meaning that the last letter should have a "¨" - above). They have exactly the same glyph. they are for most + above). They have exactly the same glyph. They are for most purposes the same, but they have completely different representations. For example MacOS X converts all file names to use Combining Diaeresis, while most other programs (including Erlang) @@ -113,7 +113,7 @@ </section> <section> -<title>What Unicode is</title> +<title>What Unicode Is</title> <p>Unicode is a standard defining code points (numbers) for all known, living or dead, scripts. In principle, every known symbol used in any language has a Unicode code point.</p> @@ -127,7 +127,7 @@ <p>It is vital to understand the difference between encodings and Unicode characters. Unicode characters are code points according to the Unicode standard, while the encodings are ways to represent such - code points. An encoding is just an standard for representation, + code points. An encoding is just a standard for representation, UTF-8 can for example be used to represent a very limited part of the Unicode character set (e.g. ISO-Latin-1), or the full Unicode range. It's just an encoding format.</p> @@ -145,7 +145,7 @@ encodings. For example Linux and MacOS X has chosen the UTF-8 encoding, which is backwards compatible with 7-bit ASCII and therefore affects programs written in plain English the - least. Windows® on the other hand supports a limited version of + least. Windows on the other hand supports a limited version of UTF-16, namely all the code planes where the characters can be stored in one single 16-bit entity, which includes most living languages.</p> @@ -158,18 +158,18 @@ can still be used to represent character code points in the Unicode standard that have numbers below 256, which corresponds exactly to the ISO-Latin-1 character set. In Erlang, this is commonly denoted - 'latin1' encoding, which is slightly misleading as ISO-Latin-1 is + <c>latin1</c> encoding, which is slightly misleading as ISO-Latin-1 is a character code range, not an encoding.</item> <tag>UTF-8</tag> <item>Each character is stored in one to four bytes depending on code point. The encoding is backwards compatible with bytewise representation of 7-bit ASCII as all 7-bit characters are stored - in one single byte in UTF-8. The characters beyond code point 126 + in one single byte in UTF-8. The characters beyond code point 127 are stored in more bytes, letting the most significant bit in the first character indicate a multi-byte character. For details on the encoding, the RFC is publicly available. Note that UTF-8 is <em>not</em> compatible with bytewise representation for - code points between 127 and 255, so a ISO-Latin-1 bytewise + code points between 128 and 255, so a ISO-Latin-1 bytewise representation is not generally compatible with UTF-8.</item> <tag>UTF-16</tag> <item>This encoding has many similarities to UTF-8, but the basic @@ -181,7 +181,7 @@ is more than one byte, byte-order issues occur, why UTF-16 exists in both a big-endian and little-endian variant. In Erlang, the full UTF-16 range is supported when applicable, like in the - 'unicode' module and in the bit syntax.</item> + <c>unicode</c> module and in the bit syntax.</item> <tag>UTF-32</tag> <item>The most straight forward representation. Each character is stored in one single 32-bit number. There is no need for escapes @@ -214,7 +214,7 @@ Unicode format of a certain file.</p> </section> <section> - <title>Areas where Erlang support Unicode</title> + <title>Areas of Unicode Support</title> <p>To support Unicode in Erlang, problems in several areas have been addressed. Each area is described briefly in this section and more thoroughly further down in this document:</p> @@ -231,7 +231,7 @@ (as the string module now can handle lists with arbitrary code points), in some cases new functionality or options need to be added (as in the <c>io</c>-module, the file handling, the <c>unicode</c> module - and the bit syntax). Today most modules in kernel and stdlib, as + and the bit syntax). Today most modules in kernel and STDLIB, as well as the VM are Unicode aware.</item> <tag>File I/O</tag> <item>I/O is by far the most problematic area for Unicode. A file @@ -242,7 +242,7 @@ text file with an encoding option, so that you can read characters from it rather than bytes, but you can also open a file for bytewise I/O. The I/O-system of Erlang has been designed (or at - least used) in a way where you expect any <c>io_device</c> to be + least used) in a way where you expect any I/O-server to be able to cope with any string data, but that is no longer the case when you work with Unicode characters. Handling the fact that you need to know the capabilities of the device where your data ends @@ -262,14 +262,14 @@ <item>File names can be stored as Unicode strings, in different ways depending on the underlying OS and file system. This can be handled fairly easy by a program. The problems arise when the file - system is not consequent in it's encodings, like for example + system is not consistent in it's encodings, like for example Linux. Linux allows files to be named with any sequence of bytes, leaving to each program to interpret those bytes. On systems where these "transparent" file names are used, Erlang has to be informed about the file name encoding by a startup flag. The default is bytewise interpretation, which is actually usually wrong, but allows for interpretation of <em>all</em> file names. The concept - of "raw file names" has to be used to handle wrongly encoded + of "raw file names" can be used to handle wrongly encoded file names if one enables Unicode file name translation (<c>+fnu</c>) on platforms where this is not the default.</item> <tag>Source code encoding</tag> @@ -280,9 +280,9 @@ <code> %% -*- coding: utf-8 -*- </code> - in the beginning of the file. It of course requires your editor to + in the beginning of the file. This of course requires your editor to support UTF-8 as well. The same comment is also interpreted by - functions like file:consult/1 , the release handler etc, so that + functions like <c>file:consult/1</c>, the release handler etc, so that you can have all text files in your source directories in UTF-8 encoding. </item> @@ -297,69 +297,73 @@ operating systems with inconsistent file naming schemes, and might also hurt portability, so it's not really recommended. It is suggested in EEP 40 that the language should also allow for - Unicode characters > 255 in variable names. Weather to + Unicode characters > 255 in variable names. Whether to implement that EEP or not is yet to be decided.</item> </taglist> </section> <section> -<title>Standard Unicode Representation in Erlang</title> -<p>In Erlang, strings are actually lists of integers. A string was up -until R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) -character set, which is, code point by code point, a sub-range of the -Unicode character set.</p> -<p>The standard list encoding for strings was therefore easily -extended to cope with the whole Unicode range: A Unicode string in -Erlang is simply a list containing integers, each integer being a -valid Unicode code point and representing one character in the Unicode -character set.</p> -<p>Erlang strings in ISO-latin-1 are a subset of Unicode strings.</p> -<p>Only if a string contains code points < 256, can it be directly -converted to a binary by using i.e. <c>erlang:iolist_to_binary/1</c> -or can be sent directly to a port. If the string contains Unicode -characters > 255, an encoding has to be decided upon and the -string should be converted to a binary in the preferred encoding using -<c>unicode:characters_to_binary/{1,2,3}</c>. Strings are not generally -lists of bytes, as they were before R13. They are lists of -characters. Characters are not generally bytes, they are Unicode -code points.</p> + <title>Standard Unicode Representation</title> + <p>In Erlang, strings are actually lists of integers. A string was + up until R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) + character set, which is, code point by code point, a sub-range of + the Unicode character set.</p> + <p>The standard list encoding for strings was therefore easily + extended to cope with the whole Unicode range: A Unicode string in + Erlang is simply a list containing integers, each integer being a + valid Unicode code point and representing one character in the + Unicode character set.</p> + <p>Erlang strings in ISO-latin-1 are a subset of Unicode + strings.</p> + <p>Only if a string contains code points < 256, can it be + directly converted to a binary by using + i.e. <c>erlang:iolist_to_binary/1</c> or can be sent directly to a + port. If the string contains Unicode characters > 255, an + encoding has to be decided upon and the string should be converted + to a binary in the preferred encoding using + <c>unicode:characters_to_binary/{1,2,3}</c>. Strings are not + generally lists of bytes, as they were before R13. They are lists of + characters. Characters are not generally bytes, they are Unicode + code points.</p> -<p>Binaries are more troublesome. For performance reasons, programs -often store textual data in binaries instead of lists, mainly because -they are more compact (one byte per character instead of two words per -character, as is the case with lists). Using -<c>erlang:list_to_binary/1</c>, an ISO-Latin-1 Erlang string could be -converted into a binary, effectively using bytewise encoding - one -byte per character. This was very convenient for those limited Erlang -strings, but cannot be done for arbitrary Unicode lists.</p> -<p>As the UTF-8 encoding is widely spread and provides some backward -compatibility in the 7-bit ASCII range, it is selected as the standard -encoding for Unicode characters in binaries for Erlang.</p> -<p>The standard binary encoding is used whenever a library function in -Erlang should cope with Unicode data in binaries, but is of course not -enforced when communicating externally. Functions and bit-syntax exist -to encode and decode both UTF-8, UTF-16 and UTF-32 in -binaries. Library functions dealing with binaries and Unicode in -general, however, only deal with the default encoding.</p> + <p>Binaries are more troublesome. For performance reasons, programs + often store textual data in binaries instead of lists, mainly + because they are more compact (one byte per character instead of two + words per character, as is the case with lists). Using + <c>erlang:list_to_binary/1</c>, an ISO-Latin-1 Erlang string could + be converted into a binary, effectively using bytewise encoding - + one byte per character. This was very convenient for those limited + Erlang strings, but cannot be done for arbitrary Unicode lists.</p> + <p>As the UTF-8 encoding is widely spread and provides some backward + compatibility in the 7-bit ASCII range, it is selected as the + standard encoding for Unicode characters in binaries for Erlang.</p> + <p>The standard binary encoding is used whenever a library function + in Erlang should cope with Unicode data in binaries, but is of + course not enforced when communicating externally. Functions and + bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32 + in binaries. Library functions dealing with binaries and Unicode in + general, however, only deal with the default encoding.</p> -<p>Character data may be combined from several sources, sometimes -available in a mix of strings and binaries. Erlang has for long had -the concept of <c>iodata</c> or <c>iolists</c>, where binaries and -lists can be combined to represent a sequence of bytes. In the same -way, the Unicode aware modules often allow for combinations of -binaries and lists where the binaries have characters encoded in UTF-8 -and the lists contain such binaries or numbers representing Unicode -code points:</p> -<code type="none"> + <p>Character data may be combined from several sources, sometimes + available in a mix of strings and binaries. Erlang has for long had + the concept of <c>iodata</c> or <c>iolist</c>s, where binaries and + lists can be combined to represent a sequence of bytes. In the same + way, the Unicode aware modules often allow for combinations of + binaries and lists where the binaries have characters encoded in + UTF-8 and the lists contain such binaries or numbers representing + Unicode code points:</p> + <code type="none"> unicode_binary() = binary() with characters encoded in UTF-8 coding standard chardata() = charlist() | unicode_binary() charlist() = maybe_improper_list(char() | unicode_binary() | charlist(), unicode_binary() | nil())</code> -<p>The module <c>unicode</c> in STDLIB even supports similar mixes -with binaries containing other encodings than UTF-8, but that is a -special case to allow for conversions to and from external data:</p> - <code type="none"> + <p>The module <seealso + marker="stdlib:unicode"><c>unicode</c></seealso> in STDLIB even + supports similar mixes with binaries containing other encodings than + UTF-8, but that is a special case to allow for conversions to and + from external data:</p> + <code type="none"> external_unicode_binary() = binary() with characters coded in a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32) @@ -371,12 +375,12 @@ external_charlist() = maybe_improper_list(char() | external_unicode_binary() | nil())</code> </section> <section> - <title>Basic Language Support for Unicode</title> + <title>Basic Language Support</title> <p><marker id="unicode_in_erlang"/>As of Erlang/OTP R16 Erlang source files can be written in either UTF-8 or bytewise encoding - (a.k.a. latin1 encoding). The details on how to state the encoding + (a.k.a. <c>latin1</c> encoding). The details on how to state the encoding of an Erlang source file can be found in - <seealso marker="stdlib:epp#encoding">epp(3)</seealso>. Strings and comments + <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>. Strings and comments can be written using Unicode, but functions still have to be named using characters from the ISO-latin-1 character set and atoms are restricted to the same ISO-latin-1 range. These restrictions in the @@ -400,7 +404,7 @@ $o/utf32-little>>,</code> Bin4 = <<"Hello"/utf16>>,</code> </section> <section> - <title>String- and Character-literals</title> + <title>String and Character Literals</title> <p>For source code, there is an extension to the <c>\</c>OOO (backslash followed by three octal numbers) and <c>\x</c>HH (backslash followed by <c>x</c>, followed by two hexadecimal @@ -409,7 +413,7 @@ Bin4 = <<"Hello"/utf16>>,</code> number of hexadecimal digits and a terminating right curly bracket). This allows for entering characters of any code point literally in a string even when the encoding of the source file is - bytewise (latin1).</p> + bytewise (<c>latin1</c>).</p> <p>In the shell, if using a Unicode input device, or in source code stored in UTF-8, <c>$</c> can be followed directly by a Unicode character producing an integer. In the following example @@ -419,7 +423,7 @@ Bin4 = <<"Hello"/utf16>>,</code> 1089</pre> </section> <section> - <title>Heuristic string detection</title> + <title>Heuristic String Detection</title> <p>In certain output functions and in the output of return values in the shell, Erlang tries to heuristically detect string data in lists and binaries. Typically you will see heuristic detection in @@ -429,7 +433,7 @@ Bin4 = <<"Hello"/utf16>>,</code> "abc" 2> <input><<97,98,99>>.</input> <<"abc">> -3> <input><<195,165,195,164,195,182>></input> +3> <input><<195,165,195,164,195,182>>.</input> <<"åäö"/utf8>></pre> <p>Here the shell will detect lists containing printable characters or binaries containing printable characters either in @@ -439,7 +443,7 @@ Bin4 = <<"Hello"/utf16>>,</code> the heuristic detection. The result would be that almost any list of integers will be deemed a string, resulting in all sorts of characters being printed, maybe even characters your terminal does - not have in it's font set (resulting in some generic output you + not have in its font set (resulting in some generic output you probably will not appreciate). Another way is to keep it backwards compatible so that only the ISO-Latin-1 character set is used to detect a string. A third way would be to let the user decide @@ -489,7 +493,7 @@ Eshell V5.10.1 (abort with ^G) only interpret characters from the ISO-Latin1 range as printable and will only detect lists or binaries with those "printable" characters as containing string data. The valid UTF-8 binary - containing "Юникод", will not be print as a string. When, on the + containing "Юникод", will not be printed as a string. When, on the other hand, started with all Unicode characters printable (<c>+pc unicode</c>), the shell will output anything containing printable Unicode data (in binaries either UTF-8 or bytewise encoded) as @@ -525,35 +529,36 @@ ok </section> </section> <section> -<title>The Interactive Shell</title> -<p>The interactive Erlang shell, when started towards a terminal or -started using the <c>werl</c> command on windows, can support Unicode -input and output.</p> -<p>On Windows®, proper operation requires that a suitable font is -installed and selected for the Erlang application to use. If no -suitable font is available on your system, try installing the DejaVu -fonts (<c>dejavu-fonts.org</c>), which are freely available and then -select that font in the Erlang shell application.</p> -<p>On Unix®-like operating systems, the terminal should be able to -handle UTF-8 on input and output (modern versions of XTerm, KDE -konsole and the Gnome terminal do for example) and your locale -settings have to be proper. As an example, my <c>LANG</c> environment -variable is set as this:</p> -<pre> + <title>The Interactive Shell</title> + <p>The interactive Erlang shell, when started towards a terminal or + started using the <c>werl</c> command on windows, can support + Unicode input and output.</p> + <p>On Windows, proper operation requires that a suitable font + is installed and selected for the Erlang application to use. If no + suitable font is available on your system, try installing the DejaVu + fonts (<c>dejavu-fonts.org</c>), which are freely available and then + select that font in the Erlang shell application.</p> + <p>On Unix-like operating systems, the terminal should be able + to handle UTF-8 on input and output (modern versions of XTerm, KDE + konsole and the Gnome terminal do for example) and your locale + settings have to be proper. As an example, my <c>LANG</c> + environment variable is set as this:</p> + <pre> $ <input>echo $LANG</input> en_US.UTF-8</pre> -<p>Actually, most systems handle the <c>LC_CTYPE</c> variable before -<c>LANG</c>, so if that is set, it has to be set to <c>UTF-8</c>:</p> -<pre> + <p>Actually, most systems handle the <c>LC_CTYPE</c> variable before + <c>LANG</c>, so if that is set, it has to be set to + <c>UTF-8</c>:</p> + <pre> $ echo <input>$LC_CTYPE</input> en_US.UTF-8</pre> -<p>The <c>LANG</c> or <c>LC_CTYPE</c> setting should be consistent -with what the terminal is capable of, there is no portable way for -Erlang to ask the actual terminal about its UTF-8 capacity, we have to -rely on the language and character type settings.</p> -<p>To investigate what Erlang thinks about the terminal, the -<c>io:getopts()</c> call can be used when the shell is started:</p> -<pre> + <p>The <c>LANG</c> or <c>LC_CTYPE</c> setting should be consistent + with what the terminal is capable of, there is no portable way for + Erlang to ask the actual terminal about its UTF-8 capacity, we have + to rely on the language and character type settings.</p> + <p>To investigate what Erlang thinks about the terminal, the + <c>io:getopts()</c> call can be used when the shell is started:</p> + <pre> $ <input>LC_CTYPE=en_US.ISO-8859-1 erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -570,47 +575,49 @@ Eshell V5.10.1 (abort with ^G) {encoding,unicode} 2></pre> -<p>When (finally?) everything is in order with the locale settings, -fonts and the terminal emulator, you probably also have discovered a -way to input characters in the script you desire. For testing, the -simplest way is to add some keyboard mappings for other languages, -usually done with some applet in your desktop environment. In my KDE -environment, I start the KDE Control Center (Personal Settings), -select "Regional and Accessibility" and then "Keyboard Layout". On -Windows XP®, I start Control Panel->Regional and Language Options, -select the Language tab and click the Details... button in the square -named "Text services and input Languages". Your environment probably -provides similar means of changing the keyboard layout. Make sure you -have a way to easily switch back and forth between keyboards if you -are not used to this, entering commands using a Cyrillic character set -is, as an example, not easily done in the Erlang shell.</p> + <p>When (finally?) everything is in order with the locale settings, + fonts and the terminal emulator, you probably also have discovered a + way to input characters in the script you desire. For testing, the + simplest way is to add some keyboard mappings for other languages, + usually done with some applet in your desktop environment. In my KDE + environment, I start the KDE Control Center (Personal Settings), + select "Regional and Accessibility" and then "Keyboard Layout". On + Windows XP, I start Control Panel->Regional and Language + Options, select the Language tab and click the Details... button in + the square named "Text services and input Languages". Your + environment probably provides similar means of changing the keyboard + layout. Make sure you have a way to easily switch back and forth + between keyboards if you are not used to this, entering commands + using a Cyrillic character set is, as an example, not easily done in + the Erlang shell.</p> -<p>Now you are set up for some Unicode input and output. The simplest -thing to do is of course to enter a string in the shell:</p> + <p>Now you are set up for some Unicode input and output. The + simplest thing to do is of course to enter a string in the + shell:</p> -<pre> + <pre> $ <input>erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) 1> <input>lists:keyfind(encoding, 1, io:getopts()).</input> {encoding,unicode} -2> <input>"Юникод"</input> +2> <input>"Юникод".</input> "Юникод" 3> <input>io:format("~ts~n", [v(2)]).</input> Юникод ok 4> </pre> -<p>While strings can be input as Unicode characters, the language -elements are still limited to the ISO-latin-1 character set. Only -character constants and strings are allowed to be beyond that -range:</p> -<pre> + <p>While strings can be input as Unicode characters, the language + elements are still limited to the ISO-latin-1 character set. Only + character constants and strings are allowed to be beyond that + range:</p> + <pre> $ <input>erl</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] Eshell V5.10.1 (abort with ^G) -1> <input>$ξ</input> +1> <input>$ξ.</input> 958 2> <input>Юникод.</input> * 1: illegal character @@ -666,23 +673,11 @@ Eshell V5.10.1 (abort with ^G) <c>[195,150,115,116,101,114,115,117,110,100]</c>, which is a list containing UTF-8 bytes - not what you would want... If you on the other hand use Unicode file name translation on such a - system, nun-UTF-8 file names will simply be ignored by functions + system, non-UTF-8 file names will simply be ignored by functions like <c>file:list_dir/1</c>. They can be retrieved with <c>file:list_dir_all/1</c>, but wrongly encoded file names will appear as "raw file names".</p> - <p>A raw file name is not a list, but a binary with undefined - encoding. Many non core applications still do not handle file - names given as binaries, why such raw names are avoided by - default. All functions in the <c>file</c> module taking - file names as input will handle raw file names, sending them more - or less uninterpreted to the underlying OS API, but only the - functions with names ending in <c>_all</c> will produce raw file - names. As special considerations will have to be taken by tools - etc to be able to handle non-UTF-8 encoded file names when - Unicode file name translation is activated on systems with - transparent file naming, the default is to leave such - translation off on such operating systems.</p> </item> </taglist> @@ -691,13 +686,13 @@ Eshell V5.10.1 (abort with ^G) work with files having names in any language or character set (as long as it is supported by the underlying OS and file system). The Unicode character list is used to denote file or directory names and - if the file system content is listed, you will also be able to get + if the file system content is listed, you will also get Unicode lists as return value. The support lies in the Kernel and STDLIB modules, why most applications (that does not explicitly require the file names to be in the ISO-latin-1 range) will benefit from the Unicode support without change.</p> - <p>On Operating systems with mandatory Unicode file names, this + <p>On operating systems with mandatory Unicode file names, this means that you more easily conform to the file names of other (non Erlang) applications, and you can also process file names that, at least on Windows, were completely inaccessible (due to having names @@ -712,8 +707,17 @@ Eshell V5.10.1 (abort with ^G) work perfectly in Unicode file name mode. It was still however considered experimental in R14B01 and is still not the default on such systems. Unicode file name translation is turned on with the - <c>+fnu</c> switch to the <c>erl</c> program. If the VM is started - in Unicode file name translation mode, + <c>+fnu</c> switch to the On Linux, a VM started without explicitly + stating the file name translation mode will default to <c>latin1</c> + as the native file name encoding. On Windows and MacOS X, the + default behavior is that of Unicode file name translation, why the + <c>file:native_name_encoding/0</c> by default returns <c>utf8</c> on + those systems (the fact that Windows actually does not use UTF-8 on + the file system level can safely be ignored by the Erlang + programmer). The default behavior can, as stated before, be + changed using the <c>+fnu</c> or <c>+fnl</c> options to the VM, see + the <seealso marker="erts:erl"><c>erl</c></seealso> program. If the + VM is started in Unicode file name translation mode, <c>file:native_name_encoding/0</c> will return the atom <c>utf8</c>. The <c>+fnu</c> switch can be followed by <c>w</c>, <c>i</c> or <c>e</c>, to control how wrongly encoded file names are @@ -722,7 +726,9 @@ Eshell V5.10.1 (abort with ^G) "skipped" in directory listings, <c>i</c> means that those wrongly encoded file names are silently ignored and <c>e</c> means that the API function will return an error whenever a wrongly encoded file - (or directory) name is encountered. <c>w</c> is the default.</p> + (or directory) name is encountered. <c>w</c> is the default. Note + that <c>file:read_link/1</c> will always return an error if the link + points to an invalid file name.</p> <p>In Unicode file name mode, file names given to the BIF <c>open_port/2</c> with the option <c>{spawn_executable,...}</c> are @@ -734,7 +740,7 @@ Eshell V5.10.1 (abort with ^G) <p>It is worth noting that the file <c>encoding</c> options given when opening a file has nothing to do with the file <em>name</em> encoding convention. You can very well open files containing data - encoded in UTF-8 but having file names in bytewise (latin1) encoding + encoded in UTF-8 but having file names in bytewise (<c>latin1</c>) encoding or vice versa.</p> <note><p>Erlang drivers and NIF shared objects still can not be @@ -744,9 +750,9 @@ Eshell V5.10.1 (abort with ^G) experimental.</p></note> <section> - <title>Notes About Raw File Names and Automatic File Name Conversion</title> + <title>Notes About Raw File Names</title> - <p>Raw file names was introduced together with Unicode file name + <p>Raw file names were introduced together with Unicode file name support in erts-5.8.2 (OTP R14B01). The reason "raw file names" was introduced in the system was to be able to consistently represent file names given in different encodings on @@ -793,26 +799,10 @@ Eshell V5.10.1 (abort with ^G) encoded file names, so that raw file names could spread unexpectedly throughout the system. Beginning with R16B, the wrongly encoded file names are only retrieved by special functions - (e.g. <c>file:list_dir_all/1</c>, so the impact on existing code is + (e.g. <c>file:list_dir_all/1</c>), so the impact on existing code is much lower, why it is now supported. Unicode file name translation is expected to be default in future releases.</p> - <p>If working with raw file names, one can still conform to the - encoding convention of the Erlang VM by using the - <c>file:native_name_encoding/0</c> function, which returns either - the atom <c>latin1</c> or the atom <c>utf8</c> depending on the file - name translation mode. On Linux, a VM started without explicitly - stating the file name translation mode will default to <c>latin1</c> - as the native file name encoding. On Windows and MacOS X, the default - behavior is that of Unicode file name translation, why the - <c>file:native_name_encoding/0</c> by default returns <c>utf8</c> on - those systems (the fact that Windows actually does not use UTF-8 on - the file system level can safely be ignored by the Erlang - programmer). The default behavior can, as been stated before, be - changed using the <c>+fnu</c> or <c>+fnl</c> options to the VM, see - the <seealso marker="erts:erl"><c>erl(1)</c></seealso> command - manual page.</p> - <p>Even if you are operating without Unicode file naming translation automatically done by the VM, you can access and create files with names in UTF-8 encoding by using raw file names encoded as @@ -822,16 +812,19 @@ Eshell V5.10.1 (abort with ^G) </section> <section> <title>Notes About MacOS X</title> - <p>MacOS X's vfs layer enforces UTF-8 file names in a quite aggressive - way. Older versions did this by simply refusing to create non UTF-8 - conforming file names, while newer versions replace offending bytes - with the sequence "%HH", where HH is the original - character in hexadecimal notation. As Unicode translation is enabled - by default on MacOS X, the only way to come up against this is to - either start the VM with the <c>+fnl</c> flag or to use a raw file - name in <c>latin1</c> encoding. In that case, the file can not be - opened with the same name as the one used to create this. The - problem is by design in newer versions of MacOS X.</p> + <p>MacOS X's vfs layer enforces UTF-8 file names in a quite + aggressive way. Older versions did this by simply refusing to create + non UTF-8 conforming file names, while newer versions replace + offending bytes with the sequence "%HH", where HH is the + original character in hexadecimal notation. As Unicode translation + is enabled by default on MacOS X, the only way to come up against + this is to either start the VM with the <c>+fnl</c> flag or to use a + raw file name in bytewise (<c>latin1</c>) encoding. If using a raw + filename, with a bytewise encoding containing characters between 127 + and 255, to create a file, the file can not be opened using the same + name as the one used to create it. There is no remedy for this + behaviour, other than keeping the file names in the right + encoding.</p> <p>MacOS X also reorganizes the names of files so that the representation of accents etc is using the "combining characters", @@ -850,7 +843,7 @@ Eshell V5.10.1 (abort with ^G) </section> </section> <section> - <title>Unicode in Environment Variables and Parameters to erl</title> + <title>Unicode in Environment and Parameters</title> <p>Environment variables and their interpretation is handled much in the same way as file names. If Unicode file names are enabled, environment variables as well as parameters to the Erlang VM are @@ -884,7 +877,7 @@ Eshell V5.10.1 (abort with ^G) <taglist> <tag><c>unicode</c></tag> <item> - <p>The module <seealso marker="stdlib:unicode">unicode</seealso> + <p>The module <seealso marker="stdlib:unicode"><c>unicode</c></seealso> is obviously Unicode-aware. It contains functions for conversion between different Unicode formats as well as some utilities for identifying byte order marks. Few programs handling Unicode data @@ -892,7 +885,7 @@ Eshell V5.10.1 (abort with ^G) </item> <tag><c>io</c></tag> <item> - <p>The <seealso marker="stdlib:io">io</seealso> module has been + <p>The <seealso marker="stdlib:io"><c>io</c></seealso> module has been extended along with the actual I/O-protocol to handle Unicode data. This means that several functions require binaries to be in UTF-8 and there are modifiers to formatting control sequences @@ -900,49 +893,50 @@ Eshell V5.10.1 (abort with ^G) </item> <tag><c>file</c>, <c>group</c>, <c>user</c></tag> <item> - <p>I/O-servers throughout the system are able both to handle + <p>I/O-servers throughout the system are able to handle Unicode data and has options for converting data upon actual output or input to/from the device. As shown earlier, the - <seealso marker="stdlib:shell">shell</seealso> has support for + <seealso marker="stdlib:shell"><c>shell</c></seealso> has support for Unicode terminals and the <seealso - marker="kernel:file">file</seealso> module allows for + marker="kernel:file"><c>file</c></seealso> module allows for translation to and from various Unicode formats on disk.</p> <p>The actual reading and writing of files with Unicode data is however not best done with the <c>file</c> module as its interface is byte oriented. A file opened with a Unicode encoding (like UTF-8), is then best read or written using the - <seealso marker="stdlib:io">io</seealso> module.</p> + <seealso marker="stdlib:io"><c>io</c></seealso> module.</p> </item> <tag><c>re</c></tag> <item> - <p>The <seealso marker="stdlib:re">re</seealso> module allows + <p>The <seealso marker="stdlib:re"><c>re</c></seealso> module allows for matching Unicode strings as a special option. As the library is actually centered on matching in binaries, the Unicode support is UTF-8-centered.</p> </item> <tag><c>wx</c></tag> <item> - <p>The <seealso marker="wx:wx">wx</seealso> graphical library + <p>The <seealso marker="wx:wx"><c>wx</c></seealso> graphical library has extensive support for Unicode text</p> </item> </taglist> - <p>The module <seealso marker="stdlib:string">string</seealso> works - perfect for Unicode strings as well as for ISO-latin-1 strings with - the exception of the language-dependent - <seealso marker="stdlib:string#to_upper/1">to_upper</seealso> and - <seealso marker="stdlib:string#to_lower/1">to_lower</seealso> functions, - which are only correct for the ISO-latin-1 character set. Actually - they can never function correctly for Unicode characters in their - current form, there are language and locale issues as well as - multi-character mappings to consider when conversion text between - cases. Converting case in an international environment is a big - subject not yet addressed in OTP.</p> + <p>The module <seealso + marker="stdlib:string"><c>string</c></seealso> works perfectly for + Unicode strings as well as for ISO-latin-1 strings with the + exception of the language-dependent <seealso + marker="stdlib:string#to_upper/1"><c>to_upper</c></seealso> and + <seealso marker="stdlib:string#to_lower/1"><c>to_lower</c></seealso> + functions, which are only correct for the ISO-latin-1 character + set. Actually they can never function correctly for Unicode + characters in their current form, as there are language and locale + issues as well as multi-character mappings to consider when + converting text between cases. Converting case in an international + environment is a big subject not yet addressed in OTP.</p> </section> <section> - <title>Unicode data in files</title> + <title>Unicode Data in Files</title> <p>The fact that Erlang as such can handle Unicode data in many forms does not automatically mean that the content of any file can be - Unicode text. The external entities such as ports or io_servers are + Unicode text. The external entities such as ports or I/O-servers are not generally Unicode capable.</p> <p>Ports are always byte oriented, so before sending data that you are not sure is bytewise encoded to a port, make sure to encode it @@ -951,7 +945,7 @@ Eshell V5.10.1 (abort with ^G) binary data (like a length indicator) or something else that shall not undergo character encoding, so no automatic translation is present.</p> - <p>io_servers behave a little differently. The io_servers connected + <p>I/O-servers behave a little differently. The I/O-servers connected to terminals (or stdout) can usually cope with Unicode data regardless of the <c>encoding</c> option. This is convenient when one expects a modern environment but do not want to crash when @@ -959,9 +953,9 @@ Eshell V5.10.1 (abort with ^G) more picky. A file can have an encoding option which makes it generally usable by the io-module (e.g. <c>{encoding,utf8}</c>), but is by default opened as a byte oriented file. The <seealso - marker="kernel:file">file</seealso> module is byte oriented, why only + marker="kernel:file"><c>file</c></seealso> module is byte oriented, why only ISO-Latin-1 characters can be written using that module. The - <seealso marker="stdlib:io">io</seealso> module is the one to use if + <seealso marker="stdlib:io"><c>io</c></seealso> module is the one to use if Unicode data is to be output to a file with other <c>encoding</c> than <c>latin1</c> (a.k.a. bytewise encoding). It is slightly confusing that a file opened with @@ -973,12 +967,12 @@ Eshell V5.10.1 (abort with ^G) files other than text files - byte by byte. Just as with ports, you can of course write encoded data into a file by "manually" converting the data to the encoding of choice (using the <seealso - marker="stdlib:unicode">unicode</seealso> module or the bit syntax) + marker="stdlib:unicode"><c>unicode</c></seealso> module or the bit syntax) and then output it on a bytewise encoded (<c>latin1</c>) file.</p> <p>The rule of thumb is that the <seealso - marker="kernel:file">file</seealso> module should be used for files + marker="kernel:file"><c>file</c></seealso> module should be used for files opened for bytewise access (<c>{encoding,latin1}</c>) and the - <seealso marker="stdlib:io">io</seealso> module should be used when + <seealso marker="stdlib:io"><c>io</c></seealso> module should be used when accessing files with any other encoding (e.g. <c>{encoding,uf8}</c>).</p> @@ -998,12 +992,12 @@ ok </pre> </section> <section> - <title>Summary of options and environment variables concerning Unicode</title> + <title><marker id="unicode_options_summary"/>Summary of Options</title> <p>The Unicode support is controlled by both command line switches, some standard environment variables and the version of OTP you are using. Most options affect mainly the way Unicode data is displayed, not the actual functionality of the API's in the standard - libraries. This means that actual Erlang programs usually do not + libraries. This means that Erlang programs usually do not need to concern themselves with these options, they are more for the development environment. An Erlang program can be written so that it works well regardless of the type of system or the Unicode options @@ -1014,14 +1008,14 @@ ok <tag>The <c>LANG</c> and <c>LC_CTYPE</c> environment variables</tag> <item> <p>The language setting in the OS mainly affects the shell. The - terminal (i.e. the group_leader) will operate with <c>{encoding, + terminal (i.e. the group leader) will operate with <c>{encoding, unicode}</c> only if the environment tells it that UTF-8 is allowed. This setting should correspond to the actual terminal you are using.</p> <p>The environment can also affect file name interpretation, if Erlang is started with the <c>+fna</c> flag.</p> <p>You can check the setting of this by calling - <c>io:getopts(group_leader()).</c>, you will get an option list + <c>io:getopts()</c>, which will give you an option list containing <c>{encoding,unicode}</c> or <c>{encoding,latin1}</c>.</p> </item> @@ -1033,7 +1027,7 @@ ok <c>io</c>/<c>io_lib:format</c> with the <c>"~tp"</c> and <c>~tP</c> formatting instructions, as described above.</p> <p>You can check this option by calling io:printable_range/0, - which will in R16 return <c>unicode</c> or <c>latin1</c>. To be + which in R16B will return <c>unicode</c> or <c>latin1</c>. To be compatible with future (expected) extensions to the settings, one should rather use <c>io_lib:printable_list/1</c> to check if a list is printable according to the setting. That function will @@ -1063,27 +1057,12 @@ ok the case. This might be the default behavior in a future release.</p> - <p>The additional {<c>w</c>|<c>i</c>|<c>e</c>} tells the - file module how to handle file names that are not interpretable - in the expected encoding. This affects <c>file:list_dir/1</c> - and <c>file:read_link/1</c>, that will never return such - file names. If <c>+fnuw</c> (or <c>+fnaw</c> in an UTF-8 - environment) is given, invalid file names encountered will result - in a warning being sent to the error logger (and all correctly - encoded names in a directory will be returned by - <c>list_dir/1</c>). If <c>+fnui</c> (or <c>+fnai</c> in an - UTF-8 environment) is given, all wrongly encoded file names are - silently ignored. If <c>+fnue</c> (or <c>+fnae</c> in an UTF-8 - environment) is given, directories containing wrongly encoded - file names will result in an error tuple being returned from - <c>file:list_dir/1</c>. Note that <c>file:read_link/1</c> will always - return an error if the link points to an invalid file name.</p> - <p>The file name translation mode can be read with the <c>file:native_name_encoding/0</c> function, which returns <c>latin1</c> (meaning bytewise encoding) or <c>utf8</c>.</p> </item> - <tag><seealso marker="stdlib:epp#default_encoding/0">epp:default_encoding()</seealso></tag> + <tag><seealso marker="stdlib:epp#default_encoding/0"> + <c>epp:default_encoding/0</c></seealso></tag> <item> <p>This function returns the default encoding for Erlang source files (if no encoding comment is present) in the currently @@ -1092,31 +1071,31 @@ ok <c>utf8</c>.</p> <p>The encoding of each file can be specified using comments as described in - <seealso marker="stdlib:epp#encoding">epp(3)</seealso>.</p> + <seealso marker="stdlib:epp#encoding"><c>epp(3)</c></seealso>.</p> </item> - <tag><seealso marker="stdlib:io#setopts/1">io:setopts</seealso> and the <c>-oldshell</c>/<c>-noshell</c> flags.</tag> + <tag><seealso marker="stdlib:io#setopts/1"><c>io:setopts/</c>{<c>1</c>,<c>2</c>}</seealso> and the <c>-oldshell</c>/<c>-noshell</c> flags.</tag> <item> <p>When Erlang is started with <c>-oldshell</c> or - <c>-noshell</c>, the io_server for <c>standard_io</c> is default + <c>-noshell</c>, the I/O-server for <c>standard_io</c> is default set to bytewise encoding, while an interactive shell defaults to what the environment variables says.</p> <p>With the <c>io:setopts/2</c> function you can set the - encoding of a file or other io_server. This can also be set when + encoding of a file or other I/O-server. This can also be set when opening a file. Setting the terminal (or other <c>standard_io</c> server) unconditionally to the option - <c>[{encoding,utf8}]</c> will for example make UTF-8 encoded characters - be written to the device regardless of how Erlang was started or + <c>{encoding,utf8}</c> will for example make UTF-8 encoded characters + being written to the device regardless of how Erlang was started or the users environment.</p> <p>Opening files with <c>encoding</c> option is convenient when writing or reading text files in a known encoding.</p> - <p>You can retrieve the <c>encoding</c> setting for an io_server + <p>You can retrieve the <c>encoding</c> setting for an I/O-server using <seealso - marker="stdlib:io#getopts/1">io:getopts</seealso>.</p> + marker="stdlib:io#getopts/1"><c>io:getopts()</c></seealso>.</p> </item> </taglist> </section> <section> - <title>Unicode Recipes</title> + <title>Recipes</title> <p>When starting with Unicode, one often stumbles over some common issues. I try to outline some methods of dealing with Unicode data in this section.</p> @@ -1129,7 +1108,7 @@ ok on encoding) is not part of the actual text. This code outlines how to open a file which is believed to have a BOM and set the files encoding and position for further sequential reading - (preferably using the <seealso marker="stdlib:io">io</seealso> + (preferably using the <seealso marker="stdlib:io"><c>io</c></seealso> module). Note that error handling is omitted from the code:</p> <code> open_bom_file_for_reading(File) -> @@ -1140,8 +1119,15 @@ open_bom_file_for_reading(File) -> io:setopts(F,[{encoding,Type}]), {ok,F}. </code> -<p>The <c>unicode:bom_to_encoding/1</c> function identifies the encoding from a binary of at least four bytes. It returns, along with an term suitable for setting the encoding of the file, the actual length of the BOM, so that the file position can be set accordingly. Note that <c>file:position/2</c> always works on byte-offsets, so that the actual byte-length of the BOM is needed.</p> -<p>To open a file for writing and putting the BOM first is even simpler:</p> + <p>The <c>unicode:bom_to_encoding/1</c> function identifies the + encoding from a binary of at least four bytes. It returns, along + with an term suitable for setting the encoding of the file, the + actual length of the BOM, so that the file position can be set + accordingly. Note that <c>file:position/2</c> always works on + byte-offsets, so that the actual byte-length of the BOM is + needed.</p> + <p>To open a file for writing and putting the BOM first is even + simpler:</p> <code> open_bom_file_for_writing(File,Encoding) -> {ok,F} = file:open(File,[write,binary]), @@ -1149,21 +1135,53 @@ open_bom_file_for_writing(File,Encoding) -> io:setopts(F,[{encoding,Encoding}]), {ok,F}. </code> -<p>In both cases the file is then best processed using the <c>io</c> module, as the functions in <c>io</c> can handle code points beyond the ISO-latin-1 range.</p> -</section> -<section> -<title>Formatted Input and Output</title> -<p>When reading and writing to Unicode-aware entities, like the User or a file opened for Unicode translation, you will probably want to format text strings using the functions in <seealso marker="stdlib:io">io</seealso> or <seealso marker="stdlib:io_lib">io_lib</seealso>. For backward compatibility reasons, these functions do not accept just any list as a string, but require a special <em>translation modifier</em> when working with Unicode texts. The modifier is <c>t</c>. When applied to the <c>s</c> control character in a formatting string, it accepts all Unicode code points and expect binaries to be in UTF-8:</p> -<pre> + <p>In both cases the file is then best processed using the + <c>io</c> module, as the functions in <c>io</c> can handle code + points beyond the ISO-latin-1 range.</p> + </section> + <section> + <title>Formatted I/O</title> + <p>When reading and writing to Unicode-aware entities, like the + User or a file opened for Unicode translation, you will probably + want to format text strings using the functions in <seealso + marker="stdlib:io"><c>io</c></seealso> or <seealso + marker="stdlib:io_lib"><c>io_lib</c></seealso>. For backward + compatibility reasons, these functions do not accept just any list + as a string, but require a special <em>translation modifier</em> + when working with Unicode texts. The modifier is <c>t</c>. When + applied to the <c>s</c> control character in a formatting string, + it accepts all Unicode code points and expect binaries to be in + UTF-8:</p> + <pre> 1> <input>io:format("~ts~n",[<<"åäö"/utf8>>]).</input> åäö ok 2> <input>io:format("~s~n",[<<"åäö"/utf8>>]).</input> åäö ok</pre> -<p>Obviously the second <c>io:format/2</c> gives undesired output because the UTF-8 binary is not in latin1. For backward compatibility, the non prefixed <c>s</c> control character expects bytewise encoded ISO-latin-1 characters in binaries and lists containing only code points < 256.</p> -<p>As long as the data is always lists, the <c>t</c> modifier can be used for any string, but when binary data is involved, care must be taken to make the right choice of formatting characters. A bytewise encoded binary will also be interpreted as a string and printed even when using <c>~ts</c>, but it might be mistaken for a valid UTF-8 string and one should therefore avoid using the <c>~ts</c> control if the binary contains bytewise encoded characters and not UTF-8.</p> -<p>The function <c>format/2</c> in <c>io_lib</c> behaves similarly. This function is defined to return a deep list of characters and the output could easily be converted to binary data for outputting on a device of any kind by a simple <c>erlang:list_to_binary/1</c>. When the translation modifier is used, the list can however contain characters that cannot be stored in one byte. The call to <c>erlang:list_to_binary/1</c> will in that case fail. However, if the I/O server you want to communicate with is Unicode-aware, the list returned can still be used directly:</p> + <p>Obviously the second <c>io:format/2</c> gives undesired output + because the UTF-8 binary is not in latin1. For backward + compatibility, the non prefixed <c>s</c> control character expects + bytewise encoded ISO-latin-1 characters in binaries and lists + containing only code points < 256.</p> + <p>As long as the data is always lists, the <c>t</c> modifier can + be used for any string, but when binary data is involved, care + must be taken to make the right choice of formatting characters. A + bytewise encoded binary will also be interpreted as a string and + printed even when using <c>~ts</c>, but it might be mistaken for a + valid UTF-8 string and one should therefore avoid using the + <c>~ts</c> control if the binary contains bytewise encoded + characters and not UTF-8.</p> + <p>The function <c>format/2</c> in <c>io_lib</c> behaves + similarly. This function is defined to return a deep list of + characters and the output could easily be converted to binary data + for outputting on a device of any kind by a simple + <c>erlang:list_to_binary/1</c>. When the translation modifier is + used, the list can however contain characters that cannot be + stored in one byte. The call to <c>erlang:list_to_binary/1</c> + will in that case fail. However, if the I/O server you want to + communicate with is Unicode-aware, the list returned can still be + used directly:</p> <pre> $ <input>erl +pc unicode</input> Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false] @@ -1174,26 +1192,38 @@ Eshell V5.10.1 (abort with ^G) 2> <input>io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])).</input> Γιούνικοντ ok</pre> -<p>The Unicode string is returned as a Unicode list, which is -recognized as such since the Erlang shell uses the Unicode encoding -(and is started with all Unicode characters considered printable). The -Unicode list is valid input to the -<seealso marker="stdlib:io#put_chars/2">io:put_chars/2</seealso> function, so -data can be output on any Unicode capable device. If the device is a -terminal, characters will be output in the <c>\x{</c>H ...<c>}</c> -format if encoding is <c>latin1</c> otherwise in UTF-8 (for the -non-interactive terminal - "oldshell" or "noshell") or whatever is -suitable to show the character properly (for an interactive terminal - -the regular shell). The bottom line is that you can always send -Unicode data to the <c>standard_io</c> device. Files will however only -accept Unicode code points beyond ISO-latin-1 if <c>encoding</c> is set -to something else than <c>latin1</c>.</p> -</section> -<section> -<title>Heuristic Identification of UTF-8</title> -<p>While it is strongly encouraged that the actual encoding of characters in binary data is known prior to processing, that is not always possible. On a typical Linux® system, there is a mix of UTF-8 and ISO-latin-1 text files and there are seldom any BOM's in the files to identify them.</p> -<p>UTF-8 is designed in such a way that ISO-latin-1 characters with numbers beyond the 7-bit ASCII range are seldom considered valid when decoded as UTF-8. Therefore one can usually use heuristics to determine if a file is in UTF-8 or if it is encoded in ISO-latin-1 (one byte per character) encoding. The <c>unicode</c> module can be used to determine if data can be interpreted as UTF-8:</p> -<code> + <p>The Unicode string is returned as a Unicode list, which is + recognized as such since the Erlang shell uses the Unicode + encoding (and is started with all Unicode characters considered + printable). The Unicode list is valid input to the <seealso + marker="stdlib:io#put_chars/2"><c>io:put_chars/2</c></seealso> function, + so data can be output on any Unicode capable device. If the device + is a terminal, characters will be output in the <c>\x{</c>H + ...<c>}</c> format if encoding is <c>latin1</c> otherwise in UTF-8 + (for the non-interactive terminal - "oldshell" or "noshell") or + whatever is suitable to show the character properly (for an + interactive terminal - the regular shell). The bottom line is that + you can always send Unicode data to the <c>standard_io</c> + device. Files will however only accept Unicode code points beyond + ISO-latin-1 if <c>encoding</c> is set to something else than + <c>latin1</c>.</p> + </section> + <section> + <title>Heuristic Identification of UTF-8</title> + <p>While it is + strongly encouraged that the actual encoding of characters in + binary data is known prior to processing, that is not always + possible. On a typical Linux system, there is a mix of UTF-8 + and ISO-latin-1 text files and there are seldom any BOM's in the + files to identify them.</p> + <p>UTF-8 is designed in such a way that ISO-latin-1 characters + with numbers beyond the 7-bit ASCII range are seldom considered + valid when decoded as UTF-8. Therefore one can usually use + heuristics to determine if a file is in UTF-8 or if it is encoded + in ISO-latin-1 (one byte per character) encoding. The + <c>unicode</c> module can be used to determine if data can be + interpreted as UTF-8:</p> + <code> heuristic_encoding_bin(Bin) when is_binary(Bin) -> case unicode:characters_to_binary(Bin,utf8,utf8) of Bin -> @@ -1201,9 +1231,16 @@ heuristic_encoding_bin(Bin) when is_binary(Bin) -> _ -> latin1 end. -</code> -<p>If one does not have a complete binary of the file content, one could instead chunk through the file and check part by part. The return-tuple <c>{incomplete,Decoded,Rest}</c> from <c>unicode:characters_to_binary/{1,2,3}</c> comes in handy. The incomplete rest from one chunk of data read from the file is prepended to the next chunk and we therefore circumvent the problem of character boundaries when reading chunks of bytes in UTF-8 encoding:</p> -<code> + </code> + <p>If one does not have a complete binary of the file content, one + could instead chunk through the file and check part by part. The + return-tuple <c>{incomplete,Decoded,Rest}</c> from + <c>unicode:characters_to_binary/{1,2,3}</c> comes in handy. The + incomplete rest from one chunk of data read from the file is + prepended to the next chunk and we therefore circumvent the + problem of character boundaries when reading chunks of bytes in + UTF-8 encoding:</p> + <code> heuristic_encoding_file(FileName) -> {ok,F} = file:open(FileName,[read,binary]), loop_through_file(F,<<>>,file:read(F,1024)). @@ -1221,9 +1258,12 @@ loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) -> Res when is_binary(Res) -> loop_through_file(F,<<>>,file:read(F,1024)) end. -</code> -<p>Another option is to try to read the whole file in utf8 encoding and see if it fails. Here we need to read the file using <c>io:get_chars/3</c>, as we have to succeed in reading characters with a code point over 255:</p> -<code> + </code> + <p>Another option is to try to read the whole file in UTF-8 + encoding and see if it fails. Here we need to read the file using + <c>io:get_chars/3</c>, as we have to succeed in reading characters + with a code point over 255:</p> + <code> heuristic_encoding_file2(FileName) -> {ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]), loop_through_file2(F,io:get_chars(F,'',1024)). @@ -1234,42 +1274,42 @@ loop_through_file2(_,{error,_Err}) -> latin1; loop_through_file2(F,Bin) when is_binary(Bin) -> loop_through_file2(F,io:get_chars(F,'',1024)). -</code> -</section> -<section> - <title>When you get a list of UTF-8 bytes</title> - <p>For various reasons, you may find yourself having a list of UTF-8 - bytes. This is not a regular string of Unicode characters as each - element in the list does not contain one character. Instead you get - the "raw" UTF-8 encoding that you have in binaries. This is easily - converted to a proper Unicode string by first converting byte per - byte into a binary and then converting the binary of UTF-8 encoded - characters back to a Unicode string:</p> -<code> + </code> + </section> + <section> + <title>Lists of UTF-8 Bytes</title> + <p>For various reasons, you may find yourself having a list of + UTF-8 bytes. This is not a regular string of Unicode characters as + each element in the list does not contain one character. Instead + you get the "raw" UTF-8 encoding that you have in binaries. This + is easily converted to a proper Unicode string by first converting + byte per byte into a binary and then converting the binary of + UTF-8 encoded characters back to a Unicode string:</p> + <code> utf8_list_to_string(StrangeList) -> unicode:characters_to_list(list_to_binary(StrangeList)). -</code> -</section> -<section> - <title>Double UTF-8 encoding</title> - <p>When working with binaries, you may get the horrible "double - UTF-8 encoding", where strange characters are encoded in your - binaries or files that you did not expect. What you may have got, is - an UTF-8 encoded binary that is for the second time encoded as - UTF-8. A common situation is where you read a file, byte by byte, - but the actual content is already UTF-8. If you then convert the - bytes to UTF-8, using the i.e. the <c>unicode</c> module or by - writing to a file opened with the <c>{encoding,utf8}</c> option. You - will have each <i>byte</i> in the in the input file encoded as - UTF-8, not each character of the original text (one character may - have been encoded in several bytes). There is no real remedy for - this other than being very sure of which data is actually encoded - in which format, and never convert UTF-8 data (possibly read byte by - byte from a file) into UTF-8 again.</p> - <p>The by far most common situation where this happens, is when you - get lists of UTF-8 instead of proper Unicode strings, and then convert - them to UTF-8 in a binary or on a file:</p> -<code> + </code> + </section> + <section> + <title>Double UTF-8 Encoding</title> + <p>When working with binaries, you may get the horrible "double + UTF-8 encoding", where strange characters are encoded in your + binaries or files that you did not expect. What you may have got, + is a UTF-8 encoded binary that is for the second time encoded as + UTF-8. A common situation is where you read a file, byte by byte, + but the actual content is already UTF-8. If you then convert the + bytes to UTF-8, using i.e. the <c>unicode</c> module or by + writing to a file opened with the <c>{encoding,utf8}</c> + option. You will have each <i>byte</i> in the in the input file + encoded as UTF-8, not each character of the original text (one + character may have been encoded in several bytes). There is no + real remedy for this other than being very sure of which data is + actually encoded in which format, and never convert UTF-8 data + (possibly read byte by byte from a file) into UTF-8 again.</p> + <p>The by far most common situation where this happens, is when + you get lists of UTF-8 instead of proper Unicode strings, and then + convert them to UTF-8 in a binary or on a file:</p> + <code> wrong_thing_to_do() -> {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"), MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary! @@ -1278,10 +1318,11 @@ loop_through_file2(F,Bin) when is_binary(Bin) -> %% bytes in a list! file:close(C). %% The file catastrophe.txt contains more or less unreadable %% garbage! -</code> - <p>Make very sure you know what a binary contains before converting - it to a string. If no other option exists, try heuristics:</p> -<code> + </code> + <p>Make very sure you know what a binary contains before + converting it to a string. If no other option exists, try + heuristics:</p> + <code> if_you_can_not_know() -> {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"), MyList = case unicode:characters_to_list(Bin) of @@ -1294,7 +1335,7 @@ loop_through_file2(F,Bin) when is_binary(Bin) -> {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]), io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets! file:close(G). %% The file contains valid UTF-8 encoded Unicode characters! -</code> -</section> + </code> + </section> </section> </chapter> |