diff options
Diffstat (limited to 'lib/stdlib/doc/src/unicode_usage.xml')
-rw-r--r-- | lib/stdlib/doc/src/unicode_usage.xml | 110 |
1 files changed, 56 insertions, 54 deletions
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index 1f64b38554..c4cb193b07 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -1,24 +1,25 @@ -<?xml version="1.0" encoding="utf8" ?> +<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE chapter SYSTEM "chapter.dtd"> <chapter> <header> <copyright> <year>1999</year> - <year>2013</year> + <year>2014</year> <holder>Ericsson AB. All Rights Reserved.</holder> </copyright> <legalnotice> - The contents of this file are subject to the Erlang Public License, - Version 1.1, (the "License"); you may not use this file except in - compliance with the License. You should have received a copy of the - Erlang Public License along with this software. If not, it can be - retrieved online at http://www.erlang.org/. - - Software distributed under the License is distributed on an "AS IS" - basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See - the License for the specific language governing rights and limitations - under the License. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. </legalnotice> @@ -41,21 +42,17 @@ future.</p> <p>The functionality described in EEP10 was implemented in Erlang/OTP - as of R13A, but that was by no means the end of it. In R14B01 support + R13A, but that was by no means the end of it. In Erlang/OTP R14B01 support for Unicode file names was added, although it was in no way complete and was by default disabled on platforms where no guarantee was given - for the file name encoding. With R16A came support for UTF-8 encoded + for the file name encoding. With Erlang/OTP R16A came support for UTF-8 encoded source code, among with enhancements to many of the applications to support both Unicode encoded file names as well as support for UTF-8 encoded files in several circumstances. Most notable is the support for UTF-8 in files read by <c>file:consult/1</c>, release handler support for UTF-8 and more support for Unicode character sets in the - I/O-system.</p> - - <p>In R17, the encoding default for Erlang source files will be - switched to UTF-8 and in R18 Erlang will support atoms in the full - Unicode range, meaning full Unicode function and module - names</p> + I/O-system. In Erlang/OTP 17.0, the encoding default for Erlang source files was + switched to UTF-8.</p> <p>This guide outlines the current Unicode support and gives a couple of recipes for working with Unicode data.</p> @@ -222,7 +219,7 @@ <tag>Representation</tag> <item>To handle Unicode characters in Erlang, we have to have a common representation both in lists and binaries. The EEP (10) and - the subsequent initial implementation in R13A settled a standard + the subsequent initial implementation in Erlang/OTP R13A settled a standard representation of Unicode characters in Erlang.</item> <tag>Manipulation</tag> <item>The Unicode characters need to be processed by the Erlang @@ -274,9 +271,9 @@ (<c>+fnu</c>) on platforms where this is not the default.</item> <tag>Source code encoding</tag> <item>When it comes to the Erlang source code, there is support - for the UTF-8 encoding and bytewise encoding. The default in R16B - is bytewise (or latin1) encoding. You can control the encoding by - a comment like: + for the UTF-8 encoding and bytewise encoding. The default in + Erlang/OTP R16B was bytewise (or latin1) encoding; in Erlang/OTP 17.0 + it was changed to UTF-8. You can control the encoding by a comment like: <code> %% -*- coding: utf-8 -*- </code> @@ -289,8 +286,8 @@ <tag>The language</tag> <item>Having the source code in UTF-8 also allows you to write string literals containing Unicode characters with code points > - 255, although atoms, module names and function names will be - restricted to the ISO-Latin-1 range until the R18 release. Binary + 255, although atoms, module names and function names are + restricted to the ISO-Latin-1 range. Binary literals where you use the <c>/utf8</c> type, can also be expressed using Unicode characters > 255. Having module names using characters other than 7-bit ASCII can cause trouble on @@ -304,7 +301,7 @@ <section> <title>Standard Unicode Representation</title> <p>In Erlang, strings are actually lists of integers. A string was - up until R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) + up until Erlang/OTP R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) character set, which is, code point by code point, a sub-range of the Unicode character set.</p> <p>The standard list encoding for strings was therefore easily @@ -321,7 +318,7 @@ encoding has to be decided upon and the string should be converted to a binary in the preferred encoding using <c>unicode:characters_to_binary/{1,2,3}</c>. Strings are not - generally lists of bytes, as they were before R13. They are lists of + generally lists of bytes, as they were before Erlang/OTP R13. They are lists of characters. Characters are not generally bytes, they are Unicode code points.</p> @@ -385,8 +382,7 @@ external_charlist() = maybe_improper_list(char() | using characters from the ISO-latin-1 character set and atoms are restricted to the same ISO-latin-1 range. These restrictions in the language are of course independent of the encoding of the source - file. Erlang/OTP R18 is expected to handle functions named in - Unicode as well as Unicode atoms.</p> + file.</p> <section> <title>Bit-syntax</title> <p>The bit-syntax contains types for coping with binary data in the @@ -447,8 +443,8 @@ Bin4 = <<"Hello"/utf16>>,</code> probably will not appreciate). Another way is to keep it backwards compatible so that only the ISO-Latin-1 character set is used to detect a string. A third way would be to let the user decide - exactly what Unicode ranges are to be viewed as characters. In - R16B you can select either the whole Unicode range or the + exactly what Unicode ranges are to be viewed as characters. Since + Erlang/OTP R16B you can select either the whole Unicode range or the ISO-Latin-1 range by supplying the startup flag <c>+pc </c><i>Range</i>, where <i>Range</i> is either <c>latin1</c> or <c>unicode</c>. For backwards compatibility, the default is @@ -662,11 +658,14 @@ Eshell V5.10.1 (abort with ^G) containing characters having code points between 128 and 255 may be named either as plain ISO-latin-1 or using UTF-8 encoding. As no consistency is enforced, the Erlang VM can do no consistent - translation of all file names. If the VM would automatically - select encoding based on heuristics, one could get unexpected - behavior on these systems. By default, Erlang starts in "latin1" - file name mode on such systems, meaning bytewise encoding in file - names. This allows for list representation of all file names in + translation of all file names.</p> + + <p>By default on such systems, Erlang starts in <c>utf8</c> file + name mode if the terminal supports UTF-8, otherwise in + <c>latin1</c> mode.</p> + + <p>In the <c>latin1</c> mode, file names are bytewise endcoded. + This allows for list representation of all file names in the system, but, for example, a file named "Ă–stersund.txt", will appear in <c>file:list_dir/1</c> as either "Ă–stersund.txt" (if the file name was encoded in bytewise ISO-Latin-1 by the program @@ -682,7 +681,7 @@ Eshell V5.10.1 (abort with ^G) </item> </taglist> - <p>The Unicode file naming support was introduced with OTP release + <p>The Unicode file naming support was introduced with Erlang/OTP R14B01. A VM operating in Unicode file name translation mode can work with files having names in any language or character set (as long as it is supported by the underlying OS and file system). The @@ -706,7 +705,7 @@ Eshell V5.10.1 (abort with ^G) problem even if it uses transparent file naming. Very few systems have mixed file name encodings. A consistent UTF-8 named system will work perfectly in Unicode file name mode. It was still however - considered experimental in R14B01 and is still not the default on + considered experimental in Erlang/OTP R14B01 and is still not the default on such systems. Unicode file name translation is turned on with the <c>+fnu</c> switch to the On Linux, a VM started without explicitly stating the file name translation mode will default to <c>latin1</c> @@ -752,9 +751,9 @@ Eshell V5.10.1 (abort with ^G) <section> <title>Notes About Raw File Names</title> - + <marker id="notes-about-raw-filenames"/> <p>Raw file names were introduced together with Unicode file name - support in erts-5.8.2 (OTP R14B01). The reason "raw file + support in erts-5.8.2 (Erlang/OTP R14B01). The reason "raw file names" was introduced in the system was to be able to consistently represent file names given in different encodings on the same system. Having the VM automatically translate a file name @@ -795,10 +794,10 @@ Eshell V5.10.1 (abort with ^G) the argument as a binary.</p> <p>To force Unicode file name translation mode on systems where this - is not the default was considered experimental in OTP R14B01 due to + is not the default was considered experimental in Erlang/OTP R14B01 due to the fact that the initial implementation did not ignore wrongly encoded file names, so that raw file names could spread unexpectedly - throughout the system. Beginning with R16B, the wrongly encoded file + throughout the system. Beginning with Erlang/OTP R16B, the wrongly encoded file names are only retrieved by special functions (e.g. <c>file:list_dir_all/1</c>), so the impact on existing code is much lower, why it is now supported. Unicode file name translation @@ -845,14 +844,16 @@ Eshell V5.10.1 (abort with ^G) </section> <section> <title>Unicode in Environment and Parameters</title> + <marker id="unicode_in_environment_and_parameters"/> <p>Environment variables and their interpretation is handled much in the same way as file names. If Unicode file names are enabled, environment variables as well as parameters to the Erlang VM are expected to be in Unicode.</p> <p>If Unicode file names are enabled, the calls to <seealso marker="kernel:os#getenv/0"><c>os:getenv/0</c></seealso>, - <seealso marker="kernel:os#getenv/1"><c>os:getenv/1</c></seealso> and - <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso> + <seealso marker="kernel:os#getenv/1"><c>os:getenv/1</c></seealso>, + <seealso marker="kernel:os#putenv/2"><c>os:putenv/2</c></seealso> and + <seealso marker="kernel:os#unsetenv/1"><c>os:unsetenv/1</c></seealso> will handle Unicode strings. On Unix-like platforms, the built-in functions will translate environment variables in UTF-8 to/from Unicode strings, possibly with code points > 255. On Windows the @@ -993,7 +994,8 @@ ok </pre> </section> <section> - <title><marker id="unicode_options_summary"/>Summary of Options</title> + <title>Summary of Options</title> + <marker id="unicode_options_summary"/> <p>The Unicode support is controlled by both command line switches, some standard environment variables and the version of OTP you are using. Most options affect mainly the way Unicode data is displayed, @@ -1014,7 +1016,8 @@ ok allowed. This setting should correspond to the actual terminal you are using.</p> <p>The environment can also affect file name interpretation, if - Erlang is started with the <c>+fna</c> flag.</p> + Erlang is started with the <c>+fna</c> flag (which is default from + Erlang/OTP 17.0).</p> <p>You can check the setting of this by calling <c>io:getopts()</c>, which will give you an option list containing <c>{encoding,unicode}</c> or @@ -1028,7 +1031,7 @@ ok <c>io</c>/<c>io_lib:format</c> with the <c>"~tp"</c> and <c>~tP</c> formatting instructions, as described above.</p> <p>You can check this option by calling io:printable_range/0, - which in R16B will return <c>unicode</c> or <c>latin1</c>. To be + which will return <c>unicode</c> or <c>latin1</c>. To be compatible with future (expected) extensions to the settings, one should rather use <c>io_lib:printable_list/1</c> to check if a list is printable according to the setting. That function will @@ -1046,8 +1049,7 @@ ok > 255.</p> <p><c>+fnl</c> means bytewise interpretation of file names, which was the usual way to represent ISO-Latin-1 file names before - UTF-8 file naming got widespread. This is the default on all - Unix-like operating systems except MacOS X.</p> + UTF-8 file naming got widespread.</p> <p><c>+fnu</c> means that file names are encoded in UTF-8, which is nowadays the common scheme (although not enforced).</p> <p><c>+fna</c> means that you automatically select between @@ -1055,8 +1057,8 @@ ok <c>LC_CTYPE</c> environment variables. This is optimistic heuristics indeed, nothing enforces a user to have a terminal with the same encoding as the file system, but usually, this is - the case. This might be the default behavior in a future - release.</p> + the case. This is the default on all Unix-like operating + systems except MacOS X.</p> <p>The file name translation mode can be read with the <c>file:native_name_encoding/0</c> function, which returns @@ -1067,8 +1069,8 @@ ok <item> <p>This function returns the default encoding for Erlang source files (if no encoding comment is present) in the currently - running release. For R16 this returns <c>latin1</c> (meaning - bytewise encoding). In R17 and forward it is expected to return + running release. In Erlang/OTP R16B <c>latin1</c> was returned (meaning + bytewise encoding). In Erlang/OTP 17.0 and forward it returns <c>utf8</c>.</p> <p>The encoding of each file can be specified using comments as described in |