diff options
-rw-r--r-- | erts/doc/src/erl.xml | 14 | ||||
-rw-r--r-- | erts/emulator/sys/common/erl_sys_common_misc.c | 2 | ||||
-rw-r--r-- | lib/kernel/doc/src/file.xml | 102 | ||||
-rw-r--r-- | lib/stdlib/doc/src/unicode_usage.xml | 35 |
4 files changed, 81 insertions, 72 deletions
diff --git a/erts/doc/src/erl.xml b/erts/doc/src/erl.xml index e737727941..6428a24209 100644 --- a/erts/doc/src/erl.xml +++ b/erts/doc/src/erl.xml @@ -535,12 +535,15 @@ </item> <tag><marker id="file_name_encoding"></marker><c><![CDATA[+fnl]]></c></tag> <item> - <p>The VM works with file names as if they are encoded using the ISO-latin-1 encoding, disallowing Unicode characters with codepoints beyond 255. This is default on operating systems that have transparent file naming, i.e. all Unixes except MacOSX.</p> + <p>The VM works with file names as if they are encoded using the ISO-latin-1 encoding, disallowing Unicode characters with codepoints beyond 255.</p> <p>See <seealso marker="stdlib:unicode_usage#unicode_file_names">STDLIB User's Guide</seealso> for more infomation about unicode file names.</p> </item> <tag><c><![CDATA[+fnu[{w|i|e}]]]></c></tag> <item> - <p>The VM works with file names as if they are encoded using UTF-8 (or some other system specific Unicode encoding). This is the default on operating systems that enforce Unicode encoding, i.e. Windows and MacOSX.</p> + <p>The VM works with file names as if they are encoded using + UTF-8 (or some other system specific Unicode encoding). This + is the default on operating systems that enforce Unicode + encoding, i.e. Windows and MacOS X.</p> <p>The <c>+fnu</c> switch can be followed by <c>w</c>, <c>i</c>, or <c>e</c> to control the way wrongly encoded file names are to be reported. <c>w</c> means that a warning is @@ -556,7 +559,12 @@ </item> <tag><c><![CDATA[+fna[{w|i|e}]]]></c></tag> <item> - <p>Selection between <c>+fnl</c> and <c>+fnu</c> is done based on the current locale settings in the OS, meaning that if you have set your terminal for UTF-8 encoding, the filesystem is expected to use the same encoding for file names (use with care).</p> + <p>Selection between <c>+fnl</c> and <c>+fnu</c> is done based + on the current locale settings in the OS, meaning that if you + have set your terminal for UTF-8 encoding, the filesystem is + expected to use the same encoding for file names. This is + default on all operating systems except MacOS X and + Windows.</p> <p>The <c>+fna</c> switch can be followed by <c>w</c>, <c>i</c>, or <c>e</c>. This will have effect if the locale settings cause the behavior of <c>+fnu</c> to be selected. diff --git a/erts/emulator/sys/common/erl_sys_common_misc.c b/erts/emulator/sys/common/erl_sys_common_misc.c index 31ad3b82d5..e3ba741058 100644 --- a/erts/emulator/sys/common/erl_sys_common_misc.c +++ b/erts/emulator/sys/common/erl_sys_common_misc.c @@ -52,7 +52,7 @@ static int filename_warning = ERL_FILENAME_WARNING_WARNING; /* Default unicode on windows and MacOS X */ static int user_filename_encoding = ERL_FILENAME_UTF8; #else -static int user_filename_encoding = ERL_FILENAME_LATIN1; +static int user_filename_encoding = ERL_FILENAME_UNKNOWN; #endif /* This controls the heuristic in printing characters in shell and w/ io:format("~tp", ...) etc. */ diff --git a/lib/kernel/doc/src/file.xml b/lib/kernel/doc/src/file.xml index 0a4dd3ba47..305b9fed2b 100644 --- a/lib/kernel/doc/src/file.xml +++ b/lib/kernel/doc/src/file.xml @@ -37,54 +37,48 @@ the file operations. See the command line flag <c>+A</c> in <seealso marker="erts:erl">erl(1)</seealso>.</p> - <p>The Erlang VM supports file names in Unicode to a limited - extent. Depending on how the VM is started (with the parameter - <c>+fnu</c> or <c>+fnl</c>), file names given can contain - characters > 255 and the VM system will convert file names - back and forth to the native file name encoding.</p> + <p>With regard to file name encoding, the Erlang VM can operate in + two modes. The current mode can be queried using the <seealso + marker="#native_name_encoding">native_name_encoding/0</seealso> + function. It returns either <c>latin1</c> or <c>utf8</c>.</p> - <p>The default behavior for Unicode character translation depends - on to what extent the underlying OS/filesystem enforces consistent - naming. On OSes where all file names are ensured to be in one or - another encoding, Unicode is the default (currently this holds for - Windows and MacOSX). On OSes with completely transparent file - naming (i.e. all Unixes except MacOSX), ISO-latin-1 file naming is - the default. The reason for the ISO-latin-1 default is that - file names are not guaranteed to be possible to interpret according to - the Unicode encoding expected (i.e. UTF-8), and file names that - cannot be decoded will only be accessible by using "raw - file names", in other word file names given as binaries.</p> - - <p>As file names are traditionally not binaries in Erlang, - applications that need to handle raw file names need to be - converted, why the Unicode mode for file names is not default on - systems having completely transparent file naming.</p> + <p>In the <c>latin1</c> mode, the Erlang VM does not change the + encoding of file names. In the <c>utf8</c> mode, file names can + contain Unicode characters greater than 255 and the VM will + convert file names back and forth to the native file name encoding + (usually UTF-8, but UTF-16 on Windows).</p> - <p>Raw file names is a new feature in OTP R14B01, which allows the - user to supply completely uninterpreted file names to the - underlying OS/filesystem. They are supplied as binaries, where it - is up to the user to supply a correct encoding for the - environment. The function <c>file:native_name_encoding()</c> can - be used to check what encoding the VM is working in. If the - function returns <c>latin1</c> file names are not in any way - converted to Unicode, if it is <c>utf8</c>, raw file names should - be encoded as UTF-8 if they are to follow the convention of the VM - (and usually the convention of the OS as well). Using raw - file names is useful if you have a filesystem with inconsistent - file naming, where some files are named in UTF-8 encoding while - others are not. A file:list_dir on such mixed file name systems - when the VM is in Unicode file name mode might return file names as - raw binaries as they cannot be interpreted as Unicode - file names. Raw file names can also be used to give UTF-8 encoded - file names even though the VM is not started in Unicode file name - translation mode.</p> + <p>The default mode depends on the operating system. Windows and + MacOS X enforce consistent file name encoding and therefore the + VM uses the <c>utf8</c> mode.</p> + + <p>On operating systems with transparent naming (i.e. all Unix + systems except MacOS X), the default will be <c>utf8</c> if the + terminal supports UTF-8, otherwise <c>latin1</c>. The default may + be overridden using the <c>+fnl</c> (to force <c>latin1</c> mode) + or <c>+fnu</c> (to force <c>utf8</c> mode) when starting <seealso + marker="erts:erl">erl</seealso>.</p> + + <p>On operating systems with transparent naming, files could be + inconsistently named, i.e. some files are encoded in UTF-8 while + others are encoded in (for example) iso-latin1. To be able to + handle file systems with inconsistent naming when running in the + <c>utf8</c> mode, the concept of "raw file names" has been + introduced.</p> + + <p>A raw file name is a file name given as a binary. The Erlang VM + will perform no translation of a file name given as a binary on + systems with transparent naming.</p> + + <p>When running in the <c>utf8</c> mode, the + <c>file:list_dir/1</c> and <c>file:read_link/1</c> functions will + never return raw file names. Use the <seealso + marker="#list_dir_all">list_dir_all/1</seealso> and <seealso + marker="#read_link_all">read_link_all/1</seealso> functions to + return all file names including raw file names.</p> + + <p>Also see <seealso marker="stdlib:unicode_usage#notes-about-raw-filenames">Notes about raw file names</seealso>.</p> - <p>Note that on Windows, <c>file:native_name_encoding()</c> - returns <c>utf8</c> per default, which is the format for raw - file names even on Windows, although the underlying OS specific - code works in a limited version of little endian UTF16. As far as - the Erlang programmer is concerned, Windows native Unicode format - is UTF-8...</p> </description> <datatypes> @@ -535,8 +529,8 @@ <name name="list_dir_all" arity="1"/> <fsummary>List all files in a directory</fsummary> <desc> - <p>Lists all the files in a directory, including files with - "raw" names. + <p><marker id="list_dir_all"/>Lists all the files in a directory, + including files with "raw" names. Returns <c>{ok, <anno>Filenames</anno>}</c> if successful. Otherwise, it returns <c>{error, <anno>Reason</anno>}</c>. <c><anno>Filenames</anno></c> is a list of @@ -653,11 +647,14 @@ </func> <func> <name name="native_name_encoding" arity="0"/> - <fsummary>Return the VM's configured filename encoding.</fsummary> + <fsummary>Return the VM's configured filename encoding</fsummary> <desc> - <p>This function returns the configured default file name encoding to use for raw file names. Generally an application supplying file names raw (as binaries), should obey the character encoding returned by this function.</p> - <p>By default, the VM uses ISO-latin-1 file name encoding on filesystems and/or OSes that use completely transparent file naming. This includes all Unix versions except MacOSX, where the vfs layer enforces UTF-8 file naming. By giving the experimental option <c>+fnu</c> when starting Erlang, UTF-8 translation of file names can be turned on even for those systems. If Unicode file name translation is in effect, the system behaves as usual as long as file names conform to the encoding, but will return file names that are not properly encoded in UTF-8 as raw file names (i.e. binaries).</p> - <p>On Windows, this function also returns <c>utf8</c> by default. The OS uses a pure Unicode naming scheme and file names are always possible to interpret as valid Unicode. The fact that the underlying Windows OS actually encodes file names using little endian UTF-16 can be ignored by the Erlang programmer. Windows and MacOSX are the only operating systems where the VM operates in Unicode file name mode by default.</p> + <p><marker id="native_name_encoding"/>This function returns + the file name encoding mode. If it is <c>latin1</c>, the + system does no translation of file names. If it is + <c>utf8</c>, file names will be converted back and forth to + the native file name encoding (usually UTF-8, but UTF-16 on + Windows).</p> </desc> </func> <func> @@ -1450,7 +1447,8 @@ <name name="read_link" arity="1"/> <fsummary>See what a link is pointing to</fsummary> <desc> - <p>This function returns <c>{ok, <anno>Filename</anno>}</c> if + <p><marker id="read_link_all"/>This function returns + <c>{ok, <anno>Filename</anno>}</c> if <c><anno>Name</anno></c> refers to a symbolic link that is not a "raw" file name, or <c>{error, <anno>Reason</anno>}</c> otherwise. diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index 33cd70e0b7..ee7dd128f1 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -52,8 +52,8 @@ for UTF-8 and more support for Unicode character sets in the I/O-system.</p> - <p>In R17, the encoding default for Erlang source files will be - switched to UTF-8 and in R18 Erlang will support atoms in the full + <p>In 17.0, the encoding default for Erlang source files was + switched to UTF-8 and in 18.0 Erlang will support atoms in the full Unicode range, meaning full Unicode function and module names</p> @@ -290,7 +290,7 @@ <item>Having the source code in UTF-8 also allows you to write string literals containing Unicode characters with code points > 255, although atoms, module names and function names will be - restricted to the ISO-Latin-1 range until the R18 release. Binary + restricted to the ISO-Latin-1 range until the 18.0 release. Binary literals where you use the <c>/utf8</c> type, can also be expressed using Unicode characters > 255. Having module names using characters other than 7-bit ASCII can cause trouble on @@ -385,7 +385,7 @@ external_charlist() = maybe_improper_list(char() | using characters from the ISO-latin-1 character set and atoms are restricted to the same ISO-latin-1 range. These restrictions in the language are of course independent of the encoding of the source - file. Erlang/OTP R18 is expected to handle functions named in + file. Erlang/OTP 18.0 is expected to handle functions named in Unicode as well as Unicode atoms.</p> <section> <title>Bit-syntax</title> @@ -662,11 +662,14 @@ Eshell V5.10.1 (abort with ^G) containing characters having code points between 128 and 255 may be named either as plain ISO-latin-1 or using UTF-8 encoding. As no consistency is enforced, the Erlang VM can do no consistent - translation of all file names. If the VM would automatically - select encoding based on heuristics, one could get unexpected - behavior on these systems. By default, Erlang starts in "latin1" - file name mode on such systems, meaning bytewise encoding in file - names. This allows for list representation of all file names in + translation of all file names.</p> + + <p>By default on such systems, Erlang starts in <c>utf8</c> file + name mode if the terminal supports UTF-8, otherwise in + <c>latin1</c> mode.</p> + + <p>In the <c>latin1</c> mode, file names are bytewise endcoded. + This allows for list representation of all file names in the system, but, for example, a file named "Ă–stersund.txt", will appear in <c>file:list_dir/1</c> as either "Ă–stersund.txt" (if the file name was encoded in bytewise ISO-Latin-1 by the program @@ -752,7 +755,7 @@ Eshell V5.10.1 (abort with ^G) <section> <title>Notes About Raw File Names</title> - + <marker id="notes-about-raw-filenames"/> <p>Raw file names were introduced together with Unicode file name support in erts-5.8.2 (OTP R14B01). The reason "raw file names" was introduced in the system was to be able to @@ -1014,7 +1017,8 @@ ok allowed. This setting should correspond to the actual terminal you are using.</p> <p>The environment can also affect file name interpretation, if - Erlang is started with the <c>+fna</c> flag.</p> + Erlang is started with the <c>+fna</c> flag (which is default from + Erlang/OTP 17.0).</p> <p>You can check the setting of this by calling <c>io:getopts()</c>, which will give you an option list containing <c>{encoding,unicode}</c> or @@ -1046,8 +1050,7 @@ ok > 255.</p> <p><c>+fnl</c> means bytewise interpretation of file names, which was the usual way to represent ISO-Latin-1 file names before - UTF-8 file naming got widespread. This is the default on all - Unix-like operating systems except MacOS X.</p> + UTF-8 file naming got widespread.</p> <p><c>+fnu</c> means that file names are encoded in UTF-8, which is nowadays the common scheme (although not enforced).</p> <p><c>+fna</c> means that you automatically select between @@ -1055,8 +1058,8 @@ ok <c>LC_CTYPE</c> environment variables. This is optimistic heuristics indeed, nothing enforces a user to have a terminal with the same encoding as the file system, but usually, this is - the case. This might be the default behavior in a future - release.</p> + the case. This is the default on all Unix-like operating + systems except MacOS X.</p> <p>The file name translation mode can be read with the <c>file:native_name_encoding/0</c> function, which returns @@ -1068,7 +1071,7 @@ ok <p>This function returns the default encoding for Erlang source files (if no encoding comment is present) in the currently running release. For R16 this returns <c>latin1</c> (meaning - bytewise encoding). In R17 and forward it is expected to return + bytewise encoding). In 17.0 and forward it returns <c>utf8</c>.</p> <p>The encoding of each file can be specified using comments as described in |