From 3e6877b06ae395a9d4310ef664d0360867a47f62 Mon Sep 17 00:00:00 2001 From: Patrik Nyblom Date: Wed, 1 Dec 2010 17:35:40 +0100 Subject: Add documentation about raw filenames and Unicode file name translation mode --- erts/doc/src/erl.xml | 13 ++++++++ lib/kernel/doc/src/file.xml | 72 ++++++++++++++++++++++++++++++++++++++++- lib/stdlib/doc/src/filelib.xml | 27 ++++++++++++++-- lib/stdlib/doc/src/filename.xml | 15 +++++++-- 4 files changed, 120 insertions(+), 7 deletions(-) diff --git a/erts/doc/src/erl.xml b/erts/doc/src/erl.xml index 9224d73b6f..77bd952d41 100644 --- a/erts/doc/src/erl.xml +++ b/erts/doc/src/erl.xml @@ -550,6 +550,19 @@

Force the compressed option on all ETS tables. Only intended for test and evaluation.

+ + +

The VM works with file names as if they are encoded using the ISO-latin-1 encoding, disallowing Unicode characters with codepoints beyond 255. This is default on operating systems that have transparent file naming, i.e. all Unixes except MacOSX.

+
+ + +

The VM works with file names as if they are encoded using UTF-8 (or some other system specific Unicode encoding). This is the default on operating systems that enforce Unicode encoding, i.e. Windows and MacOSX.

+

By enabling Unicode file name translation on systems where this is not default, you open up to the possibility that some file names can not be interpreted by the VM and therefore will be returned to the program as raw binaries. The option is therefore considered experimental.

+
+ + +

Selection between +fnl and +fnu is done based on the current locale settings in the OS, meaning that if you have set your terminal for UTF-8 encoding, the filesystem is expected to use the same encoding for filenames (use with care).

+

Sets the default heap size of processes to the size diff --git a/lib/kernel/doc/src/file.xml b/lib/kernel/doc/src/file.xml index 64cdd3a8ea..d3441d3623 100644 --- a/lib/kernel/doc/src/file.xml +++ b/lib/kernel/doc/src/file.xml @@ -36,6 +36,61 @@ other Erlang processes to continue executing in parallel with the file operations. See the command line flag +A in erl(1).

+ +

The Erlang VM supports file names in Unicode to a limited + extent. Depending on how the VM is started (with the parameter + +fnu or +fnl), file names given can contain + characters > 255 and the VM system will convert file names + back and forth to the native file name encoding.

+ +

The default behavior for Unicode character translation depends + on to what extent the underlying OS/filesystem enforces consistent + naming. On OSes where all file names are ensured to be in one or + another encoding, Unicode is the default (currently this holds for + Windows and MacOSX). On OSes with completely transparent file + naming (i.e. all Unixes except MacOSX), ISO-latin-1 file naming is + the default. The reason for the ISO-latin-1 default is that + file names are not guaranteed to be possible to interpret according to + the Unicode encoding expected (i.e. UTF-8), and file names that + cannot be decoded will only be accessible by using "raw + file names", in other word file names given as binaries.

+ +

As file names are traditionally not binaries in Erlang, + applications that need to handle raw file names need to be + converted, why the Unicode mode for file names is not default on + systems having completely transparent file naming.

+ + As of R14B01, the most basic file handling modules + (file, prim_file, filelib and + filename) accept raw file names, but the rest of OTP is not + guaranteed to handle them, why Unicode file naming on systems + where it is not default is still considered experimental. + +

Raw file names is a new feature in OTP R14B01, which allows the + user to supply completely uninterpreted file names to the + underlying OS/filesystem. They are supplied as binaries, where it + is up to the user to supply a correct encoding for the + environment. The function file:native_name_encoding() can + be used to check what encoding the VM is working in. If the + function returns latin1 file names are not in any way + converted to Unicode, if it is utf8, raw file names should + be encoded as UTF-8 if they are to follow the convention of the VM + (and usually the convention of the OS as well). Using raw + file names is useful if you have a filesystem with inconsistent + file naming, where some files are named in UTF-8 encoding while + others are not. A file:list_dir on such mixed file name systems + when the VM is in Unicode file name mode might return file names as + raw binaries as they cannot be interpreted as Unicode + file names. Raw file names can also be used to give UTF-8 encoded + file names even though the VM is not started in Unicode file name + translation mode.

+ +

Note that on Windows, file:native_name_encoding() + returns utf8 per default, which is the format for raw + file names even on Windows, although the underlying OS specific + code works in a limited version of little endian UTF16. As far as + the Erlang programmer is concerned, Windows native Unicode format + is UTF-8...

@@ -47,8 +102,14 @@ iodata() = iolist() | binary() io_device() as returned by file:open/2, a process handling IO protocols -name() = string() | atom() | DeepList +name() = string() | atom() | DeepList | RawFilename DeepList = [char() | atom() | DeepList] + RawFilename = binary() + If VM is in unicode filename mode, string() and char() are allowed to be > 255. + RawFilename is a filename not subject to Unicode translation, meaning that it + can contain characters not conforming to the Unicode encoding expected from the + filesystem (i.e. non-UTF-8 characters although the VM is started in Unicode + filename mode). posix() an atom which is named from the POSIX error codes used in @@ -597,6 +658,15 @@ f.txt: {person, "kalle", 25}. + + native_name_encoding() -> latin1 | utf8 + Retunr the VMs configure filename encoding. + +

This function returns the configured default file name encoding to use for raw file names. Generally an application supplying file names raw (as binaries), should obey the character encoding returned by this function.

+

By default, the VM uses ISO-latin-1 file name encoding on filesystems and/or OSes that use completely transparent file naming. This includes all Unix versions except for MacOSX, where the vfs layer enforces UTF-8 file naming. By giving the experimental option +fnu when starting Erlang, UTF-8 translation of file names can be turned on even for those systems. If Unicode file name translation is in effect, the system behaves as usual as long as file names conform to the encoding, but will return file names that are not properly encoded in UTF-8 as raw file names (i.e. binaries).

+

On Windows, this function also returns utf8 by default. The OS uses a pure Unicode naming scheme and file names are always possible to interpret as valid Unicode. The fact that the underlying Windows OS actually encodes file names using little endian UTF-16 can be ignored by the Erlang programmer. Windows and MacOSX are the only operating systems where the VM operates in Unicode file name mode by default.

+
+
open(Filename, Modes) -> {ok, IoDevice} | {error, Reason} Open a file diff --git a/lib/stdlib/doc/src/filelib.xml b/lib/stdlib/doc/src/filelib.xml index 4ff3b22f32..969aff4fcb 100644 --- a/lib/stdlib/doc/src/filelib.xml +++ b/lib/stdlib/doc/src/filelib.xml @@ -36,14 +36,23 @@

This module contains utilities on a higher level than the file module.

+

The module supports Unicode file names, so that it will match against regular expressions given in Unicode and that it will find and process raw file names (i.e. files named in a way that does not confirm to the expected encoding).

+

If the VM operates in Unicode file naming mode on a machine with transparent file naming, the fun() provided to fold_files/5 needs to be prepared to handle binary file names.

+

For more information about raw file names, see the file module.

DATA TYPES -filename() = string() | atom() | DeepList -dirname() = filename() -DeepList = [char() | atom() | DeepList] +filename() = = string() | atom() | DeepList | RawFilename + DeepList = [char() | atom() | DeepList] + RawFilename = binary() + If VM is in unicode filename mode, string() and char() are allowed to be > 255. + RawFilename is a filename not subject to Unicode translation, meaning that it + can contain characters not conforming to the Unicode encoding expected from the + filesystem (i.e. non-UTF-8 characters although the VM is started in Unicode + filename mode). +dirname() = filename()
@@ -90,6 +99,18 @@ DeepList = [char() | atom() | DeepList] If Recursive is true all sub-directories to Dir are processed. The regular expression matching is done on just the filename without the directory part.

+ +

If Unicode file name translation is in effect and the file + system is completely transparent, file names that cannot be + interpreted as Unicode may be encountered, in which case the + fun() must be prepared to handle raw file names + (i.e. binaries). If the regular expression contains + codepoints beyond 255, it will not match file names that does + not conform to the expected character encoding (i.e. are not + encoded in valid UTF-8).

+ +

For more information about raw file names, see the + file module.

diff --git a/lib/stdlib/doc/src/filename.xml b/lib/stdlib/doc/src/filename.xml index fe6c6f898e..cdee6e4a81 100644 --- a/lib/stdlib/doc/src/filename.xml +++ b/lib/stdlib/doc/src/filename.xml @@ -4,7 +4,7 @@
- 19972009 + 19972010 Ericsson AB. All Rights Reserved. @@ -43,13 +43,22 @@ only, even if the arguments contain back slashes. Use join/1 to normalize a file name by removing redundant directory separators.

+

The module supports raw file names in the way that if a binary is present, or the file name cannot be interpreted according to the return value of + file:native_name_encoding/0, a raw file name will also be returned. For example filename:join/1 provided with a path component being a binary (and also not being possible to interpret under the current native file name encoding) will result in a raw file name being returned (the join operation will have been performed of course). For more information about raw file names, see the file module.

DATA TYPES -name() = string() | atom() | DeepList -DeepList = [char() | atom() | DeepList] +name() = string() | atom() | DeepList | RawFilename + DeepList = [char() | atom() | DeepList] + RawFilename = binary() + If VM is in unicode filename mode, string() and char() are allowed to be > 255. + RawFilename is a filename not subject to Unicode translation, meaning that it + can contain characters not conforming to the Unicode encoding expected from the + filesystem (i.e. non-UTF-8 characters although the VM is started in Unicode + filename mode). +
-- cgit v1.2.3