From bb0b43eae854125688f3143e53c8974cafed4ad2 Mon Sep 17 00:00:00 2001
From: Rickard Green Regarding filename encoding, the Erlang VM can operate in
two modes. The current mode can be queried using function
- In
On operating systems with transparent naming, files can be inconsistently named, for example, some files are encoded in UTF-8 while @@ -81,6 +81,23 @@
See also section
+ File operations used to accept filenames containing + null characters (integer value zero). This caused + the name to be truncated and in some cases arguments + to primitive operations to be mixed up. Filenames + containing null characters inside the filename + are now rejected and will cause primitive + file operations fail. +
+ Currently null characters at the end of the filename + will be accepted by primitive file operations. Such + filenames are however still documented as invalid. The + implementation will also change in the future and + reject such filenames. +
+ See also the documentation of the
+
+ See also the documentation of the
+
If VM is in Unicode filename mode,
If VM is in Unicode filename mode,
If VM is in Unicode filename mode, characters
are allowed to be > 255.
The functions in the module
Typical error reasons are:
diff --git a/lib/kernel/doc/src/os.xml b/lib/kernel/doc/src/os.xml index 0e9add4161..0a08e2c78a 100644 --- a/lib/kernel/doc/src/os.xml +++ b/lib/kernel/doc/src/os.xml @@ -36,8 +36,99 @@ only run on a specific platform. On the other hand, with careful use, these functions can be of help in enabling a program to run on most platforms. + ++ File operations used to accept filenames containing + null characters (integer value zero). This caused + the name to be truncated and in some cases arguments + to primitive operations to be mixed up. Filenames + containing null characters inside the filename + are now rejected and will cause primitive + file operations to fail. +
++ Also environment variable operations used to accept + names and values of environment variables containing + null characters (integer value zero). This caused + operations to silently produce erroneous results. + Environment variable names and values containing + null characters inside the name or value are now + rejected and will cause environment variable + operations to fail. +
++ Currently null characters at the end of filenames, + environment variable names and values will be accepted + by the primitive operations. Such filenames, environment + variable names and values are however still documented as + invalid. The implementation will also change in the + future and reject such filenames, environment variable + names and values. +
+A string containing valid characters on the specific
+ OS for environment variable names using
+
A string containing valid characters on the specific
+ OS for environment variable values using
+
+ Assuming that environment variables has been correctly
+ set, a strings containing valid characters on the specific
+ OS for environment variable names and values using
+
All characters needs to be valid characters on the
+ specific OS using
+
Previous implementation used to allow all characters
+ as long as they were integer values greater than or equal to zero.
+ This sometimes lead to unwanted results since null characters
+ (integer value zero) often are interpreted as string termination.
+ Current implementation still accepts null characters at the end
+ of
Examples:
LsOut = os:cmd("ls"), % on unix platform
@@ -152,6 +252,15 @@ DirOut = os:cmd("dir"), % on Win32 platform
On Unix platforms, the environment is set using UTF-8 encoding if Unicode filename translation is in effect. On Windows, the environment is set using wide character interfaces.
+
+
For more information about raw filenames, see the
+ Functionality in this module generally assumes valid input and + does not necessarily fail on input that does not use a valid + encoding, but may instead very likely produce invalid output. +
++ File operations used to accept filenames containing + null characters (integer value zero). This caused + the name to be truncated and in some cases arguments + to primitive operations to be mixed up. Filenames + containing null characters inside the filename + are now rejected and will cause primitive + file operations to fail. +
++ Currently null characters at the end of the filename + will be accepted by primitive file operations. Such + filenames are however still documented as invalid. The + implementation will also change in the future and + reject such filenames. +
The module supports raw filenames in the way that if a binary is +
+ The module supports
+
+ Functionality in this module generally assumes valid input and + does not necessarily fail on input that does not use a valid + encoding, but may instead very likely produce invalid output. +
++ File operations used to accept filenames containing + null characters (integer value zero). This caused + the name to be truncated and in some cases arguments + to primitive operations to be mixed up. Filenames + containing null characters inside the filename + are now rejected and will cause primitive + file operations to fail. +
++ Currently null characters at the end of the filename + will be accepted by primitive file operations. Such + filenames are however still documented as invalid. The + implementation will also change in the future and + reject such filenames. +
Most modern operating systems support Unicode filenames in some way. There are many different ways to do this and Erlang by default treats the different approaches differently:
@@ -855,8 +855,12 @@ Eshell V5.10.1 (abort with ^G)+ Note that raw filenames not necessarily are encoded the + same way as on the OS level. +
Raw filenames were introduced together with Unicode filename support in ERTS 5.8.2 (Erlang/OTP R14B01). The reason "raw filenames" were introduced in the system was diff --git a/lib/stdlib/src/filename.erl b/lib/stdlib/src/filename.erl index 9bf4290916..9a85642c17 100644 --- a/lib/stdlib/src/filename.erl +++ b/lib/stdlib/src/filename.erl @@ -34,6 +34,38 @@ %% we flatten the arguments immediately on function entry as that makes %% it easier to ensure that the code works. +%% +%% *** Requirements on Raw Filename Format *** +%% +%% These requirements are due to the 'filename' module +%% in stdlib. This since it is documented that it +%% should be able to operate on raw filenames as well +%% as ordinary filenames. +%% +%% A raw filename *must* be a byte sequence where: +%% 1. Codepoints 0-127 (7-bit ascii) *must* be encoded +%% as a byte with the corresponding value. That is, +%% the most significant bit in the byte encoding the +%% codepoint is never set. +%% 2. Codepoints greater than 127 *must* be encoded +%% with the most significant bit set in *every* byte +%% encoding it. +%% +%% Latin1 and UTF-8 meet these requirements while +%% UTF-16 and UTF-32 don't. +%% +%% On Windows filenames are natively stored as malformed +%% UTF-16LE (lonely surrogates may appear). A more correct +%% description than UTF-16 would be an array of 16-bit +%% words... In order to meet the requirements of the +%% raw file format we convert the malformed UTF-16LE to +%% malformed UTF-8 which meet the requirements. +%% +%% Note that these requirements are today only OTP +%% internal (erts-stdlib internal) requirements that +%% could be changed. +%% + -export([absname/1, absname/2, absname_join/2, basename/1, basename/2, dirname/1, extension/1, join/1, join/2, pathtype/1, @@ -41,6 +73,7 @@ safe_relative_path/1]). -export([find_src/1, find_src/2]). % deprecated -export([basedir/2, basedir/3]). +-export([validate/1]). %% Undocumented and unsupported exports. -export([append/2]). @@ -1135,3 +1168,72 @@ basedir_os_type() -> {win32,_} -> windows; _ -> linux end. + +%% +%% validate/1 +%% + +-spec validate(FileName) -> boolean() when + FileName :: file:name_all(). + +validate(FileName) when is_binary(FileName) -> + %% Raw filename... + validate_bin(FileName); +validate(FileName) when is_list(FileName); + is_atom(FileName) -> + validate_list(FileName, + file:native_name_encoding(), + os:type()). + +validate_list(FileName, Enc, Os) -> + try + true = validate_list(FileName, Enc, Os, 0) > 0 + catch + _ : _ -> false + end. + +validate_list([], _Enc, _Os, Chars) -> + Chars; +validate_list(C, Enc, Os, Chars) when is_integer(C) -> + validate_char(C, Enc, Os), + Chars+1; +validate_list(A, Enc, Os, Chars) when is_atom(A) -> + validate_list(atom_to_list(A), Enc, Os, Chars); +validate_list([H|T], Enc, Os, Chars) -> + NewChars = validate_list(H, Enc, Os, Chars), + validate_list(T, Enc, Os, NewChars). + +%% C is always an integer... +% validate_char(C, _, _) when not is_integer(C) -> +% throw(invalid); +validate_char(C, _, _) when C < 1 -> + throw(invalid); %% No negative or null characters... +validate_char(C, latin1, _) when C > 255 -> + throw(invalid); +validate_char(C, utf8, _) when C >= 16#110000 -> + throw(invalid); +validate_char(C, utf8, {win32, _}) when C > 16#ffff -> + throw(invalid); %% invalid win wchar... +validate_char(_C, utf8, {win32, _}) -> + ok; %% Range below is accepted on windows... +validate_char(C, utf8, _) when 16#D800 =< C, C =< 16#DFFF -> + throw(invalid); %% invalid unicode range... +validate_char(_, _, _) -> + ok. + +validate_bin(Bin) -> + %% Raw filename. That is, we do not interpret + %% the encoding, but we still do not accept + %% null characters... + try + true = validate_bin(Bin, 0) > 0 + catch + _ : _ -> false + end. + +validate_bin(<<>>, Bs) -> + Bs; +validate_bin(<<0, _Rest/binary>>, _Bs) -> + throw(invalid); %% No null characters allowed... +validate_bin(<<_B, Rest/binary>>, Bs) -> + validate_bin(Rest, Bs+1). diff --git a/lib/stdlib/src/stdlib.app.src b/lib/stdlib/src/stdlib.app.src index 3c449d3cb9..ab0824ca17 100644 --- a/lib/stdlib/src/stdlib.app.src +++ b/lib/stdlib/src/stdlib.app.src @@ -107,7 +107,7 @@ dets]}, {applications, [kernel]}, {env, []}, - {runtime_dependencies, ["sasl-3.0","kernel-5.0","erts-9.0","crypto-3.3", + {runtime_dependencies, ["sasl-3.0","kernel-6.0","erts-10.0","crypto-3.3", "compiler-5.0"]} ]}. -- cgit v1.2.3