Don't allow null chars in various strings

Various places that now reject null chars inside strings - Primitive file operations reject it in filenames. - Primitive environment variable operations reject it in names and values. - os:cmd() reject it in its input. Also '=' characters are rejected by primitive environment variable operations in environment variable names. Documentation has been updated to document null characters in these types of data as invalid. Currently these operations accept null chars at the end of strings, but that will change in the future.
author: Rickard Green <[email protected]> 2017-09-06 17:00:14 +0200
committer: Rickard Green <[email protected]> 2017-10-11 16:59:40 +0200
commit: bb0b43eae854125688f3143e53c8974cafed4ad2 (patch)
tree: 648c103b0777d951b0e4a44b6615753d2748a2ea /lib/stdlib
parent: 02fd746c40e829adbe77cc526c7df904698e2534 (diff)
download: otp-bb0b43eae854125688f3143e53c8974cafed4ad2.tar.gz
otp-bb0b43eae854125688f3143e53c8974cafed4ad2.tar.bz2
otp-bb0b43eae854125688f3143e53c8974cafed4ad2.zip
5 files changed, 162 insertions, 4 deletions
diff --git a/lib/stdlib/doc/src/filelib.xml b/lib/stdlib/doc/src/filelib.xml
index 80c4acffdb..c531b10d50 100644
--- a/lib/stdlib/doc/src/filelib.xml
+++ b/lib/stdlib/doc/src/filelib.xml
@@ -45,6 +45,30 @@
 
     <p>For more information about raw filenames, see the
       <seealso marker="kernel:file"><c>file</c></seealso> module.</p>
+
+    <note>
+      <p>
+	Functionality in this module generally assumes valid input and
+	does not necessarily fail on input that does not use a valid
+	encoding, but may instead very likely produce invalid output.
+      </p>
+      <p>
+	File operations used to accept filenames containing
+	null characters (integer value zero). This caused
+	the name to be truncated and in some cases arguments
+	to primitive operations to be mixed up. Filenames
+	containing null characters inside the filename
+	are now <em>rejected</em> and will cause primitive
+	file operations to fail.
+      </p>
+    </note>
+    <warning><p>
+      Currently null characters at the end of the filename
+      will be accepted by primitive file operations. Such
+      filenames are however still documented as invalid. The
+      implementation will also change in the future and
+      reject such filenames.
+    </p></warning>
   </description>
 
   <datatypes>
diff --git a/lib/stdlib/doc/src/filename.xml b/lib/stdlib/doc/src/filename.xml
index 14fd5ef787..d2608ad542 100644
--- a/lib/stdlib/doc/src/filename.xml
+++ b/lib/stdlib/doc/src/filename.xml
@@ -46,7 +46,10 @@
       filename by removing redundant directory separators, use
       <seealso marker="#join/1"><c>join/1</c></seealso>.</p>
 
-    <p>The module supports raw filenames in the way that if a binary is
+    <p>
+      The module supports
+      <seealso marker="unicode_usage#notes-about-raw-filenames">raw
+      filenames</seealso> in the way that if a binary is
       present, or the filename cannot be interpreted according to the return
       value of <seealso marker="kernel:file#native_name_encoding/0">
       <c>file:native_name_encoding/0</c></seealso>, a raw filename is also
@@ -56,6 +59,30 @@
       (the join operation is performed of course). For more information
       about raw filenames, see the
       <seealso marker="kernel:file"><c>file</c></seealso> module.</p>
+
+    <note>
+      <p>
+	Functionality in this module generally assumes valid input and
+	does not necessarily fail on input that does not use a valid
+	encoding, but may instead very likely produce invalid output.
+      </p>
+      <p>
+	File operations used to accept filenames containing
+	null characters (integer value zero). This caused
+	the name to be truncated and in some cases arguments
+	to primitive operations to be mixed up. Filenames
+	containing null characters inside the filename
+	are now <em>rejected</em> and will cause primitive
+	file operations to fail.
+      </p>
+    </note>
+    <warning><p>
+      Currently null characters at the end of the filename
+      will be accepted by primitive file operations. Such
+      filenames are however still documented as invalid. The
+      implementation will also change in the future and
+      reject such filenames.
+    </p></warning>
   </description>
   <datatypes>
       <datatype>
@@ -555,6 +582,7 @@ unsafe</pre>
 ["a:/","msdev","include"]</pre>
       </desc>
     </func>
+
   </funcs>
 </erlref>
 
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml
index 26dc46719e..789e063c12 100644
--- a/lib/stdlib/doc/src/unicode_usage.xml
+++ b/lib/stdlib/doc/src/unicode_usage.xml
@@ -719,8 +719,8 @@ Eshell V5.10.1  (abort with ^G)
   </section> 
 
   <section>
-    <title>Unicode Filenames</title>
     <marker id="unicode_file_names"/>
+    <title>Unicode Filenames</title>
     <p>Most modern operating systems support Unicode filenames in some way.
       There are many different ways to do this and Erlang by default treats the
       different approaches differently:</p>
@@ -855,8 +855,12 @@ Eshell V5.10.1  (abort with ^G)
     </note>
 
     <section>
-      <title>Notes About Raw Filenames</title>
       <marker id="notes-about-raw-filenames"/>
+      <title>Notes About Raw Filenames</title>
+      <note><p>
+	Note that raw filenames <em>not</em> necessarily are encoded the
+	same way as on the OS level.
+      </p></note>
       <p>Raw filenames were introduced together with Unicode filename support
         in ERTS 5.8.2 (Erlang/OTP R14B01). The reason &quot;raw
         filenames&quot; were introduced in the system was
diff --git a/lib/stdlib/src/filename.erl b/lib/stdlib/src/filename.erl
index 9bf4290916..9a85642c17 100644
--- a/lib/stdlib/src/filename.erl
+++ b/lib/stdlib/src/filename.erl
@@ -34,6 +34,38 @@
 %% we flatten the arguments immediately on function entry as that makes
 %% it easier to ensure that the code works.
 
+%%
+%% *** Requirements on Raw Filename Format ***
+%%
+%% These requirements are due to the 'filename' module
+%% in stdlib. This since it is documented that it
+%% should be able to operate on raw filenames as well
+%% as ordinary filenames.
+%%
+%% A raw filename *must* be a byte sequence where:
+%% 1. Codepoints 0-127 (7-bit ascii) *must* be encoded
+%%    as a byte with the corresponding value. That is,
+%%    the most significant bit in the byte encoding the
+%%    codepoint is never set.
+%% 2. Codepoints greater than 127 *must* be encoded
+%%    with the most significant bit set in *every* byte
+%%    encoding it.
+%%
+%% Latin1 and UTF-8 meet these requirements while
+%% UTF-16 and UTF-32 don't.
+%%
+%% On Windows filenames are natively stored as malformed
+%% UTF-16LE (lonely surrogates may appear). A more correct
+%% description than UTF-16 would be an array of 16-bit
+%% words... In order to meet the requirements of the
+%% raw file format we convert the malformed UTF-16LE to
+%% malformed UTF-8 which meet the requirements.
+%%
+%% Note that these requirements are today only OTP
+%% internal (erts-stdlib internal) requirements that
+%% could be changed.
+%%
+
 -export([absname/1, absname/2, absname_join/2, 
 	 basename/1, basename/2, dirname/1,
 	 extension/1, join/1, join/2, pathtype/1,
@@ -41,6 +73,7 @@
          safe_relative_path/1]).
 -export([find_src/1, find_src/2]). % deprecated
 -export([basedir/2, basedir/3]).
+-export([validate/1]).
 
 %% Undocumented and unsupported exports.
 -export([append/2]).
@@ -1135,3 +1168,72 @@ basedir_os_type() ->
         {win32,_}     -> windows;
         _             -> linux
     end.
+
+%%
+%% validate/1
+%%
+
+-spec validate(FileName) -> boolean() when
+      FileName :: file:name_all().
+
+validate(FileName) when is_binary(FileName) ->
+    %% Raw filename...
+    validate_bin(FileName);
+validate(FileName) when is_list(FileName);
+                        is_atom(FileName) ->
+    validate_list(FileName,
+                  file:native_name_encoding(),
+                  os:type()).
+
+validate_list(FileName, Enc, Os) ->
+    try
+        true = validate_list(FileName, Enc, Os, 0) > 0
+    catch
+        _ : _ -> false
+    end.
+
+validate_list([], _Enc, _Os, Chars) ->
+    Chars;
+validate_list(C, Enc, Os, Chars) when is_integer(C) ->
+    validate_char(C, Enc, Os),
+    Chars+1;
+validate_list(A, Enc, Os, Chars) when is_atom(A) ->
+    validate_list(atom_to_list(A), Enc, Os, Chars);
+validate_list([H|T], Enc, Os, Chars) ->
+    NewChars = validate_list(H, Enc, Os, Chars),
+    validate_list(T, Enc, Os, NewChars).
+
+%% C is always an integer...
+% validate_char(C, _, _) when not is_integer(C) ->
+%     throw(invalid);
+validate_char(C, _, _) when C < 1 ->
+    throw(invalid); %% No negative or null characters...
+validate_char(C, latin1, _) when C > 255 ->
+    throw(invalid);
+validate_char(C, utf8, _) when C >= 16#110000 ->
+    throw(invalid);
+validate_char(C, utf8, {win32, _}) when C > 16#ffff ->
+    throw(invalid); %% invalid win wchar...
+validate_char(_C, utf8, {win32, _}) ->
+    ok; %% Range below is accepted on windows...
+validate_char(C, utf8, _) when 16#D800 =< C, C =< 16#DFFF ->
+    throw(invalid); %% invalid unicode range...
+validate_char(_, _, _) ->
+    ok.
+
+validate_bin(Bin) ->
+    %% Raw filename. That is, we do not interpret
+    %% the encoding, but we still do not accept
+    %% null characters...
+    try
+        true = validate_bin(Bin, 0) > 0
+    catch
+        _ : _ -> false
+    end.
+
+validate_bin(<<>>, Bs) ->
+    Bs;
+validate_bin(<<0, _Rest/binary>>, _Bs) ->
+    throw(invalid); %% No null characters allowed...
+validate_bin(<<_B, Rest/binary>>, Bs) ->
+    validate_bin(Rest, Bs+1).
diff --git a/lib/stdlib/src/stdlib.app.src b/lib/stdlib/src/stdlib.app.src
index 3c449d3cb9..ab0824ca17 100644
--- a/lib/stdlib/src/stdlib.app.src
+++ b/lib/stdlib/src/stdlib.app.src
@@ -107,7 +107,7 @@
                dets]},
   {applications, [kernel]},
   {env, []},
-  {runtime_dependencies, ["sasl-3.0","kernel-5.0","erts-9.0","crypto-3.3",
+  {runtime_dependencies, ["sasl-3.0","kernel-6.0","erts-10.0","crypto-3.3",
 			  "compiler-5.0"]}
 ]}.
author	Rickard Green <[email protected]>	2017-09-06 17:00:14 +0200
committer	Rickard Green <[email protected]>	2017-10-11 16:59:40 +0200
commit	bb0b43eae854125688f3143e53c8974cafed4ad2 (patch)
tree	648c103b0777d951b0e4a44b6615753d2748a2ea /lib/stdlib
parent	02fd746c40e829adbe77cc526c7df904698e2534 (diff)
download	otp-bb0b43eae854125688f3143e53c8974cafed4ad2.tar.gz otp-bb0b43eae854125688f3143e53c8974cafed4ad2.tar.bz2 otp-bb0b43eae854125688f3143e53c8974cafed4ad2.zip