Age | Commit message (Collapse) | Author |
|
The field width calculation did not handle graphem clusters well.
|
|
As of the introduction of Unicode characters in atoms, the control
sequences 'w' and 'W' can return non-Latin-1 characters, unless some
measure is taken.
This commit makes sure that '~w' and '~W' always return Latin-1
characters, or bytes, which can be output to ports or written to raw
files.
The Unicode translation modifier 't' is needed to return non-Latin-1
characters.
|
|
26b59dfe67e introduced support for arbitrary Unicode characters in
atoms. After that commit, it is possible to print any atom with
a "~s" format string:
1> io:format("~s\n", ['спутник']).
спутник
Note that the same text as a string will fail:
2> io:format("~s\n", ["спутник"]).
** exception error: bad argument
in function io:format/3
called as io:format(<0.53.0>,"~s\n",
[[1089,1087,1091,1090,1085,1080,1082]])
Being more permissive for atoms is probably beneficial for io:format/2.
However, for io_lib:format/2, the new behavior breaks this guarantee
in the documentation for io_lib:format/2:
If and only if the Unicode translation modifier is used in
the format string (that is, ~ts or ~tc), the resulting list
can contain characters beyond the ISO Latin-1 character range
(that is, numbers > 255).
The problem is that you can no longer be sure whether io_lib:format/2
will return an iolist that can be successfully passed to a port
or iolist_to_binary/1.
We see three solutions:
1. Keep the new behavior. That means that you can get non-iolist data
when you use ~s for printing an atom, but a 'badarg' when printing
Unicode strings. That is inconsistent, and it delays error detection
if the result is passed to a port or iolist_to_binary/1.
2. Always allow Unicode characters for ~s. That would be incompatible,
because ~s says that any binary is encoded in latin1, while ~ts says
that any binary is encoded in UTF-8. To implement this solution, we
could no longer support latin1 binaries; all binaries would have to
be encoded in UTF-8.
3. Only allow ~s for atoms where all characters are less than 256.
Require ~ts to print atoms such as 'спутник'.
We reject solution 1 because it is slightly incompatible and is
inconsistent.
We reject solution 2 because it too incompatible.
Therefore, this commit implements solution 3.
|
|
|
|
|
|
|
|
This adds three new functions to io_lib - scan_format/2, unscan_format/1,
and build_text/1 - which expose the parsed form of the format control
sequences to make it possible to easily modify or filter the input to
io_lib:format/2. This can e.g. be used in order to replace unbounded-size
control sequences like ~w or ~p with corresponding depth-limited ~W and ~P
before doing the actual formatting.
|
|
Values for which the precision or field width were too small in io_lib_format
could trigger an infinite loop or crash in term/5.
Reported-by: Richard Carlsson
|
|
Use the new function shell:strings/1 to toggle how the Erlang shell
outputs lists of integers.
|
|
The modifier 'l' can be used for turning off the string recognition of
~p and ~P.
|
|
Make sure io_lib:fwrite() with a format string including "~ts" does
not crash when given binaries that cannot be interpreted as
UTF-8-encoded strings.
We want to avoid crashes caused by excessive use of the 't' modifier.
|
|
The code related to the introduction of unicode_string() and
unicode_char() has been removed. The types char() and string() have
been extended to include Unicode characters.
In fact char() was changed some time ago; this commit is about
cleaning up the documentation and introduce better names for some
functions.
|
|
Expect modifications, additions and corrections.
There is a kludge in file_io_server and
erl_scan:continuation_location() that's not so pleasing.
|
|
|
|
|
|
|