aboutsummaryrefslogtreecommitdiffstats
path: root/lib/stdlib/doc/src
diff options
context:
space:
mode:
authorDan Gudmundsson <[email protected]>2017-04-24 12:41:10 +0200
committerDan Gudmundsson <[email protected]>2017-04-24 12:41:10 +0200
commit739bca3fc267c55d84c8f5c193d16c0b2a7eee13 (patch)
treea7ac2a4c90bdeadd2a313e7d49cecf84d3e1ba28 /lib/stdlib/doc/src
parent515dc2d603d449ed3621d96028ba403aef05ea7f (diff)
parenta9d0d119837fb0bc52d2bb3d48a47568de9100b4 (diff)
downloadotp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.tar.gz
otp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.tar.bz2
otp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.zip
Merge branch 'dgud/stdlib/unicode-string/OTP-10289'
* dgud/stdlib/unicode-string/OTP-10289: Handle chardata in string:to_float and string:to_list New unicode aware string module that works with unicode:chardata() Add nf(k)d, nf(k)c conversion functions to unicode module Reorder code and whitespace fixes Add unicode_util
Diffstat (limited to 'lib/stdlib/doc/src')
-rw-r--r--lib/stdlib/doc/src/string.xml741
-rw-r--r--lib/stdlib/doc/src/unicode.xml179
-rw-r--r--lib/stdlib/doc/src/unicode_usage.xml70
3 files changed, 897 insertions, 93 deletions
diff --git a/lib/stdlib/doc/src/string.xml b/lib/stdlib/doc/src/string.xml
index dddedf1132..dc83c40a9a 100644
--- a/lib/stdlib/doc/src/string.xml
+++ b/lib/stdlib/doc/src/string.xml
@@ -36,8 +36,613 @@
<modulesummary>String processing functions.</modulesummary>
<description>
<p>This module provides functions for string processing.</p>
+ <p>A string in this module is represented by <seealso marker="unicode#type-chardata">
+ <c>unicode:chardata()</c></seealso>, that is, a list of codepoints,
+ binaries with UTF-8-encoded codepoints
+ (<em>UTF-8 binaries</em>), or a mix of the two.</p>
+ <code>
+"abcd" is a valid string
+&lt;&lt;"abcd">> is a valid string
+["abcd"] is a valid string
+&lt;&lt;"abc..åäö"/utf8>> is a valid string
+&lt;&lt;"abc..åäö">> is NOT a valid string,
+ but a binary with Latin-1-encoded codepoints
+[&lt;&lt;"abc">>, "..åäö"] is a valid string
+[atom] is NOT a valid string</code>
+ <p>
+ This module operates on grapheme clusters. A <em>grapheme cluster</em>
+ is a user-perceived character, which can be represented by several
+ codepoints.
+ </p>
+ <code>
+"å" [229] or [97, 778]
+"e̊" [101, 778]</code>
+ <p>
+ The string length of "ß↑e̊" is 3, even though it is represented by the
+ codepoints <c>[223,8593,101,778]</c> or the UTF-8 binary
+ <c>&lt;&lt;195,159,226,134,145,101,204,138>></c>.
+ </p>
+ <p>
+ Grapheme clusters for codepoints of class <c>prepend</c>
+ and non-modern (or decomposed) Hangul is not handled for performance
+ reasons in
+ <seealso marker="#find/3"><c>find/3</c></seealso>,
+ <seealso marker="#replace/3"><c>replace/3</c></seealso>,
+ <seealso marker="#split/2"><c>split/2</c></seealso>,
+ <seealso marker="#lexemes/2"><c>split/2</c></seealso> and
+ <seealso marker="#trim/3"><c>trim/3</c></seealso>.
+ </p>
+ <p>
+ Splitting and appending strings is to be done on grapheme clusters
+ borders.
+ There is no verification that the results of appending strings are
+ valid or normalized.
+ </p>
+ <p>
+ Most of the functions expect all input to be normalized to one form,
+ see for example <seealso marker="unicode#characters_to_nfc_list/1">
+ <c>unicode:characters_to_nfc_list/1</c></seealso>.
+ </p>
+ <p>
+ Language or locale specific handling of input is not considered
+ in any function.
+ </p>
+ <p>
+ The functions can crash for non-valid input strings. For example,
+ the functions expect UTF-8 binaries but not all functions
+ verify that all binaries are encoded correctly.
+ </p>
+ <p>
+ Unless otherwise specified the return value type is the same as
+ the input type. That is, binary input returns binary output,
+ list input returns a list output, and mixed input can return a
+ mixed output.</p>
+ <code>
+1> string:trim(" sarah ").
+"sarah"
+2> string:trim(&lt;&lt;" sarah ">>).
+&lt;&lt;"sarah">>
+3> string:lexemes("foo bar", " ").
+["foo","bar"]
+4> string:lexemes(&lt;&lt;"foo bar">>, " ").
+[&lt;&lt;"foo">>,&lt;&lt;"bar">>]</code>
+ <p>This module has been reworked in Erlang/OTP 20 to
+ handle <seealso marker="unicode#type-chardata">
+ <c>unicode:chardata()</c></seealso> and operate on grapheme
+ clusters. The <seealso marker="#oldapi"> <c>old
+ functions</c></seealso> that only work on Latin-1 lists as input
+ are still available but should not be
+ used. They will be deprecated in Erlang/OTP 21.
+ </p>
</description>
+ <datatypes>
+ <datatype>
+ <name name="direction"/>
+ <name name="grapheme_cluster"/>
+ <desc>
+ <p>A user-perceived character, consisting of one or more
+ codepoints.</p>
+ </desc>
+ </datatype>
+ </datatypes>
+
+ <funcs>
+
+ <func>
+ <name name="casefold" arity="1"/>
+ <fsummary>Convert a string to a comparable string.</fsummary>
+ <desc>
+ <p>
+ Converts <c><anno>String</anno></c> to a case-agnostic
+ comparable string. Function <c>casefold/1</c> is preferred
+ over <c>lowercase/1</c> when two strings are to be compared
+ for equality. See also <seealso marker="#equal/4"><c>equal/4</c></seealso>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:casefold("Ω and ẞ SHARP S").</input>
+"ω and ss sharp s"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="chomp" arity="1"/>
+ <fsummary>Remove trailing end of line control characters.</fsummary>
+ <desc>
+ <p>
+ Returns a string where any trailing <c>\n</c> or
+ <c>\r\n</c> have been removed from <c><anno>String</anno></c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+182> <input>string:chomp(&lt;&lt;"\nHello\n\n">>).</input>
+&lt;&lt;"\nHello">>
+183> <input>string:chomp("\nHello\r\r\n").</input>
+"\nHello\r"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="equal" arity="2"/>
+ <name name="equal" arity="3"/>
+ <name name="equal" arity="4"/>
+ <fsummary>Test string equality.</fsummary>
+ <desc>
+ <p>
+ Returns <c>true</c> if <c><anno>A</anno></c> and
+ <c><anno>B</anno></c> are equal, otherwise <c>false</c>.
+ </p>
+ <p>
+ If <c><anno>IgnoreCase</anno></c> is <c>true</c>
+ the function does <seealso marker="#casefold/1">
+ <c>casefold</c>ing</seealso> on the fly before the equality test.
+ </p>
+ <p>If <c><anno>Norm</anno></c> is not <c>none</c>
+ the function applies normalization on the fly before the equality test.
+ There are four available normalization forms:
+ <seealso marker="unicode#characters_to_nfc_list/1"> <c>nfc</c></seealso>,
+ <seealso marker="unicode#characters_to_nfd_list/1"> <c>nfd</c></seealso>,
+ <seealso marker="unicode#characters_to_nfkc_list/1"> <c>nfkc</c></seealso>, and
+ <seealso marker="unicode#characters_to_nfkd_list/1"> <c>nfkd</c></seealso>.
+ </p>
+ <p>By default,
+ <c><anno>IgnoreCase</anno></c> is <c>false</c> and
+ <c><anno>Norm</anno></c> is <c>none</c>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:equal("åäö", &lt;&lt;"åäö"/utf8>>).</input>
+true
+2> <input>string:equal("åäö", unicode:characters_to_nfd_binary("åäö")).</input>
+false
+3> <input>string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc).</input>
+true</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="find" arity="2"/>
+ <name name="find" arity="3"/>
+ <fsummary>Find start of substring.</fsummary>
+ <desc>
+ <p>
+ Removes anything before <c><anno>SearchPattern</anno></c> in <c><anno>String</anno></c>
+ and returns the remainder of the string or <c>nomatch</c> if <c><anno>SearchPattern</anno></c> is not
+ found.
+ <c><anno>Dir</anno></c>, which can be <c>leading</c> or
+ <c>trailing</c>, indicates from which direction characters
+ are to be searched.
+ </p>
+ <p>
+ By default, <c><anno>Dir</anno></c> is <c>leading</c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:find("ab..cd..ef", ".").</input>
+"..cd..ef"
+2> <input>string:find(&lt;&lt;"ab..cd..ef">>, "..", trailing).</input>
+&lt;&lt;"..ef">>
+3> <input>string:find(&lt;&lt;"ab..cd..ef">>, "x", leading).</input>
+nomatch
+4> <input>string:find("ab..cd..ef", "x", trailing).</input>
+nomatch</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="is_empty" arity="1"/>
+ <fsummary>Check if the string is empty.</fsummary>
+ <desc>
+ <p>Returns <c>true</c> if <c><anno>String</anno></c> is the
+ empty string, otherwise <c>false</c>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:is_empty("foo").</input>
+false
+2> <input>string:is_empty(["",&lt;&lt;>>]).</input>
+true</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="length" arity="1"/>
+ <fsummary>Calculate length of the string.</fsummary>
+ <desc>
+ <p>
+ Returns the number of grapheme clusters in <c><anno>String</anno></c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:length("ß↑e̊").</input>
+3
+2> <input>string:length(&lt;&lt;195,159,226,134,145,101,204,138>>).</input>
+3</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="lexemes" arity="2"/>
+ <fsummary>Split string into lexemes.</fsummary>
+ <desc>
+ <p>
+ Returns a list of lexemes in <c><anno>String</anno></c>, separated
+ by the grapheme clusters in <c><anno>SeparatorList</anno></c>.
+ </p>
+ <p>
+ Notice that, as shown in this example, two or more
+ adjacent separator graphemes clusters in <c><anno>String</anno></c>
+ are treated as one. That is, there are no empty
+ strings in the resulting list of lexemes.
+ See also <seealso marker="#split/3"><c>split/3</c></seealso> which returns
+ empty strings.
+ </p>
+ <p>Notice that <c>[$\r,$\n]</c> is one grapheme cluster.</p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]).</input>
+["abc","de̊f","ghi","jkl","foo"]
+2> <input>string:lexemes(&lt;&lt;"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]).</input>
+[&lt;&lt;"abc">>,&lt;&lt;"de̊f"/utf8>>,&lt;&lt;"ghi">>,&lt;&lt;"jkl\r\nfoo">>]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="lowercase" arity="1"/>
+ <fsummary>Convert a string to lowercase</fsummary>
+ <desc>
+ <p>
+ Converts <c><anno>String</anno></c> to lowercase.
+ </p>
+ <p>
+ Notice that function <seealso marker="#casefold/1"><c>casefold/1</c></seealso>
+ should be used when converting a string to
+ be tested for equality.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+2> <input>string:lowercase(string:uppercase("Michał")).</input>
+"michał"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="next_codepoint" arity="1"/>
+ <fsummary>Pick the first codepoint.</fsummary>
+ <desc>
+ <p>
+ Returns the first codepoint in <c><anno>String</anno></c>
+ and the rest of <c><anno>String</anno></c> in the tail.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:next_codepoint(unicode:characters_to_binary("e̊fg")).</input>
+[101|&lt;&lt;"̊fg"/utf8>>]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="next_grapheme" arity="1"/>
+ <fsummary>Pick the first grapheme cluster.</fsummary>
+ <desc>
+ <p>
+ Returns the first grapheme cluster in <c><anno>String</anno></c>
+ and the rest of <c><anno>String</anno></c> in the tail.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:next_grapheme(unicode:characters_to_binary("e̊fg")).</input>
+["e̊"|&lt;&lt;"fg">>]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="nth_lexeme" arity="3"/>
+ <fsummary>Pick the nth lexeme.</fsummary>
+ <desc>
+ <p>Returns lexeme number <c><anno>N</anno></c> in
+ <c><anno>String</anno></c>, where lexemes are separated by
+ the grapheme clusters in <c><anno>SeparatorList</anno></c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").</input>
+"ghi"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="pad" arity="2"/>
+ <name name="pad" arity="3"/>
+ <name name="pad" arity="4"/>
+ <fsummary>Pad a string to given length.</fsummary>
+ <desc>
+ <p>
+ Pads <c><anno>String</anno></c> to <c><anno>Length</anno></c> with
+ grapheme cluster <c><anno>Char</anno></c>.
+ <c><anno>Dir</anno></c>, which can be <c>leading</c>, <c>trailing</c>,
+ or <c>both</c>, indicates where the padding should be added.
+ </p>
+ <p>By default, <c><anno>Char</anno></c> is <c>$\s</c> and
+ <c><anno>Dir</anno></c> is <c>trailing</c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:pad(&lt;&lt;"He̊llö"/utf8>>, 8).</input>
+[&lt;&lt;72,101,204,138,108,108,195,182>>,32,32,32]
+2> <input>io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]).</input>
+' He̊llö'
+3> <input>io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]).</input>
+' He̊llö '</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="prefix" arity="2"/>
+ <fsummary>Remove prefix from string.</fsummary>
+ <desc>
+ <p>
+ If <c><anno>Prefix</anno></c> is the prefix of
+ <c><anno>String</anno></c>, removes it and returns the
+ remainder of <c><anno>String</anno></c>, otherwise returns
+ <c>nomatch</c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:prefix(&lt;&lt;"prefix of string">>, "pre").</input>
+&lt;&lt;"fix of string">>
+2> <input>string:prefix("pre", "prefix").</input>
+nomatch</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="replace" arity="3"/>
+ <name name="replace" arity="4"/>
+ <fsummary>Replace a pattern in string.</fsummary>
+ <desc>
+ <p>
+ Replaces <c><anno>SearchPattern</anno></c> in <c><anno>String</anno></c>
+ with <c><anno>Replacement</anno></c>.
+ <c><anno>Where</anno></c>, default <c>leading</c>, indicates whether
+ the <c>leading</c>, the <c>trailing</c> or <c>all</c> encounters of
+ <c><anno>SearchPattern</anno></c> are to be replaced.
+ </p>
+ <p>Can be implemented as:</p>
+ <pre>lists:join(Replacement, split(String, SearchPattern, Where)).</pre>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:replace(&lt;&lt;"ab..cd..ef">>, "..", "*").</input>
+[&lt;&lt;"ab">>,"*",&lt;&lt;"cd..ef">>]
+2> <input>string:replace(&lt;&lt;"ab..cd..ef">>, "..", "*", all).</input>
+[&lt;&lt;"ab">>,"*",&lt;&lt;"cd">>,"*",&lt;&lt;"ef">>]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="reverse" arity="1"/>
+ <fsummary>Reverses a string</fsummary>
+ <desc>
+ <p>
+ Returns the reverse list of the grapheme clusters in <c><anno>String</anno></c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> Reverse = <input>string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")).</input>
+[[79,776],[65,776],[65,778]]
+2> <input>io:format("~ts~n",[Reverse]).</input>
+ÖÄÅ</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="slice" arity="2"/>
+ <name name="slice" arity="3"/>
+ <fsummary>Extract a part of string</fsummary>
+ <desc>
+ <p>Returns a substring of <c><anno>String</anno></c> of
+ at most <c><anno>Length</anno></c> grapheme clusters, starting at position
+ <c><anno>Start</anno></c>.</p>
+ <p>By default, <c><anno>Length</anno></c> is <c>infinity</c>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:slice(&lt;&lt;"He̊llö Wörld"/utf8>>, 4).</input>
+&lt;&lt;"ö Wörld"/utf8>>
+2> <input>string:slice(["He̊llö ", &lt;&lt;"Wörld"/utf8>>], 4,4).</input>
+"ö Wö"
+3> <input>string:slice(["He̊llö ", &lt;&lt;"Wörld"/utf8>>], 4,50).</input>
+"ö Wörld"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="split" arity="2"/>
+ <name name="split" arity="3"/>
+ <fsummary>Split a string into substrings.</fsummary>
+ <desc>
+ <p>
+ Splits <c><anno>String</anno></c> where <c><anno>SearchPattern</anno></c>
+ is encountered and return the remaining parts.
+ <c><anno>Where</anno></c>, default <c>leading</c>, indicates whether
+ the <c>leading</c>, the <c>trailing</c> or <c>all</c> encounters of
+ <c><anno>SearchPattern</anno></c> will split <c><anno>String</anno></c>.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+0> <input>string:split("ab..bc..cd", "..").</input>
+["ab","bc..cd"]
+1> <input>string:split(&lt;&lt;"ab..bc..cd">>, "..", trailing).</input>
+[&lt;&lt;"ab..bc">>,&lt;&lt;"cd">>]
+2> <input>string:split(&lt;&lt;"ab..bc....cd">>, "..", all).</input>
+[&lt;&lt;"ab">>,&lt;&lt;"bc">>,&lt;&lt;>>,&lt;&lt;"cd">>]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="take" arity="2"/>
+ <name name="take" arity="3"/>
+ <name name="take" arity="4"/>
+ <fsummary>Take leading or trailing parts.</fsummary>
+ <desc>
+ <p>Takes characters from <c><anno>String</anno></c> as long as
+ the characters are members of set <c><anno>Characters</anno></c>
+ or the complement of set <c><anno>Characters</anno></c>.
+ <c><anno>Dir</anno></c>,
+ which can be <c>leading</c> or <c>trailing</c>, indicates from
+ which direction characters are to be taken.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+5> <input>string:take("abc0z123", lists:seq($a,$z)).</input>
+{"abc","0z123"}
+6> <input>string:take(&lt;&lt;"abc0z123">>, lists:seq($0,$9), true, leading).</input>
+{&lt;&lt;"abc">>,&lt;&lt;"0z123">>}
+7> <input>string:take("abc0z123", lists:seq($0,$9), false, trailing).</input>
+{"abc0z","123"}
+8> <input>string:take(&lt;&lt;"abc0z123">>, lists:seq($a,$z), true, trailing).</input>
+{&lt;&lt;"abc0z">>,&lt;&lt;"123">>}</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="titlecase" arity="1"/>
+ <fsummary>Convert a string to titlecase.</fsummary>
+ <desc>
+ <p>
+ Converts <c><anno>String</anno></c> to titlecase.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:titlecase("ß is a SHARP s").</input>
+"Ss is a SHARP s"</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="to_float" arity="1"/>
+ <fsummary>Return a float whose text representation is the integers
+ (ASCII values) of a string.</fsummary>
+ <desc>
+ <p>Argument <c><anno>String</anno></c> is expected to start with a
+ valid text represented float (the digits are ASCII values).
+ Remaining characters in the string after the float are returned in
+ <c><anno>Rest</anno></c>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+> <input>{F1,Fs} = string:to_float("1.0-1.0e-1"),</input>
+> <input>{F2,[]} = string:to_float(Fs),</input>
+> <input>F1+F2.</input>
+0.9
+> <input>string:to_float("3/2=1.5").</input>
+{error,no_float}
+> <input>string:to_float("-1.5eX").</input>
+{-1.5,"eX"}</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="to_integer" arity="1"/>
+ <fsummary>Return an integer whose text representation is the integers
+ (ASCII values) of a string.</fsummary>
+ <desc>
+ <p>Argument <c><anno>String</anno></c> is expected to start with a
+ valid text represented integer (the digits are ASCII values).
+ Remaining characters in the string after the integer are returned in
+ <c><anno>Rest</anno></c>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+> <input>{I1,Is} = string:to_integer("33+22"),</input>
+> <input>{I2,[]} = string:to_integer(Is),</input>
+> <input>I1-I2.</input>
+11
+> <input>string:to_integer("0.5").</input>
+{0,".5"}
+> <input>string:to_integer("x=2").</input>
+{error,no_integer}</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="to_graphemes" arity="1"/>
+ <fsummary>Convert a string to a list of grapheme clusters.</fsummary>
+ <desc>
+ <p>
+ Converts <c><anno>String</anno></c> to a list of grapheme clusters.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:to_graphemes("ß↑e̊").</input>
+[223,8593,[101,778]]
+2> <input>string:to_graphemes(&lt;&lt;"ß↑e̊"/utf8>>).</input>
+[223,8593,[101,778]]</pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="trim" arity="1"/>
+ <name name="trim" arity="2"/>
+ <name name="trim" arity="3"/>
+ <fsummary>Trim leading or trailing, or both, characters.</fsummary>
+ <desc>
+ <p>
+ Returns a string, where leading or trailing, or both,
+ <c><anno>Characters</anno></c> have been removed.
+ <c><anno>Dir</anno></c> which can be <c>leading</c>, <c>trailing</c>,
+ or <c>both</c>, indicates from which direction characters
+ are to be removed.
+ </p>
+ <p> Default <c><anno>Characters</anno></c> are the set of
+ nonbreakable whitespace codepoints, defined as
+ Pattern_White_Space in
+ <url href="http://unicode.org/reports/tr31/">Unicode Standard Annex #31</url>.
+ <c>By default, <anno>Dir</anno></c> is <c>both</c>.
+ </p>
+ <p>
+ Notice that <c>[$\r,$\n]</c> is one grapheme cluster according
+ to the Unicode Standard.
+ </p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:trim("\t Hello \n").</input>
+"Hello"
+2> <input>string:trim(&lt;&lt;"\t Hello \n">>, leading).</input>
+&lt;&lt;"Hello \n">>
+3> <input>string:trim(&lt;&lt;".Hello.\n">>, trailing, "\n.").</input>
+&lt;&lt;".Hello">></pre>
+ </desc>
+ </func>
+
+ <func>
+ <name name="uppercase" arity="1"/>
+ <fsummary>Convert a string to uppercase.</fsummary>
+ <desc>
+ <p>
+ Converts <c><anno>String</anno></c> to uppercase.
+ </p>
+ <p>See also <seealso marker="#titlecase/1"><c>titlecase/1</c></seealso>.</p>
+ <p><em>Example:</em></p>
+ <pre>
+1> <input>string:uppercase("Michał").</input>
+"MICHAŁ"</pre>
+ </desc>
+ </func>
+
+ </funcs>
+
+ <section>
+ <marker id="oldapi"/>
+ <title>Obsolete API functions</title>
+ <p>Here follows the function of the old API.
+ These functions only work on a list of Latin-1 characters.
+ </p>
+ <note><p>
+ The functions are kept for backward compatibility, but are
+ not recommended.
+ They will be deprecated in Erlang/OTP 21.
+ </p>
+ <p>Any undocumented functions in <c>string</c> are not to be used.</p>
+ </note>
+ </section>
+
<funcs>
<func>
<name name="centre" arity="2"/>
@@ -47,17 +652,24 @@
<p>Returns a string, where <c><anno>String</anno></c> is centered in the
string and surrounded by blanks or <c><anno>Character</anno></c>.
The resulting string has length <c><anno>Number</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#pad/3"><c>pad/3</c></seealso>.
+ </p>
</desc>
</func>
<func>
<name name="chars" arity="2"/>
<name name="chars" arity="3"/>
- <fsummary>Returns a string consisting of numbers of characters.</fsummary>
+ <fsummary>Return a string consisting of numbers of characters.</fsummary>
<desc>
<p>Returns a string consisting of <c><anno>Number</anno></c> characters
<c><anno>Character</anno></c>. Optionally, the string can end with
string <c><anno>Tail</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="lists#duplicate/2"><c>lists:duplicate/2</c></seealso>.</p>
</desc>
</func>
@@ -69,6 +681,9 @@
<p>Returns the index of the first occurrence of
<c><anno>Character</anno></c> in <c><anno>String</anno></c>. Returns
<c>0</c> if <c><anno>Character</anno></c> does not occur.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#find/2"><c>find/2</c></seealso>.</p>
</desc>
</func>
@@ -79,6 +694,16 @@
<p>Concatenates <c><anno>String1</anno></c> and
<c><anno>String2</anno></c> to form a new string
<c><anno>String3</anno></c>, which is returned.</p>
+ <p>
+ This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use <c>[<anno>String1</anno>, <anno>String2</anno>]</c> as
+ <c>Data</c> argument, and call
+ <seealso marker="unicode#characters_to_list/2">
+ <c>unicode:characters_to_list/2</c></seealso> or
+ <seealso marker="unicode#characters_to_binary/2">
+ <c>unicode:characters_to_binary/2</c></seealso>
+ to flatten the output.
+ </p>
</desc>
</func>
@@ -88,6 +713,9 @@
<desc>
<p>Returns a string containing <c><anno>String</anno></c> repeated
<c><anno>Number</anno></c> times.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="lists#duplicate/2"><c>lists:duplicate/2</c></seealso>.</p>
</desc>
</func>
@@ -98,6 +726,9 @@
<p>Returns the length of the maximum initial segment of
<c><anno>String</anno></c>, which consists entirely of characters
not from <c><anno>Chars</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#take/3"><c>take/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:cspan("\t abcdef", " \t").
@@ -106,20 +737,14 @@
</func>
<func>
- <name name="equal" arity="2"/>
- <fsummary>Test string equality.</fsummary>
- <desc>
- <p>Returns <c>true</c> if <c><anno>String1</anno></c> and
- <c><anno>String2</anno></c> are equal, otherwise <c>false</c>.</p>
- </desc>
- </func>
-
- <func>
<name name="join" arity="2"/>
<fsummary>Join a list of strings with separator.</fsummary>
<desc>
<p>Returns a string with the elements of <c><anno>StringList</anno></c>
separated by the string in <c><anno>Separator</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="lists#join/2"><c>lists:join/2</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> join(["one", "two", "three"], ", ").
@@ -137,6 +762,10 @@
fixed. If <c>length(<anno>String</anno>)</c> &lt;
<c><anno>Number</anno></c>, then <c><anno>String</anno></c> is padded
with blanks or <c><anno>Character</anno></c>s.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#pad/2"><c>pad/2</c></seealso> or
+ <seealso marker="#pad/3"><c>pad/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:left("Hello",10,$.).
@@ -149,6 +778,9 @@
<fsummary>Return the length of a string.</fsummary>
<desc>
<p>Returns the number of characters in <c><anno>String</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#length/1"><c>length/1</c></seealso>.</p>
</desc>
</func>
@@ -160,6 +792,9 @@
<p>Returns the index of the last occurrence of
<c><anno>Character</anno></c> in <c><anno>String</anno></c>. Returns
<c>0</c> if <c><anno>Character</anno></c> does not occur.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#find/3"><c>find/3</c></seealso>.</p>
</desc>
</func>
@@ -173,6 +808,9 @@
fixed. If the length of <c>(<anno>String</anno>)</c> &lt;
<c><anno>Number</anno></c>, then <c><anno>String</anno></c> is padded
with blanks or <c><anno>Character</anno></c>s.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#pad/3"><c>pad/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:right("Hello", 10, $.).
@@ -188,6 +826,9 @@
<c><anno>SubString</anno></c> begins in <c><anno>String</anno></c>.
Returns <c>0</c> if <c><anno>SubString</anno></c>
does not exist in <c><anno>String</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#find/3"><c>find/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:rstr(" Hello Hello World World ", "Hello World").
@@ -202,6 +843,9 @@
<p>Returns the length of the maximum initial segment of
<c><anno>String</anno></c>, which consists entirely of characters
from <c><anno>Chars</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#take/2"><c>take/2</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:span("\t abcdef", " \t").
@@ -217,6 +861,9 @@
<c><anno>SubString</anno></c> begins in <c><anno>String</anno></c>.
Returns <c>0</c> if <c><anno>SubString</anno></c>
does not exist in <c><anno>String</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#find/2"><c>find/2</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:str(" Hello Hello World World ", "Hello World").
@@ -230,12 +877,15 @@
<name name="strip" arity="3"/>
<fsummary>Strip leading or trailing characters.</fsummary>
<desc>
- <p>Returns a string, where leading and/or trailing blanks or a
+ <p>Returns a string, where leading or trailing, or both, blanks or a
number of <c><anno>Character</anno></c> have been removed.
<c><anno>Direction</anno></c>, which can be <c>left</c>, <c>right</c>,
or <c>both</c>, indicates from which direction blanks are to be
removed. <c>strip/1</c> is equivalent to
<c>strip(String, both)</c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#trim/3"><c>trim/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:strip("...Hello.....", both, $.).
@@ -251,6 +901,9 @@
<p>Returns a substring of <c><anno>String</anno></c>, starting at
position <c><anno>Start</anno></c> to the end of the string, or to
and including position <c><anno>Stop</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#slice/3"><c>slice/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
sub_string("Hello World", 4, 8).
@@ -266,6 +919,9 @@ sub_string("Hello World", 4, 8).
<p>Returns a substring of <c><anno>String</anno></c>, starting at
position <c><anno>Start</anno></c>, and ending at the end of the
string or at length <c><anno>Length</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#slice/3"><c>slice/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> substr("Hello World", 4, 5).
@@ -281,6 +937,9 @@ sub_string("Hello World", 4, 8).
<p>Returns the word in position <c><anno>Number</anno></c> of
<c><anno>String</anno></c>. Words are separated by blanks or
<c><anno>Character</anno></c>s.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#nth_lexeme/3"><c>nth_lexeme/3</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> string:sub_word(" Hello old boy !",3,$o).
@@ -289,50 +948,6 @@ sub_string("Hello World", 4, 8).
</func>
<func>
- <name name="to_float" arity="1"/>
- <fsummary>Returns a float whose text representation is the integers
- (ASCII values) in a string.</fsummary>
- <desc>
- <p>Argument <c><anno>String</anno></c> is expected to start with a
- valid text represented float (the digits are ASCII values).
- Remaining characters in the string after the float are returned in
- <c><anno>Rest</anno></c>.</p>
- <p><em>Example:</em></p>
- <code type="none">
-> {F1,Fs} = string:to_float("1.0-1.0e-1"),
-> {F2,[]} = string:to_float(Fs),
-> F1+F2.
-0.9
-> string:to_float("3/2=1.5").
-{error,no_float}
-> string:to_float("-1.5eX").
-{-1.5,"eX"}</code>
- </desc>
- </func>
-
- <func>
- <name name="to_integer" arity="1"/>
- <fsummary>Returns an integer whose text representation is the integers
- (ASCII values) in a string.</fsummary>
- <desc>
- <p>Argument <c><anno>String</anno></c> is expected to start with a
- valid text represented integer (the digits are ASCII values).
- Remaining characters in the string after the integer are returned in
- <c><anno>Rest</anno></c>.</p>
- <p><em>Example:</em></p>
- <code type="none">
-> {I1,Is} = string:to_integer("33+22"),
-> {I2,[]} = string:to_integer(Is),
-> I1-I2.
-11
-> string:to_integer("0.5").
-{0,".5"}
-> string:to_integer("x=2").
-{error,no_integer}</code>
- </desc>
- </func>
-
- <func>
<name name="to_lower" arity="1" clause_i="1"/>
<name name="to_lower" arity="1" clause_i="2"/>
<name name="to_upper" arity="1" clause_i="1"/>
@@ -346,6 +961,11 @@ sub_string("Hello World", 4, 8).
<p>The specified string or character is case-converted. Notice that
the supported character set is ISO/IEC 8859-1 (also called Latin 1);
all values outside this set are unchanged</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso> use
+ <seealso marker="#lowercase/1"><c>lowercase/1</c></seealso>,
+ <seealso marker="#uppercase/1"><c>uppercase/1</c></seealso>,
+ <seealso marker="#titlecase/1"><c>titlecase/1</c></seealso> or
+ <seealso marker="#casefold/1"><c>casefold/1</c></seealso>.</p>
</desc>
</func>
@@ -363,6 +983,9 @@ sub_string("Hello World", 4, 8).
adjacent separator characters in <c><anno>String</anno></c>
are treated as one. That is, there are no empty
strings in the resulting list of tokens.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#lexemes/2"><c>lexemes/2</c></seealso>.</p>
</desc>
</func>
@@ -373,6 +996,9 @@ sub_string("Hello World", 4, 8).
<desc>
<p>Returns the number of words in <c><anno>String</anno></c>, separated
by blanks or <c><anno>Character</anno></c>.</p>
+ <p>This function is <seealso marker="#oldapi">obsolete</seealso>.
+ Use
+ <seealso marker="#lexemes/2"><c>lexemes/2</c></seealso>.</p>
<p><em>Example:</em></p>
<code type="none">
> words(" Hello old boy!", $o).
@@ -387,10 +1013,7 @@ sub_string("Hello World", 4, 8).
other. The reason is that this string package is the
combination of two earlier packages and all functions of
both packages have been retained.</p>
-
- <note>
- <p>Any undocumented functions in <c>string</c> are not to be used.</p>
- </note>
</section>
+
</erlref>
diff --git a/lib/stdlib/doc/src/unicode.xml b/lib/stdlib/doc/src/unicode.xml
index 93d0d37456..382b253ba1 100644
--- a/lib/stdlib/doc/src/unicode.xml
+++ b/lib/stdlib/doc/src/unicode.xml
@@ -50,8 +50,35 @@
external entities where this is required. When working inside the
Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when
representing Unicode characters. ISO Latin-1 encoding is supported both
- for backward compatibility and for communication
- with external entities not supporting Unicode character sets.</p>
+ for backward compatibility and for communication
+ with external entities not supporting Unicode character sets.</p>
+ <p>Programs should always operate on a normalized form and compare
+ canonical-equivalent Unicode characters as equal. All characters
+ should thus be normalized to one form once on the system borders.
+ One of the following functions can convert characters to their
+ normalized forms <seealso marker="#characters_to_nfc_list/1">
+ <c>characters_to_nfc_list/1</c></seealso>,
+ <seealso marker="#characters_to_nfc_binary/1">
+ <c>characters_to_nfc_binary/1</c></seealso>,
+ <seealso marker="#characters_to_nfd_list/1">
+ <c>characters_to_nfd_list/1</c></seealso> or
+ <seealso marker="#characters_to_nfd_binary/1">
+ <c>characters_to_nfd_binary/1</c></seealso>.
+ For general text
+ <seealso marker="#characters_to_nfc_list/1">
+ <c>characters_to_nfc_list/1</c></seealso> or
+ <seealso marker="#characters_to_nfc_binary/1">
+ <c>characters_to_nfc_binary/1</c></seealso> is preferred, and
+ for identifiers one of the compatibility normalization
+ functions, such as
+ <seealso marker="#characters_to_nfkc_list/1">
+ <c>characters_to_nfkc_list/1</c></seealso>,
+ is preferred for security reasons.
+ The normalization functions where introduced in OTP 20.
+ Additional information on normalization can be found in the
+ <url href="http://unicode.org/faq/normalization.html">Unicode FAQ</url>.
+ </p>
+
</description>
<datatypes>
@@ -335,6 +362,154 @@ decode_data(Data) ->
</func>
<func>
+ <name name="characters_to_nfc_list" arity="1"/>
+ <fsummary>Normalize characters to a list of canonical equivalent
+ composed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of canonical equivalent Composed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.
+ </p>
+ <p>The result is a list of characters.</p>
+ <code>
+3> unicode:characters_to_nfc_list([&lt;&lt;"abc..a">>,[778],$a,[776],$o,[776]]).
+"abc..åäö"
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfc_binary" arity="1"/>
+ <fsummary>Normalize characters to a utf8 binary of canonical equivalent
+ composed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of canonical equivalent Composed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.</p>
+ <p>The result is an utf8 encoded binary.</p>
+ <code>
+4> unicode:characters_to_nfc_binary([&lt;&lt;"abc..a">>,[778],$a,[776],$o,[776]]).
+&lt;&lt;"abc..åäö"/utf8>>
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfd_list" arity="1"/>
+ <fsummary>Normalize characters to a list of canonical equivalent
+ decomposed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of canonical equivalent Decomposed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.
+ </p>
+ <p>The result is a list of characters.</p>
+ <code>
+1> unicode:characters_to_nfd_list("abc..åäö").
+[97,98,99,46,46,97,778,97,776,111,776]
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfd_binary" arity="1"/>
+ <fsummary>Normalize characters to a utf8 binary of canonical equivalent
+ decomposed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of canonical equivalent Decomposed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.</p>
+ <p>The result is an utf8 encoded binary.</p>
+ <code>
+2> unicode:characters_to_nfd_binary("abc..åäö").
+&lt;&lt;97,98,99,46,46,97,204,138,97,204,136,111,204,136>>
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfkc_list" arity="1"/>
+ <fsummary>Normalize characters to a list of canonical equivalent
+ composed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of compatibly equivalent Composed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.
+ </p>
+ <p>The result is a list of characters.</p>
+ <code>
+3> unicode:characters_to_nfkc_list([&lt;&lt;"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
+"abc..åäö32"
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfkc_binary" arity="1"/>
+ <fsummary>Normalize characters to a utf8 binary of compatibly equivalent
+ composed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of compatibly equivalent Composed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.</p>
+ <p>The result is an utf8 encoded binary.</p>
+ <code>
+4> unicode:characters_to_nfkc_binary([&lt;&lt;"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]).
+&lt;&lt;"abc..åäö32"/utf8>>
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfkd_list" arity="1"/>
+ <fsummary>Normalize characters to a list of compatibly equivalent
+ decomposed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of compatibly equivalent Decomposed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.
+ </p>
+ <p>The result is a list of characters.</p>
+ <code>
+1> unicode:characters_to_nfkd_list(["abc..åäö",[65299,65298]]).
+[97,98,99,46,46,97,778,97,776,111,776,51,50]
+</code>
+ </desc>
+ </func>
+
+ <func>
+ <name name="characters_to_nfkd_binary" arity="1"/>
+ <fsummary>Normalize characters to a utf8 binary of compatibly equivalent
+ decomposed Unicode characters.</fsummary>
+ <desc>
+ <p>Converts a possibly deep list of characters and binaries
+ into a Normalized Form of compatibly equivalent Decomposed
+ characters according to the Unicode standard.</p>
+ <p>Any binaries in the input must be encoded with utf8
+ encoding.</p>
+ <p>The result is an utf8 encoded binary.</p>
+ <code>
+2> unicode:characters_to_nfkd_binary(["abc..åäö",[65299,65298]]).
+&lt;&lt;97,98,99,46,46,97,204,138,97,204,136,111,204,136,51,50>>
+</code>
+ </desc>
+ </func>
+
+ <func>
<name name="encoding_to_bom" arity="1"/>
<fsummary>Create a binary UTF byte order mark from encoding.</fsummary>
<type_desc variable="Bin">
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml
index a8ef8ff5c5..11b84f552a 100644
--- a/lib/stdlib/doc/src/unicode_usage.xml
+++ b/lib/stdlib/doc/src/unicode_usage.xml
@@ -65,7 +65,10 @@
<item><p>In Erlang/OTP 20.0, atoms and function can contain
Unicode characters. Module names are still restricted to
- the ISO-Latin-1 range.</p></item>
+ the ISO-Latin-1 range.</p>
+ <p>Support was added for normalizations forms in
+ <c>unicode</c> and the <c>string</c> module now handles
+ utf8-encoded binaries.</p></item>
</list>
<p>This section outlines the current Unicode support and gives some
@@ -110,23 +113,27 @@
</item>
</list>
- <p>So, a conversion function must know not only one character at a time,
- but possibly the whole sentence, the natural language to translate to,
- the differences in input and output string length, and so on.
- Erlang/OTP has currently no Unicode <c>to_upper</c>/<c>to_lower</c>
- functionality, but publicly available libraries address these issues.</p>
-
- <p>Another example is the accented characters, where the same glyph has two
- different representations. The Swedish letter "ö" is one example.
- The Unicode standard has a code point for it, but you can also write it
- as "o" followed by "U+0308" (Combining Diaeresis, with the simplified
- meaning that the last letter is to have "¨" above). They have the same
- glyph. They are for most purposes the same, but have different
- representations. For example, MacOS X converts all filenames to use
- Combining Diaeresis, while most other programs (including Erlang) try to
- hide that by doing the opposite when, for example, listing directories.
- However it is done, it is usually important to normalize such
- characters to avoid confusion.</p>
+ <p>So, a conversion function must know not only one character at a
+ time, but possibly the whole sentence, the natural language to
+ translate to, the differences in input and output string length,
+ and so on. Erlang/OTP has currently no Unicode
+ <c>uppercase</c>/<c>lowercase</c> functionality with language
+ specific handling, but publicly available libraries address these
+ issues.</p>
+
+ <p>Another example is the accented characters, where the same
+ glyph has two different representations. The Swedish letter "ö" is
+ one example. The Unicode standard has a code point for it, but
+ you can also write it as "o" followed by "U+0308" (Combining
+ Diaeresis, with the simplified meaning that the last letter is to
+ have "¨" above). They have the same glyph, user perceived
+ character. They are for most purposes the same, but have different
+ representations. For example, MacOS X converts all filenames to
+ use Combining Diaeresis, while most other programs (including
+ Erlang) try to hide that by doing the opposite when, for example,
+ listing directories. However it is done, it is usually important
+ to normalize such characters to avoid confusion.
+ </p>
<p>The list of examples can be made long. One need a kind of knowledge that
was not needed when programs only considered one or two languages. The
@@ -273,7 +280,7 @@
them. In some cases functionality has been added to already
existing interfaces (as the <seealso
marker="stdlib:string"><c>string</c></seealso> module now can
- handle lists with any code points). In some cases new
+ handle strings with any code points). In some cases new
functionality or options have been added (as in the <seealso
marker="stdlib:io"><c>io</c></seealso> module, the file
handling, the <seealso
@@ -977,7 +984,7 @@ Eshell V5.10.1 (abort with ^G)
<p>Fortunately, most textual data has been stored in lists and range
checking has been sparse, so modules like <c>string</c> work well for
- Unicode lists with little need for conversion or extension.</p>
+ Unicode strings with little need for conversion or extension.</p>
<p>Some modules are, however, changed to be explicitly Unicode-aware. These
modules include:</p>
@@ -1028,18 +1035,17 @@ Eshell V5.10.1 (abort with ^G)
has extensive support for Unicode text.</p></item>
</taglist>
- <p>The <seealso marker="stdlib:string"><c>string</c></seealso> module works
- perfectly for Unicode strings and ISO Latin-1 strings, except the
- language-dependent functions
- <seealso marker="stdlib:string#to_upper/1"><c>string:to_upper/1</c></seealso>
- and
- <seealso marker="stdlib:string#to_lower/1"><c>string:to_lower/1</c></seealso>,
- which are only correct for the ISO Latin-1 character set. These two
- functions can never function correctly for Unicode characters in their
- current form, as there are language and locale issues as well as
- multi-character mappings to consider when converting text between cases.
- Converting case in an international environment is a large subject not
- yet addressed in OTP.</p>
+ <p>The <seealso marker="stdlib:string"><c>string</c></seealso>
+ module works perfectly for Unicode strings and ISO Latin-1
+ strings, except the language-dependent functions <seealso
+ marker="stdlib:string#uppercase/1"><c>string:uppercase/1</c></seealso>
+ and <seealso
+ marker="stdlib:string#lowercase/1"><c>string:lowercase/1</c></seealso>.
+ These two functions can never function correctly for Unicode
+ characters in their current form, as there are language and locale
+ issues to consider when converting text between cases. Converting
+ case in an international environment is a large subject not yet
+ addressed in OTP.</p>
</section>
<section>