diff options
author | Dan Gudmundsson <[email protected]> | 2017-04-24 12:41:10 +0200 |
---|---|---|
committer | Dan Gudmundsson <[email protected]> | 2017-04-24 12:41:10 +0200 |
commit | 739bca3fc267c55d84c8f5c193d16c0b2a7eee13 (patch) | |
tree | a7ac2a4c90bdeadd2a313e7d49cecf84d3e1ba28 /lib/stdlib/doc | |
parent | 515dc2d603d449ed3621d96028ba403aef05ea7f (diff) | |
parent | a9d0d119837fb0bc52d2bb3d48a47568de9100b4 (diff) | |
download | otp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.tar.gz otp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.tar.bz2 otp-739bca3fc267c55d84c8f5c193d16c0b2a7eee13.zip |
Merge branch 'dgud/stdlib/unicode-string/OTP-10289'
* dgud/stdlib/unicode-string/OTP-10289:
Handle chardata in string:to_float and string:to_list
New unicode aware string module that works with unicode:chardata()
Add nf(k)d, nf(k)c conversion functions to unicode module
Reorder code and whitespace fixes
Add unicode_util
Diffstat (limited to 'lib/stdlib/doc')
-rw-r--r-- | lib/stdlib/doc/src/string.xml | 741 | ||||
-rw-r--r-- | lib/stdlib/doc/src/unicode.xml | 179 | ||||
-rw-r--r-- | lib/stdlib/doc/src/unicode_usage.xml | 70 |
3 files changed, 897 insertions, 93 deletions
diff --git a/lib/stdlib/doc/src/string.xml b/lib/stdlib/doc/src/string.xml index dddedf1132..dc83c40a9a 100644 --- a/lib/stdlib/doc/src/string.xml +++ b/lib/stdlib/doc/src/string.xml @@ -36,8 +36,613 @@ <modulesummary>String processing functions.</modulesummary> <description> <p>This module provides functions for string processing.</p> + <p>A string in this module is represented by <seealso marker="unicode#type-chardata"> + <c>unicode:chardata()</c></seealso>, that is, a list of codepoints, + binaries with UTF-8-encoded codepoints + (<em>UTF-8 binaries</em>), or a mix of the two.</p> + <code> +"abcd" is a valid string +<<"abcd">> is a valid string +["abcd"] is a valid string +<<"abc..åäö"/utf8>> is a valid string +<<"abc..åäö">> is NOT a valid string, + but a binary with Latin-1-encoded codepoints +[<<"abc">>, "..åäö"] is a valid string +[atom] is NOT a valid string</code> + <p> + This module operates on grapheme clusters. A <em>grapheme cluster</em> + is a user-perceived character, which can be represented by several + codepoints. + </p> + <code> +"å" [229] or [97, 778] +"e̊" [101, 778]</code> + <p> + The string length of "ß↑e̊" is 3, even though it is represented by the + codepoints <c>[223,8593,101,778]</c> or the UTF-8 binary + <c><<195,159,226,134,145,101,204,138>></c>. + </p> + <p> + Grapheme clusters for codepoints of class <c>prepend</c> + and non-modern (or decomposed) Hangul is not handled for performance + reasons in + <seealso marker="#find/3"><c>find/3</c></seealso>, + <seealso marker="#replace/3"><c>replace/3</c></seealso>, + <seealso marker="#split/2"><c>split/2</c></seealso>, + <seealso marker="#lexemes/2"><c>split/2</c></seealso> and + <seealso marker="#trim/3"><c>trim/3</c></seealso>. + </p> + <p> + Splitting and appending strings is to be done on grapheme clusters + borders. + There is no verification that the results of appending strings are + valid or normalized. + </p> + <p> + Most of the functions expect all input to be normalized to one form, + see for example <seealso marker="unicode#characters_to_nfc_list/1"> + <c>unicode:characters_to_nfc_list/1</c></seealso>. + </p> + <p> + Language or locale specific handling of input is not considered + in any function. + </p> + <p> + The functions can crash for non-valid input strings. For example, + the functions expect UTF-8 binaries but not all functions + verify that all binaries are encoded correctly. + </p> + <p> + Unless otherwise specified the return value type is the same as + the input type. That is, binary input returns binary output, + list input returns a list output, and mixed input can return a + mixed output.</p> + <code> +1> string:trim(" sarah "). +"sarah" +2> string:trim(<<" sarah ">>). +<<"sarah">> +3> string:lexemes("foo bar", " "). +["foo","bar"] +4> string:lexemes(<<"foo bar">>, " "). +[<<"foo">>,<<"bar">>]</code> + <p>This module has been reworked in Erlang/OTP 20 to + handle <seealso marker="unicode#type-chardata"> + <c>unicode:chardata()</c></seealso> and operate on grapheme + clusters. The <seealso marker="#oldapi"> <c>old + functions</c></seealso> that only work on Latin-1 lists as input + are still available but should not be + used. They will be deprecated in Erlang/OTP 21. + </p> </description> + <datatypes> + <datatype> + <name name="direction"/> + <name name="grapheme_cluster"/> + <desc> + <p>A user-perceived character, consisting of one or more + codepoints.</p> + </desc> + </datatype> + </datatypes> + + <funcs> + + <func> + <name name="casefold" arity="1"/> + <fsummary>Convert a string to a comparable string.</fsummary> + <desc> + <p> + Converts <c><anno>String</anno></c> to a case-agnostic + comparable string. Function <c>casefold/1</c> is preferred + over <c>lowercase/1</c> when two strings are to be compared + for equality. See also <seealso marker="#equal/4"><c>equal/4</c></seealso>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:casefold("Ω and ẞ SHARP S").</input> +"ω and ss sharp s"</pre> + </desc> + </func> + + <func> + <name name="chomp" arity="1"/> + <fsummary>Remove trailing end of line control characters.</fsummary> + <desc> + <p> + Returns a string where any trailing <c>\n</c> or + <c>\r\n</c> have been removed from <c><anno>String</anno></c>. + </p> + <p><em>Example:</em></p> + <pre> +182> <input>string:chomp(<<"\nHello\n\n">>).</input> +<<"\nHello">> +183> <input>string:chomp("\nHello\r\r\n").</input> +"\nHello\r"</pre> + </desc> + </func> + + <func> + <name name="equal" arity="2"/> + <name name="equal" arity="3"/> + <name name="equal" arity="4"/> + <fsummary>Test string equality.</fsummary> + <desc> + <p> + Returns <c>true</c> if <c><anno>A</anno></c> and + <c><anno>B</anno></c> are equal, otherwise <c>false</c>. + </p> + <p> + If <c><anno>IgnoreCase</anno></c> is <c>true</c> + the function does <seealso marker="#casefold/1"> + <c>casefold</c>ing</seealso> on the fly before the equality test. + </p> + <p>If <c><anno>Norm</anno></c> is not <c>none</c> + the function applies normalization on the fly before the equality test. + There are four available normalization forms: + <seealso marker="unicode#characters_to_nfc_list/1"> <c>nfc</c></seealso>, + <seealso marker="unicode#characters_to_nfd_list/1"> <c>nfd</c></seealso>, + <seealso marker="unicode#characters_to_nfkc_list/1"> <c>nfkc</c></seealso>, and + <seealso marker="unicode#characters_to_nfkd_list/1"> <c>nfkd</c></seealso>. + </p> + <p>By default, + <c><anno>IgnoreCase</anno></c> is <c>false</c> and + <c><anno>Norm</anno></c> is <c>none</c>.</p> + <p><em>Example:</em></p> + <pre> +1> <input>string:equal("åäö", <<"åäö"/utf8>>).</input> +true +2> <input>string:equal("åäö", unicode:characters_to_nfd_binary("åäö")).</input> +false +3> <input>string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc).</input> +true</pre> + </desc> + </func> + + <func> + <name name="find" arity="2"/> + <name name="find" arity="3"/> + <fsummary>Find start of substring.</fsummary> + <desc> + <p> + Removes anything before <c><anno>SearchPattern</anno></c> in <c><anno>String</anno></c> + and returns the remainder of the string or <c>nomatch</c> if <c><anno>SearchPattern</anno></c> is not + found. + <c><anno>Dir</anno></c>, which can be <c>leading</c> or + <c>trailing</c>, indicates from which direction characters + are to be searched. + </p> + <p> + By default, <c><anno>Dir</anno></c> is <c>leading</c>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:find("ab..cd..ef", ".").</input> +"..cd..ef" +2> <input>string:find(<<"ab..cd..ef">>, "..", trailing).</input> +<<"..ef">> +3> <input>string:find(<<"ab..cd..ef">>, "x", leading).</input> +nomatch +4> <input>string:find("ab..cd..ef", "x", trailing).</input> +nomatch</pre> + </desc> + </func> + + <func> + <name name="is_empty" arity="1"/> + <fsummary>Check if the string is empty.</fsummary> + <desc> + <p>Returns <c>true</c> if <c><anno>String</anno></c> is the + empty string, otherwise <c>false</c>.</p> + <p><em>Example:</em></p> + <pre> +1> <input>string:is_empty("foo").</input> +false +2> <input>string:is_empty(["",<<>>]).</input> +true</pre> + </desc> + </func> + + <func> + <name name="length" arity="1"/> + <fsummary>Calculate length of the string.</fsummary> + <desc> + <p> + Returns the number of grapheme clusters in <c><anno>String</anno></c>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:length("ß↑e̊").</input> +3 +2> <input>string:length(<<195,159,226,134,145,101,204,138>>).</input> +3</pre> + </desc> + </func> + + <func> + <name name="lexemes" arity="2"/> + <fsummary>Split string into lexemes.</fsummary> + <desc> + <p> + Returns a list of lexemes in <c><anno>String</anno></c>, separated + by the grapheme clusters in <c><anno>SeparatorList</anno></c>. + </p> + <p> + Notice that, as shown in this example, two or more + adjacent separator graphemes clusters in <c><anno>String</anno></c> + are treated as one. That is, there are no empty + strings in the resulting list of lexemes. + See also <seealso marker="#split/3"><c>split/3</c></seealso> which returns + empty strings. + </p> + <p>Notice that <c>[$\r,$\n]</c> is one grapheme cluster.</p> + <p><em>Example:</em></p> + <pre> +1> <input>string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]).</input> +["abc","de̊f","ghi","jkl","foo"] +2> <input>string:lexemes(<<"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]).</input> +[<<"abc">>,<<"de̊f"/utf8>>,<<"ghi">>,<<"jkl\r\nfoo">>]</pre> + </desc> + </func> + + <func> + <name name="lowercase" arity="1"/> + <fsummary>Convert a string to lowercase</fsummary> + <desc> + <p> + Converts <c><anno>String</anno></c> to lowercase. + </p> + <p> + Notice that function <seealso marker="#casefold/1"><c>casefold/1</c></seealso> + should be used when converting a string to + be tested for equality. + </p> + <p><em>Example:</em></p> + <pre> +2> <input>string:lowercase(string:uppercase("Michał")).</input> +"michał"</pre> + </desc> + </func> + + <func> + <name name="next_codepoint" arity="1"/> + <fsummary>Pick the first codepoint.</fsummary> + <desc> + <p> + Returns the first codepoint in <c><anno>String</anno></c> + and the rest of <c><anno>String</anno></c> in the tail. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:next_codepoint(unicode:characters_to_binary("e̊fg")).</input> +[101|<<"̊fg"/utf8>>]</pre> + </desc> + </func> + + <func> + <name name="next_grapheme" arity="1"/> + <fsummary>Pick the first grapheme cluster.</fsummary> + <desc> + <p> + Returns the first grapheme cluster in <c><anno>String</anno></c> + and the rest of <c><anno>String</anno></c> in the tail. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:next_grapheme(unicode:characters_to_binary("e̊fg")).</input> +["e̊"|<<"fg">>]</pre> + </desc> + </func> + + <func> + <name name="nth_lexeme" arity="3"/> + <fsummary>Pick the nth lexeme.</fsummary> + <desc> + <p>Returns lexeme number <c><anno>N</anno></c> in + <c><anno>String</anno></c>, where lexemes are separated by + the grapheme clusters in <c><anno>SeparatorList</anno></c>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").</input> +"ghi"</pre> + </desc> + </func> + + <func> + <name name="pad" arity="2"/> + <name name="pad" arity="3"/> + <name name="pad" arity="4"/> + <fsummary>Pad a string to given length.</fsummary> + <desc> + <p> + Pads <c><anno>String</anno></c> to <c><anno>Length</anno></c> with + grapheme cluster <c><anno>Char</anno></c>. + <c><anno>Dir</anno></c>, which can be <c>leading</c>, <c>trailing</c>, + or <c>both</c>, indicates where the padding should be added. + </p> + <p>By default, <c><anno>Char</anno></c> is <c>$\s</c> and + <c><anno>Dir</anno></c> is <c>trailing</c>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:pad(<<"He̊llö"/utf8>>, 8).</input> +[<<72,101,204,138,108,108,195,182>>,32,32,32] +2> <input>io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]).</input> +' He̊llö' +3> <input>io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]).</input> +' He̊llö '</pre> + </desc> + </func> + + <func> + <name name="prefix" arity="2"/> + <fsummary>Remove prefix from string.</fsummary> + <desc> + <p> + If <c><anno>Prefix</anno></c> is the prefix of + <c><anno>String</anno></c>, removes it and returns the + remainder of <c><anno>String</anno></c>, otherwise returns + <c>nomatch</c>. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:prefix(<<"prefix of string">>, "pre").</input> +<<"fix of string">> +2> <input>string:prefix("pre", "prefix").</input> +nomatch</pre> + </desc> + </func> + + <func> + <name name="replace" arity="3"/> + <name name="replace" arity="4"/> + <fsummary>Replace a pattern in string.</fsummary> + <desc> + <p> + Replaces <c><anno>SearchPattern</anno></c> in <c><anno>String</anno></c> + with <c><anno>Replacement</anno></c>. + <c><anno>Where</anno></c>, default <c>leading</c>, indicates whether + the <c>leading</c>, the <c>trailing</c> or <c>all</c> encounters of + <c><anno>SearchPattern</anno></c> are to be replaced. + </p> + <p>Can be implemented as:</p> + <pre>lists:join(Replacement, split(String, SearchPattern, Where)).</pre> + <p><em>Example:</em></p> + <pre> +1> <input>string:replace(<<"ab..cd..ef">>, "..", "*").</input> +[<<"ab">>,"*",<<"cd..ef">>] +2> <input>string:replace(<<"ab..cd..ef">>, "..", "*", all).</input> +[<<"ab">>,"*",<<"cd">>,"*",<<"ef">>]</pre> + </desc> + </func> + + <func> + <name name="reverse" arity="1"/> + <fsummary>Reverses a string</fsummary> + <desc> + <p> + Returns the reverse list of the grapheme clusters in <c><anno>String</anno></c>. + </p> + <p><em>Example:</em></p> + <pre> +1> Reverse = <input>string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")).</input> +[[79,776],[65,776],[65,778]] +2> <input>io:format("~ts~n",[Reverse]).</input> +ÖÄÅ</pre> + </desc> + </func> + + <func> + <name name="slice" arity="2"/> + <name name="slice" arity="3"/> + <fsummary>Extract a part of string</fsummary> + <desc> + <p>Returns a substring of <c><anno>String</anno></c> of + at most <c><anno>Length</anno></c> grapheme clusters, starting at position + <c><anno>Start</anno></c>.</p> + <p>By default, <c><anno>Length</anno></c> is <c>infinity</c>.</p> + <p><em>Example:</em></p> + <pre> +1> <input>string:slice(<<"He̊llö Wörld"/utf8>>, 4).</input> +<<"ö Wörld"/utf8>> +2> <input>string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,4).</input> +"ö Wö" +3> <input>string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,50).</input> +"ö Wörld"</pre> + </desc> + </func> + + <func> + <name name="split" arity="2"/> + <name name="split" arity="3"/> + <fsummary>Split a string into substrings.</fsummary> + <desc> + <p> + Splits <c><anno>String</anno></c> where <c><anno>SearchPattern</anno></c> + is encountered and return the remaining parts. + <c><anno>Where</anno></c>, default <c>leading</c>, indicates whether + the <c>leading</c>, the <c>trailing</c> or <c>all</c> encounters of + <c><anno>SearchPattern</anno></c> will split <c><anno>String</anno></c>. + </p> + <p><em>Example:</em></p> + <pre> +0> <input>string:split("ab..bc..cd", "..").</input> +["ab","bc..cd"] +1> <input>string:split(<<"ab..bc..cd">>, "..", trailing).</input> +[<<"ab..bc">>,<<"cd">>] +2> <input>string:split(<<"ab..bc....cd">>, "..", all).</input> +[<<"ab">>,<<"bc">>,<<>>,<<"cd">>]</pre> + </desc> + </func> + + <func> + <name name="take" arity="2"/> + <name name="take" arity="3"/> + <name name="take" arity="4"/> + <fsummary>Take leading or trailing parts.</fsummary> + <desc> + <p>Takes characters from <c><anno>String</anno></c> as long as + the characters are members of set <c><anno>Characters</anno></c> + or the complement of set <c><anno>Characters</anno></c>. + <c><anno>Dir</anno></c>, + which can be <c>leading</c> or <c>trailing</c>, indicates from + which direction characters are to be taken. + </p> + <p><em>Example:</em></p> + <pre> +5> <input>string:take("abc0z123", lists:seq($a,$z)).</input> +{"abc","0z123"} +6> <input>string:take(<<"abc0z123">>, lists:seq($0,$9), true, leading).</input> +{<<"abc">>,<<"0z123">>} +7> <input>string:take("abc0z123", lists:seq($0,$9), false, trailing).</input> +{"abc0z","123"} +8> <input>string:take(<<"abc0z123">>, lists:seq($a,$z), true, trailing).</input> +{<<"abc0z">>,<<"123">>}</pre> + </desc> + </func> + + <func> + <name name="titlecase" arity="1"/> + <fsummary>Convert a string to titlecase.</fsummary> + <desc> + <p> + Converts <c><anno>String</anno></c> to titlecase. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:titlecase("ß is a SHARP s").</input> +"Ss is a SHARP s"</pre> + </desc> + </func> + + <func> + <name name="to_float" arity="1"/> + <fsummary>Return a float whose text representation is the integers + (ASCII values) of a string.</fsummary> + <desc> + <p>Argument <c><anno>String</anno></c> is expected to start with a + valid text represented float (the digits are ASCII values). + Remaining characters in the string after the float are returned in + <c><anno>Rest</anno></c>.</p> + <p><em>Example:</em></p> + <pre> +> <input>{F1,Fs} = string:to_float("1.0-1.0e-1"),</input> +> <input>{F2,[]} = string:to_float(Fs),</input> +> <input>F1+F2.</input> +0.9 +> <input>string:to_float("3/2=1.5").</input> +{error,no_float} +> <input>string:to_float("-1.5eX").</input> +{-1.5,"eX"}</pre> + </desc> + </func> + + <func> + <name name="to_integer" arity="1"/> + <fsummary>Return an integer whose text representation is the integers + (ASCII values) of a string.</fsummary> + <desc> + <p>Argument <c><anno>String</anno></c> is expected to start with a + valid text represented integer (the digits are ASCII values). + Remaining characters in the string after the integer are returned in + <c><anno>Rest</anno></c>.</p> + <p><em>Example:</em></p> + <pre> +> <input>{I1,Is} = string:to_integer("33+22"),</input> +> <input>{I2,[]} = string:to_integer(Is),</input> +> <input>I1-I2.</input> +11 +> <input>string:to_integer("0.5").</input> +{0,".5"} +> <input>string:to_integer("x=2").</input> +{error,no_integer}</pre> + </desc> + </func> + + <func> + <name name="to_graphemes" arity="1"/> + <fsummary>Convert a string to a list of grapheme clusters.</fsummary> + <desc> + <p> + Converts <c><anno>String</anno></c> to a list of grapheme clusters. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:to_graphemes("ß↑e̊").</input> +[223,8593,[101,778]] +2> <input>string:to_graphemes(<<"ß↑e̊"/utf8>>).</input> +[223,8593,[101,778]]</pre> + </desc> + </func> + + <func> + <name name="trim" arity="1"/> + <name name="trim" arity="2"/> + <name name="trim" arity="3"/> + <fsummary>Trim leading or trailing, or both, characters.</fsummary> + <desc> + <p> + Returns a string, where leading or trailing, or both, + <c><anno>Characters</anno></c> have been removed. + <c><anno>Dir</anno></c> which can be <c>leading</c>, <c>trailing</c>, + or <c>both</c>, indicates from which direction characters + are to be removed. + </p> + <p> Default <c><anno>Characters</anno></c> are the set of + nonbreakable whitespace codepoints, defined as + Pattern_White_Space in + <url href="http://unicode.org/reports/tr31/">Unicode Standard Annex #31</url>. + <c>By default, <anno>Dir</anno></c> is <c>both</c>. + </p> + <p> + Notice that <c>[$\r,$\n]</c> is one grapheme cluster according + to the Unicode Standard. + </p> + <p><em>Example:</em></p> + <pre> +1> <input>string:trim("\t Hello \n").</input> +"Hello" +2> <input>string:trim(<<"\t Hello \n">>, leading).</input> +<<"Hello \n">> +3> <input>string:trim(<<".Hello.\n">>, trailing, "\n.").</input> +<<".Hello">></pre> + </desc> + </func> + + <func> + <name name="uppercase" arity="1"/> + <fsummary>Convert a string to uppercase.</fsummary> + <desc> + <p> + Converts <c><anno>String</anno></c> to uppercase. + </p> + <p>See also <seealso marker="#titlecase/1"><c>titlecase/1</c></seealso>.</p> + <p><em>Example:</em></p> + <pre> +1> <input>string:uppercase("Michał").</input> +"MICHAŁ"</pre> + </desc> + </func> + + </funcs> + + <section> + <marker id="oldapi"/> + <title>Obsolete API functions</title> + <p>Here follows the function of the old API. + These functions only work on a list of Latin-1 characters. + </p> + <note><p> + The functions are kept for backward compatibility, but are + not recommended. + They will be deprecated in Erlang/OTP 21. + </p> + <p>Any undocumented functions in <c>string</c> are not to be used.</p> + </note> + </section> + <funcs> <func> <name name="centre" arity="2"/> @@ -47,17 +652,24 @@ <p>Returns a string, where <c><anno>String</anno></c> is centered in the string and surrounded by blanks or <c><anno>Character</anno></c>. The resulting string has length <c><anno>Number</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#pad/3"><c>pad/3</c></seealso>. + </p> </desc> </func> <func> <name name="chars" arity="2"/> <name name="chars" arity="3"/> - <fsummary>Returns a string consisting of numbers of characters.</fsummary> + <fsummary>Return a string consisting of numbers of characters.</fsummary> <desc> <p>Returns a string consisting of <c><anno>Number</anno></c> characters <c><anno>Character</anno></c>. Optionally, the string can end with string <c><anno>Tail</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="lists#duplicate/2"><c>lists:duplicate/2</c></seealso>.</p> </desc> </func> @@ -69,6 +681,9 @@ <p>Returns the index of the first occurrence of <c><anno>Character</anno></c> in <c><anno>String</anno></c>. Returns <c>0</c> if <c><anno>Character</anno></c> does not occur.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#find/2"><c>find/2</c></seealso>.</p> </desc> </func> @@ -79,6 +694,16 @@ <p>Concatenates <c><anno>String1</anno></c> and <c><anno>String2</anno></c> to form a new string <c><anno>String3</anno></c>, which is returned.</p> + <p> + This function is <seealso marker="#oldapi">obsolete</seealso>. + Use <c>[<anno>String1</anno>, <anno>String2</anno>]</c> as + <c>Data</c> argument, and call + <seealso marker="unicode#characters_to_list/2"> + <c>unicode:characters_to_list/2</c></seealso> or + <seealso marker="unicode#characters_to_binary/2"> + <c>unicode:characters_to_binary/2</c></seealso> + to flatten the output. + </p> </desc> </func> @@ -88,6 +713,9 @@ <desc> <p>Returns a string containing <c><anno>String</anno></c> repeated <c><anno>Number</anno></c> times.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="lists#duplicate/2"><c>lists:duplicate/2</c></seealso>.</p> </desc> </func> @@ -98,6 +726,9 @@ <p>Returns the length of the maximum initial segment of <c><anno>String</anno></c>, which consists entirely of characters not from <c><anno>Chars</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#take/3"><c>take/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:cspan("\t abcdef", " \t"). @@ -106,20 +737,14 @@ </func> <func> - <name name="equal" arity="2"/> - <fsummary>Test string equality.</fsummary> - <desc> - <p>Returns <c>true</c> if <c><anno>String1</anno></c> and - <c><anno>String2</anno></c> are equal, otherwise <c>false</c>.</p> - </desc> - </func> - - <func> <name name="join" arity="2"/> <fsummary>Join a list of strings with separator.</fsummary> <desc> <p>Returns a string with the elements of <c><anno>StringList</anno></c> separated by the string in <c><anno>Separator</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="lists#join/2"><c>lists:join/2</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > join(["one", "two", "three"], ", "). @@ -137,6 +762,10 @@ fixed. If <c>length(<anno>String</anno>)</c> < <c><anno>Number</anno></c>, then <c><anno>String</anno></c> is padded with blanks or <c><anno>Character</anno></c>s.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#pad/2"><c>pad/2</c></seealso> or + <seealso marker="#pad/3"><c>pad/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:left("Hello",10,$.). @@ -149,6 +778,9 @@ <fsummary>Return the length of a string.</fsummary> <desc> <p>Returns the number of characters in <c><anno>String</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#length/1"><c>length/1</c></seealso>.</p> </desc> </func> @@ -160,6 +792,9 @@ <p>Returns the index of the last occurrence of <c><anno>Character</anno></c> in <c><anno>String</anno></c>. Returns <c>0</c> if <c><anno>Character</anno></c> does not occur.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#find/3"><c>find/3</c></seealso>.</p> </desc> </func> @@ -173,6 +808,9 @@ fixed. If the length of <c>(<anno>String</anno>)</c> < <c><anno>Number</anno></c>, then <c><anno>String</anno></c> is padded with blanks or <c><anno>Character</anno></c>s.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#pad/3"><c>pad/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:right("Hello", 10, $.). @@ -188,6 +826,9 @@ <c><anno>SubString</anno></c> begins in <c><anno>String</anno></c>. Returns <c>0</c> if <c><anno>SubString</anno></c> does not exist in <c><anno>String</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#find/3"><c>find/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:rstr(" Hello Hello World World ", "Hello World"). @@ -202,6 +843,9 @@ <p>Returns the length of the maximum initial segment of <c><anno>String</anno></c>, which consists entirely of characters from <c><anno>Chars</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#take/2"><c>take/2</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:span("\t abcdef", " \t"). @@ -217,6 +861,9 @@ <c><anno>SubString</anno></c> begins in <c><anno>String</anno></c>. Returns <c>0</c> if <c><anno>SubString</anno></c> does not exist in <c><anno>String</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#find/2"><c>find/2</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:str(" Hello Hello World World ", "Hello World"). @@ -230,12 +877,15 @@ <name name="strip" arity="3"/> <fsummary>Strip leading or trailing characters.</fsummary> <desc> - <p>Returns a string, where leading and/or trailing blanks or a + <p>Returns a string, where leading or trailing, or both, blanks or a number of <c><anno>Character</anno></c> have been removed. <c><anno>Direction</anno></c>, which can be <c>left</c>, <c>right</c>, or <c>both</c>, indicates from which direction blanks are to be removed. <c>strip/1</c> is equivalent to <c>strip(String, both)</c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#trim/3"><c>trim/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:strip("...Hello.....", both, $.). @@ -251,6 +901,9 @@ <p>Returns a substring of <c><anno>String</anno></c>, starting at position <c><anno>Start</anno></c> to the end of the string, or to and including position <c><anno>Stop</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#slice/3"><c>slice/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> sub_string("Hello World", 4, 8). @@ -266,6 +919,9 @@ sub_string("Hello World", 4, 8). <p>Returns a substring of <c><anno>String</anno></c>, starting at position <c><anno>Start</anno></c>, and ending at the end of the string or at length <c><anno>Length</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#slice/3"><c>slice/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > substr("Hello World", 4, 5). @@ -281,6 +937,9 @@ sub_string("Hello World", 4, 8). <p>Returns the word in position <c><anno>Number</anno></c> of <c><anno>String</anno></c>. Words are separated by blanks or <c><anno>Character</anno></c>s.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#nth_lexeme/3"><c>nth_lexeme/3</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > string:sub_word(" Hello old boy !",3,$o). @@ -289,50 +948,6 @@ sub_string("Hello World", 4, 8). </func> <func> - <name name="to_float" arity="1"/> - <fsummary>Returns a float whose text representation is the integers - (ASCII values) in a string.</fsummary> - <desc> - <p>Argument <c><anno>String</anno></c> is expected to start with a - valid text represented float (the digits are ASCII values). - Remaining characters in the string after the float are returned in - <c><anno>Rest</anno></c>.</p> - <p><em>Example:</em></p> - <code type="none"> -> {F1,Fs} = string:to_float("1.0-1.0e-1"), -> {F2,[]} = string:to_float(Fs), -> F1+F2. -0.9 -> string:to_float("3/2=1.5"). -{error,no_float} -> string:to_float("-1.5eX"). -{-1.5,"eX"}</code> - </desc> - </func> - - <func> - <name name="to_integer" arity="1"/> - <fsummary>Returns an integer whose text representation is the integers - (ASCII values) in a string.</fsummary> - <desc> - <p>Argument <c><anno>String</anno></c> is expected to start with a - valid text represented integer (the digits are ASCII values). - Remaining characters in the string after the integer are returned in - <c><anno>Rest</anno></c>.</p> - <p><em>Example:</em></p> - <code type="none"> -> {I1,Is} = string:to_integer("33+22"), -> {I2,[]} = string:to_integer(Is), -> I1-I2. -11 -> string:to_integer("0.5"). -{0,".5"} -> string:to_integer("x=2"). -{error,no_integer}</code> - </desc> - </func> - - <func> <name name="to_lower" arity="1" clause_i="1"/> <name name="to_lower" arity="1" clause_i="2"/> <name name="to_upper" arity="1" clause_i="1"/> @@ -346,6 +961,11 @@ sub_string("Hello World", 4, 8). <p>The specified string or character is case-converted. Notice that the supported character set is ISO/IEC 8859-1 (also called Latin 1); all values outside this set are unchanged</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso> use + <seealso marker="#lowercase/1"><c>lowercase/1</c></seealso>, + <seealso marker="#uppercase/1"><c>uppercase/1</c></seealso>, + <seealso marker="#titlecase/1"><c>titlecase/1</c></seealso> or + <seealso marker="#casefold/1"><c>casefold/1</c></seealso>.</p> </desc> </func> @@ -363,6 +983,9 @@ sub_string("Hello World", 4, 8). adjacent separator characters in <c><anno>String</anno></c> are treated as one. That is, there are no empty strings in the resulting list of tokens.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#lexemes/2"><c>lexemes/2</c></seealso>.</p> </desc> </func> @@ -373,6 +996,9 @@ sub_string("Hello World", 4, 8). <desc> <p>Returns the number of words in <c><anno>String</anno></c>, separated by blanks or <c><anno>Character</anno></c>.</p> + <p>This function is <seealso marker="#oldapi">obsolete</seealso>. + Use + <seealso marker="#lexemes/2"><c>lexemes/2</c></seealso>.</p> <p><em>Example:</em></p> <code type="none"> > words(" Hello old boy!", $o). @@ -387,10 +1013,7 @@ sub_string("Hello World", 4, 8). other. The reason is that this string package is the combination of two earlier packages and all functions of both packages have been retained.</p> - - <note> - <p>Any undocumented functions in <c>string</c> are not to be used.</p> - </note> </section> + </erlref> diff --git a/lib/stdlib/doc/src/unicode.xml b/lib/stdlib/doc/src/unicode.xml index 93d0d37456..382b253ba1 100644 --- a/lib/stdlib/doc/src/unicode.xml +++ b/lib/stdlib/doc/src/unicode.xml @@ -50,8 +50,35 @@ external entities where this is required. When working inside the Erlang/OTP environment, it is recommended to keep binaries in UTF-8 when representing Unicode characters. ISO Latin-1 encoding is supported both - for backward compatibility and for communication - with external entities not supporting Unicode character sets.</p> + for backward compatibility and for communication + with external entities not supporting Unicode character sets.</p> + <p>Programs should always operate on a normalized form and compare + canonical-equivalent Unicode characters as equal. All characters + should thus be normalized to one form once on the system borders. + One of the following functions can convert characters to their + normalized forms <seealso marker="#characters_to_nfc_list/1"> + <c>characters_to_nfc_list/1</c></seealso>, + <seealso marker="#characters_to_nfc_binary/1"> + <c>characters_to_nfc_binary/1</c></seealso>, + <seealso marker="#characters_to_nfd_list/1"> + <c>characters_to_nfd_list/1</c></seealso> or + <seealso marker="#characters_to_nfd_binary/1"> + <c>characters_to_nfd_binary/1</c></seealso>. + For general text + <seealso marker="#characters_to_nfc_list/1"> + <c>characters_to_nfc_list/1</c></seealso> or + <seealso marker="#characters_to_nfc_binary/1"> + <c>characters_to_nfc_binary/1</c></seealso> is preferred, and + for identifiers one of the compatibility normalization + functions, such as + <seealso marker="#characters_to_nfkc_list/1"> + <c>characters_to_nfkc_list/1</c></seealso>, + is preferred for security reasons. + The normalization functions where introduced in OTP 20. + Additional information on normalization can be found in the + <url href="http://unicode.org/faq/normalization.html">Unicode FAQ</url>. + </p> + </description> <datatypes> @@ -335,6 +362,154 @@ decode_data(Data) -> </func> <func> + <name name="characters_to_nfc_list" arity="1"/> + <fsummary>Normalize characters to a list of canonical equivalent + composed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of canonical equivalent Composed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding. + </p> + <p>The result is a list of characters.</p> + <code> +3> unicode:characters_to_nfc_list([<<"abc..a">>,[778],$a,[776],$o,[776]]). +"abc..åäö" +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfc_binary" arity="1"/> + <fsummary>Normalize characters to a utf8 binary of canonical equivalent + composed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of canonical equivalent Composed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding.</p> + <p>The result is an utf8 encoded binary.</p> + <code> +4> unicode:characters_to_nfc_binary([<<"abc..a">>,[778],$a,[776],$o,[776]]). +<<"abc..åäö"/utf8>> +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfd_list" arity="1"/> + <fsummary>Normalize characters to a list of canonical equivalent + decomposed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of canonical equivalent Decomposed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding. + </p> + <p>The result is a list of characters.</p> + <code> +1> unicode:characters_to_nfd_list("abc..åäö"). +[97,98,99,46,46,97,778,97,776,111,776] +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfd_binary" arity="1"/> + <fsummary>Normalize characters to a utf8 binary of canonical equivalent + decomposed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of canonical equivalent Decomposed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding.</p> + <p>The result is an utf8 encoded binary.</p> + <code> +2> unicode:characters_to_nfd_binary("abc..åäö"). +<<97,98,99,46,46,97,204,138,97,204,136,111,204,136>> +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfkc_list" arity="1"/> + <fsummary>Normalize characters to a list of canonical equivalent + composed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of compatibly equivalent Composed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding. + </p> + <p>The result is a list of characters.</p> + <code> +3> unicode:characters_to_nfkc_list([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]). +"abc..åäö32" +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfkc_binary" arity="1"/> + <fsummary>Normalize characters to a utf8 binary of compatibly equivalent + composed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of compatibly equivalent Composed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding.</p> + <p>The result is an utf8 encoded binary.</p> + <code> +4> unicode:characters_to_nfkc_binary([<<"abc..a">>,[778],$a,[776],$o,[776],[65299,65298]]). +<<"abc..åäö32"/utf8>> +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfkd_list" arity="1"/> + <fsummary>Normalize characters to a list of compatibly equivalent + decomposed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of compatibly equivalent Decomposed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding. + </p> + <p>The result is a list of characters.</p> + <code> +1> unicode:characters_to_nfkd_list(["abc..åäö",[65299,65298]]). +[97,98,99,46,46,97,778,97,776,111,776,51,50] +</code> + </desc> + </func> + + <func> + <name name="characters_to_nfkd_binary" arity="1"/> + <fsummary>Normalize characters to a utf8 binary of compatibly equivalent + decomposed Unicode characters.</fsummary> + <desc> + <p>Converts a possibly deep list of characters and binaries + into a Normalized Form of compatibly equivalent Decomposed + characters according to the Unicode standard.</p> + <p>Any binaries in the input must be encoded with utf8 + encoding.</p> + <p>The result is an utf8 encoded binary.</p> + <code> +2> unicode:characters_to_nfkd_binary(["abc..åäö",[65299,65298]]). +<<97,98,99,46,46,97,204,138,97,204,136,111,204,136,51,50>> +</code> + </desc> + </func> + + <func> <name name="encoding_to_bom" arity="1"/> <fsummary>Create a binary UTF byte order mark from encoding.</fsummary> <type_desc variable="Bin"> diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index a8ef8ff5c5..11b84f552a 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -65,7 +65,10 @@ <item><p>In Erlang/OTP 20.0, atoms and function can contain Unicode characters. Module names are still restricted to - the ISO-Latin-1 range.</p></item> + the ISO-Latin-1 range.</p> + <p>Support was added for normalizations forms in + <c>unicode</c> and the <c>string</c> module now handles + utf8-encoded binaries.</p></item> </list> <p>This section outlines the current Unicode support and gives some @@ -110,23 +113,27 @@ </item> </list> - <p>So, a conversion function must know not only one character at a time, - but possibly the whole sentence, the natural language to translate to, - the differences in input and output string length, and so on. - Erlang/OTP has currently no Unicode <c>to_upper</c>/<c>to_lower</c> - functionality, but publicly available libraries address these issues.</p> - - <p>Another example is the accented characters, where the same glyph has two - different representations. The Swedish letter "ö" is one example. - The Unicode standard has a code point for it, but you can also write it - as "o" followed by "U+0308" (Combining Diaeresis, with the simplified - meaning that the last letter is to have "¨" above). They have the same - glyph. They are for most purposes the same, but have different - representations. For example, MacOS X converts all filenames to use - Combining Diaeresis, while most other programs (including Erlang) try to - hide that by doing the opposite when, for example, listing directories. - However it is done, it is usually important to normalize such - characters to avoid confusion.</p> + <p>So, a conversion function must know not only one character at a + time, but possibly the whole sentence, the natural language to + translate to, the differences in input and output string length, + and so on. Erlang/OTP has currently no Unicode + <c>uppercase</c>/<c>lowercase</c> functionality with language + specific handling, but publicly available libraries address these + issues.</p> + + <p>Another example is the accented characters, where the same + glyph has two different representations. The Swedish letter "ö" is + one example. The Unicode standard has a code point for it, but + you can also write it as "o" followed by "U+0308" (Combining + Diaeresis, with the simplified meaning that the last letter is to + have "¨" above). They have the same glyph, user perceived + character. They are for most purposes the same, but have different + representations. For example, MacOS X converts all filenames to + use Combining Diaeresis, while most other programs (including + Erlang) try to hide that by doing the opposite when, for example, + listing directories. However it is done, it is usually important + to normalize such characters to avoid confusion. + </p> <p>The list of examples can be made long. One need a kind of knowledge that was not needed when programs only considered one or two languages. The @@ -273,7 +280,7 @@ them. In some cases functionality has been added to already existing interfaces (as the <seealso marker="stdlib:string"><c>string</c></seealso> module now can - handle lists with any code points). In some cases new + handle strings with any code points). In some cases new functionality or options have been added (as in the <seealso marker="stdlib:io"><c>io</c></seealso> module, the file handling, the <seealso @@ -977,7 +984,7 @@ Eshell V5.10.1 (abort with ^G) <p>Fortunately, most textual data has been stored in lists and range checking has been sparse, so modules like <c>string</c> work well for - Unicode lists with little need for conversion or extension.</p> + Unicode strings with little need for conversion or extension.</p> <p>Some modules are, however, changed to be explicitly Unicode-aware. These modules include:</p> @@ -1028,18 +1035,17 @@ Eshell V5.10.1 (abort with ^G) has extensive support for Unicode text.</p></item> </taglist> - <p>The <seealso marker="stdlib:string"><c>string</c></seealso> module works - perfectly for Unicode strings and ISO Latin-1 strings, except the - language-dependent functions - <seealso marker="stdlib:string#to_upper/1"><c>string:to_upper/1</c></seealso> - and - <seealso marker="stdlib:string#to_lower/1"><c>string:to_lower/1</c></seealso>, - which are only correct for the ISO Latin-1 character set. These two - functions can never function correctly for Unicode characters in their - current form, as there are language and locale issues as well as - multi-character mappings to consider when converting text between cases. - Converting case in an international environment is a large subject not - yet addressed in OTP.</p> + <p>The <seealso marker="stdlib:string"><c>string</c></seealso> + module works perfectly for Unicode strings and ISO Latin-1 + strings, except the language-dependent functions <seealso + marker="stdlib:string#uppercase/1"><c>string:uppercase/1</c></seealso> + and <seealso + marker="stdlib:string#lowercase/1"><c>string:lowercase/1</c></seealso>. + These two functions can never function correctly for Unicode + characters in their current form, as there are language and locale + issues to consider when converting text between cases. Converting + case in an international environment is a large subject not yet + addressed in OTP.</p> </section> <section> |