From 2c72e662bad11a41839780f86680d4bb05367c78 Mon Sep 17 00:00:00 2001 From: Dan Gudmundsson Date: Mon, 3 Apr 2017 12:19:21 +0200 Subject: New unicode aware string module that works with unicode:chardata() Works with unicode:chardata() as input as was decided on OTP board meeting as response to EEP-35 a long time ago. Works on graphemes clusters as base, with a few exceptions, does not handle classic (nor nfd'ified) Hangul nor the extended grapheme clusters such as the prepend class. That would make handling binaries as input/output very slow. List input => list output, binary input => binary output and mixed input => mixed output for all find/split functions. So that results can be post-processed without the need to invoke unicode:characters_to_list|binary for intermediate data. pad functions return lists of unicode:chardata() for performance. --- lib/stdlib/doc/src/string.xml | 741 ++++++++++++++++++++++++++++++++--- lib/stdlib/doc/src/unicode_usage.xml | 70 ++-- 2 files changed, 720 insertions(+), 91 deletions(-) (limited to 'lib/stdlib/doc') diff --git a/lib/stdlib/doc/src/string.xml b/lib/stdlib/doc/src/string.xml index dddedf1132..dc83c40a9a 100644 --- a/lib/stdlib/doc/src/string.xml +++ b/lib/stdlib/doc/src/string.xml @@ -36,8 +36,613 @@ String processing functions.

This module provides functions for string processing.

+

A string in this module is represented by + unicode:chardata(), that is, a list of codepoints, + binaries with UTF-8-encoded codepoints + (UTF-8 binaries), or a mix of the two.

+ +"abcd" is a valid string +<<"abcd">> is a valid string +["abcd"] is a valid string +<<"abc..åäö"/utf8>> is a valid string +<<"abc..åäö">> is NOT a valid string, + but a binary with Latin-1-encoded codepoints +[<<"abc">>, "..åäö"] is a valid string +[atom] is NOT a valid string +

+ This module operates on grapheme clusters. A grapheme cluster + is a user-perceived character, which can be represented by several + codepoints. +

+ +"å" [229] or [97, 778] +"e̊" [101, 778] +

+ The string length of "ß↑e̊" is 3, even though it is represented by the + codepoints [223,8593,101,778] or the UTF-8 binary + <<195,159,226,134,145,101,204,138>>. +

+

+ Grapheme clusters for codepoints of class prepend + and non-modern (or decomposed) Hangul is not handled for performance + reasons in + find/3, + replace/3, + split/2, + split/2 and + trim/3. +

+

+ Splitting and appending strings is to be done on grapheme clusters + borders. + There is no verification that the results of appending strings are + valid or normalized. +

+

+ Most of the functions expect all input to be normalized to one form, + see for example + unicode:characters_to_nfc_list/1. +

+

+ Language or locale specific handling of input is not considered + in any function. +

+

+ The functions can crash for non-valid input strings. For example, + the functions expect UTF-8 binaries but not all functions + verify that all binaries are encoded correctly. +

+

+ Unless otherwise specified the return value type is the same as + the input type. That is, binary input returns binary output, + list input returns a list output, and mixed input can return a + mixed output.

+ +1> string:trim(" sarah "). +"sarah" +2> string:trim(<<" sarah ">>). +<<"sarah">> +3> string:lexemes("foo bar", " "). +["foo","bar"] +4> string:lexemes(<<"foo bar">>, " "). +[<<"foo">>,<<"bar">>] +

This module has been reworked in Erlang/OTP 20 to + handle + unicode:chardata() and operate on grapheme + clusters. The old + functions that only work on Latin-1 lists as input + are still available but should not be + used. They will be deprecated in Erlang/OTP 21. +

+ + + + + +

A user-perceived character, consisting of one or more + codepoints.

+
+
+
+ + + + + + Convert a string to a comparable string. + +

+ Converts String to a case-agnostic + comparable string. Function casefold/1 is preferred + over lowercase/1 when two strings are to be compared + for equality. See also equal/4. +

+

Example:

+
+1> string:casefold("Ω and ẞ SHARP S").
+"ω and ss sharp s"
+
+
+ + + + Remove trailing end of line control characters. + +

+ Returns a string where any trailing \n or + \r\n have been removed from String. +

+

Example:

+
+182> string:chomp(<<"\nHello\n\n">>).
+<<"\nHello">>
+183> string:chomp("\nHello\r\r\n").
+"\nHello\r"
+
+
+ + + + + + Test string equality. + +

+ Returns true if A and + B are equal, otherwise false. +

+

+ If IgnoreCase is true + the function does + casefolding on the fly before the equality test. +

+

If Norm is not none + the function applies normalization on the fly before the equality test. + There are four available normalization forms: + nfc, + nfd, + nfkc, and + nfkd. +

+

By default, + IgnoreCase is false and + Norm is none.

+

Example:

+
+1> string:equal("åäö", <<"åäö"/utf8>>).
+true
+2> string:equal("åäö", unicode:characters_to_nfd_binary("åäö")).
+false
+3> string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc).
+true
+
+
+ + + + + Find start of substring. + +

+ Removes anything before SearchPattern in String + and returns the remainder of the string or nomatch if SearchPattern is not + found. + Dir, which can be leading or + trailing, indicates from which direction characters + are to be searched. +

+

+ By default, Dir is leading. +

+

Example:

+
+1> string:find("ab..cd..ef", ".").
+"..cd..ef"
+2> string:find(<<"ab..cd..ef">>, "..", trailing).
+<<"..ef">>
+3> string:find(<<"ab..cd..ef">>, "x", leading).
+nomatch
+4> string:find("ab..cd..ef", "x", trailing).
+nomatch
+
+
+ + + + Check if the string is empty. + +

Returns true if String is the + empty string, otherwise false.

+

Example:

+
+1> string:is_empty("foo").
+false
+2> string:is_empty(["",<<>>]).
+true
+
+
+ + + + Calculate length of the string. + +

+ Returns the number of grapheme clusters in String. +

+

Example:

+
+1> string:length("ß↑e̊").
+3
+2> string:length(<<195,159,226,134,145,101,204,138>>).
+3
+
+
+ + + + Split string into lexemes. + +

+ Returns a list of lexemes in String, separated + by the grapheme clusters in SeparatorList. +

+

+ Notice that, as shown in this example, two or more + adjacent separator graphemes clusters in String + are treated as one. That is, there are no empty + strings in the resulting list of lexemes. + See also split/3 which returns + empty strings. +

+

Notice that [$\r,$\n] is one grapheme cluster.

+

Example:

+
+1> string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]).
+["abc","de̊f","ghi","jkl","foo"]
+2> string:lexemes(<<"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]).
+[<<"abc">>,<<"de̊f"/utf8>>,<<"ghi">>,<<"jkl\r\nfoo">>]
+
+
+ + + + Convert a string to lowercase + +

+ Converts String to lowercase. +

+

+ Notice that function casefold/1 + should be used when converting a string to + be tested for equality. +

+

Example:

+
+2> string:lowercase(string:uppercase("Michał")).
+"michał"
+
+
+ + + + Pick the first codepoint. + +

+ Returns the first codepoint in String + and the rest of String in the tail. +

+

Example:

+
+1> string:next_codepoint(unicode:characters_to_binary("e̊fg")).
+[101|<<"̊fg"/utf8>>]
+
+
+ + + + Pick the first grapheme cluster. + +

+ Returns the first grapheme cluster in String + and the rest of String in the tail. +

+

Example:

+
+1> string:next_grapheme(unicode:characters_to_binary("e̊fg")).
+["e̊"|<<"fg">>]
+
+
+ + + + Pick the nth lexeme. + +

Returns lexeme number N in + String, where lexemes are separated by + the grapheme clusters in SeparatorList. +

+

Example:

+
+1> string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").
+"ghi"
+
+
+ + + + + + Pad a string to given length. + +

+ Pads String to Length with + grapheme cluster Char. + Dir, which can be leading, trailing, + or both, indicates where the padding should be added. +

+

By default, Char is $\s and + Dir is trailing. +

+

Example:

+
+1> string:pad(<<"He̊llö"/utf8>>, 8).
+[<<72,101,204,138,108,108,195,182>>,32,32,32]
+2> io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]).
+'   He̊llö'
+3> io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]).
+' He̊llö  '
+
+
+ + + + Remove prefix from string. + +

+ If Prefix is the prefix of + String, removes it and returns the + remainder of String, otherwise returns + nomatch. +

+

Example:

+
+1> string:prefix(<<"prefix of string">>, "pre").
+<<"fix of string">>
+2> string:prefix("pre", "prefix").
+nomatch
+
+
+ + + + + Replace a pattern in string. + +

+ Replaces SearchPattern in String + with Replacement. + Where, default leading, indicates whether + the leading, the trailing or all encounters of + SearchPattern are to be replaced. +

+

Can be implemented as:

+
lists:join(Replacement, split(String, SearchPattern, Where)).
+

Example:

+
+1> string:replace(<<"ab..cd..ef">>, "..", "*").
+[<<"ab">>,"*",<<"cd..ef">>]
+2> string:replace(<<"ab..cd..ef">>, "..", "*", all).
+[<<"ab">>,"*",<<"cd">>,"*",<<"ef">>]
+
+
+ + + + Reverses a string + +

+ Returns the reverse list of the grapheme clusters in String. +

+

Example:

+
+1> Reverse = string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")).
+[[79,776],[65,776],[65,778]]
+2> io:format("~ts~n",[Reverse]).
+ÖÄÅ
+
+
+ + + + + Extract a part of string + +

Returns a substring of String of + at most Length grapheme clusters, starting at position + Start.

+

By default, Length is infinity.

+

Example:

+
+1> string:slice(<<"He̊llö Wörld"/utf8>>, 4).
+<<"ö Wörld"/utf8>>
+2> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,4).
+"ö Wö"
+3> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,50).
+"ö Wörld"
+
+
+ + + + + Split a string into substrings. + +

+ Splits String where SearchPattern + is encountered and return the remaining parts. + Where, default leading, indicates whether + the leading, the trailing or all encounters of + SearchPattern will split String. +

+

Example:

+
+0> string:split("ab..bc..cd", "..").
+["ab","bc..cd"]
+1> string:split(<<"ab..bc..cd">>, "..", trailing).
+[<<"ab..bc">>,<<"cd">>]
+2> string:split(<<"ab..bc....cd">>, "..", all).
+[<<"ab">>,<<"bc">>,<<>>,<<"cd">>]
+
+
+ + + + + + Take leading or trailing parts. + +

Takes characters from String as long as + the characters are members of set Characters + or the complement of set Characters. + Dir, + which can be leading or trailing, indicates from + which direction characters are to be taken. +

+

Example:

+
+5> string:take("abc0z123", lists:seq($a,$z)).
+{"abc","0z123"}
+6> string:take(<<"abc0z123">>, lists:seq($0,$9), true, leading).
+{<<"abc">>,<<"0z123">>}
+7> string:take("abc0z123", lists:seq($0,$9), false, trailing).
+{"abc0z","123"}
+8> string:take(<<"abc0z123">>, lists:seq($a,$z), true, trailing).
+{<<"abc0z">>,<<"123">>}
+
+
+ + + + Convert a string to titlecase. + +

+ Converts String to titlecase. +

+

Example:

+
+1> string:titlecase("ß is a SHARP s").
+"Ss is a SHARP s"
+
+
+ + + + Return a float whose text representation is the integers + (ASCII values) of a string. + +

Argument String is expected to start with a + valid text represented float (the digits are ASCII values). + Remaining characters in the string after the float are returned in + Rest.

+

Example:

+
+> {F1,Fs} = string:to_float("1.0-1.0e-1"),
+> {F2,[]} = string:to_float(Fs),
+> F1+F2.
+0.9
+> string:to_float("3/2=1.5").
+{error,no_float}
+> string:to_float("-1.5eX").
+{-1.5,"eX"}
+
+
+ + + + Return an integer whose text representation is the integers + (ASCII values) of a string. + +

Argument String is expected to start with a + valid text represented integer (the digits are ASCII values). + Remaining characters in the string after the integer are returned in + Rest.

+

Example:

+
+> {I1,Is} = string:to_integer("33+22"),
+> {I2,[]} = string:to_integer(Is),
+> I1-I2.
+11
+> string:to_integer("0.5").
+{0,".5"}
+> string:to_integer("x=2").
+{error,no_integer}
+
+
+ + + + Convert a string to a list of grapheme clusters. + +

+ Converts String to a list of grapheme clusters. +

+

Example:

+
+1> string:to_graphemes("ß↑e̊").
+[223,8593,[101,778]]
+2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
+[223,8593,[101,778]]
+
+
+ + + + + + Trim leading or trailing, or both, characters. + +

+ Returns a string, where leading or trailing, or both, + Characters have been removed. + Dir which can be leading, trailing, + or both, indicates from which direction characters + are to be removed. +

+

Default Characters are the set of + nonbreakable whitespace codepoints, defined as + Pattern_White_Space in + Unicode Standard Annex #31. + By default, Dir is both. +

+

+ Notice that [$\r,$\n] is one grapheme cluster according + to the Unicode Standard. +

+

Example:

+
+1> string:trim("\t  Hello  \n").
+"Hello"
+2> string:trim(<<"\t  Hello  \n">>, leading).
+<<"Hello  \n">>
+3> string:trim(<<".Hello.\n">>, trailing, "\n.").
+<<".Hello">>
+
+
+ + + + Convert a string to uppercase. + +

+ Converts String to uppercase. +

+

See also titlecase/1.

+

Example:

+
+1> string:uppercase("Michał").
+"MICHAŁ"
+
+
+ +
+ +
+ + Obsolete API functions +

Here follows the function of the old API. + These functions only work on a list of Latin-1 characters. +

+

+ The functions are kept for backward compatibility, but are + not recommended. + They will be deprecated in Erlang/OTP 21. +

+

Any undocumented functions in string are not to be used.

+
+
+ @@ -47,17 +652,24 @@

Returns a string, where String is centered in the string and surrounded by blanks or Character. The resulting string has length Number.

+

This function is obsolete. + Use + pad/3. +

- Returns a string consisting of numbers of characters. + Return a string consisting of numbers of characters.

Returns a string consisting of Number characters Character. Optionally, the string can end with string Tail.

+

This function is obsolete. + Use + lists:duplicate/2.

@@ -69,6 +681,9 @@

Returns the index of the first occurrence of Character in String. Returns 0 if Character does not occur.

+

This function is obsolete. + Use + find/2.

@@ -79,6 +694,16 @@

Concatenates String1 and String2 to form a new string String3, which is returned.

+

+ This function is obsolete. + Use [String1, String2] as + Data argument, and call + + unicode:characters_to_list/2 or + + unicode:characters_to_binary/2 + to flatten the output. +

@@ -88,6 +713,9 @@

Returns a string containing String repeated Number times.

+

This function is obsolete. + Use + lists:duplicate/2.

@@ -98,6 +726,9 @@

Returns the length of the maximum initial segment of String, which consists entirely of characters not from Chars.

+

This function is obsolete. + Use + take/3.

Example:

> string:cspan("\t abcdef", " \t"). @@ -105,21 +736,15 @@ - - - Test string equality. - -

Returns true if String1 and - String2 are equal, otherwise false.

-
-
- Join a list of strings with separator.

Returns a string with the elements of StringList separated by the string in Separator.

+

This function is obsolete. + Use + lists:join/2.

Example:

> join(["one", "two", "three"], ", "). @@ -137,6 +762,10 @@ fixed. If length(String) < Number, then String is padded with blanks or Characters.

+

This function is obsolete. + Use + pad/2 or + pad/3.

Example:

> string:left("Hello",10,$.). @@ -149,6 +778,9 @@ Return the length of a string.

Returns the number of characters in String.

+

This function is obsolete. + Use + length/1.

@@ -160,6 +792,9 @@

Returns the index of the last occurrence of Character in String. Returns 0 if Character does not occur.

+

This function is obsolete. + Use + find/3.

@@ -173,6 +808,9 @@ fixed. If the length of (String) < Number, then String is padded with blanks or Characters.

+

This function is obsolete. + Use + pad/3.

Example:

> string:right("Hello", 10, $.). @@ -188,6 +826,9 @@ SubString begins in String. Returns 0 if SubString does not exist in String.

+

This function is obsolete. + Use + find/3.

Example:

> string:rstr(" Hello Hello World World ", "Hello World"). @@ -202,6 +843,9 @@

Returns the length of the maximum initial segment of String, which consists entirely of characters from Chars.

+

This function is obsolete. + Use + take/2.

Example:

> string:span("\t abcdef", " \t"). @@ -217,6 +861,9 @@ SubString begins in String. Returns 0 if SubString does not exist in String.

+

This function is obsolete. + Use + find/2.

Example:

> string:str(" Hello Hello World World ", "Hello World"). @@ -230,12 +877,15 @@ Strip leading or trailing characters. -

Returns a string, where leading and/or trailing blanks or a +

Returns a string, where leading or trailing, or both, blanks or a number of Character have been removed. Direction, which can be left, right, or both, indicates from which direction blanks are to be removed. strip/1 is equivalent to strip(String, both).

+

This function is obsolete. + Use + trim/3.

Example:

> string:strip("...Hello.....", both, $.). @@ -251,6 +901,9 @@

Returns a substring of String, starting at position Start to the end of the string, or to and including position Stop.

+

This function is obsolete. + Use + slice/3.

Example:

sub_string("Hello World", 4, 8). @@ -266,6 +919,9 @@ sub_string("Hello World", 4, 8).

Returns a substring of String, starting at position Start, and ending at the end of the string or at length Length.

+

This function is obsolete. + Use + slice/3.

Example:

> substr("Hello World", 4, 5). @@ -281,6 +937,9 @@ sub_string("Hello World", 4, 8).

Returns the word in position Number of String. Words are separated by blanks or Characters.

+

This function is obsolete. + Use + nth_lexeme/3.

Example:

> string:sub_word(" Hello old boy !",3,$o). @@ -288,50 +947,6 @@ sub_string("Hello World", 4, 8).
- - - Returns a float whose text representation is the integers - (ASCII values) in a string. - -

Argument String is expected to start with a - valid text represented float (the digits are ASCII values). - Remaining characters in the string after the float are returned in - Rest.

-

Example:

- -> {F1,Fs} = string:to_float("1.0-1.0e-1"), -> {F2,[]} = string:to_float(Fs), -> F1+F2. -0.9 -> string:to_float("3/2=1.5"). -{error,no_float} -> string:to_float("-1.5eX"). -{-1.5,"eX"} -
-
- - - - Returns an integer whose text representation is the integers - (ASCII values) in a string. - -

Argument String is expected to start with a - valid text represented integer (the digits are ASCII values). - Remaining characters in the string after the integer are returned in - Rest.

-

Example:

- -> {I1,Is} = string:to_integer("33+22"), -> {I2,[]} = string:to_integer(Is), -> I1-I2. -11 -> string:to_integer("0.5"). -{0,".5"} -> string:to_integer("x=2"). -{error,no_integer} -
-
- @@ -346,6 +961,11 @@ sub_string("Hello World", 4, 8).

The specified string or character is case-converted. Notice that the supported character set is ISO/IEC 8859-1 (also called Latin 1); all values outside this set are unchanged

+

This function is obsolete use + lowercase/1, + uppercase/1, + titlecase/1 or + casefold/1.

@@ -363,6 +983,9 @@ sub_string("Hello World", 4, 8). adjacent separator characters in String are treated as one. That is, there are no empty strings in the resulting list of tokens.

+

This function is obsolete. + Use + lexemes/2.

@@ -373,6 +996,9 @@ sub_string("Hello World", 4, 8).

Returns the number of words in String, separated by blanks or Character.

+

This function is obsolete. + Use + lexemes/2.

Example:

> words(" Hello old boy!", $o). @@ -387,10 +1013,7 @@ sub_string("Hello World", 4, 8). other. The reason is that this string package is the combination of two earlier packages and all functions of both packages have been retained.

- - -

Any undocumented functions in string are not to be used.

-
+ diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index a8ef8ff5c5..11b84f552a 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -65,7 +65,10 @@

In Erlang/OTP 20.0, atoms and function can contain Unicode characters. Module names are still restricted to - the ISO-Latin-1 range.

+ the ISO-Latin-1 range.

+

Support was added for normalizations forms in + unicode and the string module now handles + utf8-encoded binaries.

This section outlines the current Unicode support and gives some @@ -110,23 +113,27 @@ -

So, a conversion function must know not only one character at a time, - but possibly the whole sentence, the natural language to translate to, - the differences in input and output string length, and so on. - Erlang/OTP has currently no Unicode to_upper/to_lower - functionality, but publicly available libraries address these issues.

- -

Another example is the accented characters, where the same glyph has two - different representations. The Swedish letter "ö" is one example. - The Unicode standard has a code point for it, but you can also write it - as "o" followed by "U+0308" (Combining Diaeresis, with the simplified - meaning that the last letter is to have "¨" above). They have the same - glyph. They are for most purposes the same, but have different - representations. For example, MacOS X converts all filenames to use - Combining Diaeresis, while most other programs (including Erlang) try to - hide that by doing the opposite when, for example, listing directories. - However it is done, it is usually important to normalize such - characters to avoid confusion.

+

So, a conversion function must know not only one character at a + time, but possibly the whole sentence, the natural language to + translate to, the differences in input and output string length, + and so on. Erlang/OTP has currently no Unicode + uppercase/lowercase functionality with language + specific handling, but publicly available libraries address these + issues.

+ +

Another example is the accented characters, where the same + glyph has two different representations. The Swedish letter "ö" is + one example. The Unicode standard has a code point for it, but + you can also write it as "o" followed by "U+0308" (Combining + Diaeresis, with the simplified meaning that the last letter is to + have "¨" above). They have the same glyph, user perceived + character. They are for most purposes the same, but have different + representations. For example, MacOS X converts all filenames to + use Combining Diaeresis, while most other programs (including + Erlang) try to hide that by doing the opposite when, for example, + listing directories. However it is done, it is usually important + to normalize such characters to avoid confusion. +

The list of examples can be made long. One need a kind of knowledge that was not needed when programs only considered one or two languages. The @@ -273,7 +280,7 @@ them. In some cases functionality has been added to already existing interfaces (as the string module now can - handle lists with any code points). In some cases new + handle strings with any code points). In some cases new functionality or options have been added (as in the io module, the file handling, the Fortunately, most textual data has been stored in lists and range checking has been sparse, so modules like string work well for - Unicode lists with little need for conversion or extension.

+ Unicode strings with little need for conversion or extension.

Some modules are, however, changed to be explicitly Unicode-aware. These modules include:

@@ -1028,18 +1035,17 @@ Eshell V5.10.1 (abort with ^G) has extensive support for Unicode text.

-

The string module works - perfectly for Unicode strings and ISO Latin-1 strings, except the - language-dependent functions - string:to_upper/1 - and - string:to_lower/1, - which are only correct for the ISO Latin-1 character set. These two - functions can never function correctly for Unicode characters in their - current form, as there are language and locale issues as well as - multi-character mappings to consider when converting text between cases. - Converting case in an international environment is a large subject not - yet addressed in OTP.

+

The string + module works perfectly for Unicode strings and ISO Latin-1 + strings, except the language-dependent functions string:uppercase/1 + and string:lowercase/1. + These two functions can never function correctly for Unicode + characters in their current form, as there are language and locale + issues to consider when converting text between cases. Converting + case in an international environment is a large subject not yet + addressed in OTP.

-- cgit v1.2.3