From 2c72e662bad11a41839780f86680d4bb05367c78 Mon Sep 17 00:00:00 2001 From: Dan Gudmundsson Date: Mon, 3 Apr 2017 12:19:21 +0200 Subject: New unicode aware string module that works with unicode:chardata() Works with unicode:chardata() as input as was decided on OTP board meeting as response to EEP-35 a long time ago. Works on graphemes clusters as base, with a few exceptions, does not handle classic (nor nfd'ified) Hangul nor the extended grapheme clusters such as the prepend class. That would make handling binaries as input/output very slow. List input => list output, binary input => binary output and mixed input => mixed output for all find/split functions. So that results can be post-processed without the need to invoke unicode:characters_to_list|binary for intermediate data. pad functions return lists of unicode:chardata() for performance. --- lib/stdlib/doc/src/string.xml | 741 ++++++++++++++++++++++++++++++++++++++---- 1 file changed, 682 insertions(+), 59 deletions(-) (limited to 'lib/stdlib/doc/src/string.xml') diff --git a/lib/stdlib/doc/src/string.xml b/lib/stdlib/doc/src/string.xml index dddedf1132..dc83c40a9a 100644 --- a/lib/stdlib/doc/src/string.xml +++ b/lib/stdlib/doc/src/string.xml @@ -36,8 +36,613 @@ String processing functions.

This module provides functions for string processing.

+

A string in this module is represented by + unicode:chardata(), that is, a list of codepoints, + binaries with UTF-8-encoded codepoints + (UTF-8 binaries), or a mix of the two.

+ +"abcd" is a valid string +<<"abcd">> is a valid string +["abcd"] is a valid string +<<"abc..åäö"/utf8>> is a valid string +<<"abc..åäö">> is NOT a valid string, + but a binary with Latin-1-encoded codepoints +[<<"abc">>, "..åäö"] is a valid string +[atom] is NOT a valid string +

+ This module operates on grapheme clusters. A grapheme cluster + is a user-perceived character, which can be represented by several + codepoints. +

+ +"å" [229] or [97, 778] +"e̊" [101, 778] +

+ The string length of "ß↑e̊" is 3, even though it is represented by the + codepoints [223,8593,101,778] or the UTF-8 binary + <<195,159,226,134,145,101,204,138>>. +

+

+ Grapheme clusters for codepoints of class prepend + and non-modern (or decomposed) Hangul is not handled for performance + reasons in + find/3, + replace/3, + split/2, + split/2 and + trim/3. +

+

+ Splitting and appending strings is to be done on grapheme clusters + borders. + There is no verification that the results of appending strings are + valid or normalized. +

+

+ Most of the functions expect all input to be normalized to one form, + see for example + unicode:characters_to_nfc_list/1. +

+

+ Language or locale specific handling of input is not considered + in any function. +

+

+ The functions can crash for non-valid input strings. For example, + the functions expect UTF-8 binaries but not all functions + verify that all binaries are encoded correctly. +

+

+ Unless otherwise specified the return value type is the same as + the input type. That is, binary input returns binary output, + list input returns a list output, and mixed input can return a + mixed output.

+ +1> string:trim(" sarah "). +"sarah" +2> string:trim(<<" sarah ">>). +<<"sarah">> +3> string:lexemes("foo bar", " "). +["foo","bar"] +4> string:lexemes(<<"foo bar">>, " "). +[<<"foo">>,<<"bar">>] +

This module has been reworked in Erlang/OTP 20 to + handle + unicode:chardata() and operate on grapheme + clusters. The old + functions that only work on Latin-1 lists as input + are still available but should not be + used. They will be deprecated in Erlang/OTP 21. +

+ + + + + +

A user-perceived character, consisting of one or more + codepoints.

+
+
+
+ + + + + + Convert a string to a comparable string. + +

+ Converts String to a case-agnostic + comparable string. Function casefold/1 is preferred + over lowercase/1 when two strings are to be compared + for equality. See also equal/4. +

+

Example:

+
+1> string:casefold("Ω and ẞ SHARP S").
+"ω and ss sharp s"
+
+
+ + + + Remove trailing end of line control characters. + +

+ Returns a string where any trailing \n or + \r\n have been removed from String. +

+

Example:

+
+182> string:chomp(<<"\nHello\n\n">>).
+<<"\nHello">>
+183> string:chomp("\nHello\r\r\n").
+"\nHello\r"
+
+
+ + + + + + Test string equality. + +

+ Returns true if A and + B are equal, otherwise false. +

+

+ If IgnoreCase is true + the function does + casefolding on the fly before the equality test. +

+

If Norm is not none + the function applies normalization on the fly before the equality test. + There are four available normalization forms: + nfc, + nfd, + nfkc, and + nfkd. +

+

By default, + IgnoreCase is false and + Norm is none.

+

Example:

+
+1> string:equal("åäö", <<"åäö"/utf8>>).
+true
+2> string:equal("åäö", unicode:characters_to_nfd_binary("åäö")).
+false
+3> string:equal("åäö", unicode:characters_to_nfd_binary("ÅÄÖ"), true, nfc).
+true
+
+
+ + + + + Find start of substring. + +

+ Removes anything before SearchPattern in String + and returns the remainder of the string or nomatch if SearchPattern is not + found. + Dir, which can be leading or + trailing, indicates from which direction characters + are to be searched. +

+

+ By default, Dir is leading. +

+

Example:

+
+1> string:find("ab..cd..ef", ".").
+"..cd..ef"
+2> string:find(<<"ab..cd..ef">>, "..", trailing).
+<<"..ef">>
+3> string:find(<<"ab..cd..ef">>, "x", leading).
+nomatch
+4> string:find("ab..cd..ef", "x", trailing).
+nomatch
+
+
+ + + + Check if the string is empty. + +

Returns true if String is the + empty string, otherwise false.

+

Example:

+
+1> string:is_empty("foo").
+false
+2> string:is_empty(["",<<>>]).
+true
+
+
+ + + + Calculate length of the string. + +

+ Returns the number of grapheme clusters in String. +

+

Example:

+
+1> string:length("ß↑e̊").
+3
+2> string:length(<<195,159,226,134,145,101,204,138>>).
+3
+
+
+ + + + Split string into lexemes. + +

+ Returns a list of lexemes in String, separated + by the grapheme clusters in SeparatorList. +

+

+ Notice that, as shown in this example, two or more + adjacent separator graphemes clusters in String + are treated as one. That is, there are no empty + strings in the resulting list of lexemes. + See also split/3 which returns + empty strings. +

+

Notice that [$\r,$\n] is one grapheme cluster.

+

Example:

+
+1> string:lexemes("abc de̊fxxghix jkl\r\nfoo", "x e" ++ [[$\r,$\n]]).
+["abc","de̊f","ghi","jkl","foo"]
+2> string:lexemes(<<"abc de̊fxxghix jkl\r\nfoo"/utf8>>, "x e" ++ [$\r,$\n]).
+[<<"abc">>,<<"de̊f"/utf8>>,<<"ghi">>,<<"jkl\r\nfoo">>]
+
+
+ + + + Convert a string to lowercase + +

+ Converts String to lowercase. +

+

+ Notice that function casefold/1 + should be used when converting a string to + be tested for equality. +

+

Example:

+
+2> string:lowercase(string:uppercase("Michał")).
+"michał"
+
+
+ + + + Pick the first codepoint. + +

+ Returns the first codepoint in String + and the rest of String in the tail. +

+

Example:

+
+1> string:next_codepoint(unicode:characters_to_binary("e̊fg")).
+[101|<<"̊fg"/utf8>>]
+
+
+ + + + Pick the first grapheme cluster. + +

+ Returns the first grapheme cluster in String + and the rest of String in the tail. +

+

Example:

+
+1> string:next_grapheme(unicode:characters_to_binary("e̊fg")).
+["e̊"|<<"fg">>]
+
+
+ + + + Pick the nth lexeme. + +

Returns lexeme number N in + String, where lexemes are separated by + the grapheme clusters in SeparatorList. +

+

Example:

+
+1> string:nth_lexeme("abc.de̊f.ghiejkl", 3, ".e").
+"ghi"
+
+
+ + + + + + Pad a string to given length. + +

+ Pads String to Length with + grapheme cluster Char. + Dir, which can be leading, trailing, + or both, indicates where the padding should be added. +

+

By default, Char is $\s and + Dir is trailing. +

+

Example:

+
+1> string:pad(<<"He̊llö"/utf8>>, 8).
+[<<72,101,204,138,108,108,195,182>>,32,32,32]
+2> io:format("'~ts'~n",[string:pad("He̊llö", 8, leading)]).
+'   He̊llö'
+3> io:format("'~ts'~n",[string:pad("He̊llö", 8, both)]).
+' He̊llö  '
+
+
+ + + + Remove prefix from string. + +

+ If Prefix is the prefix of + String, removes it and returns the + remainder of String, otherwise returns + nomatch. +

+

Example:

+
+1> string:prefix(<<"prefix of string">>, "pre").
+<<"fix of string">>
+2> string:prefix("pre", "prefix").
+nomatch
+
+
+ + + + + Replace a pattern in string. + +

+ Replaces SearchPattern in String + with Replacement. + Where, default leading, indicates whether + the leading, the trailing or all encounters of + SearchPattern are to be replaced. +

+

Can be implemented as:

+
lists:join(Replacement, split(String, SearchPattern, Where)).
+

Example:

+
+1> string:replace(<<"ab..cd..ef">>, "..", "*").
+[<<"ab">>,"*",<<"cd..ef">>]
+2> string:replace(<<"ab..cd..ef">>, "..", "*", all).
+[<<"ab">>,"*",<<"cd">>,"*",<<"ef">>]
+
+
+ + + + Reverses a string + +

+ Returns the reverse list of the grapheme clusters in String. +

+

Example:

+
+1> Reverse = string:reverse(unicode:characters_to_nfd_binary("ÅÄÖ")).
+[[79,776],[65,776],[65,778]]
+2> io:format("~ts~n",[Reverse]).
+ÖÄÅ
+
+
+ + + + + Extract a part of string + +

Returns a substring of String of + at most Length grapheme clusters, starting at position + Start.

+

By default, Length is infinity.

+

Example:

+
+1> string:slice(<<"He̊llö Wörld"/utf8>>, 4).
+<<"ö Wörld"/utf8>>
+2> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,4).
+"ö Wö"
+3> string:slice(["He̊llö ", <<"Wörld"/utf8>>], 4,50).
+"ö Wörld"
+
+
+ + + + + Split a string into substrings. + +

+ Splits String where SearchPattern + is encountered and return the remaining parts. + Where, default leading, indicates whether + the leading, the trailing or all encounters of + SearchPattern will split String. +

+

Example:

+
+0> string:split("ab..bc..cd", "..").
+["ab","bc..cd"]
+1> string:split(<<"ab..bc..cd">>, "..", trailing).
+[<<"ab..bc">>,<<"cd">>]
+2> string:split(<<"ab..bc....cd">>, "..", all).
+[<<"ab">>,<<"bc">>,<<>>,<<"cd">>]
+
+
+ + + + + + Take leading or trailing parts. + +

Takes characters from String as long as + the characters are members of set Characters + or the complement of set Characters. + Dir, + which can be leading or trailing, indicates from + which direction characters are to be taken. +

+

Example:

+
+5> string:take("abc0z123", lists:seq($a,$z)).
+{"abc","0z123"}
+6> string:take(<<"abc0z123">>, lists:seq($0,$9), true, leading).
+{<<"abc">>,<<"0z123">>}
+7> string:take("abc0z123", lists:seq($0,$9), false, trailing).
+{"abc0z","123"}
+8> string:take(<<"abc0z123">>, lists:seq($a,$z), true, trailing).
+{<<"abc0z">>,<<"123">>}
+
+
+ + + + Convert a string to titlecase. + +

+ Converts String to titlecase. +

+

Example:

+
+1> string:titlecase("ß is a SHARP s").
+"Ss is a SHARP s"
+
+
+ + + + Return a float whose text representation is the integers + (ASCII values) of a string. + +

Argument String is expected to start with a + valid text represented float (the digits are ASCII values). + Remaining characters in the string after the float are returned in + Rest.

+

Example:

+
+> {F1,Fs} = string:to_float("1.0-1.0e-1"),
+> {F2,[]} = string:to_float(Fs),
+> F1+F2.
+0.9
+> string:to_float("3/2=1.5").
+{error,no_float}
+> string:to_float("-1.5eX").
+{-1.5,"eX"}
+
+
+ + + + Return an integer whose text representation is the integers + (ASCII values) of a string. + +

Argument String is expected to start with a + valid text represented integer (the digits are ASCII values). + Remaining characters in the string after the integer are returned in + Rest.

+

Example:

+
+> {I1,Is} = string:to_integer("33+22"),
+> {I2,[]} = string:to_integer(Is),
+> I1-I2.
+11
+> string:to_integer("0.5").
+{0,".5"}
+> string:to_integer("x=2").
+{error,no_integer}
+
+
+ + + + Convert a string to a list of grapheme clusters. + +

+ Converts String to a list of grapheme clusters. +

+

Example:

+
+1> string:to_graphemes("ß↑e̊").
+[223,8593,[101,778]]
+2> string:to_graphemes(<<"ß↑e̊"/utf8>>).
+[223,8593,[101,778]]
+
+
+ + + + + + Trim leading or trailing, or both, characters. + +

+ Returns a string, where leading or trailing, or both, + Characters have been removed. + Dir which can be leading, trailing, + or both, indicates from which direction characters + are to be removed. +

+

Default Characters are the set of + nonbreakable whitespace codepoints, defined as + Pattern_White_Space in + Unicode Standard Annex #31. + By default, Dir is both. +

+

+ Notice that [$\r,$\n] is one grapheme cluster according + to the Unicode Standard. +

+

Example:

+
+1> string:trim("\t  Hello  \n").
+"Hello"
+2> string:trim(<<"\t  Hello  \n">>, leading).
+<<"Hello  \n">>
+3> string:trim(<<".Hello.\n">>, trailing, "\n.").
+<<".Hello">>
+
+
+ + + + Convert a string to uppercase. + +

+ Converts String to uppercase. +

+

See also titlecase/1.

+

Example:

+
+1> string:uppercase("Michał").
+"MICHAŁ"
+
+
+ +
+ +
+ + Obsolete API functions +

Here follows the function of the old API. + These functions only work on a list of Latin-1 characters. +

+

+ The functions are kept for backward compatibility, but are + not recommended. + They will be deprecated in Erlang/OTP 21. +

+

Any undocumented functions in string are not to be used.

+
+
+ @@ -47,17 +652,24 @@

Returns a string, where String is centered in the string and surrounded by blanks or Character. The resulting string has length Number.

+

This function is obsolete. + Use + pad/3. +

- Returns a string consisting of numbers of characters. + Return a string consisting of numbers of characters.

Returns a string consisting of Number characters Character. Optionally, the string can end with string Tail.

+

This function is obsolete. + Use + lists:duplicate/2.

@@ -69,6 +681,9 @@

Returns the index of the first occurrence of Character in String. Returns 0 if Character does not occur.

+

This function is obsolete. + Use + find/2.

@@ -79,6 +694,16 @@

Concatenates String1 and String2 to form a new string String3, which is returned.

+

+ This function is obsolete. + Use [String1, String2] as + Data argument, and call + + unicode:characters_to_list/2 or + + unicode:characters_to_binary/2 + to flatten the output. +

@@ -88,6 +713,9 @@

Returns a string containing String repeated Number times.

+

This function is obsolete. + Use + lists:duplicate/2.

@@ -98,6 +726,9 @@

Returns the length of the maximum initial segment of String, which consists entirely of characters not from Chars.

+

This function is obsolete. + Use + take/3.

Example:

> string:cspan("\t abcdef", " \t"). @@ -105,21 +736,15 @@ - - - Test string equality. - -

Returns true if String1 and - String2 are equal, otherwise false.

-
-
- Join a list of strings with separator.

Returns a string with the elements of StringList separated by the string in Separator.

+

This function is obsolete. + Use + lists:join/2.

Example:

> join(["one", "two", "three"], ", "). @@ -137,6 +762,10 @@ fixed. If length(String) < Number, then String is padded with blanks or Characters.

+

This function is obsolete. + Use + pad/2 or + pad/3.

Example:

> string:left("Hello",10,$.). @@ -149,6 +778,9 @@ Return the length of a string.

Returns the number of characters in String.

+

This function is obsolete. + Use + length/1.

@@ -160,6 +792,9 @@

Returns the index of the last occurrence of Character in String. Returns 0 if Character does not occur.

+

This function is obsolete. + Use + find/3.

@@ -173,6 +808,9 @@ fixed. If the length of (String) < Number, then String is padded with blanks or Characters.

+

This function is obsolete. + Use + pad/3.

Example:

> string:right("Hello", 10, $.). @@ -188,6 +826,9 @@ SubString begins in String. Returns 0 if SubString does not exist in String.

+

This function is obsolete. + Use + find/3.

Example:

> string:rstr(" Hello Hello World World ", "Hello World"). @@ -202,6 +843,9 @@

Returns the length of the maximum initial segment of String, which consists entirely of characters from Chars.

+

This function is obsolete. + Use + take/2.

Example:

> string:span("\t abcdef", " \t"). @@ -217,6 +861,9 @@ SubString begins in String. Returns 0 if SubString does not exist in String.

+

This function is obsolete. + Use + find/2.

Example:

> string:str(" Hello Hello World World ", "Hello World"). @@ -230,12 +877,15 @@ Strip leading or trailing characters. -

Returns a string, where leading and/or trailing blanks or a +

Returns a string, where leading or trailing, or both, blanks or a number of Character have been removed. Direction, which can be left, right, or both, indicates from which direction blanks are to be removed. strip/1 is equivalent to strip(String, both).

+

This function is obsolete. + Use + trim/3.

Example:

> string:strip("...Hello.....", both, $.). @@ -251,6 +901,9 @@

Returns a substring of String, starting at position Start to the end of the string, or to and including position Stop.

+

This function is obsolete. + Use + slice/3.

Example:

sub_string("Hello World", 4, 8). @@ -266,6 +919,9 @@ sub_string("Hello World", 4, 8).

Returns a substring of String, starting at position Start, and ending at the end of the string or at length Length.

+

This function is obsolete. + Use + slice/3.

Example:

> substr("Hello World", 4, 5). @@ -281,6 +937,9 @@ sub_string("Hello World", 4, 8).

Returns the word in position Number of String. Words are separated by blanks or Characters.

+

This function is obsolete. + Use + nth_lexeme/3.

Example:

> string:sub_word(" Hello old boy !",3,$o). @@ -288,50 +947,6 @@ sub_string("Hello World", 4, 8).
- - - Returns a float whose text representation is the integers - (ASCII values) in a string. - -

Argument String is expected to start with a - valid text represented float (the digits are ASCII values). - Remaining characters in the string after the float are returned in - Rest.

-

Example:

- -> {F1,Fs} = string:to_float("1.0-1.0e-1"), -> {F2,[]} = string:to_float(Fs), -> F1+F2. -0.9 -> string:to_float("3/2=1.5"). -{error,no_float} -> string:to_float("-1.5eX"). -{-1.5,"eX"} -
-
- - - - Returns an integer whose text representation is the integers - (ASCII values) in a string. - -

Argument String is expected to start with a - valid text represented integer (the digits are ASCII values). - Remaining characters in the string after the integer are returned in - Rest.

-

Example:

- -> {I1,Is} = string:to_integer("33+22"), -> {I2,[]} = string:to_integer(Is), -> I1-I2. -11 -> string:to_integer("0.5"). -{0,".5"} -> string:to_integer("x=2"). -{error,no_integer} -
-
- @@ -346,6 +961,11 @@ sub_string("Hello World", 4, 8).

The specified string or character is case-converted. Notice that the supported character set is ISO/IEC 8859-1 (also called Latin 1); all values outside this set are unchanged

+

This function is obsolete use + lowercase/1, + uppercase/1, + titlecase/1 or + casefold/1.

@@ -363,6 +983,9 @@ sub_string("Hello World", 4, 8). adjacent separator characters in String are treated as one. That is, there are no empty strings in the resulting list of tokens.

+

This function is obsolete. + Use + lexemes/2.

@@ -373,6 +996,9 @@ sub_string("Hello World", 4, 8).

Returns the number of words in String, separated by blanks or Character.

+

This function is obsolete. + Use + lexemes/2.

Example:

> words(" Hello old boy!", $o). @@ -387,10 +1013,7 @@ sub_string("Hello World", 4, 8). other. The reason is that this string package is the combination of two earlier packages and all functions of both packages have been retained.

- - -

Any undocumented functions in string are not to be used.

-
+ -- cgit v1.2.3