From 2c72e662bad11a41839780f86680d4bb05367c78 Mon Sep 17 00:00:00 2001 From: Dan Gudmundsson Date: Mon, 3 Apr 2017 12:19:21 +0200 Subject: New unicode aware string module that works with unicode:chardata() Works with unicode:chardata() as input as was decided on OTP board meeting as response to EEP-35 a long time ago. Works on graphemes clusters as base, with a few exceptions, does not handle classic (nor nfd'ified) Hangul nor the extended grapheme clusters such as the prepend class. That would make handling binaries as input/output very slow. List input => list output, binary input => binary output and mixed input => mixed output for all find/split functions. So that results can be post-processed without the need to invoke unicode:characters_to_list|binary for intermediate data. pad functions return lists of unicode:chardata() for performance. --- lib/stdlib/doc/src/unicode_usage.xml | 70 +++++++++++++++++++----------------- 1 file changed, 38 insertions(+), 32 deletions(-) (limited to 'lib/stdlib/doc/src/unicode_usage.xml') diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index a8ef8ff5c5..11b84f552a 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -65,7 +65,10 @@

In Erlang/OTP 20.0, atoms and function can contain Unicode characters. Module names are still restricted to - the ISO-Latin-1 range.

+ the ISO-Latin-1 range.

+

Support was added for normalizations forms in + unicode and the string module now handles + utf8-encoded binaries.

This section outlines the current Unicode support and gives some @@ -110,23 +113,27 @@ -

So, a conversion function must know not only one character at a time, - but possibly the whole sentence, the natural language to translate to, - the differences in input and output string length, and so on. - Erlang/OTP has currently no Unicode to_upper/to_lower - functionality, but publicly available libraries address these issues.

- -

Another example is the accented characters, where the same glyph has two - different representations. The Swedish letter "ö" is one example. - The Unicode standard has a code point for it, but you can also write it - as "o" followed by "U+0308" (Combining Diaeresis, with the simplified - meaning that the last letter is to have "¨" above). They have the same - glyph. They are for most purposes the same, but have different - representations. For example, MacOS X converts all filenames to use - Combining Diaeresis, while most other programs (including Erlang) try to - hide that by doing the opposite when, for example, listing directories. - However it is done, it is usually important to normalize such - characters to avoid confusion.

+

So, a conversion function must know not only one character at a + time, but possibly the whole sentence, the natural language to + translate to, the differences in input and output string length, + and so on. Erlang/OTP has currently no Unicode + uppercase/lowercase functionality with language + specific handling, but publicly available libraries address these + issues.

+ +

Another example is the accented characters, where the same + glyph has two different representations. The Swedish letter "ö" is + one example. The Unicode standard has a code point for it, but + you can also write it as "o" followed by "U+0308" (Combining + Diaeresis, with the simplified meaning that the last letter is to + have "¨" above). They have the same glyph, user perceived + character. They are for most purposes the same, but have different + representations. For example, MacOS X converts all filenames to + use Combining Diaeresis, while most other programs (including + Erlang) try to hide that by doing the opposite when, for example, + listing directories. However it is done, it is usually important + to normalize such characters to avoid confusion. +

The list of examples can be made long. One need a kind of knowledge that was not needed when programs only considered one or two languages. The @@ -273,7 +280,7 @@ them. In some cases functionality has been added to already existing interfaces (as the string module now can - handle lists with any code points). In some cases new + handle strings with any code points). In some cases new functionality or options have been added (as in the io module, the file handling, the Fortunately, most textual data has been stored in lists and range checking has been sparse, so modules like string work well for - Unicode lists with little need for conversion or extension.

+ Unicode strings with little need for conversion or extension.

Some modules are, however, changed to be explicitly Unicode-aware. These modules include:

@@ -1028,18 +1035,17 @@ Eshell V5.10.1 (abort with ^G) has extensive support for Unicode text.

-

The string module works - perfectly for Unicode strings and ISO Latin-1 strings, except the - language-dependent functions - string:to_upper/1 - and - string:to_lower/1, - which are only correct for the ISO Latin-1 character set. These two - functions can never function correctly for Unicode characters in their - current form, as there are language and locale issues as well as - multi-character mappings to consider when converting text between cases. - Converting case in an international environment is a large subject not - yet addressed in OTP.

+

The string + module works perfectly for Unicode strings and ISO Latin-1 + strings, except the language-dependent functions string:uppercase/1 + and string:lowercase/1. + These two functions can never function correctly for Unicode + characters in their current form, as there are language and locale + issues to consider when converting text between cases. Converting + case in an international environment is a large subject not yet + addressed in OTP.

-- cgit v1.2.3