aboutsummaryrefslogtreecommitdiffstats
path: root/lib/stdlib/doc/src/unicode_usage.xml
diff options
context:
space:
mode:
authorDan Gudmundsson <[email protected]>2017-04-03 12:19:21 +0200
committerDan Gudmundsson <[email protected]>2017-04-24 12:16:56 +0200
commit2c72e662bad11a41839780f86680d4bb05367c78 (patch)
tree01e9ae9b32fdb953392e571a0773fb2cd059c498 /lib/stdlib/doc/src/unicode_usage.xml
parent75fc94b8b462d7b7f6dd4b706bbe32cff77ee575 (diff)
downloadotp-2c72e662bad11a41839780f86680d4bb05367c78.tar.gz
otp-2c72e662bad11a41839780f86680d4bb05367c78.tar.bz2
otp-2c72e662bad11a41839780f86680d4bb05367c78.zip
New unicode aware string module that works with unicode:chardata()
Works with unicode:chardata() as input as was decided on OTP board meeting as response to EEP-35 a long time ago. Works on graphemes clusters as base, with a few exceptions, does not handle classic (nor nfd'ified) Hangul nor the extended grapheme clusters such as the prepend class. That would make handling binaries as input/output very slow. List input => list output, binary input => binary output and mixed input => mixed output for all find/split functions. So that results can be post-processed without the need to invoke unicode:characters_to_list|binary for intermediate data. pad functions return lists of unicode:chardata() for performance.
Diffstat (limited to 'lib/stdlib/doc/src/unicode_usage.xml')
-rw-r--r--lib/stdlib/doc/src/unicode_usage.xml70
1 files changed, 38 insertions, 32 deletions
diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml
index a8ef8ff5c5..11b84f552a 100644
--- a/lib/stdlib/doc/src/unicode_usage.xml
+++ b/lib/stdlib/doc/src/unicode_usage.xml
@@ -65,7 +65,10 @@
<item><p>In Erlang/OTP 20.0, atoms and function can contain
Unicode characters. Module names are still restricted to
- the ISO-Latin-1 range.</p></item>
+ the ISO-Latin-1 range.</p>
+ <p>Support was added for normalizations forms in
+ <c>unicode</c> and the <c>string</c> module now handles
+ utf8-encoded binaries.</p></item>
</list>
<p>This section outlines the current Unicode support and gives some
@@ -110,23 +113,27 @@
</item>
</list>
- <p>So, a conversion function must know not only one character at a time,
- but possibly the whole sentence, the natural language to translate to,
- the differences in input and output string length, and so on.
- Erlang/OTP has currently no Unicode <c>to_upper</c>/<c>to_lower</c>
- functionality, but publicly available libraries address these issues.</p>
-
- <p>Another example is the accented characters, where the same glyph has two
- different representations. The Swedish letter "ö" is one example.
- The Unicode standard has a code point for it, but you can also write it
- as "o" followed by "U+0308" (Combining Diaeresis, with the simplified
- meaning that the last letter is to have "¨" above). They have the same
- glyph. They are for most purposes the same, but have different
- representations. For example, MacOS X converts all filenames to use
- Combining Diaeresis, while most other programs (including Erlang) try to
- hide that by doing the opposite when, for example, listing directories.
- However it is done, it is usually important to normalize such
- characters to avoid confusion.</p>
+ <p>So, a conversion function must know not only one character at a
+ time, but possibly the whole sentence, the natural language to
+ translate to, the differences in input and output string length,
+ and so on. Erlang/OTP has currently no Unicode
+ <c>uppercase</c>/<c>lowercase</c> functionality with language
+ specific handling, but publicly available libraries address these
+ issues.</p>
+
+ <p>Another example is the accented characters, where the same
+ glyph has two different representations. The Swedish letter "ö" is
+ one example. The Unicode standard has a code point for it, but
+ you can also write it as "o" followed by "U+0308" (Combining
+ Diaeresis, with the simplified meaning that the last letter is to
+ have "¨" above). They have the same glyph, user perceived
+ character. They are for most purposes the same, but have different
+ representations. For example, MacOS X converts all filenames to
+ use Combining Diaeresis, while most other programs (including
+ Erlang) try to hide that by doing the opposite when, for example,
+ listing directories. However it is done, it is usually important
+ to normalize such characters to avoid confusion.
+ </p>
<p>The list of examples can be made long. One need a kind of knowledge that
was not needed when programs only considered one or two languages. The
@@ -273,7 +280,7 @@
them. In some cases functionality has been added to already
existing interfaces (as the <seealso
marker="stdlib:string"><c>string</c></seealso> module now can
- handle lists with any code points). In some cases new
+ handle strings with any code points). In some cases new
functionality or options have been added (as in the <seealso
marker="stdlib:io"><c>io</c></seealso> module, the file
handling, the <seealso
@@ -977,7 +984,7 @@ Eshell V5.10.1 (abort with ^G)
<p>Fortunately, most textual data has been stored in lists and range
checking has been sparse, so modules like <c>string</c> work well for
- Unicode lists with little need for conversion or extension.</p>
+ Unicode strings with little need for conversion or extension.</p>
<p>Some modules are, however, changed to be explicitly Unicode-aware. These
modules include:</p>
@@ -1028,18 +1035,17 @@ Eshell V5.10.1 (abort with ^G)
has extensive support for Unicode text.</p></item>
</taglist>
- <p>The <seealso marker="stdlib:string"><c>string</c></seealso> module works
- perfectly for Unicode strings and ISO Latin-1 strings, except the
- language-dependent functions
- <seealso marker="stdlib:string#to_upper/1"><c>string:to_upper/1</c></seealso>
- and
- <seealso marker="stdlib:string#to_lower/1"><c>string:to_lower/1</c></seealso>,
- which are only correct for the ISO Latin-1 character set. These two
- functions can never function correctly for Unicode characters in their
- current form, as there are language and locale issues as well as
- multi-character mappings to consider when converting text between cases.
- Converting case in an international environment is a large subject not
- yet addressed in OTP.</p>
+ <p>The <seealso marker="stdlib:string"><c>string</c></seealso>
+ module works perfectly for Unicode strings and ISO Latin-1
+ strings, except the language-dependent functions <seealso
+ marker="stdlib:string#uppercase/1"><c>string:uppercase/1</c></seealso>
+ and <seealso
+ marker="stdlib:string#lowercase/1"><c>string:lowercase/1</c></seealso>.
+ These two functions can never function correctly for Unicode
+ characters in their current form, as there are language and locale
+ issues to consider when converting text between cases. Converting
+ case in an international environment is a large subject not yet
+ addressed in OTP.</p>
</section>
<section>