From 68d53c01b0b8e9a007a6a30158c19e34b2d2a34e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bj=C3=B6rn=20Gustavsson?= Date: Wed, 18 May 2016 15:53:35 +0200 Subject: Update STDLIB documentation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Language cleaned up by the technical writers xsipewe and tmanevik from Combitech. Proofreading and corrections by Björn Gustavsson and Hans Bolinder. --- lib/stdlib/doc/src/unicode_usage.xml | 2420 ++++++++++++++++++---------------- 1 file changed, 1278 insertions(+), 1142 deletions(-) (limited to 'lib/stdlib/doc/src/unicode_usage.xml') diff --git a/lib/stdlib/doc/src/unicode_usage.xml b/lib/stdlib/doc/src/unicode_usage.xml index b4c9385e33..7f79ac88a1 100644 --- a/lib/stdlib/doc/src/unicode_usage.xml +++ b/lib/stdlib/doc/src/unicode_usage.xml @@ -33,427 +33,495 @@ PA1 unicode_usage.xml -
-Unicode Implementation -

Implementing support for Unicode character sets is an ongoing - process. The Erlang Enhancement Proposal (EEP) 10 outlined the - basics of Unicode support and also specified a default encoding in - binaries that all Unicode-aware modules should handle in the - future.

- -

The functionality described in EEP10 was implemented in Erlang/OTP - R13A, but that was by no means the end of it. In Erlang/OTP R14B01 support - for Unicode file names was added, although it was in no way complete - and was by default disabled on platforms where no guarantee was given - for the file name encoding. With Erlang/OTP R16A came support for UTF-8 encoded - source code, among with enhancements to many of the applications to - support both Unicode encoded file names as well as support for UTF-8 - encoded files in several circumstances. Most notable is the support - for UTF-8 in files read by file:consult/1, release handler support - for UTF-8 and more support for Unicode character sets in the - I/O-system. In Erlang/OTP 17.0, the encoding default for Erlang source files was - switched to UTF-8.

- -

This guide outlines the current Unicode support and gives a couple - of recipes for working with Unicode data.

-
-
-Understanding Unicode -

Experience with the Unicode support in Erlang has made it - painfully clear that understanding Unicode characters and encodings - is not as easy as one would expect. The complexity of the field as - well as the implications of the standard requires thorough - understanding of concepts rarely before thought of.

- -

Furthermore the Erlang implementation requires understanding of - concepts that never were an issue for many (Erlang) programmers. To - understand and use Unicode characters requires that you study the - subject thoroughly, even if you're an experienced programmer.

- -

As an example, one could contemplate the issue of converting - between upper and lower case letters. Reading the standard will make - you realize that, to begin with, there's not a simple one to one - mapping in all scripts. Take German as an example, where there's a - letter "ß" (Sharp s) in lower case, but the uppercase equivalent is - "SS". Or Greek, where "Σ" has two different lowercase forms: "ς" in - word-final position and "σ" elsewhere. Or Turkish where dotted and - dot-less "i" both exist in lower case and upper case forms, or - Cyrillic "I" which usually has no lowercase form. Or of course - languages that have no concept of upper case (or lower case). So, a - conversion function will need to know not only one character at a - time, but possibly the whole sentence, maybe the natural language - the translation should be in and also take into account differences - in input and output string length and so on. There is at the time of - writing no Unicode to_upper/to_lower functionality in Erlang/OTP, but - there are publicly available libraries that address these issues.

- -

Another example is the accented characters where the same glyph - has two different representations. Let's look at the Swedish - "ö". There's a code point for that in the Unicode standard, but you - can also write it as "o" followed by U+0308 (Combining Diaeresis, - with the simplified meaning that the last letter should have a "¨" - above). They have exactly the same glyph. They are for most - purposes the same, but they have completely different - representations. For example MacOS X converts all file names to use - Combining Diaeresis, while most other programs (including Erlang) - try to hide that by doing the opposite when for example listing - directories. However it's done, it's usually important to normalize - such characters to avoid utter confusion.

- -

The list of examples can be made as long as the Unicode standard, I - suspect. The point is that one need a kind of knowledge that was - never needed when programs only took one or two languages into - account. The complexity of human languages and scripts, certainly - has made this a challenge when constructing a universal - standard. Supporting Unicode properly in your program will require - effort.

- -
-
-What Unicode Is -

Unicode is a standard defining code points (numbers) for all - known, living or dead, scripts. In principle, every known symbol - used in any language has a Unicode code point.

-

Unicode code points are defined and published by the Unicode - Consortium, which is a non profit organization.

-

Support for Unicode is increasing throughout the world of - computing, as the benefits of one common character set are - overwhelming when programs are used in a global environment.

-

Along with the base of the standard: the code points for all the - scripts, there are a couple of encoding standards available.

-

It is vital to understand the difference between encodings and - Unicode characters. Unicode characters are code points according to - the Unicode standard, while the encodings are ways to represent such - code points. An encoding is just a standard for representation, - UTF-8 can for example be used to represent a very limited part of - the Unicode character set (e.g. ISO-Latin-1), or the full Unicode - range. It's just an encoding format.

-

As long as all character sets were limited to 256 characters, - each character could be stored in one single byte, so there was more - or less only one practical encoding for the characters. Encoding - each character in one byte was so common that the encoding wasn't - even named. When we now, with the Unicode system, have a lot more - than 256 characters, we need a common way to represent these. The - common ways of representing the code points are the encodings. This - means a whole new concept to the programmer, the concept of - character representation, which was before a non-issue.

- -

Different operating systems and tools support different - encodings. For example Linux and MacOS X has chosen the UTF-8 - encoding, which is backwards compatible with 7-bit ASCII and - therefore affects programs written in plain English the - least. Windows on the other hand supports a limited version of - UTF-16, namely all the code planes where the characters can be - stored in one single 16-bit entity, which includes most living - languages.

- -

The most widely spread encodings are:

- - Bytewise representation - This is not a proper Unicode representation, but the - representation used for characters before the Unicode standard. It - can still be used to represent character code points in the Unicode - standard that have numbers below 256, which corresponds exactly to - the ISO-Latin-1 character set. In Erlang, this is commonly denoted - latin1 encoding, which is slightly misleading as ISO-Latin-1 is - a character code range, not an encoding. - UTF-8 - Each character is stored in one to four bytes depending on - code point. The encoding is backwards compatible with bytewise - representation of 7-bit ASCII as all 7-bit characters are stored - in one single byte in UTF-8. The characters beyond code point 127 - are stored in more bytes, letting the most significant bit in the - first character indicate a multi-byte character. For details on - the encoding, the RFC is publicly available. Note that UTF-8 is - not compatible with bytewise representation for - code points between 128 and 255, so a ISO-Latin-1 bytewise - representation is not generally compatible with UTF-8. - UTF-16 - This encoding has many similarities to UTF-8, but the basic - unit is a 16-bit number. This means that all characters occupy at - least two bytes, some high numbers even four bytes. Some programs, - libraries and operating systems claiming to use UTF-16 only allows - for characters that can be stored in one 16-bit entity, which is - usually sufficient to handle living languages. As the basic unit - is more than one byte, byte-order issues occur, why UTF-16 exists - in both a big-endian and little-endian variant. In Erlang, the - full UTF-16 range is supported when applicable, like in the - unicode module and in the bit syntax. - UTF-32 - The most straight forward representation. Each character is - stored in one single 32-bit number. There is no need for escapes - or any variable amount of entities for one character, all Unicode - code points can be stored in one single 32-bit entity. As with - UTF-16, there are byte-order issues, UTF-32 can be both big- and - little-endian. - UCS-4 - Basically the same as UTF-32, but without some Unicode - semantics, defined by IEEE and has little use as a separate - encoding standard. For all normal (and possibly abnormal) usages, - UTF-32 and UCS-4 are interchangeable. - -

Certain ranges of numbers are left unused in the Unicode standard - and certain ranges are even deemed invalid. The most notable invalid - range is 16#D800 - 16#DFFF, as the UTF-16 encoding does not allow - for encoding of these numbers. It can be speculated that the UTF-16 - encoding standard was, from the beginning, expected to be able to - hold all Unicode characters in one 16-bit entity, but then had to be - extended, leaving a hole in the Unicode range to cope with backward - compatibility.

-

Additionally, the code point 16#FEFF is used for byte order marks - (BOM's) and use of that character is not encouraged in other - contexts than that. It actually is valid though, as the character - "ZWNBS" (Zero Width Non Breaking Space). BOM's are used to identify - encodings and byte order for programs where such parameters are not - known in advance. Byte order marks are more seldom used than one - could expect, but their use might become more widely spread as they - provide the means for programs to make educated guesses about the - Unicode format of a certain file.

-
-
- Areas of Unicode Support -

To support Unicode in Erlang, problems in several areas have been - addressed. Each area is described briefly in this section and more - thoroughly further down in this document:

- - Representation - To handle Unicode characters in Erlang, we have to have a - common representation both in lists and binaries. The EEP (10) and - the subsequent initial implementation in Erlang/OTP R13A settled a standard - representation of Unicode characters in Erlang. - Manipulation - The Unicode characters need to be processed by the Erlang - program, why library functions need to be able to handle them. In - some cases functionality was added to already existing interfaces - (as the string module now can handle lists with arbitrary code points), - in some cases new functionality or options need to be added (as in - the io-module, the file handling, the unicode module - and the bit syntax). Today most modules in kernel and STDLIB, as - well as the VM are Unicode aware. - File I/O - I/O is by far the most problematic area for Unicode. A file - is an entity where bytes are stored and the lore of programming - has been to treat characters and bytes as interchangeable. With - Unicode characters, you need to decide on an encoding as soon as - you want to store the data in a file. In Erlang you can open a - text file with an encoding option, so that you can read characters - from it rather than bytes, but you can also open a file for - bytewise I/O. The I/O-system of Erlang has been designed (or at - least used) in a way where you expect any I/O-server to be - able to cope with any string data, but that is no longer the case - when you work with Unicode characters. Handling the fact that you - need to know the capabilities of the device where your data ends - up is something new to the Erlang programmer. Furthermore, ports - in Erlang are byte oriented, so an arbitrary string of (Unicode) - characters can not be sent to a port without first converting it - to an encoding of choice. - Terminal I/O - Terminal I/O is slightly easier than file I/O. The output is - meant for human reading and is usually Erlang syntax (e.g. in the - shell). There exists syntactic representation of any Unicode - character without actually displaying the glyph (instead written - as \x{HHH}), so Unicode data can usually be displayed - even if the terminal as such do not support the whole Unicode - range. - File names - File names can be stored as Unicode strings, in different - ways depending on the underlying OS and file system. This can be - handled fairly easy by a program. The problems arise when the file - system is not consistent in it's encodings, like for example - Linux. Linux allows files to be named with any sequence of bytes, - leaving to each program to interpret those bytes. On systems where - these "transparent" file names are used, Erlang has to be informed - about the file name encoding by a startup flag. The default is - bytewise interpretation, which is actually usually wrong, but - allows for interpretation of all file names. The concept - of "raw file names" can be used to handle wrongly encoded - file names if one enables Unicode file name translation - (+fnu) on platforms where this is not the default. - Source code encoding - When it comes to the Erlang source code, there is support - for the UTF-8 encoding and bytewise encoding. The default in - Erlang/OTP R16B was bytewise (or latin1) encoding; in Erlang/OTP 17.0 - it was changed to UTF-8. You can control the encoding by a comment like: - -%% -*- coding: utf-8 -*- - - in the beginning of the file. This of course requires your editor to - support UTF-8 as well. The same comment is also interpreted by - functions like file:consult/1, the release handler etc, so that - you can have all text files in your source directories in UTF-8 - encoding. - - The language - Having the source code in UTF-8 also allows you to write - string literals containing Unicode characters with code points > - 255, although atoms, module names and function names are - restricted to the ISO-Latin-1 range. Binary - literals where you use the /utf8 type, can also be - expressed using Unicode characters > 255. Having module names - using characters other than 7-bit ASCII can cause trouble on - operating systems with inconsistent file naming schemes, and might - also hurt portability, so it's not really recommended. It is - suggested in EEP 40 that the language should also allow for - Unicode characters > 255 in variable names. Whether to - implement that EEP or not is yet to be decided. - -
-
- Standard Unicode Representation -

In Erlang, strings are actually lists of integers. A string was - up until Erlang/OTP R13 defined to be encoded in the ISO-latin-1 (ISO8859-1) - character set, which is, code point by code point, a sub-range of - the Unicode character set.

-

The standard list encoding for strings was therefore easily - extended to cope with the whole Unicode range: A Unicode string in - Erlang is simply a list containing integers, each integer being a - valid Unicode code point and representing one character in the - Unicode character set.

-

Erlang strings in ISO-latin-1 are a subset of Unicode - strings.

-

Only if a string contains code points < 256, can it be - directly converted to a binary by using - i.e. erlang:iolist_to_binary/1 or can be sent directly to a - port. If the string contains Unicode characters > 255, an - encoding has to be decided upon and the string should be converted - to a binary in the preferred encoding using - unicode:characters_to_binary/{1,2,3}. Strings are not - generally lists of bytes, as they were before Erlang/OTP R13. They are lists of - characters. Characters are not generally bytes, they are Unicode - code points.

- -

Binaries are more troublesome. For performance reasons, programs - often store textual data in binaries instead of lists, mainly - because they are more compact (one byte per character instead of two - words per character, as is the case with lists). Using - erlang:list_to_binary/1, an ISO-Latin-1 Erlang string could - be converted into a binary, effectively using bytewise encoding - - one byte per character. This was very convenient for those limited - Erlang strings, but cannot be done for arbitrary Unicode lists.

-

As the UTF-8 encoding is widely spread and provides some backward - compatibility in the 7-bit ASCII range, it is selected as the - standard encoding for Unicode characters in binaries for Erlang.

-

The standard binary encoding is used whenever a library function - in Erlang should cope with Unicode data in binaries, but is of - course not enforced when communicating externally. Functions and - bit-syntax exist to encode and decode both UTF-8, UTF-16 and UTF-32 - in binaries. Library functions dealing with binaries and Unicode in - general, however, only deal with the default encoding.

- -

Character data may be combined from several sources, sometimes - available in a mix of strings and binaries. Erlang has for long had - the concept of iodata or iolists, where binaries and - lists can be combined to represent a sequence of bytes. In the same - way, the Unicode aware modules often allow for combinations of - binaries and lists where the binaries have characters encoded in - UTF-8 and the lists contain such binaries or numbers representing - Unicode code points:

- +
+ Unicode Implementation +

Implementing support for Unicode character sets is an ongoing process. + The Erlang Enhancement Proposal (EEP) 10 outlined the basics of Unicode + support and specified a default encoding in binaries that all + Unicode-aware modules are to handle in the future.

+ +

Here is an overview what has been done so far:

+ + +

The functionality described in EEP10 was implemented + in Erlang/OTP R13A.

+ +

Erlang/OTP R14B01 added support for Unicode + filenames, but it was not complete and was by default + disabled on platforms where no guarantee was given for the + filename encoding.

+ +

With Erlang/OTP R16A came support for UTF-8 encoded + source code, with enhancements to many of the applications to + support both Unicode encoded filenames and support for UTF-8 + encoded files in many circumstances. Most notable is the + support for UTF-8 in files read by file:consult/1, + release handler support for UTF-8, and more support for + Unicode character sets in the I/O system.

+ +

In Erlang/OTP 17.0, the encoding default for Erlang + source files was switched to UTF-8.

+
+ +

This section outlines the current Unicode support and gives some + recipes for working with Unicode data.

+
+ +
+ Understanding Unicode +

Experience with the Unicode support in Erlang has made it clear that + understanding Unicode characters and encodings is not as easy as one + would expect. The complexity of the field and the implications of the + standard require thorough understanding of concepts rarely before + thought of.

+ +

Also, the Erlang implementation requires understanding of + concepts that were never an issue for many (Erlang) programmers. To + understand and use Unicode characters requires that you study the + subject thoroughly, even if you are an experienced programmer.

+ +

As an example, contemplate the issue of converting between upper and + lower case letters. Reading the standard makes you realize that there is + not a simple one to one mapping in all scripts, for example:

+ + + +

In German, the letter "ß" (sharp s) is in lower case, but the + uppercase equivalent is "SS".

+
+ +

In Greek, the letter "Σ" has two different lowercase forms, + "ς" in word-final position and "σ" elsewhere.

+
+ +

In Turkish, both dotted and dotless "i" exist in lower case and + upper case forms.

+
+ +

Cyrillic "I" has usually no lowercase form.

+
+ +

Languages with no concept of upper case (or lower case).

+
+
+ +

So, a conversion function must know not only one character at a time, + but possibly the whole sentence, the natural language to translate to, + the differences in input and output string length, and so on. + Erlang/OTP has currently no Unicode to_upper/to_lower + functionality, but publicly available libraries address these issues.

+ +

Another example is the accented characters, where the same glyph has two + different representations. The Swedish letter "ö" is one example. + The Unicode standard has a code point for it, but you can also write it + as "o" followed by "U+0308" (Combining Diaeresis, with the simplified + meaning that the last letter is to have "¨" above). They have the same + glyph. They are for most purposes the same, but have different + representations. For example, MacOS X converts all filenames to use + Combining Diaeresis, while most other programs (including Erlang) try to + hide that by doing the opposite when, for example, listing directories. + However it is done, it is usually important to normalize such + characters to avoid confusion.

+ +

The list of examples can be made long. One need a kind of knowledge that + was not needed when programs only considered one or two languages. The + complexity of human languages and scripts has certainly made this a + challenge when constructing a universal standard. Supporting Unicode + properly in your program will require effort.

+
+ +
+ What Unicode Is +

Unicode is a standard defining code points (numbers) for all known, + living or dead, scripts. In principle, every symbol used in any + language has a Unicode code point. Unicode code points are defined and + published by the Unicode Consortium, which is a non-profit + organization.

+ +

Support for Unicode is increasing throughout the world of computing, as + the benefits of one common character set are overwhelming when programs + are used in a global environment. Along with the base of the standard, + the code points for all the scripts, some encoding standards are + available.

+ +

It is vital to understand the difference between encodings and Unicode + characters. Unicode characters are code points according to the Unicode + standard, while the encodings are ways to represent such code points. An + encoding is only a standard for representation. UTF-8 can, for example, + be used to represent a very limited part of the Unicode character set + (for example ISO-Latin-1) or the full Unicode range. It is only an + encoding format.

+ +

As long as all character sets were limited to 256 characters, each + character could be stored in one single byte, so there was more or less + only one practical encoding for the characters. Encoding each character + in one byte was so common that the encoding was not even named. With the + Unicode system there are much more than 256 characters, so a common way + is needed to represent these. The common ways of representing the code + points are the encodings. This means a whole new concept to the + programmer, the concept of character representation, which was a + non-issue earlier.

+ +

Different operating systems and tools support different encodings. For + example, Linux and MacOS X have chosen the UTF-8 encoding, which is + backward compatible with 7-bit ASCII and therefore affects programs + written in plain English the least. Windows supports a limited version + of UTF-16, namely all the code planes where the characters can be + stored in one single 16-bit entity, which includes most living + languages.

+ +

The following are the most widely spread encodings:

+ + + Bytewise representation + +

This is not a proper Unicode representation, but the representation + used for characters before the Unicode standard. It can still be used + to represent character code points in the Unicode standard with + numbers < 256, which exactly corresponds to the ISO Latin-1 + character set. In Erlang, this is commonly denoted latin1 + encoding, which is slightly misleading as ISO Latin-1 is a + character code range, not an encoding.

+
+ UTF-8 + +

Each character is stored in one to four bytes depending on code + point. The encoding is backward compatible with bytewise + representation of 7-bit ASCII, as all 7-bit characters are stored in + one single byte in UTF-8. The characters beyond code point 127 are + stored in more bytes, letting the most significant bit in the first + character indicate a multi-byte character. For details on the + encoding, the RFC is publicly available.

+

Notice that UTF-8 is not compatible with bytewise + representation for code points from 128 through 255, so an ISO + Latin-1 bytewise representation is generally incompatible with + UTF-8.

+
+ UTF-16 + +

This encoding has many similarities to UTF-8, but the basic + unit is a 16-bit number. This means that all characters occupy + at least two bytes, and some high numbers four bytes. Some + programs, libraries, and operating systems claiming to use + UTF-16 only allow for characters that can be stored in one + 16-bit entity, which is usually sufficient to handle living + languages. As the basic unit is more than one byte, byte-order + issues occur, which is why UTF-16 exists in both a big-endian + and a little-endian variant.

+

In Erlang, the full UTF-16 range is supported when applicable, like + in the unicode + module and in the bit syntax.

+
+ UTF-32 + +

The most straightforward representation. Each character is stored in + one single 32-bit number. There is no need for escapes or any + variable number of entities for one character. All Unicode code + points can be stored in one single 32-bit entity. As with UTF-16, + there are byte-order issues. UTF-32 can be both big-endian and + little-endian.

+
+ UCS-4 + +

Basically the same as UTF-32, but without some Unicode semantics, + defined by IEEE, and has little use as a separate encoding standard. + For all normal (and possibly abnormal) use, UTF-32 and UCS-4 are + interchangeable.

+
+
+ +

Certain number ranges are unused in the Unicode standard and certain + ranges are even deemed invalid. The most notable invalid range is + 16#D800-16#DFFF, as the UTF-16 encoding does not allow for encoding of + these numbers. This is possibly because the UTF-16 encoding standard, + from the beginning, was expected to be able to hold all Unicode + characters in one 16-bit entity, but was then extended, leaving a hole + in the Unicode range to handle backward compatibility.

+ +

Code point 16#FEFF is used for Byte Order Marks (BOMs) and use of that + character is not encouraged in other contexts. It is valid though, as + the character "ZWNBS" (Zero Width Non Breaking Space). BOMs are used to + identify encodings and byte order for programs where such parameters are + not known in advance. BOMs are more seldom used than expected, but can + become more widely spread as they provide the means for programs to make + educated guesses about the Unicode format of a certain file.

+
+ +
+ Areas of Unicode Support +

To support Unicode in Erlang, problems in various areas have been + addressed. This section describes each area briefly and more + thoroughly later in this User's Guide.

+ + + Representation + +

To handle Unicode characters in Erlang, a common representation + in both lists and binaries is needed. EEP (10) and the subsequent + initial implementation in Erlang/OTP R13A settled a standard + representation of Unicode characters in Erlang.

+
+ Manipulation + +

The Unicode characters need to be processed by the Erlang + program, which is why library functions must be able to handle + them. In some cases functionality has been added to already + existing interfaces (as the string module now can + handle lists with any code points). In some cases new + functionality or options have been added (as in the io module, the file + handling, the unicode module, and + the bit syntax). Today most modules in Kernel and + STDLIB, as well as the VM are Unicode-aware.

+
+ File I/O + +

I/O is by far the most problematic area for Unicode. A file is an + entity where bytes are stored, and the lore of programming has been + to treat characters and bytes as interchangeable. With Unicode + characters, you must decide on an encoding when you want to store + the data in a file. In Erlang, you can open a text file with an + encoding option, so that you can read characters from it rather than + bytes, but you can also open a file for bytewise I/O.

+

The Erlang I/O-system has been designed (or at least used) in a way + where you expect any I/O server to handle any string data. + That is, however, no longer the case when working with Unicode + characters. The Erlang programmer must now know the + capabilities of the device where the data ends up. Also, ports in + Erlang are byte-oriented, so an arbitrary string of (Unicode) + characters cannot be sent to a port without first converting it to an + encoding of choice.

+
+ Terminal I/O + +

Terminal I/O is slightly easier than file I/O. The output is meant + for human reading and is usually Erlang syntax (for example, in the + shell). There exists syntactic representation of any Unicode + character without displaying the glyph (instead written as + \x{HHH}). Unicode data can therefore usually be + displayed even if the terminal as such does not support the whole + Unicode range.

+
+ Filenames + +

Filenames can be stored as Unicode strings in different ways + depending on the underlying operating system and file system. This + can be handled fairly easy by a program. The problems arise when the + file system is inconsistent in its encodings. For example, Linux + allows files to be named with any sequence of bytes, leaving to each + program to interpret those bytes. On systems where these + "transparent" filenames are used, Erlang must be informed about the + filename encoding by a startup flag. The default is bytewise + interpretation, which is usually wrong, but allows for interpretation + of all filenames.

+

The concept of "raw filenames" can be used to handle wrongly encoded + filenames if one enables Unicode filename translation (+fnu) + on platforms where this is not the default.

+
+ Source code encoding + +

The Erlang source code has support for the UTF-8 encoding + and bytewise encoding. The default in Erlang/OTP R16B was bytewise + (latin1) encoding. It was changed to UTF-8 in Erlang/OTP 17.0. + You can control the encoding by a comment like the following in the + beginning of the file:

+ +%% -*- coding: utf-8 -*- +

This of course requires your editor to support UTF-8 as well. The + same comment is also interpreted by functions like + file:consult/1, + the release handler, and so on, so that you can have all text files + in your source directories in UTF-8 encoding.

+
+ The language + +

Having the source code in UTF-8 also allows you to write string + literals containing Unicode characters with code points > 255, + although atoms, module names, and function names are restricted to + the ISO Latin-1 range. Binary literals, where you use type + /utf8, can also be expressed using Unicode characters > 255. + Having module names using characters other than 7-bit ASCII can cause + trouble on operating systems with inconsistent file naming schemes, + and can hurt portability, so it is not recommended.

+

EEP 40 suggests that the language is also to allow for Unicode + characters > 255 in variable names. Whether to implement that EEP + is yet to be decided.

+
+
+
+ +
+ Standard Unicode Representation +

In Erlang, strings are lists of integers. A string was until + Erlang/OTP R13 defined to be encoded in the ISO Latin-1 (ISO 8859-1) + character set, which is, code point by code point, a subrange of the + Unicode character set.

+ +

The standard list encoding for strings was therefore easily extended to + handle the whole Unicode range. A Unicode string in Erlang is a list + containing integers, where each integer is a valid Unicode code point and + represents one character in the Unicode character set.

+ +

Erlang strings in ISO Latin-1 are a subset of Unicode strings.

+ +

Only if a string contains code points < 256, can it be directly + converted to a binary by using, for example, + erlang:iolist_to_binary/1 + or can be sent directly to a port. If the string contains Unicode + characters > 255, an encoding must be decided upon and the string is to + be converted to a binary in the preferred encoding using + unicode:characters_to_binary/1,2,3. + Strings are not generally lists of bytes, as they were before + Erlang/OTP R13, they are lists of characters. Characters are not + generally bytes, they are Unicode code points.

+ +

Binaries are more troublesome. For performance reasons, programs often + store textual data in binaries instead of lists, mainly because they are + more compact (one byte per character instead of two words per character, + as is the case with lists). Using + erlang:list_to_binary/1, + an ISO Latin-1 Erlang string can be converted into a binary, effectively + using bytewise encoding: one byte per character. This was convenient for + those limited Erlang strings, but cannot be done for arbitrary Unicode + lists.

+ +

As the UTF-8 encoding is widely spread and provides some backward + compatibility in the 7-bit ASCII range, it is selected as the standard + encoding for Unicode characters in binaries for Erlang.

+ +

The standard binary encoding is used whenever a library function in + Erlang is to handle Unicode data in binaries, but is of course not + enforced when communicating externally. Functions and bit syntax exist to + encode and decode both UTF-8, UTF-16, and UTF-32 in binaries. However, + library functions dealing with binaries and Unicode in general only deal + with the default encoding.

+ +

Character data can be combined from many sources, sometimes available in + a mix of strings and binaries. Erlang has for long had the concept of + iodata or iolists, where binaries and lists can be combined + to represent a sequence of bytes. In the same way, the Unicode-aware + modules often allow for combinations of binaries and lists, where the + binaries have characters encoded in UTF-8 and the lists contain such + binaries or numbers representing Unicode code points:

+ + unicode_binary() = binary() with characters encoded in UTF-8 coding standard chardata() = charlist() | unicode_binary() charlist() = maybe_improper_list(char() | unicode_binary() | charlist(), - unicode_binary() | nil()) -

The module unicode in STDLIB even - supports similar mixes with binaries containing other encodings than - UTF-8, but that is a special case to allow for conversions to and - from external data:

- -external_unicode_binary() = binary() with characters coded in - a user specified Unicode encoding other than UTF-8 (UTF-16 or UTF-32) + unicode_binary() | nil()) + +

The module unicode + even supports similar mixes with binaries containing other encodings than + UTF-8, but that is a special case to allow for conversions to and from + external data:

+ + +external_unicode_binary() = binary() with characters coded in a user-specified + Unicode encoding other than UTF-8 (UTF-16 or UTF-32) external_chardata() = external_charlist() | external_unicode_binary() -external_charlist() = maybe_improper_list(char() | - external_unicode_binary() | - external_charlist(), - external_unicode_binary() | nil()) -
-
- Basic Language Support -

As of Erlang/OTP R16 Erlang - source files can be written in either UTF-8 or bytewise encoding - (a.k.a. latin1 encoding). The details on how to state the encoding - of an Erlang source file can be found in - epp(3). Strings and comments - can be written using Unicode, but functions still have to be named - using characters from the ISO-latin-1 character set and atoms are - restricted to the same ISO-latin-1 range. These restrictions in the - language are of course independent of the encoding of the source - file.

+external_charlist() = maybe_improper_list(char() | external_unicode_binary() | + external_charlist(), external_unicode_binary() | nil())
+
+
- Bit-syntax -

The bit-syntax contains types for coping with binary data in the - three main encodings. The types are named utf8, utf16 - and utf32 respectively. The utf16 and utf32 types - can be in a big- or little-endian variant:

- + Basic Language Support +

As from Erlang/OTP R16, Erlang source + files can be written in UTF-8 or bytewise (latin1) encoding. For + information about how to state the encoding of an Erlang source file, see + the epp(3) module. + Strings and comments can be written using Unicode, but functions must + still be named using characters from the ISO Latin-1 character set, and + atoms are restricted to the same ISO Latin-1 range. These restrictions in + the language are of course independent of the encoding of the source + file.

+ +
+ Bit Syntax +

The bit syntax contains types for handling binary data in the + three main encodings. The types are named utf8, utf16, + and utf32. The utf16 and utf32 types can be in a + big-endian or a little-endian variant:

+ + <<Ch/utf8,_/binary>> = Bin1, <<Ch/utf16-little,_/binary>> = Bin2, Bin3 = <<$H/utf32-little, $e/utf32-little, $l/utf32-little, $l/utf32-little, $o/utf32-little>>, -

For convenience, literal strings can be encoded with a Unicode - encoding in binaries using the following (or similar) syntax:

- + +

For convenience, literal strings can be encoded with a Unicode + encoding in binaries using the following (or similar) syntax:

+ + Bin4 = <<"Hello"/utf16>>, -
-
- String and Character Literals -

For source code, there is an extension to the \OOO - (backslash followed by three octal numbers) and \xHH - (backslash followed by x, followed by two hexadecimal - characters) syntax, namely \x{H ...} (a backslash - followed by an x, followed by left curly bracket, any - number of hexadecimal digits and a terminating right curly - bracket). This allows for entering characters of any code point - literally in a string even when the encoding of the source file is - bytewise (latin1).

-

In the shell, if using a Unicode input device, or in source - code stored in UTF-8, $ can be followed directly by a - Unicode character producing an integer. In the following example - the code point of a Cyrillic с is output:

-
+    
+ +
+ String and Character Literals +

For source code, there is an extension to syntax \OOO + (backslash followed by three octal numbers) and \xHH (backslash + followed by x, followed by two hexadecimal characters), namely + \x{H ...} (backslash followed by x, followed by + left curly bracket, any number of hexadecimal digits, and a terminating + right curly bracket). This allows for entering characters of any code + point literally in a string even when the encoding of the source file + is bytewise (latin1).

+ +

In the shell, if using a Unicode input device, or in source code + stored in UTF-8, $ can be followed directly by a Unicode + character producing an integer. In the following example, the code + point of a Cyrillic с is output:

+ +
 7> $с.
 1089
-
-
- Heuristic String Detection -

In certain output functions and in the output of return values - in the shell, Erlang tries to heuristically detect string data in - lists and binaries. Typically you will see heuristic detection in - a situation like this:

-
+    
+ +
+ Heuristic String Detection +

In certain output functions and in the output of return values in + the shell, Erlang tries to detect string data in lists and binaries + heuristically. Typically you will see heuristic detection in a + situation like this:

+ +
 1> [97,98,99].
 "abc"
 2> <<97,98,99>>.
 <<"abc">>    
 3> <<195,165,195,164,195,182>>.
 <<"åäö"/utf8>>
-

Here the shell will detect lists containing printable - characters or binaries containing printable characters either in - bytewise or UTF-8 encoding. The question here is: what is a - printable character? One view would be that anything the Unicode - standard thinks is printable, will also be printable according to - the heuristic detection. The result would be that almost any list - of integers will be deemed a string, resulting in all sorts of - characters being printed, maybe even characters your terminal does - not have in its font set (resulting in some generic output you - probably will not appreciate). Another way is to keep it backwards - compatible so that only the ISO-Latin-1 character set is used to - detect a string. A third way would be to let the user decide - exactly what Unicode ranges are to be viewed as characters. Since - Erlang/OTP R16B you can select either the whole Unicode range or the - ISO-Latin-1 range by supplying the startup flag +pc - Range, where Range is either latin1 or - unicode. For backwards compatibility, the default is - latin1. This only controls how heuristic string detection - is done. In the future, more ranges are expected to be added, so - that one can tailor the heuristics to the language and region - relevant to the user.

-

Lets look at an example with the two different startup options:

-
+
+      

Here the shell detects lists containing printable characters or + binaries containing printable characters in bytewise or UTF-8 encoding. + But what is a printable character? One view is that anything the Unicode + standard thinks is printable, is also printable according to the + heuristic detection. The result is then that almost any list of + integers are deemed a string, and all sorts of characters are printed, + maybe also characters that your terminal lacks in its font set + (resulting in some unappreciated generic output). + Another way is to keep it backward compatible so that only the ISO + Latin-1 character set is used to detect a string. A third way is to let + the user decide exactly what Unicode ranges that are to be viewed as + characters.

+ +

As from Erlang/OTP R16B you can select the ISO Latin-1 range or the + whole Unicode range by supplying startup flag +pc latin1 or + +pc unicode, respectively. For backward compatibility, + latin1 is default. This only controls how heuristic string + detection is done. More ranges are expected to be added in the future, + enabling tailoring of the heuristics to the language and region + relevant to the user.

+ +

The following examples show the two startup options:

+ +
 $ erl +pc latin1
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -467,9 +535,9 @@ Eshell V5.10.1  (abort with ^G)
 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>.
 <<208,174,208,189,208,184,208,186,208,190,208,180>>
 5> <<229/utf8,228/utf8,246/utf8>>.
-<<"åäö"/utf8>>
-
-
+<<"åäö"/utf8>>
+ +
 $ erl +pc unicode
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -483,78 +551,88 @@ Eshell V5.10.1  (abort with ^G)
 4> <<208,174,208,189,208,184,208,186,208,190,208,180>>.
 <<"Юникод"/utf8>>
 5> <<229/utf8,228/utf8,246/utf8>>.
-<<"åäö"/utf8>>
-
-

In the examples, we can see that the default Erlang shell will - only interpret characters from the ISO-Latin1 range as printable - and will only detect lists or binaries with those "printable" - characters as containing string data. The valid UTF-8 binary - containing "Юникод", will not be printed as a string. When, on the - other hand, started with all Unicode characters printable (+pc - unicode), the shell will output anything containing printable - Unicode data (in binaries either UTF-8 or bytewise encoded) as - string data.

- -

These heuristics are also used by - io(_lib):format/2 and friends when the - t modifier is used in conjunction with ~p or - ~P:

-
+<<"åäö"/utf8>>
+ +

In the examples, you can see that the default Erlang shell interprets + only characters from the ISO Latin1 range as printable and only detects + lists or binaries with those "printable" characters as containing + string data. The valid UTF-8 binary containing the Russian word + "Юникод", is not printed as a string. When started with all Unicode + characters printable (+pc unicode), the shell outputs anything + containing printable Unicode data (in binaries, either UTF-8 or + bytewise encoded) as string data.

+ +

These heuristics are also used by + io:format/2, + io_lib:format/2, + and friends when modifier t is used with ~p or + ~P:

+ +
 $ erl +pc latin1
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
 Eshell V5.10.1  (abort with ^G)  
 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]).
 {<<"åäö">>,<<"åäö"/utf8>>,<<208,174,208,189,208,184,208,186,208,190,208,180>>}
-ok
-
-
+ok
+ +
 $ erl +pc unicode
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
 Eshell V5.10.1  (abort with ^G)  
 1> io:format("~tp~n",[{<<"åäö">>, <<"åäö"/utf8>>, <<208,174,208,189,208,184,208,186,208,190,208,180>>}]).
 {<<"åäö">>,<<"åäö"/utf8>>,<<"Юникод"/utf8>>}
-ok
-
-

Please observe that this only affects heuristic interpretation - of lists and binaries on output. For example the ~ts format - sequence does always output a valid lists of characters, - regardless of the +pc setting, as the programmer has - explicitly requested string output.

+ok
+ +

Notice that this only affects heuristic interpretation of + lists and binaries on output. For example, the ~ts format + sequence always outputs a valid list of characters, regardless of the + +pc setting, as the programmer has explicitly requested string + output.

+
-
-
- The Interactive Shell -

The interactive Erlang shell, when started towards a terminal or - started using the werl command on windows, can support - Unicode input and output.

-

On Windows, proper operation requires that a suitable font - is installed and selected for the Erlang application to use. If no - suitable font is available on your system, try installing the DejaVu - fonts (dejavu-fonts.org), which are freely available and then - select that font in the Erlang shell application.

-

On Unix-like operating systems, the terminal should be able - to handle UTF-8 on input and output (modern versions of XTerm, KDE - konsole and the Gnome terminal do for example) and your locale - settings have to be proper. As an example, my LANG - environment variable is set as this:

-
+
+  
+ The Interactive Shell +

The interactive Erlang shell, when started to a terminal or started + using command werl on Windows, can support Unicode input and + output.

+ +

On Windows, proper operation requires that a suitable font is + installed and selected for the Erlang application to use. If no suitable + font is available on your system, try installing the + DejaVu fonts, which are freely + available, and then select that font in the Erlang shell application.

+ +

On Unix-like operating systems, the terminal is to be able to handle + UTF-8 on input and output (this is done by, for example, modern versions + of XTerm, KDE Konsole, and the Gnome terminal) + and your locale settings must be proper. As + an example, a LANG environment variable can be set as follows:

+ +
 $ echo $LANG
 en_US.UTF-8
-

Actually, most systems handle the LC_CTYPE variable before - LANG, so if that is set, it has to be set to - UTF-8:

-
+
+    

Most systems handle variable LC_CTYPE before LANG, so if + that is set, it must be set to UTF-8:

+ +
 $ echo $LC_CTYPE
 en_US.UTF-8
-

The LANG or LC_CTYPE setting should be consistent - with what the terminal is capable of, there is no portable way for - Erlang to ask the actual terminal about its UTF-8 capacity, we have - to rely on the language and character type settings.

-

To investigate what Erlang thinks about the terminal, the - io:getopts() call can be used when the shell is started:

-
+
+    

The LANG or LC_CTYPE setting are to be consistent with + what the terminal is capable of. There is no portable way for Erlang to + ask the terminal about its UTF-8 capacity, we have to rely on the + language and character type settings.

+ +

To investigate what Erlang thinks about the terminal, the call + io:getopts() + can be used when the shell is started:

+ +
 $ LC_CTYPE=en_US.ISO-8859-1 erl
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -571,27 +649,31 @@ Eshell V5.10.1  (abort with ^G)
 {encoding,unicode}
 2>
-

When (finally?) everything is in order with the locale settings, - fonts and the terminal emulator, you probably also have discovered a - way to input characters in the script you desire. For testing, the - simplest way is to add some keyboard mappings for other languages, - usually done with some applet in your desktop environment. In my KDE - environment, I start the KDE Control Center (Personal Settings), - select "Regional and Accessibility" and then "Keyboard Layout". On - Windows XP, I start Control Panel->Regional and Language - Options, select the Language tab and click the Details... button in - the square named "Text services and input Languages". Your - environment probably provides similar means of changing the keyboard - layout. Make sure you have a way to easily switch back and forth - between keyboards if you are not used to this, entering commands - using a Cyrillic character set is, as an example, not easily done in - the Erlang shell.

- -

Now you are set up for some Unicode input and output. The - simplest thing to do is of course to enter a string in the - shell:

- -
+    

When (finally?) everything is in order with the locale settings, fonts. + and the terminal emulator, you have probably found a way to input + characters in the script you desire. For testing, the simplest way is to + add some keyboard mappings for other languages, usually done with some + applet in your desktop environment.

+ +

In a KDE environment, select KDE Control Center (Personal + Settings) > Regional and Accessibility > Keyboard + Layout.

+ +

On Windows XP, select Control Panel > Regional and Language + Options, select tab Language, and click button + Details... in the square named Text Services and Input + Languages.

+ +

Your environment + probably provides similar means of changing the keyboard layout. Ensure + that you have a way to switch back and forth between keyboards easily if + you are not used to this. For example, entering commands using a Cyrillic + character set is not easily done in the Erlang shell.

+ +

Now you are set up for some Unicode input and output. The simplest thing + to do is to enter a string in the shell:

+ +
 $ erl
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -603,12 +685,13 @@ Eshell V5.10.1  (abort with ^G)
 3> io:format("~ts~n", [v(2)]).
 Юникод
 ok
-4> 
-

While strings can be input as Unicode characters, the language - elements are still limited to the ISO-latin-1 character set. Only - character constants and strings are allowed to be beyond that - range:

-
+4>
+ +

While strings can be input as Unicode characters, the language elements + are still limited to the ISO Latin-1 character set. Only character + constants and strings are allowed to be beyond that range:

+ +
 $ erl
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -618,371 +701,398 @@ Eshell V5.10.1  (abort with ^G)
 2> Юникод.
 * 1: illegal character
 2> 
-
-
- Unicode File Names - -

Most modern operating systems support Unicode file names in some - way or another. There are several different ways to do this and - Erlang by default treats the different approaches differently:

- - Mandatory Unicode file naming - -

Windows and, for most common uses, MacOS X enforces Unicode - support for file names. All files created in the file system have - names that can consistently be interpreted. In MacOS X, all file - names are retrieved in UTF-8 encoding, while Windows has - selected an approach where each system call handling file names - has a special Unicode aware variant, giving much the same - effect. There are no file names on these systems that are not - Unicode file names, why the default behavior of the Erlang VM is - to work in "Unicode file name translation mode", - meaning that a file name can be given as a Unicode list and that - will be automatically translated to the proper name encoding for - the underlying operating and file system.

-

Doing i.e. a file:list_dir/1 on one of these systems - may return Unicode lists with code points beyond 255, depending - on the content of the actual file system.

-

As the feature is fairly new, you may still stumble upon non - core applications that cannot handle being provided with file - names containing characters with code points larger than 255, but - the core Erlang system should have no problems with Unicode file - names.

-
- Transparent file naming - -

Most Unix operating systems have adopted a simpler approach, - namely that Unicode file naming is not enforced, but by - convention. Those systems usually use UTF-8 encoding for Unicode - file names, but do not enforce it. On such a system, a file name - containing characters having code points between 128 and 255 may - be named either as plain ISO-latin-1 or using UTF-8 encoding. As - no consistency is enforced, the Erlang VM can do no consistent - translation of all file names.

- -

By default on such systems, Erlang starts in utf8 file - name mode if the terminal supports UTF-8, otherwise in - latin1 mode.

- -

In the latin1 mode, file names are bytewise endcoded. - This allows for list representation of all file names in - the system, but, for example, a file named "Östersund.txt", will - appear in file:list_dir/1 as either "Östersund.txt" (if - the file name was encoded in bytewise ISO-Latin-1 by the program - creating the file, or more probably as - [195,150,115,116,101,114,115,117,110,100], which is a - list containing UTF-8 bytes - not what you would want... If you - on the other hand use Unicode file name translation on such a - system, non-UTF-8 file names will simply be ignored by functions - like file:list_dir/1. They can be retrieved with - file:list_dir_all/1, but wrongly encoded file names will - appear as "raw file names".

- -
-
- -

The Unicode file naming support was introduced with Erlang/OTP - R14B01. A VM operating in Unicode file name translation mode can - work with files having names in any language or character set (as - long as it is supported by the underlying OS and file system). The - Unicode character list is used to denote file or directory names and - if the file system content is listed, you will also get - Unicode lists as return value. The support lies in the Kernel and - STDLIB modules, why most applications (that does not explicitly - require the file names to be in the ISO-latin-1 range) will benefit - from the Unicode support without change.

- -

On operating systems with mandatory Unicode file names, this - means that you more easily conform to the file names of other (non - Erlang) applications, and you can also process file names that, at - least on Windows, were completely inaccessible (due to having names - that could not be represented in ISO-latin-1). Also you will avoid - creating incomprehensible file names on MacOS X as the vfs layer of - the OS will accept all your file names as UTF-8 and will not rewrite - them.

- -

For most systems, turning on Unicode file name translation is no - problem even if it uses transparent file naming. Very few systems - have mixed file name encodings. A consistent UTF-8 named system will - work perfectly in Unicode file name mode. It was still however - considered experimental in Erlang/OTP R14B01 and is still not the default on - such systems. Unicode file name translation is turned on with the - +fnu switch to the On Linux, a VM started without explicitly - stating the file name translation mode will default to latin1 - as the native file name encoding. On Windows and MacOS X, the - default behavior is that of Unicode file name translation, why the - file:native_name_encoding/0 by default returns utf8 on - those systems (the fact that Windows actually does not use UTF-8 on - the file system level can safely be ignored by the Erlang - programmer). The default behavior can, as stated before, be - changed using the +fnu or +fnl options to the VM, see - the erl program. If the - VM is started in Unicode file name translation mode, - file:native_name_encoding/0 will return the atom - utf8. The +fnu switch can be followed by w, - i or e, to control how wrongly encoded file names are - to be reported. w means that a warning is sent to the - error_logger whenever a wrongly encoded file name is - "skipped" in directory listings, i means that those wrongly - encoded file names are silently ignored and e means that the - API function will return an error whenever a wrongly encoded file - (or directory) name is encountered. w is the default. Note - that file:read_link/1 will always return an error if the link - points to an invalid file name.

- -

In Unicode file name mode, file names given to the BIF - open_port/2 with the option {spawn_executable,...} are - also interpreted as Unicode. So is the parameter list given in the - args option available when using spawn_executable. The - UTF-8 translation of arguments can be avoided using binaries, see - the discussion about raw file names below.

- -

It is worth noting that the file encoding options given - when opening a file has nothing to do with the file name - encoding convention. You can very well open files containing data - encoded in UTF-8 but having file names in bytewise (latin1) encoding - or vice versa.

- -

Erlang drivers and NIF shared objects still can not be - named with names containing code points beyond 127. This is a known - limitation to be removed in a future release. Erlang modules however - can, but it is definitely not a good idea and is still considered - experimental.

- -
- Notes About Raw File Names - -

Raw file names were introduced together with Unicode file name - support in erts-5.8.2 (Erlang/OTP R14B01). The reason "raw file - names" was introduced in the system was to be able to - consistently represent file names given in different encodings on - the same system. Having the VM automatically translate a file name - that is not in UTF-8 to a list of Unicode characters might seem - practical, but this would open up for both duplicate file names and - other inconsistent behavior. Consider a directory containing a file - named "björn" in ISO-latin-1, while the Erlang VM is - operating in Unicode file name mode (and therefore expecting UTF-8 - file naming). The ISO-latin-1 name is not valid UTF-8 and one could - be tempted to think that automatic conversion in for example - file:list_dir/1 is a good idea. But what would happen if we - later tried to open the file and have the name as a Unicode list - (magically converted from the ISO-latin-1 file name)? The VM will - convert the file name given to UTF-8, as this is the encoding - expected. Effectively this means trying to open the file named - <<"björn"/utf8>>. This file does not exist, - and even if it existed it would not be the same file as the one that - was listed. We could even create two files named "björn", - one named in the UTF-8 encoding and one not. If - file:list_dir/1 would automatically convert the ISO-latin-1 - file name to a list, we would get two identical file names as the - result. To avoid this, we need to differentiate between file names - being properly encoded according to the Unicode file naming - convention (i.e. UTF-8) and file names being invalid under the - encoding. By the common file:list_dir/1 function, the wrongly - encoded file names are simply ignored in Unicode file name - translation mode, but by the file:list_dir_all/1 function, - the file names with invalid encoding are returned as "raw" - file names, i.e. as binaries.

- -

The Erlang file module accepts raw file names as - input. open_port({spawn_executable, ...} ...) also accepts - them. As mentioned earlier, the arguments given in the option list - to open_port({spawn_executable, ...} ...) undergo the same - conversion as the file names, meaning that the executable will be - provided with arguments in UTF-8 as well. This translation is - avoided consistently with how the file names are treated, by giving - the argument as a binary.

- -

To force Unicode file name translation mode on systems where this - is not the default was considered experimental in Erlang/OTP R14B01 due to - the fact that the initial implementation did not ignore wrongly - encoded file names, so that raw file names could spread unexpectedly - throughout the system. Beginning with Erlang/OTP R16B, the wrongly encoded file - names are only retrieved by special functions - (e.g. file:list_dir_all/1), so the impact on existing code is - much lower, why it is now supported. Unicode file name translation - is expected to be default in future releases.

- -

Even if you are operating without Unicode file naming translation - automatically done by the VM, you can access and create files with - names in UTF-8 encoding by using raw file names encoded as - UTF-8. Enforcing the UTF-8 encoding regardless of the mode the - Erlang VM is started in might, in some circumstances be a good idea, - as the convention of using UTF-8 file names is spreading.

-
-
- Notes About MacOS X -

MacOS X's vfs layer enforces UTF-8 file names in a quite - aggressive way. Older versions did this by simply refusing to create - non UTF-8 conforming file names, while newer versions replace - offending bytes with the sequence "%HH", where HH is the - original character in hexadecimal notation. As Unicode translation - is enabled by default on MacOS X, the only way to come up against - this is to either start the VM with the +fnl flag or to use a - raw file name in bytewise (latin1) encoding. If using a raw - filename, with a bytewise encoding containing characters between 127 - and 255, to create a file, the file can not be opened using the same - name as the one used to create it. There is no remedy for this - behaviour, other than keeping the file names in the right - encoding.

- -

MacOS X also reorganizes the names of files so that the - representation of accents etc is using the "combining characters", - i.e. the character ö is represented as the code points - [111,776], where 111 is the character o and 776 is the - special accent character "combining diaeresis". This way of - normalizing Unicode is otherwise very seldom used and Erlang - normalizes those file names in the opposite way upon retrieval, so - that file names using combining accents are not passed up to the - Erlang application. In Erlang the file name "björn" is - retrieved as [98,106,246,114,110], not as [98,106,117,776,114,110], - even though the file system might think differently. The - normalization into combining accents are redone when actually - accessing files, so this can usually be ignored by the Erlang - programmer.

-
-
-
- Unicode in Environment and Parameters - -

Environment variables and their interpretation is handled much in - the same way as file names. If Unicode file names are enabled, - environment variables as well as parameters to the Erlang VM are - expected to be in Unicode.

-

If Unicode file names are enabled, the calls to - os:getenv/0, - os:getenv/1, - os:putenv/2 and - os:unsetenv/1 - will handle Unicode strings. On Unix-like platforms, the built-in - functions will translate environment variables in UTF-8 to/from - Unicode strings, possibly with code points > 255. On Windows the - Unicode versions of the environment system API will be used, also - allowing for code points > 255.

-

On Unix-like operating systems, parameters are expected to be - UTF-8 without translation if Unicode file names are enabled.

-
-
- Unicode-aware Modules -

Most of the modules in Erlang/OTP are of course Unicode-unaware - in the sense that they have no notion of Unicode and really should - not have. Typically they handle non-textual or byte-oriented data - (like gen_tcp etc).

-

Modules that actually handle textual data (like io_lib, - string etc) are sometimes subject to conversion or extension - to be able to handle Unicode characters.

-

Fortunately, most textual data has been stored in lists and range - checking has been sparse, why modules like string works well - for Unicode lists with little need for conversion or extension.

-

Some modules are however changed to be explicitly - Unicode-aware. These modules include:

- - unicode - -

The module unicode - is obviously Unicode-aware. It contains functions for conversion - between different Unicode formats as well as some utilities for - identifying byte order marks. Few programs handling Unicode data - will survive without this module.

-
- io - -

The io module has been - extended along with the actual I/O-protocol to handle Unicode - data. This means that several functions require binaries to be - in UTF-8 and there are modifiers to formatting control sequences - to allow for outputting of Unicode strings.

-
- file, group, user - -

I/O-servers throughout the system are able to handle - Unicode data and has options for converting data upon actual - output or input to/from the device. As shown earlier, the - shell has support for - Unicode terminals and the file module allows for - translation to and from various Unicode formats on disk.

-

The actual reading and writing of files with Unicode data is - however not best done with the file module as its - interface is byte oriented. A file opened with a Unicode - encoding (like UTF-8), is then best read or written using the - io module.

-
- re - -

The re module allows - for matching Unicode strings as a special option. As the library - is actually centered on matching in binaries, the Unicode - support is UTF-8-centered.

-
- wx - -

The wx graphical library - has extensive support for Unicode text

-
-
-

The module string works perfectly for - Unicode strings as well as for ISO-latin-1 strings with the - exception of the language-dependent to_upper and - to_lower - functions, which are only correct for the ISO-latin-1 character - set. Actually they can never function correctly for Unicode - characters in their current form, as there are language and locale - issues as well as multi-character mappings to consider when - converting text between cases. Converting case in an international - environment is a big subject not yet addressed in OTP.

-
-
- Unicode Data in Files -

The fact that Erlang as such can handle Unicode data in many forms - does not automatically mean that the content of any file can be - Unicode text. The external entities such as ports or I/O-servers are - not generally Unicode capable.

-

Ports are always byte oriented, so before sending data that you - are not sure is bytewise encoded to a port, make sure to encode it - in a proper Unicode encoding. Sometimes this will mean that only - part of the data shall be encoded as e.g. UTF-8, some parts may be - binary data (like a length indicator) or something else that shall - not undergo character encoding, so no automatic translation is - present.

-

I/O-servers behave a little differently. The I/O-servers connected - to terminals (or stdout) can usually cope with Unicode data - regardless of the encoding option. This is convenient when - one expects a modern environment but do not want to crash when - writing to a archaic terminal or pipe. Files on the other hand are - more picky. A file can have an encoding option which makes it - generally usable by the io-module (e.g. {encoding,utf8}), but - is by default opened as a byte oriented file. The file module is byte oriented, why only - ISO-Latin-1 characters can be written using that module. The - io module is the one to use if - Unicode data is to be output to a file with other encoding - than latin1 (a.k.a. bytewise encoding). It is slightly - confusing that a file opened with - e.g. file:open(Name,[read,{encoding,utf8}]), cannot be - properly read using file:read(File,N) but you have to use the - io module to retrieve the Unicode data from it. The reason is - that file:read and file:write (and friends) are purely - byte oriented, and should so be, as that is the way to access - files other than text files - byte by byte. Just as with ports, you - can of course write encoded data into a file by "manually" converting - the data to the encoding of choice (using the unicode module or the bit syntax) - and then output it on a bytewise encoded (latin1) file.

-

The rule of thumb is that the file module should be used for files - opened for bytewise access ({encoding,latin1}) and the - io module should be used when - accessing files with any other encoding - (e.g. {encoding,uf8}).

- -

Functions reading Erlang syntax from files generally recognize - the coding: comment and can therefore handle Unicode data on - input. When writing Erlang Terms to a file, you should insert - such comments when applicable:

-
+  
+ +
+ Unicode Filenames + +

Most modern operating systems support Unicode filenames in some way. + There are many different ways to do this and Erlang by default treats the + different approaches differently:

+ + + Mandatory Unicode file naming + +

Windows and, for most common uses, MacOS X enforce Unicode support + for filenames. All files created in the file system have names that + can consistently be interpreted. In MacOS X, all filenames are + retrieved in UTF-8 encoding. In Windows, each system call handling + filenames has a special Unicode-aware variant, giving much the same + effect. There are no filenames on these systems that are not Unicode + filenames. So, the default behavior of the Erlang VM is to work in + "Unicode filename translation mode". This means that a + filename can be specified as a Unicode list, which is automatically + translated to the proper name encoding for the underlying operating + system and file system.

+

Doing, for example, a + file:list_dir/1 + on one of these systems can return Unicode lists with code points + > 255, depending on the content of the file system.

+
+ Transparent file naming + +

Most Unix operating systems have adopted a simpler approach, namely + that Unicode file naming is not enforced, but by convention. Those + systems usually use UTF-8 encoding for Unicode filenames, but do not + enforce it. On such a system, a filename containing characters with + code points from 128 through 255 can be named as plain ISO Latin-1 or + use UTF-8 encoding. As no consistency is enforced, the Erlang VM + cannot do consistent translation of all filenames.

+

By default on such systems, Erlang starts in utf8 filename + mode if the terminal supports UTF-8, otherwise in latin1 + mode.

+

In latin1 mode, filenames are bytewise encoded. This allows + for list representation of all filenames in the system. However, a + a file named "Östersund.txt", appears in + file:list_dir/1 + either as "Östersund.txt" (if the filename was encoded in bytewise + ISO Latin-1 by the program creating the file) or more probably as + [195,150,115,116,101,114,115,117,110,100], which is a list + containing UTF-8 bytes (not what you want). If you use Unicode + filename translation on such a system, non-UTF-8 filenames are + ignored by functions like file:list_dir/1. They can be + retrieved with function + file:list_dir_all/1, + but wrongly encoded filenames appear as "raw filenames". +

+
+
+ +

The Unicode file naming support was introduced in Erlang/OTP + R14B01. A VM operating in Unicode filename translation mode can + work with files having names in any language or character set (as + long as it is supported by the underlying operating system and + file system). The Unicode character list is used to denote + filenames or directory names. If the file system content is + listed, you also get Unicode lists as return value. The support + lies in the Kernel and STDLIB modules, which is why + most applications (that does not explicitly require the filenames + to be in the ISO Latin-1 range) benefit from the Unicode support + without change.

+ +

On operating systems with mandatory Unicode filenames, this means that + you more easily conform to the filenames of other (non-Erlang) + applications. You can also process filenames that, at least on Windows, + were inaccessible (because of having names that could not be represented + in ISO Latin-1). Also, you avoid creating incomprehensible filenames + on MacOS X, as the vfs layer of the operating system accepts all + your filenames as UTF-8 does not rewrite them.

+ +

For most systems, turning on Unicode filename translation is no problem + even if it uses transparent file naming. Very few systems have mixed + filename encodings. A consistent UTF-8 named system works perfectly in + Unicode filename mode. It was still, however, considered experimental in + Erlang/OTP R14B01 and is still not the default on such systems.

+ +

Unicode filename translation is turned on with switch +fnu. On + Linux, a VM started without explicitly stating the filename translation + mode defaults to latin1 as the native filename encoding. On + Windows and MacOS X, the default behavior is that of Unicode filename + translation. Therefore + file:native_name_encoding/0 + by default returns utf8 on those systems (Windows does not use + UTF-8 on the file system level, but this can safely be ignored by the + Erlang programmer). The default behavior can, as stated earlier, be + changed using option +fnu or +fnl to the VM, see the + erl program. If the VM is + started in Unicode filename translation mode, + file:native_name_encoding/0 returns atom utf8. Switch + +fnu can be followed by w, i, or e to control + how wrongly encoded filenames are to be reported.

+ + + +

w means that a warning is sent to the error_logger + whenever a wrongly encoded filename is "skipped" in directory + listings. w is the default.

+
+ +

i means that wrongly encoded filenames are silently ignored. +

+
+ +

e means that the API function returns an error whenever a + wrongly encoded filename (or directory name) is encountered.

+
+
+ +

Notice that + file:read_link/1 + always returns an error if the link points to an invalid filename.

+ +

In Unicode filename mode, filenames given to BIF open_port/2 with + option {spawn_executable,...} are also interpreted as Unicode. So + is the parameter list specified in option args available when + using spawn_executable. The UTF-8 translation of arguments can be + avoided using binaries, see section + Notes About Raw Filenames. +

+ +

Notice that the file encoding options specified when opening a file has + nothing to do with the filename encoding convention. You can very well + open files containing data encoded in UTF-8, but having filenames in + bytewise (latin1) encoding or conversely.

+ +

Erlang drivers and NIF-shared objects still cannot be named with + names containing code points > 127. This limitation will be removed in + a future release. However, Erlang modules can, but it is definitely not a + good idea and is still considered experimental.

+
+ +
+ Notes About Raw Filenames + +

Raw filenames were introduced together with Unicode filename support + in ERTS 5.8.2 (Erlang/OTP R14B01). The reason "raw + filenames" were introduced in the system was + to be able to represent + filenames, specified in different encodings on the same system, + consistently. It can seem practical to have the VM automatically + translate a filename that is not in UTF-8 to a list of Unicode + characters, but this would open up for both duplicate filenames and + other inconsistent behavior.

+ +

Consider a directory containing a file named "björn" in ISO + Latin-1, while the Erlang VM is operating in Unicode filename mode (and + therefore expects UTF-8 file naming). The ISO Latin-1 name is not valid + UTF-8 and one can be tempted to think that automatic conversion in, for + example, + file:list_dir/1 + is a good idea. But what would happen if we later tried to open the file + and have the name as a Unicode list (magically converted from the ISO + Latin-1 filename)? The VM converts the filename to UTF-8, as this is + the encoding expected. Effectively this means trying to open the file + named <<"björn"/utf8>>. This file does not exist, + and even if it existed it would not be the same file as the one that was + listed. We could even create two files named "björn", one + named in UTF-8 encoding and one not. If file:list_dir/1 would + automatically convert the ISO Latin-1 filename to a list, we would get + two identical filenames as the result. To avoid this, we must + differentiate between filenames that are properly encoded according to + the Unicode file naming convention (that is, UTF-8) and filenames that + are invalid under the encoding. By the common function + file:list_dir/1, the wrongly encoded filenames are ignored in + Unicode filename translation mode, but by function + file:list_dir_all/1 + the filenames with invalid encoding are returned as "raw" + filenames, that is, as binaries.

+ +

The file module accepts raw filenames as input. + open_port({spawn_executable, ...} ...) also accepts them. As + mentioned earlier, the arguments specified in the option list to + open_port({spawn_executable, ...} ...) undergo the same + conversion as the filenames, meaning that the executable is provided + with arguments in UTF-8 as well. This translation is avoided + consistently with how the filenames are treated, by giving the argument + as a binary.

+ +

To force Unicode filename translation mode on systems where this is not + the default was considered experimental in Erlang/OTP R14B01. This was + because the initial implementation did not ignore wrongly encoded + filenames, so that raw filenames could spread unexpectedly throughout + the system. As from Erlang/OTP R16B, the wrongly encoded + filenames are only retrieved by special functions (such as + file:list_dir_all/1). Since the impact on existing code is + therefore much lower it is now supported. + Unicode filename translation is + expected to be default in future releases.

+ +

Even if you are operating without Unicode file naming translation + automatically done by the VM, you can access and create files with + names in UTF-8 encoding by using raw filenames encoded as UTF-8. + Enforcing the UTF-8 encoding regardless of the mode the Erlang VM is + started in can in some circumstances be a good idea, as the convention + of using UTF-8 filenames is spreading.

+
+ +
+ Notes About MacOS X +

The vfs layer of MacOS X enforces UTF-8 filenames in an + aggressive way. Older versions did this by refusing to create non-UTF-8 + conforming filenames, while newer versions replace offending bytes with + the sequence "%HH", where HH is the original character in + hexadecimal notation. As Unicode translation is enabled by default on + MacOS X, the only way to come up against this is to either start the VM + with flag +fnl or to use a raw filename in bytewise + (latin1) encoding. If using a raw filename, with a bytewise + encoding containing characters from 127 through 255, to create a file, + the file cannot be opened using the same name as the one used to create + it. There is no remedy for this behavior, except keeping the filenames + in the correct encoding.

+ +

MacOS X reorganizes the filenames so that the representation of + accents, and so on, uses the "combining characters". For example, + character ö is represented as code points [111,776], + where 111 is character o and 776 is the special + accent character "Combining Diaeresis". This way of normalizing Unicode + is otherwise very seldom used. Erlang normalizes those filenames in the + opposite way upon retrieval, so that filenames using combining accents + are not passed up to the Erlang application. In Erlang, filename + "björn" is retrieved as [98,106,246,114,110], not as + [98,106,117,776,114,110], although the file system can think + differently. The normalization into combining accents is redone when + accessing files, so this can usually be ignored by the Erlang + programmer.

+
+
+ +
+ Unicode in Environment and Parameters + +

Environment variables and their interpretation are handled much in the + same way as filenames. If Unicode filenames are enabled, environment + variables as well as parameters to the Erlang VM are expected to be in + Unicode.

+ +

If Unicode filenames are enabled, the calls to + os:getenv/0,1, + os:putenv/2, and + os:unsetenv/1 + handle Unicode strings. On Unix-like platforms, the built-in functions + translate environment variables in UTF-8 to/from Unicode strings, possibly + with code points > 255. On Windows, the Unicode versions of the + environment system API are used, and code points > 255 are allowed.

+

On Unix-like operating systems, parameters are expected to be UTF-8 + without translation if Unicode filenames are enabled.

+
+ +
+ Unicode-Aware Modules +

Most of the modules in Erlang/OTP are Unicode-unaware in the sense that + they have no notion of Unicode and should not have. Typically they handle + non-textual or byte-oriented data (such as gen_tcp).

+ +

Modules handling textual data (such as + io_lib and + string are sometimes + subject to conversion or extension to be able to handle Unicode + characters.

+ +

Fortunately, most textual data has been stored in lists and range + checking has been sparse, so modules like string work well for + Unicode lists with little need for conversion or extension.

+ +

Some modules are, however, changed to be explicitly Unicode-aware. These + modules include:

+ + + unicode + +

The unicode + module is clearly Unicode-aware. It contains functions for conversion + between different Unicode formats and some utilities for identifying + byte order marks. Few programs handling Unicode data survive without + this module.

+
+ io + +

The io module has been + extended along with the actual I/O protocol to handle Unicode data. + This means that many functions require binaries to be in UTF-8, and + there are modifiers to format control sequences to allow for output + of Unicode strings.

+
+ file, group, user + +

I/O-servers throughout the system can handle Unicode data and have + options for converting data upon output or input to/from the device. + As shown earlier, the + shell module has + support for Unicode terminals and the + file module + allows for translation to and from various Unicode formats on + disk.

+

Reading and writing of files with Unicode data is, however, not best + done with the file module, as its interface is + byte-oriented. A file opened with a Unicode encoding (like UTF-8) is + best read or written using the + io module.

+
+ re + +

The re module allows + for matching Unicode strings as a special option. As the library is + centered on matching in binaries, the Unicode support is + UTF-8-centered.

+
+ wx + +

The graphical library wx + has extensive support for Unicode text.

+
+ +

The string module works + perfectly for Unicode strings and ISO Latin-1 strings, except the + language-dependent functions + string:to_upper/1 + and + string:to_lower/1, + which are only correct for the ISO Latin-1 character set. These two + functions can never function correctly for Unicode characters in their + current form, as there are language and locale issues as well as + multi-character mappings to consider when converting text between cases. + Converting case in an international environment is a large subject not + yet addressed in OTP.

+
+ +
+ Unicode Data in Files +

Although Erlang can handle Unicode data in many forms does not + automatically mean that the content of any file can be Unicode text. The + external entities, such as ports and I/O servers, are not generally + Unicode capable.

+ +

Ports are always byte-oriented, so before sending data that you are not + sure is bytewise-encoded to a port, ensure to encode it in a proper + Unicode encoding. Sometimes this means that only part of the data must + be encoded as, for example, UTF-8. Some parts can be binary data (like a + length indicator) or something else that must not undergo character + encoding, so no automatic translation is present.

+ +

I/O servers behave a little differently. The I/O servers connected to + terminals (or stdout) can usually cope with Unicode data + regardless of the encoding option. This is convenient when one expects + a modern environment but do not want to crash when writing to an archaic + terminal or pipe.

+ +

A file can have an encoding option that makes it generally usable by the + io module (for example + {encoding,utf8}), but is by default opened as a byte-oriented file. + The file module is + byte-oriented, so only ISO Latin-1 characters can be written using that + module. Use the io module if Unicode data is to be output to a + file with other encoding than latin1 (bytewise encoding). + It is slightly confusing that a file opened with, for example, + file:open(Name,[read,{encoding,utf8}]) cannot be properly read + using file:read(File,N), but using the io module to retrieve + the Unicode data from it. The reason is that file:read and + file:write (and friends) are purely byte-oriented, and should be, + as that is the way to access files other than text files, byte by byte. + As with ports, you can write encoded data into a file by "manually" + converting the data to the encoding of choice (using the + unicode module or the + bit syntax) and then output it on a bytewise (latin1) encoded + file.

+ +

Recommendations:

+ + +

Use the + file module for + files opened for bytewise access ({encoding,latin1}).

+
+

Use the io module + when accessing files with any other encoding (for example + {encoding,uf8}).

+
+
+ +

Functions reading Erlang syntax from files recognize the coding: + comment and can therefore handle Unicode data on input. When writing + Erlang terms to a file, you are advised to insert such comments when + applicable:

+ +
 $ erl +fna +pc unicode
 Erlang R16B (erts-5.10.1) [source]  [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -990,202 +1100,224 @@ Eshell V5.10.1  (abort with ^G)
 1> file:write_file("test.term",<<"%% coding: utf-8\n[{\"Юникод\",4711}].\n"/utf8>>).
 ok
 2> file:consult("test.term").   
-{ok,[[{"Юникод",4711}]]}
-  
-
-
- Summary of Options - -

The Unicode support is controlled by both command line switches, - some standard environment variables and the version of OTP you are - using. Most options affect mainly the way Unicode data is displayed, - not the actual functionality of the API's in the standard - libraries. This means that Erlang programs usually do not - need to concern themselves with these options, they are more for the - development environment. An Erlang program can be written so that it - works well regardless of the type of system or the Unicode options - that are in effect.

- -

Here follows a summary of the settings affecting Unicode:

- - The LANG and LC_CTYPE environment variables - -

The language setting in the OS mainly affects the shell. The - terminal (i.e. the group leader) will operate with {encoding, - unicode} only if the environment tells it that UTF-8 is - allowed. This setting should correspond to the actual terminal - you are using.

-

The environment can also affect file name interpretation, if - Erlang is started with the +fna flag (which is default from - Erlang/OTP 17.0).

-

You can check the setting of this by calling - io:getopts(), which will give you an option list - containing {encoding,unicode} or - {encoding,latin1}.

-
- The +pc {unicode|latin1} flag to - erl(1) - -

This flag affects what is interpreted as string data when - doing heuristic string detection in the shell and in - io/io_lib:format with the "~tp" and - ~tP formatting instructions, as described above.

-

You can check this option by calling io:printable_range/0, - which will return unicode or latin1. To be - compatible with future (expected) extensions to the settings, - one should rather use io_lib:printable_list/1 to check if - a list is printable according to the setting. That function will - take into account new possible settings returned from - io:printable_range/0.

-
- The +fn{l|a|u} - [{w|i|e}] - flag to erl(1) - -

This flag affects how the file names are to be interpreted. On - operating systems with transparent file naming, this has to be - specified to allow for file naming in Unicode characters (and - for correct interpretation of file names containing characters - > 255.

-

+fnl means bytewise interpretation of file names, which - was the usual way to represent ISO-Latin-1 file names before - UTF-8 file naming got widespread.

-

+fnu means that file names are encoded in UTF-8, which - is nowadays the common scheme (although not enforced).

-

+fna means that you automatically select between - +fnl and +fnu, based on the LANG and - LC_CTYPE environment variables. This is optimistic - heuristics indeed, nothing enforces a user to have a terminal - with the same encoding as the file system, but usually, this is - the case. This is the default on all Unix-like operating - systems except MacOS X.

- -

The file name translation mode can be read with the - file:native_name_encoding/0 function, which returns - latin1 (meaning bytewise encoding) or utf8.

-
- - epp:default_encoding/0 - -

This function returns the default encoding for Erlang source - files (if no encoding comment is present) in the currently - running release. In Erlang/OTP R16B latin1 was returned (meaning - bytewise encoding). In Erlang/OTP 17.0 and forward it returns - utf8.

-

The encoding of each file can be specified using comments as - described in - epp(3).

-
- io:setopts/{1,2} and the -oldshell/-noshell flags. - -

When Erlang is started with -oldshell or - -noshell, the I/O-server for standard_io is default - set to bytewise encoding, while an interactive shell defaults to - what the environment variables says.

-

With the io:setopts/2 function you can set the - encoding of a file or other I/O-server. This can also be set when - opening a file. Setting the terminal (or other - standard_io server) unconditionally to the option - {encoding,utf8} will for example make UTF-8 encoded characters - being written to the device regardless of how Erlang was started or - the users environment.

-

Opening files with encoding option is convenient when - writing or reading text files in a known encoding.

-

You can retrieve the encoding setting for an I/O-server - using io:getopts().

-
-
-
-
- Recipes -

When starting with Unicode, one often stumbles over some common - issues. I try to outline some methods of dealing with Unicode data - in this section.

+{ok,[[{"Юникод",4711}]]}
+
+ +
+ Summary of Options + +

The Unicode support is controlled by both command-line switches, some + standard environment variables, and the OTP version you are using. Most + options affect mainly how Unicode data is displayed, not the + functionality of the APIs in the standard libraries. This means that + Erlang programs usually do not need to concern themselves with these + options, they are more for the development environment. An Erlang program + can be written so that it works well regardless of the type of system or + the Unicode options that are in effect.

+ +

Here follows a summary of the settings affecting Unicode:

+ + + The LANG and LC_CTYPE environment variables + +

The language setting in the operating system mainly affects the + shell. The terminal (that is, the group leader) operates with + {encoding, unicode} only if the environment tells it that + UTF-8 is allowed. This setting is to correspond to the terminal you + are using.

+

The environment can also affect filename interpretation, if Erlang + is started with flag +fna (which is default from + Erlang/OTP 17.0).

+

You can check the setting of this by calling + io:getopts(), + which gives you an option list containing {encoding,unicode} + or {encoding,latin1}.

+
+ The +pc {unicode|latin1} flag to + erl(1) + +

This flag affects what is interpreted as string data when doing + heuristic string detection in the shell and in + io/ + io_lib:format + with the "~tp" and ~tP formatting instructions, as + described earlier.

+

You can check this option by calling + io:printable_range/0, + which returns unicode or latin1. To be compatible with + future (expected) extensions to the settings, rather use + io_lib:printable_list/1 + to check if a list is printable according to the setting. That + function takes into account new possible settings returned from + io:printable_range/0.

+
+ The +fn{l|u|a} + [{w|i|e}] flag to + erl(1) + +

This flag affects how the filenames are to be interpreted. On + operating systems with transparent file naming, this must be + specified to allow for file naming in Unicode characters (and for + correct interpretation of filenames containing characters > 255). +

+ + +

+fnl means bytewise interpretation of filenames, which was + the usual way to represent ISO Latin-1 filenames before UTF-8 + file naming got widespread.

+
+ +

+fnu means that filenames are encoded in UTF-8, which is + nowadays the common scheme (although not enforced).

+
+ +

+fna means that you automatically select between + +fnl and +fnu, based on environment variables + LANG and LC_CTYPE. This is optimistic + heuristics indeed, nothing enforces a user to have a terminal with + the same encoding as the file system, but this is usually the + case. This is the default on all Unix-like operating systems, + except MacOS X.

+
+
+

The filename translation mode can be read with function + file:native_name_encoding/0, + which returns latin1 (bytewise encoding) or utf8.

+
+ epp:default_encoding/0 + +

This function returns the default encoding for Erlang source files + (if no encoding comment is present) in the currently running release. + In Erlang/OTP R16B, latin1 (bytewise encoding) was returned. + As from Erlang/OTP 17.0, utf8 is returned.

+

The encoding of each file can be specified using comments as + described in the + epp(3) module. +

+
+ io:setopts/1,2 + and flags -oldshell/-noshell + +

When Erlang is started with -oldshell or -noshell, the + I/O server for standard_io is by default set to bytewise + encoding, while an interactive shell defaults to what the + environment variables says.

+

You can set the encoding of a file or other I/O server with function + io:setopts/2. + This can also be set when opening a file. Setting the terminal (or + other standard_io server) unconditionally to option + {encoding,utf8} implies that UTF-8 encoded characters are + written to the device, regardless of how Erlang was started or the + user's environment.

+

Opening files with option encoding is convenient when + writing or reading text files in a known encoding.

+

You can retrieve the encoding setting for an I/O server with + function + io:getopts(). +

+
+
+
+
- Byte Order Marks -

A common method of identifying encoding in text-files is to put - a byte order mark (BOM) first in the file. The BOM is the - code point 16#FEFF encoded in the same way as the rest of the - file. If such a file is to be read, the first few bytes (depending - on encoding) is not part of the actual text. This code outlines - how to open a file which is believed to have a BOM and set the - files encoding and position for further sequential reading - (preferably using the io - module). Note that error handling is omitted from the code:

- + Recipes +

When starting with Unicode, one often stumbles over some common issues. + This section describes some methods of dealing with Unicode data.

+ +
+ Byte Order Marks +

A common method of identifying encoding in text files is to put a Byte + Order Mark (BOM) first in the file. The BOM is the code point 16#FEFF + encoded in the same way as the remaining file. If such a file is to be + read, the first few bytes (depending on encoding) are not part of the + text. This code outlines how to open a file that is believed to + have a BOM, and sets the files encoding and position for further + sequential reading (preferably using the + io module).

+ +

Notice that error handling is omitted from the code:

+ + open_bom_file_for_reading(File) -> {ok,F} = file:open(File,[read,binary]), {ok,Bin} = file:read(F,4), {Type,Bytes} = unicode:bom_to_encoding(Bin), file:position(F,Bytes), io:setopts(F,[{encoding,Type}]), - {ok,F}. - -

The unicode:bom_to_encoding/1 function identifies the - encoding from a binary of at least four bytes. It returns, along - with an term suitable for setting the encoding of the file, the - actual length of the BOM, so that the file position can be set - accordingly. Note that file:position/2 always works on - byte-offsets, so that the actual byte-length of the BOM is - needed.

-

To open a file for writing and putting the BOM first is even - simpler:

- + {ok,F}. + +

Function + unicode:bom_to_encoding/1 + identifies the encoding from a binary of at least four bytes. It + returns, along with a term suitable for setting the encoding of the + file, the byte length of the BOM, so that the file position can be set + accordingly. Notice that function + file:position/2 + always works on byte-offsets, so that the byte length of the BOM is + needed.

+ +

To open a file for writing and place the BOM first is even simpler:

+ + open_bom_file_for_writing(File,Encoding) -> {ok,F} = file:open(File,[write,binary]), ok = file:write(File,unicode:encoding_to_bom(Encoding)), io:setopts(F,[{encoding,Encoding}]), - {ok,F}. - -

In both cases the file is then best processed using the - io module, as the functions in io can handle code - points beyond the ISO-latin-1 range.

-
-
- Formatted I/O -

When reading and writing to Unicode-aware entities, like the - User or a file opened for Unicode translation, you will probably - want to format text strings using the functions in io or io_lib. For backward - compatibility reasons, these functions do not accept just any list - as a string, but require a special translation modifier - when working with Unicode texts. The modifier is t. When - applied to the s control character in a formatting string, - it accepts all Unicode code points and expect binaries to be in - UTF-8:

-
+    {ok,F}.
+
+      

The file is in both these cases then best processed using the + io module, as the functions + in that module can handle code points beyond the ISO Latin-1 range.

+
+ +
+ Formatted I/O +

When reading and writing to Unicode-aware entities, like a + file opened for Unicode translation, you probably want to format text + strings using the functions in the + io module or the + io_lib module. For + backward compatibility reasons, these functions do not accept any list + as a string, but require a special translation modifier when + working with Unicode texts. The modifier is t. When applied to + control character s in a formatting string, it accepts all + Unicode code points and expects binaries to be in UTF-8:

+ +
 1> io:format("~ts~n",[<<"åäö"/utf8>>]).
 åäö
 ok
 2> io:format("~s~n",[<<"åäö"/utf8>>]).
 åäö
 ok
-

Obviously the second io:format/2 gives undesired output - because the UTF-8 binary is not in latin1. For backward - compatibility, the non prefixed s control character expects - bytewise encoded ISO-latin-1 characters in binaries and lists - containing only code points < 256.

-

As long as the data is always lists, the t modifier can - be used for any string, but when binary data is involved, care - must be taken to make the right choice of formatting characters. A - bytewise encoded binary will also be interpreted as a string and - printed even when using ~ts, but it might be mistaken for a - valid UTF-8 string and one should therefore avoid using the - ~ts control if the binary contains bytewise encoded - characters and not UTF-8.

-

The function format/2 in io_lib behaves - similarly. This function is defined to return a deep list of - characters and the output could easily be converted to binary data - for outputting on a device of any kind by a simple - erlang:list_to_binary/1. When the translation modifier is - used, the list can however contain characters that cannot be - stored in one byte. The call to erlang:list_to_binary/1 - will in that case fail. However, if the I/O server you want to - communicate with is Unicode-aware, the list returned can still be - used directly:

-
+
+      

Clearly, the second io:format/2 gives undesired output, as the + UTF-8 binary is not in latin1. For backward compatibility, the + non-prefixed control character s expects bytewise-encoded ISO + Latin-1 characters in binaries and lists containing only code points + < 256.

+ +

As long as the data is always lists, modifier t can be used for + any string, but when binary data is involved, care must be taken to + make the correct choice of formatting characters. A bytewise-encoded + binary is also interpreted as a string, and printed even when using + ~ts, but it can be mistaken for a valid UTF-8 string. Avoid + therefore using the ~ts control if the binary contains + bytewise-encoded characters and not UTF-8.

+ +

Function + io_lib:format/2 + behaves similarly. It is defined to return a deep list of characters + and the output can easily be converted to binary data for outputting on + any device by a simple + erlang:list_to_binary/1. + When the translation modifier is used, the list can, however, contain + characters that cannot be stored in one byte. The call to + erlang:list_to_binary/1 then fails. However, if the I/O server + you want to communicate with is Unicode-aware, the returned list can + still be used directly:

+ +
 $ erl +pc unicode
 Erlang R16B (erts-5.10.1) [source] [async-threads:0] [hipe] [kernel-poll:false]
 
@@ -1195,55 +1327,56 @@ Eshell V5.10.1 (abort with ^G)
 2> io:put_chars(io_lib:format("~ts~n", ["Γιούνικοντ"])).
 Γιούνικοντ
 ok
-

The Unicode string is returned as a Unicode list, which is - recognized as such since the Erlang shell uses the Unicode - encoding (and is started with all Unicode characters considered - printable). The Unicode list is valid input to the io:put_chars/2 function, - so data can be output on any Unicode capable device. If the device - is a terminal, characters will be output in the \x{H - ...} format if encoding is latin1 otherwise in UTF-8 - (for the non-interactive terminal - "oldshell" or "noshell") or - whatever is suitable to show the character properly (for an - interactive terminal - the regular shell). The bottom line is that - you can always send Unicode data to the standard_io - device. Files will however only accept Unicode code points beyond - ISO-latin-1 if encoding is set to something else than - latin1.

-
-
- Heuristic Identification of UTF-8 -

While it is - strongly encouraged that the actual encoding of characters in - binary data is known prior to processing, that is not always - possible. On a typical Linux system, there is a mix of UTF-8 - and ISO-latin-1 text files and there are seldom any BOM's in the - files to identify them.

-

UTF-8 is designed in such a way that ISO-latin-1 characters - with numbers beyond the 7-bit ASCII range are seldom considered - valid when decoded as UTF-8. Therefore one can usually use - heuristics to determine if a file is in UTF-8 or if it is encoded - in ISO-latin-1 (one byte per character) encoding. The - unicode module can be used to determine if data can be - interpreted as UTF-8:

- + +

The Unicode string is returned as a Unicode list, which is recognized + as such, as the Erlang shell uses the Unicode encoding (and is started + with all Unicode characters considered printable). The Unicode list is + valid input to function + io:put_chars/2, + so data can be output on any Unicode-capable device. If the device is a + terminal, characters are output in format \x{H...} if + encoding is latin1. Otherwise in UTF-8 (for the non-interactive + terminal: "oldshell" or "noshell") or whatever is suitable to show the + character properly (for an interactive terminal: the regular shell).

+ +

So, you can always send Unicode data to the standard_io device. + Files, however, accept only Unicode code points beyond ISO Latin-1 if + encoding is set to something else than latin1.

+
+ +
+ Heuristic Identification of UTF-8 +

While it is strongly encouraged that the encoding of characters + in binary data is known before processing, that is not always possible. + On a typical Linux system, there is a mix of UTF-8 and ISO Latin-1 text + files, and there are seldom any BOMs in the files to identify them.

+ +

UTF-8 is designed so that ISO Latin-1 characters with numbers beyond + the 7-bit ASCII range are seldom considered valid when decoded as UTF-8. + Therefore one can usually use heuristics to determine if a file is in + UTF-8 or if it is encoded in ISO Latin-1 (one byte per character). + The unicode + module can be used to determine if data can be interpreted as UTF-8:

+ + heuristic_encoding_bin(Bin) when is_binary(Bin) -> case unicode:characters_to_binary(Bin,utf8,utf8) of Bin -> utf8; _ -> latin1 - end. - -

If one does not have a complete binary of the file content, one - could instead chunk through the file and check part by part. The - return-tuple {incomplete,Decoded,Rest} from - unicode:characters_to_binary/{1,2,3} comes in handy. The - incomplete rest from one chunk of data read from the file is - prepended to the next chunk and we therefore circumvent the - problem of character boundaries when reading chunks of bytes in - UTF-8 encoding:

- + end. + +

If you do not have a complete binary of the file content, you can + instead chunk through the file and check part by part. The return-tuple + {incomplete,Decoded,Rest} from function + unicode:characters_to_binary/1,2,3 + comes in handy. The incomplete rest from one chunk of data read from the + file is prepended to the next chunk and we therefore avoid the problem + of character boundaries when reading chunks of bytes in UTF-8 + encoding:

+ + heuristic_encoding_file(FileName) -> {ok,F} = file:open(FileName,[read,binary]), loop_through_file(F,<<>>,file:read(F,1024)). @@ -1260,13 +1393,14 @@ loop_through_file(F,Acc,{ok,Bin}) when is_binary(Bin) -> loop_through_file(F,Rest,file:read(F,1024)); Res when is_binary(Res) -> loop_through_file(F,<<>>,file:read(F,1024)) - end. - -

Another option is to try to read the whole file in UTF-8 - encoding and see if it fails. Here we need to read the file using - io:get_chars/3, as we have to succeed in reading characters - with a code point over 255:

- + end. + +

Another option is to try to read the whole file in UTF-8 encoding and + see if it fails. Here we need to read the file using function + io:get_chars/3, + as we have to read characters with a code point > 255:

+ + heuristic_encoding_file2(FileName) -> {ok,F} = file:open(FileName,[read,binary,{encoding,utf8}]), loop_through_file2(F,io:get_chars(F,'',1024)). @@ -1276,69 +1410,71 @@ loop_through_file2(_,eof) -> loop_through_file2(_,{error,_Err}) -> latin1; loop_through_file2(F,Bin) when is_binary(Bin) -> - loop_through_file2(F,io:get_chars(F,'',1024)). - -
-
- Lists of UTF-8 Bytes -

For various reasons, you may find yourself having a list of - UTF-8 bytes. This is not a regular string of Unicode characters as - each element in the list does not contain one character. Instead - you get the "raw" UTF-8 encoding that you have in binaries. This - is easily converted to a proper Unicode string by first converting - byte per byte into a binary and then converting the binary of - UTF-8 encoded characters back to a Unicode string:

- - utf8_list_to_string(StrangeList) -> - unicode:characters_to_list(list_to_binary(StrangeList)). - -
-
- Double UTF-8 Encoding -

When working with binaries, you may get the horrible "double - UTF-8 encoding", where strange characters are encoded in your - binaries or files that you did not expect. What you may have got, - is a UTF-8 encoded binary that is for the second time encoded as - UTF-8. A common situation is where you read a file, byte by byte, - but the actual content is already UTF-8. If you then convert the - bytes to UTF-8, using i.e. the unicode module or by - writing to a file opened with the {encoding,utf8} - option. You will have each byte in the in the input file - encoded as UTF-8, not each character of the original text (one - character may have been encoded in several bytes). There is no - real remedy for this other than being very sure of which data is - actually encoded in which format, and never convert UTF-8 data - (possibly read byte by byte from a file) into UTF-8 again.

-

The by far most common situation where this happens, is when - you get lists of UTF-8 instead of proper Unicode strings, and then - convert them to UTF-8 in a binary or on a file:

- - wrong_thing_to_do() -> - {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"), - MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary! - {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]), - io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8 - %% bytes in a list! - file:close(C). %% The file catastrophe.txt contains more or less unreadable - %% garbage! - -

Make very sure you know what a binary contains before - converting it to a string. If no other option exists, try - heuristics:

- - if_you_can_not_know() -> - {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"), - MyList = case unicode:characters_to_list(Bin) of - L when is_list(L) -> - L; - _ -> - binary_to_list(Bin) %% The file was bytewise encoded - end, - %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes - {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]), - io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets! - file:close(G). %% The file contains valid UTF-8 encoded Unicode characters! - + loop_through_file2(F,io:get_chars(F,'',1024)).
+
+ +
+ Lists of UTF-8 Bytes +

For various reasons, you can sometimes have a list of UTF-8 + bytes. This is not a regular string of Unicode characters, as each list + element does not contain one character. Instead you get the "raw" UTF-8 + encoding that you have in binaries. This is easily converted to a proper + Unicode string by first converting byte per byte into a binary, and then + converting the binary of UTF-8 encoded characters back to a Unicode + string:

+ + +utf8_list_to_string(StrangeList) -> + unicode:characters_to_list(list_to_binary(StrangeList)). +
+ +
+ Double UTF-8 Encoding +

When working with binaries, you can get the horrible "double UTF-8 + encoding", where strange characters are encoded in your binaries or + files. In other words, you can get a UTF-8 encoded binary that for the + second time is encoded as UTF-8. A common situation is where you read a + file, byte by byte, but the content is already UTF-8. If you then + convert the bytes to UTF-8, using, for example, the + unicode module, or by + writing to a file opened with option {encoding,utf8}, you have + each byte in the input file encoded as UTF-8, not each + character of the original text (one character can have been encoded in + many bytes). There is no real remedy for this other than to be sure of + which data is encoded in which format, and never convert UTF-8 data + (possibly read byte by byte from a file) into UTF-8 again.

+ +

By far the most common situation where this occurs, is when you get + lists of UTF-8 instead of proper Unicode strings, and then convert them + to UTF-8 in a binary or on a file:

+ + +wrong_thing_to_do() -> + {ok,Bin} = file:read_file("an_utf8_encoded_file.txt"), + MyList = binary_to_list(Bin), %% Wrong! It is an utf8 binary! + {ok,C} = file:open("catastrophe.txt",[write,{encoding,utf8}]), + io:put_chars(C,MyList), %% Expects a Unicode string, but get UTF-8 + %% bytes in a list! + file:close(C). %% The file catastrophe.txt contains more or less unreadable + %% garbage! + +

Ensure you know what a binary contains before converting it to a + string. If no other option exists, try heuristics:

+ + +if_you_can_not_know() -> + {ok,Bin} = file:read_file("maybe_utf8_encoded_file.txt"), + MyList = case unicode:characters_to_list(Bin) of + L when is_list(L) -> + L; + _ -> + binary_to_list(Bin) %% The file was bytewise encoded + end, + %% Now we know that the list is a Unicode string, not a list of UTF-8 bytes + {ok,G} = file:open("greatness.txt",[write,{encoding,utf8}]), + io:put_chars(G,MyList), %% Expects a Unicode string, which is what it gets! + file:close(G). %% The file contains valid UTF-8 encoded Unicode characters! +
- + -- cgit v1.2.3