From b553664f54034e8c04ae6f9cc44f16b7f516518b Mon Sep 17 00:00:00 2001 From: Sverker Eriksson Date: Fri, 11 Jan 2013 17:27:29 +0100 Subject: erl_interface: utf8 atoms continued --- lib/erl_interface/doc/src/ei.xml | 59 +++++++++++++++++++++++++++++++-- lib/erl_interface/doc/src/erl_eterm.xml | 18 ++++++---- 2 files changed, 69 insertions(+), 8 deletions(-) (limited to 'lib/erl_interface/doc/src') diff --git a/lib/erl_interface/doc/src/ei.xml b/lib/erl_interface/doc/src/ei.xml index 539e16d837..0b0b1eeb79 100644 --- a/lib/erl_interface/doc/src/ei.xml +++ b/lib/erl_interface/doc/src/ei.xml @@ -82,6 +82,22 @@ function returns the size required (note that for strings an extra byte is needed for the 0 string terminator).

+
+ DATA TYPES + + + enum erlang_char_encoding + +

+ +enum erlang_char_encoding { + ERLANG_ASCII, ERLANG_LATIN1, ERLANG_UTF8, ERLANG_WHATEVER +}; + +

The character encoding used for atoms.

+
+
+
voidei_set_compat_rel(release_number) @@ -225,11 +241,31 @@ Encode an atom

Encodes an atom in the binary format. The parameter - is the name of the atom. Only upto bytes + is the name of the atom in latin1 encoding. Only upto MAXATOMLEN-1 bytes are encoded. The name should be zero-terminated, except for the function.

+ + intei_encode_atom_as(char *buf, int *index, const char *p, enum erlang_char_encoding from_enc, enum erlang_char_encoding to_enc) + intei_encode_atom_len_as(char *buf, int *index, const char *p, int len, enum erlang_char_encoding from_enc, enum erlang_char_encoding to_enc) + intei_x_encode_atom_as(ei_x_buff* x, const char *p, enum erlang_char_encoding from_enc, enum erlang_char_encoding to_enc) + intei_x_encode_atom_len_as(ei_x_buff* x, const char *p, int len, enum erlang_char_encoding from_enc, enum erlang_char_encoding to_enc) + Encode an atom + +

Encodes an atom in the binary format with character encoding + to_enc (latin1 or utf8). + The p parameter is the name of the atom with character encoding + from_enc. + The name must either be zero-terminated or a function variant with a len + parameter must be used.

+

The encoding will fail if the atom is too long or if it can not be represented + with character encoding to_enc.

+

These functions were introduced in R16 release of Erlang/OTP as part of a first step + to support UTF8 atoms. Atoms encoded with ERLANG_UTF8 + can not be decoded by earlier releases than R16.

+
+
intei_encode_binary(char *buf, int *index, const void *p, long len) intei_x_encode_binary(ei_x_buff* x, const void *p, long len) @@ -490,10 +526,29 @@ ei_x_encode_empty_list(&x); Decode an atom

This function decodes an atom from the binary format. The - name of the atom is placed at . There can be at most + null terminated name of the atom is placed at . There can be at most bytes placed in the buffer.

+ + intei_decode_atom_as(const char *buf, int *index, char *p, int plen, enum erlang_char_encoding want, enum erlang_char_encoding* was, enum erlang_char_encoding* result) + Decode an atom + +

This function decodes an atom from the binary format. The + null terminated name of the atom is placed in buffer at p of length + plen bytes.

+

The wanted string encoding is specified by + want. The original encoding used in the + binary format (latin1 or utf8) can be obtained from *was. The actual encoding of the resulting string + (7-bit ascii, latin1 or utf8) can be obtained from *result. Both was and result can be NULL. + *result may differ from want if want is ERLANG_WHATEVER or if + *result turn out to be pure 7-bit ascii (compatible with both latin1 and utf8).

+

This function fails if the atom is too long for the buffer + or if it can not be represented with encoding want.

+

This functions was introduced in R16 release of Erlang/OTP as part of a first step + to support UTF8 atoms.

+
+
intei_decode_binary(const char *buf, int *index, void *p, long *len) Decode a binary diff --git a/lib/erl_interface/doc/src/erl_eterm.xml b/lib/erl_interface/doc/src/erl_eterm.xml index f403618c59..c7840d7813 100644 --- a/lib/erl_interface/doc/src/erl_eterm.xml +++ b/lib/erl_interface/doc/src/erl_eterm.xml @@ -77,10 +77,12 @@

+ A string representing atom . - The length (in characters) of atom t. + + The length (in bytes) of atom t. A pointer to the contents of @@ -92,6 +94,7 @@ The floating point value of . + The Node in pid . The sequence number in pid . @@ -104,6 +107,7 @@ The creation number in port . + The node in port . The first part of the reference number in ref . Use @@ -296,7 +300,7 @@ iohead ::= Binary ETERM *erl_mk_atom(string) Creates an atom - char *string; + const char *string;

Creates an atom.

@@ -305,10 +309,12 @@ iohead ::= Binary

Returns an Erlang term containing an atom. Note that it is the callers responsibility to make sure that contains a valid name for an atom.

-

can be used to retrieve the - atom name (as a string). Note that the string is not - 0-terminated in the atom. returns - the length of the atom name.

+

and + can be used to retrieve the atom name (as a null terminated string). + and returns the length of the atom name.

+

Note that the UTF8 variants were introduced in Erlang/OTP releases R16 + and the string returned by ERL_ATOM_PTR(atom) was not null terminated on older releases.

+
-- cgit v1.2.3 From 1f4765cca4874fa92fcfad888fbe6d5f2fbf74d1 Mon Sep 17 00:00:00 2001 From: Sverker Eriksson Date: Tue, 22 Jan 2013 19:25:36 +0100 Subject: erl_interface: even more utf8 atom stuff --- lib/erl_interface/doc/src/ei.xml | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'lib/erl_interface/doc/src') diff --git a/lib/erl_interface/doc/src/ei.xml b/lib/erl_interface/doc/src/ei.xml index 0b0b1eeb79..e9c7c644b5 100644 --- a/lib/erl_interface/doc/src/ei.xml +++ b/lib/erl_interface/doc/src/ei.xml @@ -94,7 +94,11 @@ enum erlang_char_encoding { ERLANG_ASCII, ERLANG_LATIN1, ERLANG_UTF8, ERLANG_WHATEVER }; -

The character encoding used for atoms.

+

The character encoding used for atoms. ERLANG_ASCII represents 7-bit ASCII. + Latin1 and UTF8 are different extensions of 7-bit ASCII. All 7-bit ASCII characters + are valid Latin1 and UTF8 characters. ASCII and Latin1 both represent each character + by one byte. A UTF8 character can consist of one to four bytes. ERLANG_WHATEVER + is not an encoding but rather used as a wildcard.

@@ -256,11 +260,11 @@ enum erlang_char_encoding {

Encodes an atom in the binary format with character encoding to_enc (latin1 or utf8). The p parameter is the name of the atom with character encoding - from_enc. + from_enc (ascii, latin1 or utf8). The name must either be zero-terminated or a function variant with a len parameter must be used.

-

The encoding will fail if the atom is too long or if it can not be represented - with character encoding to_enc.

+

The encoding will fail if p is not a valid string in encoding from_enc, + if the string is too long or if it can not be represented with character encoding to_enc.

These functions were introduced in R16 release of Erlang/OTP as part of a first step to support UTF8 atoms. Atoms encoded with ERLANG_UTF8 can not be decoded by earlier releases than R16.

-- cgit v1.2.3 From c596e17cf3d69cf5e10d28ee2a8ee35162786da1 Mon Sep 17 00:00:00 2001 From: Sverker Eriksson Date: Wed, 23 Jan 2013 16:04:38 +0100 Subject: erl_interface: Changed erlang_char_encoding interface to allow bitwise-or'd combinations. --- lib/erl_interface/doc/src/ei.xml | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) (limited to 'lib/erl_interface/doc/src') diff --git a/lib/erl_interface/doc/src/ei.xml b/lib/erl_interface/doc/src/ei.xml index e9c7c644b5..117c787da6 100644 --- a/lib/erl_interface/doc/src/ei.xml +++ b/lib/erl_interface/doc/src/ei.xml @@ -91,14 +91,13 @@

enum erlang_char_encoding { - ERLANG_ASCII, ERLANG_LATIN1, ERLANG_UTF8, ERLANG_WHATEVER + ERLANG_ASCII, ERLANG_LATIN1, ERLANG_UTF8 };

The character encoding used for atoms. ERLANG_ASCII represents 7-bit ASCII. Latin1 and UTF8 are different extensions of 7-bit ASCII. All 7-bit ASCII characters are valid Latin1 and UTF8 characters. ASCII and Latin1 both represent each character - by one byte. A UTF8 character can consist of one to four bytes. ERLANG_WHATEVER - is not an encoding but rather used as a wildcard.

+ by one byte. A UTF8 character can consist of one to four bytes.

@@ -545,11 +544,13 @@ ei_x_encode_empty_list(&x); want. The original encoding used in the binary format (latin1 or utf8) can be obtained from *was. The actual encoding of the resulting string (7-bit ascii, latin1 or utf8) can be obtained from *result. Both was and result can be NULL. - *result may differ from want if want is ERLANG_WHATEVER or if - *result turn out to be pure 7-bit ascii (compatible with both latin1 and utf8).

+ + *result may differ from want if want is a bitwise-or'd combination like + ERLANG_LATIN1|ERLANG_UTF8 or if *result turn out to be pure 7-bit ascii + (compatible with both latin1 and utf8).

This function fails if the atom is too long for the buffer or if it can not be represented with encoding want.

-

This functions was introduced in R16 release of Erlang/OTP as part of a first step +

This function was introduced in R16 release of Erlang/OTP as part of a first step to support UTF8 atoms.

-- cgit v1.2.3