20172017 Ericsson AB. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. uri_string Péter Dimitrov 1 2017-10-24 A
uri_string URI processing functions.

This module contains functions for parsing and handling URIs (RFC 3986) and form-urlencoded query strings (HTML5).

A URI is an identifier consisting of a sequence of characters matching the syntax rule named URI in RFC 3986.

The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment:

    URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
    hier-part   = "//" authority path-abempty
                   / path-absolute
                   / path-rootless
                   / path-empty
    scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
    authority   = [ userinfo "@" ] host [ ":" port ]
    userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )

    reserved    = gen-delims / sub-delims
    gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                / "*" / "+" / "," / ";" / "="

    unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
    


The interpretation of a URI depends only on the characters used and not on how those characters are represented in a network protocol.

The functions implemented by this module cover the following use cases:

Parsing URIs into its components and returing a map

parse/1
Recomposing a map of URI components into a URI string

recompose/1
Changing inbound binary and percent-encoding of URIs

transcode/2
Transforming URIs into a normalized form

normalize/1
Composing form-urlencoded query strings from a list of key-value pairs

compose_query/1

compose_query/2
Dissecting form-urlencoded query strings into a list of key-value pairs

dissect_query/1

There are four different encodings present during the handling of URIs:

Inbound binary encoding in binaries Inbound percent-encoding in lists and binaries Outbound binary encoding in binaries Outbound percent-encoding in lists and binaries

Functions with uri_string() argument accept lists, binaries and mixed lists (lists with binary elements) as input type. All of the functions but transcode/2 expects input as lists of unicode codepoints, UTF-8 encoded binaries and UTF-8 percent-encoded URI parts ("%C3%B6" corresponds to the unicode character "ö").

Unless otherwise specified the return value type and encoding are the same as the input type and encoding. That is, binary input returns binary output, list input returns a list output but mixed input returns list output.

In case of lists there is only percent-encoding. In binaries, however, both binary encoding and percent-encoding shall be considered. transcode/2 provides the means to convert between the supported encodings, it takes a uri_string() and a list of options specifying inbound and outbound encodings.

RFC 3986 does not mandate any specific character encoding and it is usually defined by the protocol or surrounding text. This library takes the same assumption, binary and percent-encoding are handled as one configuration unit, they cannot be set to different values.

Error tuple indicating the type of error. Possible values of the second component:

invalid_character invalid_encoding invalid_input invalid_map invalid_percent_encoding invalid_scheme invalid_uri invalid_utf8 missing_value

The third component is a term providing additional information about the cause of the error.

Map holding the main components of a URI.

List of unicode codepoints, a UTF-8 encoded binary, or a mix of the two, representing an RFC 3986 compliant URI (percent-encoded form). A URI is a sequence of characters from a very limited set: the letters of the basic Latin alphabet, digits, and a few special characters.

Compose urlencoded query string.

Composes a form-urlencoded QueryString based on a QueryList, a list of non-percent-encoded key-value pairs. Form-urlencoding is defined in section 4.10.22.6 of the HTML5 specification.

See also the opposite operation dissect_query/1.

Example:

1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}]).

2> >,<<"1">>},
2> {<<"city">>,<<"örebro"/utf8>>}]).]]>
>]]>
	
Compose urlencoded query string.

Same as compose_query/1 but with an additional Options parameter, that controls the encoding ("charset") used by the encoding algorithm. There are two supported encodings: utf8 (or unicode) and latin1.

Each character in the entry's name and value that cannot be expressed using the selected character encoding, is replaced by a string consisting of a U+0026 AMPERSAND character (), a "#" (U+0023) character, one or more ASCII digits representing the Unicode code point of the character in base ten, and finally a ";" (U+003B) character.

Bytes that are out of the range 0x2A, 0x2D, 0x2E, 0x30 to 0x39, 0x41 to 0x5A, 0x5F, 0x61 to 0x7A, are percent-encoded (U+0025 PERCENT SIGN character (%) followed by uppercase ASCII hex digits representing the hexadecimal value of the byte).

See also the opposite operation dissect_query/1.

Example:

1> uri_string:compose_query([{"foo bar","1"},{"city","örebro"}],
1> [{encoding, latin1}]).
 uri_string:compose_query([{<<"foo bar">>,<<"1">>},
2> {<<"city">>,<<"東京"/utf8>>}], [{encoding, latin1}]).]]>
>]]>
	
Dissect query string.

Dissects an urlencoded QueryString and returns a QueryList, a list of non-percent-encoded key-value pairs. Form-urlencoding is defined in section 4.10.22.6 of the HTML5 specification.

It is not as strict for its input as the decoding algorithm defined by HTML5 and accepts all unicode characters.

See also the opposite operation compose_query/1.

Example:

1> 
[{"foo bar","1"},{"city","örebro"}]
2> >).]]>
>,<<"1">>},
 {<<"city">>,<<230,157,177,228,186,172>>}] ]]>
	
Syntax-based normalization.

Transforms URIString into a normalized form using Syntax-Based Normalization as defined by RFC 3986.

This function implements case normalization, percent-encoding normalization, path segment normalization and scheme based normalization for HTTP(S) with basic support for FTP, SSH, SFTP and TFTP.

Example:

1> uri_string:normalize("/a/b/c/./../../g").
"/a/g"
2> >).]]>
>]]>
3> uri_string:normalize("http://localhost:80").
"https://localhost/"
	
Parse URI into a map.

Parses an RFC 3986 compliant uri_string() into a uri_map(), that holds the parsed components of the URI. If parsing fails, an error tuple is returned.

See also the opposite operation recompose/1.

Example:

1> uri_string:parse("foo://user@example.com:8042/over/there?name=ferret#nose").
#{fragment => "nose",host => "example.com",
  path => "/over/there",port => 8042,query => "name=ferret",
  scheme => foo,userinfo => "user"}
2> >).]]>
 <<"example.com">>,path => <<"/over/there">>,
  port => 8042,query => <<"name=ferret">>,scheme => <<"foo">>,
  userinfo => <<"user">>}]]>
	
Recompose URI.

Creates an RFC 3986 compliant URIString (percent-encoded), based on the components of URIMap. If the URIMap is invalid, an error tuple is returned.

See also the opposite operation parse/1.

Example:

1> URIMap = #{fragment => "nose", host => "example.com", path => "/over/there",
1> port => 8042, query => "name=ferret", scheme => "foo", userinfo => "user"}.
#{fragment => "top",host => "example.com",
  path => "/over/there",port => 8042,query => "?name=ferret",
  scheme => foo,userinfo => "user"}

2> uri_string:recompose(URIMap).
"foo://example.com:8042/over/there?name=ferret#nose"
Transcode URI.

Transcodes an RFC 3986 compliant URIString, where Options is a list of tagged tuples, specifying the inbound (in_encoding) and outbound (out_encoding) encodings. in_encoding and out_encoding specifies both binary encoding and percent-encoding for the input and output data. Mixed encoding, where binary encoding is not the same as percent-encoding, is not supported. If an argument is invalid, an error tuple is returned.

Example:

1> >,]]>
1> [{in_encoding, utf32},{out_encoding, utf8}]).
>]]>
2> uri_string:transcode("foo%F6bar", [{in_encoding, latin1},
2> {out_encoding, utf8}]).
"foo%C3%B6bar"