From a8221e1844dc6f5235be9a88fd6832183871fc6c Mon Sep 17 00:00:00 2001 From: Patrik Nyblom Date: Wed, 31 Jul 2013 17:36:50 +0200 Subject: Add documentation of report_errors and match_limit(_recursion) --- lib/stdlib/doc/src/re.xml | 154 ++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 136 insertions(+), 18 deletions(-) (limited to 'lib/stdlib') diff --git a/lib/stdlib/doc/src/re.xml b/lib/stdlib/doc/src/re.xml index 8a47b1579d..b2cde3f72d 100644 --- a/lib/stdlib/doc/src/re.xml +++ b/lib/stdlib/doc/src/re.xml @@ -218,13 +218,18 @@ This option makes it possible to include comments inside complicated patterns. N

If the regular expression is previously compiled, the option list can only contain the options anchored, - global, notbol, noteol, - notempty, notempty_atstart, {offset, integer() >= 0}, {newline, - NLSpec} and {capture, ValueSpec}/{capture, ValueSpec, + global, notbol, noteol, report_errors, + notempty, notempty_atstart, {offset, integer() >= 0}, + {match_limit, integer() >= 0}, + {match_limit_recursion, integer() >= 0}, + {newline, + NLSpec} and + {capture, ValueSpec}/{capture, ValueSpec, Type}. Otherwise all options valid for the re:compile/2 function are allowed as well. Options allowed both for compilation and execution of a match, namely - anchored and {newline, NLSpec}, will affect both + anchored and {newline, NLSpec}, + will affect both the compilation and execution if present together with a non pre-compiled regular expression.

@@ -254,6 +259,17 @@ This option makes it possible to include comments inside complicated patterns. N be done either by specifying none or an empty list as ValueSpec.

+

The report_errors option adds the possibility that an + error tuple is returned. The tuple will either indicate a + matching error (match_limit or + match_limit_recursion) or a compilation error, where the + error tuple has the format {error, {compile, + CompileErr}}. Note that if the option + report_errors is not given, the function never returns + error tuples, but will report compilation errors as a badarg + exception and failed matches due to exceeded match limits simply + as nomatch.

+

The options relevant for execution are:

@@ -368,6 +384,116 @@ This option makes it possible to include comments inside complicated patterns. N behavior of the dollar metacharacter. It does not affect \Z or \z. + report_errors + +

This option gives better control of the error handling in re:run/3. When it is given, compilation errors (if the regular expression isn't already compiled) as well as run-time errors are explicitly returned as an error tuple.

+

The possible run-time errors are:

+ + match_limit + + The PCRE library sets a limit on how many times the + internal match function can be called. The default value for + this is 10000000 in the library compiled for Erlang. If + {error, match_limit} is returned, it means that the + execution of the regular expression has reached this + limit. Normally this is to be regarded as a nomatch, + which is the default return value when this happens, but by + specifying report_errors, you will get informed when + the match fails due to to many internal calls. + + match_limit_recursion + + This error is very similar to match_limit, but + occurs when the internal match function of PCRE is + "recursively" called more times than the + "match_limit_recursion" limit, which is by default 10000000 as + well. Note that as long as the match_limit and + match_limit_default values are kept at the default + values, the match_limit_recursion error can not occur, + as the match_limit error will occur before that (each + recursive call is also a call, but not vice versa). Both + limits can however be changed, either by setting limits + directly in the regular expression string (see reference + section below) or by giving options to re:run/3 + + +

It is important to understand that what is referred to as + "recursion" when limiting matches is not actually recursion on + the C stack of the Erlang machine, neither is it recursion on + the Erlang process stack. The version of PCRE compiled into the + Erlang VM uses machine "heap" memory to store values that needs to be + kept over recursion in regular expression matches.

+
+ {match_limit, integer() >= 0} + +

This option limits the execution time of a match in an + implementation-specific way. It is described in the following + way by the PCRE documentation:

+ + +The match_limit field provides a means of preventing PCRE from using +up a vast amount of resources when running patterns that are not going +to match, but which have a very large number of possibilities in their +search trees. The classic example is a pattern that uses nested +unlimited repeats. + +Internally, pcre_exec() uses a function called match(), which it calls +repeatedly (sometimes recursively). The limit set by match_limit is +imposed on the number of times this function is called during a match, +which has the effect of limiting the amount of backtracking that can +take place. For patterns that are not anchored, the count restarts +from zero for each position in the subject string. + +

This means that runaway regular expression matches can fail + faster if the limit is lowered using this option. The default + value compiled into the Erlang virtual machine is 10000000

+ +

This option does in no way affect the execution of the + Erlang virtual machine in terms of "long running + BIF's". re:run always give control back to the scheduler + of Erlang processes at intervals that ensures the real time + properties of the Erlang system.

+
+ + {match_limit_recursion, integer() >= 0} + +

This option limits the execution time and memory + consumption of a match in an implementation-specific way, very + similar to match_limit. It is described in the following + way by the PCRE documentation:

+ + +The match_limit_recursion field is similar to match_limit, but instead +of limiting the total number of times that match() is called, it +limits the depth of recursion. The recursion depth is a smaller number +than the total number of calls, because not all calls to match() are +recursive. This limit is of use only if it is set smaller than +match_limit. + +Limiting the recursion depth limits the amount of machine stack that +can be used, or, when PCRE has been compiled to use memory on the heap +instead of the stack, the amount of heap memory that can be +used. + +

The Erlang virtual machine uses a PCRE library where heap + memory is used when regular expression match recursion happens, + why this limits the usage of machine heap, not C stack.

+ +

Specifying a lower value may result in matches with deep recursion failing, when they should actually have matched:

+ +1> re:run("aaaaaaaaaaaaaz","(a+)*z"). +{match,[{0,14},{0,13}]} +2> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5}]). +nomatch +3> re:run("aaaaaaaaaaaaaz","(a+)*z",[{match_limit_recursion,5},report_errors]). +{error,match_limit_recursion} + +

This option, as well as the match_limit option should + only be used in very rare cases. Understanding of the PCRE + library internals is recommended before tampering with these + limits.

+
+ {offset, integer() >= 0} Start matching at the offset (position) given in the @@ -442,7 +568,7 @@ This option makes it possible to include comments inside complicated patterns. N All captured subpatterns including the complete matching string. This is the default. all_names All named subpatterns in the regular expression, as if a list() - of all the names in alphabetical order was given. The list of all names can also be retrieved with the inspect/2 function. + of all the names in alphabetical order was given. The list of all names can also be retrieved with the inspect/2 function. first Only the first captured subpattern, which is always the complete matching part of the subject. All explicitly captured subpatterns are discarded. all_but_first @@ -894,27 +1020,19 @@ below. A change of \R setting can be combined with a change of newline convention.

Setting match and recursion limits

-

The internal limits on how many calls (and recursive calls) can be done to the internal matching -engine of PCRE during one call to re:run/{2,3}, -can be set by items at the start of the pattern:

+ +

The caller of re:run/3 can set a limit on the number of times the internal match() function is called and on the maximum depth of recursive calls. These facilities are provided to catch runaway matches that are provoked by patterns with huge matching trees (a typical example is a pattern with nested unlimited repeats) and to avoid running out of system stack by too much recursion. When one of these limits is reached, pcre_exec() gives an error return. The limits can also be set by items at the start of the pattern of the form

(*LIMIT_MATCH=d)

(*LIMIT_RECURSION=d)

-

where d is any number of decimal digits. However, the value of the setting must -be less than the value set by the Erlang virtual machine for it to have -any effect. In other words, the pattern writer can lower the limit set by the -VM, but not raise it. If there is more than one setting of one of these -limits, the lower value is used. -

+

where d is any number of decimal digits. However, the value of the setting must be less than the value set by the caller of re:run/3 for it to have any effect. In other words, the pattern writer can lower the limit set by the programmer, but not raise it. If there is more than one setting of one of these limits, the lower value is used.

-

The current value for both the limits are 10000000 in the Erlang +

The current default value for both the limits are 10000000 in the Erlang VM. Note that the recursion limit does not actually affect the stack depth of the VM, as PCRE for Erlang is compiled in such a way that the -match function is never called recursively.

+match function never does recursion on the "C-stack".

-

Basically, tampering with these limits is seldom useful.

-
Characters and metacharacters -- cgit v1.2.3