From 6513fc5eb55b306e2b1088123498e6c50b9e7273 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bj=C3=B6rn=20Gustavsson?= A good start when programming efficiently is to have knowledge about
+ A good start when programming efficiently is to know
how much memory different data types and operations require. It is
implementation-dependent how much memory the Erlang data types and
- other items consume, but here are some figures for the
- erts-5.2 system (OTP release R9B). (There have been no significant
- changes in R13.)
The unit of measurement is memory words. There exists both a 32-bit - and a 64-bit implementation, and a word is therefore, 4 bytes or +
The unit of measurement is memory words. There exists both a + 32-bit and a 64-bit implementation. A word is therefore 4 bytes or 8 bytes, respectively.
The Erlang language specification puts no limits on number of processes, - length of atoms etc., but for performance and memory saving reasons, - there will always be limits in a practical implementation of the Erlang - language and execution environment.
-The maximum number of simultaneously alive Erlang processes is
- by default 32768. This limit can be configured at startup,
- for more information see the
-
A remote node Y has to be known to node X if there exist
- any pids, ports, references, or funs (Erlang data types) from Y
- on X, or if X and Y are connected. The maximum number of remote
- nodes simultaneously/ever known to a node is limited by the
-
The maximum number of simultaneously open Erlang ports is
- often by default 16384. This limit can be configured at startup,
- for more information see the
-
The Erlang language specification puts no limits on the number of + processes, length of atoms, and so on. However, for performance and + memory saving reasons, there will always be limits in a practical + implementation of the Erlang language and execution environment.
- The maximum number of simultaneously open files and sockets - depend on -In R12B, the most natural way to write binary construction and matching is now +
In R12B, the most natural way to construct and match binaries is significantly faster than in earlier releases.
-To construct at binary, you can simply write
+To construct a binary, you can simply write as follows:
DO (in R12B) / REALLY DO NOT (in earlier releases)
my_list_to_binary([], Acc) ->
Acc.]]>
- In releases before R12B,
The extra space allocated (or reallocated) will be twice the size of the +
In releases before R12B,
The most natural way to match binaries is now the fastest:
@@ -64,55 +63,79 @@ my_binary_to_list(<Internally, binaries and bitstrings are implemented in the same way. - In this section, we will call them binaries since that is what + In this section, they are called binaries because that is what they are called in the emulator source code.
-There are four types of binary objects internally. Two of them are - containers for binary data and two of them are merely references to - a part of a binary.
- -The binary containers are called refc binaries - (short for reference-counted binaries) and heap binaries.
+Four types of binary objects are available internally:
+Two are containers for binary data and are called:
+Two are merely references to a part of a binary and + are called:
+Refc binaries consist of two parts:
+The binary object can be referenced by any number of ProcBins from any - number of processes; the object contains a reference counter to keep track + number of processes. The object contains a reference counter to keep track of the number of references, so that it can be removed when the last reference disappears.
All ProcBin objects in a process are part of a linked list, so that the garbage collector can keep track of them and decrement the reference counters in the binary when a ProcBin disappears.
+Heap binaries are small binaries, up to 64 bytes, and are stored
+ directly on the process heap. They are copied when the process is
+ garbage-collected and when they are sent as a message. They do not
require any special handling by the garbage collector.
There are two types of reference objects that can reference part of - a refc binary or heap binary. They are called sub binaries and - match contexts.
+The reference objects sub binaries and + match contexts can reference part of + a refc binary or heap binary.
A match context is similar to a sub binary, but is + optimized for binary matching. For example, it contains a direct + pointer to the binary data. For each field that is matched out of + a binary, the position in the match context is incremented.
In R11B, a match context was only used during a binary matching operation.
@@ -122,27 +145,28 @@ my_binary_to_list(<<>>) -> [].]]> context and discard the sub binary. Instead of creating a sub binary, the match context is kept. -The compiler can only do this optimization if it can know for sure +
The compiler can only do this optimization if it knows that the match context will not be shared. If it would be shared, the functional properties (also called referential transparency) of Erlang would break.
+In R12B, appending to a binary or bitstring
+In R12B, appending to a binary or bitstring + is specially optimized by the runtime system:
>
<>]]>
- is specially optimized by the run-time system. - Because the run-time system handles the optimization (instead of +
As the runtime system handles the optimization (instead of the compiler), there are very few circumstances in which the optimization - will not work.
+ does not work. -To explain how it works, we will go through this code
+To explain how it works, let us examine the following code line + by line:
>, %% 1
@@ -152,81 +176,81 @@ Bin3 = <>, %% 4
Bin4 = <>, %% 5 !!!
{Bin4,Bin3} %% 6]]>
- line by line.
- -The first line (marked with the
+
The second line is an append operation. Since
It gets more interesting in the third line.
+
Same thing in the fourth line. There are 252 bytes left, - so there is no problem storing another three bytes.
+But in the fifth line something interesting happens.
- Note that we don't append to the previous result in
What will happen?
+The run-time system will see that
The runtime system sees that
The optimization of the binary append operation requires that there is a single ProcBin and a single reference to the ProcBin for the binary. The reason is that the binary object can be - moved (reallocated) during an append operation, and when that happens + moved (reallocated) during an append operation, and when that happens, the pointer in the ProcBin must be updated. If there would be more than one ProcBin pointing to the binary object, it would not be possible to find and update all of them.
-Therefore, certain operations on a binary will mark it so that +
Therefore, certain operations on a binary mark it so that any future append operation will be forced to copy the binary. In most cases, the binary object will be shrunk at the same time to reclaim the extra space allocated for growing.
-When appending to a binary
+When appending to a binary as follows, only the binary returned + from the latest append operation will support further cheap append + operations:
>]]>
- only the binary returned from the latest append operation will - support further cheap append operations. In the code fragment above, +
In the code fragment in the beginning of this section,
appending to
If a binary is sent as a message to a process or port, the binary will be shrunk and any further append operation will copy the binary - data into a new binary. For instance, in the following code fragment
+ data into a new binary. For example, in the following code fragment +>,
@@ -234,12 +258,12 @@ PortOrPid ! Bin1,
Bin = <> %% Bin1 will be COPIED
]]>
- The same thing happens if you insert a binary into an ets
- table or send it to a port using
The same happens if you insert a binary into an Ets
+ table, send it to a port using
Matching a binary will also cause it to shrink and the next append operation will copy the binary data:
@@ -249,22 +273,23 @@ Bin1 = <The reason is that a
The reason is that a
+
If a process simply keeps binaries (either in "loop data" or in the process - dictionary), the garbage collector may eventually shrink the binaries. - If only one such binary is kept, it will not be shrunk. If the process later - appends to a binary that has been shrunk, the binary object will be reallocated - to make place for the data to be appended.
+If a process simply keeps binaries (either in "loop data" or in the + process + dictionary), the garbage collector can eventually shrink the binaries. + If only one such binary is kept, it will not be shrunk. If the process + later appends to a binary that has been shrunk, the binary object will + be reallocated to make place for the data to be appended.
We will revisit the example shown earlier
+Let us revisit the example in the beginning of the previous section:
DO (in R12B)
>) ->
[H|my_binary_to_list(T)];
my_binary_to_list(<<>>) -> [].]]>
- too see what is happening under the hood.
- -The very first time
The first time
In R11B, at this point a
In R11B, at this point a
+
Therefore, in R12B,
Therefore, in R12B,
When the end of the binary is reached and the second clause matches, the match context will simply be discarded (removed in the next - garbage collection, since there is no longer any reference to it).
+ garbage collection, as there is no longer any reference to it).To summarize,
In R11B, the fastest way to match binaries is:
+In R11B, the fastest way to match binaries is as follows:
DO NOT (in R12B)
end.]]>
This function cleverly avoids building sub binaries, but it cannot
- avoid building a match context in each recursion step. Therefore, in both R11B and R12B,
+ avoid building a match context in each recursion step.
+ Therefore, in both R11B and R12B,
Returning to
Returning to
Yes, it will. The compiler will remove the building of the sub binary in the - second clause
+Yes, it will. The compiler will remove the building of the sub binary in + the second clause:
>) ->
after_zero(T);
-.
-.
-.]]>
+...]]>
- but will generate code that builds a sub binary in the first clause
+But it will generate code that builds a sub binary in the first clause:
>) ->
T;
-.
-.
-.]]>
+...]]>
- Therefore,
Therefore,
Code like the following will also be optimized:
@@ -371,12 +390,14 @@ all_but_zeroes_to_list(<<0,T/binary>>, Acc, Remaining) -> all_but_zeroes_to_list(<The compiler will remove building of sub binaries in the second and third clauses,
- and it will add an instruction to the first clause that will convert
The compiler removes building of sub binaries in the second and third
+ clauses, and it adds an instruction to the first clause that converts
+
Before you begin to think that the compiler can optimize any binary patterns, - here is a function that the compiler (currently, at least) is not able to optimize:
+Before you begin to think that the compiler can optimize any binary + patterns, the following function cannot be optimized by the compiler + (currently, at least):
>) ->
@@ -386,43 +407,43 @@ non_opt_eq([_|_], <<_,_/binary>>) ->
non_opt_eq([], <<>>) ->
true.]]>
- It was briefly mentioned earlier that the compiler can only delay creation of - sub binaries if it can be sure that the binary will not be shared. In this case, - the compiler cannot be sure.
+It was mentioned earlier that the compiler can only delay creation of + sub binaries if it knows that the binary will not be shared. In this case, + the compiler cannot know.
-We will soon show how to rewrite
Soon it is shown how to rewrite
Use the
- or passed via an environment variable
+or passed through an environment variable:
- Note that the
Notice that the
The warnings will look like this:
+The warnings look as follows:
- To make it clearer exactly what code the warnings refer to, - in the examples that follow, the warnings are inserted as comments - after the clause they refer to:
+To make it clearer exactly what code the warnings refer to, the + warnings in the following examples are inserted as comments + after the clause they refer to, for example:
>) ->
@@ -434,12 +455,12 @@ after_zero(<<_,T/binary>>) ->
after_zero(<<>>) ->
<<>>.]]>
- The warning for the first clause tells us that it is not possible to - delay the creation of a sub binary, because it will be returned. - The warning for the second clause tells us that a sub binary will not be +
The warning for the first clause says that the creation of a sub + binary cannot be delayed, because it will be returned. + The warning for the second clause says that a sub binary will not be created (yet).
-It is time to revisit the earlier example of the code that could not +
Let us revisit the earlier example of the code that could not be optimized and find out why:
>) ->
non_opt_eq([], <<>>) ->
true.]]>
- The compiler emitted two warnings. The
The compiler emitted two warnings. The
We will soon show another example that should make the distinction between
Soon another example will show the difference between the
+
>, [H|T2]) ->
@@ -485,15 +506,13 @@ match_body([0|_], <>) ->
%% sub binary optimization;
%% SUGGEST changing argument order
done;
-.
-.
-.]]>
+...]]>
The warning means that if there is a call to
>) ->
@@ -504,10 +523,10 @@ match_head(List, <<_:10,Data/binary>>) ->
The compiler itself figures out if a variable is unused. The same - code is generated for each of the following functions
+The compiler figures out if a variable is unused. The same + code is generated for each of the following functions:
>, Count) -> count1(T, Count+1);
@@ -519,11 +538,9 @@ count2(<<>>, Count) -> Count.
count3(<<_H,T/binary>>, Count) -> count3(T, Count+1);
count3(<<>>, Count) -> Count.]]>
- In each iteration, the first 8 bits in the binary will be skipped, not matched out.
- +In each iteration, the first 8 bits in the binary will be skipped, + not matched out.
Here we list a few modules and BIFs to watch out for, and not only +
This section lists a few modules and BIFs to watch out for, not only from a performance point of view.
Creating timers using
The functions in the
The functions in the
Atoms are not garbage-collected. Once an atom is created, it will never - be removed. The emulator will terminate if the limit for the number - of atoms (1048576 by default) is reached.
+Atoms are not garbage-collected. Once an atom is created, it is never + removed. The emulator terminates if the limit for the number + of atoms (1,048,576 by default) is reached.
-Therefore, converting arbitrary input strings to atoms could be - dangerous in a system that will run continuously. - If only certain well-defined atoms are allowed as input, you can use +
Therefore, converting arbitrary input strings to atoms can be
+ dangerous in a system that runs continuously.
+ If only certain well-defined atoms are allowed as input,
Using
-apply(list_to_atom("some_prefix"++Var), foo, Args)
-
- is quite expensive and is not recommended in time-critical code.
+apply(list_to_atom("some_prefix"++Var), foo, Args)Normally you don't have to worry about the speed of
Normally, there is no need to worry about the speed of
Some uses of
foo(L) when length(L) >= 3 ->
...
- can be rewritten to
+can be rewritten to:
foo([_,_,_|_]=L) ->
...
- (One slight difference is that
One slight difference is that
There is one exception to the rule that the tuple is copied.
If the compiler clearly can see that destructively updating the tuple would
- give exactly the same result as if the tuple was copied, the call to
-
multiple_setelement(T0) ->
T1 = setelement(9, T0, bar),
T2 = setelement(7, T1, foobar),
setelement(5, T2, new_value).
- the first
The two following
For the optimization to be applied, all of the followings conditions +
For the optimization to be applied, all the followings conditions must be true:
If it is not possible to structure the code as in the
If the code cannot be structured as in the
Using the new BIFs
Using the new BIFs
It is usually more efficient to split a binary using matching
instead of calling the
DO
> = Bin,]]>
DO NOT
- {Bin1,Bin2} = split_binary(Bin, Num)
-
+ {Bin1,Bin2} = split_binary(Bin, Num)
Note that the '
The "
DO NOT
@@ -182,42 +180,39 @@ multiple_setelement(T0) -> HugeList1 -- HugeList2]]>Instead use the
DO
HugeSet1 = ordsets:from_list(HugeList1),
HugeSet2 = ordsets:from_list(HugeList2),
- ordsets:subtract(HugeSet1, HugeSet2)
-
+ ordsets:subtract(HugeSet1, HugeSet2)
- Obviously, that code will not work if the original order +
Obviously, that code does not work if the original order of the list is important. If the order of the list must be - preserved, do like this:
+ preserved, do as follows:DO
- Subtle note 1: This code behaves differently from ' This code behaves differently from " Also, this code compares lists elements using the
+ "
Subtle note 2: This code compares lists elements using the
- '
Using the '
Using the "
OK
- HugeList1 -- [Element]
-
+ HugeList1 -- [Element]
This chapter provides a (very) brief overview on how to write efficient - drivers. It is assumed that you already have a good understanding of +
This section provides a brief overview on how to write efficient drivers.
+It is assumed that you have a good understanding of drivers.
The run-time system will always take a lock before running +
The runtime system always takes a lock before running any code in a driver.
-By default, that lock will be at the driver level, meaning that +
By default, that lock is at the driver level, that is, if several ports have been opened to the same driver, only code for one port at the same time can be running.
-A driver can be configured to instead have one lock for each port.
+A driver can be configured to have one lock for each port instead.
-If a driver is used in a functional way (i.e. it holds no state, +
If a driver is used in a functional way (that is, holds no state, but only does some heavy calculation and returns a result), several - ports with registered names can be opened beforehand and the port to - be used can be chosen based on the scheduler ID like this:
+ ports with registered names can be opened beforehand, and the port to + be used can be chosen based on the scheduler ID as follows:
-define(PORT_NAMES(),
@@ -67,82 +66,82 @@ client_port() ->
There are basically two ways to avoid copying a binary that is - sent to a driver.
- -If the
Therefore, if you want to send both a pre-existing binary and some
- additional data to a driver without copying the binary, you must call
-
Another way to avoid copying binaries is to implement an
If the
Therefore, if you want to send both a pre-existing binary and some
+ extra data to a driver without copying the binary, you must call
+
Implement an
The run-time system can represent binaries up to 64 bytes as - heap binaries. They will always be copied when sent in a messages, - but they will require less memory if they are not sent to another +
The runtime system can represent binaries up to 64 bytes as + heap binaries. They are always copied when sent in messages, + but they require less memory if they are not sent to another process and garbage collection is cheaper.
-If you know that the binaries you return are always small, - you should use driver API calls that do not require a pre-allocated - binary, for instance +
If you know that the binaries you return are always small, you
+ are advised to use driver API calls that do not require a pre-allocated
+ binary, for example,
To avoid copying data when a big binary is sent or returned from +
To avoid copying data when a large binary is sent or returned from the driver to an Erlang process, the driver must first allocate the binary and then send it to an Erlang process in some way.
-Use
Use
+
There are several ways to send a binary created with
-
From the
A single binary can be sent with
-
Using
+
Pattern matching in function head and in
Pattern matching in function head as well as in
One exception is pattern matching of binaries. The compiler - will not rearrange clauses that match binaries. Placing the - clause that matches against the empty binary last will usually - be slightly faster than placing it first.
+ does not rearrange clauses that match binaries. Placing the + clause that matches against the empty binary last is usually + slightly faster than placing it first. -Here is a rather contrived example to show another exception:
+The following is a rather unnatural example to show another + exception:
DO NOT
@@ -53,27 +53,30 @@ atom_map1(five) -> 5;
atom_map1(six) -> 6.
The problem is the clause with the variable
First the input value is compared to
+
If none of the first three clauses matched, the fourth clause
- will match since a variable always matches. If the guard test
-
If the guard test failed, the input value is compared to
+
Rewriting to either
+Rewriting to either:
DO
5;
atom_map2(six) -> 6;
atom_map2(Int) when is_integer(Int) -> Int.]]>
- or
+or:
DO
4;
atom_map3(five) -> 5;
atom_map3(six) -> 6.]]>
- will give slightly more efficient matching code.
+gives slightly more efficient matching code.
-Here is a less contrived example:
+Another example:
DO NOT
match anything, the compiler is not allowed to rearrange the clauses,
but must generate code that matches them in the order written.
- If the function is rewritten like this
+ If the function is rewritten as follows, the compiler is free to
+ rearrange the clauses:
DO
map_pairs2(Map, [X|Xs], [Y|Ys]) ->
[Map(X, Y)|map_pairs2(Map, Xs, Ys)].]]>
- the compiler is free to rearrange the clauses. It will generate code
- similar to this
+ The compiler will generate code similar to this:
DO NOT (already done by the compiler)
Ys0
end.]]>
- which should be slightly faster for presumably the most common case
+
This is slightly faster for probably the most common case
that the input lists are not empty or very short.
- (Another advantage is that Dialyzer is able to deduce a better type
- for the variable Xs .)
+ (Another advantage is that Dialyzer can deduce a better type
+ for the Xs variable.)
Here is an intentionally rough guide to the relative costs of - different kinds of calls. It is based on benchmark figures run on +
This is an intentionally rough guide to the relative costs of + different calls. It is based on benchmark figures run on Solaris/Sparc:
Calling and applying a fun does not involve any hash-table lookup.
A fun contains an (indirect) pointer to the function that implements
@@ -178,42 +185,44 @@ explicit_map_pairs(Map, Xs0, Ys0) ->
Tuples are not fun(s).
A "tuple fun",
It no longer matters (from a performance point of view) - whether you write
+ whether you write:
Module:Function(Arg1, Arg2)
- or
+or:
apply(Module, Function, [Arg1,Arg2])
- (The compiler internally rewrites the latter code into the former.)
+The compiler internally rewrites the latter code into the + former.
-The following code
+The following code is slightly slower because the shape of the + list of arguments is unknown at compile time.
apply(Module, Function, Arguments)
- is slightly slower because the shape of the list of arguments - is not known at compile time.
When writing recursive functions it is preferable to make them - tail-recursive so that they can execute in constant memory space.
+When writing recursive functions, it is preferable to make them + tail-recursive so that they can execute in constant memory space:
+DO
list_length(List) ->
@@ -224,13 +233,14 @@ list_length([], AccLen) ->
list_length([_|Tail], AccLen) ->
list_length(Tail, AccLen + 1). % Tail-recursive
+
DO NOT
+
list_length([]) ->
0. % Base case
list_length([_ | Tail]) ->
list_length(Tail) + 1. % Not tail-recursive
+Premature optimization is the root of all evil. -- D.E. Knuth
-"Premature optimization is the root of all evil" + (D.E. Knuth)
Efficient code can be well-structured and clean code, based on +
Efficient code can be well-structured and clean, based on a sound overall architecture and sound algorithms. Efficient code can be highly implementation-code that bypasses documented interfaces and takes advantage of obscure quirks in the current implementation.
-Ideally, your code should only contain the first kind of efficient - code. If that turns out to be too slow, you should profile the application +
Ideally, your code only contains the first type of efficient + code. If that turns out to be too slow, profile the application to find out where the performance bottlenecks are and optimize only the - bottlenecks. Other code should stay as clean as possible.
+ bottlenecks. Let other code stay as clean as possible. -Fortunately, compiler and run-time optimizations introduced in - R12B makes it easier to write code that is both clean and - efficient. For instance, the ugly workarounds needed in R11B and earlier +
Fortunately, compiler and runtime optimizations introduced in + Erlang/OTP R12B makes it easier to write code that is both clean and + efficient. For example, the ugly workarounds needed in R11B and earlier releases to get the most speed out of binary pattern matching are no longer necessary. In fact, the ugly code is slower than the clean code (because the clean code has become faster, not because the uglier code has become slower).
-This Efficiency Guide cannot really learn you how to write efficient +
This Efficiency Guide cannot really teach you how to write efficient code. It can give you a few pointers about what to avoid and what to use, and some understanding of how certain language features are implemented. - We have generally not included general tips about optimization that will - work in any language, such as moving common calculations out of loops.
+ This guide does not include general tips about optimization that + works in any language, such as moving common calculations out of loops.It is assumed that the reader is familiar with the Erlang - programming language and concepts of OTP.
+It is assumed that you are familiar with the Erlang programming + language and the OTP concepts.
Lists can only be built starting from the end and attaching
- list elements at the beginning. If you use the
Lists can only be built starting from the end and attaching list
+ elements at the beginning. If you use the "
List1 ++ List2
- you will create a new list which is copy of the elements in
Looking at how
append([H|T], Tail) ->
@@ -50,12 +48,12 @@ append([H|T], Tail) ->
append([], Tail) ->
Tail.
- So the important thing when recursing and building a list is to - make sure that you attach the new elements to the beginning of the list, - so that you build a list, and not hundreds or thousands of - copies of the growing result list.
+When recursing and building a list, it is important to ensure + that you attach the new elements to the beginning of the list. In + this way, you will build one list, not hundreds or thousands + of copies of the growing result list.
-Let us first look at how it should not be done:
+Let us first see how it is not to be done:
DO NOT
bad_fib(N, Current, Next, Fibs) ->
bad_fib(N - 1, Next, Current + Next, Fibs ++ [Current]).]]>
- Here we are not a building a list; in each iteration step we - create a new list that is one element longer than the new previous list.
+Here more than one list is built. In each iteration step a new list + is created that is one element longer than the new previous list.
-To avoid copying the result in each iteration, we must build the list in - reverse order and reverse the list when we are done:
+To avoid copying the result in each iteration, build the list in + reverse order and reverse the list when you are done:
DO
Lists comprehensions still have a reputation for being slow. They used to be implemented using funs, which used to be slow.
-In recent Erlang/OTP releases (including R12B), a list comprehension
+In recent Erlang/OTP releases (including R12B), a list comprehension:
- is basically translated to a local function
+is basically translated to a local function:
'lc^0'([E|Tail], Expr) ->
[Expr(E)|'lc^0'(Tail, Expr)];
'lc^0'([], _Expr) -> [].
- In R12B, if the result of the list comprehension will obviously not be used, - a list will not be constructed. For instance, in this code
+In R12B, if the result of the list comprehension will obviously + not be used, a list will not be constructed. For example, in this code:
- or in this code
+or in this code:
[io:put_chars(E) || E <- List];
... ->
end,
some_function(...),
-.
-.
-.]]>
+...]]>
- the value is neither assigned to a variable, nor passed to another function, - nor returned, so there is no need to construct a list and the compiler will simplify - the code for the list comprehension to
+the value is not assigned to a variable, not passed to another function, + and not returned. This means that there is no need to construct a list and + the compiler will simplify the code for the list comprehension to:
'lc^0'([E|Tail], Expr) ->
@@ -139,14 +133,15 @@ some_function(...),
In the following situations, you can easily avoid calling
In the following situations, you can easily avoid calling
+
Port example
+DO
... port_command(Port, DeepList) ...+
DO NOT
... @@ -180,7 +178,7 @@ some_function(...), port_command(Port, TerminatedStr) ...-
Instead do like this:
+Instead:
DO
@@ -188,47 +186,53 @@ some_function(...), TerminatedStr = [String, 0], % String="foo" => [[$f, $o, $o], 0] port_command(Port, TerminatedStr) ...+
Append example
DO
> lists:append([[1], [2], [3]]). [1,2,3] >+
DO NOT
> lists:flatten([[1], [2], [3]]). [1,2,3] >+
In the performance myth chapter, the following myth was exposed:
-
In Section 7.2, the following myth was exposed:
+
To summarize, in R12B there is usually not much difference between a body-recursive list function and tail-recursive function that reverses the list at the end. Therefore, concentrate on writing beautiful code - and forget about the performance of your list functions. In the time-critical - parts of your code (and only there), measure before rewriting - your code.
- -Important note: This section talks about lists functions that - construct lists. A tail-recursive function that does not construct - a list runs in constant space, while the corresponding body-recursive - function uses stack space proportional to the length of the list. - For instance, a function that sums a list of integers, should not be - written like this
+ and forget about the performance of your list functions. In the + time-critical parts of your code (and only there), measure + before rewriting your code. + +This section is about list functions that construct + lists. A tail-recursive function that does not construct a list runs + in constant space, while the corresponding body-recursive function + uses stack space proportional to the length of the list.
For example, a function that sums a list of integers, is + not to be written as follows:
DO NOT
recursive_sum([H|T]) -> H+recursive_sum(T);
recursive_sum([]) -> 0.
- but like this
+Instead:
DO
diff --git a/system/doc/efficiency_guide/myths.xml b/system/doc/efficiency_guide/myths.xml
index b1108dbab2..70d2dae88e 100644
--- a/system/doc/efficiency_guide/myths.xml
+++ b/system/doc/efficiency_guide/myths.xml
@@ -31,47 +31,48 @@
myths.xml
+
Some truths seem to live on well beyond their best-before date,
- perhaps because "information" spreads more rapidly from person-to-person
- faster than a single release note that notes, for instance, that funs
+ perhaps because "information" spreads faster from person-to-person
+ than a single release note that says, for example, that funs
have become faster.
- Here we try to kill the old truths (or semi-truths) that have
+
This section tries to kill the old truths (or semi-truths) that have
become myths.
- Myth: Funs are slow
- Yes, funs used to be slow. Very slow. Slower than apply/3 .
+
Myth: Funs are Slow
+ Funs used to be very slow, slower than apply/3 .
Originally, funs were implemented using nothing more than
compiler trickery, ordinary tuples, apply/3 , and a great
deal of ingenuity.
- But that is ancient history. Funs was given its own data type
- in the R6B release and was further optimized in the R7B release.
- Now the cost for a fun call falls roughly between the cost for a call to
- local function and apply/3 .
+ But that is history. Funs was given its own data type
+ in R6B and was further optimized in R7B.
+ Now the cost for a fun call falls roughly between the cost for a call
+ to a local function and apply/3 .
- Myth: List comprehensions are slow
+ Myth: List Comprehensions are Slow
List comprehensions used to be implemented using funs, and in the
- bad old days funs were really slow.
+ old days funs were indeed slow.
- Nowadays the compiler rewrites list comprehensions into an ordinary
- recursive function. Of course, using a tail-recursive function with
+
Nowadays, the compiler rewrites list comprehensions into an ordinary
+ recursive function. Using a tail-recursive function with
a reverse at the end would be still faster. Or would it?
That leads us to the next myth.
- Myth: Tail-recursive functions are MUCH faster
- than recursive functions
+ Myth: Tail-Recursive Functions are Much Faster
+ Than Recursive Functions
According to the myth,
recursive functions leave references
- to dead terms on the stack and the garbage collector will have to
- copy all those dead terms, while tail-recursive functions immediately
+ to dead terms on the stack and the garbage collector has to copy
+ all those dead terms, while tail-recursive functions immediately
discard those terms.
That used to be true before R7B. In R7B, the compiler started
@@ -79,48 +80,47 @@
be used with an empty list, so that the garbage collector would not
keep dead values any longer than necessary.
- Even after that optimization, a tail-recursive function would
- still most of the time be faster than a body-recursive function. Why?
+ Even after that optimization, a tail-recursive function is
+ still most of the times faster than a body-recursive function. Why?
It has to do with how many words of stack that are used in each
- recursive call. In most cases, a recursive function would use more words
+ recursive call. In most cases, a recursive function uses more words
on the stack for each recursion than the number of words a tail-recursive
- would allocate on the heap. Since more memory is used, the garbage
- collector will be invoked more frequently, and it will have more work traversing
+ would allocate on the heap. As more memory is used, the garbage
+ collector is invoked more frequently, and it has more work traversing
the stack.
- In R12B and later releases, there is an optimization that will
+
In R12B and later releases, there is an optimization that
in many cases reduces the number of words used on the stack in
- body-recursive calls, so that a body-recursive list function and
+ body-recursive calls. A body-recursive list function and a
tail-recursive function that calls lists:reverse/1 at the
- end will use exactly the same amount of memory.
+ marker="stdlib:lists#reverse/1">lists:reverse/1 at
+ the end will use the same amount of memory.
lists:map/2 , lists:filter/2 , list comprehensions,
and many other recursive functions now use the same amount of space
as their tail-recursive equivalents.
- So which is faster?
+ So, which is faster?
+ It depends. On Solaris/Sparc, the body-recursive function seems to
+ be slightly faster, even for lists with a lot of elements. On the x86
+ architecture, tail-recursion was up to about 30% faster.
- It depends. On Solaris/Sparc, the body-recursive function seems to
- be slightly faster, even for lists with very many elements. On the x86
- architecture, tail-recursion was up to about 30 percent faster.
-
- So the choice is now mostly a matter of taste. If you really do need
+
So, the choice is now mostly a matter of taste. If you really do need
the utmost speed, you must measure. You can no longer be
- absolutely sure that the tail-recursive list function will be the fastest
- in all circumstances.
+ sure that the tail-recursive list function always is the fastest.
- Note: A tail-recursive function that does not need to reverse the
- list at the end is, of course, faster than a body-recursive function,
+ A tail-recursive function that does not need to reverse the
+ list at the end is faster than a body-recursive function,
as are tail-recursive functions that do not construct any terms at all
- (for instance, a function that sums all integers in a list).
+ (for example, a function that sums all integers in a list).
- Myth: '++' is always bad
+ Myth: Operator "++" is Always Bad
- The ++ operator has, somewhat undeservedly, got a very bad reputation.
- It probably has something to do with code like
+ The ++ operator has, somewhat undeservedly, got a bad reputation.
+ It probably has something to do with code like the following,
+ which is the most inefficient way there is to reverse a list:
DO NOT
@@ -129,12 +129,10 @@ naive_reverse([H|T]) ->
naive_reverse([]) ->
[].
- which is the most inefficient way there is to reverse a list.
- Since the ++ operator copies its left operand, the result
- will be copied again and again and again... leading to quadratic
- complexity.
+ As the ++ operator copies its left operand, the result
+ is copied repeatedly, leading to quadratic complexity.
- On the other hand, using ++ like this
+ But using ++ as follows is not bad:
OK
@@ -143,11 +141,11 @@ naive_but_ok_reverse([H|T], Acc) ->
naive_but_ok_reverse([], Acc) ->
Acc.
- is not bad. Each list element will only be copied once.
+
Each list element is copied only once.
The growing result Acc is the right operand
- for the ++ operator, and it will not be copied.
+ for the ++ operator, and it is not copied.
- Of course, experienced Erlang programmers would actually write
+ Experienced Erlang programmers would write as follows:
DO
@@ -156,32 +154,34 @@ vanilla_reverse([H|T], Acc) ->
vanilla_reverse([], Acc) ->
Acc.
- which is slightly more efficient because you don't build a
- list element only to directly copy it. (Or it would be more efficient
- if the the compiler did not automatically rewrite [H]++Acc
+
This is slightly more efficient because here you do not build a
+ list element only to copy it directly. (Or it would be more efficient
+ if the compiler did not automatically rewrite [H]++Acc
to [H|Acc] .)
- Myth: Strings are slow
-
- Actually, string handling could be slow if done improperly.
- In Erlang, you'll have to think a little more about how the strings
- are used and choose an appropriate representation and use
- the re module instead of the obsolete
- regexp module if you are going to use regular expressions.
+ Myth: Strings are Slow
+
+ String handling can be slow if done improperly.
+ In Erlang, you need to think a little more about how the strings
+ are used and choose an appropriate representation. If you
+ use regular expressions, use the
+ re module in STDLIB
+ instead of the obsolete regexp module.
- Myth: Repairing a Dets file is very slow
+ Myth: Repairing a Dets File is Very Slow
The repair time is still proportional to the number of records
- in the file, but Dets repairs used to be much, much slower in the past.
+ in the file, but Dets repairs used to be much slower in the past.
Dets has been massively rewritten and improved.
- Myth: BEAM is a stack-based byte-code virtual machine (and therefore slow)
+ Myth: BEAM is a Stack-Based Byte-Code Virtual Machine
+ (and Therefore Slow)
BEAM is a register-based virtual machine. It has 1024 virtual registers
that are used for holding temporary values and for passing arguments when
@@ -193,11 +193,11 @@ vanilla_reverse([], Acc) ->
- Myth: Use '_' to speed up your program when a variable is not used
+ Myth: Use "_" to Speed Up Your Program When a Variable
+ is Not Used
- That was once true, but since R6B the BEAM compiler is quite capable of seeing itself
+
That was once true, but from R6B the BEAM compiler can see
that a variable is not used.
-
diff --git a/system/doc/efficiency_guide/processes.xml b/system/doc/efficiency_guide/processes.xml
index 86951e2dcc..3bdc314235 100644
--- a/system/doc/efficiency_guide/processes.xml
+++ b/system/doc/efficiency_guide/processes.xml
@@ -30,15 +30,15 @@
- Creation of an Erlang process
+ Creating an Erlang Process
- An Erlang process is lightweight compared to operating
- systems threads and processes.
+ An Erlang process is lightweight compared to threads and
+ processes in operating systems.
A newly spawned Erlang process uses 309 words of memory
in the non-SMP emulator without HiPE support. (SMP support
- and HiPE support will both add to this size.) The size can
- be found out like this:
+ and HiPE support both add to this size.) The size can
+ be found as follows:
Erlang (BEAM) emulator version 5.6 [async-threads:0] [kernel-poll:false]
@@ -51,11 +51,11 @@ Eshell V5.6 (abort with ^G)
3> Bytes div erlang:system_info(wordsize).
309
- The size includes 233 words for the heap area (which includes the stack).
- The garbage collector will increase the heap as needed.
+ The size includes 233 words for the heap area (which includes the
+ stack). The garbage collector increases the heap as needed.
The main (outer) loop for a process must be tail-recursive.
- If not, the stack will grow until the process terminates.
+ Otherwise, the stack grows until the process terminates.
DO NOT
@@ -74,7 +74,7 @@ loop() ->
The call to io:format/2 will never be executed, but a
return address will still be pushed to the stack each time
loop/0 is called recursively. The correct tail-recursive
- version of the function looks like this:
+ version of the function looks as follows:
DO
@@ -90,92 +90,98 @@ loop() ->
end.
- Initial heap size
+ Initial Heap Size
The default initial heap size of 233 words is quite conservative
- in order to support Erlang systems with hundreds of thousands or
- even millions of processes. The garbage collector will grow and
- shrink the heap as needed.
+ to support Erlang systems with hundreds of thousands or
+ even millions of processes. The garbage collector grows and
+ shrinks the heap as needed.
In a system that use comparatively few processes, performance
- might be improved by increasing the minimum heap size using either
- the +h option for
+ might be improved by increasing the minimum heap size
+ using either the +h option for
erl or on a process-per-process
basis using the min_heap_size option for
spawn_opt/4 .
- The gain is twofold: Firstly, although the garbage collector will
- grow the heap, it will grow it step by step, which will be more
- costly than directly establishing a larger heap when the process
- is spawned. Secondly, the garbage collector may also shrink the
- heap if it is much larger than the amount of data stored on it;
- setting the minimum heap size will prevent that.
-
- The emulator will probably use more memory, and because garbage
- collections occur less frequently, huge binaries could be
+
The gain is twofold:
+
+ - Although the garbage collector grows the heap, it grows it
+ step-by-step, which is more costly than directly establishing a
+ larger heap when the process is spawned.
+ - The garbage collector can also shrink the heap if it is
+ much larger than the amount of data stored on it;
+ setting the minimum heap size prevents that.
+
+
+ The emulator probably uses more memory, and because garbage
+ collections occur less frequently, huge binaries can be
kept much longer.
In systems with many processes, computation tasks that run
- for a short time could be spawned off into a new process with
- a higher minimum heap size. When the process is done, it will
- send the result of the computation to another process and terminate.
- If the minimum heap size is calculated properly, the process may not
- have to do any garbage collections at all.
- This optimization should not be attempted
+ for a short time can be spawned off into a new process with
+ a higher minimum heap size. When the process is done, it sends
+ the result of the computation to another process and terminates.
+ If the minimum heap size is calculated properly, the process might
+ not have to do any garbage collections at all.
+ This optimization is not to be attempted
without proper measurements.
-
- Process messages
+ Process Messages
- All data in messages between Erlang processes is copied, with
- the exception of
+
All data in messages between Erlang processes is copied,
+ except for
refc binaries
on the same Erlang node.
When a message is sent to a process on another Erlang node,
- it will first be encoded to the Erlang External Format before
- being sent via an TCP/IP socket. The receiving Erlang node decodes
- the message and distributes it to the right process.
+ it is first encoded to the Erlang External Format before
+ being sent through a TCP/IP socket. The receiving Erlang node decodes
+ the message and distributes it to the correct process.
- The constant pool
+ Constant Pool
Constant Erlang terms (also called literals) are now
kept in constant pools; each loaded module has its own pool.
- The following function
+ The following function does no longer build the tuple every time
+ it is called (only to have it discarded the next time the garbage
+ collector was run), but the tuple is located in the module's
+ constant pool:
DO (in R12B and later)
days_in_month(M) ->
- element(M, {31,28,31,30,31,30,31,31,30,31,30,31}).
-
- will no longer build the tuple every time it is called (only
- to have it discarded the next time the garbage collector was run), but
- the tuple will be located in the module's constant pool.
+ element(M, {31,28,31,30,31,30,31,31,30,31,30,31}).
But if a constant is sent to another process (or stored in - an ETS table), it will be copied. - The reason is that the run-time system must be able - to keep track of all references to constants in order to properly - unload code containing constants. (When the code is unloaded, - the constants will be copied to the heap of the processes that refer + an Ets table), it is copied. + The reason is that the runtime system must be able + to keep track of all references to constants to unload code + containing constants properly. (When the code is unloaded, + the constants are copied to the heap of the processes that refer to them.) The copying of constants might be eliminated in a future - release.
+ Erlang/OTP release.Shared sub-terms are not preserved when a term is sent
- to another process, passed as the initial process arguments in
- the
Shared subterms are not preserved in the following + cases:
+That is an optimization. Most applications do not send messages + with shared subterms.
-Here is an example of how a shared sub-term can be created:
+The following example shows how a shared subterm can be created:
kilo_byte() ->
@@ -186,32 +192,32 @@ kilo_byte(0, Acc) ->
kilo_byte(N, Acc) ->
kilo_byte(N-1, [Acc|Acc]).
- 1> byte_size(list_to_binary(efficiency_guide:kilo_byte())). 1024-
Using the
Using the
2> erts_debug:size(efficiency_guide:kilo_byte()). 22-
Using the
Using the
3> erts_debug:flat_size(efficiency_guide:kilo_byte()). 4094-
We can verify that sharing will be lost if we insert the - data into an ETS table:
+It can be verified that sharing will be lost if the data is + inserted into an Ets table:
4> T = ets:new(tab, []). @@ -223,21 +229,21 @@ true 7> erts_debug:flat_size(element(2, hd(ets:lookup(T, key)))). 4094-
When the data has passed through an ETS table, +
When the data has passed through an Ets table,
In a future release of Erlang/OTP, we might implement a - way to (optionally) preserve sharing. We have no plans to make - preserving of sharing the default behaviour, since that would +
In a future Erlang/OTP release, it might be implemented a + way to (optionally) preserve sharing. There are no plans to make + preserving of sharing the default behaviour, as that would penalize the vast majority of Erlang applications.
The SMP emulator (introduced in R11B) will take advantage of a +
The SMP emulator (introduced in R11B) takes advantage of a multi-core or multi-CPU computer by running several Erlang scheduler threads (typically, the same as the number of cores). Each scheduler thread schedules Erlang processes in the same way as the Erlang scheduler @@ -247,11 +253,11 @@ true must have more than one runnable Erlang process most of the time. Otherwise, the Erlang emulator can still only run one Erlang process at the time, but you must still pay the overhead for locking. Although - we try to reduce the locking overhead as much as possible, it will never - become exactly zero.
+ Erlang/OTP tries to reduce the locking overhead as much as possible, + it will never become exactly zero. -Benchmarks that may seem to be concurrent are often sequential. - The estone benchmark, for instance, is entirely sequential. So is also +
Benchmarks that appear to be concurrent are often sequential.
+ The estone benchmark, for example, is entirely sequential. So is
the most common implementation of the "ring benchmark"; usually one process
is active, while the others wait in a
Even experienced software developers often guess wrong about where - the performance bottlenecks are in their programs.
- -Therefore, profile your program to see where the performance + the performance bottlenecks are in their programs. Therefore, profile + your program to see where the performance bottlenecks are and concentrate on optimizing them.
-Erlang/OTP contains several tools to help finding bottlenecks.
+Erlang/OTP contains several tools to help finding bottlenecks:
+ +If the program is too large to be profiled by
If the program is too big to be profiled by
The tools are further described in
+
If you have a big system it might be interesting to run profiling +
For a large system, it can be interesting to run profiling on a simulated and limited scenario to start with. But bottlenecks - have a tendency to only appear or cause problems when - there are many things going on at the same time, and when there - are many nodes involved. Therefore it is desirable to also run + have a tendency to appear or cause problems only when + many things are going on at the same time, and when + many nodes are involved. Therefore, it is also desirable to run profiling in a system test plant on a real target system.
-When your system is big you do not want to run the profiling - tools on the whole system. You want to concentrate on processes - and modules that you know are central and stand for a big part of the - execution.
+ +For a large system, you do not want to run the profiling + tools on the whole system. Instead you want to concentrate on + central processes and modules, which contribute for a big part + of the execution.
When analyzing the result file from the profiling activity - you should look for functions that are called many +
When analyzing the result file from the profiling activity, + look for functions that are called many times and have a long "own" execution time (time excluding calls - to other functions). Functions that just are called very - many times can also be interesting, as even small things can add - up to quite a bit if they are repeated often. Then you need to - ask yourself what can I do to reduce this time. Appropriate - types of questions to ask yourself are:
+ to other functions). Functions that are called a lot of + times can also be interesting, as even small things can add + up to quite a bit if repeated often. Also + ask yourself what you can do to reduce this time. The following + are appropriate types of questions to ask yourself: +These questions are not always trivial to answer. You might
- need to do some benchmarks to back up your theory, to avoid
- making things slower if your theory is wrong. See
These questions are not always trivial to answer. Some
+ benchmarks might be needed to back up your theory and to avoid
+ making things slower if your theory is wrong. For details, see
+
-
-
-
The primary use of
Clearly, this information can be used to determine what
+ code is run very frequently and can therefore be subject for
+ optimization. Using
Benchmarks can measure wall-clock time or CPU time.
-
+
It is probably a good idea to do both wall-clock measurements and CPU time measurements.
-Some additional advice:
+Some final advice:
Every example using Ets has a corresponding example in - Mnesia. In general all Ets examples also apply to Dets tables.
+ Mnesia. In general, all Ets examples also apply to Dets tables.Select/Match operations on Ets and Mnesia tables can become +
Select/match operations on Ets and Mnesia tables can become
very expensive operations. They usually need to scan the complete
- table. You should try to structure your
- data so that you minimize the need for select/match
- operations. However, if you really need a select/match operation,
- it will still be more efficient than using
There are exceptions when the complete table is not
- scanned, for instance if part of the key is bound when searching an
-
When creating a record to be used in a select/match operation you - want most of the fields to have the value '_'. The easiest and fastest way - to do that is as follows:
+ table. Try to structure the data to minimize the need for select/match + operations. However, if you require a select/match operation, + it is still more efficient than usingIn some circumstances, the select/match operations do not need
+ to scan the complete table.
+ For example, if part of the key is bound when searching an
+
When creating a record to be used in a select/match operation, you + want most of the fields to have the value "_". The easiest and + fastest way to do that is as follows:
#person{age = 42, _ = '_'}.
The delete operation is considered +
The
DO
... @@ -88,14 +86,16 @@ end,
Do not fetch data that you already have! Consider that you
- have a module that handles the abstract data type Person. You
- export the interface function
Do not fetch data that you already have.
+Consider that you have a module that handles the abstract data
+ type
If the functions
If the function
For non-persistent database storage, prefer Ets tables over
- Mnesia local_content tables. Even the Mnesia
Assume we have an Ets-table, which uses
Assuming an Ets table that uses
[#person{idno = 1, name = "Adam", age = 31, occupation = "mailman"}, #person{idno = 2, name = "Bryan", age = 31, occupation = "cashier"}, #person{idno = 3, name = "Bryan", age = 35, occupation = "banker"}, #person{idno = 4, name = "Carl", age = 25, occupation = "mailman"}]-
If we must return all data stored in the Ets-table we
- can use
If you must return all data stored in the Ets table, you
+ can use
DO
... @@ -192,8 +192,8 @@ ets:select(Tab,[{ #person{idno='_', TabList = ets:tab2list(Tab), lists:map(fun(X) -> X#person.age end, TabList), ...-
If we are only interested in the age of all persons named - Bryan, we should:
+If you are only interested in the age of all persons named + "Bryan", then:
DO
... @@ -224,8 +224,8 @@ BryanList = lists:filter(fun(X) -> X#person.name == "Bryan" end, TabList), lists:map(fun(X) -> X#person.age end, BryanList), ...-
If we need all information stored in the Ets table about - persons named Bryan we should:
+If you need all information stored in the Ets table about + persons named "Bryan", then:
DO
... @@ -243,60 +243,60 @@ lists:filter(fun(X) -> X#person.name == "Bryan" end, TabList),
If the data in the table should be accessed so that the order +
If the data in the table is to be accessed so that the order
of the keys in the table is significant, the table type
-
An
An Ets table is a single key table (either a hash table or a - tree ordered by the key) and should be used as one. In other +
An Ets table is a single-key table (either a hash table or a
+ tree ordered by the key) and is to be used as one. In other
words, use the key to look up things whenever possible. A
- lookup by a known key in a set Ets table is constant and for a
- ordered_set Ets table it is O(logN). A key lookup is always
+ lookup by a known key in a
A simple solution would be to use the
An index table for the table in the previous examples would - have to be a bag (as keys would appear more than once) and could + have to be a bag (as keys would appear more than once) and can have the following contents:
- [#index_entry{name="Adam", idno=1}, #index_entry{name="Bryan", idno=2}, #index_entry{name="Bryan", idno=3}, #index_entry{name="Carl", idno=4}]-
Given this index table a lookup of the
Given this index table, a lookup of the
... MatchingIDs = ets:lookup(IndexTable,"Bryan"), @@ -306,30 +306,31 @@ lists:map(fun(#index_entry{idno = ID}) -> end, MatchingIDs), ...-
Note that the code above never uses
Notice that this code never uses
Keeping an index table introduces some overhead when - inserting records in the table, therefore the number of operations - gained from the table has to be weighted against the number of - operations inserting objects in the table. However, note that the gain when - the key can be used to lookup elements is significant.
+ inserting records in the table. The number of operations gained + from the table must therefore be compared against the number of + operations inserting objects in the table. However, notice that the + gain is significant when the key can be used to lookup elements.If you frequently do a lookup on a field that is not the - key of the table, you will lose performance using - "mnesia:select/match_object" as this function will traverse the - whole table. You may create a secondary index instead and + key of the table, you lose performance using + "mnesia:select/match_object" as this function traverses the + whole table. You can create a secondary index instead and use "mnesia:index_read" to get faster access, however this - will require more memory. Example:
+ requires more memory. +Example
-record(person, {idno, name, age, occupation}). ... @@ -347,14 +348,15 @@ PersonsAge42 =Transactions -Transactions is a way to guarantee that the distributed +
Using transactions is a way to guarantee that the distributed Mnesia database remains consistent, even when many different - processes update it in parallel. However if you have - real time requirements it is recommended to use dirty - operations instead of transactions. When using the dirty - operations you lose the consistency guarantee, this is usually + processes update it in parallel. However, if you have + real-time requirements it is recommended to use
+ processes must send update requests to that process. +dirty + operations instead of transactions. When usingdirty + operations, you lose the consistency guarantee; this is usually solved by only letting one process update the table. Other - processes have to send update requests to that process.Example
... % Using transaction -- cgit v1.2.3