20012013
Ericsson AB. All Rights Reserved.
The contents of this file are subject to the Erlang Public License,
Version 1.1, (the "License"); you may not use this file except in
compliance with the License. You should have received a copy of the
Erlang Public License along with this software. If not, it can be
retrieved online at http://www.erlang.org/.
Software distributed under the License is distributed on an "AS IS"
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
the License for the specific language governing rights and limitations
under the License.
Profiling
Ingela Anderton
2001-11-02
profiling.xml
Do Not Guess About Performance - Profile
Even experienced software developers often guess wrong about where
the performance bottlenecks are in their programs. Therefore, profile
your program to see where the performance
bottlenecks are and concentrate on optimizing them.
Erlang/OTP contains several tools to help finding bottlenecks:
- fprof provides the most detailed information about
where the program time is spent, but it significantly slows down the
program it profiles.
eprof provides time information of each function
used in the program. No call graph is produced, but eprof has
considerable less impact on the program it profiles.
If the program is too large to be profiled by fprof or
eprof, the cover and cprof tools can be used
to locate code parts that are to be more thoroughly profiled using
fprof or eprof.
- cover provides execution counts per line per
process, with less overhead than fprof. Execution counts
can, with some caution, be used to locate potential performance
bottlenecks.
- cprof is the most lightweight tool, but it only
provides execution counts on a function basis (for all processes,
not per process).
The tools are further described in
Tools.
Large Systems
For a large system, it can be interesting to run profiling
on a simulated and limited scenario to start with. But bottlenecks
have a tendency to appear or cause problems only when
many things are going on at the same time, and when
many nodes are involved. Therefore, it is also desirable to run
profiling in a system test plant on a real target system.
For a large system, you do not want to run the profiling
tools on the whole system. Instead you want to concentrate on
central processes and modules, which contribute for a big part
of the execution.
What to Look For
When analyzing the result file from the profiling activity,
look for functions that are called many
times and have a long "own" execution time (time excluding calls
to other functions). Functions that are called a lot of
times can also be interesting, as even small things can add
up to quite a bit if repeated often. Also
ask yourself what you can do to reduce this time. The following
are appropriate types of questions to ask yourself:
- Is it possible to reduce the number of times the function
is called?
- Can any test be run less often if the order of tests is
changed?
- Can any redundant tests be removed?
- Does any calculated expression give the same result
each time?
- Are there other ways to do this that are equivalent and
more efficient?
- Can another internal data representation be used to make
things more efficient?
These questions are not always trivial to answer. Some
benchmarks might be needed to back up your theory and to avoid
making things slower if your theory is wrong. For details, see
Benchmarking.
Tools
fprof
fprof measures the execution time for each function,
both own time, that is, how much time a function has used for its
own execution, and accumulated time, that is, including called
functions. The values are displayed per process. You also get
to know how many times each function has been called.
fprof is based on trace to file to minimize runtime
performance impact. Using fprof is just a matter of
calling a few library functions, see the
fprof manual page in
tools .fprof was introduced in R8.
eprof
eprof is based on the Erlang trace_info BIFs.
eprof shows how much time has been used by each process,
and in which function calls this time has been spent. Time is
shown as percentage of total time and absolute time. For more
information, see the eprof
manual page in tools.
cover
The primary use of cover is coverage analysis to verify
test cases, making sure that all relevant code is covered.
cover counts how many times each executable line of code
is executed when a program is run, on a per module basis.
Clearly, this information can be used to determine what
code is run very frequently and can therefore be subject for
optimization. Using cover is just a matter of calling a
few library functions, see the
cover manual page in
tools.
cprof
cprof is something in between fprof and
cover regarding features. It counts how many times each
function is called when the program is run, on a per module
basis. cprof has a low performance degradation effect
(compared with fprof) and does not need to recompile
any modules to profile (compared with cover).
For more information, see the
cprof manual page in
tools.
Tool Summary
Tool |
Results |
Size of Result |
Effects on Program Execution Time |
Records Number of Calls |
Records Execution Time |
Records Called by |
Records Garbage Collection |
fprof |
Per process to screen/file |
Large |
Significant slowdown |
Yes |
Total and own |
Yes |
Yes |
eprof |
Per process/function to screen/file |
Medium |
Small slowdown |
Yes |
Only total |
No |
No |
cover |
Per module to screen/file |
Small |
Moderate slowdown |
Yes, per line |
No |
No |
No |
cprof |
Per module to caller |
Small |
Small slowdown |
Yes |
No |
No |
No |
Tool Summary
Benchmarking
The main purpose of benchmarking is to find out which
implementation of a given algorithm or function is the fastest.
Benchmarking is far from an exact science. Today's operating systems
generally run background tasks that are difficult to turn off.
Caches and multiple CPU cores does not facilitate benchmarking.
It would be best to run UNIX computers in single-user mode when
benchmarking, but that is inconvenient to say the least for casual
testing.
Benchmarks can measure wall-clock time or CPU time.
- timer:tc/3 measures
wall-clock time. The advantage with wall-clock time is that I/O,
swapping, and other activities in the operating system kernel are
included in the measurements. The disadvantage is that the
measurements vary a lot. Usually it is best to run the
benchmark several times and note the shortest time, which is to
be the minimum time that is possible to achieve under the best of
circumstances.
- statistics/1
with argument runtime measures CPU time spent in the Erlang
virtual machine. The advantage with CPU time is that the results are more
consistent from run to run. The disadvantage is that the time
spent in the operating system kernel (such as swapping and I/O)
is not included. Therefore, measuring CPU time is misleading if
any I/O (file or socket) is involved.
It is probably a good idea to do both wall-clock measurements and
CPU time measurements.
Some final advice:
- The granularity of both measurement types can be high.
Therefore, ensure that each individual measurement
lasts for at least several seconds.
- To make the test fair, each new test run is to run in its own,
newly created Erlang process. Otherwise, if all tests run in the
same process, the later tests start out with larger heap sizes
and therefore probably do fewer garbage collections.
Also consider restarting the Erlang emulator between each test.
- Do not assume that the fastest implementation of a given algorithm
on computer architecture X is also the fastest on computer architecture
Y.