aboutsummaryrefslogtreecommitdiffstats
path: root/system/doc/efficiency_guide/profiling.xml
blob: 5df12eefe0907ac0b877c6ca47fa0f97003f3c58 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE chapter SYSTEM "chapter.dtd">

<chapter>
  <header>
    <copyright>
      <year>2001</year><year>2013</year>
      <holder>Ericsson AB. All Rights Reserved.</holder>
    </copyright>
    <legalnotice>
      The contents of this file are subject to the Erlang Public License,
      Version 1.1, (the "License"); you may not use this file except in
      compliance with the License. You should have received a copy of the
      Erlang Public License along with this software. If not, it can be
      retrieved online at http://www.erlang.org/.
    
      Software distributed under the License is distributed on an "AS IS"
      basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
      the License for the specific language governing rights and limitations
      under the License.
    
    </legalnotice>

    <title>Profiling</title>
    <prepared>Ingela Anderton</prepared>
    <docno></docno>
    <date>2001-11-02</date>
    <rev></rev>
    <file>profiling.xml</file>
  </header>

  <section>
    <title>Do Not Guess About Performance - Profile</title>

    <p>Even experienced software developers often guess wrong about where
    the performance bottlenecks are in their programs. Therefore, profile
    your program to see where the performance
    bottlenecks are and concentrate on optimizing them.</p>

    <p>Erlang/OTP contains several tools to help finding bottlenecks:</p>

    <list type="bulleted">
      <item><c>fprof</c> provides the most detailed information about
      where the program time is spent, but it significantly slows down the
      program it profiles.</item>

      <item><p><c>eprof</c> provides time information of each function
      used in the program. No call graph is produced, but <c>eprof</c> has
      considerable less impact on the program it profiles.</p>
      <p>If the program is too large to be profiled by <c>fprof</c> or
      <c>eprof</c>, the <c>cover</c> and <c>cprof</c> tools can be used
      to locate code parts that are to be more thoroughly profiled using
      <c>fprof</c> or <c>eprof</c>.</p></item>

      <item><c>cover</c> provides execution counts per line per
      process, with less overhead than <c>fprof</c>. Execution counts
      can, with some caution, be used to locate potential performance
      bottlenecks.</item>

      <item><c>cprof</c> is the most lightweight tool, but it only
      provides execution counts on a function basis (for all processes,
      not per process).</item>
    </list>

    <p>The tools are further described in
    <seealso marker="#profiling_tools">Tools</seealso>.</p>
  </section>

  <section>
    <title>Large Systems</title>
    <p>For a large system, it can be interesting to run profiling
      on a simulated and limited scenario to start with. But bottlenecks
      have a tendency to appear or cause problems only when
      many things are going on at the same time, and when
      many nodes are involved. Therefore, it is also desirable to run
      profiling in a system test plant on a real target system.</p>

    <p>For a large system, you do not want to run the profiling
      tools on the whole system. Instead you want to concentrate on
      central processes and modules, which contribute for a big part
      of the execution.</p>
  </section>

  <section>
    <title>What to Look For</title>
    <p>When analyzing the result file from the profiling activity,
      look for functions that are called many
      times and have a long "own" execution time (time excluding calls
      to other functions). Functions that are called a lot of
      times can also be interesting, as even small things can add
      up to quite a bit if repeated often. Also
      ask yourself what you can do to reduce this time. The following
      are appropriate types of questions to ask yourself:</p>

    <list type="bulleted">
      <item>Is it possible to reduce the number of times the function
      is called?</item>
      <item>Can any test be run less often if the order of tests is
      changed?</item>
      <item>Can any redundant tests be removed?</item>
      <item>Does any calculated expression give the same result
       each time?</item>
      <item>Are there other ways to do this that are equivalent and
       more efficient?</item>
      <item>Can another internal data representation be used to make
       things more efficient?</item>
    </list>

    <p>These questions are not always trivial to answer. Some
    benchmarks might be needed to back up your theory and to avoid
      making things slower if your theory is wrong. For details, see
    <seealso marker="#benchmark">Benchmarking</seealso>.</p>
  </section>

  <section>
    <title>Tools</title>
    <marker id="profiling_tools"></marker>
    <section>
      <title>fprof</title>
      <p><c>fprof</c> measures the execution time for each function,
      both own time, that is, how much time a function has used for its
      own execution, and accumulated time, that is, including called
      functions. The values are displayed per process. You also get
      to know how many times each function has been called.</p>

      <p><c>fprof</c> is based on trace to file to minimize runtime
      performance impact. Using <c>fprof</c> is just a matter of
      calling a few library functions, see the
      <seealso marker="tools:fprof">fprof</seealso> manual page in
      <c>tools</c> .<c>fprof</c> was introduced in R8.</p>
    </section>

    <section>
      <title>eprof</title>
      <p><c>eprof</c> is based on the Erlang <c>trace_info</c> BIFs.
      <c>eprof</c> shows how much time has been used by each process,
      and in which function calls this time has been spent. Time is
      shown as percentage of total time and absolute time. For more
      information, see the <seealso marker="tools:eprof">eprof</seealso>
      manual page in <c>tools</c>.</p>
    </section>

    <section>
      <title>cover</title>
      <p>The primary use of <c>cover</c> is coverage analysis to verify
      test cases, making sure that all relevant code is covered.
      <c>cover</c> counts how many times each executable line of code
      is executed when a program is run, on a per module basis.</p>
      <p>Clearly, this information can be used to determine what
      code is run very frequently and can therefore be subject for
      optimization. Using <c>cover</c> is just a matter of calling a
      few library functions, see the
      <seealso marker="tools:cover">cover</seealso> manual page in
      <c>tools</c>.</p>
    </section>

    <section>
      <title>cprof</title>
      <p><c>cprof</c> is something in between <c>fprof</c> and
      <c>cover</c> regarding features. It counts how many times each
      function is called when the program is run, on a per module
      basis. <c>cprof</c> has a low performance degradation effect
      (compared with <c>fprof</c>) and does not need to recompile
      any modules to profile (compared with <c>cover</c>).
      For more information, see the
      <seealso marker="tools:cprof">cprof</seealso> manual page in
      <c>tools</c>.</p>
    </section>

    <section>
      <title>Tool Summary</title>
      <table>
        <row>
          <cell><em>Tool</em></cell>
          <cell><em>Results</em></cell>
          <cell><em>Size of Result</em></cell>
          <cell><em>Effects on Program Execution Time</em></cell>
          <cell><em>Records Number of Calls</em></cell>
          <cell><em>Records Execution Time</em></cell>
          <cell><em>Records Called by</em></cell>
          <cell><em>Records Garbage Collection</em></cell>
        </row>
        <row>
          <cell><c>fprof</c></cell>
          <cell>Per process to screen/file</cell>
          <cell>Large</cell>
          <cell>Significant slowdown</cell>
          <cell>Yes</cell>
          <cell>Total and own</cell>
          <cell>Yes</cell>
          <cell>Yes</cell>
        </row>
        <row>
          <cell><c>eprof</c></cell>
          <cell>Per process/function to screen/file</cell>
          <cell>Medium</cell>
          <cell>Small slowdown</cell>
          <cell>Yes</cell>
          <cell>Only total</cell>
          <cell>No</cell>
          <cell>No</cell>
        </row>
        <row>
          <cell><c>cover</c></cell>
          <cell>Per module to screen/file</cell>
          <cell>Small</cell>
          <cell>Moderate slowdown</cell>
          <cell>Yes, per line</cell>
          <cell>No</cell>
          <cell>No</cell>
          <cell>No</cell>
        </row>
        <row>
          <cell><c>cprof</c></cell>
          <cell>Per module to caller</cell>
          <cell>Small</cell>
          <cell>Small slowdown</cell>
          <cell>Yes</cell>
          <cell>No</cell>
          <cell>No</cell>
          <cell>No</cell>
        </row>
        <tcaption>Tool Summary</tcaption>
      </table>
    </section>
  </section>

  <section>
    <marker id="benchmark"></marker>
    <title>Benchmarking</title>

    <p>The main purpose of benchmarking is to find out which
    implementation of a given algorithm or function is the fastest.
    Benchmarking is far from an exact science. Today's operating systems
    generally run background tasks that are difficult to turn off.
    Caches and multiple CPU cores does not facilitate benchmarking.
    It would be best to run UNIX computers in single-user mode when
    benchmarking, but that is inconvenient to say the least for casual
    testing.</p>
    
    <p>Benchmarks can measure wall-clock time or CPU time.</p>

    <list type="bulleted">
    <item><seealso marker="stdlib:timer#tc/3">timer:tc/3</seealso> measures
    wall-clock time. The advantage with wall-clock time is that I/O,
    swapping, and other activities in the operating system kernel are
    included in the measurements. The disadvantage is that the
    measurements vary a lot. Usually it is best to run the
    benchmark several times and note the shortest time, which is to
    be the minimum time that is possible to achieve under the best of
    circumstances.</item>

    <item><seealso marker="erts:erlang#statistics/1">statistics/1</seealso>
    with argument <c>runtime</c> measures CPU time spent in the Erlang
    virtual machine. The advantage with CPU time is that the results are more
    consistent from run to run. The disadvantage is that the time
    spent in the operating system kernel (such as swapping and I/O)
    is not included. Therefore, measuring CPU time is misleading if
    any I/O (file or socket) is involved.</item>
    </list>

    <p>It is probably a good idea to do both wall-clock measurements and
    CPU time measurements.</p>

    <p>Some final advice:</p>

    <list type="bulleted">
    <item>The granularity of both measurement types can be high.
    Therefore, ensure that each individual measurement
    lasts for at least several seconds.</item>

    <item>To make the test fair, each new test run is to run in its own,
    newly created Erlang process. Otherwise, if all tests run in the
    same process, the later tests start out with larger heap sizes
    and therefore probably do fewer garbage collections.
    Also consider restarting the Erlang emulator between each test.</item>

    <item>Do not assume that the fastest implementation of a given algorithm
    on computer architecture X is also the fastest on computer architecture
    Y.</item>
    </list>
  </section>
</chapter>