19992013
Ericsson AB. All Rights Reserved.
The contents of this file are subject to the Erlang Public License,
Version 1.1, (the "License"); you may not use this file except in
compliance with the License. You should have received a copy of the
Erlang Public License along with this software. If not, it can be
retrieved online at http://www.erlang.org/.
Software distributed under the License is distributed on an "AS IS"
basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
the License for the specific language governing rights and limitations
under the License.
How to interpret the Erlang crash dumps
Patrik Nyblom
1999-11-11
PA1
crash_dump.xml
This document describes the file generated
upon abnormal exit of the Erlang runtime system.
Important: For OTP release R9C the Erlang crash dump has
had a major facelift. This means that the information in this
document will not be directly applicable for older dumps. However,
if you use the Crashdump Viewer tool on older dumps, the crash
dumps are translated into a format similar to this.
The system will write the crash dump in the current directory of
the emulator or in the file pointed out by the environment variable
(whatever that means on the current operating system)
ERL_CRASH_DUMP. For a crash dump to be written, there has to be a
writable file system mounted.
Crash dumps are written mainly for one of two reasons: either the
builtin function is called explicitly with a
string argument from running Erlang code, or else the runtime
system has detected an error that cannot be handled. The most
usual reason that the system can't handle the error is that the
cause is external limitations, such as running out of memory. A
crash dump due to an internal error may be caused by the system
reaching limits in the emulator itself (like the number of atoms
in the system, or too many simultaneous ets tables). Usually the
emulator or the operating system can be reconfigured to avoid the
crash, which is why interpreting the crash dump correctly is
important.
The erlang crash dump is a readable text file, but it might not be
very easy to read. Using the Crashdump Viewer tool in the
application will simplify the task. This is an
HTML based tool for browsing Erlang crash dumps.
General information
The first part of the dump shows the creation time for the dump,
a slogan indicating the reason for the dump, the system version,
of the node from which the dump originates, the compile time of
the emulator running the originating node and the number of
atoms in the atom table.
Reasons for crash dumps (slogan)
The reason for the dump is noted in the beginning of the file
as Slogan: <reason> (the word "slogan" has historical
roots). If the system is halted by the BIF
, the slogan is the string parameter
passed to the BIF, otherwise it is a description generated by
the emulator or the (Erlang) kernel. Normally the message
should be enough to understand the problem, but nevertheless
some messages are described here. Note however that the
suggested reasons for the crash are only suggestions. The exact reasons for the errors may vary
depending on the local applications and the underlying
operating system.
- "<A>: Cannot allocate <N>
bytes of memory (of type "<T>")." - The system
has run out of memory. <A> is the allocator that failed
to allocate memory, <N> is the number of bytes that
<A> tried to allocate, and <T> is the memory block
type that the memory was needed for. The most common case is
that a process stores huge amounts of data. In this case
<T> is most often , ,
, or . For more information on
allocators see
erts_alloc(3).
- "<A>: Cannot reallocate <N>
bytes of memory (of type "<T>")." - Same as
above with the exception that memory was being reallocated
instead of being allocated when the system ran out of memory.
- "Unexpected op code N" - Error in compiled
code, file damaged or error in the compiler.
- "Module Name undefined" "Function
Name undefined" "No function
Name:Name/1" "No function
Name:start/2" - The kernel/stdlib applications are
damaged or the start script is damaged.
- "Driver_select called with too large file descriptor
" - The number of file descriptors for sockets
exceed 1024 (Unix only). The limit on file-descriptors in
some Unix flavors can be set to over 1024, but only 1024
sockets/pipes can be used simultaneously by Erlang (due to
limitations in the Unix call). The number of
open regular files is not affected by this.
- "Received SIGUSR1" - The SIGUSR1 signal was sent to the
Erlang machine (Unix only).
- "Kernel pid terminated (Who)
(Exit-reason)" - The kernel supervisor has detected
a failure, usually that the
has shut down ( = ,
= ). The application controller
may have shut down for a number of reasons, the most usual
being that the node name of the distributed Erlang node is
already in use. A complete supervisor tree "crash" (i.e.,
the top supervisors have exited) will give about the same
result. This message comes from the Erlang code and not from
the virtual machine itself. It is always due to some kind of
failure in an application, either within OTP or a
"user-written" one. Looking at the error log for your
application is probably the first step to take.
- "Init terminating in do_boot ()" - The primitive Erlang boot
sequence was terminated, most probably because the boot
script has errors or cannot be read. This is usually a
configuration error - the system may have been started with
a faulty parameter or with a boot script from
the wrong version of OTP.
- "Could not start kernel pid (Who) ()" - One of the
kernel processes could not start. This is probably due to
faulty arguments (like errors in a argument)
or faulty configuration files. Check that all files are in
their correct location and that the configuration files (if
any) are not damaged. Usually there are also messages
written to the controlling terminal and/or the error log
explaining what's wrong.
Other errors than the ones mentioned above may occur, as the
BIF may generate any message. If the
message is not generated by the BIF and does not occur in the
list above, it may be due to an error in the emulator. There
may however be unusual messages that I haven't mentioned, that
still are connected to an application failure. There is a lot
more information available, so more thorough reading of the
crash dump may reveal the crash reason. The size of processes,
the number of ets tables and the Erlang data on each process
stack can be useful for tracking down the problem.
Number of atoms
The number of atoms in the system at the time of the crash is
shown as Atoms: <number>. Some ten thousands atoms is
perfectly normal, but more could indicate that the BIF
is used to dynamically generate a
lot of different atoms, which is never a good idea.
Memory information
Under the tag =memory you will find information similar
to what you can obtain on a living node with
erlang:memory().
Internal table information
The tags =hash_table:<table_name> and
=index_table:<table_name> presents internal
tables. These are mostly of interest for runtime system
developers.
Allocated areas
Under the tag =allocated_areas you will find information
similar to what you can obtain on a living node with
erlang:system_info(allocated_areas).
Allocator
Under the tag =allocator:<A> you will find
various information about allocator <A>. The information
is similar to what you can obtain on a living node with
erlang:system_info({allocator, <A>}).
For more information see the documentation of
erlang:system_info({allocator, <A>}),
and the
erts_alloc(3)
documentation.
Process information
The Erlang crashdump contains a listing of each living Erlang
process in the system. The process information for one process
may look like this (line numbers have been added):
The following fields can exist for a process:
=proc:<pid>
- Heading, states the process identifier
State
-
The state of the process. This can be one of the following:
- Scheduled - The process was scheduled to run
but not currently running ("in the run queue").
- Waiting - The process was waiting for
something (in ).
- Running - The process was currently
running. If the BIF was called, this was
the process calling it.
- Exiting - The process was on its way to
exit.
- Garbing - This is bad luck, the process was
garbage collecting when the crash dump was written, the rest
of the information for this process is limited.
- Suspended - The process is suspended, either
by the BIF or because it is
trying to write to a busy port.
Registered name
- The registered name of the process, if any.
Spawned as
- The entry point of the process, i.e., what function was
referenced in the or call that
started the process.
Last scheduled in for | Current call
- The current function of the process. These fields will not
always exist.
Spawned by
- The parent of the process, i.e. the process which executed
or .
Started
- The date and time when the process was started.
Message queue length
- The number of messages in the process' message queue.
Number of heap fragments
- The number of allocated heap fragments.
Heap fragment data
- Size of fragmented heap data. This is data either created by
messages being sent to the process or by the Erlang BIFs. This
amount depends on so many things that this field is utterly
uninteresting.
Link list
- Process id's of processes linked to this one. May also contain
ports. If process monitoring is used, this field also tells in
which direction the monitoring is in effect, i.e., a link
being "to" a process tells you that the "current" process was
monitoring the other and a link "from" a process tells you
that the other process was monitoring the current one.
Reductions
- The number of reductions consumed by the process.
Stack+heap
- The size of the stack and heap (they share memory segment)
OldHeap
- The size of the "old heap". The Erlang virtual machine uses
generational garbage collection with two generations. There is
one heap for new data items and one for the data that have
survived two garbage collections. The assumption (which is
almost always correct) is that data that survive two garbage
collections can be "tenured" to a heap more seldom garbage
collected, as they will live for a long period. This is a
quite usual technique in virtual machines. The sum of the
heaps and stack together constitute most of the process's
allocated memory.
Heap unused, OldHeap unused
- The amount of unused memory on each heap. This information is
usually useless.
Stack
- If the system uses shared heap, the fields
Stack+heap, OldHeap, Heap unused
and OldHeap unused do not exist. Instead this field
presents the size of the process' stack.
Program counter
- The current instruction pointer. This is only interesting for
runtime system developers. The function into which the program
counter points is the current function of the process.
CP
- The continuation pointer, i.e. the return address for the
current call. Usually useless for other than runtime system
developers. This may be followed by the function into which
the CP points, which is the function calling the current
function.
Arity
- The number of live argument registers. The argument registers,
if any are live, will follow. These may contain the arguments
of the function if they are not yet moved to the stack.
See also the section about process data.
Port information
This section lists the open ports, their owners, any linked
processed, and the name of their driver or external process.
ETS tables
This section contains information about all the ETS tables in
the system. The following fields are interesting for each table:
=ets:<owner>
- Heading, states the owner of the table (a process identifier)
Table
- The identifier for the table. If the table is a
, this is the name.
Name
- The name of the table, regardless of whether it is a
or not.
Buckets
- This occurs if the table is a hash table, i.e. if it is not an
.
Ordered set (AVL tree), Elements
- This occurs only if the table is an . (The
number of elements is the same as the number of objects in the
table.)
Objects
- The number of objects in the table
Words
- The number of words (usually 4 bytes/word) allocated to data
in the table.
Timers
This section contains information about all the timers started
with the BIFs and
. The following fields exists for each
timer:
=timer:<owner>
- Heading, states the owner of the timer (a process identifier)
i.e. the process to receive the message when the timer
expires.
Message
- The message to be sent.
Time left
- Number of milliseconds left until the message would have been
sent.
Distribution information
If the Erlang node was alive, i.e., set up for communicating
with other nodes, this section lists the connections that were
active. The following fields can exist:
=node:<node_name>
- The name of the node
no_distribution
- This will only occur if the node was not distributed.
=visible_node:<channel>
- Heading for a visible nodes, i.e. an alive node with a
connection to the node that crashed. States the channel number
for the node.
=hidden_node:<channel>
- Heading for a hidden node. A hidden node is the same as a
visible node, except that it is started with the "-hidden"
flag. States the channel number for the node.
=not_connected:<channel>
- Heading for a node which is has been connected to the crashed
node earlier. References (i.e. process or port identifiers)
to the not connected node existed at the time of the crash.
exist. States the channel number for the node.
Name
- The name of the remote node.
Controller
- The port which controls the communication with the remote node.
Creation
- An integer (1-3) which together with the node name identifies
a specific instance of the node.
Remote monitoring: <local_proc> <remote_proc>
- The local process was monitoring the remote process at the
time of the crash.
Remotely monitored by: <local_proc> <remote_proc>
- The remote process was monitoring the local process at the
time of the crash.
Remote link: <local_proc> <remote_proc>
- A link existed between the local process and the remote
process at the time of the crash.
Loaded module information
This section contains information about all loaded modules.
First, the memory usage by loaded code is summarized. There is
one field for "Current code" which is code that is the current
latest version of the modules. There is also a field for "Old
code" which is code where there exists a newer version in the
system, but the old version is not yet purged. The memory usage
is in bytes.
All loaded modules are then listed. The following fields exist:
=mod:<module_name>
- Heading, and the name of the module.
Current size
- Memory usage for the loaded code in bytes
Old size
- Memory usage for the old code, if any.
Current attributes
- Module attributes for the current code. This field is decoded
when looked at by the Crashdump Viewer tool.
Old attributes
- Module attributes for the old code, if any. This field is
decoded when looked at by the Crashdump Viewer tool.
Current compilation info
- Compilation information (options) for the current code. This
field is decoded when looked at by the Crashdump Viewer tool.
Old compilation info
- Compilation information (options) for the old code, if
any. This field is decoded when looked at by the Crashdump
Viewer tool.
Fun information
In this section, all funs are listed. The following fields exist
for each fun:
=fun
- Heading
Module
- The name of the module where the fun was defined.
Uniq, Index
- Identifiers
Address
- The address of the fun's code.
Native_address
- The address of the fun's code when HiPE is enabled.
Refc
- The number of references to the fun.
Process Data
For each process there will be at least one =proc_stack
and one =proc_heap tag followed by the raw memory
information for the stack and heap of the process.
For each process there will also be a =proc_messages
tag if the process' message queue is non-empty and a
=proc_dictionary tag if the process' dictionary (the
and thing) is non-empty.
The raw memory information can be decoded by the Crashdump
Viewer tool. You will then be able to see the stack dump, the
message queue (if any) and the dictionary (if any).
The stack dump is a dump of the Erlang process stack. Most of
the live data (i.e., variables currently in use) are placed on
the stack; thus this can be quite interesting. One has to
"guess" what's what, but as the information is symbolic,
thorough reading of this information can be very useful. As an
example we can find the state variable of the Erlang primitive
loader on line in the example below:
)
(2) y(0) ["/view/siri_r10_dev/clearcase/otp/erts/lib/kernel/ebin","/view/siri_r10_dev/
(3) clearcase/otp/erts/lib/stdlib/ebin"]
(4) y(1) <0.1.0>
(5) y(2) {state,[],none,#Fun,undefined,#Fun,#Fun,#Port<0.2>,infinity,#Fun}
(6) y(3) infinity ]]>
When interpreting the data for a process, it is helpful to know
that anonymous function objects (funs) are given a name
constructed from the name of the function in which they are
created, and a number (starting with 0) indicating the number of
that fun within that function.
Atoms
Now all the atoms in the system are written. This is only
interesting if one suspects that dynamic generation of atoms could
be a problem, otherwise this section can be ignored.
Note that the last created atom is printed first.
Disclaimer
The format of the crash dump evolves between releases of
OTP. Some information here may not apply to your
version. A description as this will never be complete; it is meant as
an explanation of the crash dump in general and as a help
when trying to find application errors, not as a complete
specification.