|
|
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE erlref SYSTEM "erlref.dtd">
<erlref>
<header>
<copyright>
<year>1996</year><year>2013</year>
<holder>Ericsson AB. All Rights Reserved.</holder>
</copyright>
<legalnotice>
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
</legalnotice>
<title>heart</title>
<prepared>Magnus Fröberg</prepared>
<docno></docno>
<date>1998-01-28</date>
<rev>A</rev>
</header>
<module>heart</module>
<modulesummary>Heartbeat Monitoring of an Erlang Runtime System</modulesummary>
<description>
<p>This modules contains the interface to the <c>heart</c> process.
<c>heart</c> sends periodic heartbeats to an external port
program, which is also named <c>heart</c>. The purpose of
the heart port program is to check that the Erlang runtime system
it is supervising is still running. If the port program has not
received any heartbeats within <c>HEART_BEAT_TIMEOUT</c> seconds
(default is 60 seconds), the system can be rebooted. Also, if
the system is equipped with a hardware watchdog timer and is
running Solaris, the watchdog can be used to supervise the entire
system.</p>
<p>An Erlang runtime system to be monitored by a heart program,
should be started with the command line flag <c>-heart</c> (see
also <seealso marker="erts:erl">erl(1)</seealso>). The <c>heart</c>
process is then started automatically:</p>
<pre>
% <input>erl -heart ...</input></pre>
<p>If the system should be rebooted because of missing heart-beats,
or a terminated Erlang runtime system, the environment variable
<c>HEART_COMMAND</c> has to be set before the system is started.
If this variable is not set, a warning text will be printed but
the system will not reboot. However, if the hardware watchdog is
used, it will trigger a reboot <c>HEART_BEAT_BOOT_DELAY</c>
seconds later nevertheless (default is 60).</p>
<p>To reboot on the WINDOWS platform <c>HEART_COMMAND</c> can be
set to <c>heart -shutdown</c> (included in the Erlang delivery)
or of course to any other suitable program which can activate a
reboot.</p>
<p>The hardware watchdog will not be started under Solaris if
the environment variable <c>HW_WD_DISABLE</c> is set.</p>
<p>The <c>HEART_BEAT_TIMEOUT</c> and <c>HEART_BEAT_BOOT_DELAY</c>
environment variables can be used to configure the heart timeouts,
they can be set in the operating system shell before Erlang is
started or be specified at the command line:</p>
<pre>
% <input>erl -heart -env HEART_BEAT_TIMEOUT 30 ...</input></pre>
<p>The value (in seconds) must be in the range 10 < X <= 65535.</p>
<p>It should be noted that if the system clock is adjusted with
more than <c>HEART_BEAT_TIMEOUT</c> seconds, <c>heart</c> will
timeout and try to reboot the system. This can happen, for
example, if the system clock is adjusted automatically by use of
NTP (Network Time Protocol).</p>
<p> If a crash occurs, an <c><![CDATA[erl_crash.dump]]></c> will <em>not</em> be written
unless the environment variable <c><![CDATA[ERL_CRASH_DUMP_SECONDS]]></c> is set.
</p>
<pre>
% <input>erl -heart -env ERL_CRASH_DUMP_SECONDS 10 ...</input></pre>
<p> If a regular core dump is wanted, let heart know by setting the kill signal to abort
using the environment variable <c><![CDATA[HEART_KILL_SIGNAL=SIGABRT]]></c>.
If unset, or not set to <c><![CDATA[SIGABRT]]></c>, the default behaviour will be a kill
signal using <c><![CDATA[SIGKILL]]></c>.
</p>
<pre>
% <input>erl -heart -env HEART_KILL_SIGNAL SIGABRT ...</input></pre>
<p>
Furthermore, <c><![CDATA[ERL_CRASH_DUMP_SECONDS]]></c> has the following behaviour on
<c>heart</c>:
</p>
<taglist>
<tag><c><![CDATA[ERL_CRASH_DUMP_SECONDS=0]]></c></tag>
<item><p>
Suppresses the writing a crash dump file entirely,
thus rebooting the runtime system immediately.
This is the same as not setting the environment variable.
</p>
</item>
<tag><c><![CDATA[ERL_CRASH_DUMP_SECONDS=-1]]></c></tag>
<item><p> Setting the environment variable to a negative value will not reboot
the runtime system until the crash dump file has been completly written.
</p>
</item>
<tag><c><![CDATA[ERL_CRASH_DUMP_SECONDS=S]]></c></tag>
<item><p>
Heart will wait for <c>S</c> seconds to let the crash dump file be written.
After <c>S</c> seconds <c>heart</c> will reboot the runtime system regardless of
the crash dump file has been written or not.
</p>
</item>
</taglist>
<p>In the following descriptions, all function fails with reason
<c>badarg</c> if <c>heart</c> is not started.</p>
</description>
<datatypes>
<datatype>
<name name="heart_option"/>
</datatype>
</datatypes>
<funcs>
<func>
<name name="set_cmd" arity="1"/>
<fsummary>Set a temporary reboot command</fsummary>
<desc>
<p>Sets a temporary reboot command. This command is used if
a <c>HEART_COMMAND</c> other than the one specified with
the environment variable should be used in order to reboot
the system. The new Erlang runtime system will (if it
misbehaves) use the environment variable
<c>HEART_COMMAND</c> to reboot.</p>
<p>Limitations: The <c><anno>Cmd</anno></c> command string
will be sent to the heart program as a ISO-latin-1 or UTF-8
encoded binary depending on the file name encoding mode of the
emulator (see
<seealso marker="kernel:file#native_name_encoding/0"><c>file:native_name_encoding/0</c></seealso>).
The size of the encoded binary must be less than 2047 bytes.</p>
</desc>
</func>
<func>
<name name="clear_cmd" arity="0"/>
<fsummary>Clear the temporary boot command</fsummary>
<desc>
<p>Clears the temporary boot command. If the system terminates,
the normal <c>HEART_COMMAND</c> is used to reboot.</p>
</desc>
</func>
<func>
<name name="get_cmd" arity="0"/>
<fsummary>Get the temporary reboot command</fsummary>
<desc>
<p>Get the temporary reboot command. If the command is cleared,
the empty string will be returned.</p>
</desc>
</func>
<func>
<name name="set_callback" arity="2"/>
<fsummary>Set a validation callback</fsummary>
<desc>
<p> This validation callback will be executed before any heartbeat sent
to the port program. For the validation to succeed it needs to return
with the value <c>ok</c>.
</p>
<p> An exception within the callback will be treated as a validation failure. </p>
<p> The callback will be removed if the system reboots. </p>
</desc>
</func>
<func>
<name name="clear_callback" arity="0"/>
<fsummary>Clear the validation callback</fsummary>
<desc>
<p>Removes the validation callback call before heartbeats.</p>
</desc>
</func>
<func>
<name name="get_callback" arity="0"/>
<fsummary>Get the validation callback</fsummary>
<desc>
<p>Get the validation callback. If the callback is cleared, <c>none</c> will be returned.</p>
</desc>
</func>
<func>
<name name="set_options" arity="1"/>
<fsummary>Set a list of options</fsummary>
<desc>
<p> Valid options <c>set_options</c> are: </p>
<taglist>
<tag><c>check_schedulers</c></tag>
<item>
<p>If enabled, a signal will be sent to each scheduler to check its
responsiveness. The system check occurs before any heartbeat sent
to the port program. If any scheduler is not responsive enough the
heart program will not receive its heartbeat and thus eventually terminate the node.
</p>
</item>
</taglist>
<p> Returns with the value <c>ok</c> if the options are valid.</p>
</desc>
</func>
<func>
<name name="get_options" arity="0"/>
<fsummary>Get the temporary reboot command</fsummary>
<desc>
<p>Returns <c>{ok, Options}</c> where <c>Options</c> is a list of current options enabled for heart.
If the callback is cleared, <c>none</c> will be returned.</p>
</desc>
</func>
</funcs>
</erlref>
|