aboutsummaryrefslogtreecommitdiffstats
path: root/system/doc/design_principles/distributed_applications.xml
diff options
context:
space:
mode:
Diffstat (limited to 'system/doc/design_principles/distributed_applications.xml')
-rw-r--r--system/doc/design_principles/distributed_applications.xml217
1 files changed, 217 insertions, 0 deletions
diff --git a/system/doc/design_principles/distributed_applications.xml b/system/doc/design_principles/distributed_applications.xml
new file mode 100644
index 0000000000..39a24b3598
--- /dev/null
+++ b/system/doc/design_principles/distributed_applications.xml
@@ -0,0 +1,217 @@
+<?xml version="1.0" encoding="latin1" ?>
+<!DOCTYPE chapter SYSTEM "chapter.dtd">
+
+<chapter>
+ <header>
+ <copyright>
+ <year>2003</year><year>2009</year>
+ <holder>Ericsson AB. All Rights Reserved.</holder>
+ </copyright>
+ <legalnotice>
+ The contents of this file are subject to the Erlang Public License,
+ Version 1.1, (the "License"); you may not use this file except in
+ compliance with the License. You should have received a copy of the
+ Erlang Public License along with this software. If not, it can be
+ retrieved online at http://www.erlang.org/.
+
+ Software distributed under the License is distributed on an "AS IS"
+ basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
+ the License for the specific language governing rights and limitations
+ under the License.
+
+ </legalnotice>
+
+ <title>Distributed Applications</title>
+ <prepared></prepared>
+ <docno></docno>
+ <date></date>
+ <rev></rev>
+ <file>distributed_applications.xml</file>
+ </header>
+
+ <section>
+ <title>Definition</title>
+ <p>In a distributed system with several Erlang nodes, there may be
+ a need to control applications in a distributed manner. If
+ the node, where a certain application is running, goes down,
+ the application should be restarted at another node.</p>
+ <p>Such an application is called a <em>distributed application</em>.
+ Note that it is the control of the application which is
+ distributed, all applications can of course be distributed in
+ the sense that they, for example, use services on other nodes.</p>
+ <p>Because a distributed application may move between nodes, some
+ addressing mechanism is required to ensure that it can be
+ addressed by other applications, regardless on which node it
+ currently executes. This issue is not addressed here, but the
+ Kernel module <c>global</c> or STDLIB module <c>pg</c> can be
+ used for this purpose.</p>
+ </section>
+
+ <section>
+ <title>Specifying Distributed Applications</title>
+ <p>Distributed applications are controlled by both the application
+ controller and a distributed application controller process,
+ <c>dist_ac</c>. Both these processes are part of the <c>kernel</c>
+ application. Therefore, distributed applications are specified by
+ configuring the <c>kernel</c> application, using the following
+ configuration parameter (see also <c>kernel(6)</c>):</p>
+ <taglist>
+ <tag><c>distributed = [{Application, [Timeout,] NodeDesc}]</c></tag>
+ <item>
+ <p>Specifies where the application <c>Application = atom()</c>
+ may execute. <c>NodeDesc = [Node | {Node,...,Node}]</c> is
+ a list of node names in priority order. The order between
+ nodes in a tuple is undefined.</p>
+ <p><c>Timeout = integer()</c> specifies how many milliseconds to
+ wait before restarting the application at another node.
+ Defaults to 0.</p>
+ </item>
+ </taglist>
+ <p>For distribution of application control to work properly,
+ the nodes where a distributed application may run must contact
+ each other and negotiate where to start the application. This is
+ done using the following <c>kernel</c> configuration parameters:</p>
+ <taglist>
+ <tag><c>sync_nodes_mandatory = [Node]</c></tag>
+ <item>Specifies which other nodes must be started (within
+ the timeout specified by <c>sync_nodes_timeout</c>.</item>
+ <tag><c>sync_nodes_optional = [Node]</c></tag>
+ <item>Specifies which other nodes can be started (within
+ the timeout specified by <c>sync_nodes_timeout</c>.</item>
+ <tag><c>sync_nodes_timeout = integer() | infinity</c></tag>
+ <item>Specifies how many milliseconds to wait for the other nodes
+ to start.</item>
+ </taglist>
+ <p>When started, the node will wait for all nodes specified by
+ <c>sync_nodes_mandatory</c> and <c>sync_nodes_optional</c> to
+ come up. When all nodes have come up, or when all mandatory nodes
+ have come up and the time specified by <c>sync_nodes_timeout</c>
+ has elapsed, all applications will be started. If not all
+ mandatory nodes have come up, the node will terminate.</p>
+ <p>Example: An application <c>myapp</c> should run at the node
+ <c>cp1@cave</c>. If this node goes down, <c>myapp</c> should
+ be restarted at <c>cp2@cave</c> or <c>cp3@cave</c>. A system
+ configuration file <c>cp1.config</c> for <c>cp1@cave</c> could
+ look like:</p>
+ <code type="none">
+[{kernel,
+ [{distributed, [{myapp, 5000, [cp1@cave, {cp2@cave, cp3@cave}]}]},
+ {sync_nodes_mandatory, [cp2@cave, cp3@cave]},
+ {sync_nodes_timeout, 5000}
+ ]
+ }
+].</code>
+ <p>The system configuration files for <c>cp2@cave</c> and
+ <c>cp3@cave</c> are identical, except for the list of mandatory
+ nodes which should be <c>[cp1@cave, cp3@cave]</c> for
+ <c>cp2@cave</c> and <c>[cp1@cave, cp2@cave]</c> for
+ <c>cp3@cave</c>.</p>
+ <note>
+ <p>All involved nodes must have the same value for
+ <c>distributed</c> and <c>sync_nodes_timeout</c>, or
+ the behaviour of the system is undefined.</p>
+ </note>
+ </section>
+
+ <section>
+ <title>Starting and Stopping Distributed Applications</title>
+ <p>When all involved (mandatory) nodes have been started,
+ the distributed application can be started by calling
+ <c>application:start(Application)</c> at <em>all of these nodes.</em></p>
+ <p>It is of course also possible to use a boot script (see
+ <seealso marker="release_structure">Releases</seealso>) which
+ automatically starts the application.</p>
+ <p>The application will be started at the first node, specified
+ by the <c>distributed</c> configuration parameter, which is up
+ and running. The application is started as usual. That is, an
+ application master is created and calls the application callback
+ function:</p>
+ <code type="none">
+Module:start(normal, StartArgs)</code>
+ <p>Example: Continuing the example from the previous section,
+ the three nodes are started, specifying the system configuration
+ file:</p>
+ <pre>
+> <input>erl -sname cp1 -config cp1</input>
+> <input>erl -sname cp2 -config cp2</input>
+> <input>erl -sname cp3 -config cp3</input></pre>
+ <p>When all nodes are up and running, <c>myapp</c> can be started.
+ This is achieved by calling <c>application:start(myapp)</c> at
+ all three nodes. It is then started at <c>cp1</c>, as shown in
+ the figure below.</p>
+ <marker id="dist1"></marker>
+ <image file="../design_principles/dist1.gif">
+ <icaption>Application myapp - Situation 1</icaption>
+ </image>
+ <p>Similarly, the application must be stopped by calling
+ <c>application:stop(Application)</c> at all involved nodes.</p>
+ </section>
+
+ <section>
+ <title>Failover</title>
+ <p>If the node where the application is running goes down,
+ the application is restarted (after the specified timeout) at
+ the first node, specified by the <c>distributed</c> configuration
+ parameter, which is up and running. This is called a
+ <em>failover</em>.</p>
+ <p>The application is started the normal way at the new node,
+ that is, by the application master calling:</p>
+ <code type="none">
+Module:start(normal, StartArgs)</code>
+ <p>Exception: If the application has the <c>start_phases</c> key
+ defined (see <seealso marker="included_applications">Included Applications</seealso>), then the application is instead started
+ by calling:</p>
+ <code type="none">
+Module:start({failover, Node}, StartArgs)</code>
+ <p>where <c>Node</c> is the terminated node.</p>
+ <p>Example: If <c>cp1</c> goes down, the system checks which one of
+ the other nodes, <c>cp2</c> or <c>cp3</c>, has the least number of
+ running applications, but waits for 5 seconds for <c>cp1</c> to
+ restart. If <c>cp1</c> does not restart and <c>cp2</c> runs fewer
+ applications than <c>cp3,</c> then <c>myapp</c> is restarted on
+ <c>cp2</c>.</p>
+ <marker id="dist2"></marker>
+ <image file="../design_principles/dist2.gif">
+ <icaption>Application myapp - Situation 2</icaption>
+ </image>
+ <p>Suppose now that <c>cp2</c> goes down as well and does not
+ restart within 5 seconds. <c>myapp</c> is now restarted on
+ <c>cp3</c>.</p>
+ <marker id="dist3"></marker>
+ <image file="../design_principles/dist3.gif">
+ <icaption>Application myapp - Situation 3</icaption>
+ </image>
+ </section>
+
+ <section>
+ <title>Takeover</title>
+ <p>If a node is started, which has higher priority according
+ to <c>distributed</c>, than the node where a distributed
+ application is currently running, the application will be
+ restarted at the new node and stopped at the old node. This is
+ called a <em>takeover</em>.</p>
+ <p>The application is started by the application master calling:</p>
+ <code type="none">
+Module:start({takeover, Node}, StartArgs)</code>
+ <p>where <c>Node</c> is the old node.</p>
+ <p>Example: If <c>myapp</c> is running at <c>cp3</c>, and if
+ <c>cp2</c> now restarts, it will not restart <c>myapp</c>,
+ because the order between nodes <c>cp2</c> and <c>cp3</c> is
+ undefined.</p>
+ <marker id="dist4"></marker>
+ <image file="../design_principles/dist4.gif">
+ <icaption>Application myapp - Situation 4</icaption>
+ </image>
+ <p>However, if <c>cp1</c> restarts as well, the function
+ <c>application:takeover/2</c> moves <c>myapp</c> to <c>cp1</c>,
+ because <c>cp1</c> has a higher priority than <c>cp3</c> for this
+ application. In this case,
+ <c>Module:start({takeover, cp3@cave}, StartArgs)</c> is executed
+ at <c>cp1</c> to start the application.</p>
+ <marker id="dist5"></marker>
+ <image file="../design_principles/dist5.gif">
+ <icaption>Application myapp - Situation 5</icaption>
+ </image>
+ </section>
+</chapter>
+