diff options
Diffstat (limited to 'system/doc/design_principles/distributed_applications.xml')
-rw-r--r-- | system/doc/design_principles/distributed_applications.xml | 217 |
1 files changed, 217 insertions, 0 deletions
diff --git a/system/doc/design_principles/distributed_applications.xml b/system/doc/design_principles/distributed_applications.xml new file mode 100644 index 0000000000..39a24b3598 --- /dev/null +++ b/system/doc/design_principles/distributed_applications.xml @@ -0,0 +1,217 @@ +<?xml version="1.0" encoding="latin1" ?> +<!DOCTYPE chapter SYSTEM "chapter.dtd"> + +<chapter> + <header> + <copyright> + <year>2003</year><year>2009</year> + <holder>Ericsson AB. All Rights Reserved.</holder> + </copyright> + <legalnotice> + The contents of this file are subject to the Erlang Public License, + Version 1.1, (the "License"); you may not use this file except in + compliance with the License. You should have received a copy of the + Erlang Public License along with this software. If not, it can be + retrieved online at http://www.erlang.org/. + + Software distributed under the License is distributed on an "AS IS" + basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See + the License for the specific language governing rights and limitations + under the License. + + </legalnotice> + + <title>Distributed Applications</title> + <prepared></prepared> + <docno></docno> + <date></date> + <rev></rev> + <file>distributed_applications.xml</file> + </header> + + <section> + <title>Definition</title> + <p>In a distributed system with several Erlang nodes, there may be + a need to control applications in a distributed manner. If + the node, where a certain application is running, goes down, + the application should be restarted at another node.</p> + <p>Such an application is called a <em>distributed application</em>. + Note that it is the control of the application which is + distributed, all applications can of course be distributed in + the sense that they, for example, use services on other nodes.</p> + <p>Because a distributed application may move between nodes, some + addressing mechanism is required to ensure that it can be + addressed by other applications, regardless on which node it + currently executes. This issue is not addressed here, but the + Kernel module <c>global</c> or STDLIB module <c>pg</c> can be + used for this purpose.</p> + </section> + + <section> + <title>Specifying Distributed Applications</title> + <p>Distributed applications are controlled by both the application + controller and a distributed application controller process, + <c>dist_ac</c>. Both these processes are part of the <c>kernel</c> + application. Therefore, distributed applications are specified by + configuring the <c>kernel</c> application, using the following + configuration parameter (see also <c>kernel(6)</c>):</p> + <taglist> + <tag><c>distributed = [{Application, [Timeout,] NodeDesc}]</c></tag> + <item> + <p>Specifies where the application <c>Application = atom()</c> + may execute. <c>NodeDesc = [Node | {Node,...,Node}]</c> is + a list of node names in priority order. The order between + nodes in a tuple is undefined.</p> + <p><c>Timeout = integer()</c> specifies how many milliseconds to + wait before restarting the application at another node. + Defaults to 0.</p> + </item> + </taglist> + <p>For distribution of application control to work properly, + the nodes where a distributed application may run must contact + each other and negotiate where to start the application. This is + done using the following <c>kernel</c> configuration parameters:</p> + <taglist> + <tag><c>sync_nodes_mandatory = [Node]</c></tag> + <item>Specifies which other nodes must be started (within + the timeout specified by <c>sync_nodes_timeout</c>.</item> + <tag><c>sync_nodes_optional = [Node]</c></tag> + <item>Specifies which other nodes can be started (within + the timeout specified by <c>sync_nodes_timeout</c>.</item> + <tag><c>sync_nodes_timeout = integer() | infinity</c></tag> + <item>Specifies how many milliseconds to wait for the other nodes + to start.</item> + </taglist> + <p>When started, the node will wait for all nodes specified by + <c>sync_nodes_mandatory</c> and <c>sync_nodes_optional</c> to + come up. When all nodes have come up, or when all mandatory nodes + have come up and the time specified by <c>sync_nodes_timeout</c> + has elapsed, all applications will be started. If not all + mandatory nodes have come up, the node will terminate.</p> + <p>Example: An application <c>myapp</c> should run at the node + <c>cp1@cave</c>. If this node goes down, <c>myapp</c> should + be restarted at <c>cp2@cave</c> or <c>cp3@cave</c>. A system + configuration file <c>cp1.config</c> for <c>cp1@cave</c> could + look like:</p> + <code type="none"> +[{kernel, + [{distributed, [{myapp, 5000, [cp1@cave, {cp2@cave, cp3@cave}]}]}, + {sync_nodes_mandatory, [cp2@cave, cp3@cave]}, + {sync_nodes_timeout, 5000} + ] + } +].</code> + <p>The system configuration files for <c>cp2@cave</c> and + <c>cp3@cave</c> are identical, except for the list of mandatory + nodes which should be <c>[cp1@cave, cp3@cave]</c> for + <c>cp2@cave</c> and <c>[cp1@cave, cp2@cave]</c> for + <c>cp3@cave</c>.</p> + <note> + <p>All involved nodes must have the same value for + <c>distributed</c> and <c>sync_nodes_timeout</c>, or + the behaviour of the system is undefined.</p> + </note> + </section> + + <section> + <title>Starting and Stopping Distributed Applications</title> + <p>When all involved (mandatory) nodes have been started, + the distributed application can be started by calling + <c>application:start(Application)</c> at <em>all of these nodes.</em></p> + <p>It is of course also possible to use a boot script (see + <seealso marker="release_structure">Releases</seealso>) which + automatically starts the application.</p> + <p>The application will be started at the first node, specified + by the <c>distributed</c> configuration parameter, which is up + and running. The application is started as usual. That is, an + application master is created and calls the application callback + function:</p> + <code type="none"> +Module:start(normal, StartArgs)</code> + <p>Example: Continuing the example from the previous section, + the three nodes are started, specifying the system configuration + file:</p> + <pre> +> <input>erl -sname cp1 -config cp1</input> +> <input>erl -sname cp2 -config cp2</input> +> <input>erl -sname cp3 -config cp3</input></pre> + <p>When all nodes are up and running, <c>myapp</c> can be started. + This is achieved by calling <c>application:start(myapp)</c> at + all three nodes. It is then started at <c>cp1</c>, as shown in + the figure below.</p> + <marker id="dist1"></marker> + <image file="../design_principles/dist1.gif"> + <icaption>Application myapp - Situation 1</icaption> + </image> + <p>Similarly, the application must be stopped by calling + <c>application:stop(Application)</c> at all involved nodes.</p> + </section> + + <section> + <title>Failover</title> + <p>If the node where the application is running goes down, + the application is restarted (after the specified timeout) at + the first node, specified by the <c>distributed</c> configuration + parameter, which is up and running. This is called a + <em>failover</em>.</p> + <p>The application is started the normal way at the new node, + that is, by the application master calling:</p> + <code type="none"> +Module:start(normal, StartArgs)</code> + <p>Exception: If the application has the <c>start_phases</c> key + defined (see <seealso marker="included_applications">Included Applications</seealso>), then the application is instead started + by calling:</p> + <code type="none"> +Module:start({failover, Node}, StartArgs)</code> + <p>where <c>Node</c> is the terminated node.</p> + <p>Example: If <c>cp1</c> goes down, the system checks which one of + the other nodes, <c>cp2</c> or <c>cp3</c>, has the least number of + running applications, but waits for 5 seconds for <c>cp1</c> to + restart. If <c>cp1</c> does not restart and <c>cp2</c> runs fewer + applications than <c>cp3,</c> then <c>myapp</c> is restarted on + <c>cp2</c>.</p> + <marker id="dist2"></marker> + <image file="../design_principles/dist2.gif"> + <icaption>Application myapp - Situation 2</icaption> + </image> + <p>Suppose now that <c>cp2</c> goes down as well and does not + restart within 5 seconds. <c>myapp</c> is now restarted on + <c>cp3</c>.</p> + <marker id="dist3"></marker> + <image file="../design_principles/dist3.gif"> + <icaption>Application myapp - Situation 3</icaption> + </image> + </section> + + <section> + <title>Takeover</title> + <p>If a node is started, which has higher priority according + to <c>distributed</c>, than the node where a distributed + application is currently running, the application will be + restarted at the new node and stopped at the old node. This is + called a <em>takeover</em>.</p> + <p>The application is started by the application master calling:</p> + <code type="none"> +Module:start({takeover, Node}, StartArgs)</code> + <p>where <c>Node</c> is the old node.</p> + <p>Example: If <c>myapp</c> is running at <c>cp3</c>, and if + <c>cp2</c> now restarts, it will not restart <c>myapp</c>, + because the order between nodes <c>cp2</c> and <c>cp3</c> is + undefined.</p> + <marker id="dist4"></marker> + <image file="../design_principles/dist4.gif"> + <icaption>Application myapp - Situation 4</icaption> + </image> + <p>However, if <c>cp1</c> restarts as well, the function + <c>application:takeover/2</c> moves <c>myapp</c> to <c>cp1</c>, + because <c>cp1</c> has a higher priority than <c>cp3</c> for this + application. In this case, + <c>Module:start({takeover, cp3@cave}, StartArgs)</c> is executed + at <c>cp1</c> to start the application.</p> + <marker id="dist5"></marker> + <image file="../design_principles/dist5.gif"> + <icaption>Application myapp - Situation 5</icaption> + </image> + </section> +</chapter> + |