<?xml version="1.0" encoding="latin1" ?>
<!DOCTYPE chapter SYSTEM "chapter.dtd">

<chapter>
  <header>
    <copyright>
      <year>2003</year><year>2010</year>
      <holder>Ericsson AB. All Rights Reserved.</holder>
    </copyright>
    <legalnotice>
      The contents of this file are subject to the Erlang Public License,
      Version 1.1, (the "License"); you may not use this file except in
      compliance with the License. You should have received a copy of the
      Erlang Public License along with this software. If not, it can be
      retrieved online at http://www.erlang.org/.

      Software distributed under the License is distributed on an "AS IS"
      basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See
      the License for the specific language governing rights and limitations
      under the License.

    </legalnotice>

    <title>Robustness</title>
    <prepared></prepared>
    <docno></docno>
    <date></date>
    <rev></rev>
    <file>robustness.xml</file>
  </header>
  <p>There are several things which are wrong with
    the <seealso marker="conc_prog#ex">messenger example</seealso> from
    the previous chapter. For example if a node where a user is logged
    on goes down without doing a log off, the user will remain in
    the server's <c>User_List</c> but the client will disappear thus
    making it impossible for the user to log on again as the server
    thinks the user already logged on.</p>
  <p>Or what happens if the server goes down in the middle of sending a
    message leaving the sending client hanging for ever in
    the <c>await_result</c> function?</p>

  <section>
    <title>Timeouts</title>
    <p>Before improving the messenger program we will look into some
      general principles, using the ping pong program as an example.
      Recall that when "ping" finishes, it tells "pong" that it has
      done so by sending the atom <c>finished</c> as a message to "pong"
      so that "pong" could also finish. Another way to let "pong"
      finish, is to make "pong" exit if it does not receive a message
      from ping within a certain time, this can be done by adding a
      <em>timeout</em> to <c>pong</c> as shown in the following example:</p>
    <code type="none">
-module(tut19).

-export([start_ping/1, start_pong/0,  ping/2, pong/0]).

ping(0, Pong_Node) ->
    io:format("ping finished~n", []);

ping(N, Pong_Node) ->
    {pong, Pong_Node} ! {ping, self()},
    receive
        pong ->
            io:format("Ping received pong~n", [])
    end,
    ping(N - 1, Pong_Node).

pong() ->
    receive
        {ping, Ping_PID} ->
            io:format("Pong received ping~n", []),
            Ping_PID ! pong,
            pong()
    after 5000 ->
            io:format("Pong timed out~n", [])
    end.

start_pong() ->
    register(pong, spawn(tut19, pong, [])).

start_ping(Pong_Node) ->
    spawn(tut19, ping, [3, Pong_Node]).</code>
    <p>After we have compiled this and copied the <c>tut19.beam</c>
      file to the necessary directories:</p>
    <p>On (pong@kosken):</p>
    <pre>
(pong@kosken)1> <input>tut19:start_pong().</input>
true
Pong received ping
Pong received ping
Pong received ping
Pong timed out</pre>
    <p>On (ping@gollum):</p>
    <pre>
(ping@gollum)1> <input>tut19:start_ping(pong@kosken).</input>
&lt;0.36.0>
Ping received pong
Ping received pong
Ping received pong
ping finished   </pre>
    <p>(The timeout is set in:</p>
    <code type="none">
pong() ->
    receive
        {ping, Ping_PID} ->
            io:format("Pong received ping~n", []),
            Ping_PID ! pong,
            pong()
    after 5000 ->
            io:format("Pong timed out~n", [])
    end.</code>
    <p>We start the timeout (<c>after 5000</c>) when we enter
      <c>receive</c>. The timeout is canceled if <c>{ping,Ping_PID}</c>
      is received. If <c>{ping,Ping_PID}</c> is not received,
      the actions following the timeout will be done after 5000
      milliseconds. <c>after</c> must be last in the <c>receive</c>,
      i.e. preceded by all other message reception specifications in
      the <c>receive</c>. Of course we could also call a function which
      returned an integer for the timeout:</p>
    <code type="none">
after pong_timeout() -></code>
    <p>In general, there are better ways than using timeouts to
      supervise parts of a distributed Erlang system. Timeouts are
      usually appropriate to supervise external events, for example if
      you have expected a message from some external system within a
      specified time. For example, we could use a timeout to log a user
      out of the messenger system if they have not accessed it, for
      example, in ten minutes.</p>
  </section>

  <section>
    <title>Error Handling</title>
    <p>Before we go into details of the supervision and error handling
      in an Erlang system, we need see how Erlang processes terminate,
      or in Erlang terminology, <em>exit</em>.</p>
    <p>A process which executes <c>exit(normal)</c> or simply runs out
      of things to do has a <em>normal</em> exit.</p>
    <p>A process which encounters a runtime error (e.g. divide by zero,
      bad match, trying to call a function which doesn't exist etc)
      exits with an error, i.e. has an <em>abnormal</em> exit. A
      process which executes
      <seealso marker="erts:erlang#exit/1">exit(Reason)</seealso>
      where <c>Reason</c> is any Erlang term except the atom
      <c>normal</c>, also has an abnormal exit.</p>
    <p>An Erlang process can set up links to other Erlang processes. If
      a process calls
      <seealso marker="erts:erlang#link/1">link(Other_Pid)</seealso>
      it sets up a bidirectional link between itself and the process
      called <c>Other_Pid</c>. When a process terminates, it sends
      something called a <em>signal</em> to all the processes it has
      links to.</p>
    <p>The signal carries information about the pid it was sent from and
      the exit reason.</p>
    <p>The default behaviour of a process which receives a normal exit
      is to ignore the signal.</p>
    <p>The default behaviour in the two other cases (i.e. abnormal exit)
      above is to bypass all messages to the receiving process and to
      kill it and to propagate the same error signal to the killed
      process' links. In this way you can connect all processes in a
      transaction together using links and if one of the processes
      exits abnormally, all the processes in the transaction will be
      killed. As we often want to create a process and link to it at
      the same time, there is a special BIF,
      <seealso marker="erts:erlang#spawn_link/1">spawn_link</seealso>
      which does the same as <c>spawn</c>, but also creates a link to
      the spawned process.</p>
    <p>Now an example of the ping pong example using links to terminate
      "pong":</p>
    <code type="none">
-module(tut20).

-export([start/1,  ping/2, pong/0]).

ping(N, Pong_Pid) ->
    link(Pong_Pid),
    ping1(N, Pong_Pid).

ping1(0, _) ->
    exit(ping);

ping1(N, Pong_Pid) ->
    Pong_Pid ! {ping, self()},
    receive
        pong ->
            io:format("Ping received pong~n", [])
    end,
    ping1(N - 1, Pong_Pid).

pong() ->
    receive
        {ping, Ping_PID} ->
            io:format("Pong received ping~n", []),
            Ping_PID ! pong,
            pong()
    end.

start(Ping_Node) ->
    PongPID = spawn(tut20, pong, []),
    spawn(Ping_Node, tut20, ping, [3, PongPID]).</code>
    <pre>
(s1@bill)3> <input>tut20:start(s2@kosken).</input>
Pong received ping
&lt;3820.41.0>
Ping received pong
Pong received ping
Ping received pong
Pong received ping
Ping received pong</pre>
    <p>This is a slight modification of the ping pong program where both
      processes are spawned from the same <c>start/1</c> function,
      where the "ping" process can be spawned on a separate node. Note
      the use of the <c>link</c> BIF. "Ping" calls
      <c>exit(ping)</c> when it finishes and this will cause an exit
      signal to be sent to "pong" which will also terminate.</p>
    <p>It is possible to modify the default behaviour of a process so
      that it does not get killed when it receives abnormal exit
      signals, but all signals will be turned into normal messages on
      the format <c>{'EXIT',FromPID,Reason}</c> and added to the end of
      the receiving processes message queue. This behaviour is set by:</p>
    <code type="none">
process_flag(trap_exit, true)</code>
    <p>There are several other process flags, see
      <seealso marker="erts:erlang#process_flag/2">erlang(3)</seealso>.
      Changing the default behaviour of a process in this way is
      usually not done in standard user programs, but is left to
      the supervisory programs in OTP (but that's another tutorial).
      However we will modify the ping pong program to illustrate exit
      trapping.</p>
    <code type="none">
-module(tut21).

-export([start/1,  ping/2, pong/0]).

ping(N, Pong_Pid) ->
    link(Pong_Pid), 
    ping1(N, Pong_Pid).

ping1(0, _) ->
    exit(ping);

ping1(N, Pong_Pid) ->
    Pong_Pid ! {ping, self()},
    receive
        pong ->
            io:format("Ping received pong~n", [])
    end,
    ping1(N - 1, Pong_Pid).

pong() ->
    process_flag(trap_exit, true), 
    pong1().

pong1() ->
    receive
        {ping, Ping_PID} ->
            io:format("Pong received ping~n", []),
            Ping_PID ! pong,
            pong1();
        {'EXIT', From, Reason} ->
            io:format("pong exiting, got ~p~n", [{'EXIT', From, Reason}])
    end.

start(Ping_Node) ->
    PongPID = spawn(tut21, pong, []),
    spawn(Ping_Node, tut21, ping, [3, PongPID]).</code>
    <pre>
(s1@bill)1> <input>tut21:start(s2@gollum).</input>
&lt;3820.39.0>
Pong received ping
Ping received pong
Pong received ping
Ping received pong
Pong received ping
Ping received pong
pong exiting, got {'EXIT',&lt;3820.39.0>,ping}</pre>
  </section>

  <section>
    <title>The Larger Example with Robustness Added</title>
    <p>Now we return to the messenger program and add changes which
      make it more robust:</p>
    <code type="none">
%%% Message passing utility.  
%%% User interface:
%%% login(Name)
%%%     One user at a time can log in from each Erlang node in the
%%%     system messenger: and choose a suitable Name. If the Name
%%%     is already logged in at another node or if someone else is
%%%     already logged in at the same node, login will be rejected
%%%     with a suitable error message.
%%% logoff()
%%%     Logs off anybody at at node
%%% message(ToName, Message)
%%%     sends Message to ToName. Error messages if the user of this 
%%%     function is not logged on or if ToName is not logged on at
%%%     any node.
%%%
%%% One node in the network of Erlang nodes runs a server which maintains
%%% data about the logged on users. The server is registered as "messenger"
%%% Each node where there is a user logged on runs a client process registered
%%% as "mess_client" 
%%%
%%% Protocol between the client processes and the server
%%% ----------------------------------------------------
%%% 
%%% To server: {ClientPid, logon, UserName}
%%% Reply {messenger, stop, user_exists_at_other_node} stops the client
%%% Reply {messenger, logged_on} logon was successful
%%%
%%% When the client terminates for some reason
%%% To server: {'EXIT', ClientPid, Reason}
%%%
%%% To server: {ClientPid, message_to, ToName, Message} send a message
%%% Reply: {messenger, stop, you_are_not_logged_on} stops the client
%%% Reply: {messenger, receiver_not_found} no user with this name logged on
%%% Reply: {messenger, sent} Message has been sent (but no guarantee)
%%%
%%% To client: {message_from, Name, Message},
%%%
%%% Protocol between the "commands" and the client
%%% ---------------------------------------------- 
%%%
%%% Started: messenger:client(Server_Node, Name)
%%% To client: logoff
%%% To client: {message_to, ToName, Message}
%%%
%%% Configuration: change the server_node() function to return the
%%% name of the node where the messenger server runs

-module(messenger).
-export([start_server/0, server/0, 
         logon/1, logoff/0, message/2, client/2]).

%%% Change the function below to return the name of the node where the
%%% messenger server runs
server_node() ->
    messenger@super.

%%% This is the server process for the "messenger"
%%% the user list has the format [{ClientPid1, Name1},{ClientPid22, Name2},...]
server() ->
    process_flag(trap_exit, true),
    server([]).

server(User_List) ->
    receive
        {From, logon, Name} ->
            New_User_List = server_logon(From, Name, User_List),
            server(New_User_List);
        {'EXIT', From, _} ->
            New_User_List = server_logoff(From, User_List),
            server(New_User_List);
        {From, message_to, To, Message} ->
            server_transfer(From, To, Message, User_List),
            io:format("list is now: ~p~n", [User_List]),
            server(User_List)
    end.

%%% Start the server
start_server() ->
    register(messenger, spawn(messenger, server, [])).

%%% Server adds a new user to the user list
server_logon(From, Name, User_List) ->
    %% check if logged on anywhere else
    case lists:keymember(Name, 2, User_List) of
        true ->
            From ! {messenger, stop, user_exists_at_other_node},  %reject logon
            User_List;
        false ->
            From ! {messenger, logged_on},
            link(From),
            [{From, Name} | User_List]        %add user to the list
    end.

%%% Server deletes a user from the user list
server_logoff(From, User_List) ->
    lists:keydelete(From, 1, User_List).


%%% Server transfers a message between user
server_transfer(From, To, Message, User_List) ->
    %% check that the user is logged on and who he is
    case lists:keysearch(From, 1, User_List) of
        false ->
            From ! {messenger, stop, you_are_not_logged_on};
        {value, {_, Name}} ->
            server_transfer(From, Name, To, Message, User_List)
    end.

%%% If the user exists, send the message
server_transfer(From, Name, To, Message, User_List) ->
    %% Find the receiver and send the message
    case lists:keysearch(To, 2, User_List) of
        false ->
            From ! {messenger, receiver_not_found};
        {value, {ToPid, To}} ->
            ToPid ! {message_from, Name, Message}, 
            From ! {messenger, sent} 
    end.

%%% User Commands
logon(Name) ->
    case whereis(mess_client) of 
        undefined ->
            register(mess_client, 
                     spawn(messenger, client, [server_node(), Name]));
        _ -> already_logged_on
    end.

logoff() ->
    mess_client ! logoff.

message(ToName, Message) ->
    case whereis(mess_client) of % Test if the client is running
        undefined ->
            not_logged_on;
        _ -> mess_client ! {message_to, ToName, Message},
             ok
end.

%%% The client process which runs on each user node
client(Server_Node, Name) ->
    {messenger, Server_Node} ! {self(), logon, Name},
    await_result(),
    client(Server_Node).

client(Server_Node) ->
    receive
        logoff ->
            exit(normal);
        {message_to, ToName, Message} ->
            {messenger, Server_Node} ! {self(), message_to, ToName, Message},
            await_result();
        {message_from, FromName, Message} ->
            io:format("Message from ~p: ~p~n", [FromName, Message])
    end,
    client(Server_Node).

%%% wait for a response from the server
await_result() ->
    receive
        {messenger, stop, Why} -> % Stop the client 
            io:format("~p~n", [Why]),
            exit(normal);
        {messenger, What} ->  % Normal response
            io:format("~p~n", [What])
    after 5000 ->
            io:format("No response from server~n", []),
            exit(timeout)
    end.</code>
    <p>We have added the following changes:</p>
    <p>The messenger server traps exits. If it receives an exit signal,
      <c>{'EXIT',From,Reason}</c> this means that a client process has
      terminated or is unreachable because:</p>
    <list type="bulleted">
      <item>the user has logged off (we have removed the "logoff"
       message),</item>
      <item>the network connection to the client is broken,</item>
      <item>the node on which the client process resides has gone down,
       or</item>
      <item>the client processes has done some illegal operation.</item>
    </list>
    <p>If we receive an exit signal as above, we delete the tuple,
      <c>{From,Name}</c> from the servers <c>User_List</c> using
      the <c>server_logoff</c> function. If the node on which the server
      runs goes down, an exit signal (automatically generated by
      the system), will be sent to all of the client processes:
      <c>{'EXIT',MessengerPID,noconnection}</c> causing all the client
      processes to terminate.</p>
    <p>We have also introduced a timeout of five seconds in
      the <c>await_result</c> function. I.e. if the server does not
      reply within five seconds (5000 ms), the client terminates. This
      is really only needed in the logon sequence before the client and
      server are linked.</p>
    <p>An interesting case is if the client was to terminate before
      the server links to it. This is taken care of because linking to a
      non-existent process causes an exit signal,
      <c>{'EXIT',From,noproc}</c>, to be automatically generated as if
      the process terminated immediately after the link operation.</p>
  </section>
</chapter>