diff options
Diffstat (limited to 'erts/doc/src/alt_dist.xml')
-rw-r--r-- | erts/doc/src/alt_dist.xml | 1099 |
1 files changed, 1099 insertions, 0 deletions
diff --git a/erts/doc/src/alt_dist.xml b/erts/doc/src/alt_dist.xml new file mode 100644 index 0000000000..9a68b3cf40 --- /dev/null +++ b/erts/doc/src/alt_dist.xml @@ -0,0 +1,1099 @@ +<?xml version="1.0" encoding="latin1" ?> +<!DOCTYPE chapter SYSTEM "chapter.dtd"> + +<chapter> + <header> + <copyright> + <year>2000</year><year>2009</year> + <holder>Ericsson AB. All Rights Reserved.</holder> + </copyright> + <legalnotice> + The contents of this file are subject to the Erlang Public License, + Version 1.1, (the "License"); you may not use this file except in + compliance with the License. You should have received a copy of the + Erlang Public License along with this software. If not, it can be + retrieved online at http://www.erlang.org/. + + Software distributed under the License is distributed on an "AS IS" + basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See + the License for the specific language governing rights and limitations + under the License. + + </legalnotice> + + <title>How to implement an alternative carrier for the Erlang distribution</title> + <prepared>Patrik Nyblom</prepared> + <responsible></responsible> + <docno></docno> + <approved></approved> + <checked></checked> + <date>2000-10-17</date> + <rev>PA2</rev> + <file>alt_dist.xml</file> + </header> + <p>This document describes how one can implement ones own carrier + protocol for the Erlang distribution. The distribution is normally + carried by the TCP/IP protocol. What's explained here is the method for + replacing TCP/IP with another protocol. </p> + <p>The document is a step by step explanation of the <c><![CDATA[uds_dist]]></c> example + application (seated in the kernel applications <c><![CDATA[examples]]></c> directory). + The <c><![CDATA[uds_dist]]></c> application implements distribution over Unix domain + sockets and is written for the Sun Solaris 2 operating environment. The + mechanisms are however general and applies to any operating system Erlang + runs on. The reason the C code is not made portable, is simply readability.</p> + <note><p>This document was written a long time ago. Most of it is still + valid, but some things have changed since it was first written. + Most notably the driver interface. There have been some updates + to the documentation of the driver presented in this documentation, + but more could be done and are planned for the future. The + reader is encouraged to also read the + <seealso marker="erl_driver">erl_driver</seealso>, and the + <seealso marker="erl_driver">driver_entry</seealso> documentation. + </p></note> + + <section> + <title>Introduction</title> + <p>To implement a new carrier for the Erlang distribution, one must first + make the protocol available to the Erlang machine, which involves writing + an Erlang driver. There is no way one can use a port program, + there <em>has</em> to + be an Erlang driver. Erlang drivers can either be statically + linked + to the emulator, which can be an alternative when using the open source + distribution of Erlang, or dynamically loaded into the Erlang machines + address space, which is the only alternative if a precompiled version of + Erlang is to be used. </p> + <p>Writing an Erlang driver is by no means easy. The driver is written + as a couple of call-back functions called by the Erlang emulator when + data is sent to the driver or the driver has any data available on a file + descriptor. As the driver call-back routines execute in the main + thread of the Erlang machine, the call-back functions can perform + no blocking activity whatsoever. The call-backs should only set up + file descriptors for waiting and/or read/write available data. All + I/O has to be non blocking. Driver call-backs are however executed + in sequence, why a global state can safely be updated within the + routines. </p> + <p>When the driver is implemented, one would preferably write an + Erlang interface for the driver to be able to test the + functionality of the driver separately. This interface can then + be used by the distribution module which will cover the details of + the protocol from the <c><![CDATA[net_kernel]]></c>. The easiest path is to + mimic the <c><![CDATA[inet]]></c> and <c><![CDATA[inet_tcp]]></c> interfaces, but a lot of + functionality in those modules need not be implemented. In the + example application, only a few of the usual interfaces are + implemented, and they are much simplified.</p> + <p>When the protocol is available to Erlang through a driver and an + Erlang interface module, a distribution module can be + written. The distribution module is a module with well defined + call-backs, much like a <c><![CDATA[gen_server]]></c> (there is no compiler support + for checking the call-backs though). The details of finding other + nodes (i.e. talking to epmd or something similar), creating a + listen port (or similar), connecting to other nodes and performing + the handshakes/cookie verification are all implemented by this + module. There is however a utility module, <c><![CDATA[dist_util]]></c>, that + will do most of the hard work of handling handshakes, cookies, + timers and ticking. Using <c><![CDATA[dist_util]]></c> makes implementing a + distribution module much easier and that's what we are doing in + the example application.</p> + <p>The last step is to create boot scripts to make the protocol + implementation available at boot time. The implementation can be + debugged by starting the distribution when all of the system is + running, but in a real system the distribution should start very + early, why a boot-script and some command line parameters are + necessary. This last step also implies that the Erlang code in the + interface and distribution modules is written in such a way that + it can be run in the startup phase. Most notably there can be no + calls to the <c><![CDATA[application]]></c> module or to any modules not + loaded at boot-time (i.e. only <c><![CDATA[kernel]]></c>, <c><![CDATA[stdlib]]></c> and the + application itself can be used).</p> + </section> + + <section> + <title>The driver</title> + <p>Although Erlang drivers in general may be beyond the scope of this + document, a brief introduction seems to be in place.</p> + + <section> + <title>Drivers in general</title> + <p>An Erlang driver is a native code module written in C (or + assembler) which serves as an interface for some special operating + system service. This is a general mechanism that is used + throughout the Erlang emulator for all kinds of I/O. An Erlang + driver can be dynamically linked (or loaded) to the Erlang + emulator at runtime by using the <c><![CDATA[erl_ddll]]></c> Erlang + module. Some of the drivers in OTP are however statically linked + to the runtime system, but that's more an optimization than a + necessity.</p> + <p>The driver data-types and the functions available to the driver + writer are defined in the header file <c><![CDATA[erl_driver.h]]></c> (there + is also an deprecated version called <c><![CDATA[driver.h]]></c>, don't use + that one.) seated in Erlang's include directory (and in + $ERL_TOP/erts/emulator/beam in the source code + distribution). Refer to that file for function prototypes etc.</p> + <p>When writing a driver to make a communications protocol available + to Erlang, one should know just about everything worth knowing + about that particular protocol. All operation has to be non + blocking and all possible situations should be accounted for in + the driver. A non stable driver will affect and/or crash the + whole Erlang runtime system, which is seldom what's wanted. </p> + <p>The emulator calls the driver in the following situations:</p> + <list type="bulleted"> + <item>When the driver is loaded. This call-back has to have a + special name and will inform the emulator of what call-backs should + be used by returning a pointer to a <c><![CDATA[ErlDrvEntry]]></c> struct, + which should be properly filled in (see below).</item> + <item>When a port to the driver is opened (by a <c><![CDATA[open_port]]></c> + call from Erlang). This routine should set up internal data + structures and return an opaque data entity of the type + <c><![CDATA[ErlDrvData]]></c>, which is a data-type large enough to hold a + pointer. The pointer returned by this function will be the first + argument to all other call-backs concerning this particular + port. It is usually called the port handle. The emulator only + stores the handle and does never try to interpret it, why it can + be virtually anything (well anything not larger than a pointer + that is) and can point to anything if it is a pointer. Usually + this pointer will refer to a structure holding information about + the particular port, as i t does in our example.</item> + <item>When an Erlang process sends data to the port. The data will + arrive as a buffer of bytes, the interpretation is not defined, + but is up to the implementor. This call-back returns nothing to the + caller, answers are sent to the caller as messages (using a + routine called <c><![CDATA[driver_output]]></c> available to all + drivers). There is also a way to talk in a synchronous way to + drivers, described below. There can be an additional call-back + function for handling data that is fragmented (sent in a deep + io-list). That interface will get the data in a form suitable for + Unix <c><![CDATA[writev]]></c> rather than in a single buffer. There is no + need for a distribution driver to implement such a call-back, so + we wont.</item> + <item>When a file descriptor is signaled for input. This call-back + is called when the emulator detects input on a file descriptor + which the driver has marked for monitoring by using the interface + <c><![CDATA[driver_select]]></c>. The mechanism of driver select makes it + possible to read non blocking from file descriptors by calling + <c><![CDATA[driver_select]]></c> when reading is needed and then do the actual + reading in this call-back (when reading is actually possible). The + typical scenario is that <c><![CDATA[driver_select]]></c> is called when an + Erlang process orders a read operation, and that this routine + sends the answer when data is available on the file descriptor.</item> + <item>When a file descriptor is signaled for output. This call-back + is called in a similar way as the previous, but when writing to a + file descriptor is possible. The usual scenario is that Erlang + orders writing on a file descriptor and that the driver calls + <c><![CDATA[driver_select]]></c>. When the descriptor is ready for output, + this call-back is called an the driver can try to send the + output. There may of course be queuing involved in such + operations, and there are some convenient queue routines available + to the driver writer to use in such situations.</item> + <item>When a port is closed, either by an Erlang process or by the + driver calling one of the <c><![CDATA[driver_failure_XXX]]></c> routines. This + routine should clean up everything connected to one particular + port. Note that when other call-backs call a + <c><![CDATA[driver_failure_XXX]]></c> routine, this routine will be + immediately called and the call-back routine issuing the error can + make no more use of the data structures for the port, as this + routine surely has freed all associated data and closed all file + descriptors. If the queue utility available to driver writes is + used, this routine will however <em>not</em> be called until the + queue is empty.</item> + <item>When an Erlang process calls <c>erlang:port_control/3</c>, + which is a synchronous interface to drivers. The control interface + is used to set driver options, change states of ports etc. We'll + use this interface quite a lot in our example.</item> + <item>When a timer expires. The driver can set timers with the + function <c><![CDATA[driver_set_timer]]></c>. When such timers expire, a + specific call-back function is called. We will not use timers in + our example.</item> + <item>When the whole driver is unloaded. Every resource allocated + by the driver should be freed.</item> + </list> + </section> + + <section> + <title>The distribution driver's data structures</title> + <p>The driver used for Erlang distribution should implement a + reliable, order maintaining, variable length packet oriented + protocol. All error correction, re-sending and such need to be + implemented in the driver or by the underlying communications + protocol. If the protocol is stream oriented (as is the case with + both TCP/IP and our streamed Unix domain sockets), some mechanism + for packaging is needed. We will use the simple method of having a + header of four bytes containing the length of the package in a big + endian 32 bit integer (as Unix domain sockets only can be used + between processes on the same machine, we actually don't need to + code the integer in some special endianess, but I'll do it anyway + because in most situation you do need to do it. Unix domain + sockets are reliable and order maintaining, so we don't need to + implement resends and such in our driver.</p> + <p>Lets start writing our example Unix domain sockets driver by + declaring prototypes and filling in a static ErlDrvEntry + structure.</p> + <code type="none"><![CDATA[ +( 1) #include <stdio.h> +( 2) #include <stdlib.h> +( 3) #include <string.h> +( 4) #include <unistd.h> +( 5) #include <errno.h> +( 6) #include <sys/types.h> +( 7) #include <sys/stat.h> +( 8) #include <sys/socket.h> +( 9) #include <sys/un.h> +(10) #include <fcntl.h> + +(11) #define HAVE_UIO_H +(12) #include "erl_driver.h" + +(13) /* +(14) ** Interface routines +(15) */ +(16) static ErlDrvData uds_start(ErlDrvPort port, char *buff); +(17) static void uds_stop(ErlDrvData handle); +(18) static void uds_command(ErlDrvData handle, char *buff, int bufflen); +(19) static void uds_input(ErlDrvData handle, ErlDrvEvent event); +(20) static void uds_output(ErlDrvData handle, ErlDrvEvent event); +(21) static void uds_finish(void); +(22) static int uds_control(ErlDrvData handle, unsigned int command, +(23) char* buf, int count, char** res, int res_size); + +(24) /* The driver entry */ +(25) static ErlDrvEntry uds_driver_entry = { +(26) NULL, /* init, N/A */ +(27) uds_start, /* start, called when port is opened */ +(28) uds_stop, /* stop, called when port is closed */ +(29) uds_command, /* output, called when erlang has sent */ +(30) uds_input, /* ready_input, called when input +(31) descriptor ready */ +(32) uds_output, /* ready_output, called when output +(33) descriptor ready */ +(34) "uds_drv", /* char *driver_name, the argument +(35) to open_port */ +(36) uds_finish, /* finish, called when unloaded */ +(37) NULL, /* void * that is not used (BC) */ +(38) uds_control, /* control, port_control callback */ +(39) NULL, /* timeout, called on timeouts */ +(40) NULL, /* outputv, vector output interface */ +(41) NULL, /* ready_async callback */ +(42) NULL, /* flush callback */ +(43) NULL, /* call callback */ +(44) NULL, /* event callback */ +(45) ERL_DRV_EXTENDED_MARKER, /* Extended driver interface marker */ +(46) ERL_DRV_EXTENDED_MAJOR_VERSION, /* Major version number */ +(47) ERL_DRV_EXTENDED_MINOR_VERSION, /* Minor version number */ +(48) ERL_DRV_FLAG_SOFT_BUSY, /* Driver flags. Soft busy flag is +(49) required for distribution drivers */ +(50) NULL, /* Reserved for internal use */ +(51) NULL, /* process_exit callback */ +(52) NULL /* stop_select callback */ +(53) };]]></code> + <p>On line 1 to 10 we have included the OS headers needed for our + driver. As this driver is written for Solaris, we know that the + header <c><![CDATA[uio.h]]></c> exists, why we can define the preprocessor + variable <c><![CDATA[HAVE_UIO_H]]></c> before we include <c><![CDATA[erl_driver.h]]></c> + at line 12. The definition of <c><![CDATA[HAVE_UIO_H]]></c> will make the + I/O vectors used in Erlang's driver queues to correspond to the + operating systems ditto, which is very convenient.</p> + <p>The different call-back functions are declared ("forward + declarations") on line 16 to 23.</p> + <p>The driver structure is similar for statically linked in + drivers and dynamically loaded. However some of the fields + should be left empty (i.e. initialized to NULL) in the + different types of drivers. The first field (the <c><![CDATA[init]]></c> + function pointer) is always left blank in a dynamically loaded + driver, which can be seen on line 26. The NULL on line 37 + should always be there, the field is no longer used and is + retained for backward compatibility. We use no timers in this + driver, why no call-back for timers is needed. The <c>outputv</c> field + (line 40) can be used to implement an interface similar to + Unix <c><![CDATA[writev]]></c> for output. The Erlang runtime + system could previously not use <c>outputv</c> for the + distribution, but since erts version 5.7.2 it can. + Since this driver was written before erts version 5.7.2 it does + not use the <c>outputv</c> callback. Using the <c>outputv</c> + callback is preferred since it reduces copying of data. (We + will however use scatter/gather I/O internally in the driver).</p> + <p>As of erts version 5.5.3 the driver interface was extended with + version control and the possibility to pass capability information. + Capability flags are present at line 48. As of erts version 5.7.4 + the + <seealso marker="driver_entry#driver_flags">ERL_DRV_FLAG_SOFT_BUSY</seealso> + flag is required for drivers that are to be used by the distribution. + The soft busy flag implies that the driver is capable of handling + calls to the <c>output</c> and <c>outputv</c> callbacks even though + it has marked itself as busy. This has always been a requirement + on drivers used by the distribution, but there have previously not + been any capability information available about this. For more + information see + <seealso marker="erl_driver#set_busy_port">set_busy_port()</seealso>). +</p> + <p>This driver was written before the runtime system had SMP support. + The driver will still function in the runtime system with SMP support, + but performance will suffer from lock contention on the driver lock + used for the driver. This can be alleviated by reviewing and perhaps + rewriting the code so that each instance of the driver safely can + execute in parallel. When instances safely can execute in parallel it + is safe to enable instance specific locking on the driver. This is done + by passing + <seealso marker="driver_entry#driver_flags">ERL_DRV_FLAG_USE_PORT_LOCKING</seealso> + as a driver flag. This is left as an exercise for the reader.</p> + <p>Our defined call-backs thus are:</p> + <list type="bulleted"> + <item>uds_start, which shall initiate data for a port. We wont + create any actual sockets here, just initialize data structures.</item> + <item>uds_stop, the function called when a port is closed.</item> + <item>uds_command, which will handle messages from Erlang. The + messages can either be plain data to be sent or more subtle + instructions to the driver. We will use this function mostly for + data pumping.</item> + <item>uds_input, this is the call-back which is called when we have + something to read from a socket.</item> + <item>uds_output, this is the function called when we can write to a + socket.</item> + <item>uds_finish, which is called when the driver is unloaded. A + distribution driver will actually (or hopefully) never be unloaded, + but we include this for completeness. Being able to clean up after + oneself is always a good thing.</item> + <item>uds_control, the <c>erlang:port_control/2</c> call-back, which + will be used a lot in this implementation.</item> + </list> + <p>The ports implemented by this driver will operate in two major + modes, which i will call the <em>command</em> and <em>data</em> + modes. In command mode, only passive reading and writing (like + gen_tcp:recv/gen_tcp:send) can be + done, and this is the mode the port will be in during the + distribution handshake. When the connection is up, the port will + be switched to data mode and all data will be immediately read and + passed further to the Erlang emulator. In data mode, no data + arriving to the uds_command will be interpreted, but just packaged + and sent out on the socket. The uds_control call-back will do the + switching between those two modes.</p> + <p>While the <c><![CDATA[net_kernel]]></c> informs different subsystems that the + connection is coming up, the port should accept data to send, but + not receive any data, to avoid that data arrives from another node + before every kernel subsystem is prepared to handle it. We have a + third mode for this intermediate stage, lets call it the + <em>intermediate</em> mode.</p> + <p>Lets define an enum for the different types of ports we have:</p> + <code type="none"><![CDATA[ +( 1) typedef enum { +( 2) portTypeUnknown, /* An uninitialized port */ +( 3) portTypeListener, /* A listening port/socket */ +( 4) portTypeAcceptor, /* An intermidiate stage when accepting +( 5) on a listen port */ +( 6) portTypeConnector, /* An intermediate stage when connecting */ +( 7) portTypeCommand, /* A connected open port in command mode */ +( 8) portTypeIntermediate, /* A connected open port in special +( 9) half active mode */ +(10) portTypeData /* A connectec open port in data mode */ +(11) } PortType; ]]></code> + <p>Lets look at the different types:</p> + <list type="bulleted"> + <item>portTypeUnknown - The type a port has when it's opened, but + not actually bound to any file descriptor.</item> + <item>portTypeListener - A port that is connected to a listen + socket. This port will not do especially much, there will be no data + pumping done on this socket, but there will be read data available + when one is trying to do an accept on the port.</item> + <item>portTypeAcceptor - This is a port that is to represent the + result of an accept operation. It is created when one wants to + accept from a listen socket, and it will be converted to a + portTypeCommand when the accept succeeds.</item> + <item>portTypeConnector - Very similar to portTypeAcceptor, an + intermediate stage between the request for a connect operation and + that the socket is really connected to an accepting ditto in the + other end. As soon as the sockets are connected, the port will + switch type to portTypeCommand.</item> + <item>portTypeCommand - A connected socket (or accepted socket if + you want) that is in the command mode mentioned earlier.</item> + <item>portTypeIntermediate - The intermediate stage for a connected + socket. There should be no processing of input for this socket.</item> + <item>portTypeData - The mode where data is pumped through the port + and the uds_command routine will regard every call as a call where + sending is wanted. In this mode all input available will be read and + sent to Erlang as soon as it arrives on the socket, much like in the + active mode of a <c><![CDATA[gen_tcp]]></c> socket.</item> + </list> + <p>Now lets look at the state we'll need for our ports. One can note + that not all fields are used for all types of ports and that one + could save some space by using unions, but that would clutter the + code with multiple indirections, so i simply use one struct for + all types of ports, for readability.</p> + <code type="none"><![CDATA[ +( 1) typedef unsigned char Byte; +( 2) typedef unsigned int Word; + +( 3) typedef struct uds_data { +( 4) int fd; /* File descriptor */ +( 5) ErlDrvPort port; /* The port identifier */ +( 6) int lockfd; /* The file descriptor for a lock file in +( 7) case of listen sockets */ +( 8) Byte creation; /* The creation serial derived from the +( 9) lockfile */ +(10) PortType type; /* Type of port */ +(11) char *name; /* Short name of socket for unlink */ +(12) Word sent; /* Bytes sent */ +(13) Word received; /* Bytes received */ +(14) struct uds_data *partner; /* The partner in an accept/listen pair */ +(15) struct uds_data *next; /* Next structure in list */ +(16) /* The input buffer and it's data */ +(17) int buffer_size; /* The allocated size of the input buffer */ +(18) int buffer_pos; /* Current position in input buffer */ +(19) int header_pos; /* Where the current header is in the +(20) input buffer */ +(21) Byte *buffer; /* The actual input buffer */ +(22) } UdsData; ]]></code> + <p>This structure is used for all types of ports although some + fields are useless for some types. The least memory consuming + solution would be to arrange this structure as a union of + structures, but the multiple indirections in the code to + access a field in such a structure will clutter the code to + much for an example.</p> + <p>Let's look at the fields in our structure:</p> + <list type="bulleted"> + <item>fd - The file descriptor of the socket associated with the + port.</item> + <item>port - The port identifier for the port which this structure + corresponds to. It is needed for most <c><![CDATA[driver_XXX]]></c> + calls from the driver back to the emulator.</item> + <item> + <p>lockfd - If the socket is a listen socket, we use a separate + (regular) file for two purposes:</p> + <list type="bulleted"> + <item>We want a locking mechanism that gives no race + conditions, so that we can be sure of if another Erlang + node uses the listen socket name we require or if the + file is only left there from a previous (crashed) + session.</item> + <item> + <p>We store the <em>creation</em> serial number in the + file. The <em>creation</em> is a number that should + change between different instances of different Erlang + emulators with the same name, so that process + identifiers from one emulator won't be valid when sent + to a new emulator with the same distribution name. The + creation can be between 0 and 3 (two bits) and is stored + in every process identifier sent to another node. </p> + <p>In a system with TCP based distribution, this data is + kept in the <em>Erlang port mapper daemon</em> + (<c><![CDATA[epmd]]></c>), which is contacted when a distributed + node starts. The lock-file and a convention for the UDS + listen socket's name will remove the need for + <c><![CDATA[epmd]]></c> when using this distribution module. UDS + is always restricted to one host, why avoiding a port + mapper is easy.</p> + </item> + </list> + </item> + <item>creation - The creation number for a listen socket, which is + calculated as (the value found in the lock-file + 1) rem + 4. This creation value is also written back into the + lock-file, so that the next invocation of the emulator will + found our value in the file.</item> + <item>type - The current type/state of the port, which can be one + of the values declared above.</item> + <item>name - The name of the socket file (the path prefix + removed), which allows for deletion (<c><![CDATA[unlink]]></c>) when the + socket is closed.</item> + <item>sent - How many bytes that have been sent over the + socket. This may wrap, but that's no problem for the + distribution, as the only thing that interests the Erlang + distribution is if this value has changed (the Erlang + net_kernel <em>ticker</em> uses this value by calling the + driver to fetch it, which is done through the + <c>erlang:port_control</c> routine).</item> + <item>received - How many bytes that are read (received) from the + socket, used in similar ways as <c><![CDATA[sent]]></c>.</item> + <item>partner - A pointer to another port structure, which is + either the listen port from which this port is accepting a + connection or the other way around. The "partner relation" + is always bidirectional.</item> + <item>next - Pointer to next structure in a linked list of all + port structures. This list is used when accepting + connections and when the driver is unloaded.</item> + <item>buffer_size, buffer_pos, header_pos, buffer - data for input + buffering. Refer to the source code (in the kernel/examples + directory) for details about the input buffering. That + certainly goes beyond the scope of this document.</item> + </list> + </section> + + <section> + <title>Selected parts of the distribution driver implementation</title> + <p>The distribution drivers implementation is not completely + covered in this text, details about buffering and other things + unrelated to driver writing are not explained. Likewise are + some peculiarities of the UDS protocol not explained in + detail. The chosen protocol is not important.</p> + <p>Prototypes for the driver call-back routines can be found in + the <c><![CDATA[erl_driver.h]]></c> header file.</p> + <p>The driver initialization routine is (usually) declared with a + macro to make the driver easier to port between different + operating systems (and flavours of systems). This is the only + routine that has to have a well defined name. All other + call-backs are reached through the driver structure. The macro + to use is named <c><![CDATA[DRIVER_INIT]]></c> and takes the driver name + as parameter.</p> + <code type="none"><![CDATA[ +(1) /* Beginning of linked list of ports */ +(2) static UdsData *first_data; + + +(3) DRIVER_INIT(uds_drv) +(4) { +(5) first_data = NULL; +(6) return &uds_driver_entry; +(7) } ]]></code> + <p>The routine initializes the single global data structure and + returns a pointer to the driver entry. The routine will be + called when <c><![CDATA[erl_ddll:load_driver]]></c> is called from Erlang.</p> + <p>The <c><![CDATA[uds_start]]></c> routine is called when a port is opened + from Erlang. In our case, we only allocate a structure and + initialize it. Creating the actual socket is left to the + <c><![CDATA[uds_command]]></c> routine.</p> + <code type="none"><![CDATA[ +( 1) static ErlDrvData uds_start(ErlDrvPort port, char *buff) +( 2) { +( 3) UdsData *ud; +( 4) +( 5) ud = ALLOC(sizeof(UdsData)); +( 6) ud->fd = -1; +( 7) ud->lockfd = -1; +( 8) ud->creation = 0; +( 9) ud->port = port; +(10) ud->type = portTypeUnknown; +(11) ud->name = NULL; +(12) ud->buffer_size = 0; +(13) ud->buffer_pos = 0; +(14) ud->header_pos = 0; +(15) ud->buffer = NULL; +(16) ud->sent = 0; +(17) ud->received = 0; +(18) ud->partner = NULL; +(19) ud->next = first_data; +(20) first_data = ud; +(21) +(22) return((ErlDrvData) ud); +(23) } ]]></code> + <p>Every data item is initialized, so that no problems will arise + when a newly created port is closed (without there being any + corresponding socket). This routine is called when + <c><![CDATA[open_port({spawn, "uds_drv"},[])]]></c> is called from Erlang.</p> + <p>The <c><![CDATA[uds_command]]></c> routine is the routine called when an + Erlang process sends data to the port. All asynchronous + commands when the port is in <em>command mode</em> as well as + the sending of all data when the port is in <em>data mode</em> + is handled in this9s routine. Let's have a look at it:</p> + <code type="none"><![CDATA[ +( 1) static void uds_command(ErlDrvData handle, char *buff, int bufflen) +( 2) { +( 3) UdsData *ud = (UdsData *) handle; + +( 4) if (ud->type == portTypeData || ud->type == portTypeIntermediate) { +( 5) DEBUGF(("Passive do_send %d",bufflen)); +( 6) do_send(ud, buff + 1, bufflen - 1); /* XXX */ +( 7) return; +( 8) } +( 9) if (bufflen == 0) { +(10) return; +(11) } +(12) switch (*buff) { +(13) case 'L': +(14) if (ud->type != portTypeUnknown) { +(15) driver_failure_posix(ud->port, ENOTSUP); +(16) return; +(17) } +(18) uds_command_listen(ud,buff,bufflen); +(19) return; +(20) case 'A': +(21) if (ud->type != portTypeUnknown) { +(22) driver_failure_posix(ud->port, ENOTSUP); +(23) return; +(24) } +(25) uds_command_accept(ud,buff,bufflen); +(26) return; +(27) case 'C': +(28) if (ud->type != portTypeUnknown) { +(29) driver_failure_posix(ud->port, ENOTSUP); +(30) return; +(31) } +(32) uds_command_connect(ud,buff,bufflen); +(33) return; +(34) case 'S': +(35) if (ud->type != portTypeCommand) { +(36) driver_failure_posix(ud->port, ENOTSUP); +(37) return; +(38) } +(39) do_send(ud, buff + 1, bufflen - 1); +(40) return; +(41) case 'R': +(42) if (ud->type != portTypeCommand) { +(43) driver_failure_posix(ud->port, ENOTSUP); +(44) return; +(45) } +(46) do_recv(ud); +(47) return; +(48) default: +(49) return; +(50) } +(51) } ]]></code> + <p>The command routine takes three parameters; the handle + returned for the port by <c><![CDATA[uds_start]]></c>, which is a pointer + to the internal port structure, the data buffer and the length + of the data buffer. The buffer is the data sent from Erlang + (a list of bytes) converted to an C array (of bytes). </p> + <p>If Erlang sends i.e. the list <c><![CDATA[[$a,$b,$c]]]></c> to the port, + the <c><![CDATA[bufflen]]></c> variable will be <c><![CDATA[3]]></c> ant the + <c><![CDATA[buff]]></c> variable will contain <c><![CDATA[{'a','b','c'}]]></c> (no + null termination). Usually the first byte is used as an + opcode, which is the case in our driver to (at least when the + port is in command mode). The opcodes are defined as:</p> + <list type="bulleted"> + <item>'L'<socketname>: Create and listen on socket with the + given name.</item> + <item>'A'<listennumber as 32 bit bigendian>: Accept from the + listen socket identified by the given identification + number. The identification number is retrieved with the + uds_control routine.</item> + <item>'C'<socketname>: Connect to the socket named + <socketname>.</item> + <item>'S'<data>: Send the data <data> on the + connected/accepted socket (in command mode). The sending is + acked when the data has left this process.</item> + <item>'R': Receive one packet of data.</item> + </list> + <p>One may wonder what is meant by "one packet of data" in the + 'R' command. This driver always sends data packeted with a 4 + byte header containing a big endian 32 bit integer that + represents the length of the data in the packet. There is no + need for different packet sizes or some kind of streamed + mode, as this driver is for the distribution only. One may + wonder why the header word is coded explicitly in big endian + when an UDS socket is local to the host. The answer simply is + that I see it as a good practice when writing a distribution + driver, as distribution in practice usually cross the host + boundaries. </p> + <p>On line 4-8 we handle the case where the port is in data or + intermediate mode, the rest of the routine handles the + different commands. We see (first on line 15) that the routine + uses the <c><![CDATA[driver_failure_posix()]]></c> routine to report + errors. One important thing to remember is that the failure + routines make a call to our <c><![CDATA[uds_stop]]></c> routine, which + will remove the internal port data. The handle (and the casted + handle <c><![CDATA[ud]]></c>) is therefore <em>invalid pointers</em> after a + <c><![CDATA[driver_failure]]></c> call and we should <em>immediately return</em>. The runtime system will send exit signals to all + linked processes.</p> + <p>The uds_input routine gets called when data is available on a + file descriptor previously passed to the <c><![CDATA[driver_select]]></c> + routine. Typically this happens when a read command is issued + and no data is available. Lets look at the <c><![CDATA[do_recv]]></c> + routine:</p> + <code type="none"><![CDATA[ +( 1) static void do_recv(UdsData *ud) +( 2) { +( 3) int res; +( 4) char *ibuf; +( 5) for(;;) { +( 6) if ((res = buffered_read_package(ud,&ibuf)) < 0) { +( 7) if (res == NORMAL_READ_FAILURE) { +( 8) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ, 1); +( 9) } else { +(10) driver_failure_eof(ud->port); +(11) } +(12) return; +(13) } +(14) /* Got a package */ +(15) if (ud->type == portTypeCommand) { +(16) ibuf[-1] = 'R'; /* There is always room for a single byte +(17) opcode before the actual buffer +(18) (where the packet header was) */ +(19) driver_output(ud->port,ibuf - 1, res + 1); +(20) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ,0); +(21) return; +(22) } else { +(23) ibuf[-1] = DIST_MAGIC_RECV_TAG; /* XXX */ +(24) driver_output(ud->port,ibuf - 1, res + 1); +(25) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ,1); +(26) } +(27) } +(28) } ]]></code> + <p>The routine tries to read data until a packet is read or the + <c><![CDATA[buffered_read_package]]></c> routine returns a + <c><![CDATA[NORMAL_READ_FAILURE]]></c> (an internally defined constant for + the module that means that the read operation resulted in an + <c><![CDATA[EWOULDBLOCK]]></c>). If the port is in command mode, the + reading stops when one package is read, but if it is in data + mode, the reading continues until the socket buffer is empty + (read failure). If no more data can be read and more is wanted + (always the case when socket is in data mode) driver_select is + called to make the <c><![CDATA[uds_input]]></c> call-back be called when + more data is available for reading.</p> + <p>When the port is in data mode, all data is sent to Erlang in a + format that suits the distribution, in fact the raw data will + never reach any Erlang process, but will be + translated/interpreted by the emulator itself and then + delivered in the correct format to the correct processes. In + the current emulator version, received data should be tagged + with a single byte of 100. Thats what the macro + <c><![CDATA[DIST_MAGIC_RECV_TAG]]></c> is defined to. The tagging of data + in the distribution will possibly change in the future.</p> + <p>The <c><![CDATA[uds_input]]></c> routine will handle other input events + (like nonblocking <c><![CDATA[accept]]></c>), but most importantly handle + data arriving at the socket by calling <c><![CDATA[do_recv]]></c>:</p> + <code type="none"><![CDATA[ +( 1) static void uds_input(ErlDrvData handle, ErlDrvEvent event) +( 2) { +( 3) UdsData *ud = (UdsData *) handle; + +( 4) if (ud->type == portTypeListener) { +( 5) UdsData *ad = ud->partner; +( 6) struct sockaddr_un peer; +( 7) int pl = sizeof(struct sockaddr_un); +( 8) int fd; + +( 9) if ((fd = accept(ud->fd, (struct sockaddr *) &peer, &pl)) < 0) { +(10) if (errno != EWOULDBLOCK) { +(11) driver_failure_posix(ud->port, errno); +(12) return; +(13) } +(14) return; +(15) } +(16) SET_NONBLOCKING(fd); +(17) ad->fd = fd; +(18) ad->partner = NULL; +(19) ad->type = portTypeCommand; +(20) ud->partner = NULL; +(21) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ, 0); +(22) driver_output(ad->port, "Aok",3); +(23) return; +(24) } +(25) do_recv(ud); +(26) } ]]></code> + <p>The important line here is the last line in the function, the + <c><![CDATA[do_read]]></c> routine is called to handle new input. The rest + of the function handles input on a listen socket, which means + that there should be possible to do an accept on the + socket, which is also recognized as a read event.</p> + <p>The output mechanisms are similar to the input. Lets first + look at the <c><![CDATA[do_send]]></c> routine:</p> + <code type="none"><![CDATA[ +( 1) static void do_send(UdsData *ud, char *buff, int bufflen) +( 2) { +( 3) char header[4]; +( 4) int written; +( 5) SysIOVec iov[2]; +( 6) ErlIOVec eio; +( 7) ErlDrvBinary *binv[] = {NULL,NULL}; + +( 8) put_packet_length(header, bufflen); +( 9) iov[0].iov_base = (char *) header; +(10) iov[0].iov_len = 4; +(11) iov[1].iov_base = buff; +(12) iov[1].iov_len = bufflen; +(13) eio.iov = iov; +(14) eio.binv = binv; +(15) eio.vsize = 2; +(16) eio.size = bufflen + 4; +(17) written = 0; +(18) if (driver_sizeq(ud->port) == 0) { +(19) if ((written = writev(ud->fd, iov, 2)) == eio.size) { +(20) ud->sent += written; +(21) if (ud->type == portTypeCommand) { +(22) driver_output(ud->port, "Sok", 3); +(23) } +(24) return; +(25) } else if (written < 0) { +(26) if (errno != EWOULDBLOCK) { +(27) driver_failure_eof(ud->port); +(28) return; +(29) } else { +(30) written = 0; +(31) } +(32) } else { +(33) ud->sent += written; +(34) } +(35) /* Enqueue remaining */ +(36) } +(37) driver_enqv(ud->port, &eio, written); +(38) send_out_queue(ud); +(39) } ]]></code> + <p>This driver uses the <c><![CDATA[writev]]></c> system call to send data + onto the socket. A combination of writev and the driver output + queues is very convenient. An <em>ErlIOVec</em> structure + contains a <em>SysIOVec</em> (which is equivalent to the + <c><![CDATA[struct iovec]]></c> structure defined in <c><![CDATA[uio.h]]></c>. The + ErlIOVec also contains an array of <em>ErlDrvBinary</em> + pointers, of the same length as the number of buffers in the + I/O vector itself. One can use this to allocate the binaries + for the queue "manually" in the driver, but we'll just fill + the binary array with NULL values (line 7) , which will make + the runtime system allocate it's own buffers when we call + <c><![CDATA[driver_enqv]]></c> (line 37).</p> + <p></p> + <p>The routine builds an I/O vector containing the header bytes + and the buffer (the opcode has been removed and the buffer + length decreased by the output routine). If the queue is + empty, we'll write the data directly to the socket (or at + least try to). If any data is left, it is stored in the queue + and then we try to send the queue (line 38). An ack is sent + when the message is delivered completely (line 22). The + <c><![CDATA[send_out_queue]]></c> will send acks if the sending is + completed there. If the port is in command mode, the Erlang + code serializes the send operations so that only one packet + can be waiting for delivery at a time. Therefore the ack can + be sent simply whenever the queue is empty.</p> + <p></p> + <p>A short look at the <c><![CDATA[send_out_queue]]></c> routine:</p> + <code type="none"><![CDATA[ +( 1) static int send_out_queue(UdsData *ud) +( 2) { +( 3) for(;;) { +( 4) int vlen; +( 5) SysIOVec *tmp = driver_peekq(ud->port, &vlen); +( 6) int wrote; +( 7) if (tmp == NULL) { +( 8) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_WRITE, 0); +( 9) if (ud->type == portTypeCommand) { +(10) driver_output(ud->port, "Sok", 3); +(11) } +(12) return 0; +(13) } +(14) if (vlen > IO_VECTOR_MAX) { +(15) vlen = IO_VECTOR_MAX; +(16) } +(17) if ((wrote = writev(ud->fd, tmp, vlen)) < 0) { +(18) if (errno == EWOULDBLOCK) { +(19) driver_select(ud->port, (ErlDrvEvent) ud->fd, +(20) DO_WRITE, 1); +(21) return 0; +(22) } else { +(23) driver_failure_eof(ud->port); +(24) return -1; +(25) } +(26) } +(27) driver_deq(ud->port, wrote); +(28) ud->sent += wrote; +(29) } +(30) } ]]></code> + <p>What we do is simply to pick out an I/O vector from the queue + (which is the whole queue as an <em>SysIOVec</em>). If the I/O + vector is to long (IO_VECTOR_MAX is defined to 16), the vector + length is decreased (line 15), otherwise the <c><![CDATA[writev]]></c> + (line 17) call will + fail. Writing is tried and anything written is dequeued (line + 27). If the write fails with <c><![CDATA[EWOULDBLOCK]]></c> (note that all + sockets are in nonblocking mode), <c><![CDATA[driver_select]]></c> is + called to make the <c><![CDATA[uds_output]]></c> routine be called when + there is space to write again.</p> + <p>We will continue trying to write until the queue is empty or + the writing would block.</p> + <p>The routine above are called from the <c><![CDATA[uds_output]]></c> + routine, which looks like this:</p> + <code type="none"><![CDATA[ +( 1) static void uds_output(ErlDrvData handle, ErlDrvEvent event) +( 2) { +( 3) UdsData *ud = (UdsData *) handle; +( 4) if (ud->type == portTypeConnector) { +( 5) ud->type = portTypeCommand; +( 6) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_WRITE, 0); +( 7) driver_output(ud->port, "Cok",3); +( 8) return; +( 9) } +(10) send_out_queue(ud); +(11) } ]]></code> + <p>The routine is simple, it first handles the fact that the + output select will concern a socket in the business of + connecting (and the connecting blocked). If the socket is in + a connected state it simply sends the output queue, this + routine is called when there is possible to write to a socket + where we have an output queue, so there is no question what to + do.</p> + <p>The driver implements a control interface, which is a + synchronous interface called when Erlang calls + <c><![CDATA[erlang:port_control/3]]></c>. This is the only interface + that can control the driver when it is in data mode and it may + be called with the following opcodes:</p> + <list type="bulleted"> + <item>'C': Set port in command mode.</item> + <item>'I': Set port in intermediate mode.</item> + <item>'D': Set port in data mode.</item> + <item>'N': Get identification number for listen port, this + identification number is used in an accept command to the + driver, it is returned as a big endian 32 bit integer, which + happens to be the file identifier for the listen socket.</item> + <item>'S': Get statistics, which is the number of bytes received, + the number of bytes sent and the number of bytes pending in + the output queue. This data is used when the distribution + checks that a connection is alive (ticking). The statistics + is returned as 3 32 bit big endian integers.</item> + <item>'T': Send a tick message, which is a packet of length + 0. Ticking is done when the port is in data mode, so the + command for sending data cannot be used (besides it ignores + zero length packages in command mode). This is used by the + ticker to send dummy data when no other traffic is present. + <em>Note</em> that it is important that the interface for + sending ticks is not blocking. This implementation uses + <c>erlang:port_control/3</c> which does not block the caller. + If <c>erlang:port_command</c> is used, use + <c>erlang:port_command/3</c> and pass <c>[force]</c> as + option list; otherwise, the caller can be blocked indefinitely + on a busy port and prevent the system from taking down a + connection that is not functioning.</item> + <item>'R': Get creation number of listen socket, which is used to + dig out the number stored in the lock file to differentiate + between invocations of Erlang nodes with the same name.\011 </item> + </list> + <p>The control interface gets a buffer to return its value in, + but is free to allocate it's own buffer is the provided one is + to small. Here is the code for <c><![CDATA[uds_control]]></c>:</p> + <code type="none"><![CDATA[ +( 1) static int uds_control(ErlDrvData handle, unsigned int command, +( 2) char* buf, int count, char** res, int res_size) +( 3) { +( 4) /* Local macro to ensure large enough buffer. */ +( 5) #define ENSURE(N) \\ +( 6) do { \\ +( 7) if (res_size < N) { \\ +( 8) *res = ALLOC(N); \\ +( 9) } \\ +(10) } while(0) + +(11) UdsData *ud = (UdsData *) handle; + +(12) switch (command) { +(13) case 'S': +(14) { +(15) ENSURE(13); +(16) **res = 0; +(17) put_packet_length((*res) + 1, ud->received); +(18) put_packet_length((*res) + 5, ud->sent); +(19) put_packet_length((*res) + 9, driver_sizeq(ud->port)); +(20) return 13; +(21) } +(22) case 'C': +(23) if (ud->type < portTypeCommand) { +(24) return report_control_error(res, res_size, "einval"); +(25) } +(26) ud->type = portTypeCommand; +(27) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ, 0); +(28) ENSURE(1); +(29) **res = 0; +(30) return 1; +(31) case 'I': +(32) if (ud->type < portTypeCommand) { +(33) return report_control_error(res, res_size, "einval"); +(34) } +(35) ud->type = portTypeIntermediate; +(36) driver_select(ud->port, (ErlDrvEvent) ud->fd, DO_READ, 0); +(37) ENSURE(1); +(38) **res = 0; +(39) return 1; +(40) case 'D': +(41) if (ud->type < portTypeCommand) { +(42) return report_control_error(res, res_size, "einval"); +(43) } +(44) ud->type = portTypeData; +(45) do_recv(ud); +(46) ENSURE(1); +(47) **res = 0; +(48) return 1; +(49) case 'N': +(50) if (ud->type != portTypeListener) { +(51) return report_control_error(res, res_size, "einval"); +(52) } +(53) ENSURE(5); +(54) (*res)[0] = 0; +(55) put_packet_length((*res) + 1, ud->fd); +(56) return 5; +(57) case 'T': /* tick */ +(58) if (ud->type != portTypeData) { +(59) return report_control_error(res, res_size, "einval"); +(60) } +(61) do_send(ud,"",0); +(62) ENSURE(1); +(63) **res = 0; +(64) return 1; +(65) case 'R': +(66) if (ud->type != portTypeListener) { +(67) return report_control_error(res, res_size, "einval"); +(68) } +(69) ENSURE(2); +(70) (*res)[0] = 0; +(71) (*res)[1] = ud->creation; +(72) return 2; +(73) default: +(74) return report_control_error(res, res_size, "einval"); +(75) } +(76) #undef ENSURE +(77) } ]]></code> + <p>The macro <c><![CDATA[ENSURE]]></c> (line 5 to 10) is used to ensure that + the buffer is large enough for our answer. We switch on the + command and take actions, there is not much to say about this + routine. Worth noting is that we always has read select active + on a port in data mode (achieved by calling <c><![CDATA[do_recv]]></c> on + line 45), but turn off read selection in intermediate and + command modes (line 27 and 36).</p> + <p>The rest of the driver is more or less UDS specific and not of + general interest.</p> + </section> + </section> + + <section> + <title>Putting it all together</title> + <p>To test the distribution, one can use the + <c><![CDATA[net_kernel:start/1]]></c> function, which is useful as it starts + the distribution on a running system, where tracing/debugging + can be performed. The <c><![CDATA[net_kernel:start/1]]></c> routine takes a + list as it's single argument. The lists first element should be + the node name (without the "@hostname") as an atom, and the second (and + last) element should be one of the atoms <c><![CDATA[shortnames]]></c> or + <c><![CDATA[longnames]]></c>. In the example case <c><![CDATA[shortnames]]></c> is + preferred. </p> + <p>For net kernel to find out which distribution module to use, the + command line argument <c><![CDATA[-proto_dist]]></c> is used. The argument + is followed by one or more distribution module names, with the + "_dist" suffix removed, i.e. uds_dist as a distribution module + is specified as <c><![CDATA[-proto_dist uds]]></c>.</p> + <p>If no epmd (TCP port mapper daemon) is used, one should also + specify the command line option <c><![CDATA[-no_epmd]]></c>, which will make + Erlang skip the epmd startup, both as a OS process and as an + Erlang ditto.</p> + <p>The path to the directory where the distribution modules reside + must be known at boot, which can either be achieved by + specifying <c><![CDATA[-pa <path>]]></c> on the command line or by building + a boot script containing the applications used for your + distribution protocol (in the uds_dist protocol, it's only the + uds_dist application that needs to be added to the script).</p> + <p>The distribution will be started at boot if all the above is + specified and an <c><![CDATA[-sname <name>]]></c> flag is present at the + command line, here follows two examples: </p> + <pre> +$ <input>erl -pa $ERL_TOP/lib/kernel/examples/uds_dist/ebin -proto_dist uds -no_epmd</input> +Erlang (BEAM) emulator version 5.0 + +Eshell V5.0 (abort with ^G) +1> <input>net_kernel:start([bing,shortnames]).</input> +{ok,<0.30.0>} +(bing@hador)2></pre> + <p>...</p> + <pre> +$ <input>erl -pa $ERL_TOP/lib/kernel/examples/uds_dist/ebin -proto_dist uds \\ </input> +<input> -no_epmd -sname bong</input> +Erlang (BEAM) emulator version 5.0 + +Eshell V5.0 (abort with ^G) +(bong@hador)1></pre> + <p>One can utilize the ERL_FLAGS environment variable to store the + complicated parameters in:</p> + <pre> +$ <input>ERL_FLAGS=-pa $ERL_TOP/lib/kernel/examples/uds_dist/ebin \\ </input> +<input> -proto_dist uds -no_epmd</input> +$ <input>export ERL_FLAGS</input> +$ <input>erl -sname bang</input> +Erlang (BEAM) emulator version 5.0 + +Eshell V5.0 (abort with ^G) +(bang@hador)1></pre> + <p>The <c><![CDATA[ERL_FLAGS]]></c> should preferably not include the name of + the node.</p> + </section> +</chapter> + |