Mnesia 

1 Introduction

This document aims to give a brief introduction of the implementation
of mnesia, it's data and functions.

Håkan has written other mnesia papers of interest, (see ~hakan/public_html/):
o Resource consumption (mnesia_consumption.txt)
o What to think about when changing mnesia (mnesia_upgrade_policy.txt)
o Mnesia internals course (mnesia_internals_slides.pdf)
o Mnesia overview (mnesia_overview.pdf)

1.1. Basic concepts

In a mnesia cluster all nodes are equal, there is no concept off
master or backup nodes. That said when mixing disc based (uses the
disc to store meta information) nodes and ram based (do not use disc
at all) nodes the disc based ones sometimes have precedence over ram
based nodes.

2 Meta Data

Mnesia has two types of global meta data, static and dynamic.
All the meta data is stored in the ets table mnesia_gvar.

2.1 Static Meta Data
The static data is the schema information, usually kept in
'schema.DAT' file, the data is created with
mnesia:create_schema(Nodes) for disc nodes (i.e. nodes which uses the
disc). Ram based mnesia nodes create an empty schema at startup.

The static data i.e. schema, contains information about which nodes
are involved in the cluster and which type (ram or disc) they have. It
also contains information about which tables exist on which node and
so on.

The schema information (static data) must always be the same on all
active nodes in the mnesia cluster. Schema information is updated via
schema functions, e.g. mnesia:add_table_copy/3,
mnesia:change_table_copy/3...

2.2 Dynamic Meta Data

The dynamic data is transient and is local to each mnesia node
in the cluster. Examples of dynamic data is: currently active mnesia
nodes, which tables are currently available and where are they
located. Dynamic data is updated internally by each mnesia during the
nodes lifetime, i.e. when nodes goes up and down or are added to or
deleted from the mnesia cluster.

3 Processes and Files

The most important processes in mnesia are mnesia_monitor,
mnesia_controller, mnesia_tm and mnesia_locker.

Mnesia_monitor acts as supervisor and monitors all resources.  It
listens for nodeup and nodedown and keeps links to other mnesia nodes,
if a node goes down it forwards the information to all the necessary
processes, e.g. mnesia_controller, mnesia_locker, mnesia_tm and all
transactions.  During start it negotiates the protocol version with
the other nodes and keep track of which nodes uses which version. The
monitor process also detects and warns about partioned networks, it is
then up to the user to deal with them. It is the owner of all open
files, ets tables and so on.

The mnesia_controller process is responsible for loading tables,
keeping the dynamic meta data updated, synchronize dangerous work such
as schema transactions vs dump log operation vs table loading/sending.

The last two processes are involved in all transactions, the
mnesia_locker process manages transaction locks, and mnesia_tm manages
all transaction work.

4 Startup and table Loading

The early startup is mostly driven by the mnesia_tm process/module,
logs are dumped (see log dumping), node-names of other nodes in the
cluster are retrieved from the static meta data or from environment
parameters and initial connections are made to the other mnesia
nodes.

The rest of start up is driven by the mnesia_controller process where
the schema (static meta data) is merged between each node, this is
done to keep the schema consistent between all nodes in the
cluster. When the schema is merged all local tables are put in a
loading queue, tables which are only available or have local content
is loaded directly from disc or created if they are type ram_copies.

The other tables are kept in the queue until mnesia decides whether to
load them from disk or from another node.  If another mnesia node has
already loaded the table, i.e. got a copy in ram or an open dets file,
the table is always loaded from that node to keep the data consistent.
If no other node has a loaded copy of the table, some mnesia node has
to load it first, and the other nodes can copy the table from the
first node. Mnesia keeps information about when other nodes went down,
a starting mnesia will check which nodes have been down, if some of
the nodes have not been down the starting node will let those nodes
load the table first. If all other nodes have been down then the
starting mnesia will load the table. The node that is allowed to load
the table will load it and the other nodes will copy it from that node.

If a node, which the starter node has not a 'mnesia_down' note from,
is down the starter node will have to wait until that node comes up
and decision can be taken, this behavior can be overruled by user
settings. The order of table loading could be described as:

1. Mnesia downs, Normally decides from where mnesia should load tables.
2. Master nodes (overrides mnesia downs).
3. Force load (overrides Master nodes).
   1) If possible, load table from active master nodes
   2) if no master nodes is active load from any active nodes,
   3) if no active node has an active table get local copy 
      (if ram create empty one) 

Currently mnesia can handle one download and one upload at the same
time. Dumping and loading/sending may run simultaneously but neither
of them may run during schema commit. Loaders/senders may not start if
a schema commit is enqueued. That synchronization is made to prohibit
that the schema transaction modifies the meta data and the
prerequisites of the table loading changes.

The actual loading of a table is implemented in 'mnesia_loader.erl'.
It currently works as follows:

Receiver				Sender
--------				------
Spawned
Find sender node
Queue sender request    ---->                 
					Spawned
*)Spawn real receiver	<----		Send init info
*)Grab schema lock for			Grab write table lock
  that table to avoid			Subscribe receiver 
  deadlock with schema transactions	   to table updates
Create table (ets or dets)		Release Lock

Get data (with ack	---->            
   as flow control)     <----		Burst data to receiver
					Send no_more
Apply subscription messages
Store copy on disc			Grab read lock
Create index, snmp data 		Update meta data info
and checkpoints if needed		cleanup 
no_more			---->	
					Release lock

*) Don't spawn or grab schema lock if operation is add_table_copy, 
   it's already a schema operation.


5 Transaction

Transaction are normally driven from the client process, i.e. the
process that call 'mnesia:transaction'. The client first acquires a
globally unique transaction id (tid) and temporary transaction storage
(ts an ets table) from mnesia_tm and then executes the transaction
fun. Mnesia-api calls such as 'mnesia:write/1' and 'mnesia:read'
contains code for acquiring the needed locks. Intermediate database
states and acquired locks are kept in the transaction storage, and all
mnesia operations has to be "patched" against that store. I.e. a write
operation in a transaction should be seen within (and only within)
that transaction, if the same key is read after the write.
After the transaction fun is completed the ts is analyzed to see which
nodes are involved in the transaction, and what type of commit protocol 
shall be used. Then the result is committed and additional work such as 
snmp, checkpoints and index updates are performed. The transaction is
finish by releasing all resources.

An example: 

Example = fun(X) -> 
	      {table1, key, Value} = mnesia:read(table1, key),
	      ok = mnesia:write(table1, {table1, key, Value+X}),
	      {table1, key, Updated} = mnesia:read(table1, key),
	      Updated
	  end,
mnesia:transaction(Example, [10]).

A message overview of a simple successful asynchronous transaction
                                                               non local
Client Process		    mnesia_tm(local)	mnesia_locker  mnesia_tm
------------------------------------------------------------------------
Get tid          ---->       
		 <---       Tid and ts
Get read lock
from available node  ------------------------------->
Value	         <----------Value or restart trans---
Patch value against ts

Get write lock
from all nodes     ------------------------------->
                   ------------------------------->
ok's	          <<---------ok's or restart trans---
write data in ts

Get read lock,already done.
Read data Value
'Patch' data with ts
Fun return Value+X.

If everything is ok
commit transaction

Find the nodes that the transaction
needs to be committed on and
collect every update from ts. 

Ask for commit  ----------->
                ----------------------------------------------->

Ok's		<<---------      ------------------------------
Commit          ----------------------------------------------->
log commit decision on disk
Commit locally
  update snmp
  update checkpoints
  notify subscribers
  update index
  
Release locks   ------------------------------->
Release transaction  ----->

Return trans result
----------------------

If all needed resources are available, i.e. the needed tables are
loaded somewhere in the cluster during the transaction, and the user
code doesn't crash, a transaction in mnesia won't fail. If something
happens in the mnesia cluster such as node down from the replica the
transaction was about to read from, or that a lock couldn't be
acquired and the transaction was not allowed to be queued on that
lock, the transaction is restarted, i.e. all resources are released
and the fun is called again. By default a transaction can be
restarted is infinity many times, but the user may choose to limit
the number of restarts.

The dirty operations don't do any of the above they just finds out
where to write the data, logs the operation to disk and casts (or call
in case of sync_dirty operation) the data to those nodes.  Therefore
the dirty operations have the drawback that each write or delete sends
a message per operation to the involved nodes.

There is also a synchronous variant of 2-phase commit protocol which
waits on an additional ack message after the transaction is committed
on every node. The intention is to provide the user with a way to
solve overloading problems.

A 3-phase commit protocol is used for schema transaction or if the
transaction result is going to be committed in a asymmetrical way,
i.e. a transaction that writes to table a and b where table a and b
have replicas on different nodes. The outcome of the transactions are
stored temporary in an ets table and in the log file.

6 Schema transactions

Schema transactions are handled differently than ordinary
transactions, they are implemented in mnesia_schema (and in
mnesia_dumper). The schema operation is always spawned to protect from
that the client process dies during the transaction.

The actual transaction fun checks the pre-conditions and acquires the
needed locks and notes the operation in the transaction store. During
the commit, the schema transaction runs a schema prepare operation (on
every node) that does the needed prerequisite job. Then the operation
is logged to disc, and the actual commit work is done by dumping the
log. Every schema operation has special clause in mnesia_dumper to
handle the finishing work. Every schema prepare operation has a
matching undo_prepare operation which needs to be invoked if the
transaction is aborted.

7 Locks

"The locking algorithm is a traditional 'two-phase locking'* and the
deadlock prevention is 'wait-die'*, time stamps for the wait-die algorithm
is 'Lamport clock'* maintained by mnesia_tm. The Lamport clock is kept
when the transaction is restarted to avoid starving." 

* References can be found in the paper mnesia_overview.pdf
  Klacke, Håkan and Hans wrote about mnesia.

What the quote above means is that read locks are acquired on the
replica that mnesia read from, write locks are acquired on all nodes
which have a replica. Several read lock can lock the same object, but
write locks are exclusive. The transaction identifier (tid) is a ever
increasing system uniq counter which have the same sort order on every
node (a Lamport clock), which enables mnesia_locker to order the lock
requests. When a lock request arrives, mnesia_locker checks whether
the lock is available, if it is a 'granted' is sent back to the client
and the lock is noted as taken in an ets table. If the lock is already
occupied, it's tid is compared with tid of the transaction holding the
lock. If the tid of holding transaction is greater than the tid of
asking transaction it's allowed to be put in the lock queue (another
ets table) and no response is sent back until the lock is released, if
not the transaction will get a negative response and mnesia_tm will
restart the transaction after it has slept for a random time.

Sticky locks works almost as a write lock, the first time a sticky
lock is acquired a request is sent to all nodes. The lock is marked as
taken by the requesting node (not transaction), when the lock is later
released it's only released on the node that has the sticky lock, 
thus the next time a transaction is requesting the lock it don't need
to ask the others nodes. If another node wants the lock it has to request
a lock release first, before it can acquire the lock.

8 Fragmented tables

Fragmented tables are used to split a large table in smaller parts.
It is implemented as a layer between the client and mnesia which
extends the meta data with additional properties and maps a {table,
key} tuple to a table_fragment. 

The default mapping is erlang:phash() but the user may provide his own
mapping function to be able to predict which records is stored in
which table fragment, e.g.  the client may want to steer where a
record generated from a certain device is placed. 

The foreign key is used to co-locate other tables to the same node.
The other additinal table attributes are also used to distribute the
table fragments.

9 Log Dumping

All operations on disk tables are stored on a log 'LATEST.LOG' on
disk, so mnesia can redo the transactions if the node goes down.
Dumping the log means that mnesia moves the committed data from the
general log to the table specific disk storage. To avoid that the log
grows to large and uses a lot of disk space and makes the startup slow,
mnesia dumps the log during it's uptime. There are two triggers that
start the log dumping, timeouts and the number of commits since last
dump, both are user configurable.

Disc copies tables are implemented with two disk_log files, one
'table.DCD' (disc copies data) and one 'table.DCL' (disc copies log).
The dcd contains raw records, and the dcl contains operations on that
table, i.e. '{write, {table, key, value}}' or '{delete, {table,
key}}'.  First time a record for a specific table is found when
dumping the table, the size of both the dcd and the dcl files are
checked. And if the sizeof(dcl)/sizeof(dcd) is greater than a
threshold, the current ram table is dumped to file 'table.DCD' and the
corresponding dcl file is deleted, and all other records in the
general log that belongs to that table are ignored. If the threshold
is not meet than the operations in the general log to that table are
appended to the dcl file. On start up both files are read, first the
contents of the dcd are loaded to an ets table, then it's modified by
the operations stored in the corresponding dcl file.

Disc only copies tables updates the 'dets' file directly when
committing the data so those entries can be ignored during normal log
dumping, they are only added to the 'dets' file during startup when
mnesia don't know the state of the disk table.

10 Checkpoints and backups 

Checkpoints are created to be able to take snapshots of the database,
which is pretty good when you want consistent backups, i.e. you don't
want half of a transaction in the backup. The checkpoint creates a
shadow table (called retainer) for each table involved in the
checkpoint. When a checkpoint is requested it will not start until all
ongoing transactions are completed. The new transactions will update
both the real table and update the shadow table with operations to
undo the changes on the real table, when a key is modified the first
time.  I.e. when write operation '{table, a, 14}' is made, the shadow
table is checked if key 'a' has a undo operation, if it has, nothing
more is done. If not a {write, {table, a, OLD_VALUE}} is added to the
shadow table if the real table had an old value, if not a {delete,
{table, a}} operation is added to the shadow table.

The backup is taken by copying every record in the real table and then
appending every operation in the shadow table to the backup, thus
undoing the changes that where made since the checkpoint where
started.