diff options
author | Jean-Sébastien Pédron <[email protected]> | 2014-12-11 18:24:38 +0100 |
---|---|---|
committer | Dan Gudmundsson <[email protected]> | 2015-02-10 14:42:24 +0100 |
commit | 70c4db1d0b57363e33a04c935b653092f68cb91a (patch) | |
tree | 36b9ca11521f1bf02e40284b7e6c9728becb6e49 /lib/mnesia/src | |
parent | ee17dd99f2a56499b13dc4c84578105ea1f14ff6 (diff) | |
download | otp-70c4db1d0b57363e33a04c935b653092f68cb91a.tar.gz otp-70c4db1d0b57363e33a04c935b653092f68cb91a.tar.bz2 otp-70c4db1d0b57363e33a04c935b653092f68cb91a.zip |
mnesia: Check nodes after protocol negotiation
During Mnesia startup, after protocol negotiation, the list of connected
nodes is written to "recover_nodes". This list is later used to merge
the schema.
If Mnesia was stopped on a remote node between the protocol negotiation
and the moment the list is stored in "recover_nodes", the remote node
is still considered running: the value of "recover_nodes" stored during
mnesia_down/1 is overwritten. Therefore, this node may be used to
acquire a write lock on the schema in order to perform the merge. In
this case, the remote node never answers to the lock request and Mnesia
hang forever (application:start(mnesia) never returns).
To fix the problem, we check the list one last time and remove from it
all nodes where Mnesia is stopped. And because there is still a chance
for missing mnesia_down event, handle_cast({mnesia_down, ...}, ...)
writes to recover_nodes again, in addition to mnesia_down/1.
Diffstat (limited to 'lib/mnesia/src')
-rw-r--r-- | lib/mnesia/src/mnesia_recover.erl | 31 |
1 files changed, 28 insertions, 3 deletions
diff --git a/lib/mnesia/src/mnesia_recover.erl b/lib/mnesia/src/mnesia_recover.erl index b6492707e2..eeb4fa0ced 100644 --- a/lib/mnesia/src/mnesia_recover.erl +++ b/lib/mnesia/src/mnesia_recover.erl @@ -689,12 +689,29 @@ handle_call({connect_nodes, Ns}, From, State) -> %% called from handle_info gen_server:reply(From, {[], AlreadyConnected}), {noreply, State}; - GoodNodes -> + ProbablyGoodNodes -> %% Now we have agreed upon a protocol with some new nodes - %% and we may use them when we recover transactions + %% and we may use them when we recover transactions. + %% + %% Just in case Mnesia was stopped on some of those nodes + %% between the protocol negotiation and now, we check one + %% more time the state of Mnesia. + %% + %% Of course, there is still a chance that mnesia_down + %% events occur during this check and we miss them. To + %% prevent it, handle_cast({mnesia_down, ...}, ...) removes + %% the down node again, in addition to mnesia_down/1. + %% + %% See a comment in handle_cast({mnesia_down, ...}, ...). + Verify = fun(N) -> + Run = mnesia_lib:is_running(N), + Run =:= yes orelse Run =:= starting + end, + GoodNodes = [N || N <- ProbablyGoodNodes, Verify(N)], + mnesia_lib:add_list(recover_nodes, GoodNodes), cast({announce_all, GoodNodes}), - case get_master_nodes(schema) of + case get_master_nodes(schema) of [] -> Context = starting_partitioned_network, mnesia_monitor:detect_inconcistency(GoodNodes, Context); @@ -842,6 +859,14 @@ handle_cast({what_decision, Node, OtherD}, State) -> {noreply, State}; handle_cast({mnesia_down, Node}, State) -> + %% The node was already removed from recover_nodes in mnesia_down/1, + %% but we do it again here in the mnesia_recover process, in case + %% another event incorrectly added it back. This can happen during + %% Mnesia startup which takes time betweenthe connection, the + %% protocol negotiation and the merge of the schema. + %% + %% See a comment in handle_call({connect_nodes, ...), ...). + mnesia_lib:del(recover_nodes, Node), case State#state.unclear_decision of undefined -> {noreply, State}; |