Skip to content

Commit

Permalink
[DLM] do full recover_locks barrier
Browse files Browse the repository at this point in the history
Red Hat BZ 211914

The previous patch "[DLM] fix aborted recovery during
node removal" was incomplete as discovered with further testing.  It set
the bit for the RS_LOCKS barrier but did not then wait for the barrier.
This is often ok, but sometimes it will cause yet another recovery hang.
If it's a new node that also has the lowest nodeid that skips the barrier
wait, then it misses the important step of collecting and reporting the
barrier status from the other nodes (which is the job of the low nodeid in
the barrier wait routine).

Signed-off-by: David Teigland <teigland@redhat.com>
Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
  • Loading branch information
David Teigland authored and Steven Whitehouse committed Nov 30, 2006
1 parent 2cdc98a commit 4b77f2c
Showing 1 changed file with 7 additions and 1 deletion.
8 changes: 7 additions & 1 deletion fs/dlm/recoverd.c
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,15 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover *rv)
/*
* Other lockspace members may be going through the "neg" steps
* while also adding us to the lockspace, in which case they'll
* be looking for this status bit during dlm_recover_locks().
* be doing the recover_locks (RS_LOCKS) barrier.
*/
dlm_set_recover_status(ls, DLM_RS_LOCKS);

error = dlm_recover_locks_wait(ls);
if (error) {
log_error(ls, "recover_locks_wait failed %d", error);
goto fail;
}
}

dlm_release_root_list(ls);
Expand Down

0 comments on commit 4b77f2c

Please sign in to comment.