Skip to content

Commit

Permalink
nvme-fc: do not ignore connectivity loss during connecting
Browse files Browse the repository at this point in the history
When a connectivity loss occurs while nvme_fc_create_assocation is
being executed, it's possible that the ctrl ends up stuck in the LIVE
state:

  1) nvme nvme10: NVME-FC{10}: create association : ...
  2) nvme nvme10: NVME-FC{10}: controller connectivity lost.
                  Awaiting Reconnect
     nvme nvme10: queue_size 128 > ctrl maxcmd 32, reducing to maxcmd
  3) nvme nvme10: Could not set queue count (880)
     nvme nvme10: Failed to configure AEN (cfg 900)
  4) nvme nvme10: NVME-FC{10}: controller connect complete
  5) nvme nvme10: failed nvme_keep_alive_end_io error=4

A connection attempt starts 1) and the ctrl is in state CONNECTING.
Shortly after the LLDD driver detects a connection lost event and calls
nvme_fc_ctrl_connectivity_loss 2). Because we are still in CONNECTING
state, this event is ignored.

nvme_fc_create_association continues to run in parallel and tries to
communicate with the controller and these commands will fail. Though
these errors are filtered out, e.g in 3) setting the I/O queues numbers
fails which leads to an early exit in nvme_fc_create_io_queues. Because
the number of IO queues is 0 at this point, there is nothing left in
nvme_fc_create_association which could detected the connection drop.
Thus the ctrl enters LIVE state 4).

Eventually the keep alive handler times out 5) but because nothing is
being done, the ctrl stays in LIVE state.

There is already the ASSOC_FAILED flag to track connectivity loss event
but this bit is set too late in the recovery code path. Move this into
the connectivity loss event handler and synchronize it with the state
change. This ensures that the ASSOC_FAILED flag is seen by
nvme_fc_create_io_queues and it does not enter the LIVE state after a
connectivity loss event. If the connectivity loss event happens after we
entered the LIVE state the normal error recovery path is executed.

Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Keith Busch <kbusch@kernel.org>
  • Loading branch information
Daniel Wagner authored and Keith Busch committed Jan 23, 2025
1 parent 294b2b7 commit ee59e38
Showing 1 changed file with 18 additions and 5 deletions.
23 changes: 18 additions & 5 deletions drivers/nvme/host/fc.c
Original file line number Diff line number Diff line change
Expand Up @@ -781,11 +781,19 @@ nvme_fc_abort_lsops(struct nvme_fc_rport *rport)
static void
nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
{
enum nvme_ctrl_state state;
unsigned long flags;

dev_info(ctrl->ctrl.device,
"NVME-FC{%d}: controller connectivity lost. Awaiting "
"Reconnect", ctrl->cnum);

switch (nvme_ctrl_state(&ctrl->ctrl)) {
spin_lock_irqsave(&ctrl->lock, flags);
set_bit(ASSOC_FAILED, &ctrl->flags);
state = nvme_ctrl_state(&ctrl->ctrl);
spin_unlock_irqrestore(&ctrl->lock, flags);

switch (state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
/*
Expand Down Expand Up @@ -2542,7 +2550,6 @@ nvme_fc_error_recovery(struct nvme_fc_ctrl *ctrl, char *errmsg)
*/
if (ctrl->ctrl.state == NVME_CTRL_CONNECTING) {
__nvme_fc_abort_outstanding_ios(ctrl, true);
set_bit(ASSOC_FAILED, &ctrl->flags);
dev_warn(ctrl->ctrl.device,
"NVME-FC{%d}: transport error during (re)connect\n",
ctrl->cnum);
Expand Down Expand Up @@ -3167,12 +3174,18 @@ nvme_fc_create_association(struct nvme_fc_ctrl *ctrl)
else
ret = nvme_fc_recreate_io_queues(ctrl);
}
if (!ret && test_bit(ASSOC_FAILED, &ctrl->flags))
ret = -EIO;
if (ret)
goto out_term_aen_ops;

changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
spin_lock_irqsave(&ctrl->lock, flags);
if (!test_bit(ASSOC_FAILED, &ctrl->flags))
changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
else
ret = -EIO;
spin_unlock_irqrestore(&ctrl->lock, flags);

if (ret)
goto out_term_aen_ops;

ctrl->ctrl.nr_reconnects = 0;

Expand Down

0 comments on commit ee59e38

Please sign in to comment.