Skip to content

Add Linux 5.15.88-444 or later fixing Broadcom/LSI logging and other things #2829

Merged
merged 5 commits into from Feb 20, 2023

Conversation

pmenzel
Copy link
Collaborator

@pmenzel pmenzel commented Jan 16, 2023

Tested on maleficent.

@pmenzel
Copy link
Collaborator Author

pmenzel commented Jan 16, 2023

The commit message needs to be added, and it needs to be tested on the affected file servers.

  • epizootie
  • grele
  • plage
  • rabies
  • sauterelles
  • tenebres

They all have 12 TB drives. The file server lyssa with 8 TB drives is not affected.

The test file server wayofthedodo has 12 TB drives, but does not exhibit the problem – probably due to missing data and therefore not enough load.

@david
Copy link
Collaborator

david commented Jan 16, 2023

The system with 8TB drives is 'lyssa'

@donald
Copy link
Collaborator

donald commented Jan 17, 2023

Switch back to SRCURL (and fix version there) ?

You've sneaked in ("firmware: coreboot: Check size of table entry and use flex-array") 1. Was this on purpose?

@thomas
Copy link
Collaborator

thomas commented Jan 17, 2023

I think there are some open questions about the nature of the issue.
a) Is it only a logging issue?
b) Why wasn't it observed before?
c) Why does e.g. 'grele' shows the issue, and not 'wayofthedodo'?
d) Is the message (that was simply removed) really denoting an error condition (the word 'failed' would make me think so), was it supposed to be a warning, or a debug message?

@donald
Copy link
Collaborator

donald commented Jan 17, 2023

Good questions. I doubt, we are competent enough to answer them.

The warnings are removed from _base_check_pcie_native_sgl, _base_build_sg and _base_build_sg_scmd.

_base_check_pcie_native_sgl is only used internally by _base_build_sg.

All call sites of _base_build_sg and the single call site of _base_build_sg_scmd (which go via ioc->build_sg and ioc->build_sg_scmd ) seem to ignore the return value of the functions. This doesn't give me confidence, that the problem is handled by the caller, but I don't understand all this.

Should we change that to WARN_ONCE so that we still get an indication on what systems that happens without filling the disk with repeated log entries? If I/O can be lost by this condition, it might be good to have a reminder in the logfile when stumble on a frozen system later.

Btw: Today, I/O on "gone" froze for a second time since we have a 5.15 kernel. Again it started a few minutes after mdcheck was paused in the morning. This time it is the other md device. It blocks and shows I/Os in flight. xxd to one of the member disks also blocks.

@pmenzel
Copy link
Collaborator Author

pmenzel commented Jan 17, 2023

Christoph Hellwig explicitly asked to drop the message.:

I'd remove the message entirely.

@pmenzel
Copy link
Collaborator Author

pmenzel commented Jan 17, 2023

Otherwise, the referenced patch by John Pittman rate-limits the message.

@donald
Copy link
Collaborator

donald commented Jan 17, 2023

I must confess, that Christoph Hellwig might have a little more insight...

And "double completion" doesn't sound that alarming.

So leave it as it is without any messages. We can just try it out.

@thomas
Copy link
Collaborator

thomas commented Jan 17, 2023

Concerning the scsi_dma_map call found in almost every scsi driver. The snippets speak for themselves:

./scsi/ipr.c:   nseg = scsi_dma_map(scsi_cmd);
./scsi/ipr.c-   if (nseg < 0) {
./scsi/ipr.c-           if (printk_ratelimit())
./scsi/ipr.c:                   dev_err(&ioa_cfg->pdev->dev, "scsi_dma_map failed!\n");
./scsi/ipr.c-           return -1;
./scsi/ipr.c-   }

./scsi/dc395x.c:        nseg = scsi_dma_map(cmd);
./scsi/dc395x.c-        BUG_ON(nseg < 0);

./scsi/qla4xxx/ql4_iocb.c-      /* Calculate the number of request entries needed. */
./scsi/qla4xxx/ql4_iocb.c:      nseg = scsi_dma_map(cmd);
./scsi/qla4xxx/ql4_iocb.c-      if (nseg < 0)
./scsi/qla4xxx/ql4_iocb.c-              goto queuing_error;
./scsi/qla4xxx/ql4_iocb.c-      tot_dsds = nseg;

./scsi/aic7xxx/aic79xx_osm.c:   nseg = scsi_dma_map(cmd);
./scsi/aic7xxx/aic79xx_osm.c-   if (nseg < 0)
./scsi/aic7xxx/aic79xx_osm.c-           return SCSI_MLQUEUE_HOST_BUSY;

As for the function:

./scsi/scsi_lib_dma.c-/**
./scsi/scsi_lib_dma.c: * scsi_dma_map - perform DMA mapping against command's sg lists
./scsi/scsi_lib_dma.c- * @cmd:  scsi command
./scsi/scsi_lib_dma.c- *
./scsi/scsi_lib_dma.c- * Returns the number of sg lists actually used, zero if the sg lists
./scsi/scsi_lib_dma.c- * is NULL, or -ENOMEM if the mapping failed.
./scsi/scsi_lib_dma.c- */
./scsi/scsi_lib_dma.c:int scsi_dma_map(struct scsi_cmnd *cmd)
./scsi/scsi_lib_dma.c-{
./scsi/scsi_lib_dma.c-  int nseg = 0;
./scsi/scsi_lib_dma.c-

And as for "double completion", to me it looks like this describes the error happening on the logging site?

@donald
Copy link
Collaborator

donald commented Jan 17, 2023

Can we trigger the error? If so, we could try to bisect, although this might take ages.

@pmenzel
Copy link
Collaborator Author

pmenzel commented Jan 17, 2023

Switch back to SRCURL (and fix version there) ?

Yes, will do. Just wanted to get the test Linux kernel build. (Nvidia drivers are also missing.)

You've sneaked in ("firmware: coreboot: Check size of table entry and use flex-array") 1. Was this on purpose?

Yes, it was as we have two coreboot machines. (It got picked for the stable series already, so I picked it too.)

@donald
Copy link
Collaborator

donald commented Jan 18, 2023

Hold it, I might like to add another kernel patch "request-key: Cannot find command to construct key" in /var/log/messages for sec=mariux

@donald donald changed the title Add Linux 5.15.88-444 fixing Broadcom/LSI logging Add Linux 5.15.88-444 or lates fixing Broadcom/LSI logging and other things Jan 19, 2023
@donald donald changed the title Add Linux 5.15.88-444 or lates fixing Broadcom/LSI logging and other things Add Linux 5.15.88-444 or later fixing Broadcom/LSI logging and other things Jan 19, 2023
@donald
Copy link
Collaborator

donald commented Jan 19, 2023

And as for "double completion", to me it looks like this describes the error happening on the logging site?

Correct.

Hmm. What to do about this?

@thomas
Copy link
Collaborator

thomas commented Jan 19, 2023

And as for "double completion", to me it looks like this describes the error happening on the logging site?

Correct.

Hmm. What to do about this?

Ignore it? If I see it right, it has nothing to do with the error messages generated by mpt3sas_base.c. BTW, is there already any knowledge which of the three possible locations had thrown the error?

@donald
Copy link
Collaborator

donald commented Feb 20, 2023

@thomas: Are you able to test whether the rate limit of the warnings works? (for linux-5.15.94-447.x86_64) ?

pmenzel and others added 5 commits February 20, 2023 13:45
Add version 5.15.89 and remove version 5.15.88.

The kernel mariux-5.15.89-445.tar.gz contains another patch which shoud
avoid the "request-key: Cannot find command to construct key ,,;"
messages in the syslog with a sec=mariux nfs client.
Add version 5.15.94 and remove version 5.15.89.

The kernel mariux-5.15.94-447 contains patches to rate limit the
"scsi_dma_map failed: request for %d bytes" warnings messages.
Build version 510.60.02 for Linux 5.15.94-447 and remove for Linux
5.15.89-445A.
@donald
Copy link
Collaborator

donald commented Feb 20, 2023

We no longer have the sas controllers to test the rate limit of the strange warning.

Basic function is tested on sigusr2 (with nvidia) and - accidentally - on done. I merge that now, so that the unwanted "request-key: Cannot find command to construct key" log messages no longer appear on the nfs clients. This is fixed by mariux64/linux@5665b3517ce3a

@donald donald merged commit 531b139 into master Feb 20, 2023
Sign in to join this conversation on GitHub.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants