Skip to content

aacraid issues in linux-4.14.56-215 #850

Closed
thomas opened this issue Jul 25, 2018 · 8 comments · Fixed by #909
Closed

aacraid issues in linux-4.14.56-215 #850

thomas opened this issue Jul 25, 2018 · 8 comments · Fixed by #909

Comments

@thomas
Copy link
Collaborator

thomas commented Jul 25, 2018

With linux-4.14.56-215 (see #847) the HBA-1000 doesn't work in all machines (turing is ok, pitti, deadbird, kronos are not). Yielding mostly, besides complete hangups, to the following message:

2018-07-25T12:43:51.339178+02:00 pitti kernel: [  188.664158] aacraid: aac_fib_send: first asynchronous command timed out.
2018-07-25T12:43:51.339188+02:00 pitti kernel: [  188.664158] Usually a result of a PCI interrupt routing problem;
2018-07-25T12:43:51.339191+02:00 pitti kernel: [  188.664158] update mother board BIOS or consider utilizing one of
2018-07-25T12:43:51.339199+02:00 pitti kernel: [  188.664158] the SAFE mode kernel options (acpi, apic etc)

Since there are no obvious changes to the driver code, it must be something more obscure.
Test builds gave the following results

Besides this, the version gets removed from the distmaster. Hosts kronos and pitti go into 'nodist'-limbo for the time until 4.14.55 is ready to be disted.

pmenzel added a commit that referenced this issue Sep 11, 2018
Since 4.14.56 there is a regression with the aacraid driver, so this was
never installed on the distmaster. Therefore, remove the files too.

Fixes: #850
@pmenzel pmenzel reopened this Oct 1, 2018
@pmenzel
Copy link
Collaborator

pmenzel commented Oct 1, 2018

The issue is still present in the upstream 4.14 series, so keep this open, until it’s fixed there.

Using Linux 4.14.73, and in drivers/scsi/aacraid/comminit.c removing PCI_IRQ_AFFINITY from

        if (msi_count > 1 &&
            pci_find_capability(dev->pdev, PCI_CAP_ID_MSIX)) {
                min_msix = 2;
                i = pci_alloc_irq_vectors(dev->pdev,
                                          min_msix, msi_count,
→                                           PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
                if (i > 0) {
                        dev->msi_enabled = 1;
                        msi_count = i;
                } else {
                        dev->msi_enabled = 0;
                        dev_err(&dev->pdev->dev,
                        "MSIX not supported!! Will try INTX 0x%x.\n", i);
                }
        }

the driver seems to initialize correctly, and C8014 and M8002 are detected.

$ dmesg | grep aacraid
[   11.806599] Adaptec aacraid driver 1.2.1[50834]-custom
[   11.812561] aacraid 0000:04:00.0: can't disable ASPM; OS doesn't have ASPM control
[   11.824283] aacraid: Comm Interface type3 enabled
[   11.923654] aacraid 0000:04:00.0: 64 Bit DAC enabled
[   11.931387] scsi host6: aacraid
$ more /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid6] [raid5] [raid4] [multipath] 
md1 : active raid6 sdb[0] sdq[15] sdp[14] sdo[13] sdn[12] sdm[11] sdl[10] sdk[9] sdj[8] sdi[7
] sdh[6] sdg[5] sdf[4] sde[3] sdd[2] sdc[1]
      109394532352 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUU
U]
      bitmap: 0/59 pages [0KB], 65536KB chunk

md0 : active raid6 sdu[0] sdaf[15] sdae[14] sdab[13] sdx[12] sdt[11] sdaa[10] sdw[9] sdag[8] 
sdac[7] sdy[6] sds[5] sdah[4] sdad[3] sdz[2] sdv[1]
      109394532352 blocks super 1.2 level 6, 512k chunk, algorithm 2 [16/16] [UUUUUUUUUUUUUUU
U]
      bitmap: 0/59 pages [0KB], 65536KB chunk

unused devices: <none>

@donald
Copy link
Collaborator

donald commented Oct 5, 2018

lkml thread for reference: https://lkml.org/lkml/2018/8/10/324

@donald
Copy link
Collaborator

donald commented Oct 5, 2018

Should we try to disable CONFIG_HOTPLUG_CPU until the driver is fixed? I don't think, we need it anywhere. The explanation of Ming Lei in https://lkml.org/lkml/2018/8/13/28 sounds solid.

On the other hand:

ssh $HOST cat /sys/devices/system/cpu/online /sys/devices/system/cpu/present /sys/devices/system/cpu/possible:

host online present possible
deadbird 0-23 0-23 0-31
kronos 0-127 0-127 0-127
pitty 0-31 0-31 0-127
turing 0-15 0-15 0-15

From this output, I don't understand, how kronos could be affected.

addendum: CONFIG_HOTPLUG_CPU is autoselected by SUSPEND=Y (default=Y) and SMP=Y so we'd need to disable SUSPEND.

@pmenzel
Copy link
Collaborator

pmenzel commented Oct 6, 2018 via email

@donald
Copy link
Collaborator

donald commented Oct 6, 2018

Problem discussed in this thread, too : https://lkml.org/lkml/2018/3/8/810 ( links to https://lore.kernel.org/lkml/1519311270.2535.53.camel@intel.com/T/#u )

@donald
Copy link
Collaborator

donald commented Oct 6, 2018

We don't suspend or hibernate. I've never seen a shut down cpu. Does Power Management do anything useful for us?

Looking at the code, I agree, that unsetting PCI_IRQ_AFFINITY in drivers/scsi/aacraid/comminit.c (suggested in #932) might be a valid workaround, too.

But all in all I'd vote to just continue to revert "9a0ef98e186d genirq/affinity: Assign vectors to all present CPUs" and, if needed, all later commits on "kernel/irq/affinity.c". This commit changes the semantics of pci_alloc_irq_vectors to the user (the device driver). Instead of mapping the interrupts to online CPUs it maps them to possible CPUs (even offline and not-yet-plugged in ones). If a MSI interrupt is mapped to offline cpus only, it will not be served at all. This is not a problem, if the driver is updated to select a queue (interrupt) assigned to the cpu which issues the i/o request (which is not only the best one to receive the reply numa- and cachewise, but obviously is online, too). This is what "scsi: hpsa: fix selection of reply queue" and "scsi: megaraid_sas: fix selection of reply queue" did but was not yet done for aacraid.

@thomas
Copy link
Collaborator Author

thomas commented Oct 6, 2018 via email

@wwwutz
Copy link
Collaborator

wwwutz commented Apr 12, 2022

fixed by hardware removal.

@wwwutz wwwutz closed this as completed Apr 12, 2022
Sign in to join this conversation on GitHub.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants