Skip to content

irq-core-2022-12-10

 - Core:

   The bulk is the rework of the MSI subsystem to support per device MSI
   interrupt domains. This solves conceptual problems of the current
   PCI/MSI design which are in the way of providing support for PCI/MSI[-X]
   and the upcoming PCI/IMS mechanism on the same device.

   IMS (Interrupt Message Store] is a new specification which allows device
   manufactures to provide implementation defined storage for MSI messages
   contrary to the uniform and specification defined storage mechanisms for
   PCI/MSI and PCI/MSI-X. IMS not only allows to overcome the size limitations
   of the MSI-X table, but also gives the device manufacturer the freedom to
   store the message in arbitrary places, even in host memory which is shared
   with the device.

   There have been several attempts to glue this into the current MSI code,
   but after lengthy discussions it turned out that there is a fundamental
   design problem in the current PCI/MSI-X implementation. This needs some
   historical background.

   When PCI/MSI[-X] support was added around 2003, interrupt management was
   completely different from what we have today in the actively developed
   architectures. Interrupt management was completely architecture specific
   and while there were attempts to create common infrastructure the
   commonalities were rudimentary and just providing shared data structures and
   interfaces so that drivers could be written in an architecture agnostic
   way.

   The initial PCI/MSI[-X] support obviously plugged into this model which
   resulted in some basic shared infrastructure in the PCI core code for
   setting up MSI descriptors, which are a pure software construct for holding
   data relevant for a particular MSI interrupt, but the actual association to
   Linux interrupts was completely architecture specific. This model is still
   supported today to keep museum architectures and notorious stranglers
   alive.

   In 2013 Intel tried to add support for hot-pluggable IO/APICs to the kernel,
   which was creating yet another architecture specific mechanism and resulted
   in an unholy mess on top of the existing horrors of x86 interrupt handling.
   The x86 interrupt management code was already an incomprehensible maze of
   indirections between the CPU vector management, interrupt remapping and the
   actual IO/APIC and PCI/MSI[-X] implementation.

   At roughly the same time ARM struggled with the ever growing SoC specific
   extensions which were glued on top of the architected GIC interrupt
   controller.

   This resulted in a fundamental redesign of interrupt management and
   provided the today prevailing concept of hierarchical interrupt
   domains. This allowed to disentangle the interactions between x86 vector
   domain and interrupt remapping and also allowed ARM to handle the zoo of
   SoC specific interrupt components in a sane way.

   The concept of hierarchical interrupt domains aims to encapsulate the
   functionality of particular IP blocks which are involved in interrupt
   delivery so that they become extensible and pluggable. The X86
   encapsulation looks like this:

                                            |--- device 1
     [Vector]---[Remapping]---[PCI/MSI]--|...
                                            |--- device N

   where the remapping domain is an optional component and in case that it is
   not available the PCI/MSI[-X] domains have the vector domain as their
   parent. This reduced the required interaction between the domains pretty
   much to the initialization phase where it is obviously required to
   establish the proper parent relation ship in the components of the
   hierarchy.

   While in most cases the model is strictly representing the chain of IP
   blocks and abstracting them so they can be plugged together to form a
   hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the hardware
   it's clear that the actual PCI/MSI[-X] interrupt controller is not a global
   entity, but strict a per PCI device entity.

   Here we took a short cut on the hierarchical model and went for the easy
   solution of providing "global" PCI/MSI domains which was possible because
   the PCI/MSI[-X] handling is uniform across the devices. This also allowed
   to keep the existing PCI/MSI[-X] infrastructure mostly unchanged which in
   turn made it simple to keep the existing architecture specific management
   alive.

   A similar problem was created in the ARM world with support for IP block
   specific message storage. Instead of going all the way to stack a IP block
   specific domain on top of the generic MSI domain this ended in a construct
   which provides a "global" platform MSI domain which allows overriding the
   irq_write_msi_msg() callback per allocation.

   In course of the lengthy discussions we identified other abuse of the MSI
   infrastructure in wireless drivers, NTB etc. where support for
   implementation specific message storage was just mindlessly glued into the
   existing infrastructure. Some of this just works by chance on particular
   platforms but will fail in hard to diagnose ways when the driver is used
   on platforms where the underlying MSI interrupt management code does not
   expect the creative abuse.

   Another shortcoming of today's PCI/MSI-X support is the inability to
   allocate or free individual vectors after the initial enablement of
   MSI-X. This results in an works by chance implementation of VFIO (PCI
   pass-through) where interrupts on the host side are not set up upfront to
   avoid resource exhaustion. They are expanded at run-time when the guest
   actually tries to use them. The way how this is implemented is that the
   host disables MSI-X and then re-enables it with a larger number of
   vectors again. That works by chance because most device drivers set up
   all interrupts before the device actually will utilize them. But that's
   not universally true because some drivers allocate a large enough number
   of vectors but do not utilize them until it's actually required,
   e.g. for acceleration support. But at that point other interrupts of the
   device might be in active use and the MSI-X disable/enable dance can
   just result in losing interrupts and therefore hard to diagnose subtle
   problems.

   Last but not least the "global" PCI/MSI-X domain approach prevents to
   utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact that IMS
   is not longer providing a uniform storage and configuration model.

   The solution to this is to implement the missing step and switch from
   global PCI/MSI domains to per device PCI/MSI domains. The resulting
   hierarchy then looks like this:

                              |--- [PCI/MSI] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N

   which in turn allows to provide support for multiple domains per device:

                              |--- [PCI/MSI] device 1
                              |--- [PCI/IMS] device 1
     [Vector]---[Remapping]---|...
                              |--- [PCI/MSI] device N
                              |--- [PCI/IMS] device N

   This work converts the MSI and PCI/MSI core and the x86 interrupt
   domains to the new model, provides new interfaces for post-enable
   allocation/free of MSI-X interrupts and the base framework for PCI/IMS.
   PCI/IMS has been verified with the work in progress IDXD driver.

   There is work in progress to convert ARM over which will replace the
   platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
   "solutions" are in the works as well.

 - Drivers:

   - Updates for the LoongArch interrupt chip drivers

   - Support for MTK CIRQv2

   - The usual small fixes and updates all over the place
Assets 2
Loading