Skip to content

x86-fpu-2021-11-01

tagged this 01 Nov 00:35
 - Cleanup of extable fixup handling to be more robust, which in turn
   allows to make the FPU exception fixups more robust as well.

 - Change the return code for signal frame related failures from explicit
   error codes to a boolean fail/success as that's all what the calling
   code evaluates.

 - A large refactoring of the FPU code to prepare for adding AMX support:

   - Distangle the public header maze and remove especially the misnomed
     kitchen sink internal.h which is despite it's name included all over
     the place.

   - Add a proper abstraction for the register buffer storage (struct
     fpstate) which allows to dynamically size the buffer at runtime by
     flipping the pointer to the buffer container from the default
     container which is embedded in task_struct::tread::fpu to a
     dynamically allocated container with a larger register buffer.

   - Convert the code over to the new fpstate mechanism.

   - Consolidate the KVM FPU handling by moving the FPU related code into
     the FPU core which removes the number of exports and avoids adding
     even more export when AMX has to be supported in KVM. This also
     removes duplicated code which was of course unnecessary different and
     incomplete in the KVM copy.

   - Simplify the KVM FPU buffer handling by utilizing the new fpstate
     container and just switching the buffer pointer from the user space
     buffer to the KVM guest buffer when entering vcpu_run() and flipping
     it back when leaving the function. This cuts the memory requirements
     of a vCPU for FPU buffers in half and avoids pointless memory copy
     operations.

     This also solves the so far unresolved problem of adding AMX support
     because the current FPU buffer handling of KVM inflicted a circular
     dependency between adding AMX support to the core and to KVM.  With
     the new scheme of switching fpstate AMX support can be added to the
     core code without affecting KVM.

   - Replace various variables with proper data structures so the extra
     information required for adding dynamically enabled FPU features (AMX)
     can be added in one place

 - Add AMX (Advanved Matrix eXtensions) support (finally):

    AMX is a large XSTATE component which is going to be available with
    Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
    which allows to trap the (first) use of an AMX related instruction,
    which has two benefits:

    1) It allows the kernel to control access to the feature

    2) It allows the kernel to dynamically allocate the large register
       state buffer instead of burdening every task with the the extra 8K
       or larger state storage.

    It would have been great to gain this kind of control already with
    AVX512.

    The support comes with the following infrastructure components:

    1) arch_prctl() to
       - read the supported features (equivalent to XGETBV(0))
       - read the permitted features for a task
       - request permission for a dynamically enabled feature

       Permission is granted per process, inherited on fork() and cleared
       on exec(). The permission policy of the kernel is restricted to
       sigaltstack size validation, but the syscall obviously allows
       further restrictions via seccomp etc.

    2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
       takes granted permissions and the potentially resulting larger
       signal frame into account. This mechanism can also be used to
       enforce factual sigaltstack validation independent of dynamic
       features to help with finding potential victims of the 2K
       sigaltstack size constant which is broken since AVX512 support was
       added.

    3) Exception handling for #NM traps to catch first use of a extended
       feature via a new cause MSR. If the exception was caused by the use
       of such a feature, the handler checks permission for that
       feature. If permission has not been granted, the handler sends a
       SIGILL like the #UD handler would do if the feature would have been
       disabled in XCR0. If permission has been granted, then a new fpstate
       which fits the larger buffer requirement is allocated.

       In the unlikely case that this allocation fails, the handler sends
       SIGSEGV to the task. That's not elegant, but unavoidable as the
       other discussed options of preallocation or full per task
       permissions come with their own set of horrors for kernel and/or
       userspace. So this is the lesser of the evils and SIGSEGV caused by
       unexpected memory allocation failures is not a fundamentally new
       concept either.

       When allocation succeeds, the fpstate properties are filled in to
       reflect the extended feature set and the resulting sizes, the
       fpu::fpstate pointer is updated accordingly and the trap is disarmed
       for this task permanently.

    4) Enumeration and size calculations

    5) Trap switching via MSR_XFD

       The XFD (eXtended Feature Disable) MSR is context switched with the
       same life time rules as the FPU register state itself. The mechanism
       is keyed off with a static key which is default disabled so !AMX
       equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
       is limited by comparing the tasks XFD value with a per CPU shadow
       variable to avoid redundant MSR writes. In case of switching from a
       AMX using task to a non AMX using task or vice versa, the extra MSR
       write is obviously inevitable.

       All other places which need to be aware of the variable feature sets
       and resulting variable sizes are not affected at all because they
       retrieve the information (feature set, sizes) unconditonally from
       the fpstate properties.

    6) Enable the new AMX states

  Note, this is relatively new code despite the fact that AMX support is in
  the works for more than a year now.

  The big refactoring of the FPU code, which allowed to do a proper
  integration has been started exactly 3 weeks ago. Refactoring of the
  existing FPU code and of the original AMX patches took a week and has
  been subject to extensive review and testing. The only fallout which has
  not been caught in review and testing right away was restricted to AMX
  enabled systems, which is completely irrelevant for anyone outside Intel
  and their early access program. There might be dragons lurking as usual,
  but so far the fine grained refactoring has held up and eventual yet
  undetected fallout is bisectable and should be easily addressable before
  the 5.16 release. Famous last words...

  Many thanks to Chang Bae and Dave Hansen for working hard on this and
  also to the various test teams at Intel who reserved extra capacity to
  follow the rapid development of this closely which provides the
  confidence level required to offer this rather large update for inclusion
  into 5.16-rc1.
Assets 2
Loading