- Cleanup of extable fixup handling to be more robust, which in turn
allows to make the FPU exception fixups more robust as well.
- Change the return code for signal frame related failures from explicit
error codes to a boolean fail/success as that's all what the calling
code evaluates.
- A large refactoring of the FPU code to prepare for adding AMX support:
- Distangle the public header maze and remove especially the misnomed
kitchen sink internal.h which is despite it's name included all over
the place.
- Add a proper abstraction for the register buffer storage (struct
fpstate) which allows to dynamically size the buffer at runtime by
flipping the pointer to the buffer container from the default
container which is embedded in task_struct::tread::fpu to a
dynamically allocated container with a larger register buffer.
- Convert the code over to the new fpstate mechanism.
- Consolidate the KVM FPU handling by moving the FPU related code into
the FPU core which removes the number of exports and avoids adding
even more export when AMX has to be supported in KVM. This also
removes duplicated code which was of course unnecessary different and
incomplete in the KVM copy.
- Simplify the KVM FPU buffer handling by utilizing the new fpstate
container and just switching the buffer pointer from the user space
buffer to the KVM guest buffer when entering vcpu_run() and flipping
it back when leaving the function. This cuts the memory requirements
of a vCPU for FPU buffers in half and avoids pointless memory copy
operations.
This also solves the so far unresolved problem of adding AMX support
because the current FPU buffer handling of KVM inflicted a circular
dependency between adding AMX support to the core and to KVM. With
the new scheme of switching fpstate AMX support can be added to the
core code without affecting KVM.
- Replace various variables with proper data structures so the extra
information required for adding dynamically enabled FPU features (AMX)
can be added in one place
- Add AMX (Advanved Matrix eXtensions) support (finally):
AMX is a large XSTATE component which is going to be available with
Saphire Rapids XEON CPUs. The feature comes with an extra MSR (MSR_XFD)
which allows to trap the (first) use of an AMX related instruction,
which has two benefits:
1) It allows the kernel to control access to the feature
2) It allows the kernel to dynamically allocate the large register
state buffer instead of burdening every task with the the extra 8K
or larger state storage.
It would have been great to gain this kind of control already with
AVX512.
The support comes with the following infrastructure components:
1) arch_prctl() to
- read the supported features (equivalent to XGETBV(0))
- read the permitted features for a task
- request permission for a dynamically enabled feature
Permission is granted per process, inherited on fork() and cleared
on exec(). The permission policy of the kernel is restricted to
sigaltstack size validation, but the syscall obviously allows
further restrictions via seccomp etc.
2) A stronger sigaltstack size validation for sys_sigaltstack(2) which
takes granted permissions and the potentially resulting larger
signal frame into account. This mechanism can also be used to
enforce factual sigaltstack validation independent of dynamic
features to help with finding potential victims of the 2K
sigaltstack size constant which is broken since AVX512 support was
added.
3) Exception handling for #NM traps to catch first use of a extended
feature via a new cause MSR. If the exception was caused by the use
of such a feature, the handler checks permission for that
feature. If permission has not been granted, the handler sends a
SIGILL like the #UD handler would do if the feature would have been
disabled in XCR0. If permission has been granted, then a new fpstate
which fits the larger buffer requirement is allocated.
In the unlikely case that this allocation fails, the handler sends
SIGSEGV to the task. That's not elegant, but unavoidable as the
other discussed options of preallocation or full per task
permissions come with their own set of horrors for kernel and/or
userspace. So this is the lesser of the evils and SIGSEGV caused by
unexpected memory allocation failures is not a fundamentally new
concept either.
When allocation succeeds, the fpstate properties are filled in to
reflect the extended feature set and the resulting sizes, the
fpu::fpstate pointer is updated accordingly and the trap is disarmed
for this task permanently.
4) Enumeration and size calculations
5) Trap switching via MSR_XFD
The XFD (eXtended Feature Disable) MSR is context switched with the
same life time rules as the FPU register state itself. The mechanism
is keyed off with a static key which is default disabled so !AMX
equipped CPUs have zero overhead. On AMX enabled CPUs the overhead
is limited by comparing the tasks XFD value with a per CPU shadow
variable to avoid redundant MSR writes. In case of switching from a
AMX using task to a non AMX using task or vice versa, the extra MSR
write is obviously inevitable.
All other places which need to be aware of the variable feature sets
and resulting variable sizes are not affected at all because they
retrieve the information (feature set, sizes) unconditonally from
the fpstate properties.
6) Enable the new AMX states
Note, this is relatively new code despite the fact that AMX support is in
the works for more than a year now.
The big refactoring of the FPU code, which allowed to do a proper
integration has been started exactly 3 weeks ago. Refactoring of the
existing FPU code and of the original AMX patches took a week and has
been subject to extensive review and testing. The only fallout which has
not been caught in review and testing right away was restricted to AMX
enabled systems, which is completely irrelevant for anyone outside Intel
and their early access program. There might be dragons lurking as usual,
but so far the fine grained refactoring has held up and eventual yet
undetected fallout is bisectable and should be easily addressable before
the 5.16 release. Famous last words...
Many thanks to Chang Bae and Dave Hansen for working hard on this and
also to the various test teams at Intel who reserved extra capacity to
follow the rapid development of this closely which provides the
confidence level required to offer this rather large update for inclusion
into 5.16-rc1.