Skip to content

Commit

Permalink
Merge tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux…
Browse files Browse the repository at this point in the history
…/kernel/git/tip/tip

Pull x86 FRED support from Thomas Gleixner:
 "Support for x86 Fast Return and Event Delivery (FRED).

  FRED is a replacement for IDT event delivery on x86 and addresses most
  of the technical nightmares which IDT exposes:

   1) Exception cause registers like CR2 need to be manually preserved
      in nested exception scenarios.

   2) Hardware interrupt stack switching is suboptimal for nested
      exceptions as the interrupt stack mechanism rewinds the stack on
      each entry which requires a massive effort in the low level entry
      of #NMI code to handle this.

   3) No hardware distinction between entry from kernel or from user
      which makes establishing kernel context more complex than it needs
      to be especially for unconditionally nestable exceptions like NMI.

   4) NMI nesting caused by IRET unconditionally reenabling NMIs, which
      is a problem when the perf NMI takes a fault when collecting a
      stack trace.

   5) Partial restore of ESP when returning to a 16-bit segment

   6) Limitation of the vector space which can cause vector exhaustion
      on large systems.

   7) Inability to differentiate NMI sources

  FRED addresses these shortcomings by:

   1) An extended exception stack frame which the CPU uses to save
      exception cause registers. This ensures that the meta information
      for each exception is preserved on stack and avoids the extra
      complexity of preserving it in software.

   2) Hardware interrupt stack switching is non-rewinding if a nested
      exception uses the currently interrupt stack.

   3) The entry points for kernel and user context are separate and GS
      BASE handling which is required to establish kernel context for
      per CPU variable access is done in hardware.

   4) NMIs are now nesting protected. They are only reenabled on the
      return from NMI.

   5) FRED guarantees full restore of ESP

   6) FRED does not put a limitation on the vector space by design
      because it uses a central entry points for kernel and user space
      and the CPUstores the entry type (exception, trap, interrupt,
      syscall) on the entry stack along with the vector number. The
      entry code has to demultiplex this information, but this removes
      the vector space restriction.

      The first hardware implementations will still have the current
      restricted vector space because lifting this limitation requires
      further changes to the local APIC.

   7) FRED stores the vector number and meta information on stack which
      allows having more than one NMI vector in future hardware when the
      required local APIC changes are in place.

  The series implements the initial FRED support by:

   - Reworking the existing entry and IDT handling infrastructure to
     accomodate for the alternative entry mechanism.

   - Expanding the stack frame to accomodate for the extra 16 bytes FRED
     requires to store context and meta information

   - Providing FRED specific C entry points for events which have
     information pushed to the extended stack frame, e.g. #PF and #DB.

   - Providing FRED specific C entry points for #NMI and #MCE

   - Implementing the FRED specific ASM entry points and the C code to
     demultiplex the events

   - Providing detection and initialization mechanisms and the necessary
     tweaks in context switching, GS BASE handling etc.

  The FRED integration aims for maximum code reuse vs the existing IDT
  implementation to the extent possible and the deviation in hot paths
  like context switching are handled with alternatives to minimalize the
  impact. The low level entry and exit paths are seperate due to the
  extended stack frame and the hardware based GS BASE swichting and
  therefore have no impact on IDT based systems.

  It has been extensively tested on existing systems and on the FRED
  simulation and as of now there are no outstanding problems"

* tag 'x86-fred-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
  x86/fred: Fix init_task thread stack pointer initialization
  MAINTAINERS: Add a maintainer entry for FRED
  x86/fred: Fix a build warning with allmodconfig due to 'inline' failing to inline properly
  x86/fred: Invoke FRED initialization code to enable FRED
  x86/fred: Add FRED initialization functions
  x86/syscall: Split IDT syscall setup code into idt_syscall_init()
  KVM: VMX: Call fred_entry_from_kvm() for IRQ/NMI handling
  x86/entry: Add fred_entry_from_kvm() for VMX to handle IRQ/NMI
  x86/entry/calling: Allow PUSH_AND_CLEAR_REGS being used beyond actual entry code
  x86/fred: Fixup fault on ERETU by jumping to fred_entrypoint_user
  x86/fred: Let ret_from_fork_asm() jmp to asm_fred_exit_user when FRED is enabled
  x86/traps: Add sysvec_install() to install a system interrupt handler
  x86/fred: FRED entry/exit and dispatch code
  x86/fred: Add a machine check entry stub for FRED
  x86/fred: Add a NMI entry stub for FRED
  x86/fred: Add a debug fault entry stub for FRED
  x86/idtentry: Incorporate definitions/declarations of the FRED entries
  x86/fred: Make exc_page_fault() work for FRED
  x86/fred: Allow single-step trap and NMI when starting a new task
  x86/fred: No ESPFIX needed when FRED is enabled
  ...
  • Loading branch information
Linus Torvalds committed Mar 11, 2024
2 parents ca7e917 + c416b5b commit 720c857
Show file tree
Hide file tree
Showing 56 changed files with 1,372 additions and 121 deletions.
6 changes: 6 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1525,6 +1525,12 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.

fred= [X86-64]
Enable/disable Flexible Return and Event Delivery.
Format: { on | off }
on: enable FRED when it's present.
off: disable FRED, the default setting.

ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
Expand Down
96 changes: 96 additions & 0 deletions Documentation/arch/x86/x86_64/fred.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
.. SPDX-License-Identifier: GPL-2.0
=========================================
Flexible Return and Event Delivery (FRED)
=========================================

Overview
========

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
delivery through the interrupt descriptor table (IDT event
delivery) and event return by the IRET instruction with lower
latency transitions.

2) Improve software robustness by ensuring that event delivery
establishes the full supervisor context and that event return
establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

In addition to these transitions, the FRED architecture defines a new
instruction (LKGS) for managing the state of the GS segment register.
The LKGS instruction can be used by 64-bit operating systems that do
not use the new FRED transitions.

Furthermore, the FRED architecture is easy to extend for future CPU
architectures.

Software based event dispatching
================================

FRED operates differently from IDT in terms of event handling. Instead
of directly dispatching an event to its handler based on the event
vector, FRED requires the software to dispatch an event to its handler
based on both the event's type and vector. Therefore, an event dispatch
framework must be implemented to facilitate the event-to-handler
dispatch process. The FRED event dispatch framework takes control
once an event is delivered, and employs a two-level dispatch.

The first level dispatching is event type based, and the second level
dispatching is event vector based.

Full supervisor/user context
============================

FRED event delivery atomically save and restore full supervisor/user
context upon event delivery and return. Thus it avoids the problem of
transient states due to %cr2 and/or %dr6, and it is no longer needed
to handle all the ugly corner cases caused by half baked entry states.

FRED allows explicit unblock of NMI with new event return instructions
ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
unblocks NMI, e.g., when an exception happens during NMI handling.

FRED always restores the full value of %rsp, thus ESPFIX is no longer
needed when FRED is enabled.

LKGS
====

LKGS behaves like the MOV to GS instruction except that it loads the
base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
segment’s descriptor cache. With LKGS, it ends up with avoiding
mucking with kernel GS, i.e., an operating system can always operate
with its own GS base address.

Because FRED event delivery from ring 3 and ERETU both swap the value
of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
the introduction of LKGS instruction, the SWAPGS instruction is no
longer needed when FRED is enabled, thus is disallowed (#UD).

Stack levels
============

4 stack levels 0~3 are introduced to replace the nonreentrant IST for
event handling, and each stack level should be configured to use a
dedicated stack.

The current stack level could be unchanged or go higher upon FRED
event delivery. If unchanged, the CPU keeps using the current event
stack. If higher, the CPU switches to a new event stack specified by
the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].

Only execution of a FRED return instruction ERET[US], could lower the
current stack level, causing the CPU to switch back to the stack it was
on before a previous event delivery that promoted the stack level.
1 change: 1 addition & 0 deletions Documentation/arch/x86/x86_64/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,4 @@ x86_64 Support
cpu-hotplug-spec
machinecheck
fsgs
fred
10 changes: 10 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -11157,6 +11157,16 @@ L: netdev@vger.kernel.org
S: Maintained
F: drivers/net/wwan/iosm/

INTEL(R) FLEXIBLE RETURN AND EVENT DELIVERY
M: Xin Li <xin@zytor.com>
M: "H. Peter Anvin" <hpa@zytor.com>
S: Supported
F: Documentation/arch/x86/x86_64/fred.rst
F: arch/x86/entry/entry_64_fred.S
F: arch/x86/entry/entry_fred.c
F: arch/x86/include/asm/fred.h
F: arch/x86/kernel/fred.c

INTEL(R) TRACE HUB
M: Alexander Shishkin <alexander.shishkin@linux.intel.com>
S: Supported
Expand Down
9 changes: 9 additions & 0 deletions arch/x86/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -496,6 +496,15 @@ config X86_CPU_RESCTRL

Say N if unsure.

config X86_FRED
bool "Flexible Return and Event Delivery"
depends on X86_64
help
When enabled, try to use Flexible Return and Event Delivery
instead of the legacy SYSCALL/SYSENTER/IDT architecture for
ring transitions and exception/interrupt handling if the
system supports.

if X86_32
config X86_BIGSMP
bool "Support for big SMP systems with more than 8 CPUs"
Expand Down
5 changes: 4 additions & 1 deletion arch/x86/entry/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ obj-y += vdso/
obj-y += vsyscall/

obj-$(CONFIG_PREEMPTION) += thunk_$(BITS).o
CFLAGS_entry_fred.o += -fno-stack-protector
CFLAGS_REMOVE_entry_fred.o += -pg $(CC_FLAGS_FTRACE)
obj-$(CONFIG_X86_FRED) += entry_64_fred.o entry_fred.o

obj-$(CONFIG_IA32_EMULATION) += entry_64_compat.o syscall_32.o
obj-$(CONFIG_X86_X32_ABI) += syscall_x32.o

15 changes: 10 additions & 5 deletions arch/x86/entry/calling.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ For 32-bit we have the following conventions - kernel is built with
* for assembly code:
*/

.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
.macro PUSH_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 unwind_hint=1
.if \save_ret
pushq %rsi /* pt_regs->si */
movq 8(%rsp), %rsi /* temporarily store the return address in %rsi */
Expand All @@ -87,14 +87,17 @@ For 32-bit we have the following conventions - kernel is built with
pushq %r13 /* pt_regs->r13 */
pushq %r14 /* pt_regs->r14 */
pushq %r15 /* pt_regs->r15 */

.if \unwind_hint
UNWIND_HINT_REGS
.endif

.if \save_ret
pushq %rsi /* return address on top of stack */
.endif
.endm

.macro CLEAR_REGS
.macro CLEAR_REGS clear_bp=1
/*
* Sanitize registers of values that a speculation attack might
* otherwise want to exploit. The lower registers are likely clobbered
Expand All @@ -109,17 +112,19 @@ For 32-bit we have the following conventions - kernel is built with
xorl %r10d, %r10d /* nospec r10 */
xorl %r11d, %r11d /* nospec r11 */
xorl %ebx, %ebx /* nospec rbx */
.if \clear_bp
xorl %ebp, %ebp /* nospec rbp */
.endif
xorl %r12d, %r12d /* nospec r12 */
xorl %r13d, %r13d /* nospec r13 */
xorl %r14d, %r14d /* nospec r14 */
xorl %r15d, %r15d /* nospec r15 */

.endm

.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0
PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret
CLEAR_REGS
.macro PUSH_AND_CLEAR_REGS rdx=%rdx rcx=%rcx rax=%rax save_ret=0 clear_bp=1 unwind_hint=1
PUSH_REGS rdx=\rdx, rcx=\rcx, rax=\rax, save_ret=\save_ret unwind_hint=\unwind_hint
CLEAR_REGS clear_bp=\clear_bp
.endm

.macro POP_REGS pop_rdi=1
Expand Down
4 changes: 0 additions & 4 deletions arch/x86/entry/entry_32.S
Original file line number Diff line number Diff line change
Expand Up @@ -649,10 +649,6 @@ SYM_CODE_START_LOCAL(asm_\cfunc)
SYM_CODE_END(asm_\cfunc)
.endm

.macro idtentry_sysvec vector cfunc
idtentry \vector asm_\cfunc \cfunc has_error_code=0
.endm

/*
* Include the defines which emit the idt entries which are shared
* shared between 32 and 64 bit and emit the __irqentry_text_* markers
Expand Down
14 changes: 6 additions & 8 deletions arch/x86/entry/entry_64.S
Original file line number Diff line number Diff line change
Expand Up @@ -248,7 +248,13 @@ SYM_CODE_START(ret_from_fork_asm)
* and unwind should work normally.
*/
UNWIND_HINT_REGS

#ifdef CONFIG_X86_FRED
ALTERNATIVE "jmp swapgs_restore_regs_and_return_to_usermode", \
"jmp asm_fred_exit_user", X86_FEATURE_FRED
#else
jmp swapgs_restore_regs_and_return_to_usermode
#endif
SYM_CODE_END(ret_from_fork_asm)
.popsection

Expand Down Expand Up @@ -371,14 +377,6 @@ SYM_CODE_END(\asmsym)
idtentry \vector asm_\cfunc \cfunc has_error_code=1
.endm

/*
* System vectors which invoke their handlers directly and are not
* going through the regular common device interrupt handling code.
*/
.macro idtentry_sysvec vector cfunc
idtentry \vector asm_\cfunc \cfunc has_error_code=0
.endm

/**
* idtentry_mce_db - Macro to generate entry stubs for #MC and #DB
* @vector: Vector number
Expand Down
131 changes: 131 additions & 0 deletions arch/x86/entry/entry_64_fred.S
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
/* SPDX-License-Identifier: GPL-2.0 */
/*
* The actual FRED entry points.
*/

#include <linux/export.h>

#include <asm/asm.h>
#include <asm/fred.h>
#include <asm/segment.h>

#include "calling.h"

.code64
.section .noinstr.text, "ax"

.macro FRED_ENTER
UNWIND_HINT_END_OF_STACK
ENDBR
PUSH_AND_CLEAR_REGS
movq %rsp, %rdi /* %rdi -> pt_regs */
.endm

.macro FRED_EXIT
UNWIND_HINT_REGS
POP_REGS
.endm

/*
* The new RIP value that FRED event delivery establishes is
* IA32_FRED_CONFIG & ~FFFH for events that occur in ring 3.
* Thus the FRED ring 3 entry point must be 4K page aligned.
*/
.align 4096

SYM_CODE_START_NOALIGN(asm_fred_entrypoint_user)
FRED_ENTER
call fred_entry_from_user
SYM_INNER_LABEL(asm_fred_exit_user, SYM_L_GLOBAL)
FRED_EXIT
1: ERETU

_ASM_EXTABLE_TYPE(1b, asm_fred_entrypoint_user, EX_TYPE_ERETU)
SYM_CODE_END(asm_fred_entrypoint_user)

/*
* The new RIP value that FRED event delivery establishes is
* (IA32_FRED_CONFIG & ~FFFH) + 256 for events that occur in
* ring 0, i.e., asm_fred_entrypoint_user + 256.
*/
.org asm_fred_entrypoint_user + 256, 0xcc
SYM_CODE_START_NOALIGN(asm_fred_entrypoint_kernel)
FRED_ENTER
call fred_entry_from_kernel
FRED_EXIT
ERETS
SYM_CODE_END(asm_fred_entrypoint_kernel)

#if IS_ENABLED(CONFIG_KVM_INTEL)
SYM_FUNC_START(asm_fred_entry_from_kvm)
push %rbp
mov %rsp, %rbp

UNWIND_HINT_SAVE

/*
* Both IRQ and NMI from VMX can be handled on current task stack
* because there is no need to protect from reentrancy and the call
* stack leading to this helper is effectively constant and shallow
* (relatively speaking). Do the same when FRED is active, i.e., no
* need to check current stack level for a stack switch.
*
* Emulate the FRED-defined redzone and stack alignment.
*/
sub $(FRED_CONFIG_REDZONE_AMOUNT << 6), %rsp
and $FRED_STACK_FRAME_RSP_MASK, %rsp

/*
* Start to push a FRED stack frame, which is always 64 bytes:
*
* +--------+-----------------+
* | Bytes | Usage |
* +--------+-----------------+
* | 63:56 | Reserved |
* | 55:48 | Event Data |
* | 47:40 | SS + Event Info |
* | 39:32 | RSP |
* | 31:24 | RFLAGS |
* | 23:16 | CS + Aux Info |
* | 15:8 | RIP |
* | 7:0 | Error Code |
* +--------+-----------------+
*/
push $0 /* Reserved, must be 0 */
push $0 /* Event data, 0 for IRQ/NMI */
push %rdi /* fred_ss handed in by the caller */
push %rbp
pushf
mov $__KERNEL_CS, %rax
push %rax

/*
* Unlike the IDT event delivery, FRED _always_ pushes an error code
* after pushing the return RIP, thus the CALL instruction CANNOT be
* used here to push the return RIP, otherwise there is no chance to
* push an error code before invoking the IRQ/NMI handler.
*
* Use LEA to get the return RIP and push it, then push an error code.
*/
lea 1f(%rip), %rax
push %rax /* Return RIP */
push $0 /* Error code, 0 for IRQ/NMI */

PUSH_AND_CLEAR_REGS clear_bp=0 unwind_hint=0
movq %rsp, %rdi /* %rdi -> pt_regs */
call __fred_entry_from_kvm /* Call the C entry point */
POP_REGS
ERETS
1:
/*
* Objtool doesn't understand what ERETS does, this hint tells it that
* yes, we'll reach here and with what stack state. A save/restore pair
* isn't strictly needed, but it's the simplest form.
*/
UNWIND_HINT_RESTORE
pop %rbp
RET

SYM_FUNC_END(asm_fred_entry_from_kvm)
EXPORT_SYMBOL_GPL(asm_fred_entry_from_kvm);
#endif
Loading

0 comments on commit 720c857

Please sign in to comment.