From b5322b6ec06a6c58650f52abcd2492000396363b Mon Sep 17 00:00:00 2001 From: "Herton R. Krzesinski" Date: Thu, 20 Mar 2025 11:22:13 -0300 Subject: [PATCH 1/5] x86/uaccess: Improve performance by aligning writes to 8 bytes in copy_user_generic(), on non-FSRM/ERMS CPUs History of the performance regression: ====================================== Since the following series of user copy updates were merged upstream ~2 years ago via: a5624566431d ("Merge branch 'x86-rep-insns': x86 user copy clarifications") .. copy_user_generic() on x86_64 stopped doing alignment of the writes to the destination to a 8 byte boundary for the non FSRM case. Previously, this was done through the ALIGN_DESTINATION macro that was used in the now removed copy_user_generic_unrolled function. Turns out this change causes some loss of performance/throughput on some use cases and specific CPU/platforms without FSRM and ERMS. Lately I got two reports of performance/throughput issues after a RHEL 9 kernel pulled the same upstream series with updates to user copy functions. Both reports consisted of running specific networking/TCP related testing using iperf3. Partial upstream fix ==================== The first report was related to a Linux Bridge testing using VMs on a specific machine with an AMD CPU (EPYC 7402), and after a brief investigation it turned out that the later change via: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") ... helped/fixed the performance issue. However, after the later commit/fix was applied, then I got another regression reported in a multistream TCP test on a 100Gbit mlx5 nic, also running on an AMD based platform (AMD EPYC 7302 CPU), again that was using iperf3 to run the test. That regression was after applying the later fix/commit, but only this didn't help in telling the whole history. Testing performed to pinpoint residual regression ================================================= So I narrowed down the second regression use case, but running it without traffic through a NIC, on localhost, in trying to narrow down CPU usage and not being limited by other factor like network bandwidth. I used another system also with an AMD CPU (AMD EPYC 7742). Basically, I run iperf3 in server and client mode in the same system, for example: - Start the server binding it to CPU core/thread 19: $ taskset -c 19 iperf3 -D -s -B 127.0.0.1 -p 12000 - Start the client always binding/running on CPU core/thread 17, using perf to get statistics: $ perf stat -o stat.txt taskset -c 17 iperf3 -c 127.0.0.1 -b 0/1000 -V \ -n 50G --repeating-payload -l 16384 -p 12000 --cport 12001 2>&1 \ > stat-19.txt For the client, always running/pinned to CPU 17. But for the iperf3 in server mode, I did test runs using CPUs 19, 21, 23 or not pinned to any specific CPU. So it basically consisted with four runs of the same commands, just changing the CPU which the server is pinned, or without pinning by removing the taskset call before the server command. The CPUs were chosen based on NUMA node they were on, this is the relevant output of lscpu on the system: $ lscpu ... Model name: AMD EPYC 7742 64-Core Processor ... Caches (sum of all): L1d: 2 MiB (64 instances) L1i: 2 MiB (64 instances) L2: 32 MiB (64 instances) L3: 256 MiB (16 instances) NUMA: NUMA node(s): 4 NUMA node0 CPU(s): 0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121 NUMA node1 CPU(s): 2,3,10,11,18,19,26,27,34,35,42,43,50,51,58,59,66,67,74,75,82,83,90,91,98,99,106,107,114,115,122,123 NUMA node2 CPU(s): 4,5,12,13,20,21,28,29,36,37,44,45,52,53,60,61,68,69,76,77,84,85,92,93,100,101,108,109,116,117,124,125 NUMA node3 CPU(s): 6,7,14,15,22,23,30,31,38,39,46,47,54,55,62,63,70,71,78,79,86,87,94,95,102,103,110,111,118,119,126,127 ... So for the server run, when picking a CPU, I chose CPUs to be not on the same node. The reason is with that I was able to get/measure relevant performance differences when changing the alignment of the writes to the destination in copy_user_generic. Testing shows up to +81% performance improvement under iperf3 ============================================================= Here's a summary of the iperf3 runs: # Vanilla upstream alignment: CPU RATE SYS TIME sender-receiver Server bind 19: 13.0Gbits/sec 28.371851000 33.233499566 86.9%-70.8% Server bind 21: 12.9Gbits/sec 28.283381000 33.586486621 85.8%-69.9% Server bind 23: 11.1Gbits/sec 33.660190000 39.012243176 87.7%-64.5% Server bind none: 18.9Gbits/sec 19.215339000 22.875117865 86.0%-80.5% # With the attached patch (aligning writes in non ERMS/FSRM case): CPU RATE SYS TIME sender-receiver Server bind 19: 20.8Gbits/sec 14.897284000 20.811101382 75.7%-89.0% Server bind 21: 20.4Gbits/sec 15.205055000 21.263165909 75.4%-89.7% Server bind 23: 20.2Gbits/sec 15.433801000 21.456175000 75.5%-89.8% Server bind none: 26.1Gbits/sec 12.534022000 16.632447315 79.8%-89.6% So I consistently got better results when aligning the write. The results above were run on 6.14.0-rc6/rc7 based kernels. The sys is sys time and then the total time to run/transfer 50G of data. The last field is the CPU usage of sender/receiver iperf3 process. It's also worth to note that each pair of iperf3 runs may get slightly different results on each run, but I always got consistent higher results with the write alignment for this specific test of running the processes on CPUs in different NUMA nodes. Linus Torvalds helped/provided this version of the patch. Initially I proposed a version which aligned writes for all cases in rep_movs_alternative, however it used two extra registers and thus Linus provided an enhanced version that only aligns the write on the large_movsq case, which is sufficient since the problem happens only on those AMD CPUs like ones mentioned above without ERMS/FSRM, and also doesn't require using extra registers. Also, I validated that aligning only on large_movsq case is really enough for getting the performance back. I also tested this patch on an old Intel based non-ERMS/FRMS system (with Xeon E5-2667 - Sandy Bridge based) and didn't get any problems: no performance enhancement but also no regression either, using the same iperf3 based benchmark. Also newer Intel processors after Sandy Bridge usually have ERMS and should not be affected by this change. [ mingo: Updated the changelog. ] Fixes: ca96b162bfd2 ("x86: bring back rep movsq for user access on CPUs without ERMS") Fixes: 034ff37d3407 ("x86: rewrite '__copy_user_nocache' function") Reported-by: Ondrej Lichtner Co-developed-by: Linus Torvalds Signed-off-by: Linus Torvalds Signed-off-by: Herton R. Krzesinski Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20250320142213.2623518-1-herton@redhat.com --- arch/x86/lib/copy_user_64.S | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index aa8c341b2441..06296eb69fd4 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -77,6 +77,24 @@ SYM_FUNC_START(rep_movs_alternative) _ASM_EXTABLE_UA( 0b, 1b) .Llarge_movsq: + /* Do the first possibly unaligned word */ +0: movq (%rsi),%rax +1: movq %rax,(%rdi) + + _ASM_EXTABLE_UA( 0b, .Lcopy_user_tail) + _ASM_EXTABLE_UA( 1b, .Lcopy_user_tail) + + /* What would be the offset to the aligned destination? */ + leaq 8(%rdi),%rax + andq $-8,%rax + subq %rdi,%rax + + /* .. and update pointers and count to match */ + addq %rax,%rdi + addq %rax,%rsi + subq %rax,%rcx + + /* make %rcx contain the number of words, %rax the remainder */ movq %rcx,%rax shrq $3,%rcx andl $7,%eax From f710202b2a45addea3dcdcd862770ecbaf6597ef Mon Sep 17 00:00:00 2001 From: Nathan Chancellor Date: Tue, 18 Mar 2025 15:32:30 -0700 Subject: [PATCH 2/5] x86/tools: Drop duplicate unlikely() definition in insn_decoder_test.c After commit c104c16073b7 ("Kunit to check the longest symbol length"), there is a warning when building with clang because there is now a definition of unlikely from compiler.h in tools/include/linux, which conflicts with the one in the instruction decoder selftest: arch/x86/tools/insn_decoder_test.c:15:9: warning: 'unlikely' macro redefined [-Wmacro-redefined] Remove the second unlikely() definition, as it is no longer necessary, clearing up the warning. Fixes: c104c16073b7 ("Kunit to check the longest symbol length") Signed-off-by: Nathan Chancellor Signed-off-by: Ingo Molnar Acked-by: Shuah Khan Link: https://lore.kernel.org/r/20250318-x86-decoder-test-fix-unlikely-redef-v1-1-74c84a7bf05b@kernel.org --- arch/x86/tools/insn_decoder_test.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/tools/insn_decoder_test.c b/arch/x86/tools/insn_decoder_test.c index 6c2986d2ad11..08cd913cbd4e 100644 --- a/arch/x86/tools/insn_decoder_test.c +++ b/arch/x86/tools/insn_decoder_test.c @@ -12,8 +12,6 @@ #include #include -#define unlikely(cond) (cond) - #include #include #include From 7170130e4c72ce0caa0cb42a1627c635cc262821 Mon Sep 17 00:00:00 2001 From: Balbir Singh Date: Tue, 1 Apr 2025 11:07:52 +1100 Subject: [PATCH 3/5] x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit As Bert Karwatzki reported, the following recent commit causes a performance regression on AMD iGPU and dGPU systems: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") It exposed a bug with nokaslr and zone device interaction. The root cause of the bug is that, the GPU driver registers a zone device private memory region. When KASLR is disabled or the above commit is applied, the direct_map_physmem_end is set to much higher than 10 TiB typically to the 64TiB address. When zone device private memory is added to the system via add_pages(), it bumps up the max_pfn to the same value. This causes dma_addressing_limited() to return true, since the device cannot address memory all the way up to max_pfn. This caused a regression for games played on the iGPU, as it resulted in the DMA32 zone being used for GPU allocations. Fix this by not bumping up max_pfn on x86 systems, when pgmap is passed into add_pages(). The presence of pgmap is used to determine if device private memory is being added via add_pages(). More details: devm_request_mem_region() and request_free_mem_region() request for device private memory. iomem_resource is passed as the base resource with start and end parameters. iomem_resource's end depends on several factors, including the platform and virtualization. On x86 for example on bare metal, this value is set to boot_cpu_data.x86_phys_bits. boot_cpu_data.x86_phys_bits can change depending on support for MKTME. By default it is set to the same as log2(direct_map_physmem_end) which is 46 to 52 bits depending on the number of levels in the page table. The allocation routines used iomem_resource's end and direct_map_physmem_end to figure out where to allocate the region. [ arch/powerpc is also impacted by this problem, but this patch does not fix the issue for PowerPC. ] Testing: 1. Tested on a virtual machine with test_hmm for zone device inseration 2. A previous version of this patch was tested by Bert, please see: https://lore.kernel.org/lkml/d87680bab997fdc9fb4e638983132af235d9a03a.camel@web.de/ [ mingo: Clarified the comments and the changelog. ] Reported-by: Bert Karwatzki Tested-by: Bert Karwatzki Fixes: 7ffb791423c7 ("x86/kaslr: Reduce KASLR entropy on most x86 systems") Signed-off-by: Balbir Singh Signed-off-by: Ingo Molnar Cc: Brian Gerst Cc: Juergen Gross Cc: H. Peter Anvin Cc: Linus Torvalds Cc: Andrew Morton Cc: Christoph Hellwig Cc: Pierre-Eric Pelloux-Prayer Cc: Alex Deucher Cc: Christian König Cc: David Airlie Cc: Simona Vetter Link: https://lore.kernel.org/r/20250401000752.249348-1-balbirs@nvidia.com --- arch/x86/mm/init_64.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 519aa53114fa..821a0b53b21c 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -959,9 +959,18 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, ret = __add_pages(nid, start_pfn, nr_pages, params); WARN_ON_ONCE(ret); - /* update max_pfn, max_low_pfn and high_memory */ - update_end_of_memory_vars(start_pfn << PAGE_SHIFT, - nr_pages << PAGE_SHIFT); + /* + * Special case: add_pages() is called by memremap_pages() for adding device + * private pages. Do not bump up max_pfn in the device private path, + * because max_pfn changes affect dma_addressing_limited(). + * + * dma_addressing_limited() returning true when max_pfn is the device's + * addressable memory can force device drivers to use bounce buffers + * and impact their performance negatively: + */ + if (!params->pgmap) + /* update max_pfn, max_low_pfn and high_memory */ + update_end_of_memory_vars(start_pfn << PAGE_SHIFT, nr_pages << PAGE_SHIFT); return ret; } From d0ebf4c7eb91fe73981d5250b50e9d22db8fb946 Mon Sep 17 00:00:00 2001 From: "Dr. David Alan Gilbert" Date: Wed, 25 Dec 2024 17:50:10 +0000 Subject: [PATCH 4/5] x86/platform/iosf_mbi: Remove unused iosf_mbi_unregister_pmic_bus_access_notifier() The last use of iosf_mbi_unregister_pmic_bus_access_notifier() was removed in 2017 by: a5266db4d314 ("drm/i915: Acquire PUNIT->PMIC bus for intel_uncore_forcewake_reset()") Remove it. (Note that the '_unlocked' version is still used.) Signed-off-by: Dr. David Alan Gilbert Signed-off-by: Ingo Molnar Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: David Airlie Cc: Simona Vetter Cc: intel-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Link: https://lore.kernel.org/r/20241225175010.91783-1-linux@treblig.org --- arch/x86/include/asm/iosf_mbi.h | 7 ------- arch/x86/platform/intel/iosf_mbi.c | 13 ------------- drivers/gpu/drm/i915/i915_iosf_mbi.h | 6 ------ 3 files changed, 26 deletions(-) diff --git a/arch/x86/include/asm/iosf_mbi.h b/arch/x86/include/asm/iosf_mbi.h index af7541c11821..8ace6559d399 100644 --- a/arch/x86/include/asm/iosf_mbi.h +++ b/arch/x86/include/asm/iosf_mbi.h @@ -167,13 +167,6 @@ void iosf_mbi_unblock_punit_i2c_access(void); */ int iosf_mbi_register_pmic_bus_access_notifier(struct notifier_block *nb); -/** - * iosf_mbi_register_pmic_bus_access_notifier - Unregister PMIC bus notifier - * - * @nb: notifier_block to unregister - */ -int iosf_mbi_unregister_pmic_bus_access_notifier(struct notifier_block *nb); - /** * iosf_mbi_unregister_pmic_bus_access_notifier_unlocked - Unregister PMIC bus * notifier, unlocked diff --git a/arch/x86/platform/intel/iosf_mbi.c b/arch/x86/platform/intel/iosf_mbi.c index c81cea208c2c..40ae94db20d8 100644 --- a/arch/x86/platform/intel/iosf_mbi.c +++ b/arch/x86/platform/intel/iosf_mbi.c @@ -422,19 +422,6 @@ int iosf_mbi_unregister_pmic_bus_access_notifier_unlocked( } EXPORT_SYMBOL(iosf_mbi_unregister_pmic_bus_access_notifier_unlocked); -int iosf_mbi_unregister_pmic_bus_access_notifier(struct notifier_block *nb) -{ - int ret; - - /* Wait for the bus to go inactive before unregistering */ - iosf_mbi_punit_acquire(); - ret = iosf_mbi_unregister_pmic_bus_access_notifier_unlocked(nb); - iosf_mbi_punit_release(); - - return ret; -} -EXPORT_SYMBOL(iosf_mbi_unregister_pmic_bus_access_notifier); - void iosf_mbi_assert_punit_acquired(void) { WARN_ON(iosf_mbi_pmic_punit_access_count == 0); diff --git a/drivers/gpu/drm/i915/i915_iosf_mbi.h b/drivers/gpu/drm/i915/i915_iosf_mbi.h index 8f81b7603d37..317075d0da4e 100644 --- a/drivers/gpu/drm/i915/i915_iosf_mbi.h +++ b/drivers/gpu/drm/i915/i915_iosf_mbi.h @@ -31,12 +31,6 @@ iosf_mbi_unregister_pmic_bus_access_notifier_unlocked(struct notifier_block *nb) { return 0; } - -static inline -int iosf_mbi_unregister_pmic_bus_access_notifier(struct notifier_block *nb) -{ - return 0; -} #endif #endif /* __I915_IOSF_MBI_H__ */ From e5f1e8af9c9e151ecd665f6d2e36fb25fec3b110 Mon Sep 17 00:00:00 2001 From: "Xin Li (Intel)" Date: Tue, 1 Apr 2025 00:57:27 -0700 Subject: [PATCH 5/5] x86/fred: Fix system hang during S4 resume with FRED enabled Upon a wakeup from S4, the restore kernel starts and initializes the FRED MSRs as needed from its perspective. It then loads a hibernation image, including the image kernel, and attempts to load image pages directly into their original page frames used before hibernation unless those frames are currently in use. Once all pages are moved to their original locations, it jumps to a "trampoline" page in the image kernel. At this point, the image kernel takes control, but the FRED MSRs still contain values set by the restore kernel, which may differ from those set by the image kernel before hibernation. Therefore, the image kernel must ensure the FRED MSRs have the same values as before hibernation. Since these values depend only on the location of the kernel text and data, they can be recomputed from scratch. Reported-by: Xi Pardee Reported-by: Todd Brandt Tested-by: Todd Brandt Suggested-by: H. Peter Anvin (Intel) Signed-off-by: Xin Li (Intel) Signed-off-by: Ingo Molnar Reviewed-by: Rafael J. Wysocki Reviewed-by: H. Peter Anvin (Intel) Cc: Andy Lutomirski Cc: Brian Gerst Cc: Juergen Gross Cc: Linus Torvalds Link: https://lore.kernel.org/r/20250401075728.3626147-1-xin@zytor.com --- arch/x86/power/cpu.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c index 63230ff8cf4f..08e76a5ca155 100644 --- a/arch/x86/power/cpu.c +++ b/arch/x86/power/cpu.c @@ -27,6 +27,7 @@ #include #include #include +#include #ifdef CONFIG_X86_32 __visible unsigned long saved_context_ebx; @@ -231,6 +232,19 @@ static void notrace __restore_processor_state(struct saved_context *ctxt) */ #ifdef CONFIG_X86_64 wrmsrl(MSR_GS_BASE, ctxt->kernelmode_gs_base); + + /* + * Reinitialize FRED to ensure the FRED MSRs contain the same values + * as before hibernation. + * + * Note, the setup of FRED RSPs requires access to percpu data + * structures. Therefore, FRED reinitialization can only occur after + * the percpu access pointer (i.e., MSR_GS_BASE) is restored. + */ + if (ctxt->cr4 & X86_CR4_FRED) { + cpu_init_fred_exceptions(); + cpu_init_fred_rsps(); + } #else loadsegment(fs, __KERNEL_PERCPU); #endif