From d6b3ef9e7a9c2398a71978874957c88c73d270a6 Mon Sep 17 00:00:00 2001 From: Loic Poulain Date: Wed, 2 Apr 2025 15:01:37 +0200 Subject: [PATCH 01/31] mailmap: map Loic Poulain's old email addresses Map old email addresses that are no longer in use. Link: https://lkml.kernel.org/r/20250402130137.12328-1-loic.poulain@oss.qualcomm.com Signed-off-by: Loic Poulain Cc: Simon Horman Signed-off-by: Andrew Morton --- .mailmap | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.mailmap b/.mailmap index 4f7cd8e23177..6a874f9950bf 100644 --- a/.mailmap +++ b/.mailmap @@ -438,6 +438,8 @@ Linus Lüssing Li Yang Li Yang Lior David +Loic Poulain +Loic Poulain Lorenzo Pieralisi Lorenzo Stoakes Luca Ceresoli From be8254f694469e60252e35c467ac9a878d7797bf Mon Sep 17 00:00:00 2001 From: Daniel Gomez Date: Fri, 21 Mar 2025 20:24:33 +0000 Subject: [PATCH 02/31] radix-tree: add missing cleanup.h Add shared cleanup.h header for radix-tree testing tools. Fixes build error found with kdevops [1]: cc -I../shared -I. -I../../include -I../../../lib -g -Og -Wall -D_LGPL_SOURCE -fsanitize=address -fsanitize=undefined -c -o radix-tree.o radix-tree.c In file included from ../shared/linux/idr.h:1, from radix-tree.c:18: ../shared/linux/../../../../include/linux/idr.h:18:10: fatal error: linux/cleanup.h: No such file or directory 18 | #include | ^~~~~~~~~~~~~~~~~ compilation terminated. make: *** [: radix-tree.o] Error 1 [1] https://github.com/linux-kdevops/kdevops https://github.com/linux-kdevops/linux-mm-kpd/ actions/runs/13971648496/job/39114756401 [akpm@linux-foundation.org: remove unneeded header guards, per Sidhartha] Link: https://lkml.kernel.org/r/20250321-fix-radix-tree-build-v1-1-838a1e6540e2@samsung.com Fixes: 6c8b0b835f00 ("perf/core: Simplify perf_pmu_register()") Signed-off-by: Daniel Gomez Cc: Daniel Gomez Cc: Ingo Molnar Cc: Liam Howlett Cc: Luis Chamberalin Cc: Matthew Wilcox (Oracle) Cc: Ravi Bangoria Cc: Peter Zijlstra Cc: Sidhartha Kumar Signed-off-by: Andrew Morton --- tools/testing/shared/linux/cleanup.h | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 tools/testing/shared/linux/cleanup.h diff --git a/tools/testing/shared/linux/cleanup.h b/tools/testing/shared/linux/cleanup.h new file mode 100644 index 000000000000..ea3081426ee9 --- /dev/null +++ b/tools/testing/shared/linux/cleanup.h @@ -0,0 +1,2 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#include "../../../../include/linux/cleanup.h" From a30951d09c33c899f0e4aca80eb87fad5f10ecfa Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 2 Apr 2025 20:33:11 -0400 Subject: [PATCH 03/31] test suite: use %zu to print size_t On 32-bit, we can't use %lu to print a size_t variable and gcc warns us about it. Shame it doesn't warn about it on 64-bit. Link: https://lkml.kernel.org/r/20250403003311.359917-1-Liam.Howlett@oracle.com Fixes: cc86e0c2f306 ("radix tree test suite: add support for slab bulk APIs") Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Liam R. Howlett Signed-off-by: Andrew Morton --- tools/testing/shared/linux.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/shared/linux.c b/tools/testing/shared/linux.c index 66dbb362385f..0f97fb0d19e1 100644 --- a/tools/testing/shared/linux.c +++ b/tools/testing/shared/linux.c @@ -150,7 +150,7 @@ void kmem_cache_free(struct kmem_cache *cachep, void *objp) void kmem_cache_free_bulk(struct kmem_cache *cachep, size_t size, void **list) { if (kmalloc_verbose) - pr_debug("Bulk free %p[0-%lu]\n", list, size - 1); + pr_debug("Bulk free %p[0-%zu]\n", list, size - 1); pthread_mutex_lock(&cachep->lock); for (int i = 0; i < size; i++) @@ -168,7 +168,7 @@ int kmem_cache_alloc_bulk(struct kmem_cache *cachep, gfp_t gfp, size_t size, size_t i; if (kmalloc_verbose) - pr_debug("Bulk alloc %lu\n", size); + pr_debug("Bulk alloc %zu\n", size); pthread_mutex_lock(&cachep->lock); if (cachep->nr_objs >= size) { From 51339d99c0131bc0d16d378e9b05bc498d2967e2 Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Wed, 2 Apr 2025 19:55:14 -0700 Subject: [PATCH 04/31] locking/local_lock, mm: replace localtry_ helpers with local_trylock_t type Partially revert commit 0aaddfb06882 ("locking/local_lock: Introduce localtry_lock_t"). Remove localtry_*() helpers, since localtry_lock() name might be misinterpreted as "try lock". Introduce local_trylock[_irqsave]() helpers that only work with newly introduced local_trylock_t type. Note that attempt to use local_trylock[_irqsave]() with local_lock_t will cause compilation failure. Usage and behavior in !PREEMPT_RT: local_lock_t lock; // sizeof(lock) == 0 local_lock(&lock); // preempt disable local_lock_irqsave(&lock, ...); // irq save if (local_trylock_irqsave(&lock, ...)) // compilation error local_trylock_t lock; // sizeof(lock) == 4 local_lock(&lock); // preempt disable, acquired = 1 local_lock_irqsave(&lock, ...); // irq save, acquired = 1 if (local_trylock(&lock)) // if (!acquired) preempt disable, acquired = 1 if (local_trylock_irqsave(&lock, ...)) // if (!acquired) irq save, acquired = 1 The existing local_lock_*() macros can be used either with local_lock_t or local_trylock_t. With local_trylock_t they set acquired = 1 while local_unlock_*() clears it. In !PREEMPT_RT local_lock_irqsave(local_lock_t *) disables interrupts to protect critical section, but it doesn't prevent NMI, so the fully reentrant code cannot use local_lock_irqsave(local_lock_t *) for exclusive access. The local_lock_irqsave(local_trylock_t *) helper disables interrupts and sets acquired=1, so local_trylock_irqsave(local_trylock_t *) from NMI attempting to acquire the same lock will return false. In PREEMPT_RT local_lock_irqsave() maps to preemptible spin_lock(). Map local_trylock_irqsave() to preemptible spin_trylock(). When in hard IRQ or NMI return false right away, since spin_trylock() is not safe due to explicit locking in the underneath rt_spin_trylock() implementation. Removing this explicit locking and attempting only "trylock" is undesired due to PI implications. The local_trylock() without _irqsave can be used to avoid the cost of disabling/enabling interrupts by only disabling preemption, so local_trylock() in an interrupt attempting to acquire the same lock will return false. Note there is no need to use local_inc for acquired variable, since it's a percpu variable with strict nesting scopes. Note that guard(local_lock)(&lock) works only for "local_lock_t lock". The patch also makes sure that local_lock_release(l) is called before WRITE_ONCE(l->acquired, 0). Though IRQs are disabled at this point the local_trylock() from NMI will succeed and local_lock_acquire(l) will warn. Link: https://lkml.kernel.org/r/20250403025514.41186-1-alexei.starovoitov@gmail.com Fixes: 0aaddfb06882 ("locking/local_lock: Introduce localtry_lock_t") Signed-off-by: Alexei Starovoitov Acked-by: Vlastimil Babka Acked-by: Sebastian Andrzej Siewior Reviewed-by: Shakeel Butt Cc: Daniel Borkman Cc: Linus Torvalds Cc: Martin KaFai Lau Cc: Michal Hocko Cc: Peter Zijlstra Cc: Steven Rostedt Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- include/linux/local_lock.h | 58 ++------ include/linux/local_lock_internal.h | 207 ++++++++++++---------------- mm/memcontrol.c | 39 +++--- 3 files changed, 114 insertions(+), 190 deletions(-) diff --git a/include/linux/local_lock.h b/include/linux/local_lock.h index 1a0bc35839e3..16a2ee4f8310 100644 --- a/include/linux/local_lock.h +++ b/include/linux/local_lock.h @@ -52,44 +52,23 @@ __local_unlock_irqrestore(lock, flags) /** - * localtry_lock_init - Runtime initialize a lock instance - */ -#define localtry_lock_init(lock) __localtry_lock_init(lock) - -/** - * localtry_lock - Acquire a per CPU local lock - * @lock: The lock variable - */ -#define localtry_lock(lock) __localtry_lock(lock) - -/** - * localtry_lock_irq - Acquire a per CPU local lock and disable interrupts - * @lock: The lock variable - */ -#define localtry_lock_irq(lock) __localtry_lock_irq(lock) - -/** - * localtry_lock_irqsave - Acquire a per CPU local lock, save and disable - * interrupts - * @lock: The lock variable - * @flags: Storage for interrupt flags + * local_lock_init - Runtime initialize a lock instance */ -#define localtry_lock_irqsave(lock, flags) \ - __localtry_lock_irqsave(lock, flags) +#define local_trylock_init(lock) __local_trylock_init(lock) /** - * localtry_trylock - Try to acquire a per CPU local lock. + * local_trylock - Try to acquire a per CPU local lock * @lock: The lock variable * * The function can be used in any context such as NMI or HARDIRQ. Due to * locking constrains it will _always_ fail to acquire the lock in NMI or * HARDIRQ context on PREEMPT_RT. */ -#define localtry_trylock(lock) __localtry_trylock(lock) +#define local_trylock(lock) __local_trylock(lock) /** - * localtry_trylock_irqsave - Try to acquire a per CPU local lock, save and disable - * interrupts if acquired + * local_trylock_irqsave - Try to acquire a per CPU local lock, save and disable + * interrupts if acquired * @lock: The lock variable * @flags: Storage for interrupt flags * @@ -97,29 +76,8 @@ * locking constrains it will _always_ fail to acquire the lock in NMI or * HARDIRQ context on PREEMPT_RT. */ -#define localtry_trylock_irqsave(lock, flags) \ - __localtry_trylock_irqsave(lock, flags) - -/** - * local_unlock - Release a per CPU local lock - * @lock: The lock variable - */ -#define localtry_unlock(lock) __localtry_unlock(lock) - -/** - * local_unlock_irq - Release a per CPU local lock and enable interrupts - * @lock: The lock variable - */ -#define localtry_unlock_irq(lock) __localtry_unlock_irq(lock) - -/** - * localtry_unlock_irqrestore - Release a per CPU local lock and restore - * interrupt flags - * @lock: The lock variable - * @flags: Interrupt flags to restore - */ -#define localtry_unlock_irqrestore(lock, flags) \ - __localtry_unlock_irqrestore(lock, flags) +#define local_trylock_irqsave(lock, flags) \ + __local_trylock_irqsave(lock, flags) DEFINE_GUARD(local_lock, local_lock_t __percpu*, local_lock(_T), diff --git a/include/linux/local_lock_internal.h b/include/linux/local_lock_internal.h index 67bd13d142fa..bf2bf40d7b18 100644 --- a/include/linux/local_lock_internal.h +++ b/include/linux/local_lock_internal.h @@ -15,10 +15,11 @@ typedef struct { #endif } local_lock_t; +/* local_trylock() and local_trylock_irqsave() only work with local_trylock_t */ typedef struct { local_lock_t llock; - unsigned int acquired; -} localtry_lock_t; + u8 acquired; +} local_trylock_t; #ifdef CONFIG_DEBUG_LOCK_ALLOC # define LOCAL_LOCK_DEBUG_INIT(lockname) \ @@ -29,6 +30,9 @@ typedef struct { }, \ .owner = NULL, +# define LOCAL_TRYLOCK_DEBUG_INIT(lockname) \ + .llock = { LOCAL_LOCK_DEBUG_INIT((lockname).llock) }, + static inline void local_lock_acquire(local_lock_t *l) { lock_map_acquire(&l->dep_map); @@ -56,6 +60,7 @@ static inline void local_lock_debug_init(local_lock_t *l) } #else /* CONFIG_DEBUG_LOCK_ALLOC */ # define LOCAL_LOCK_DEBUG_INIT(lockname) +# define LOCAL_TRYLOCK_DEBUG_INIT(lockname) static inline void local_lock_acquire(local_lock_t *l) { } static inline void local_trylock_acquire(local_lock_t *l) { } static inline void local_lock_release(local_lock_t *l) { } @@ -63,7 +68,7 @@ static inline void local_lock_debug_init(local_lock_t *l) { } #endif /* !CONFIG_DEBUG_LOCK_ALLOC */ #define INIT_LOCAL_LOCK(lockname) { LOCAL_LOCK_DEBUG_INIT(lockname) } -#define INIT_LOCALTRY_LOCK(lockname) { .llock = { LOCAL_LOCK_DEBUG_INIT(lockname.llock) }} +#define INIT_LOCAL_TRYLOCK(lockname) { LOCAL_TRYLOCK_DEBUG_INIT(lockname) } #define __local_lock_init(lock) \ do { \ @@ -76,6 +81,8 @@ do { \ local_lock_debug_init(lock); \ } while (0) +#define __local_trylock_init(lock) __local_lock_init(lock.llock) + #define __spinlock_nested_bh_init(lock) \ do { \ static struct lock_class_key __key; \ @@ -87,149 +94,117 @@ do { \ local_lock_debug_init(lock); \ } while (0) +#define __local_lock_acquire(lock) \ + do { \ + local_trylock_t *tl; \ + local_lock_t *l; \ + \ + l = (local_lock_t *)this_cpu_ptr(lock); \ + tl = (local_trylock_t *)l; \ + _Generic((lock), \ + local_trylock_t *: ({ \ + lockdep_assert(tl->acquired == 0); \ + WRITE_ONCE(tl->acquired, 1); \ + }), \ + default:(void)0); \ + local_lock_acquire(l); \ + } while (0) + #define __local_lock(lock) \ do { \ preempt_disable(); \ - local_lock_acquire(this_cpu_ptr(lock)); \ + __local_lock_acquire(lock); \ } while (0) #define __local_lock_irq(lock) \ do { \ local_irq_disable(); \ - local_lock_acquire(this_cpu_ptr(lock)); \ + __local_lock_acquire(lock); \ } while (0) #define __local_lock_irqsave(lock, flags) \ do { \ local_irq_save(flags); \ - local_lock_acquire(this_cpu_ptr(lock)); \ - } while (0) - -#define __local_unlock(lock) \ - do { \ - local_lock_release(this_cpu_ptr(lock)); \ - preempt_enable(); \ + __local_lock_acquire(lock); \ } while (0) -#define __local_unlock_irq(lock) \ - do { \ - local_lock_release(this_cpu_ptr(lock)); \ - local_irq_enable(); \ - } while (0) - -#define __local_unlock_irqrestore(lock, flags) \ - do { \ - local_lock_release(this_cpu_ptr(lock)); \ - local_irq_restore(flags); \ - } while (0) - -#define __local_lock_nested_bh(lock) \ - do { \ - lockdep_assert_in_softirq(); \ - local_lock_acquire(this_cpu_ptr(lock)); \ - } while (0) - -#define __local_unlock_nested_bh(lock) \ - local_lock_release(this_cpu_ptr(lock)) - -/* localtry_lock_t variants */ - -#define __localtry_lock_init(lock) \ -do { \ - __local_lock_init(&(lock)->llock); \ - WRITE_ONCE((lock)->acquired, 0); \ -} while (0) - -#define __localtry_lock(lock) \ - do { \ - localtry_lock_t *lt; \ - preempt_disable(); \ - lt = this_cpu_ptr(lock); \ - local_lock_acquire(<->llock); \ - WRITE_ONCE(lt->acquired, 1); \ - } while (0) - -#define __localtry_lock_irq(lock) \ - do { \ - localtry_lock_t *lt; \ - local_irq_disable(); \ - lt = this_cpu_ptr(lock); \ - local_lock_acquire(<->llock); \ - WRITE_ONCE(lt->acquired, 1); \ - } while (0) - -#define __localtry_lock_irqsave(lock, flags) \ - do { \ - localtry_lock_t *lt; \ - local_irq_save(flags); \ - lt = this_cpu_ptr(lock); \ - local_lock_acquire(<->llock); \ - WRITE_ONCE(lt->acquired, 1); \ - } while (0) - -#define __localtry_trylock(lock) \ +#define __local_trylock(lock) \ ({ \ - localtry_lock_t *lt; \ - bool _ret; \ + local_trylock_t *tl; \ \ preempt_disable(); \ - lt = this_cpu_ptr(lock); \ - if (!READ_ONCE(lt->acquired)) { \ - WRITE_ONCE(lt->acquired, 1); \ - local_trylock_acquire(<->llock); \ - _ret = true; \ - } else { \ - _ret = false; \ + tl = this_cpu_ptr(lock); \ + if (READ_ONCE(tl->acquired)) { \ preempt_enable(); \ + tl = NULL; \ + } else { \ + WRITE_ONCE(tl->acquired, 1); \ + local_trylock_acquire( \ + (local_lock_t *)tl); \ } \ - _ret; \ + !!tl; \ }) -#define __localtry_trylock_irqsave(lock, flags) \ +#define __local_trylock_irqsave(lock, flags) \ ({ \ - localtry_lock_t *lt; \ - bool _ret; \ + local_trylock_t *tl; \ \ local_irq_save(flags); \ - lt = this_cpu_ptr(lock); \ - if (!READ_ONCE(lt->acquired)) { \ - WRITE_ONCE(lt->acquired, 1); \ - local_trylock_acquire(<->llock); \ - _ret = true; \ - } else { \ - _ret = false; \ + tl = this_cpu_ptr(lock); \ + if (READ_ONCE(tl->acquired)) { \ local_irq_restore(flags); \ + tl = NULL; \ + } else { \ + WRITE_ONCE(tl->acquired, 1); \ + local_trylock_acquire( \ + (local_lock_t *)tl); \ } \ - _ret; \ + !!tl; \ }) -#define __localtry_unlock(lock) \ +#define __local_lock_release(lock) \ + do { \ + local_trylock_t *tl; \ + local_lock_t *l; \ + \ + l = (local_lock_t *)this_cpu_ptr(lock); \ + tl = (local_trylock_t *)l; \ + local_lock_release(l); \ + _Generic((lock), \ + local_trylock_t *: ({ \ + lockdep_assert(tl->acquired == 1); \ + WRITE_ONCE(tl->acquired, 0); \ + }), \ + default:(void)0); \ + } while (0) + +#define __local_unlock(lock) \ do { \ - localtry_lock_t *lt; \ - lt = this_cpu_ptr(lock); \ - WRITE_ONCE(lt->acquired, 0); \ - local_lock_release(<->llock); \ + __local_lock_release(lock); \ preempt_enable(); \ } while (0) -#define __localtry_unlock_irq(lock) \ +#define __local_unlock_irq(lock) \ do { \ - localtry_lock_t *lt; \ - lt = this_cpu_ptr(lock); \ - WRITE_ONCE(lt->acquired, 0); \ - local_lock_release(<->llock); \ + __local_lock_release(lock); \ local_irq_enable(); \ } while (0) -#define __localtry_unlock_irqrestore(lock, flags) \ +#define __local_unlock_irqrestore(lock, flags) \ do { \ - localtry_lock_t *lt; \ - lt = this_cpu_ptr(lock); \ - WRITE_ONCE(lt->acquired, 0); \ - local_lock_release(<->llock); \ + __local_lock_release(lock); \ local_irq_restore(flags); \ } while (0) +#define __local_lock_nested_bh(lock) \ + do { \ + lockdep_assert_in_softirq(); \ + local_lock_acquire(this_cpu_ptr(lock)); \ + } while (0) + +#define __local_unlock_nested_bh(lock) \ + local_lock_release(this_cpu_ptr(lock)) + #else /* !CONFIG_PREEMPT_RT */ /* @@ -237,16 +212,18 @@ do { \ * critical section while staying preemptible. */ typedef spinlock_t local_lock_t; -typedef spinlock_t localtry_lock_t; +typedef spinlock_t local_trylock_t; #define INIT_LOCAL_LOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname)) -#define INIT_LOCALTRY_LOCK(lockname) INIT_LOCAL_LOCK(lockname) +#define INIT_LOCAL_TRYLOCK(lockname) __LOCAL_SPIN_LOCK_UNLOCKED((lockname)) #define __local_lock_init(l) \ do { \ local_spin_lock_init((l)); \ } while (0) +#define __local_trylock_init(l) __local_lock_init(l) + #define __local_lock(__lock) \ do { \ migrate_disable(); \ @@ -283,17 +260,7 @@ do { \ spin_unlock(this_cpu_ptr((lock))); \ } while (0) -/* localtry_lock_t variants */ - -#define __localtry_lock_init(lock) __local_lock_init(lock) -#define __localtry_lock(lock) __local_lock(lock) -#define __localtry_lock_irq(lock) __local_lock(lock) -#define __localtry_lock_irqsave(lock, flags) __local_lock_irqsave(lock, flags) -#define __localtry_unlock(lock) __local_unlock(lock) -#define __localtry_unlock_irq(lock) __local_unlock(lock) -#define __localtry_unlock_irqrestore(lock, flags) __local_unlock_irqrestore(lock, flags) - -#define __localtry_trylock(lock) \ +#define __local_trylock(lock) \ ({ \ int __locked; \ \ @@ -308,11 +275,11 @@ do { \ __locked; \ }) -#define __localtry_trylock_irqsave(lock, flags) \ +#define __local_trylock_irqsave(lock, flags) \ ({ \ typecheck(unsigned long, flags); \ flags = 0; \ - __localtry_trylock(lock); \ + __local_trylock(lock); \ }) #endif /* CONFIG_PREEMPT_RT */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 421740f1bcdc..c96c1f2b9cf5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1759,7 +1759,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg) } struct memcg_stock_pcp { - localtry_lock_t stock_lock; + local_trylock_t stock_lock; struct mem_cgroup *cached; /* this never be root cgroup */ unsigned int nr_pages; @@ -1774,7 +1774,7 @@ struct memcg_stock_pcp { #define FLUSHING_CACHED_CHARGE 0 }; static DEFINE_PER_CPU(struct memcg_stock_pcp, memcg_stock) = { - .stock_lock = INIT_LOCALTRY_LOCK(stock_lock), + .stock_lock = INIT_LOCAL_TRYLOCK(stock_lock), }; static DEFINE_MUTEX(percpu_charge_mutex); @@ -1805,11 +1805,10 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, if (nr_pages > MEMCG_CHARGE_BATCH) return ret; - if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) { - if (!gfpflags_allow_spinning(gfp_mask)) - return ret; - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); - } + if (gfpflags_allow_spinning(gfp_mask)) + local_lock_irqsave(&memcg_stock.stock_lock, flags); + else if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) + return ret; stock = this_cpu_ptr(&memcg_stock); stock_pages = READ_ONCE(stock->nr_pages); @@ -1818,7 +1817,7 @@ static bool consume_stock(struct mem_cgroup *memcg, unsigned int nr_pages, ret = true; } - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); return ret; } @@ -1857,14 +1856,14 @@ static void drain_local_stock(struct work_struct *dummy) * drain_stock races is that we always operate on local CPU stock * here with IRQ disabled */ - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); old = drain_obj_stock(stock); drain_stock(stock); clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); obj_cgroup_put(old); } @@ -1894,7 +1893,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) { unsigned long flags; - if (!localtry_trylock_irqsave(&memcg_stock.stock_lock, flags)) { + if (!local_trylock_irqsave(&memcg_stock.stock_lock, flags)) { /* * In case of unlikely failure to lock percpu stock_lock * uncharge memcg directly. @@ -1907,7 +1906,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages) return; } __refill_stock(memcg, nr_pages); - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); } /* @@ -1964,9 +1963,9 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) stock = &per_cpu(memcg_stock, cpu); /* drain_obj_stock requires stock_lock */ - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); old = drain_obj_stock(stock); - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); drain_stock(stock); obj_cgroup_put(old); @@ -2787,7 +2786,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, unsigned long flags; int *bytes; - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); /* @@ -2836,7 +2835,7 @@ static void mod_objcg_state(struct obj_cgroup *objcg, struct pglist_data *pgdat, if (nr) __mod_objcg_mlstate(objcg, pgdat, idx, nr); - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); obj_cgroup_put(old); } @@ -2846,7 +2845,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) unsigned long flags; bool ret = false; - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (objcg == READ_ONCE(stock->cached_objcg) && stock->nr_bytes >= nr_bytes) { @@ -2854,7 +2853,7 @@ static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) ret = true; } - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); return ret; } @@ -2946,7 +2945,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, unsigned long flags; unsigned int nr_pages = 0; - localtry_lock_irqsave(&memcg_stock.stock_lock, flags); + local_lock_irqsave(&memcg_stock.stock_lock, flags); stock = this_cpu_ptr(&memcg_stock); if (READ_ONCE(stock->cached_objcg) != objcg) { /* reset if necessary */ @@ -2960,7 +2959,7 @@ static void refill_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes, stock->nr_bytes &= (PAGE_SIZE - 1); } - localtry_unlock_irqrestore(&memcg_stock.stock_lock, flags); + local_unlock_irqrestore(&memcg_stock.stock_lock, flags); obj_cgroup_put(old); if (nr_pages) From 1b17cdbb708bf435973d0a57bee4230d242085cf Mon Sep 17 00:00:00 2001 From: Takuma Watanabe Date: Tue, 18 Mar 2025 20:55:19 +0900 Subject: [PATCH 05/31] mseal: fix typo and style in documentation Correct a typo in the mseal documentation. Link: https://lkml.kernel.org/r/20250318115521.11654-1-takumaw1990@gmail.com Signed-off-by: Takuma Watanabe Cc: Jeff Xu Signed-off-by: Andrew Morton --- Documentation/userspace-api/mseal.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/userspace-api/mseal.rst b/Documentation/userspace-api/mseal.rst index 1dabfc29be0d..7195a7f91107 100644 --- a/Documentation/userspace-api/mseal.rst +++ b/Documentation/userspace-api/mseal.rst @@ -27,7 +27,7 @@ SYSCALL ======= mseal syscall signature ----------------------- - ``int mseal(void \* addr, size_t len, unsigned long flags)`` + ``int mseal(void *addr, size_t len, unsigned long flags)`` **addr**/**len**: virtual memory address range. The address range set by **addr**/**len** must meet: From 770c8d55c42868239c748a3ebc57c9e37755f842 Mon Sep 17 00:00:00 2001 From: Sheng Yong Date: Tue, 1 Apr 2025 22:47:12 +0800 Subject: [PATCH 06/31] lib/iov_iter: fix to increase non slab folio refcount When testing EROFS file-backed mount over v9fs on qemu, I encountered a folio UAF issue. The page sanity check reports the following call trace. The root cause is that pages in bvec are coalesced across a folio bounary. The refcount of all non-slab folios should be increased to ensure p9_releas_pages can put them correctly. BUG: Bad page state in process md5sum pfn:18300 page: refcount:0 mapcount:0 mapping:00000000d5ad8e4e index:0x60 pfn:0x18300 head: order:0 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 aops:z_erofs_aops ino:30b0f dentry name(?):"GoogleExtServicesCn.apk" flags: 0x100000000000041(locked|head|node=0|zone=1) raw: 0100000000000041 dead000000000100 dead000000000122 ffff888014b13bd0 raw: 0000000000000060 0000000000000020 00000000ffffffff 0000000000000000 head: 0100000000000041 dead000000000100 dead000000000122 ffff888014b13bd0 head: 0000000000000060 0000000000000020 00000000ffffffff 0000000000000000 head: 0100000000000000 0000000000000000 ffffffffffffffff 0000000000000000 head: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set Call Trace: dump_stack_lvl+0x53/0x70 bad_page+0xd4/0x220 __free_pages_ok+0x76d/0xf30 __folio_put+0x230/0x320 p9_release_pages+0x179/0x1f0 p9_virtio_zc_request+0xa2a/0x1230 p9_client_zc_rpc.constprop.0+0x247/0x700 p9_client_read_once+0x34d/0x810 p9_client_read+0xf3/0x150 v9fs_issue_read+0x111/0x360 netfs_unbuffered_read_iter_locked+0x927/0x1390 netfs_unbuffered_read_iter+0xa2/0xe0 vfs_iocb_iter_read+0x2c7/0x460 erofs_fileio_rq_submit+0x46b/0x5b0 z_erofs_runqueue+0x1203/0x21e0 z_erofs_readahead+0x579/0x8b0 read_pages+0x19f/0xa70 page_cache_ra_order+0x4ad/0xb80 filemap_readahead.isra.0+0xe7/0x150 filemap_get_pages+0x7aa/0x1890 filemap_read+0x320/0xc80 vfs_read+0x6c6/0xa30 ksys_read+0xf9/0x1c0 do_syscall_64+0x9e/0x1a0 entry_SYSCALL_64_after_hwframe+0x71/0x79 Link: https://lkml.kernel.org/r/20250401144712.1377719-1-shengyong1@xiaomi.com Fixes: b9c0e49abfca ("mm: decline to manipulate the refcount on a slab page") Signed-off-by: Sheng Yong Reviewed-by: Matthew Wilcox (Oracle) Acked-by: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- lib/iov_iter.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 8c7fdb7d8c8f..bc9391e55d57 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -1191,7 +1191,7 @@ static ssize_t __iov_iter_get_pages_alloc(struct iov_iter *i, return -ENOMEM; p = *pages; for (int k = 0; k < n; k++) { - struct folio *folio = page_folio(page); + struct folio *folio = page_folio(page + k); p[k] = page + k; if (!folio_test_slab(folio)) folio_get(folio); From a84edd52f0a0fa193f0f685769939cf84510755b Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Mon, 31 Mar 2025 19:10:24 -0700 Subject: [PATCH 07/31] mm/compaction: fix bug in hugetlb handling pathway The compaction code doesn't take references on pages until we're certain we should attempt to handle it. In the hugetlb case, isolate_or_dissolve_huge_page() may return -EBUSY without taking a reference to the folio associated with our pfn. If our folio's refcount drops to 0, compound_nr() becomes unpredictable, making low_pfn and nr_scanned unreliable. The user-visible effect is minimal - this should rarely happen (if ever). Fix this by storing the folio statistics earlier on the stack (just like the THP and Buddy cases). Also revert commit 66fe1cf7f581 ("mm: compaction: use helper compound_nr in isolate_migratepages_block") to make backporting easier. Link: https://lkml.kernel.org/r/20250401021025.637333-1-vishal.moola@gmail.com Fixes: 369fa227c219 ("mm: make alloc_contig_range handle free hugetlb pages") Signed-off-by: Vishal Moola (Oracle) Acked-by: Oscar Salvador Reviewed-by: Zi Yan Cc: Miaohe Lin Cc: Oscar Salvador Cc: Signed-off-by: Andrew Morton --- mm/compaction.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index 139f00c0308a..ca71fd3c3181 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -981,13 +981,13 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, } if (PageHuge(page)) { + const unsigned int order = compound_order(page); /* * skip hugetlbfs if we are not compacting for pages * bigger than its order. THPs and other compound pages * are handled below. */ if (!cc->alloc_contig) { - const unsigned int order = compound_order(page); if (order <= MAX_PAGE_ORDER) { low_pfn += (1UL << order) - 1; @@ -1011,8 +1011,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Do not report -EBUSY down the chain */ if (ret == -EBUSY) ret = 0; - low_pfn += compound_nr(page) - 1; - nr_scanned += compound_nr(page) - 1; + low_pfn += (1UL << order) - 1; + nr_scanned += (1UL << order) - 1; goto isolate_fail; } From c5bb27e2da3a6e1006f3e0aeb36d57c1dd1144aa Mon Sep 17 00:00:00 2001 From: Alexei Starovoitov Date: Sun, 30 Mar 2025 17:28:09 -0700 Subject: [PATCH 08/31] mm/page_alloc: avoid second trylock of zone->lock spin_trylock followed by spin_lock will cause extra write cache access. If the lock is contended it may cause unnecessary cache line bouncing and will execute redundant irq restore/save pair. Therefore, check alloc/fpi_flags first and use spin_trylock or spin_lock. Link: https://lkml.kernel.org/r/20250331002809.94758-1-alexei.starovoitov@gmail.com Fixes: 97769a53f117 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation") Signed-off-by: Alexei Starovoitov Suggested-by: Linus Torvalds Reviewed-by: Sebastian Andrzej Siewior Acked-by: Michal Hocko Acked-by: Vlastimil Babka Reviewed-by: Harry Yoo Reviewed-by: Shakeel Butt Cc: Andrii Nakryiko Cc: Daniel Borkman Cc: Martin KaFai Lau Cc: Michal Hocko Cc: Peter Zijlstra Cc: Steven Rostedt Signed-off-by: Andrew Morton --- mm/page_alloc.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fd6b865cb1ab..9a219fe8e130 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1400,11 +1400,12 @@ static void free_one_page(struct zone *zone, struct page *page, struct llist_head *llhead; unsigned long flags; - if (!spin_trylock_irqsave(&zone->lock, flags)) { - if (unlikely(fpi_flags & FPI_TRYLOCK)) { + if (unlikely(fpi_flags & FPI_TRYLOCK)) { + if (!spin_trylock_irqsave(&zone->lock, flags)) { add_page_to_zone_llist(zone, page, order); return; } + } else { spin_lock_irqsave(&zone->lock, flags); } @@ -2314,9 +2315,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long flags; int i; - if (!spin_trylock_irqsave(&zone->lock, flags)) { - if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { + if (!spin_trylock_irqsave(&zone->lock, flags)) return 0; + } else { spin_lock_irqsave(&zone->lock, flags); } for (i = 0; i < count; ++i) { @@ -2937,9 +2939,10 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, do { page = NULL; - if (!spin_trylock_irqsave(&zone->lock, flags)) { - if (unlikely(alloc_flags & ALLOC_TRYLOCK)) + if (unlikely(alloc_flags & ALLOC_TRYLOCK)) { + if (!spin_trylock_irqsave(&zone->lock, flags)) return NULL; + } else { spin_lock_irqsave(&zone->lock, flags); } if (alloc_flags & ALLOC_HIGHATOMIC) From 382360d289c1e1d89df06f33c4d5b874b38ef267 Mon Sep 17 00:00:00 2001 From: Liu Shixin Date: Thu, 3 Apr 2025 14:41:38 +0800 Subject: [PATCH 09/31] mm/hugetlb: fix nid mismatch in alloc_surplus_hugetlb_folio() It's wrong to use nid directly since the nid may be changed in allocation. Use folio_nid() to obtain the nid of folio instead. Fix: 2273dea6b1e1 ("mm/hugetlb: update nr_huge_pages and surplus_huge_pages together") Link: https://lkml.kernel.org/r/20250403064138.2867929-1-liushixin2@huawei.com Signed-off-by: Liu Shixin Acked-by: Oscar Salvador Cc: David Hildenbrand Cc: Kefeng Wang Cc: Muchun Song Cc: Nanyong Sun Signed-off-by: Andrew Morton --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 39f92aad7bd1..6670f9b9e07a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2271,7 +2271,7 @@ static struct folio *alloc_surplus_hugetlb_folio(struct hstate *h, * as surplus_pages, otherwise it might confuse * persistent_huge_pages() momentarily. */ - __prep_account_new_huge_page(h, nid); + __prep_account_new_huge_page(h, folio_nid(folio)); /* * We could have raced with the pool size change. From 35e214b11df8a73dc4b7256d3ca5d647be83c585 Mon Sep 17 00:00:00 2001 From: Baoquan He Date: Fri, 28 Mar 2025 18:44:02 +0800 Subject: [PATCH 10/31] MAINTAINERS: add Andrew and Baoquan as kexec maintainers Add Andrew as kexec/kdump maintainer because he has been helping review and merge ready kexec/kdump patches. And I would like to nominate myself as kexec maintainer because I always try to review generic kexec codes. Link: https://lkml.kernel.org/r/20250328104402.16826-1-bhe@redhat.com Signed-off-by: Baoquan He Acked-by: Dave Young Acked-by: Simon Horman Cc: Eric Biederman Signed-off-by: Andrew Morton --- MAINTAINERS | 3 +++ 1 file changed, 3 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 96b827049501..de97cd54ff24 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -12810,6 +12810,7 @@ F: lib/Kconfig.kcsan F: scripts/Makefile.kcsan KDUMP +M: Andrew Morton M: Baoquan He R: Vivek Goyal R: Dave Young @@ -13111,6 +13112,8 @@ F: fs/kernfs/ F: include/linux/kernfs.h KEXEC +M: Andrew Morton +M: Baoquan He L: kexec@lists.infradead.org W: http://kernel.org/pub/linux/utils/kernel/kexec/ F: include/linux/kexec.h From 9c02223e2d9df5cb37c51aedb78f3960294e09b5 Mon Sep 17 00:00:00 2001 From: Mark Brown Date: Fri, 4 Apr 2025 17:42:32 +0100 Subject: [PATCH 11/31] selftests/mm: generate a temporary mountpoint for cgroup filesystem Currently if the filesystem for the cgroups version it wants to use is not mounted charge_reserved_hugetlb.sh and hugetlb_reparenting_test.sh tests will attempt to mount it on the hard coded path /dev/cgroup/memory, deleting that directory when the test finishes. This will fail if there is not a preexisting directory at that path, and since the directory is deleted subsequent runs of the test will fail. Instead of relying on this hard coded directory name use mktemp to generate a temporary directory to use as a mountpoint, fixing both the assumption and the disruption caused by deleting a preexisting directory. This means that if the relevant cgroup filesystem is not already mounted then we rely on having coreutils (which provides mktemp) installed. I suspect that many current users are relying on having things automounted by default, and given that the script relies on bash it's probably not an unreasonable requirement. Link: https://lkml.kernel.org/r/20250404-kselftest-mm-cgroup2-detection-v1-1-3dba6d32ba8c@kernel.org Fixes: 209376ed2a84 ("selftests/vm: make charge_reserved_hugetlb.sh work with existing cgroup setting") Signed-off-by: Mark Brown Cc: Aishwarya TCV Cc: Mark Brown Cc: Mina Almasry Cc: Shuah Khan Cc: Waiman Long Cc: Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/charge_reserved_hugetlb.sh | 4 ++-- tools/testing/selftests/mm/hugetlb_reparenting_test.sh | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh index 67df7b47087f..e1fe16bcbbe8 100755 --- a/tools/testing/selftests/mm/charge_reserved_hugetlb.sh +++ b/tools/testing/selftests/mm/charge_reserved_hugetlb.sh @@ -29,7 +29,7 @@ fi if [[ $cgroup2 ]]; then cgroup_path=$(mount -t cgroup2 | head -1 | awk '{print $3}') if [[ -z "$cgroup_path" ]]; then - cgroup_path=/dev/cgroup/memory + cgroup_path=$(mktemp -d) mount -t cgroup2 none $cgroup_path do_umount=1 fi @@ -37,7 +37,7 @@ if [[ $cgroup2 ]]; then else cgroup_path=$(mount -t cgroup | grep ",hugetlb" | awk '{print $3}') if [[ -z "$cgroup_path" ]]; then - cgroup_path=/dev/cgroup/memory + cgroup_path=$(mktemp -d) mount -t cgroup memory,hugetlb $cgroup_path do_umount=1 fi diff --git a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh index 11f9bbe7dc22..0b0d4ba1af27 100755 --- a/tools/testing/selftests/mm/hugetlb_reparenting_test.sh +++ b/tools/testing/selftests/mm/hugetlb_reparenting_test.sh @@ -23,7 +23,7 @@ fi if [[ $cgroup2 ]]; then CGROUP_ROOT=$(mount -t cgroup2 | head -1 | awk '{print $3}') if [[ -z "$CGROUP_ROOT" ]]; then - CGROUP_ROOT=/dev/cgroup/memory + CGROUP_ROOT=$(mktemp -d) mount -t cgroup2 none $CGROUP_ROOT do_umount=1 fi From 41e6ddcaa0f18dda4c3fadf22533775a30d6f72f Mon Sep 17 00:00:00 2001 From: Lorenzo Stoakes Date: Fri, 21 Mar 2025 10:09:37 +0000 Subject: [PATCH 12/31] mm/vma: add give_up_on_oom option on modify/merge, use in uffd release Currently, if a VMA merge fails due to an OOM condition arising on commit merge or a failure to duplicate anon_vma's, we report this so the caller can handle it. However there are cases where the caller is only ostensibly trying a merge, and doesn't mind if it fails due to this condition. Since we do not want to introduce an implicit assumption that we only actually modify VMAs after OOM conditions might arise, add a 'give up on oom' option and make an explicit contract that, should this flag be set, we absolutely will not modify any VMAs should OOM arise and just bail out. Since it'd be very unusual for a user to try to vma_modify() with this flag set but be specifying a range within a VMA which ends up being split (which can fail due to rlimit issues, not only OOM), we add a debug warning for this condition. The motivating reason for this is uffd release - syzkaller (and Pedro Falcato's VERY astute analysis) found a way in which an injected fault on allocation, triggering an OOM condition on commit merge, would result in uffd code becoming confused and treating an error value as if it were a VMA pointer. To avoid this, we make use of this new VMG flag to ensure that this never occurs, utilising the fact that, should we be clearing entire VMAs, we do not wish an OOM event to be reported to us. Many thanks to Pedro Falcato for his excellent analysis and Jann Horn for his insightful and intelligent analysis of the situation, both of whom were instrumental in this fix. Link: https://lkml.kernel.org/r/20250321100937.46634-1-lorenzo.stoakes@oracle.com Reported-by: syzbot+20ed41006cf9d842c2b5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67dc67f0.050a0220.25ae54.001e.GAE@google.com/ Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure") Signed-off-by: Lorenzo Stoakes Suggested-by: Pedro Falcato Suggested-by: Jann Horn Cc: Signed-off-by: Andrew Morton --- mm/userfaultfd.c | 13 ++++++++++-- mm/vma.c | 51 ++++++++++++++++++++++++++++++++++++++++++++---- mm/vma.h | 9 ++++++++- 3 files changed, 66 insertions(+), 7 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index fbf2cf62ab9f..7d5d709cc838 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1902,6 +1902,14 @@ struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, unsigned long end) { struct vm_area_struct *ret; + bool give_up_on_oom = false; + + /* + * If we are modifying only and not splitting, just give up on the merge + * if OOM prevents us from merging successfully. + */ + if (start == vma->vm_start && end == vma->vm_end) + give_up_on_oom = true; /* Reset ptes for the whole vma range if wr-protected */ if (userfaultfd_wp(vma)) @@ -1909,7 +1917,7 @@ struct vm_area_struct *userfaultfd_clear_vma(struct vma_iterator *vmi, ret = vma_modify_flags_uffd(vmi, prev, vma, start, end, vma->vm_flags & ~__VM_UFFD_FLAGS, - NULL_VM_UFFD_CTX); + NULL_VM_UFFD_CTX, give_up_on_oom); /* * In the vma_merge() successful mprotect-like case 8: @@ -1960,7 +1968,8 @@ int userfaultfd_register_range(struct userfaultfd_ctx *ctx, new_flags = (vma->vm_flags & ~__VM_UFFD_FLAGS) | vm_flags; vma = vma_modify_flags_uffd(&vmi, prev, vma, start, vma_end, new_flags, - (struct vm_userfaultfd_ctx){ctx}); + (struct vm_userfaultfd_ctx){ctx}, + /* give_up_on_oom = */false); if (IS_ERR(vma)) return PTR_ERR(vma); diff --git a/mm/vma.c b/mm/vma.c index 5cdc5612bfc1..839d12f02c88 100644 --- a/mm/vma.c +++ b/mm/vma.c @@ -666,6 +666,9 @@ static void vmg_adjust_set_range(struct vma_merge_struct *vmg) /* * Actually perform the VMA merge operation. * + * IMPORTANT: We guarantee that, should vmg->give_up_on_oom is set, to not + * modify any VMAs or cause inconsistent state should an OOM condition arise. + * * Returns 0 on success, or an error value on failure. */ static int commit_merge(struct vma_merge_struct *vmg) @@ -685,6 +688,12 @@ static int commit_merge(struct vma_merge_struct *vmg) init_multi_vma_prep(&vp, vma, vmg); + /* + * If vmg->give_up_on_oom is set, we're safe, because we don't actually + * manipulate any VMAs until we succeed at preallocation. + * + * Past this point, we will not return an error. + */ if (vma_iter_prealloc(vmg->vmi, vma)) return -ENOMEM; @@ -915,7 +924,13 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( if (anon_dup) unlink_anon_vmas(anon_dup); - vmg->state = VMA_MERGE_ERROR_NOMEM; + /* + * We've cleaned up any cloned anon_vma's, no VMAs have been + * modified, no harm no foul if the user requests that we not + * report this and just give up, leaving the VMAs unmerged. + */ + if (!vmg->give_up_on_oom) + vmg->state = VMA_MERGE_ERROR_NOMEM; return NULL; } @@ -926,7 +941,15 @@ static __must_check struct vm_area_struct *vma_merge_existing_range( abort: vma_iter_set(vmg->vmi, start); vma_iter_load(vmg->vmi); - vmg->state = VMA_MERGE_ERROR_NOMEM; + + /* + * This means we have failed to clone anon_vma's correctly, but no + * actual changes to VMAs have occurred, so no harm no foul - if the + * user doesn't want this reported and instead just wants to give up on + * the merge, allow it. + */ + if (!vmg->give_up_on_oom) + vmg->state = VMA_MERGE_ERROR_NOMEM; return NULL; } @@ -1068,6 +1091,10 @@ int vma_expand(struct vma_merge_struct *vmg) /* This should already have been checked by this point. */ VM_WARN_ON_VMG(!can_merge_remove_vma(next), vmg); vma_start_write(next); + /* + * In this case we don't report OOM, so vmg->give_up_on_mm is + * safe. + */ ret = dup_anon_vma(middle, next, &anon_dup); if (ret) return ret; @@ -1090,9 +1117,15 @@ int vma_expand(struct vma_merge_struct *vmg) return 0; nomem: - vmg->state = VMA_MERGE_ERROR_NOMEM; if (anon_dup) unlink_anon_vmas(anon_dup); + /* + * If the user requests that we just give upon OOM, we are safe to do so + * here, as commit merge provides this contract to us. Nothing has been + * changed - no harm no foul, just don't report it. + */ + if (!vmg->give_up_on_oom) + vmg->state = VMA_MERGE_ERROR_NOMEM; return -ENOMEM; } @@ -1534,6 +1567,13 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg) if (vmg_nomem(vmg)) return ERR_PTR(-ENOMEM); + /* + * Split can fail for reasons other than OOM, so if the user requests + * this it's probably a mistake. + */ + VM_WARN_ON(vmg->give_up_on_oom && + (vma->vm_start != start || vma->vm_end != end)); + /* Split any preceding portion of the VMA. */ if (vma->vm_start < start) { int err = split_vma(vmg->vmi, vma, start, 1); @@ -1602,12 +1642,15 @@ struct vm_area_struct struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long new_flags, - struct vm_userfaultfd_ctx new_ctx) + struct vm_userfaultfd_ctx new_ctx, + bool give_up_on_oom) { VMG_VMA_STATE(vmg, vmi, prev, vma, start, end); vmg.flags = new_flags; vmg.uffd_ctx = new_ctx; + if (give_up_on_oom) + vmg.give_up_on_oom = true; return vma_modify(&vmg); } diff --git a/mm/vma.h b/mm/vma.h index 7356ca5a22d3..149926e8a6d1 100644 --- a/mm/vma.h +++ b/mm/vma.h @@ -114,6 +114,12 @@ struct vma_merge_struct { */ bool just_expand :1; + /* + * If a merge is possible, but an OOM error occurs, give up and don't + * execute the merge, returning NULL. + */ + bool give_up_on_oom :1; + /* Internal flags set during merge process: */ /* @@ -255,7 +261,8 @@ __must_check struct vm_area_struct struct vm_area_struct *vma, unsigned long start, unsigned long end, unsigned long new_flags, - struct vm_userfaultfd_ctx new_ctx); + struct vm_userfaultfd_ctx new_ctx, + bool give_up_on_oom); __must_check struct vm_area_struct *vma_merge_new_range(struct vma_merge_struct *vmg); From a5561c88cf3cddc405ae7a0ef6b606224ddabf0c Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:27 +0100 Subject: [PATCH 13/31] ASN.1: add module description This is needed to avoid a build warning: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/asn1_decoder.o Link: https://lkml.kernel.org/r/20250324173242.1501003-2-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Tested-by: Geert Uytterhoeven Cc: Jeff Johnson Cc: Masahiro Yamada Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- lib/asn1_decoder.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/asn1_decoder.c b/lib/asn1_decoder.c index 13da529e2e72..5738ae286b41 100644 --- a/lib/asn1_decoder.c +++ b/lib/asn1_decoder.c @@ -518,4 +518,5 @@ int asn1_ber_decoder(const struct asn1_decoder *decoder, } EXPORT_SYMBOL_GPL(asn1_ber_decoder); +MODULE_DESCRIPTION("Decoder for ASN.1 BER/DER/CER encoded bytestream"); MODULE_LICENSE("GPL"); From 10764175baf4531d0f4b1147fd113660ea476226 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:28 +0100 Subject: [PATCH 14/31] samples/livepatch: add module descriptions Every module should have a description, so add one for each of these modules. [akpm@linux-foundation.org: match the livepatch-callbacks-mod.c description, per Petr] Link: https://lkml.kernel.org/r/20250324173242.1501003-3-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Reviewed-by: Petr Mladek Cc: Christophe Leroy Cc: Easwar Hariharan Cc: Jeff Johnson Cc: Jiri Kosina Cc: Joe Lawrence Cc: Josh Poimboeuf Cc: Masahiro Yamada Cc: Miroslav Benes Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- samples/livepatch/livepatch-callbacks-busymod.c | 1 + samples/livepatch/livepatch-callbacks-demo.c | 1 + samples/livepatch/livepatch-callbacks-mod.c | 1 + samples/livepatch/livepatch-sample.c | 1 + samples/livepatch/livepatch-shadow-fix1.c | 1 + samples/livepatch/livepatch-shadow-fix2.c | 1 + 6 files changed, 6 insertions(+) diff --git a/samples/livepatch/livepatch-callbacks-busymod.c b/samples/livepatch/livepatch-callbacks-busymod.c index 69105596e72e..fadc2a85cb35 100644 --- a/samples/livepatch/livepatch-callbacks-busymod.c +++ b/samples/livepatch/livepatch-callbacks-busymod.c @@ -56,4 +56,5 @@ static void livepatch_callbacks_mod_exit(void) module_init(livepatch_callbacks_mod_init); module_exit(livepatch_callbacks_mod_exit); +MODULE_DESCRIPTION("Live patching demo for (un)patching callbacks, support module"); MODULE_LICENSE("GPL"); diff --git a/samples/livepatch/livepatch-callbacks-demo.c b/samples/livepatch/livepatch-callbacks-demo.c index 11c3f4357812..9e69d9caed25 100644 --- a/samples/livepatch/livepatch-callbacks-demo.c +++ b/samples/livepatch/livepatch-callbacks-demo.c @@ -192,5 +192,6 @@ static void livepatch_callbacks_demo_exit(void) module_init(livepatch_callbacks_demo_init); module_exit(livepatch_callbacks_demo_exit); +MODULE_DESCRIPTION("Live patching demo for (un)patching callbacks"); MODULE_LICENSE("GPL"); MODULE_INFO(livepatch, "Y"); diff --git a/samples/livepatch/livepatch-callbacks-mod.c b/samples/livepatch/livepatch-callbacks-mod.c index 2a074f422a51..d1851b471ad9 100644 --- a/samples/livepatch/livepatch-callbacks-mod.c +++ b/samples/livepatch/livepatch-callbacks-mod.c @@ -38,4 +38,5 @@ static void livepatch_callbacks_mod_exit(void) module_init(livepatch_callbacks_mod_init); module_exit(livepatch_callbacks_mod_exit); +MODULE_DESCRIPTION("Live patching demo for (un)patching callbacks, support module"); MODULE_LICENSE("GPL"); diff --git a/samples/livepatch/livepatch-sample.c b/samples/livepatch/livepatch-sample.c index cd76d7ebe598..5263a2f31c48 100644 --- a/samples/livepatch/livepatch-sample.c +++ b/samples/livepatch/livepatch-sample.c @@ -66,5 +66,6 @@ static void livepatch_exit(void) module_init(livepatch_init); module_exit(livepatch_exit); +MODULE_DESCRIPTION("Kernel Live Patching Sample Module"); MODULE_LICENSE("GPL"); MODULE_INFO(livepatch, "Y"); diff --git a/samples/livepatch/livepatch-shadow-fix1.c b/samples/livepatch/livepatch-shadow-fix1.c index f3f153895d6c..cbf68ca40097 100644 --- a/samples/livepatch/livepatch-shadow-fix1.c +++ b/samples/livepatch/livepatch-shadow-fix1.c @@ -168,5 +168,6 @@ static void livepatch_shadow_fix1_exit(void) module_init(livepatch_shadow_fix1_init); module_exit(livepatch_shadow_fix1_exit); +MODULE_DESCRIPTION("Live patching demo for shadow variables"); MODULE_LICENSE("GPL"); MODULE_INFO(livepatch, "Y"); diff --git a/samples/livepatch/livepatch-shadow-fix2.c b/samples/livepatch/livepatch-shadow-fix2.c index 361046a4f10c..b99122cb221f 100644 --- a/samples/livepatch/livepatch-shadow-fix2.c +++ b/samples/livepatch/livepatch-shadow-fix2.c @@ -128,5 +128,6 @@ static void livepatch_shadow_fix2_exit(void) module_init(livepatch_shadow_fix2_init); module_exit(livepatch_shadow_fix2_exit); +MODULE_DESCRIPTION("Live patching demo for shadow variables"); MODULE_LICENSE("GPL"); MODULE_INFO(livepatch, "Y"); From 6810431bc47301376174331f6d85a70c8520c941 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:29 +0100 Subject: [PATCH 15/31] fpga: tests: add module descriptions Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/fpga/tests/fpga-bridge-test.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/fpga/tests/fpga-mgr-test.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/fpga/tests/fpga-region-test.o Link: https://lkml.kernel.org/r/20250324173242.1501003-4-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Cc: Hao Wu Cc: Jeff Johnson Cc: Marco Pagani Cc: Masahiro Yamada Cc: Moritz Fischer Cc: Russ Weight Cc: Stehen Rothwell Cc: Tom Rix Cc: Xu Yilun Signed-off-by: Andrew Morton --- drivers/fpga/tests/fpga-bridge-test.c | 1 + drivers/fpga/tests/fpga-mgr-test.c | 1 + drivers/fpga/tests/fpga-region-test.c | 1 + 3 files changed, 3 insertions(+) diff --git a/drivers/fpga/tests/fpga-bridge-test.c b/drivers/fpga/tests/fpga-bridge-test.c index b9ab29809e96..124ba40e32b1 100644 --- a/drivers/fpga/tests/fpga-bridge-test.c +++ b/drivers/fpga/tests/fpga-bridge-test.c @@ -170,4 +170,5 @@ static struct kunit_suite fpga_bridge_suite = { kunit_test_suite(fpga_bridge_suite); +MODULE_DESCRIPTION("KUnit test for the FPGA Bridge"); MODULE_LICENSE("GPL"); diff --git a/drivers/fpga/tests/fpga-mgr-test.c b/drivers/fpga/tests/fpga-mgr-test.c index 9cb37aefbac4..8748babb0504 100644 --- a/drivers/fpga/tests/fpga-mgr-test.c +++ b/drivers/fpga/tests/fpga-mgr-test.c @@ -330,4 +330,5 @@ static struct kunit_suite fpga_mgr_suite = { kunit_test_suite(fpga_mgr_suite); +MODULE_DESCRIPTION("KUnit test for the FPGA Manager"); MODULE_LICENSE("GPL"); diff --git a/drivers/fpga/tests/fpga-region-test.c b/drivers/fpga/tests/fpga-region-test.c index 6a108cafded8..020ceac48509 100644 --- a/drivers/fpga/tests/fpga-region-test.c +++ b/drivers/fpga/tests/fpga-region-test.c @@ -214,4 +214,5 @@ static struct kunit_suite fpga_region_suite = { kunit_test_suite(fpga_region_suite); +MODULE_DESCRIPTION("KUnit test for the FPGA Region"); MODULE_LICENSE("GPL"); From 75dd4975f569c179284fa644e5be24cbfa77b0ee Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:31 +0100 Subject: [PATCH 16/31] zlib: add module description Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/zlib_inflate/zlib_inflate.o Link: https://lkml.kernel.org/r/20250324173242.1501003-6-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Cc: Jeff Johnson Cc: Masahiro Yamada Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- lib/zlib_inflate/inflate_syms.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/zlib_inflate/inflate_syms.c b/lib/zlib_inflate/inflate_syms.c index 9720114c0672..b8996d90e8bc 100644 --- a/lib/zlib_inflate/inflate_syms.c +++ b/lib/zlib_inflate/inflate_syms.c @@ -18,4 +18,5 @@ EXPORT_SYMBOL(zlib_inflateEnd); EXPORT_SYMBOL(zlib_inflateReset); EXPORT_SYMBOL(zlib_inflateIncomp); EXPORT_SYMBOL(zlib_inflate_blob); +MODULE_DESCRIPTION("Data decompression using the deflation algorithm"); MODULE_LICENSE("GPL"); From 91640531b92ec63e260b9d5c681d387676ec462c Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:32 +0100 Subject: [PATCH 17/31] ucs2_string: add module description Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/ucs2_string.o Link: https://lkml.kernel.org/r/20250324173242.1501003-7-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Cc: Jeff Johnson Cc: Masahiro Yamada Cc: Stehen Rothwell Signed-off-by: Andrew Morton --- lib/ucs2_string.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/ucs2_string.c b/lib/ucs2_string.c index 9308bcfb2ad5..dfb4f2358cab 100644 --- a/lib/ucs2_string.c +++ b/lib/ucs2_string.c @@ -165,4 +165,5 @@ ucs2_as_utf8(u8 *dest, const ucs2_char_t *src, unsigned long maxlength) } EXPORT_SYMBOL(ucs2_as_utf8); +MODULE_DESCRIPTION("UCS2 string handling"); MODULE_LICENSE("GPL v2"); From e2ffee91c40fef59ec84dd7379aa634dcefa0a42 Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:34 +0100 Subject: [PATCH 18/31] mm/kasan: add module decription Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in mm/kasan/kasan_test.o [akpm@linux-foundation.org: update description text, per Andrey] Link: https://lkml.kernel.org/r/20250324173242.1501003-9-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Reviewed-by: Andrey Konovalov Cc: Alexander Potapenko Cc: Andrey Ryabinin Cc: Dmitriy Vyukov Cc: Jann Horn Cc: Jeff Johnson Cc: Macro Elver Cc: Masahiro Yamada Cc: Nihar Chaithanya Cc: Peter Zijlstra Cc: Sabyrzhan Tasbolatov Cc: Stehen Rothwell Cc: Vincenzo Frascino Signed-off-by: Andrew Morton --- mm/kasan/kasan_test_c.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/kasan/kasan_test_c.c b/mm/kasan/kasan_test_c.c index 3ea317837c2d..f24e3bef72a4 100644 --- a/mm/kasan/kasan_test_c.c +++ b/mm/kasan/kasan_test_c.c @@ -2127,4 +2127,5 @@ static struct kunit_suite kasan_kunit_test_suite = { kunit_test_suite(kasan_kunit_test_suite); +MODULE_DESCRIPTION("KUnit tests for checking KASAN bug-detection capabilities"); MODULE_LICENSE("GPL"); From 61c4e6ca8c9364e08c2c132d1bfae2f85af42c7a Mon Sep 17 00:00:00 2001 From: Arnd Bergmann Date: Mon, 24 Mar 2025 18:32:35 +0100 Subject: [PATCH 19/31] kunit: slub: add module description Modules without a description now cause a warning: WARNING: modpost: missing MODULE_DESCRIPTION() in lib/tests/slub_kunit.o Link: https://lkml.kernel.org/r/20250324173242.1501003-10-arnd@kernel.org Fixes: 6c6c1fc09de3 ("modpost: require a MODULE_DESCRIPTION()") Signed-off-by: Arnd Bergmann Cc: Guenetr Roeck Cc: Jeff Johnson Cc: Masahiro Yamada Cc: Pei Xiao Cc: Rae Moar Cc: Stehen Rothwell Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- lib/tests/slub_kunit.c | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/tests/slub_kunit.c b/lib/tests/slub_kunit.c index d47c472b0520..848b682a2d70 100644 --- a/lib/tests/slub_kunit.c +++ b/lib/tests/slub_kunit.c @@ -325,4 +325,5 @@ static struct kunit_suite test_suite = { }; kunit_test_suite(test_suite); +MODULE_DESCRIPTION("Kunit tests for slub allocator"); MODULE_LICENSE("GPL"); From 90abee6d7895d5eef18c91d870d8168be4e76e9d Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Mon, 7 Apr 2025 14:01:53 -0400 Subject: [PATCH 20/31] mm: page_alloc: speed up fallbacks in rmqueue_bulk() MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The test robot identified c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") as the root cause of a 56.4% regression in vm-scalability::lru-file-mmap-read. Carlos reports an earlier patch, c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion"), as the root cause for a regression in worst-case zone->lock+irqoff hold times. Both of these patches modify the page allocator's fallback path to be less greedy in an effort to stave off fragmentation. The flip side of this is that fallbacks are also less productive each time around, which means the fallback search can run much more frequently. Carlos' traces point to rmqueue_bulk() specifically, which tries to refill the percpu cache by allocating a large batch of pages in a loop. It highlights how once the native freelists are exhausted, the fallback code first scans orders top-down for whole blocks to claim, then falls back to a bottom-up search for the smallest buddy to steal. For the next batch page, it goes through the same thing again. This can be made more efficient. Since rmqueue_bulk() holds the zone->lock over the entire batch, the freelists are not subject to outside changes; when the search for a block to claim has already failed, there is no point in trying again for the next page. Modify __rmqueue() to remember the last successful fallback mode, and restart directly from there on the next rmqueue_bulk() iteration. Oliver confirms that this improves beyond the regression that the test robot reported against c2f6ea38fc1b: commit: f3b92176f4 ("tools/selftests: add guard region test for /proc/$pid/pagemap") c2f6ea38fc ("mm: page_alloc: don't steal single pages from biggest buddy") acc4d5ff0b ("Merge tag 'net-6.15-rc0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") 2c847f27c3 ("mm: page_alloc: speed up fallbacks in rmqueue_bulk()") <--- your patch f3b92176f4f7100f c2f6ea38fc1b640aa7a2e155cc1 acc4d5ff0b61eb1715c498b6536 2c847f27c37da65a93d23c237c5 ---------------- --------------------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev %change %stddev \ | \ | \ | \ 25525364 ± 3% -56.4% 11135467 -57.8% 10779336 +31.6% 33581409 vm-scalability.throughput Carlos confirms that worst-case times are almost fully recovered compared to before the earlier culprit patch: 2dd482ba627d (before freelist hygiene): 1ms c0cd6f557b90 (after freelist hygiene): 90ms next-20250319 (steal smallest buddy): 280ms this patch : 8ms [jackmanb@google.com: comment updates] Link: https://lkml.kernel.org/r/D92AC0P9594X.3BML64MUKTF8Z@google.com [hannes@cmpxchg.org: reset rmqueue_mode in rmqueue_buddy() error loop, per Yunsheng Lin] Link: https://lkml.kernel.org/r/20250409140023.GA2313@cmpxchg.org Link: https://lkml.kernel.org/r/20250407180154.63348-1-hannes@cmpxchg.org Fixes: c0cd6f557b90 ("mm: page_alloc: fix freelist movement during block conversion") Fixes: c2f6ea38fc1b ("mm: page_alloc: don't steal single pages from biggest buddy") Signed-off-by: Johannes Weiner Signed-off-by: Brendan Jackman Reported-by: kernel test robot Reported-by: Carlos Song Tested-by: Carlos Song Tested-by: kernel test robot Closes: https://lore.kernel.org/oe-lkp/202503271547.fc08b188-lkp@intel.com Reviewed-by: Brendan Jackman Tested-by: Shivank Garg Acked-by: Zi Yan Reviewed-by: Vlastimil Babka Cc: [6.10+] Signed-off-by: Andrew Morton --- mm/page_alloc.c | 113 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 79 insertions(+), 34 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9a219fe8e130..1715e34b91af 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2183,23 +2183,15 @@ try_to_claim_block(struct zone *zone, struct page *page, } /* - * Try finding a free buddy page on the fallback list. - * - * This will attempt to claim a whole pageblock for the requested type - * to ensure grouping of such requests in the future. - * - * If a whole block cannot be claimed, steal an individual page, regressing to - * __rmqueue_smallest() logic to at least break up as little contiguity as - * possible. + * Try to allocate from some fallback migratetype by claiming the entire block, + * i.e. converting it to the allocation's start migratetype. * * The use of signed ints for order and current_order is a deliberate * deviation from the rest of this file, to make the for loop * condition simpler. - * - * Return the stolen page, or NULL if none can be found. */ static __always_inline struct page * -__rmqueue_fallback(struct zone *zone, int order, int start_migratetype, +__rmqueue_claim(struct zone *zone, int order, int start_migratetype, unsigned int alloc_flags) { struct free_area *area; @@ -2237,14 +2229,29 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = try_to_claim_block(zone, page, current_order, order, start_migratetype, fallback_mt, alloc_flags); - if (page) - goto got_one; + if (page) { + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; + } } - if (alloc_flags & ALLOC_NOFRAGMENT) - return NULL; + return NULL; +} + +/* + * Try to steal a single page from some fallback migratetype. Leave the rest of + * the block as its current migratetype, potentially causing fragmentation. + */ +static __always_inline struct page * +__rmqueue_steal(struct zone *zone, int order, int start_migratetype) +{ + struct free_area *area; + int current_order; + struct page *page; + int fallback_mt; + bool claim_block; - /* No luck claiming pageblock. Find the smallest fallback page */ for (current_order = order; current_order < NR_PAGE_ORDERS; current_order++) { area = &(zone->free_area[current_order]); fallback_mt = find_suitable_fallback(area, current_order, @@ -2254,25 +2261,28 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype, page = get_page_from_free_area(area, fallback_mt); page_del_and_expand(zone, page, order, current_order, fallback_mt); - goto got_one; + trace_mm_page_alloc_extfrag(page, order, current_order, + start_migratetype, fallback_mt); + return page; } return NULL; - -got_one: - trace_mm_page_alloc_extfrag(page, order, current_order, - start_migratetype, fallback_mt); - - return page; } +enum rmqueue_mode { + RMQUEUE_NORMAL, + RMQUEUE_CMA, + RMQUEUE_CLAIM, + RMQUEUE_STEAL, +}; + /* * Do the hard work of removing an element from the buddy allocator. * Call me with the zone->lock already held. */ static __always_inline struct page * __rmqueue(struct zone *zone, unsigned int order, int migratetype, - unsigned int alloc_flags) + unsigned int alloc_flags, enum rmqueue_mode *mode) { struct page *page; @@ -2291,16 +2301,48 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, } } - page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) { - if (alloc_flags & ALLOC_CMA) + /* + * First try the freelists of the requested migratetype, then try + * fallbacks modes with increasing levels of fragmentation risk. + * + * The fallback logic is expensive and rmqueue_bulk() calls in + * a loop with the zone->lock held, meaning the freelists are + * not subject to any outside changes. Remember in *mode where + * we found pay dirt, to save us the search on the next call. + */ + switch (*mode) { + case RMQUEUE_NORMAL: + page = __rmqueue_smallest(zone, order, migratetype); + if (page) + return page; + fallthrough; + case RMQUEUE_CMA: + if (alloc_flags & ALLOC_CMA) { page = __rmqueue_cma_fallback(zone, order); - - if (!page) - page = __rmqueue_fallback(zone, order, migratetype, - alloc_flags); + if (page) { + *mode = RMQUEUE_CMA; + return page; + } + } + fallthrough; + case RMQUEUE_CLAIM: + page = __rmqueue_claim(zone, order, migratetype, alloc_flags); + if (page) { + /* Replenished preferred freelist, back to normal mode. */ + *mode = RMQUEUE_NORMAL; + return page; + } + fallthrough; + case RMQUEUE_STEAL: + if (!(alloc_flags & ALLOC_NOFRAGMENT)) { + page = __rmqueue_steal(zone, order, migratetype); + if (page) { + *mode = RMQUEUE_STEAL; + return page; + } + } } - return page; + return NULL; } /* @@ -2312,6 +2354,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count, struct list_head *list, int migratetype, unsigned int alloc_flags) { + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; unsigned long flags; int i; @@ -2323,7 +2366,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order, } for (i = 0; i < count; ++i) { struct page *page = __rmqueue(zone, order, migratetype, - alloc_flags); + alloc_flags, &rmqm); if (unlikely(page == NULL)) break; @@ -2948,7 +2991,9 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone, if (alloc_flags & ALLOC_HIGHATOMIC) page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC); if (!page) { - page = __rmqueue(zone, order, migratetype, alloc_flags); + enum rmqueue_mode rmqm = RMQUEUE_NORMAL; + + page = __rmqueue(zone, order, migratetype, alloc_flags, &rmqm); /* * If the allocation fails, allow OOM handling and From 60580e0bd587b1df7aa9f749ed735db8377acaab Mon Sep 17 00:00:00 2001 From: Frank van der Linden Date: Mon, 7 Apr 2025 16:54:35 +0000 Subject: [PATCH 21/31] mm/cma: report base address of single range correctly The cma_declare_contiguous_nid code was refactored by commit c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested"), so that it could use an internal function to attempt a single range area first, and then try a multi-range one. However, that meant that the actual base address used for the !fixed case (base == 0) wasn't available one level up to be printed in the informational message, and it would always end up printing a base address of 0 in the boot message. Make the internal function take a phys_addr_t pointer to the base address, so that the value is available to the caller. [fvdl@google.com: v2] Link: https://lkml.kernel.org/r/20250408164000.3215690-1-fvdl@google.com Link: https://lkml.kernel.org/r/20250407165435.2567898-1-fvdl@google.com Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested") Signed-off-by: Frank van der Linden Reported-by: Geert Uytterhoeven Closes: https://lore.kernel.org/linux-mm/CAMuHMdVWviQ7O9yBFE3f=ev0eVb1CnsQvR6SKtEROBbM6z7g3w@mail.gmail.com/ Tested-by: Geert Uytterhoeven Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/cma.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index b06d5fe73399..15632939f20a 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -35,7 +35,7 @@ struct cma cma_areas[MAX_CMA_AREAS]; unsigned int cma_area_count; -static int __init __cma_declare_contiguous_nid(phys_addr_t base, +static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, phys_addr_t size, phys_addr_t limit, phys_addr_t alignment, unsigned int order_per_bit, bool fixed, const char *name, struct cma **res_cma, @@ -370,7 +370,7 @@ int __init cma_declare_contiguous_multi(phys_addr_t total_size, phys_addr_t align, unsigned int order_per_bit, const char *name, struct cma **res_cma, int nid) { - phys_addr_t start, end; + phys_addr_t start = 0, end; phys_addr_t size, sizesum, sizeleft; struct cma_init_memrange *mrp, *mlp, *failed; struct cma_memrange *cmrp; @@ -384,7 +384,7 @@ int __init cma_declare_contiguous_multi(phys_addr_t total_size, /* * First, try it the normal way, producing just one range. */ - ret = __cma_declare_contiguous_nid(0, total_size, 0, align, + ret = __cma_declare_contiguous_nid(&start, total_size, 0, align, order_per_bit, false, name, res_cma, nid); if (ret != -ENOMEM) goto out; @@ -580,7 +580,7 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, { int ret; - ret = __cma_declare_contiguous_nid(base, size, limit, alignment, + ret = __cma_declare_contiguous_nid(&base, size, limit, alignment, order_per_bit, fixed, name, res_cma, nid); if (ret != 0) pr_err("Failed to reserve %ld MiB\n", @@ -592,14 +592,14 @@ int __init cma_declare_contiguous_nid(phys_addr_t base, return ret; } -static int __init __cma_declare_contiguous_nid(phys_addr_t base, +static int __init __cma_declare_contiguous_nid(phys_addr_t *basep, phys_addr_t size, phys_addr_t limit, phys_addr_t alignment, unsigned int order_per_bit, bool fixed, const char *name, struct cma **res_cma, int nid) { phys_addr_t memblock_end = memblock_end_of_DRAM(); - phys_addr_t highmem_start; + phys_addr_t highmem_start, base = *basep; int ret; /* @@ -722,12 +722,15 @@ static int __init __cma_declare_contiguous_nid(phys_addr_t base, } ret = cma_init_reserved_mem(base, size, order_per_bit, name, res_cma); - if (ret) + if (ret) { memblock_phys_free(base, size); + return ret; + } (*res_cma)->nid = nid; + *basep = base; - return ret; + return 0; } static void cma_debug_show_areas(struct cma *cma) From aabf58bfaacedb3f54801ba09c6d50daf83b74f4 Mon Sep 17 00:00:00 2001 From: Jinjiang Tu Date: Mon, 7 Apr 2025 20:47:06 +0800 Subject: [PATCH 22/31] mm/hugetlb: fix set_max_huge_pages() when there are surplus pages In set_max_huge_pages(), min_count is computed taking into account surplus huge pages, which might lead in some cases to not be able to free huge pages and end up accounting them as surplus instead. One way to solve it is to subtract surplus_huge_pages directly, but we cannot do it blindly because there might be surplus pages that are also free pages, which might happen when we fail to restore the vmemmap for optimized hvo pages. So we could be subtracting the same page twice. In order to work this around, let us first compute the number of free persistent pages, and use that along with surplus pages to compute min_count. Steps to reproduce: 1) create 5 hugetlb folios in Node0 2) run a program to use all the hugetlb folios 3) echo 0 > nr_hugepages for Node0 to free the hugetlb folios. Thus the 5 hugetlb folios in Node0 are accounted as surplus. 4) create 5 hugetlb folios in Node1 5) echo 0 > nr_hugepages for Node1 to free the hugetlb folios The result: Node0 Node1 Total 5 5 Free 0 5 Surp 5 5 The result with this patch: Node0 Node1 Total 5 0 Free 0 0 Surp 5 0 Link: https://lkml.kernel.org/r/20250409055957.3774471-1-tujinjiang@huawei.com Link: https://lkml.kernel.org/r/20250407124706.2688092-1-tujinjiang@huawei.com Fixes: 9a30523066cd ("hugetlb: add per node hstate attributes") Signed-off-by: Jinjiang Tu Acked-by: Oscar Salvador Cc: David Hildenbrand Cc: Kefeng Wang Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6670f9b9e07a..ea065347866e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3825,6 +3825,7 @@ static int adjust_pool_surplus(struct hstate *h, nodemask_t *nodes_allowed, static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { + unsigned long persistent_free_count; unsigned long min_count; unsigned long allocated; struct folio *folio; @@ -3959,8 +3960,24 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, * though, we'll note that we're not allowed to exceed surplus * and won't grow the pool anywhere else. Not until one of the * sysctls are changed, or the surplus pages go out of use. + * + * min_count is the expected number of persistent pages, we + * shouldn't calculate min_count by using + * resv_huge_pages + persistent_huge_pages() - free_huge_pages, + * because there may exist free surplus huge pages, and this will + * lead to subtracting twice. Free surplus huge pages come from HVO + * failing to restore vmemmap, see comments in the callers of + * hugetlb_vmemmap_restore_folio(). Thus, we should calculate + * persistent free count first. */ - min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; + persistent_free_count = h->free_huge_pages; + if (h->free_huge_pages > persistent_huge_pages(h)) { + if (h->free_huge_pages > h->surplus_huge_pages) + persistent_free_count -= h->surplus_huge_pages; + else + persistent_free_count = 0; + } + min_count = h->resv_huge_pages + persistent_huge_pages(h) - persistent_free_count; min_count = max(count, min_count); try_to_free_low(h, min_count, nodes_allowed); From 8c583e538aa681ecb293d5606054de70f44b5558 Mon Sep 17 00:00:00 2001 From: Baolin Wang Date: Mon, 7 Apr 2025 19:31:35 +0800 Subject: [PATCH 23/31] selftests: mincore: fix tmpfs mincore test failure When running mincore test cases, I encountered the following failures: " mincore_selftest.c:359:check_tmpfs_mmap:Expected ra_pages (511) == 0 (0) mincore_selftest.c:360:check_tmpfs_mmap:Read-ahead pages found in memory check_tmpfs_mmap: Test terminated by assertion FAIL global.check_tmpfs_mmap not ok 5 global.check_tmpfs_mmap FAILED: 4 / 5 tests passed " The reason for the test case failure is that my system automatically enabled tmpfs large folio allocation by adding the 'transparent_hugepage_tmpfs=always' cmdline. However, the test case still expects the tmpfs mounted on /dev/shm to allocate small folios, which leads to assertion failures when verifying readahead pages. As discussed with David, there's no reason to continue checking the readahead logic for tmpfs. Drop it to fix this issue. Link: https://lkml.kernel.org/r/9a00856cc6a8b4e46f4ab8b1af11ce5fc1a31851.1744025467.git.baolin.wang@linux.alibaba.com Fixes: d635ccdb435c ("mm: shmem: add a kernel command line to change the default huge policy for tmpfs") Signed-off-by: Baolin Wang Acked-by: Zi Yan Acked-by: David Hildenbrand Cc: Barry Song <21cnbao@gmail.com> Cc: Hugh Dickins Cc: Matthew Wilcox (Oracle) Cc: Ryan Roberts Signed-off-by: Andrew Morton --- .../testing/selftests/mincore/mincore_selftest.c | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/tools/testing/selftests/mincore/mincore_selftest.c b/tools/testing/selftests/mincore/mincore_selftest.c index e949a43a6145..0fd4b00bd345 100644 --- a/tools/testing/selftests/mincore/mincore_selftest.c +++ b/tools/testing/selftests/mincore/mincore_selftest.c @@ -286,8 +286,7 @@ TEST(check_file_mmap) /* * Test mincore() behavior on a page backed by a tmpfs file. This test - * performs the same steps as the previous one. However, we don't expect - * any readahead in this case. + * performs the same steps as the previous one. */ TEST(check_tmpfs_mmap) { @@ -298,7 +297,6 @@ TEST(check_tmpfs_mmap) int page_size; int fd; int i; - int ra_pages = 0; page_size = sysconf(_SC_PAGESIZE); vec_size = FILE_SIZE / page_size; @@ -341,8 +339,7 @@ TEST(check_tmpfs_mmap) } /* - * Touch a page in the middle of the mapping. We expect only - * that page to be fetched into memory. + * Touch a page in the middle of the mapping. */ addr[FILE_SIZE / 2] = 1; retval = mincore(addr, FILE_SIZE, vec); @@ -351,15 +348,6 @@ TEST(check_tmpfs_mmap) TH_LOG("Page not found in memory after use"); } - i = FILE_SIZE / 2 / page_size + 1; - while (i < vec_size && vec[i]) { - ra_pages++; - i++; - } - ASSERT_EQ(ra_pages, 0) { - TH_LOG("Read-ahead pages found in memory"); - } - munmap(addr, FILE_SIZE); close(fd); free(vec); From 9e2bd67773579fdd2e58de9628c2d319f3f518e7 Mon Sep 17 00:00:00 2001 From: wangxuewen Date: Mon, 7 Apr 2025 18:30:17 +0800 Subject: [PATCH 24/31] mm/hugetlb: add a line break at the end of the format string Missing line break at the end of the format string. Link: https://lkml.kernel.org/r/20250407103017.2979821-1-18810879172@163.com Signed-off-by: wangxuewen Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ea065347866e..e3e6ac991b9c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4647,7 +4647,7 @@ static void __init hugetlb_sysfs_init(void) err = hugetlb_sysfs_add_hstate(h, hugepages_kobj, hstate_kobjs, &hstate_attr_group); if (err) - pr_err("HugeTLB: Unable to add hstate %s", h->name); + pr_err("HugeTLB: Unable to add hstate %s\n", h->name); } #ifdef CONFIG_NUMA From 8ab1b16023961dc640023b10436d282f905835ad Mon Sep 17 00:00:00 2001 From: "Vishal Moola (Oracle)" Date: Thu, 3 Apr 2025 16:54:17 -0700 Subject: [PATCH 25/31] mm: fix filemap_get_folios_contig returning batches of identical folios filemap_get_folios_contig() is supposed to return distinct folios found within [start, end]. Large folios in the Xarray become multi-index entries. xas_next() can iterate through the sub-indexes before finding a sibling entry and breaking out of the loop. This can result in a returned folio_batch containing an indeterminate number of duplicate folios, which forces the callers to skeptically handle the returned batch. This is inefficient and incurs a large maintenance overhead. We can fix this by calling xas_advance() after we have successfully adding a folio to the batch to ensure our Xarray is positioned such that it will correctly find the next folio - similar to filemap_get_read_batch(). Link: https://lkml.kernel.org/r/Z-8s1-kiIDkzgRbc@fedora Fixes: 35b471467f88 ("filemap: add filemap_get_folios_contig()") Signed-off-by: Vishal Moola (Oracle) Reported-by: Qu Wenruo Closes: https://lkml.kernel.org/r/b714e4de-2583-4035-b829-72cfb5eb6fc6@gmx.com Tested-by: Qu Wenruo Cc: Matthew Wilcox (Oracle) Cc: Vivek Kasireddy Cc: Signed-off-by: Andrew Morton --- mm/filemap.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/filemap.c b/mm/filemap.c index b5e784f34d98..7b90cbeb4a1a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2244,6 +2244,7 @@ unsigned filemap_get_folios_contig(struct address_space *mapping, *start = folio->index + nr; goto out; } + xas_advance(&xas, folio_next_index(folio) - 1); continue; put_folio: folio_put(folio); From 8c56c5dbcf52220cc9be7a36e7f21ebd5939e0b9 Mon Sep 17 00:00:00 2001 From: David Hildenbrand Date: Tue, 8 Apr 2025 10:59:50 +0200 Subject: [PATCH 26/31] mm: (un)track_pfn_copy() fix + doc improvements We got a late smatch warning and some additional review feedback. smatch warnings: mm/memory.c:1428 copy_page_range() error: uninitialized symbol 'pfn'. We actually use the pfn only when it is properly initialized; however, we may pass an uninitialized value to a function -- although it will not use it that likely still is UB in C. So let's just fix it by always initializing pfn in the caller of track_pfn_copy(), and improving the documentation of track_pfn_copy(). While at it, clarify the doc of untrack_pfn_copy(), that internal checks make sure if we actually have to untrack anything. Link: https://lkml.kernel.org/r/20250408085950.976103-1-david@redhat.com Fixes: dc84bc2aba85 ("x86/mm/pat: Fix VM_PAT handling when fork() fails in copy_page_range()") Signed-off-by: David Hildenbrand Reported-by: kernel test robot Reported-by: Dan Carpenter Closes: https://lore.kernel.org/r/202503270941.IFILyNCX-lkp@intel.com/ Reviewed-by: Lorenzo Stoakes Acked-by: Ingo Molnar Cc: Andrew Morton Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Borislav Petkov Cc: Rik van Riel Cc: "H. Peter Anvin" Cc: Linus Torvalds Signed-off-by: Andrew Morton --- include/linux/pgtable.h | 9 ++++++--- mm/memory.c | 2 +- 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e2b705c14945..b50447ef1c92 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1511,8 +1511,9 @@ static inline void track_pfn_insert(struct vm_area_struct *vma, pgprot_t *prot, /* * track_pfn_copy is called when a VM_PFNMAP VMA is about to get the page - * tables copied during copy_page_range(). On success, stores the pfn to be - * passed to untrack_pfn_copy(). + * tables copied during copy_page_range(). Will store the pfn to be + * passed to untrack_pfn_copy() only if there is something to be untracked. + * Callers should initialize the pfn to 0. */ static inline int track_pfn_copy(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, unsigned long *pfn) @@ -1522,7 +1523,9 @@ static inline int track_pfn_copy(struct vm_area_struct *dst_vma, /* * untrack_pfn_copy is called when a VM_PFNMAP VMA failed to copy during - * copy_page_range(), but after track_pfn_copy() was already called. + * copy_page_range(), but after track_pfn_copy() was already called. Can + * be called even if track_pfn_copy() did not actually track anything: + * handled internally. */ static inline void untrack_pfn_copy(struct vm_area_struct *dst_vma, unsigned long pfn) diff --git a/mm/memory.c b/mm/memory.c index 2d8c265fc7d6..1a35165622e1 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1361,7 +1361,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) struct mm_struct *dst_mm = dst_vma->vm_mm; struct mm_struct *src_mm = src_vma->vm_mm; struct mmu_notifier_range range; - unsigned long next, pfn; + unsigned long next, pfn = 0; bool is_cow; int ret; From 0aa8dbe5a8dcaf0cc083f4a519a2906e1eb4609e Mon Sep 17 00:00:00 2001 From: Jean-Michel Hautbois Date: Fri, 28 Mar 2025 12:14:24 +0100 Subject: [PATCH 27/31] mailmap: add entry for Jean-Michel Hautbois As recent contributions where made with the @ideasonboard.com email, any reply would fail. Add the proper address to map this old one. Link: https://lkml.kernel.org/r/20250328-mailmap-v2-v2-1-bdc69d2193ca@yoseli.org Signed-off-by: Jean-Michel Hautbois Acked-by: Laurent Pinchart Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 6a874f9950bf..6efaee6537e4 100644 --- a/.mailmap +++ b/.mailmap @@ -322,6 +322,7 @@ Jayachandran C Jayachandran C Jayachandran C +Jean-Michel Hautbois Jean Tourrilhes Jeevan Shriram Jeff Garzik From e6e07b696da529e85d1ba880555b5df5c80a46bd Mon Sep 17 00:00:00 2001 From: "T.J. Mercier" Date: Wed, 9 Apr 2025 22:51:11 +0000 Subject: [PATCH 28/31] alloc_tag: handle incomplete bulk allocations in vm_module_tags_populate alloc_pages_bulk_node() may partially succeed and allocate fewer than the requested nr_pages. There are several conditions under which this can occur, but we have encountered the case where CONFIG_PAGE_OWNER is enabled causing all bulk allocations to always fallback to single page allocations due to commit 187ad460b841 ("mm/page_alloc: avoid page allocator recursion with pagesets.lock held"). Currently vm_module_tags_populate() immediately fails when alloc_pages_bulk_node() returns fewer than the requested number of pages. When this happens memory allocation profiling gets disabled, for example [ 14.297583] [9: modprobe: 465] Failed to allocate memory for allocation tags in the module scsc_wlan. Memory allocation profiling is disabled! [ 14.299339] [9: modprobe: 465] modprobe: Failed to insmod '/vendor/lib/modules/scsc_wlan.ko' with args '': Out of memory This patch causes vm_module_tags_populate() to retry bulk allocations for the remaining memory instead of failing immediately which will avoid the disablement of memory allocation profiling. Link: https://lkml.kernel.org/r/20250409225111.3770347-1-tjmercier@google.com Fixes: 0f9b685626da ("alloc_tag: populate memory for module tags as needed") Signed-off-by: T.J. Mercier Reported-by: Janghyuck Kim Acked-by: Suren Baghdasaryan Cc: Kent Overstreet Cc: Signed-off-by: Andrew Morton --- lib/alloc_tag.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c index 1d893e313614..25ecc1334b67 100644 --- a/lib/alloc_tag.c +++ b/lib/alloc_tag.c @@ -422,11 +422,20 @@ static int vm_module_tags_populate(void) unsigned long old_shadow_end = ALIGN(phys_end, MODULE_ALIGN); unsigned long new_shadow_end = ALIGN(new_end, MODULE_ALIGN); unsigned long more_pages; - unsigned long nr; + unsigned long nr = 0; more_pages = ALIGN(new_end - phys_end, PAGE_SIZE) >> PAGE_SHIFT; - nr = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN, - NUMA_NO_NODE, more_pages, next_page); + while (nr < more_pages) { + unsigned long allocated; + + allocated = alloc_pages_bulk_node(GFP_KERNEL | __GFP_NOWARN, + NUMA_NO_NODE, more_pages - nr, next_page + nr); + + if (!allocated) + break; + nr += allocated; + } + if (nr < more_pages || vmap_pages_range(phys_end, phys_end + (nr << PAGE_SHIFT), PAGE_KERNEL, next_page, PAGE_SHIFT) < 0) { From 92868577d05ff75f9f38c6345ed275203827faba Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Wed, 9 Apr 2025 15:20:06 +0530 Subject: [PATCH 29/31] selftests/mm: fix compiler -Wmaybe-uninitialized warning Following build warning comes up for cow test as 'transferred' variable has not been initialized. Fix the warning via zero init for the variable. CC cow cow.c: In function `do_test_vmsplice_in_parent': cow.c:365:61: warning: `transferred' may be used uninitialized [-Wmaybe-uninitialized] 365 | cur = read(fds[0], new + total, transferred - total); | ~~~~~~~~~~~~^~~~~~~ cow.c:296:29: note: `transferred' was declared here 296 | ssize_t cur, total, transferred; | ^~~~~~~~~~~ CC compaction_test CC gup_longterm Link: https://lkml.kernel.org/r/20250409095006.1422620-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual Cc: Shuah Khan Cc: Anshuman Khandual Cc: David Hildenbrand Signed-off-by: Andrew Morton --- tools/testing/selftests/mm/cow.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/mm/cow.c b/tools/testing/selftests/mm/cow.c index f0cb14ea8608..b6cfe0a4b7df 100644 --- a/tools/testing/selftests/mm/cow.c +++ b/tools/testing/selftests/mm/cow.c @@ -293,7 +293,7 @@ static void do_test_vmsplice_in_parent(char *mem, size_t size, .iov_base = mem, .iov_len = size, }; - ssize_t cur, total, transferred; + ssize_t cur, total, transferred = 0; struct comm_pipes comm_pipes; char *old, *new; int ret, fds[2]; From a995199384347261bb3f21b2e171fa7f988bd2f8 Mon Sep 17 00:00:00 2001 From: "Kirill A. Shutemov" Date: Wed, 9 Apr 2025 12:40:43 +0300 Subject: [PATCH 30/31] mm: fix apply_to_existing_page_range() In the case of apply_to_existing_page_range(), apply_to_pte_range() is reached with 'create' set to false. When !create, the loop over the PTE page table is broken. apply_to_pte_range() will only move to the next PTE entry if 'create' is true or if the current entry is not pte_none(). This means that the user of apply_to_existing_page_range() will not have 'fn' called for any entries after the first pte_none() in the PTE page table. Fix the loop logic in apply_to_pte_range(). There are no known runtime issues from this, but the fix is trivial enough for stable@ even without a known buggy user. Link: https://lkml.kernel.org/r/20250409094043.1629234-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov Fixes: be1db4753ee6 ("mm/memory.c: add apply_to_existing_page_range() helper") Cc: Daniel Axtens Cc: David Hildenbrand Cc: Vlastimil Babka Cc: Signed-off-by: Andrew Morton --- mm/memory.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 1a35165622e1..44481fe7c629 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2938,11 +2938,11 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, if (fn) { do { if (create || !pte_none(ptep_get(pte))) { - err = fn(pte++, addr, data); + err = fn(pte, addr, data); if (err) break; } - } while (addr += PAGE_SIZE, addr != end); + } while (pte++, addr += PAGE_SIZE, addr != end); } *mask |= PGTBL_PTE_MODIFIED; From a727a83ef22591d47e2d219cd8e01bd3616f4611 Mon Sep 17 00:00:00 2001 From: Oscar Salvador Date: Wed, 9 Apr 2025 10:24:52 +0200 Subject: [PATCH 31/31] MAINTAINERS: update HUGETLB reviewers I have done quite some review on hugetlb code over the years, and some work on it as well, the latest being the hugetlb pagewalk unification which is a work in progress, and touches hugetlb code to great lengths. HugeTLB does not have many reviewers, so I would like to help out by offering myself as an official Reviewer. Signed-off-by: Oscar Salvador Link: https://lkml.kernel.org/r/20250409082452.269180-1-osalvador@suse.de Acked-by: David Hildenbrand Acked-by: Vlastimil Babka Acked-by: Muchun Song Cc: Michal Hocko Cc: Peter Xu Signed-off-by: Andrew Morton --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index de97cd54ff24..56ffc830e5f2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -10954,6 +10954,7 @@ F: include/linux/platform_data/huawei-gaokun-ec.h HUGETLB SUBSYSTEM M: Muchun Song +R: Oscar Salvador L: linux-mm@kvack.org S: Maintained F: Documentation/ABI/testing/sysfs-kernel-mm-hugepages