From 5ad15f1b32f4a9cb7653b5ab1eccf285b4045007 Mon Sep 17 00:00:00 2001 From: Dan Carpenter Date: Wed, 12 Oct 2022 16:19:39 +0300 Subject: [PATCH 01/17] mailmap: update Dan Carpenter's email address My time at Oracle is ending at the end of the month. Update my email address accordingly. Link: https://lkml.kernel.org/r/Y0a+6+5SHMdvUnpg@kili Signed-off-by: Dan Carpenter Cc: Joe Perches Signed-off-by: Andrew Morton --- .mailmap | 1 + 1 file changed, 1 insertion(+) diff --git a/.mailmap b/.mailmap index 380378e2db368..b4e7511121f06 100644 --- a/.mailmap +++ b/.mailmap @@ -104,6 +104,7 @@ Christoph Hellwig Colin Ian King Corey Minyard Damian Hobson-Garcia +Dan Carpenter Daniel Borkmann Daniel Borkmann Daniel Borkmann From cef408e70e9b0c175a874b9d9fe6acc7e12f569f Mon Sep 17 00:00:00 2001 From: Qais Yousef Date: Fri, 14 Oct 2022 15:10:16 +0100 Subject: [PATCH 02/17] mailmap: update email for Qais Yousef Update my email address for old entry and add a new entry for my contribution while working with arm to continue support that work. Link: https://lkml.kernel.org/r/20221014141016.539625-1-qyousef@layalina.io Signed-off-by: Qais Yousef Acked-by: Qais Yousef Acked-by: Qais Yousef Signed-off-by: Andrew Morton --- .mailmap | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/.mailmap b/.mailmap index b4e7511121f06..fdd7989492fc3 100644 --- a/.mailmap +++ b/.mailmap @@ -354,7 +354,8 @@ Peter Oruba Pratyush Anand Praveen BP Punit Agrawal -Qais Yousef +Qais Yousef +Qais Yousef Quentin Monnet Quentin Perret Rafael J. Wysocki From 7329e3ebe3594b425955ab591ecea335e85842c2 Mon Sep 17 00:00:00 2001 From: Liam Howlett Date: Sat, 15 Oct 2022 02:12:33 +0000 Subject: [PATCH 03/17] mm/mempolicy: fix mbind_range() arguments to vma_merge() Fuzzing produced an invalid argument to vma_merge() which was caught by the newly added verification of the number of VMAs being removed on process exit. Analyzing the failure eventually resulted in finding an issue with the search of a VMA that started at address 0, which caused an underflow and thus the loss of many VMAs being tracked in the tree. Fix the underflow by changing the search of the maple tree to use the start address directly. Link: https://lkml.kernel.org/r/20221015021135.2816178-1-Liam.Howlett@oracle.com Fixes: 66850be55e8e ("mm/mempolicy: use vma iterator & maple state instead of vma linked list") Signed-off-by: Liam R. Howlett Reported-by: kernel test robot Link: https://lore.kernel.org/r/202210052318.5ad10912-oliver.sang@intel.com Cc: Yu Zhao Signed-off-by: Andrew Morton --- mm/mempolicy.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index a937eaec5b68d..61aa9aedb7289 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -787,17 +787,22 @@ static int vma_replace_policy(struct vm_area_struct *vma, static int mbind_range(struct mm_struct *mm, unsigned long start, unsigned long end, struct mempolicy *new_pol) { - MA_STATE(mas, &mm->mm_mt, start - 1, start - 1); + MA_STATE(mas, &mm->mm_mt, start, start); struct vm_area_struct *prev; struct vm_area_struct *vma; int err = 0; pgoff_t pgoff; - prev = mas_find_rev(&mas, 0); - if (prev && (start < prev->vm_end)) - vma = prev; - else - vma = mas_next(&mas, end - 1); + prev = mas_prev(&mas, 0); + if (unlikely(!prev)) + mas_set(&mas, start); + + vma = mas_find(&mas, end - 1); + if (WARN_ON(!vma)) + return 0; + + if (start > vma->vm_start) + prev = vma; for (; vma; vma = mas_next(&mas, end - 1)) { unsigned long vmstart = max(start, vma->vm_start); From 4249a05ff670e7b1aeea77f1a5451080ea86c88d Mon Sep 17 00:00:00 2001 From: Alexey Romanov Date: Thu, 13 Oct 2022 14:28:25 +0300 Subject: [PATCH 04/17] zsmalloc: zs_destroy_pool: add size_class NULL check Inside the zs_destroy_pool() function, there can still be NULL size_class pointers: if when the next size_class is allocated, inside zs_create_pool() function, kzalloc will return NULL and handling the error condition, zs_create_pool() will call zs_destroy_pool(). Link: https://lkml.kernel.org/r/20221013112825.61869-1-avromanov@sberdevices.ru Fixes: f24263a5a076 ("zsmalloc: remove unnecessary size_class NULL check") Signed-off-by: Alexey Romanov Reviewed-by: Sergey Senozhatsky Cc: Minchan Kim Cc: Nitin Gupta Signed-off-by: Andrew Morton --- mm/zsmalloc.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 525758713a553..d03941cace2c4 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -2311,6 +2311,9 @@ void zs_destroy_pool(struct zs_pool *pool) int fg; struct size_class *class = pool->size_class[i]; + if (!class) + continue; + if (class->index != i) continue; From 977ef30a7d888eeb52fb6908f99080f33e5309a8 Mon Sep 17 00:00:00 2001 From: Martin Liska Date: Thu, 13 Oct 2022 09:40:59 +0200 Subject: [PATCH 05/17] gcov: support GCC 12.1 and newer compilers Starting with GCC 12.1, the created .gcda format can't be read by gcov tool. There are 2 significant changes to the .gcda file format that need to be supported: a) [gcov: Use system IO buffering] (23eb66d1d46a34cb28c4acbdf8a1deb80a7c5a05) changed that all sizes in the format are in bytes and not in words (4B) b) [gcov: make profile merging smarter] (72e0c742bd01f8e7e6dcca64042b9ad7e75979de) add a new checksum to the file header. Tested with GCC 7.5, 10.4, 12.2 and the current master. Link: https://lkml.kernel.org/r/624bda92-f307-30e9-9aaa-8cc678b2dfb2@suse.cz Signed-off-by: Martin Liska Tested-by: Peter Oberparleiter Reviewed-by: Peter Oberparleiter Cc: Signed-off-by: Andrew Morton --- kernel/gcov/gcc_4_7.c | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/kernel/gcov/gcc_4_7.c b/kernel/gcov/gcc_4_7.c index 460c12b7dfea2..7971e989e425b 100644 --- a/kernel/gcov/gcc_4_7.c +++ b/kernel/gcov/gcc_4_7.c @@ -30,6 +30,13 @@ #define GCOV_TAG_FUNCTION_LENGTH 3 +/* Since GCC 12.1 sizes are in BYTES and not in WORDS (4B). */ +#if (__GNUC__ >= 12) +#define GCOV_UNIT_SIZE 4 +#else +#define GCOV_UNIT_SIZE 1 +#endif + static struct gcov_info *gcov_info_head; /** @@ -383,12 +390,18 @@ size_t convert_to_gcda(char *buffer, struct gcov_info *info) pos += store_gcov_u32(buffer, pos, info->version); pos += store_gcov_u32(buffer, pos, info->stamp); +#if (__GNUC__ >= 12) + /* Use zero as checksum of the compilation unit. */ + pos += store_gcov_u32(buffer, pos, 0); +#endif + for (fi_idx = 0; fi_idx < info->n_functions; fi_idx++) { fi_ptr = info->functions[fi_idx]; /* Function record. */ pos += store_gcov_u32(buffer, pos, GCOV_TAG_FUNCTION); - pos += store_gcov_u32(buffer, pos, GCOV_TAG_FUNCTION_LENGTH); + pos += store_gcov_u32(buffer, pos, + GCOV_TAG_FUNCTION_LENGTH * GCOV_UNIT_SIZE); pos += store_gcov_u32(buffer, pos, fi_ptr->ident); pos += store_gcov_u32(buffer, pos, fi_ptr->lineno_checksum); pos += store_gcov_u32(buffer, pos, fi_ptr->cfg_checksum); @@ -402,7 +415,8 @@ size_t convert_to_gcda(char *buffer, struct gcov_info *info) /* Counter record. */ pos += store_gcov_u32(buffer, pos, GCOV_TAG_FOR_COUNTER(ct_idx)); - pos += store_gcov_u32(buffer, pos, ci_ptr->num * 2); + pos += store_gcov_u32(buffer, pos, + ci_ptr->num * 2 * GCOV_UNIT_SIZE); for (cv_idx = 0; cv_idx < ci_ptr->num; cv_idx++) { pos += store_gcov_u64(buffer, pos, From 759a7c6126eef5635506453e9b9d55a6a3ac2084 Mon Sep 17 00:00:00 2001 From: Joseph Qi Date: Mon, 17 Oct 2022 21:02:26 +0800 Subject: [PATCH 06/17] ocfs2: fix BUG when iput after ocfs2_mknod fails Commit b1529a41f777 "ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error" tried to reclaim the claimed inode if __ocfs2_mknod_locked() fails later. But this introduce a race, the freed bit may be reused immediately by another thread, which will update dinode, e.g. i_generation. Then iput this inode will lead to BUG: inode->i_generation != le32_to_cpu(fe->i_generation) We could make this inode as bad, but we did want to do operations like wipe in some cases. Since the claimed inode bit can only affect that an dinode is missing and will return back after fsck, it seems not a big problem. So just leave it as is by revert the reclaim logic. Link: https://lkml.kernel.org/r/20221017130227.234480-1-joseph.qi@linux.alibaba.com Fixes: b1529a41f777 ("ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error") Signed-off-by: Joseph Qi Reported-by: Yan Wang Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/namei.c | 11 +---------- 1 file changed, 1 insertion(+), 10 deletions(-) diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 961d1cf54388e..1a97e167b2194 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -632,18 +632,9 @@ static int ocfs2_mknod_locked(struct ocfs2_super *osb, return status; } - status = __ocfs2_mknod_locked(dir, inode, dev, new_fe_bh, + return __ocfs2_mknod_locked(dir, inode, dev, new_fe_bh, parent_fe_bh, handle, inode_ac, fe_blkno, suballoc_loc, suballoc_bit); - if (status < 0) { - u64 bg_blkno = ocfs2_which_suballoc_group(fe_blkno, suballoc_bit); - int tmp = ocfs2_free_suballoc_bits(handle, inode_ac->ac_inode, - inode_ac->ac_bh, suballoc_bit, bg_blkno, 1); - if (tmp) - mlog_errno(tmp); - } - - return status; } static int ocfs2_mkdir(struct user_namespace *mnt_userns, From 28f4821b1b53e0649706912e810c6c232fc506f9 Mon Sep 17 00:00:00 2001 From: Joseph Qi Date: Mon, 17 Oct 2022 21:02:27 +0800 Subject: [PATCH 07/17] ocfs2: clear dinode links count in case of error In ocfs2_mknod(), if error occurs after dinode successfully allocated, ocfs2 i_links_count will not be 0. So even though we clear inode i_nlink before iput in error handling, it still won't wipe inode since we'll refresh inode from dinode during inode lock. So just like clear inode i_nlink, we clear ocfs2 i_links_count as well. Also do the same change for ocfs2_symlink(). Link: https://lkml.kernel.org/r/20221017130227.234480-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi Reported-by: Yan Wang Cc: Mark Fasheh Cc: Joel Becker Cc: Junxiao Bi Cc: Changwei Ge Cc: Gang He Cc: Jun Piao Cc: Signed-off-by: Andrew Morton --- fs/ocfs2/namei.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 1a97e167b2194..05f32989bad6f 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -232,6 +232,7 @@ static int ocfs2_mknod(struct user_namespace *mnt_userns, handle_t *handle = NULL; struct ocfs2_super *osb; struct ocfs2_dinode *dirfe; + struct ocfs2_dinode *fe = NULL; struct buffer_head *new_fe_bh = NULL; struct inode *inode = NULL; struct ocfs2_alloc_context *inode_ac = NULL; @@ -382,6 +383,7 @@ static int ocfs2_mknod(struct user_namespace *mnt_userns, goto leave; } + fe = (struct ocfs2_dinode *) new_fe_bh->b_data; if (S_ISDIR(mode)) { status = ocfs2_fill_new_dir(osb, handle, dir, inode, new_fe_bh, data_ac, meta_ac); @@ -454,8 +456,11 @@ static int ocfs2_mknod(struct user_namespace *mnt_userns, leave: if (status < 0 && did_quota_inode) dquot_free_inode(inode); - if (handle) + if (handle) { + if (status < 0 && fe) + ocfs2_set_links_count(fe, 0); ocfs2_commit_trans(osb, handle); + } ocfs2_inode_unlock(dir, 1); if (did_block_signals) @@ -2019,8 +2024,11 @@ static int ocfs2_symlink(struct user_namespace *mnt_userns, ocfs2_clusters_to_bytes(osb->sb, 1)); if (status < 0 && did_quota_inode) dquot_free_inode(inode); - if (handle) + if (handle) { + if (status < 0 && fe) + ocfs2_set_links_count(fe, 0); ocfs2_commit_trans(osb, handle); + } ocfs2_inode_unlock(dir, 1); if (did_block_signals) From eacf96d23f23e5bfd175be07048246efd0be4cc6 Mon Sep 17 00:00:00 2001 From: Colin Ian King Date: Fri, 7 Oct 2022 21:43:39 +0100 Subject: [PATCH 08/17] init: Kconfig: fix spelling mistake "satify" -> "satisfy" There is a spelling mistake in a Kconfig description. Fix it. Link: https://lkml.kernel.org/r/20221007204339.2757753-1-colin.i.king@gmail.com Signed-off-by: Colin Ian King Signed-off-by: Andrew Morton --- init/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/init/Kconfig b/init/Kconfig index 694f7c160c9c1..abf65098f1b6b 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -66,7 +66,7 @@ config RUST_IS_AVAILABLE This shows whether a suitable Rust toolchain is available (found). Please see Documentation/rust/quick-start.rst for instructions on how - to satify the build requirements of Rust support. + to satisfy the build requirements of Rust support. In particular, the Makefile target 'rustavailable' is useful to check why the Rust toolchain is not being detected. From 5789151e48acc3fd34d2109bf2021dc4df5e33e9 Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Mon, 17 Oct 2022 19:49:45 -0700 Subject: [PATCH 09/17] mm/mmap: undo ->mmap() when mas_preallocate() fails A memory leak in hugetlb_reserve_pages was reported in [1]. The root cause was traced to an error path in mmap_region when mas_preallocate() fails. In this case, the vma is freed after a successful call to filesystem specific mmap. The hugetlbfs mmap routine may allocate data structures pointed to by m_private_data. These need to be cleaned up by the hugetlb vm_ops->close() routine. The same issue was addressed by commit deb0f6562884 ("mm/mmap: undo ->mmap() when arch_validate_flags() fails") for the arch_validate_flags() test. Go to the same close_and_free_vma label if mas_preallocate() fails. [1] https://lore.kernel.org/linux-mm/CAKXUXMxf7OiCwbxib7MwfR4M1b5+b3cNTU7n5NV9Zm4967=FPQ@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221018024945.415036-1-mike.kravetz@oracle.com Fixes: d4af56c5c7c6 ("mm: start tracking VMAs with maple tree") Signed-off-by: Mike Kravetz Reported-by: Lukas Bulwahn Reviewed-by: Liam R. Howlett Cc: Andrii Nakryiko Cc: Carlos Llamas Cc: Catalin Marinas Cc: Muchun Song Signed-off-by: Andrew Morton --- mm/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index bf2122af94e7a..3c9890e443a3e 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2681,7 +2681,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (mas_preallocate(&mas, vma, GFP_KERNEL)) { error = -ENOMEM; if (file) - goto unmap_and_free_vma; + goto close_and_free_vma; else goto free_vma; } From 1cd916d0340d0f45b151599c24ec40b5b2fd8e4a Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 18 Oct 2022 13:57:37 -0700 Subject: [PATCH 10/17] mm/mmap.c: __vma_adjust(): suppress uninitialized var warning The code is OK, but it fools gcc. mm/mmap.c:802 __vma_adjust() error: uninitialized symbol 'next_next'. Fixes: 524e00b36e8c5 ("mm: remove rb tree.") Reported-by: kernel test robot Cc: Liam R. Howlett Signed-off-by: Andrew Morton --- mm/mmap.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/mmap.c b/mm/mmap.c index 3c9890e443a3e..721fe5c82a0e6 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -618,7 +618,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, struct vm_area_struct *expand) { struct mm_struct *mm = vma->vm_mm; - struct vm_area_struct *next_next, *next = find_vma(mm, vma->vm_end); + struct vm_area_struct *next_next = NULL; /* uninit var warning */ + struct vm_area_struct *next = find_vma(mm, vma->vm_end); struct vm_area_struct *orig_vma = vma; struct address_space *mapping = NULL; struct rb_root_cached *root = NULL; From a57b70519d1f7c53be98478623652738e5ac70d5 Mon Sep 17 00:00:00 2001 From: Liam Howlett Date: Tue, 18 Oct 2022 19:17:12 +0000 Subject: [PATCH 11/17] mm/mmap: fix MAP_FIXED address return on VMA merge mmap should return the start address of newly mapped area when successful. On a successful merge of a VMA, the return address was changed and thus was violating that expectation from userspace. This is a restoration of functionality provided by 309d08d9b3a3 (mm/mmap.c: fix mmap return value when vma is merged after call_mmap()). For completeness of fixing MAP_FIXED, implement the comments from the previous discussion to never update the address and fail if the address changes. Leaving the error as a WARN_ON() to avoid crashing the kernel. Link: https://lkml.kernel.org/r/20221018191613.4133459-1-Liam.Howlett@oracle.com Link: https://lore.kernel.org/all/Y06yk66SKxlrwwfb@lakrids/ Link: https://lore.kernel.org/all/20201203085350.22624-1-liuzixian4@huawei.com/ Fixes: 4dd1b84140c1 ("mm/mmap: use advanced maple tree API for mmap_region()") Signed-off-by: Liam R. Howlett Reported-by: Mark Rutland Cc: Liu Zixian Cc: David Hildenbrand Cc: Jason Gunthorpe Cc: Matthew Wilcox Signed-off-by: Andrew Morton --- mm/mmap.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 721fe5c82a0e6..e270057ed04eb 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2626,14 +2626,14 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (error) goto unmap_and_free_vma; - /* Can addr have changed?? - * - * Answer: Yes, several device drivers can do it in their - * f_op->mmap method. -DaveM + /* + * Expansion is handled above, merging is handled below. + * Drivers should not alter the address of the VMA. */ - WARN_ON_ONCE(addr != vma->vm_start); - - addr = vma->vm_start; + if (WARN_ON((addr != vma->vm_start))) { + error = -EINVAL; + goto close_and_free_vma; + } mas_reset(&mas); /* @@ -2655,7 +2655,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr, vm_area_free(vma); vma = merge; /* Update vm_flags to pick up the change. */ - addr = vma->vm_start; vm_flags = vma->vm_flags; goto unmap_writable; } From 12df140f0bdfae5dcfc81800970dd7f6f632e00c Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Mon, 17 Oct 2022 20:25:05 -0400 Subject: [PATCH 12/17] mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages The h->*_huge_pages counters are protected by the hugetlb_lock, but alloc_huge_page has a corner case where it can decrement the counter outside of the lock. This could lead to a corrupted value of h->resv_huge_pages, which we have observed on our systems. Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a potential race. Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com Fixes: a88c76954804 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count") Signed-off-by: Rik van Riel Reviewed-by: Mike Kravetz Cc: Naoya Horiguchi Cc: Glen McCready Cc: Mike Kravetz Cc: Muchun Song Cc: Signed-off-by: Andrew Morton --- mm/hugetlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index b586cdd75930b..dede0337c07c7 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -2924,11 +2924,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, page = alloc_buddy_huge_page_with_mpol(h, vma, addr); if (!page) goto out_uncharge_cgroup; + spin_lock_irq(&hugetlb_lock); if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) { SetHPageRestoreReserve(page); h->resv_huge_pages--; } - spin_lock_irq(&hugetlb_lock); list_add(&page->lru, &h->hugepage_activelist); set_page_refcounted(page); /* Fall through */ From 08ac85521cb2e26f25b885492180815ce8eaf4b7 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 18 Oct 2022 20:18:38 -0700 Subject: [PATCH 13/17] mm: /proc/pid/smaps_rollup: fix maple tree search /proc/pid/smaps_rollup showed 0 kB for everything: now find first vma. Link: https://lkml.kernel.org/r/3011bee7-182-97a2-1083-d5f5b688e54b@google.com Fixes: c4c84f06285e ("fs/proc/task_mmu: stop using linked list and highest_vm_end") Signed-off-by: Hugh Dickins Reviewed-by: Liam R. Howlett Cc: Alexey Dobriyan Cc: Matthew Wilcox (Oracle) Cc: Vlastimil Babka Signed-off-by: Andrew Morton --- fs/proc/task_mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 8b4f3073f8f55..8a74cdcc9af00 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -902,7 +902,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v) goto out_put_mm; hold_task_mempolicy(priv); - vma = mas_find(&mas, 0); + vma = mas_find(&mas, ULONG_MAX); if (unlikely(!vma)) goto empty_set; From df48a5f7a3bbac6a700026b554922943ecee1fb0 Mon Sep 17 00:00:00 2001 From: "Liam R. Howlett" Date: Tue, 31 May 2022 09:20:51 -0400 Subject: [PATCH 14/17] mm/page_alloc: reduce potential fragmentation in make_alloc_exact() Try to avoid using the left over split page on the next request for a page by calling __free_pages_ok() with FPI_TO_TAIL. This increases the potential of defragmenting memory when it's used for a short period of time. Link: https://lkml.kernel.org/r/20220531185626.yvlmymbxyoe5vags@revolver Signed-off-by: Liam R. Howlett Suggested-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton --- mm/page_alloc.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e20ade858e71c..b5a6c815ae284 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5784,14 +5784,18 @@ static void *make_alloc_exact(unsigned long addr, unsigned int order, size_t size) { if (addr) { - unsigned long alloc_end = addr + (PAGE_SIZE << order); - unsigned long used = addr + PAGE_ALIGN(size); - - split_page(virt_to_page((void *)addr), order); - while (used < alloc_end) { - free_page(used); - used += PAGE_SIZE; - } + unsigned long nr = DIV_ROUND_UP(size, PAGE_SIZE); + struct page *page = virt_to_page((void *)addr); + struct page *last = page + nr; + + split_page_owner(page, 1 << order); + split_page_memcg(page, 1 << order); + while (page < --last) + set_page_refcounted(last); + + last = page + (1UL << order); + for (page += nr; page < last; page++) + __free_pages_ok(page, 0, FPI_TO_TAIL); } return (void *)addr; } From 612b8a317023e1396965aacac43d80053c6e77db Mon Sep 17 00:00:00 2001 From: Mike Kravetz Date: Wed, 19 Oct 2022 13:19:57 -0700 Subject: [PATCH 15/17] hugetlb: fix memory leak associated with vma_lock structure The hugetlb vma_lock structure hangs off the vm_private_data pointer of sharable hugetlb vmas. The structure is vma specific and can not be shared between vmas. At fork and various other times, vmas are duplicated via vm_area_dup(). When this happens, the pointer in the newly created vma must be cleared and the structure reallocated. Two hugetlb specific routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open. Both routines are called for newly created vmas. hugetlb_dup_vma_private would always clear the pointer and hugetlb_vm_op_open would allocate the new vms_lock structure. This did not work in the case of this calling sequence pointed out in [1]. move_vma copy_vma new_vma = vm_area_dup(vma); new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock. is_vm_hugetlb_page(vma) clear_vma_resv_huge_pages hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL When clearing hugetlb_dup_vma_private we actually leak the associated vma_lock structure. The vma_lock structure contains a pointer to the associated vma. This information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open to ensure we only clear the vm_private_data of newly created (copied) vmas. In such cases, the vma->vma_lock->vma field will not point to the vma. Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear vm_private_data if vma->vma_lock->vma == vma. Also, log a warning if hugetlb_vm_op_open ever encounters the case where vma_lock has already been correctly allocated for the vma. [1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/ Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com Fixes: 131a79b474e9 ("hugetlb: fix vma lock handling during split vma and range unmapping") Signed-off-by: Mike Kravetz Reviewed-by: Miaohe Lin Cc: Andrea Arcangeli Cc: "Aneesh Kumar K.V" Cc: Axel Rasmussen Cc: David Hildenbrand Cc: Davidlohr Bueso Cc: James Houghton Cc: "Kirill A. Shutemov" Cc: Michal Hocko Cc: Mina Almasry Cc: Muchun Song Cc: Naoya Horiguchi Cc: Pasha Tatashin Cc: Peter Xu Cc: Prakash Sangappa Cc: Sven Schnelle Signed-off-by: Andrew Morton --- mm/hugetlb.c | 35 +++++++++++++++++++++++++++-------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index dede0337c07c7..546df97c31e4c 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1014,15 +1014,23 @@ void hugetlb_dup_vma_private(struct vm_area_struct *vma) VM_BUG_ON_VMA(!is_vm_hugetlb_page(vma), vma); /* * Clear vm_private_data + * - For shared mappings this is a per-vma semaphore that may be + * allocated in a subsequent call to hugetlb_vm_op_open. + * Before clearing, make sure pointer is not associated with vma + * as this will leak the structure. This is the case when called + * via clear_vma_resv_huge_pages() and hugetlb_vm_op_open has already + * been called to allocate a new structure. * - For MAP_PRIVATE mappings, this is the reserve map which does * not apply to children. Faults generated by the children are * not guaranteed to succeed, even if read-only. - * - For shared mappings this is a per-vma semaphore that may be - * allocated in a subsequent call to hugetlb_vm_op_open. */ - vma->vm_private_data = (void *)0; - if (!(vma->vm_flags & VM_MAYSHARE)) - return; + if (vma->vm_flags & VM_MAYSHARE) { + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; + + if (vma_lock && vma_lock->vma != vma) + vma->vm_private_data = NULL; + } else + vma->vm_private_data = NULL; } /* @@ -4601,6 +4609,7 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma) struct resv_map *resv = vma_resv_map(vma); /* + * HPAGE_RESV_OWNER indicates a private mapping. * This new VMA should share its siblings reservation map if present. * The VMA will only ever have a valid reservation map pointer where * it is being copied for another still existing VMA. As that VMA @@ -4615,11 +4624,21 @@ static void hugetlb_vm_op_open(struct vm_area_struct *vma) /* * vma_lock structure for sharable mappings is vma specific. - * Clear old pointer (if copied via vm_area_dup) and create new. + * Clear old pointer (if copied via vm_area_dup) and allocate + * new structure. Before clearing, make sure vma_lock is not + * for this vma. */ if (vma->vm_flags & VM_MAYSHARE) { - vma->vm_private_data = NULL; - hugetlb_vma_lock_alloc(vma); + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data; + + if (vma_lock) { + if (vma_lock->vma != vma) { + vma->vm_private_data = NULL; + hugetlb_vma_lock_alloc(vma); + } else + pr_warn("HugeTLB: vma_lock already exists in %s.\n", __func__); + } else + hugetlb_vma_lock_alloc(vma); } } From 71e2d666ef85d51834d658830f823560c402b8b6 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Wed, 19 Oct 2022 14:41:56 +0100 Subject: [PATCH 16/17] mm/huge_memory: do not clobber swp_entry_t during THP split The following has been observed when running stressng mmap since commit b653db77350c ("mm: Clear page->private when splitting or migrating a page") watchdog: BUG: soft lockup - CPU#75 stuck for 26s! [stress-ng:9546] CPU: 75 PID: 9546 Comm: stress-ng Tainted: G E 6.0.0-revert-b653db77-fix+ #29 0357d79b60fb09775f678e4f3f64ef0579ad1374 Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016 RIP: 0010:xas_descend+0x28/0x80 Code: cc cc 0f b6 0e 48 8b 57 08 48 d3 ea 83 e2 3f 89 d0 48 83 c0 04 48 8b 44 c6 08 48 89 77 18 48 89 c1 83 e1 03 48 83 f9 02 75 08 <48> 3d fd 00 00 00 76 08 88 57 12 c3 cc cc cc cc 48 c1 e8 02 89 c2 RSP: 0018:ffffbbf02a2236a8 EFLAGS: 00000246 RAX: ffff9cab7d6a0002 RBX: ffffe04b0af88040 RCX: 0000000000000002 RDX: 0000000000000030 RSI: ffff9cab60509b60 RDI: ffffbbf02a2236c0 RBP: 0000000000000000 R08: ffff9cab60509b60 R09: ffffbbf02a2236c0 R10: 0000000000000001 R11: ffffbbf02a223698 R12: 0000000000000000 R13: ffff9cab4e28da80 R14: 0000000000039c01 R15: ffff9cab4e28da88 FS: 00007fab89b85e40(0000) GS:ffff9cea3fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fab84e00000 CR3: 00000040b73a4003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: xas_load+0x3a/0x50 __filemap_get_folio+0x80/0x370 ? put_swap_page+0x163/0x360 pagecache_get_page+0x13/0x90 __try_to_reclaim_swap+0x50/0x190 scan_swap_map_slots+0x31e/0x670 get_swap_pages+0x226/0x3c0 folio_alloc_swap+0x1cc/0x240 add_to_swap+0x14/0x70 shrink_page_list+0x968/0xbc0 reclaim_page_list+0x70/0xf0 reclaim_pages+0xdd/0x120 madvise_cold_or_pageout_pte_range+0x814/0xf30 walk_pgd_range+0x637/0xa30 __walk_page_range+0x142/0x170 walk_page_range+0x146/0x170 madvise_pageout+0xb7/0x280 ? asm_common_interrupt+0x22/0x40 madvise_vma_behavior+0x3b7/0xac0 ? find_vma+0x4a/0x70 ? find_vma+0x64/0x70 ? madvise_vma_anon_name+0x40/0x40 madvise_walk_vmas+0xa6/0x130 do_madvise+0x2f4/0x360 __x64_sys_madvise+0x26/0x30 do_syscall_64+0x5b/0x80 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x17/0x40 ? do_syscall_64+0x67/0x80 ? syscall_exit_to_user_mode+0x17/0x40 ? do_syscall_64+0x67/0x80 ? do_syscall_64+0x67/0x80 ? common_interrupt+0x8b/0xa0 entry_SYSCALL_64_after_hwframe+0x63/0xcd The problem can be reproduced with the mmtests config config-workload-stressng-mmap. It does not always happen and when it triggers is variable but it has happened on multiple machines. The intent of commit b653db77350c patch was to avoid the case where PG_private is clear but folio->private is not-NULL. However, THP tail pages uses page->private for "swp_entry_t if folio_test_swapcache()" as stated in the documentation for struct folio. This patch only clobbers page->private for tail pages if the head page was not in swapcache and warns once if page->private had an unexpected value. Link: https://lkml.kernel.org/r/20221019134156.zjyyn5aownakvztf@techsingularity.net Fixes: b653db77350c ("mm: Clear page->private when splitting or migrating a page") Signed-off-by: Mel Gorman Cc: Matthew Wilcox (Oracle) Cc: Mel Gorman Cc: Yang Shi Cc: Brian Foster Cc: Dan Streetman Cc: Miaohe Lin Cc: Oleksandr Natalenko Cc: Seth Jennings Cc: Vitaly Wool Cc: Signed-off-by: Andrew Morton --- mm/huge_memory.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1cc4a5f4791e9..03fc7e5edf075 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2455,7 +2455,16 @@ static void __split_huge_page_tail(struct page *head, int tail, page_tail); page_tail->mapping = head->mapping; page_tail->index = head->index + tail; - page_tail->private = 0; + + /* + * page->private should not be set in tail pages with the exception + * of swap cache pages that store the swp_entry_t in tail pages. + * Fix up and warn once if private is unexpectedly set. + */ + if (!folio_test_swapcache(page_folio(head))) { + VM_WARN_ON_ONCE_PAGE(page_tail->private != 0, head); + page_tail->private = 0; + } /* Page flags must be visible before we make the page non-compound. */ smp_wmb(); From 97061d441110528dc02972818f2f1dad485107f9 Mon Sep 17 00:00:00 2001 From: Alistair Popple Date: Wed, 19 Oct 2022 23:29:34 +1100 Subject: [PATCH 17/17] nouveau: fix migrate_to_ram() for faulting page Commit 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") changed the migrate_to_ram() callback to take a reference on the device page to ensure it can't be freed while handling the fault. Unfortunately the corresponding update to Nouveau to accommodate this change was inadvertently dropped from that patch causing GPU to CPU migration to fail so add it here. Link: https://lkml.kernel.org/r/20221019122934.866205-1-apopple@nvidia.com Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Alistair Popple Cc: John Hubbard Cc: Ralph Campbell Cc: Lyude Paul Cc: Ben Skeggs Signed-off-by: Andrew Morton --- drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 5fe209107246f..20fe53815b20f 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -176,6 +176,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf) .src = &src, .dst = &dst, .pgmap_owner = drm->dev, + .fault_page = vmf->page, .flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE, };