-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linu…
…x/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Yu Zhao's Multi-Gen LRU patches are here. They've been under test in linux-next for a couple of months without, to my knowledge, any negative reports (or any positive ones, come to that). - Also the Maple Tree from Liam Howlett. An overlapping range-based tree for vmas. It it apparently slightly more efficient in its own right, but is mainly targeted at enabling work to reduce mmap_lock contention. Liam has identified a number of other tree users in the kernel which could be beneficially onverted to mapletrees. Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat at [1]. This has yet to be addressed due to Liam's unfortunately timed vacation. He is now back and we'll get this fixed up. - Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses clang-generated instrumentation to detect used-unintialized bugs down to the single bit level. KMSAN keeps finding bugs. New ones, as well as the legacy ones. - Yang Shi adds a userspace mechanism (madvise) to induce a collapse of memory into THPs. - Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to support file/shmem-backed pages. - userfaultfd updates from Axel Rasmussen - zsmalloc cleanups from Alexey Romanov - cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and memory-failure - Huang Ying adds enhancements to NUMA balancing memory tiering mode's page promotion, with a new way of detecting hot pages. - memcg updates from Shakeel Butt: charging optimizations and reduced memory consumption. - memcg cleanups from Kairui Song. - memcg fixes and cleanups from Johannes Weiner. - Vishal Moola provides more folio conversions - Zhang Yi removed ll_rw_block() :( - migration enhancements from Peter Xu - migration error-path bugfixes from Huang Ying - Aneesh Kumar added ability for a device driver to alter the memory tiering promotion paths. For optimizations by PMEM drivers, DRM drivers, etc. - vma merging improvements from Jakub Matěn. - NUMA hinting cleanups from David Hildenbrand. - xu xin added aditional userspace visibility into KSM merging activity. - THP & KSM code consolidation from Qi Zheng. - more folio work from Matthew Wilcox. - KASAN updates from Andrey Konovalov. - DAMON cleanups from Kaixu Xia. - DAMON work from SeongJae Park: fixes, cleanups. - hugetlb sysfs cleanups from Muchun Song. - Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core. Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1] * tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits) hugetlb: allocate vma lock for all sharable vmas hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer hugetlb: fix vma lock handling during split vma and range unmapping mglru: mm/vmscan.c: fix imprecise comments mm/mglru: don't sync disk for each aging cycle mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol mm: memcontrol: use do_memsw_account() in a few more places mm: memcontrol: deprecate swapaccounting=0 mode mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled mm/secretmem: remove reduntant return value mm/hugetlb: add available_huge_pages() func mm: remove unused inline functions from include/linux/mm_inline.h selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd selftests/vm: add thp collapse shmem testing selftests/vm: add thp collapse file and tmpfs testing selftests/vm: modularize thp collapse memory operations selftests/vm: dedup THP helpers mm/khugepaged: add tracepoint to hpage_collapse_scan_file() mm/madvise: add file and shmem support to MADV_COLLAPSE ...
- Loading branch information
Showing
409 changed files
with
65,792 additions
and
8,034 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
What: /sys/devices/virtual/memory_tiering/ | ||
Date: August 2022 | ||
Contact: Linux memory management mailing list <linux-mm@kvack.org> | ||
Description: A collection of all the memory tiers allocated. | ||
|
||
Individual memory tier details are contained in subdirectories | ||
named by the abstract distance of the memory tier. | ||
|
||
/sys/devices/virtual/memory_tiering/memory_tierN/ | ||
|
||
|
||
What: /sys/devices/virtual/memory_tiering/memory_tierN/ | ||
/sys/devices/virtual/memory_tiering/memory_tierN/nodes | ||
Date: August 2022 | ||
Contact: Linux memory management mailing list <linux-mm@kvack.org> | ||
Description: Directory with details of a specific memory tier | ||
|
||
This is the directory containing information about a particular | ||
memory tier, memtierN, where N is derived based on abstract distance. | ||
|
||
A smaller value of N implies a higher (faster) memory tier in the | ||
hierarchy. | ||
|
||
nodes: NUMA nodes that are part of this memory tier. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,162 @@ | ||
.. SPDX-License-Identifier: GPL-2.0 | ||
============= | ||
Multi-Gen LRU | ||
============= | ||
The multi-gen LRU is an alternative LRU implementation that optimizes | ||
page reclaim and improves performance under memory pressure. Page | ||
reclaim decides the kernel's caching policy and ability to overcommit | ||
memory. It directly impacts the kswapd CPU usage and RAM efficiency. | ||
|
||
Quick start | ||
=========== | ||
Build the kernel with the following configurations. | ||
|
||
* ``CONFIG_LRU_GEN=y`` | ||
* ``CONFIG_LRU_GEN_ENABLED=y`` | ||
|
||
All set! | ||
|
||
Runtime options | ||
=============== | ||
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the | ||
following subsections. | ||
|
||
Kill switch | ||
----------- | ||
``enabled`` accepts different values to enable or disable the | ||
following components. Its default value depends on | ||
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled | ||
unless some of them have unforeseen side effects. Writing to | ||
``enabled`` has no effect when a component is not supported by the | ||
hardware, and valid values will be accepted even when the main switch | ||
is off. | ||
|
||
====== =============================================================== | ||
Values Components | ||
====== =============================================================== | ||
0x0001 The main switch for the multi-gen LRU. | ||
0x0002 Clearing the accessed bit in leaf page table entries in large | ||
batches, when MMU sets it (e.g., on x86). This behavior can | ||
theoretically worsen lock contention (mmap_lock). If it is | ||
disabled, the multi-gen LRU will suffer a minor performance | ||
degradation for workloads that contiguously map hot pages, | ||
whose accessed bits can be otherwise cleared by fewer larger | ||
batches. | ||
0x0004 Clearing the accessed bit in non-leaf page table entries as | ||
well, when MMU sets it (e.g., on x86). This behavior was not | ||
verified on x86 varieties other than Intel and AMD. If it is | ||
disabled, the multi-gen LRU will suffer a negligible | ||
performance degradation. | ||
[yYnN] Apply to all the components above. | ||
====== =============================================================== | ||
|
||
E.g., | ||
:: | ||
|
||
echo y >/sys/kernel/mm/lru_gen/enabled | ||
cat /sys/kernel/mm/lru_gen/enabled | ||
0x0007 | ||
echo 5 >/sys/kernel/mm/lru_gen/enabled | ||
cat /sys/kernel/mm/lru_gen/enabled | ||
0x0005 | ||
|
||
Thrashing prevention | ||
-------------------- | ||
Personal computers are more sensitive to thrashing because it can | ||
cause janks (lags when rendering UI) and negatively impact user | ||
experience. The multi-gen LRU offers thrashing prevention to the | ||
majority of laptop and desktop users who do not have ``oomd``. | ||
|
||
Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of | ||
``N`` milliseconds from getting evicted. The OOM killer is triggered | ||
if this working set cannot be kept in memory. In other words, this | ||
option works as an adjustable pressure relief valve, and when open, it | ||
terminates applications that are hopefully not being used. | ||
|
||
Based on the average human detectable lag (~100ms), ``N=1000`` usually | ||
eliminates intolerable janks due to thrashing. Larger values like | ||
``N=3000`` make janks less noticeable at the risk of premature OOM | ||
kills. | ||
|
||
The default value ``0`` means disabled. | ||
|
||
Experimental features | ||
===================== | ||
``/sys/kernel/debug/lru_gen`` accepts commands described in the | ||
following subsections. Multiple command lines are supported, so does | ||
concatenation with delimiters ``,`` and ``;``. | ||
|
||
``/sys/kernel/debug/lru_gen_full`` provides additional stats for | ||
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from | ||
evicted generations in this file. | ||
|
||
Working set estimation | ||
---------------------- | ||
Working set estimation measures how much memory an application needs | ||
in a given time interval, and it is usually done with little impact on | ||
the performance of the application. E.g., data centers want to | ||
optimize job scheduling (bin packing) to improve memory utilizations. | ||
When a new job comes in, the job scheduler needs to find out whether | ||
each server it manages can allocate a certain amount of memory for | ||
this new job before it can pick a candidate. To do so, the job | ||
scheduler needs to estimate the working sets of the existing jobs. | ||
|
||
When it is read, ``lru_gen`` returns a histogram of numbers of pages | ||
accessed over different time intervals for each memcg and node. | ||
``MAX_NR_GENS`` decides the number of bins for each histogram. The | ||
histograms are noncumulative. | ||
:: | ||
|
||
memcg memcg_id memcg_path | ||
node node_id | ||
min_gen_nr age_in_ms nr_anon_pages nr_file_pages | ||
... | ||
max_gen_nr age_in_ms nr_anon_pages nr_file_pages | ||
|
||
Each bin contains an estimated number of pages that have been accessed | ||
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages | ||
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of | ||
the former is the largest and that of the latter is the smallest. | ||
|
||
Users can write the following command to ``lru_gen`` to create a new | ||
generation ``max_gen_nr+1``: | ||
|
||
``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` | ||
|
||
``can_swap`` defaults to the swap setting and, if it is set to ``1``, | ||
it forces the scan of anon pages when swap is off, and vice versa. | ||
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it | ||
employs heuristics to reduce the overhead, which is likely to reduce | ||
the coverage as well. | ||
|
||
A typical use case is that a job scheduler runs this command at a | ||
certain time interval to create new generations, and it ranks the | ||
servers it manages based on the sizes of their cold pages defined by | ||
this time interval. | ||
|
||
Proactive reclaim | ||
----------------- | ||
Proactive reclaim induces page reclaim when there is no memory | ||
pressure. It usually targets cold pages only. E.g., when a new job | ||
comes in, the job scheduler wants to proactively reclaim cold pages on | ||
the server it selected, to improve the chance of successfully landing | ||
this new job. | ||
|
||
Users can write the following command to ``lru_gen`` to evict | ||
generations less than or equal to ``min_gen_nr``. | ||
|
||
``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` | ||
|
||
``min_gen_nr`` should be less than ``max_gen_nr-1``, since | ||
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to | ||
the active list) and therefore cannot be evicted. ``swappiness`` | ||
overrides the default value in ``/proc/sys/vm/swappiness``. | ||
``nr_to_reclaim`` limits the number of pages to evict. | ||
|
||
A typical use case is that a job scheduler runs this command before it | ||
tries to land a new job on a server. If it fails to materialize enough | ||
cold pages because of the overestimation, it retries on the next | ||
server according to the ranking result obtained from the working set | ||
estimation step. This less forceful approach limits the impacts on the | ||
existing jobs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.