Skip to content

Commit

Permalink
Merge remote-tracking branch 'mm-stable/mm-stable'
Browse files Browse the repository at this point in the history
# Conflicts:
#	fs/exec.c
#	include/linux/nodemask.h
  • Loading branch information
Mark Brown committed Sep 27, 2022
2 parents dcfe749 + 21b7bdb commit ce3e5dd
Show file tree
Hide file tree
Showing 260 changed files with 56,926 additions and 4,531 deletions.
25 changes: 25 additions & 0 deletions Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
What: /sys/devices/virtual/memory_tiering/
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: A collection of all the memory tiers allocated.

Individual memory tier details are contained in subdirectories
named by the abstract distance of the memory tier.

/sys/devices/virtual/memory_tiering/memory_tierN/


What: /sys/devices/virtual/memory_tiering/memory_tierN/
/sys/devices/virtual/memory_tiering/memory_tierN/nodes
Date: August 2022
Contact: Linux memory management mailing list <linux-mm@kvack.org>
Description: Directory with details of a specific memory tier

This is the directory containing information about a particular
memory tier, memtierN, where N is derived based on abstract distance.

A smaller value of N implies a higher (faster) memory tier in the
hierarchy.

nodes: NUMA nodes that are part of this memory tier.

2 changes: 1 addition & 1 deletion Documentation/accounting/delay-accounting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ a) waiting for a CPU (while being runnable)
b) completion of synchronous block I/O initiated by the task
c) swapping in pages
d) memory reclaim
e) thrashing page cache
e) thrashing
f) direct compact
g) write-protect copy

Expand Down
8 changes: 8 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1469,6 +1469,14 @@
Permit 'security.evm' to be updated regardless of
current integrity status.

early_page_ext [KNL] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
memory initialization is disabled) so the boot process
might take longer, especially on systems with a lot of
memory. Available with CONFIG_PAGE_EXTENSION=y.

failslab=
fail_usercopy=
fail_page_alloc=
Expand Down
10 changes: 5 additions & 5 deletions Documentation/admin-guide/mm/cma_debugfs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ CMA Debugfs Interface
The CMA debugfs interface is useful to retrieve basic information out of the
different CMA areas and to test allocation/release in each of the areas.

Each CMA zone represents a directory under <debugfs>/cma/, indexed by the
kernel's CMA index. So the first CMA zone would be:
Each CMA area represents a directory under <debugfs>/cma/, represented by
its CMA name like below:

<debugfs>/cma/cma-0
<debugfs>/cma/<cma_name>

The structure of the files created under that directory is as follows:

Expand All @@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows:
- [RO] bitmap: The bitmap of page states in the zone.
- [WO] alloc: Allocate N pages from that CMA area. For example::

echo 5 > <debugfs>/cma/cma-2/alloc
echo 5 > <debugfs>/cma/<cma_name>/alloc

would try to allocate 5 pages from the cma-2 area.
would try to allocate 5 pages from the 'cma_name' area.

- [WO] free: Free N pages from that CMA area, similar to the above.
1 change: 1 addition & 0 deletions Documentation/admin-guide/mm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ the Linux memory management.
idle_page_tracking
ksm
memory-hotplug
multigen_lru
nommu-mmap
numa_memory_policy
numaperf
Expand Down
36 changes: 36 additions & 0 deletions Documentation/admin-guide/mm/ksm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the
``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must
be increased accordingly.

Monitoring KSM profit
=====================

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate a number of rmap_items to
save each scanned page's brief rmap information. Some of these pages may
be merged, but some may not be abled to be merged after being checked
several times, which are unprofitable memory consumed.

1) How to determine whether KSM save memory or consume memory in system-wide
range? Here is a simple approximate calculation for reference::

general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
sizeof(rmap_item);

where all_rmap_items can be easily obtained by summing ``pages_sharing``,
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.

2) The KSM profit inner a single process can be similarly obtained by the
following approximate calculation::

process_profit =~ ksm_merging_pages * sizeof(page) -
ksm_rmap_items * sizeof(rmap_item).

where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``.

From the perspective of application, a high ratio of ``ksm_rmap_items`` to
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
administrators have to rethink how to change madvise policy. Giving an example
for reference, a page's size is usually 4K, and the rmap_item's size is
separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture.
so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU
or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped,
because the ksm profit is approximately zero or negative.

Monitoring KSM events
=====================

Expand Down
162 changes: 162 additions & 0 deletions Documentation/admin-guide/mm/multigen_lru.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
.. SPDX-License-Identifier: GPL-2.0
=============
Multi-Gen LRU
=============
The multi-gen LRU is an alternative LRU implementation that optimizes
page reclaim and improves performance under memory pressure. Page
reclaim decides the kernel's caching policy and ability to overcommit
memory. It directly impacts the kswapd CPU usage and RAM efficiency.

Quick start
===========
Build the kernel with the following configurations.

* ``CONFIG_LRU_GEN=y``
* ``CONFIG_LRU_GEN_ENABLED=y``

All set!

Runtime options
===============
``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the
following subsections.

Kill switch
-----------
``enabled`` accepts different values to enable or disable the
following components. Its default value depends on
``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled
unless some of them have unforeseen side effects. Writing to
``enabled`` has no effect when a component is not supported by the
hardware, and valid values will be accepted even when the main switch
is off.

====== ===============================================================
Values Components
====== ===============================================================
0x0001 The main switch for the multi-gen LRU.
0x0002 Clearing the accessed bit in leaf page table entries in large
batches, when MMU sets it (e.g., on x86). This behavior can
theoretically worsen lock contention (mmap_lock). If it is
disabled, the multi-gen LRU will suffer a minor performance
degradation for workloads that contiguously map hot pages,
whose accessed bits can be otherwise cleared by fewer larger
batches.
0x0004 Clearing the accessed bit in non-leaf page table entries as
well, when MMU sets it (e.g., on x86). This behavior was not
verified on x86 varieties other than Intel and AMD. If it is
disabled, the multi-gen LRU will suffer a negligible
performance degradation.
[yYnN] Apply to all the components above.
====== ===============================================================

E.g.,
::

echo y >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0007
echo 5 >/sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
0x0005

Thrashing prevention
--------------------
Personal computers are more sensitive to thrashing because it can
cause janks (lags when rendering UI) and negatively impact user
experience. The multi-gen LRU offers thrashing prevention to the
majority of laptop and desktop users who do not have ``oomd``.

Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of
``N`` milliseconds from getting evicted. The OOM killer is triggered
if this working set cannot be kept in memory. In other words, this
option works as an adjustable pressure relief valve, and when open, it
terminates applications that are hopefully not being used.

Based on the average human detectable lag (~100ms), ``N=1000`` usually
eliminates intolerable janks due to thrashing. Larger values like
``N=3000`` make janks less noticeable at the risk of premature OOM
kills.

The default value ``0`` means disabled.

Experimental features
=====================
``/sys/kernel/debug/lru_gen`` accepts commands described in the
following subsections. Multiple command lines are supported, so does
concatenation with delimiters ``,`` and ``;``.

``/sys/kernel/debug/lru_gen_full`` provides additional stats for
debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from
evicted generations in this file.

Working set estimation
----------------------
Working set estimation measures how much memory an application needs
in a given time interval, and it is usually done with little impact on
the performance of the application. E.g., data centers want to
optimize job scheduling (bin packing) to improve memory utilizations.
When a new job comes in, the job scheduler needs to find out whether
each server it manages can allocate a certain amount of memory for
this new job before it can pick a candidate. To do so, the job
scheduler needs to estimate the working sets of the existing jobs.

When it is read, ``lru_gen`` returns a histogram of numbers of pages
accessed over different time intervals for each memcg and node.
``MAX_NR_GENS`` decides the number of bins for each histogram. The
histograms are noncumulative.
::

memcg memcg_id memcg_path
node node_id
min_gen_nr age_in_ms nr_anon_pages nr_file_pages
...
max_gen_nr age_in_ms nr_anon_pages nr_file_pages

Each bin contains an estimated number of pages that have been accessed
within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages
and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of
the former is the largest and that of the latter is the smallest.

Users can write the following command to ``lru_gen`` to create a new
generation ``max_gen_nr+1``:

``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]``

``can_swap`` defaults to the swap setting and, if it is set to ``1``,
it forces the scan of anon pages when swap is off, and vice versa.
``force_scan`` defaults to ``1`` and, if it is set to ``0``, it
employs heuristics to reduce the overhead, which is likely to reduce
the coverage as well.

A typical use case is that a job scheduler runs this command at a
certain time interval to create new generations, and it ranks the
servers it manages based on the sizes of their cold pages defined by
this time interval.

Proactive reclaim
-----------------
Proactive reclaim induces page reclaim when there is no memory
pressure. It usually targets cold pages only. E.g., when a new job
comes in, the job scheduler wants to proactively reclaim cold pages on
the server it selected, to improve the chance of successfully landing
this new job.

Users can write the following command to ``lru_gen`` to evict
generations less than or equal to ``min_gen_nr``.

``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]``

``min_gen_nr`` should be less than ``max_gen_nr-1``, since
``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to
the active list) and therefore cannot be evicted. ``swappiness``
overrides the default value in ``/proc/sys/vm/swappiness``.
``nr_to_reclaim`` limits the number of pages to evict.

A typical use case is that a job scheduler runs this command before it
tries to land a new job on a server. If it fails to materialize enough
cold pages because of the overestimation, it retries on the next
server according to the ranking result obtained from the working set
estimation step. This less forceful approach limits the impacts on the
existing jobs.
41 changes: 38 additions & 3 deletions Documentation/admin-guide/mm/userfaultfd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick.
Design
======

Userfaults are delivered and resolved through the ``userfaultfd`` syscall.
Userspace creates a new userfaultfd, initializes it, and registers one or more
regions of virtual memory with it. Then, any page faults which occur within the
region(s) result in a message being delivered to the userfaultfd, notifying
userspace of the fault.

The ``userfaultfd`` (aside from registering and unregistering virtual
memory ranges) provides two primary functionalities:
Expand All @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory
management of mremap/mprotect is that the userfaults in all their
operations never involve heavyweight structures like vmas (in fact the
``userfaultfd`` runtime load never takes the mmap_lock for writing).

Vmas are not suitable for page- (or hugepage) granular fault tracking
when dealing with virtual address spaces that could span
Terabytes. Too many vmas would be needed for that.

The ``userfaultfd`` once opened by invoking the syscall, can also be
The ``userfaultfd``, once created, can also be
passed using unix domain sockets to a manager process, so the same
manager process could handle the userfaults of a multitude of
different processes without them being aware about what is going on
Expand All @@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``).
API
===

Creating a userfaultfd
----------------------

There are two ways to create a new userfaultfd, each of which provide ways to
restrict access to this functionality (since historically userfaultfds which
handle kernel page faults have been a useful tool for exploiting the kernel).

The first way, supported since userfaultfd was introduced, is the
userfaultfd(2) syscall. Access to this is controlled in several ways:

- Any user can always create a userfaultfd which traps userspace page faults
only. Such a userfaultfd can be created using the userfaultfd(2) syscall
with the flag UFFD_USER_MODE_ONLY.

- In order to also trap kernel page faults for the address space, either the
process needs the CAP_SYS_PTRACE capability, or the system must have
vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd
is set to 0.

The second way, added to the kernel more recently, is by opening
/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method
yields equivalent userfaultfds to the userfaultfd(2) syscall.

Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal
filesystem permissions (user/group/mode), which gives fine grained access to
userfaultfd specifically, without also granting other unrelated privileges at
the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access
to /dev/userfaultfd can always create userfaultfds that trap kernel page faults;
vm.unprivileged_userfaultfd is not considered.

Initializing a userfaultfd
--------------------------

When first opened the ``userfaultfd`` must be enabled invoking the
``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or
a later API version) which will specify the ``read/POLLIN`` protocol
Expand Down
11 changes: 11 additions & 0 deletions Documentation/admin-guide/sysctl/kernel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -635,6 +635,17 @@ different types of memory (represented as different NUMA nodes) to
place the hot pages in the fast memory. This is implemented based on
unmapping and page fault too.

numa_balancing_promote_rate_limit_MBps
======================================

Too high promotion/demotion throughput between different memory types
may hurt application latency. This can be used to rate limit the
promotion throughput. The per-node max promotion throughput in MB/s
will be limited to be no more than the set value.

A rule of thumb is to set this to less than 1/10 of the PMEM node
write bandwidth.

oops_all_cpu_backtrace
======================

Expand Down
3 changes: 3 additions & 0 deletions Documentation/admin-guide/sysctl/vm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -926,6 +926,9 @@ calls without any restrictions.

The default value is 0.

Another way to control permissions for userfaultfd is to use
/dev/userfaultfd instead of userfaultfd(2). See
Documentation/admin-guide/mm/userfaultfd.rst.

user_reserve_kbytes
===================
Expand Down
1 change: 1 addition & 0 deletions Documentation/core-api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ Library functionality that is used throughout the kernel.
kref
assoc_array
xarray
maple_tree
idr
circular-buffers
rbtree
Expand Down
Loading

0 comments on commit ce3e5dd

Please sign in to comment.