Skip to content

Commit

Permalink
sched/numa-balancing: Move some document to make it consistent with t…
Browse files Browse the repository at this point in the history
…he code

After commit 8a99b68 ("sched: Move SCHED_DEBUG sysctl to
debugfs"), some NUMA balancing sysctls enclosed with SCHED_DEBUG has
been moved to debugfs.  This patch move the document for these
sysctls from

  Documentation/admin-guide/sysctl/kernel.rst

to

  Documentation/scheduler/sched-debug.rst

to make the document consistent with the code.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Link: https://lkml.kernel.org/r/20220210052514.3038279-1-ying.huang@intel.com
  • Loading branch information
Huang Ying authored and Peter Zijlstra committed Feb 11, 2022
1 parent e496132 commit 3624ba7
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 45 deletions.
46 changes: 1 addition & 45 deletions Documentation/admin-guide/sysctl/kernel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -609,51 +609,7 @@ be migrated to a local memory node.
The unmapping of pages and trapping faults incur additional overhead that
ideally is offset by improved memory locality but there is no universal
guarantee. If the target workload is already bound to NUMA nodes then this
feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the `numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
numa_balancing_scan_size_mb`_, and numa_balancing_settle_count sysctls.


numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
===============================================================================================================================


Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
memory node local to where the task is running. Every "scan delay" the task
scans the next "scan size" number of pages in its address space. When the
end of the address space is reached the scanner restarts from the beginning.

In combination, the "scan delay" and "scan size" determine the scan rate.
When "scan delay" decreases, the scan rate increases. The scan delay and
hence the scan rate of every task is adaptive and depends on historical
behaviour. If pages are properly placed then the scan delay increases,
otherwise the scan delay decreases. The "scan size" is not adaptive but
the higher the "scan size", the higher the scan rate.

Higher scan rates incur higher system overhead as page faults must be
trapped and potentially data must be migrated. However, the higher the scan
rate, the more quickly a tasks memory is migrated to a local node if the
workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.

``numa_balancing_scan_period_min_ms`` is the minimum time in milliseconds to
scan a tasks virtual memory. It effectively controls the maximum scanning
rate for each task.

``numa_balancing_scan_delay_ms`` is the starting "scan delay" used for a task
when it initially forks.

``numa_balancing_scan_period_max_ms`` is the maximum time in milliseconds to
scan a tasks virtual memory. It effectively controls the minimum scanning
rate for each task.

``numa_balancing_scan_size_mb`` is how many megabytes worth of pages are
scanned for a given scan.

feature should be disabled.

oops_all_cpu_backtrace
======================
Expand Down
1 change: 1 addition & 0 deletions Documentation/scheduler/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Linux Scheduler
sched-nice-design
sched-rt-group
sched-stats
sched-debug

text_files

Expand Down
54 changes: 54 additions & 0 deletions Documentation/scheduler/sched-debug.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
=================
Scheduler debugfs
=================

Booting a kernel with CONFIG_SCHED_DEBUG=y will give access to
scheduler specific debug files under /sys/kernel/debug/sched. Some of
those files are described below.

numa_balancing
==============

`numa_balancing` directory is used to hold files to control NUMA
balancing feature. If the system overhead from the feature is too
high then the rate the kernel samples for NUMA hinting faults may be
controlled by the `scan_period_min_ms, scan_delay_ms,
scan_period_max_ms, scan_size_mb` files.


scan_period_min_ms, scan_delay_ms, scan_period_max_ms, scan_size_mb
-------------------------------------------------------------------

Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
memory node local to where the task is running. Every "scan delay" the task
scans the next "scan size" number of pages in its address space. When the
end of the address space is reached the scanner restarts from the beginning.

In combination, the "scan delay" and "scan size" determine the scan rate.
When "scan delay" decreases, the scan rate increases. The scan delay and
hence the scan rate of every task is adaptive and depends on historical
behaviour. If pages are properly placed then the scan delay increases,
otherwise the scan delay decreases. The "scan size" is not adaptive but
the higher the "scan size", the higher the scan rate.

Higher scan rates incur higher system overhead as page faults must be
trapped and potentially data must be migrated. However, the higher the scan
rate, the more quickly a tasks memory is migrated to a local node if the
workload pattern changes and minimises performance impact due to remote
memory accesses. These files control the thresholds for scan delays and
the number of pages scanned.

``scan_period_min_ms`` is the minimum time in milliseconds to scan a
tasks virtual memory. It effectively controls the maximum scanning
rate for each task.

``scan_delay_ms`` is the starting "scan delay" used for a task when it
initially forks.

``scan_period_max_ms`` is the maximum time in milliseconds to scan a
tasks virtual memory. It effectively controls the minimum scanning
rate for each task.

``scan_size_mb`` is how many megabytes worth of pages are scanned for
a given scan.

0 comments on commit 3624ba7

Please sign in to comment.