diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup index a4e31c4651944..b446a7154a1b5 100644 --- a/Documentation/ABI/testing/procfs-smaps_rollup +++ b/Documentation/ABI/testing/procfs-smaps_rollup @@ -22,6 +22,7 @@ Description: MMUPageSize: 4 kB Rss: 884 kB Pss: 385 kB + Pss_Dirty: 68 kB Pss_Anon: 301 kB Pss_File: 80 kB Pss_Shmem: 4 kB diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-ksm b/Documentation/ABI/testing/sysfs-kernel-mm-ksm index 1c9bed5595f5d..d244674a94806 100644 --- a/Documentation/ABI/testing/sysfs-kernel-mm-ksm +++ b/Documentation/ABI/testing/sysfs-kernel-mm-ksm @@ -41,7 +41,7 @@ Description: Kernel Samepage Merging daemon sysfs interface sleep_millisecs: how many milliseconds ksm should sleep between scans. - See Documentation/vm/ksm.rst for more information. + See Documentation/mm/ksm.rst for more information. What: /sys/kernel/mm/ksm/merge_across_nodes Date: January 2013 diff --git a/Documentation/ABI/testing/sysfs-kernel-slab b/Documentation/ABI/testing/sysfs-kernel-slab index c440f4946e121..cd5fb8fa3ddfb 100644 --- a/Documentation/ABI/testing/sysfs-kernel-slab +++ b/Documentation/ABI/testing/sysfs-kernel-slab @@ -37,7 +37,7 @@ Description: The alloc_calls file is read-only and lists the kernel code locations from which allocations for this cache were performed. The alloc_calls file only contains information if debugging is - enabled for that cache (see Documentation/vm/slub.rst). + enabled for that cache (see Documentation/mm/slub.rst). What: /sys/kernel/slab//alloc_fastpath Date: February 2008 @@ -219,7 +219,7 @@ Contact: Pekka Enberg , Description: The free_calls file is read-only and lists the locations of object frees if slab debugging is enabled (see - Documentation/vm/slub.rst). + Documentation/mm/slub.rst). What: /sys/kernel/slab//free_fastpath Date: February 2008 diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 176298f2f4def..ad9ba3ec90a5d 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1433,6 +1433,24 @@ PAGE_SIZE multiple when read back. workingset_nodereclaim Number of times a shadow node has been reclaimed + pgscan (npn) + Amount of scanned pages (in an inactive LRU list) + + pgsteal (npn) + Amount of reclaimed pages + + pgscan_kswapd (npn) + Amount of scanned pages by kswapd (in an inactive LRU list) + + pgscan_direct (npn) + Amount of scanned pages directly (in an inactive LRU list) + + pgsteal_kswapd (npn) + Amount of reclaimed pages by kswapd + + pgsteal_direct (npn) + Amount of reclaimed pages directly + pgfault (npn) Total number of page faults incurred @@ -1442,12 +1460,6 @@ PAGE_SIZE multiple when read back. pgrefill (npn) Amount of scanned pages (in an active LRU list) - pgscan (npn) - Amount of scanned pages (in an inactive LRU list) - - pgsteal (npn) - Amount of reclaimed pages - pgactivate (npn) Amount of pages moved to the active LRU list diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 89294b9fd9f42..839eb34e74898 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1732,9 +1732,11 @@ Built with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON=y, the default is on. - This is not compatible with memory_hotplug.memmap_on_memory. - If both parameters are enabled, hugetlb_free_vmemmap takes - precedence over memory_hotplug.memmap_on_memory. + Note that the vmemmap pages may be allocated from the added + memory block itself when memory_hotplug.memmap_on_memory is + enabled, those vmemmap pages cannot be optimized even if this + feature is enabled. Other vmemmap pages not allocated from + the added memory block itself do not be affected. hung_task_panic= [KNL] Should the hung task detector generate panics. @@ -3093,10 +3095,12 @@ [KNL,X86,ARM] Boolean flag to enable this feature. Format: {on | off (default)} When enabled, runtime hotplugged memory will - allocate its internal metadata (struct pages) - from the hotadded memory which will allow to - hotadd a lot of memory without requiring - additional memory to do so. + allocate its internal metadata (struct pages, + those vmemmap pages cannot be optimized even + if hugetlb_free_vmemmap is enabled) from the + hotadded memory which will allow to hotadd a + lot of memory without requiring additional + memory to do so. This feature is disabled by default because it has some implication on large (e.g. GB) allocations in some configurations (e.g. small @@ -3106,10 +3110,6 @@ Note that even when enabled, there are a few cases where the feature is not effective. - This is not compatible with hugetlb_free_vmemmap. If - both parameters are enabled, hugetlb_free_vmemmap takes - precedence over memory_hotplug.memmap_on_memory. - memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest Format: default : 0 @@ -5523,7 +5523,7 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/vm/slub.rst. + For more information see Documentation/mm/slub.rst. slab_max_order= [MM, SLAB] Determines the maximum allowed order for slabs. @@ -5537,13 +5537,13 @@ slub_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/vm/slub.rst. + Documentation/mm/slub.rst. slub_max_order= [MM, SLUB] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/vm/slub.rst. + Documentation/mm/slub.rst. slub_min_objects= [MM, SLUB] The minimum number of objects per slab. SLUB will @@ -5552,12 +5552,12 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/vm/slub.rst. + For more information see Documentation/mm/slub.rst. slub_min_order= [MM, SLUB] Determines the minimum page order for slabs. Must be lower than slub_max_order. - For more information see Documentation/vm/slub.rst. + For more information see Documentation/mm/slub.rst. slub_merge [MM, SLUB] Same with slab_merge. diff --git a/Documentation/admin-guide/mm/concepts.rst b/Documentation/admin-guide/mm/concepts.rst index b966fcff993b2..c79f1e3362220 100644 --- a/Documentation/admin-guide/mm/concepts.rst +++ b/Documentation/admin-guide/mm/concepts.rst @@ -125,7 +125,7 @@ processor. Each bank is referred to as a `node` and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters. You can find more details about NUMA in -:ref:`Documentation/vm/numa.rst ` and in +:ref:`Documentation/mm/numa.rst ` and in :ref:`Documentation/admin-guide/mm/numa_memory_policy.rst `. Page cache diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 61aff88347f3c..05500042f7776 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -4,7 +4,7 @@ Monitoring Data Accesses ======================== -:doc:`DAMON ` allows light-weight data access monitoring. +:doc:`DAMON ` allows light-weight data access monitoring. Using DAMON, users can analyze the memory access patterns of their systems and optimize those. @@ -14,3 +14,4 @@ optimize those. start usage reclaim + lru_sort diff --git a/Documentation/admin-guide/mm/damon/lru_sort.rst b/Documentation/admin-guide/mm/damon/lru_sort.rst new file mode 100644 index 0000000000000..c09cace806516 --- /dev/null +++ b/Documentation/admin-guide/mm/damon/lru_sort.rst @@ -0,0 +1,294 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================= +DAMON-based LRU-lists Sorting +============================= + +DAMON-based LRU-lists Sorting (DAMON_LRU_SORT) is a static kernel module that +aimed to be used for proactive and lightweight data access pattern based +(de)prioritization of pages on their LRU-lists for making LRU-lists a more +trusworthy data access pattern source. + +Where Proactive LRU-lists Sorting is Required? +============================================== + +As page-granularity access checking overhead could be significant on huge +systems, LRU lists are normally not proactively sorted but partially and +reactively sorted for special events including specific user requests, system +calls and memory pressure. As a result, LRU lists are sometimes not so +perfectly prepared to be used as a trustworthy access pattern source for some +situations including reclamation target pages selection under sudden memory +pressure. + +Because DAMON can identify access patterns of best-effort accuracy while +inducing only user-specified range of overhead, proactively running +DAMON_LRU_SORT could be helpful for making LRU lists more trustworthy access +pattern source with low and controlled overhead. + +How It Works? +============= + +DAMON_LRU_SORT finds hot pages (pages of memory regions that showing access +rates that higher than a user-specified threshold) and cold pages (pages of +memory regions that showing no access for a time that longer than a +user-specified threshold) using DAMON, and prioritizes hot pages while +deprioritizing cold pages on their LRU-lists. To avoid it consuming too much +CPU for the prioritizations, a CPU time usage limit can be configured. Under +the limit, it prioritizes and deprioritizes more hot and cold pages first, +respectively. System administrators can also configure under what situation +this scheme should automatically activated and deactivated with three memory +pressure watermarks. + +Its default parameters for hotness/coldness thresholds and CPU quota limit are +conservatively chosen. That is, the module under its default parameters could +be widely used without harm for common situations while providing a level of +benefits for systems having clear hot/cold access patterns under memory +pressure while consuming only a limited small portion of CPU time. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_LRU_SORT=y``. + +To let sysadmins enable or disable it and tune for the given system, +DAMON_LRU_SORT utilizes module parameters. That is, you can put +``damon_lru_sort.=`` on the kernel boot command line or write +proper values to ``/sys/modules/damon_lru_sort/parameters/`` files. + +Below are the description of each parameter. + +enabled +------- + +Enable or disable DAMON_LRU_SORT. + +You can enable DAMON_LRU_SORT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_LRU_SORT. Note that DAMON_LRU_SORT could do +no real monitoring and LRU-lists sorting due to the watermarks-based activation +condition. Refer to below descriptions for the watermarks parameter for this. + +commit_inputs +------------- + +Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``. + +Input parameters that updated while DAMON_LRU_SORT is running are not applied +by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT reads values +of parametrs except ``enabled`` again. Once the re-reading is done, this +parameter is set as ``N``. If invalid parameters are found while the +re-reading, DAMON_LRU_SORT will be disabled. + +hot_thres_access_freq +--------------------- + +Access frequency threshold for hot memory regions identification in permil. + +If a memory region is accessed in frequency of this or higher, DAMON_LRU_SORT +identifies the region as hot, and mark it as accessed on the LRU list, so that +it could not be reclaimed under memory pressure. 50% by default. + +cold_min_age +------------ + +Time threshold for cold memory regions identification in microseconds. + +If a memory region is not accessed for this or longer time, DAMON_LRU_SORT +identifies the region as cold, and mark it as unaccessed on the LRU list, so +that it could be reclaimed first under memory pressure. 120 seconds by +default. + +quota_ms +-------- + +Limit of time for trying the LRU lists sorting in milliseconds. + +DAMON_LRU_SORT tries to use only up to this time within a time window +(quota_reset_interval_ms) for trying LRU lists sorting. This can be used +for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the +limit is disabled. + +10 ms by default. + +quota_reset_interval_ms +----------------------- + +The time quota charge reset interval in milliseconds. + +The charge reset interval for the quota of time (quota_ms). That is, +DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms +milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds. + +1 second by default. + +wmarks_interval +--------------- + +The watermarks check time interval in microseconds. + +Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is +enabled but inactive due to its watermarks rule. 5 seconds by default. + +wmarks_high +----------- + +Free memory rate (per thousand) for the high watermark. + +If free memory of the system in bytes per thousand bytes is higher than this, +DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the +watermarks. 200 (20%) by default. + +wmarks_mid +---------- + +Free memory rate (per thousand) for the middle watermark. + +If free memory of the system in bytes per thousand bytes is between this and +the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring and +the LRU-lists sorting. 150 (15%) by default. + +wmarks_low +---------- + +Free memory rate (per thousand) for the low watermark. + +If free memory of the system in bytes per thousand bytes is lower than this, +DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks the +watermarks. 50 (5%) by default. + +sample_interval +--------------- + +Sampling interval for the monitoring in microseconds. + +The sampling interval of DAMON for the cold memory monitoring. Please refer to +the DAMON documentation (:doc:`usage`) for more detail. 5ms by default. + +aggr_interval +------------- + +Aggregation interval for the monitoring in microseconds. + +The aggregation interval of DAMON for the cold memory monitoring. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. 100ms by +default. + +min_nr_regions +-------------- + +Minimum number of monitoring regions. + +The minimal number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set lower-bound of the monitoring quality. +But, setting this too high could result in increased monitoring overhead. +Please refer to the DAMON documentation (:doc:`usage`) for more detail. 10 by +default. + +max_nr_regions +-------------- + +Maximum number of monitoring regions. + +The maximum number of monitoring regions of DAMON for the cold memory +monitoring. This can be used to set upper-bound of the monitoring overhead. +However, setting this too low could result in bad monitoring quality. Please +refer to the DAMON documentation (:doc:`usage`) for more detail. 1000 by +defaults. + +monitor_region_start +-------------------- + +Start of target memory region in physical address. + +The start physical address of memory region that DAMON_LRU_SORT will do work +against. By default, biggest System RAM is used as the region. + +monitor_region_end +------------------ + +End of target memory region in physical address. + +The end physical address of memory region that DAMON_LRU_SORT will do work +against. By default, biggest System RAM is used as the region. + +kdamond_pid +----------- + +PID of the DAMON thread. + +If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. Else, +-1. + +nr_lru_sort_tried_hot_regions +----------------------------- + +Number of hot memory regions that tried to be LRU-sorted. + +bytes_lru_sort_tried_hot_regions +-------------------------------- + +Total bytes of hot memory regions that tried to be LRU-sorted. + +nr_lru_sorted_hot_regions +------------------------- + +Number of hot memory regions that successfully be LRU-sorted. + +bytes_lru_sorted_hot_regions +---------------------------- + +Total bytes of hot memory regions that successfully be LRU-sorted. + +nr_hot_quota_exceeds +-------------------- + +Number of times that the time quota limit for hot regions have exceeded. + +nr_lru_sort_tried_cold_regions +------------------------------ + +Number of cold memory regions that tried to be LRU-sorted. + +bytes_lru_sort_tried_cold_regions +--------------------------------- + +Total bytes of cold memory regions that tried to be LRU-sorted. + +nr_lru_sorted_cold_regions +-------------------------- + +Number of cold memory regions that successfully be LRU-sorted. + +bytes_lru_sorted_cold_regions +----------------------------- + +Total bytes of cold memory regions that successfully be LRU-sorted. + +nr_cold_quota_exceeds +--------------------- + +Number of times that the time quota limit for cold regions have exceeded. + +Example +======= + +Below runtime example commands make DAMON_LRU_SORT to find memory regions +having >=50% access frequency and LRU-prioritize while LRU-deprioritizing +memory regions that not accessed for 120 seconds. The prioritization and +deprioritization is limited to be done using only up to 1% CPU time to avoid +DAMON_LRU_SORT consuming too much CPU time for the (de)prioritization. It also +asks DAMON_LRU_SORT to do nothing if the system's free memory rate is more than +50%, but start the real works if it becomes lower than 40%. If DAMON_RECLAIM +doesn't make progress and therefore the free memory rate becomes lower than +20%, it asks DAMON_LRU_SORT to do nothing again, so that we can fall back to +the LRU-list based page granularity reclamation. :: + + # cd /sys/modules/damon_lru_sort/parameters + # echo 500 > hot_thres_access_freq + # echo 120000000 > cold_min_age + # echo 10 > quota_ms + # echo 1000 > quota_reset_interval_ms + # echo 500 > wmarks_high + # echo 400 > wmarks_mid + # echo 200 > wmarks_low + # echo Y > enabled diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst index 46306f1f34b1a..4f1479a11e634 100644 --- a/Documentation/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/admin-guide/mm/damon/reclaim.rst @@ -48,12 +48,6 @@ DAMON_RECLAIM utilizes module parameters. That is, you can put ``damon_reclaim.=`` on the kernel boot command line or write proper values to ``/sys/modules/damon_reclaim/parameters/`` files. -Note that the parameter values except ``enabled`` are applied only when -DAMON_RECLAIM starts. Therefore, if you want to apply new parameter values in -runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable -it via ``enabled`` parameter file. Writing of the new values to proper -parameter values should be done before the re-enablement. - Below are the description of each parameter. enabled @@ -268,4 +262,4 @@ granularity reclamation. :: .. [1] https://research.google/pubs/pub48551/ .. [2] https://lwn.net/Articles/787611/ -.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html +.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 1bb7b72414b24..d52f572a90298 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -30,11 +30,11 @@ DAMON provides below interfaces for different users. `. This will be removed after next LTS kernel is released, so users should move to the :ref:`sysfs interface `. - *Kernel Space Programming Interface.* - :doc:`This ` is for kernel space programmers. Using this, + :doc:`This ` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by writing kernel space DAMON application programs for you. You can even extend DAMON for various address spaces. For detail, please refer to the interface - :doc:`document `. + :doc:`document `. .. _sysfs_interface: @@ -185,7 +185,7 @@ controls the monitoring overhead, exist. You can set and get the values by writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer -to the Design document (:doc:`/vm/damon/design`). +to the Design document (:doc:`/mm/damon/design`). contexts//targets/ --------------------- @@ -264,6 +264,8 @@ that can be written to and read from the file and their meaning are as below. - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT`` - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE`` - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE`` + - ``lru_prio``: Prioritize the region on its LRU lists. + - ``lru_deprio``: Deprioritize the region on its LRU lists. - ``stat``: Do nothing but count the statistics schemes//access_pattern/ @@ -402,7 +404,7 @@ Attributes Users can get and set the ``sampling interval``, ``aggregation interval``, ``update interval``, and min/max number of monitoring target regions by reading from and writing to the ``attrs`` file. To know about the monitoring -attributes in detail, please refer to the :doc:`/vm/damon/design`. For +attributes in detail, please refer to the :doc:`/mm/damon/design`. For example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and 1000, and then check it again:: diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index c21b5823f1261..1bd11118dfb1c 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -36,6 +36,7 @@ the Linux memory management. numa_memory_policy numaperf pagemap + shrinker_debugfs soft-dirty swap_numa transhuge diff --git a/Documentation/admin-guide/mm/shrinker_debugfs.rst b/Documentation/admin-guide/mm/shrinker_debugfs.rst new file mode 100644 index 0000000000000..3887f0b294fe8 --- /dev/null +++ b/Documentation/admin-guide/mm/shrinker_debugfs.rst @@ -0,0 +1,135 @@ +.. _shrinker_debugfs: + +========================== +Shrinker Debugfs Interface +========================== + +Shrinker debugfs interface provides a visibility into the kernel memory +shrinkers subsystem and allows to get information about individual shrinkers +and interact with them. + +For each shrinker registered in the system a directory in **/shrinker/** +is created. The directory's name is composed from the shrinker's name and an +unique id: e.g. *kfree_rcu-0* or *sb-xfs:vda1-36*. + +Each shrinker directory contains **count** and **scan** files, which allow to +trigger *count_objects()* and *scan_objects()* callbacks for each memcg and +numa node (if applicable). + +Usage: +------ + +1. *List registered shrinkers* + + :: + + $ cd /sys/kernel/debug/shrinker/ + $ ls + dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42 + mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43 + mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44 + rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49 + sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13 + sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36 + sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19 + sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10 + sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9 + sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37 + sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38 + sb-dax-11 sb-proc-45 sb-tmpfs-35 + sb-debugfs-7 sb-proc-46 sb-tmpfs-40 + +2. *Get information about a specific shrinker* + + :: + + $ cd sb-btrfs\:vda2-24/ + $ ls + count scan + +3. *Count objects* + + Each line in the output has the following format:: + + ... + ... + ... + + If there are no objects on all numa nodes, a line is omitted. If there + are no objects at all, the output might be empty. + + If the shrinker is not memcg-aware or CONFIG_MEMCG is off, 0 is printed + as cgroup inode id. If the shrinker is not numa-aware, 0's are printed + for all nodes except the first one. + :: + + $ cat count + 1 224 2 + 21 98 0 + 55 818 10 + 2367 2 0 + 2401 30 0 + 225 13 0 + 599 35 0 + 939 124 0 + 1041 3 0 + 1075 1 0 + 1109 1 0 + 1279 60 0 + 1313 7 0 + 1347 39 0 + 1381 3 0 + 1449 14 0 + 1483 63 0 + 1517 53 0 + 1551 6 0 + 1585 1 0 + 1619 6 0 + 1653 40 0 + 1687 11 0 + 1721 8 0 + 1755 4 0 + 1789 52 0 + 1823 888 0 + 1857 1 0 + 1925 2 0 + 1959 32 0 + 2027 22 0 + 2061 9 0 + 2469 799 0 + 2537 861 0 + 2639 1 0 + 2707 70 0 + 2775 4 0 + 2877 84 0 + 293 1 0 + 735 8 0 + +4. *Scan objects* + + The expected input format:: + + + + For a non-memcg-aware shrinker or on a system with no memory + cgrups **0** should be passed as cgroup id. + :: + + $ cd /sys/kernel/debug/shrinker/ + $ cd sb-btrfs\:vda2-24/ + + $ cat count | head -n 5 + 1 212 0 + 21 97 0 + 55 802 5 + 2367 2 0 + 225 13 0 + + $ echo "55 0 200" > scan + + $ cat count | head -n 5 + 1 212 0 + 21 96 0 + 55 752 5 + 2367 2 0 + 225 13 0 diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 5c9aa171a0d34..f74f722ad7028 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -565,9 +565,8 @@ See Documentation/admin-guide/mm/hugetlbpage.rst hugetlb_optimize_vmemmap ======================== -This knob is not available when memory_hotplug.memmap_on_memory (kernel parameter) -is configured or the size of 'struct page' (a structure defined in -include/linux/mm_types.h) is not power of two (an unusual system config could +This knob is not available when the size of 'struct page' (a structure defined +in include/linux/mm_types.h) is not power of two (an unusual system config could result in this). Enable (set to 1) or disable (set to 0) the feature of optimizing vmemmap pages @@ -760,7 +759,7 @@ and don't use much of it. The default value is 0. -See Documentation/vm/overcommit-accounting.rst and +See Documentation/mm/overcommit-accounting.rst and mm/util.c::__vm_enough_memory() for more information. diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst index 726065a3095e0..dc95df462eea2 100644 --- a/Documentation/core-api/index.rst +++ b/Documentation/core-api/index.rst @@ -86,7 +86,7 @@ Memory management ================= How to allocate and use memory in the kernel. Note that there is a lot -more memory-management documentation in Documentation/vm/index.rst. +more memory-management documentation in Documentation/mm/index.rst. .. toctree:: :maxdepth: 1 diff --git a/Documentation/dev-tools/kmemleak.rst b/Documentation/dev-tools/kmemleak.rst index 1c935f41cd3a2..5483fd39ef295 100644 --- a/Documentation/dev-tools/kmemleak.rst +++ b/Documentation/dev-tools/kmemleak.rst @@ -174,7 +174,6 @@ mapping: - ``kmemleak_alloc_phys`` - ``kmemleak_free_part_phys`` -- ``kmemleak_not_leak_phys`` - ``kmemleak_ignore_phys`` Dealing with false positives/negatives diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 1bc91fb8c321a..0b5120ff506c8 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -448,6 +448,7 @@ Memory Area, or VMA) there is a series of lines such as the following:: MMUPageSize: 4 kB Rss: 892 kB Pss: 374 kB + Pss_Dirty: 0 kB Shared_Clean: 892 kB Shared_Dirty: 0 kB Private_Clean: 0 kB @@ -479,7 +480,9 @@ dirty shared and private pages in the mapping. The "proportional set size" (PSS) of a process is the count of pages it has in memory, where each page is divided by the number of processes sharing it. So if a process has 1000 pages all to itself, and 1000 shared with one other -process, its PSS will be 1500. +process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which +consists of dirty pages. ("Pss_Clean" is not included, but it can be +calculated by subtracting "Pss_Dirty" from "Pss".) Note that even a page which is part of a MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used by only one process, is accounted @@ -1109,7 +1112,7 @@ CommitLimit yield a CommitLimit of 7.3G. For more details, see the memory overcommit documentation - in vm/overcommit-accounting. + in mm/overcommit-accounting. Committed_AS The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which diff --git a/Documentation/index.rst b/Documentation/index.rst index 35d90903242aa..00722aa20cd79 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -129,7 +129,7 @@ needed). sound/index crypto/index filesystems/index - vm/index + mm/index bpf/index usb/index PCI/index diff --git a/Documentation/vm/active_mm.rst b/Documentation/mm/active_mm.rst similarity index 100% rename from Documentation/vm/active_mm.rst rename to Documentation/mm/active_mm.rst diff --git a/Documentation/vm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst similarity index 100% rename from Documentation/vm/arch_pgtable_helpers.rst rename to Documentation/mm/arch_pgtable_helpers.rst diff --git a/Documentation/vm/balance.rst b/Documentation/mm/balance.rst similarity index 100% rename from Documentation/vm/balance.rst rename to Documentation/mm/balance.rst diff --git a/Documentation/vm/bootmem.rst b/Documentation/mm/bootmem.rst similarity index 100% rename from Documentation/vm/bootmem.rst rename to Documentation/mm/bootmem.rst diff --git a/Documentation/vm/damon/api.rst b/Documentation/mm/damon/api.rst similarity index 100% rename from Documentation/vm/damon/api.rst rename to Documentation/mm/damon/api.rst diff --git a/Documentation/vm/damon/design.rst b/Documentation/mm/damon/design.rst similarity index 100% rename from Documentation/vm/damon/design.rst rename to Documentation/mm/damon/design.rst diff --git a/Documentation/vm/damon/faq.rst b/Documentation/mm/damon/faq.rst similarity index 100% rename from Documentation/vm/damon/faq.rst rename to Documentation/mm/damon/faq.rst diff --git a/Documentation/vm/damon/index.rst b/Documentation/mm/damon/index.rst similarity index 100% rename from Documentation/vm/damon/index.rst rename to Documentation/mm/damon/index.rst diff --git a/Documentation/vm/free_page_reporting.rst b/Documentation/mm/free_page_reporting.rst similarity index 100% rename from Documentation/vm/free_page_reporting.rst rename to Documentation/mm/free_page_reporting.rst diff --git a/Documentation/vm/frontswap.rst b/Documentation/mm/frontswap.rst similarity index 100% rename from Documentation/vm/frontswap.rst rename to Documentation/mm/frontswap.rst diff --git a/Documentation/vm/highmem.rst b/Documentation/mm/highmem.rst similarity index 100% rename from Documentation/vm/highmem.rst rename to Documentation/mm/highmem.rst diff --git a/Documentation/vm/hmm.rst b/Documentation/mm/hmm.rst similarity index 100% rename from Documentation/vm/hmm.rst rename to Documentation/mm/hmm.rst diff --git a/Documentation/vm/hugetlbfs_reserv.rst b/Documentation/mm/hugetlbfs_reserv.rst similarity index 100% rename from Documentation/vm/hugetlbfs_reserv.rst rename to Documentation/mm/hugetlbfs_reserv.rst diff --git a/Documentation/vm/hwpoison.rst b/Documentation/mm/hwpoison.rst similarity index 100% rename from Documentation/vm/hwpoison.rst rename to Documentation/mm/hwpoison.rst diff --git a/Documentation/vm/index.rst b/Documentation/mm/index.rst similarity index 100% rename from Documentation/vm/index.rst rename to Documentation/mm/index.rst diff --git a/Documentation/vm/ksm.rst b/Documentation/mm/ksm.rst similarity index 100% rename from Documentation/vm/ksm.rst rename to Documentation/mm/ksm.rst diff --git a/Documentation/vm/memory-model.rst b/Documentation/mm/memory-model.rst similarity index 99% rename from Documentation/vm/memory-model.rst rename to Documentation/mm/memory-model.rst index 30e8fbed6914a..3779e562dc76e 100644 --- a/Documentation/vm/memory-model.rst +++ b/Documentation/mm/memory-model.rst @@ -170,7 +170,7 @@ The users of `ZONE_DEVICE` are: * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` event callbacks to allow a device-driver to coordinate memory management events related to device-memory, typically GPU memory. See - Documentation/vm/hmm.rst. + Documentation/mm/hmm.rst. * p2pdma: Create `struct page` objects to allow peer devices in a PCI/-E topology to coordinate direct-DMA operations between themselves, diff --git a/Documentation/vm/mmu_notifier.rst b/Documentation/mm/mmu_notifier.rst similarity index 100% rename from Documentation/vm/mmu_notifier.rst rename to Documentation/mm/mmu_notifier.rst diff --git a/Documentation/vm/numa.rst b/Documentation/mm/numa.rst similarity index 100% rename from Documentation/vm/numa.rst rename to Documentation/mm/numa.rst diff --git a/Documentation/vm/oom.rst b/Documentation/mm/oom.rst similarity index 100% rename from Documentation/vm/oom.rst rename to Documentation/mm/oom.rst diff --git a/Documentation/vm/overcommit-accounting.rst b/Documentation/mm/overcommit-accounting.rst similarity index 100% rename from Documentation/vm/overcommit-accounting.rst rename to Documentation/mm/overcommit-accounting.rst diff --git a/Documentation/vm/page_allocation.rst b/Documentation/mm/page_allocation.rst similarity index 100% rename from Documentation/vm/page_allocation.rst rename to Documentation/mm/page_allocation.rst diff --git a/Documentation/vm/page_cache.rst b/Documentation/mm/page_cache.rst similarity index 100% rename from Documentation/vm/page_cache.rst rename to Documentation/mm/page_cache.rst diff --git a/Documentation/vm/page_frags.rst b/Documentation/mm/page_frags.rst similarity index 100% rename from Documentation/vm/page_frags.rst rename to Documentation/mm/page_frags.rst diff --git a/Documentation/vm/page_migration.rst b/Documentation/mm/page_migration.rst similarity index 100% rename from Documentation/vm/page_migration.rst rename to Documentation/mm/page_migration.rst diff --git a/Documentation/vm/page_owner.rst b/Documentation/mm/page_owner.rst similarity index 100% rename from Documentation/vm/page_owner.rst rename to Documentation/mm/page_owner.rst diff --git a/Documentation/vm/page_reclaim.rst b/Documentation/mm/page_reclaim.rst similarity index 100% rename from Documentation/vm/page_reclaim.rst rename to Documentation/mm/page_reclaim.rst diff --git a/Documentation/vm/page_table_check.rst b/Documentation/mm/page_table_check.rst similarity index 100% rename from Documentation/vm/page_table_check.rst rename to Documentation/mm/page_table_check.rst diff --git a/Documentation/vm/page_tables.rst b/Documentation/mm/page_tables.rst similarity index 100% rename from Documentation/vm/page_tables.rst rename to Documentation/mm/page_tables.rst diff --git a/Documentation/vm/physical_memory.rst b/Documentation/mm/physical_memory.rst similarity index 100% rename from Documentation/vm/physical_memory.rst rename to Documentation/mm/physical_memory.rst diff --git a/Documentation/vm/process_addrs.rst b/Documentation/mm/process_addrs.rst similarity index 100% rename from Documentation/vm/process_addrs.rst rename to Documentation/mm/process_addrs.rst diff --git a/Documentation/vm/remap_file_pages.rst b/Documentation/mm/remap_file_pages.rst similarity index 100% rename from Documentation/vm/remap_file_pages.rst rename to Documentation/mm/remap_file_pages.rst diff --git a/Documentation/vm/shmfs.rst b/Documentation/mm/shmfs.rst similarity index 100% rename from Documentation/vm/shmfs.rst rename to Documentation/mm/shmfs.rst diff --git a/Documentation/vm/slab.rst b/Documentation/mm/slab.rst similarity index 100% rename from Documentation/vm/slab.rst rename to Documentation/mm/slab.rst diff --git a/Documentation/vm/slub.rst b/Documentation/mm/slub.rst similarity index 100% rename from Documentation/vm/slub.rst rename to Documentation/mm/slub.rst diff --git a/Documentation/vm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst similarity index 100% rename from Documentation/vm/split_page_table_lock.rst rename to Documentation/mm/split_page_table_lock.rst diff --git a/Documentation/vm/swap.rst b/Documentation/mm/swap.rst similarity index 100% rename from Documentation/vm/swap.rst rename to Documentation/mm/swap.rst diff --git a/Documentation/vm/transhuge.rst b/Documentation/mm/transhuge.rst similarity index 100% rename from Documentation/vm/transhuge.rst rename to Documentation/mm/transhuge.rst diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst similarity index 100% rename from Documentation/vm/unevictable-lru.rst rename to Documentation/mm/unevictable-lru.rst diff --git a/Documentation/vm/vmalloc.rst b/Documentation/mm/vmalloc.rst similarity index 100% rename from Documentation/vm/vmalloc.rst rename to Documentation/mm/vmalloc.rst diff --git a/Documentation/vm/vmalloced-kernel-stacks.rst b/Documentation/mm/vmalloced-kernel-stacks.rst similarity index 100% rename from Documentation/vm/vmalloced-kernel-stacks.rst rename to Documentation/mm/vmalloced-kernel-stacks.rst diff --git a/Documentation/vm/vmemmap_dedup.rst b/Documentation/mm/vmemmap_dedup.rst similarity index 100% rename from Documentation/vm/vmemmap_dedup.rst rename to Documentation/mm/vmemmap_dedup.rst diff --git a/Documentation/vm/z3fold.rst b/Documentation/mm/z3fold.rst similarity index 100% rename from Documentation/vm/z3fold.rst rename to Documentation/mm/z3fold.rst diff --git a/Documentation/vm/zsmalloc.rst b/Documentation/mm/zsmalloc.rst similarity index 100% rename from Documentation/vm/zsmalloc.rst rename to Documentation/mm/zsmalloc.rst diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst index 0c8276109fc0b..30c69e1f44fec 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/index.rst @@ -13,7 +13,7 @@ 监测数据访问 ============ -:doc:`DAMON ` 允许轻量级的数据访问监测。使用DAMON, +:doc:`DAMON ` 允许轻量级的数据访问监测。使用DAMON, 用户可以分析他们系统的内存访问模式,并优化它们。 .. toctree:: diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst index 1500bdbf338aa..c976f3e33ffd3 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/reclaim.rst @@ -229,4 +229,4 @@ DAMON_RECLAIM再次什么都不做,这样我们就可以退回到基于LRU列 .. [1] https://research.google/pubs/pub48551/ .. [2] https://lwn.net/Articles/787611/ -.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html +.. [3] https://www.kernel.org/doc/html/latest/mm/free_page_reporting.html diff --git a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst index eee0e8c5c368d..cd41ada4fdad8 100644 --- a/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst +++ b/Documentation/translations/zh_CN/admin-guide/mm/damon/usage.rst @@ -33,9 +33,9 @@ DAMON 为不同的用户提供了下面这些接口。 口相同。这将在下一个LTS内核发布后被移除,所以用户应该转移到 :ref:`sysfs interface `。 - *内核空间编程接口。* - :doc:`这 ` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内 + :doc:`这 ` 这是为内核空间程序员准备的。使用它,用户可以通过为你编写内 核空间的DAMON应用程序,最灵活有效地利用DAMON的每一个功能。你甚至可以为各种地址空间扩展DAMON。 - 详细情况请参考接口 :doc:`文件 `。 + 详细情况请参考接口 :doc:`文件 `。 sysfs接口 ========= @@ -148,7 +148,7 @@ contexts//monitoring_attrs/ 在 ``nr_regions`` 目录下,有两个文件分别用于DAMON监测区域的下限和上限(``min`` 和 ``max`` ), 这两个文件控制着监测的开销。你可以通过向这些文件的写入和读出来设置和获取这些值。 -关于间隔和监测区域范围的更多细节,请参考设计文件 (:doc:`/vm/damon/design`)。 +关于间隔和监测区域范围的更多细节,请参考设计文件 (:doc:`/mm/damon/design`)。 contexts//targets/ --------------------- @@ -318,7 +318,7 @@ DAMON导出了八个文件, ``attrs``, ``target_ids``, ``init_regions``, ---- 用户可以通过读取和写入 ``attrs`` 文件获得和设置 ``采样间隔`` 、 ``聚集间隔`` 、 ``更新间隔`` -以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/vm/damon/design` 。例如, +以及监测目标区域的最小/最大数量。要详细了解监测属性,请参考 `:doc:/mm/damon/design` 。例如, 下面的命令将这些值设置为5ms、100ms、1000ms、10和1000,然后再次检查:: # cd /damon diff --git a/Documentation/translations/zh_CN/core-api/index.rst b/Documentation/translations/zh_CN/core-api/index.rst index 10bc438504ac2..8a94ad87465dc 100644 --- a/Documentation/translations/zh_CN/core-api/index.rst +++ b/Documentation/translations/zh_CN/core-api/index.rst @@ -101,7 +101,7 @@ Todolist: ======== 如何在内核中分配和使用内存。请注意,在 -:doc:`/vm/index` 中有更多的内存管理文档。 +:doc:`/mm/index` 中有更多的内存管理文档。 .. toctree:: :maxdepth: 1 diff --git a/Documentation/translations/zh_CN/index.rst b/Documentation/translations/zh_CN/index.rst index ad7bb8c17562b..bf85baca8b3eb 100644 --- a/Documentation/translations/zh_CN/index.rst +++ b/Documentation/translations/zh_CN/index.rst @@ -118,7 +118,7 @@ TODOList: sound/index filesystems/index scheduler/index - vm/index + mm/index peci/index TODOList: diff --git a/Documentation/translations/zh_CN/vm/active_mm.rst b/Documentation/translations/zh_CN/mm/active_mm.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/active_mm.rst rename to Documentation/translations/zh_CN/mm/active_mm.rst index 366609ea4f375..c2816f523bd78 100644 --- a/Documentation/translations/zh_CN/vm/active_mm.rst +++ b/Documentation/translations/zh_CN/mm/active_mm.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/active_mm.rst +:Original: Documentation/mm/active_mm.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/balance.rst b/Documentation/translations/zh_CN/mm/balance.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/balance.rst rename to Documentation/translations/zh_CN/mm/balance.rst index e98a47ef24a8f..6fd79209c3079 100644 --- a/Documentation/translations/zh_CN/vm/balance.rst +++ b/Documentation/translations/zh_CN/mm/balance.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/balance.rst +:Original: Documentation/mm/balance.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/damon/api.rst b/Documentation/translations/zh_CN/mm/damon/api.rst similarity index 91% rename from Documentation/translations/zh_CN/vm/damon/api.rst rename to Documentation/translations/zh_CN/mm/damon/api.rst index 21143eea4ebed..5593a83c86bc6 100644 --- a/Documentation/translations/zh_CN/vm/damon/api.rst +++ b/Documentation/translations/zh_CN/mm/damon/api.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/damon/api.rst +:Original: Documentation/mm/damon/api.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/damon/design.rst b/Documentation/translations/zh_CN/mm/damon/design.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/damon/design.rst rename to Documentation/translations/zh_CN/mm/damon/design.rst index 46128b77c2b3d..16e3db34a7dd1 100644 --- a/Documentation/translations/zh_CN/vm/damon/design.rst +++ b/Documentation/translations/zh_CN/mm/damon/design.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/damon/design.rst +:Original: Documentation/mm/damon/design.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/damon/faq.rst b/Documentation/translations/zh_CN/mm/damon/faq.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/damon/faq.rst rename to Documentation/translations/zh_CN/mm/damon/faq.rst index 07b4ac19407de..de4be417494ac 100644 --- a/Documentation/translations/zh_CN/vm/damon/faq.rst +++ b/Documentation/translations/zh_CN/mm/damon/faq.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/damon/faq.rst +:Original: Documentation/mm/damon/faq.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/damon/index.rst b/Documentation/translations/zh_CN/mm/damon/index.rst similarity index 90% rename from Documentation/translations/zh_CN/vm/damon/index.rst rename to Documentation/translations/zh_CN/mm/damon/index.rst index 84d36d90c9b0f..b03bf307204f4 100644 --- a/Documentation/translations/zh_CN/vm/damon/index.rst +++ b/Documentation/translations/zh_CN/mm/damon/index.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/damon/index.rst +:Original: Documentation/mm/damon/index.rst :翻译: @@ -14,7 +14,7 @@ DAMON:数据访问监视器 ========================== DAMON是Linux内核的一个数据访问监控框架子系统。DAMON的核心机制使其成为 -(该核心机制详见(Documentation/translations/zh_CN/vm/damon/design.rst)) +(该核心机制详见(Documentation/translations/zh_CN/mm/damon/design.rst)) - *准确度* (监测输出对DRAM级别的内存管理足够有用;但可能不适合CPU Cache级别), - *轻量级* (监控开销低到可以在线应用),以及 @@ -30,4 +30,3 @@ DAMON是Linux内核的一个数据访问监控框架子系统。DAMON的核心 faq design api - diff --git a/Documentation/translations/zh_CN/vm/free_page_reporting.rst b/Documentation/translations/zh_CN/mm/free_page_reporting.rst similarity index 97% rename from Documentation/translations/zh_CN/vm/free_page_reporting.rst rename to Documentation/translations/zh_CN/mm/free_page_reporting.rst index 14336a3aa5f4e..5bfd58014c948 100644 --- a/Documentation/translations/zh_CN/vm/free_page_reporting.rst +++ b/Documentation/translations/zh_CN/mm/free_page_reporting.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/free_page_reporting.rst +:Original: Documentation/mm/free_page_reporting.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/frontswap.rst b/Documentation/translations/zh_CN/mm/frontswap.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/frontswap.rst rename to Documentation/translations/zh_CN/mm/frontswap.rst index 98aa6f581ea7e..6cef3dcdaa772 100644 --- a/Documentation/translations/zh_CN/vm/frontswap.rst +++ b/Documentation/translations/zh_CN/mm/frontswap.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/free_page_reporting.rst +:Original: Documentation/mm/free_page_reporting.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/highmem.rst b/Documentation/translations/zh_CN/mm/highmem.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/highmem.rst rename to Documentation/translations/zh_CN/mm/highmem.rst index 2003217746463..f74800a6d9a70 100644 --- a/Documentation/translations/zh_CN/vm/highmem.rst +++ b/Documentation/translations/zh_CN/mm/highmem.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/highmem.rst +:Original: Documentation/mm/highmem.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/hmm.rst b/Documentation/translations/zh_CN/mm/hmm.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/hmm.rst rename to Documentation/translations/zh_CN/mm/hmm.rst index 2379df95aa587..5024a8a155164 100644 --- a/Documentation/translations/zh_CN/vm/hmm.rst +++ b/Documentation/translations/zh_CN/mm/hmm.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/hmm.rst +:Original: Documentation/mm/hmm.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/hugetlbfs_reserv.rst b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/hugetlbfs_reserv.rst rename to Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst index c6d471ce2131b..752e5696cd476 100644 --- a/Documentation/translations/zh_CN/vm/hugetlbfs_reserv.rst +++ b/Documentation/translations/zh_CN/mm/hugetlbfs_reserv.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/hugetlbfs_reserv.rst +:Original: Documentation/mm/hugetlbfs_reserv.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/hwpoison.rst b/Documentation/translations/zh_CN/mm/hwpoison.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/hwpoison.rst rename to Documentation/translations/zh_CN/mm/hwpoison.rst index c6e1e7bdb05b5..310862edc937a 100644 --- a/Documentation/translations/zh_CN/vm/hwpoison.rst +++ b/Documentation/translations/zh_CN/mm/hwpoison.rst @@ -1,5 +1,5 @@ -:Original: Documentation/vm/hwpoison.rst +:Original: Documentation/mm/hwpoison.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/index.rst b/Documentation/translations/zh_CN/mm/index.rst similarity index 97% rename from Documentation/translations/zh_CN/vm/index.rst rename to Documentation/translations/zh_CN/mm/index.rst index c77a565538453..2f53e37b80497 100644 --- a/Documentation/translations/zh_CN/vm/index.rst +++ b/Documentation/translations/zh_CN/mm/index.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/index.rst +:Original: Documentation/mm/index.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/ksm.rst b/Documentation/translations/zh_CN/mm/ksm.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/ksm.rst rename to Documentation/translations/zh_CN/mm/ksm.rst index 83b0c73984da1..d1f82e857ad72 100644 --- a/Documentation/translations/zh_CN/vm/ksm.rst +++ b/Documentation/translations/zh_CN/mm/ksm.rst @@ -1,6 +1,6 @@ .. include:: ../disclaimer-zh_CN.rst -:Original: Documentation/vm/ksm.rst +:Original: Documentation/mm/ksm.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/memory-model.rst b/Documentation/translations/zh_CN/mm/memory-model.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/memory-model.rst rename to Documentation/translations/zh_CN/mm/memory-model.rst index 013e30c88d72c..77ec149a970ca 100644 --- a/Documentation/translations/zh_CN/vm/memory-model.rst +++ b/Documentation/translations/zh_CN/mm/memory-model.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/memory-model.rst +:Original: Documentation/mm/memory-model.rst :翻译: @@ -129,7 +129,7 @@ ZONE_DEVICE * pmem: 通过DAX映射将平台持久性内存作为直接I/O目标使用。 * hmm: 用 `->page_fault()` 和 `->page_free()` 事件回调扩展 `ZONE_DEVICE` , - 以允许设备驱动程序协调与设备内存相关的内存管理事件,通常是GPU内存。参见/vm/hmm.rst。 + 以允许设备驱动程序协调与设备内存相关的内存管理事件,通常是GPU内存。参见Documentation/mm/hmm.rst。 * p2pdma: 创建 `struct page` 对象,允许PCI/E拓扑结构中的peer设备协调它们之间的 直接DMA操作,即绕过主机内存。 diff --git a/Documentation/translations/zh_CN/vm/mmu_notifier.rst b/Documentation/translations/zh_CN/mm/mmu_notifier.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/mmu_notifier.rst rename to Documentation/translations/zh_CN/mm/mmu_notifier.rst index b29a37b336286..ce3664d1a4106 100644 --- a/Documentation/translations/zh_CN/vm/mmu_notifier.rst +++ b/Documentation/translations/zh_CN/mm/mmu_notifier.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/mmu_notifier.rst +:Original: Documentation/mm/mmu_notifier.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/numa.rst b/Documentation/translations/zh_CN/mm/numa.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/numa.rst rename to Documentation/translations/zh_CN/mm/numa.rst index 6af412b924ada..b15cfeeb6dfbb 100644 --- a/Documentation/translations/zh_CN/vm/numa.rst +++ b/Documentation/translations/zh_CN/mm/numa.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/numa.rst +:Original: Documentation/mm/numa.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/overcommit-accounting.rst b/Documentation/translations/zh_CN/mm/overcommit-accounting.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/overcommit-accounting.rst rename to Documentation/translations/zh_CN/mm/overcommit-accounting.rst index 8765cb118f24b..d8452d8b7fbbe 100644 --- a/Documentation/translations/zh_CN/vm/overcommit-accounting.rst +++ b/Documentation/translations/zh_CN/mm/overcommit-accounting.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/overcommit-accounting.rst +:Original: Documentation/mm/overcommit-accounting.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/page_frags.rst b/Documentation/translations/zh_CN/mm/page_frags.rst similarity index 97% rename from Documentation/translations/zh_CN/vm/page_frags.rst rename to Documentation/translations/zh_CN/mm/page_frags.rst index 38ecddb9e1c08..20bd3fafdc8c9 100644 --- a/Documentation/translations/zh_CN/vm/page_frags.rst +++ b/Documentation/translations/zh_CN/mm/page_frags.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/page_frags.rst +:Original: Documentation/mm/page_frags.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/page_migration.rst b/Documentation/translations/zh_CN/mm/page_migration.rst similarity index 100% rename from Documentation/translations/zh_CN/vm/page_migration.rst rename to Documentation/translations/zh_CN/mm/page_migration.rst diff --git a/Documentation/translations/zh_CN/vm/page_owner.rst b/Documentation/translations/zh_CN/mm/page_owner.rst similarity index 99% rename from Documentation/translations/zh_CN/vm/page_owner.rst rename to Documentation/translations/zh_CN/mm/page_owner.rst index 7bd740bc5bf4f..b7f81d7a6589c 100644 --- a/Documentation/translations/zh_CN/vm/page_owner.rst +++ b/Documentation/translations/zh_CN/mm/page_owner.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/page_owner.rst +:Original: Documentation/mm/page_owner.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/page_table_check.rst b/Documentation/translations/zh_CN/mm/page_table_check.rst similarity index 97% rename from Documentation/translations/zh_CN/vm/page_table_check.rst rename to Documentation/translations/zh_CN/mm/page_table_check.rst index a29fc1b360e6f..e8077310a76c6 100644 --- a/Documentation/translations/zh_CN/vm/page_table_check.rst +++ b/Documentation/translations/zh_CN/mm/page_table_check.rst @@ -1,6 +1,6 @@ .. SPDX-License-Identifier: GPL-2.0 -:Original: Documentation/vm/page_table_check.rst +:Original: Documentation/mm/page_table_check.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/remap_file_pages.rst b/Documentation/translations/zh_CN/mm/remap_file_pages.rst similarity index 97% rename from Documentation/translations/zh_CN/vm/remap_file_pages.rst rename to Documentation/translations/zh_CN/mm/remap_file_pages.rst index af6b7e28af231..31e0c54dc36f2 100644 --- a/Documentation/translations/zh_CN/vm/remap_file_pages.rst +++ b/Documentation/translations/zh_CN/mm/remap_file_pages.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/remap_file_pages.rst +:Original: Documentation/mm/remap_file_pages.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/split_page_table_lock.rst b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/split_page_table_lock.rst rename to Documentation/translations/zh_CN/mm/split_page_table_lock.rst index 50694d97c426a..4fb7aa666037c 100644 --- a/Documentation/translations/zh_CN/vm/split_page_table_lock.rst +++ b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/split_page_table_lock.rst +:Original: Documentation/mm/split_page_table_lock.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/vmalloced-kernel-stacks.rst b/Documentation/translations/zh_CN/mm/vmalloced-kernel-stacks.rst similarity index 100% rename from Documentation/translations/zh_CN/vm/vmalloced-kernel-stacks.rst rename to Documentation/translations/zh_CN/mm/vmalloced-kernel-stacks.rst diff --git a/Documentation/translations/zh_CN/vm/z3fold.rst b/Documentation/translations/zh_CN/mm/z3fold.rst similarity index 96% rename from Documentation/translations/zh_CN/vm/z3fold.rst rename to Documentation/translations/zh_CN/mm/z3fold.rst index 57204aa08caaf..9569a6d882700 100644 --- a/Documentation/translations/zh_CN/vm/z3fold.rst +++ b/Documentation/translations/zh_CN/mm/z3fold.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/z3fold.rst +:Original: Documentation/mm/z3fold.rst :翻译: diff --git a/Documentation/translations/zh_CN/vm/zsmalloc.rst b/Documentation/translations/zh_CN/mm/zsmalloc.rst similarity index 98% rename from Documentation/translations/zh_CN/vm/zsmalloc.rst rename to Documentation/translations/zh_CN/mm/zsmalloc.rst index 45a9b7ab2a515..4c8c9b1006a92 100644 --- a/Documentation/translations/zh_CN/vm/zsmalloc.rst +++ b/Documentation/translations/zh_CN/mm/zsmalloc.rst @@ -1,4 +1,4 @@ -:Original: Documentation/vm/zsmalloc.rst +:Original: Documentation/mm/zsmalloc.rst :翻译: diff --git a/Documentation/translations/zh_TW/index.rst b/Documentation/translations/zh_TW/index.rst index e1ce9d8c06f8f..e97d7d5787515 100644 --- a/Documentation/translations/zh_TW/index.rst +++ b/Documentation/translations/zh_TW/index.rst @@ -128,7 +128,7 @@ TODOList: * security/index * sound/index * crypto/index -* vm/index +* mm/index * bpf/index * usb/index * PCI/index diff --git a/Documentation/vm/.gitignore b/Documentation/vm/.gitignore deleted file mode 100644 index bc74f56430082..0000000000000 --- a/Documentation/vm/.gitignore +++ /dev/null @@ -1,3 +0,0 @@ -# SPDX-License-Identifier: GPL-2.0-only -page-types -slabinfo diff --git a/MAINTAINERS b/MAINTAINERS index 414ecc3843a7c..19930e2bdd145 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5643,7 +5643,7 @@ L: linux-mm@kvack.org S: Maintained F: Documentation/ABI/testing/sysfs-kernel-mm-damon F: Documentation/admin-guide/mm/damon/ -F: Documentation/vm/damon/ +F: Documentation/mm/damon/ F: include/linux/damon.h F: include/trace/events/damon.h F: mm/damon/ @@ -9226,7 +9226,7 @@ HMM - Heterogeneous Memory Management M: Jérôme Glisse L: linux-mm@kvack.org S: Maintained -F: Documentation/vm/hmm.rst +F: Documentation/mm/hmm.rst F: include/linux/hmm* F: lib/test_hmm* F: mm/hmm* @@ -9324,8 +9324,8 @@ L: linux-mm@kvack.org S: Maintained F: Documentation/ABI/testing/sysfs-kernel-mm-hugepages F: Documentation/admin-guide/mm/hugetlbpage.rst -F: Documentation/vm/hugetlbfs_reserv.rst -F: Documentation/vm/vmemmap_dedup.rst +F: Documentation/mm/hugetlbfs_reserv.rst +F: Documentation/mm/vmemmap_dedup.rst F: fs/hugetlbfs/ F: include/linux/hugetlb.h F: mm/hugetlb.c @@ -15292,7 +15292,7 @@ M: Pasha Tatashin M: Andrew Morton L: linux-mm@kvack.org S: Maintained -F: Documentation/vm/page_table_check.rst +F: Documentation/mm/page_table_check.rst F: include/linux/page_table_check.h F: mm/page_table_check.c @@ -22436,7 +22436,7 @@ M: Nitin Gupta R: Sergey Senozhatsky L: linux-mm@kvack.org S: Maintained -F: Documentation/vm/zsmalloc.rst +F: Documentation/mm/zsmalloc.rst F: include/linux/zsmalloc.h F: mm/zsmalloc.c diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index ec20c1004abf5..ef427a6bdd1ab 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -155,6 +155,10 @@ do_page_fault(unsigned long address, unsigned long mmcsr, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index dad27e4d69ff1..5ca59a482632a 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -146,6 +146,10 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) return; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + /* * Fault retry nuances, mmap_lock already relinquished by core mm */ diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index a062e07516dd2..46cccd6bf705a 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -322,6 +322,10 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) return 0; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return 0; + if (!(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h index 1fd2846dbefeb..d20f5da2d76fa 100644 --- a/arch/arm64/include/asm/hugetlb.h +++ b/arch/arm64/include/asm/hugetlb.h @@ -46,9 +46,6 @@ extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep, unsigned long sz); #define __HAVE_ARCH_HUGE_PTEP_GET extern pte_t huge_ptep_get(pte_t *ptep); -extern void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned long sz); -#define set_huge_swap_pte_at set_huge_swap_pte_at void __init arm64_hugetlb_cma_reserve(void); diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index cdf3ffa0c2234..c33f1fad27450 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -608,6 +608,10 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, return 0; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return 0; + if (fault & VM_FAULT_RETRY) { mm_flags |= FAULT_FLAG_TRIED; goto retry; diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 5307ffdefb8de..86db1e4e69e08 100644 --- a/arch/arm64/mm/hugetlbpage.c +++ b/arch/arm64/mm/hugetlbpage.c @@ -241,6 +241,13 @@ static void clear_flush(struct mm_struct *mm, flush_tlb_range(&vma, saddr, addr); } +static inline struct folio *hugetlb_swap_entry_to_folio(swp_entry_t entry) +{ + VM_BUG_ON(!is_migration_entry(entry) && !is_hwpoison_entry(entry)); + + return page_folio(pfn_to_page(swp_offset(entry))); +} + void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, pte_t pte) { @@ -250,11 +257,16 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, unsigned long pfn, dpfn; pgprot_t hugeprot; - /* - * Code needs to be expanded to handle huge swap and migration - * entries. Needed for HUGETLB and MEMORY_FAILURE. - */ - WARN_ON(!pte_present(pte)); + if (!pte_present(pte)) { + struct folio *folio; + + folio = hugetlb_swap_entry_to_folio(pte_to_swp_entry(pte)); + ncontig = num_contig_ptes(folio_size(folio), &pgsize); + + for (i = 0; i < ncontig; i++, ptep++) + set_pte_at(mm, addr, ptep, pte); + return; + } if (!pte_cont(pte)) { set_pte_at(mm, addr, ptep, pte); @@ -272,18 +284,6 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot)); } -void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned long sz) -{ - int i, ncontig; - size_t pgsize; - - ncontig = num_contig_ptes(sz, &pgsize); - - for (i = 0; i < ncontig; i++, ptep++) - set_pte(ptep, pte); -} - pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long sz) { diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index 7215a46b6b8eb..e15f736cca4b4 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -285,6 +285,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs) return; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely((fault & VM_FAULT_RETRY) && (flags & FAULT_FLAG_ALLOW_RETRY))) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index 4fac4b9eb3164..f73c7cbfe3260 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -96,6 +96,10 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { if (fault & VM_FAULT_RETRY) { diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 07379d1a227fe..ef78c2d66cdde 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -139,6 +139,10 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { /* * We ran out of memory, or some other thing happened diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 8e2b8f1133df7..9be4d0eef299d 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -406,7 +406,7 @@ config ARCH_SPARSEMEM_ENABLE Say Y to support efficient handling of sparse physical memory, for architectures which are either NUMA (Non-Uniform Memory Access) or have huge holes in the physical address space for other reasons. - See for more. + See for more. config ARCH_ENABLE_THP_MIGRATION def_bool y diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 71aa9f6315dc8..4d2837eb3e2a3 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -141,6 +141,10 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, if (fault_signal_pending(fault, regs)) return 0; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return 0; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index a9626e6a68af9..5c40c3ebe52f7 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -222,6 +222,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index b08bc556d30dd..a27045f5a556d 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -162,6 +162,10 @@ static void __do_page_fault(struct pt_regs *regs, unsigned long write, return; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index a32f14cd72f28..edaca0a6c1c1c 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -139,6 +139,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index 53b760af3bb75..b4762d66e9efe 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -165,6 +165,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 84bc437be5cd1..9ad80d4d3389c 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -311,6 +311,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { /* * We hit a shared mapping outside of the file, or some diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index cb9d5fd39d7f3..392ff48f77df3 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1273,7 +1273,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm, unsigned long addr, * should return true. * We should not call this on a hugetlb entry. We should check for HugeTLB * entry using vma->vm_flags - * The page table walk rule is explained in Documentation/vm/transhuge.rst + * The page table walk rule is explained in Documentation/mm/transhuge.rst */ static inline int pmd_trans_huge(pmd_t pmd) { diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index c1cb21a008843..7c507fb48182b 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -65,6 +65,11 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, ret = 0; *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0, NULL); + + /* The fault is fully completed (including releasing mmap lock) */ + if (*flt & VM_FAULT_COMPLETED) + return 0; + if (unlikely(*flt & VM_FAULT_ERROR)) { if (*flt & VM_FAULT_OOM) { ret = -ENOMEM; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index d53fed4eccbd3..0140054286873 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -511,6 +511,10 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (fault_signal_pending(fault, regs)) return user_mode(regs) ? 0 : SIGBUS; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + goto out; + /* * Handle the retry right now, the mmap_lock has been released in that * case. @@ -525,6 +529,7 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (unlikely(fault & VM_FAULT_ERROR)) return mm_fault_error(regs, address, fault); +out: /* * Major/minor page fault accounting. */ diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 40694f0cab9e5..f2fbd1400b7c9 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -326,6 +326,10 @@ asmlinkage void do_page_fault(struct pt_regs *regs) if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_RETRY)) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index e173b6187ad56..973dcd05c2935 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -433,6 +433,17 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) goto out_up; goto out; } + + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) { + if (gmap) { + mmap_read_lock(mm); + goto out_gmap; + } + fault = 0; + goto out; + } + if (unlikely(fault & VM_FAULT_ERROR)) goto out_up; @@ -452,6 +463,7 @@ static inline vm_fault_t do_exception(struct pt_regs *regs, int access) mmap_read_lock(mm); goto retry; } +out_gmap: if (IS_ENABLED(CONFIG_PGSTE) && gmap) { address = __gmap_link(gmap, current->thread.gmap_addr, address); diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index e175667b13637..acd2f5e50bfcd 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -485,6 +485,10 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, if (mm_fault_error(regs, error_code, address, fault)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index ad569d9bd1242..91259f291c540 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -190,6 +190,10 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write, if (fault_signal_pending(fault, regs)) return; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index 253e07043298b..4acc12eafbf54 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -427,6 +427,10 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) if (fault_signal_pending(fault, regs)) goto exit_exception; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + goto lock_released; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -449,6 +453,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs) } mmap_read_unlock(mm); +lock_released: mm_rss = get_mm_rss(mm); #if defined(CONFIG_TRANSPARENT_HUGEPAGE) mm_rss -= (mm->context.thp_pte_count * (HPAGE_SIZE / PAGE_SIZE)); diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index d1d5d0be03085..d3ce21c4ca32a 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -76,6 +76,10 @@ int handle_page_fault(unsigned long address, unsigned long ip, if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) goto out_nosemaphore; + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return 0; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) { goto out_of_memory; diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index bd74a287b54ad..bfb50262fd371 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -6699,7 +6699,7 @@ int kvm_mmu_vendor_module_init(void) if (percpu_counter_init(&kvm_total_used_mmu_pages, 0, GFP_KERNEL)) goto out; - ret = register_shrinker(&mmu_shrinker); + ret = register_shrinker(&mmu_shrinker, "x86-mmu"); if (ret) goto out; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 971977c438fc1..fa71a5d12e872 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1408,6 +1408,10 @@ void do_user_addr_fault(struct pt_regs *regs, return; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + /* * If we need to retry the mmap_lock has already been released, * and if there is a fatal signal pending there is no guarantee diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c index a0d023cb42921..509408da0da1e 100644 --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -19,44 +19,6 @@ #include #include -#if 0 /* This is just for testing */ -struct page * -follow_huge_addr(struct mm_struct *mm, unsigned long address, int write) -{ - unsigned long start = address; - int length = 1; - int nr; - struct page *page; - struct vm_area_struct *vma; - - vma = find_vma(mm, addr); - if (!vma || !is_vm_hugetlb_page(vma)) - return ERR_PTR(-EINVAL); - - pte = huge_pte_offset(mm, address, vma_mmu_pagesize(vma)); - - /* hugetlb should be locked, and hence, prefaulted */ - WARN_ON(!pte || pte_none(*pte)); - - page = &pte_page(*pte)[vpfn % (HPAGE_SIZE/PAGE_SIZE)]; - - WARN_ON(!PageHead(page)); - - return page; -} - -int pmd_huge(pmd_t pmd) -{ - return 0; -} - -int pud_huge(pud_t pud) -{ - return 0; -} - -#else - /* * pmd_huge() returns 1 if @pmd is hugetlb related entry, that is normal * hugetlb entry or non-present (migration or hwpoisoned) hugetlb entry. @@ -72,7 +34,6 @@ int pud_huge(pud_t pud) { return !!(pud_val(pud) & _PAGE_PSE); } -#endif #ifdef CONFIG_HUGETLB_PAGE static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file, diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index 16f0a5ff57991..8c781b05c0bdd 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -172,6 +172,10 @@ void do_page_fault(struct pt_regs *regs) return; } + /* The fault is fully completed (including releasing mmap lock) */ + if (fault & VM_FAULT_COMPLETED) + return; + if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; diff --git a/drivers/android/binder_alloc.c b/drivers/android/binder_alloc.c index 5649a0371a1f2..51b502217d000 100644 --- a/drivers/android/binder_alloc.c +++ b/drivers/android/binder_alloc.c @@ -1084,7 +1084,7 @@ int binder_alloc_shrinker_init(void) int ret = list_lru_init(&binder_alloc_lru); if (ret == 0) { - ret = register_shrinker(&binder_shrinker); + ret = register_shrinker(&binder_shrinker, "android-binder"); if (ret) list_lru_destroy(&binder_alloc_lru); } diff --git a/drivers/block/zram/zcomp.c b/drivers/block/zram/zcomp.c index 052aa3f65514e..0916de952e091 100644 --- a/drivers/block/zram/zcomp.c +++ b/drivers/block/zram/zcomp.c @@ -63,12 +63,6 @@ static int zcomp_strm_init(struct zcomp_strm *zstrm, struct zcomp *comp) bool zcomp_available_algorithm(const char *comp) { - int i; - - i = sysfs_match_string(backends, comp); - if (i >= 0) - return true; - /* * Crypto does not ignore a trailing new line symbol, * so make sure you don't supply a string containing @@ -217,6 +211,11 @@ struct zcomp *zcomp_create(const char *compress) struct zcomp *comp; int error; + /* + * Crypto API will execute /sbin/modprobe if the compression module + * is not loaded yet. We must do it here, otherwise we are about to + * call /sbin/modprobe under CPU hot-plug lock. + */ if (!zcomp_available_algorithm(compress)) return ERR_PTR(-EINVAL); diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c index 1030053571a20..8dc5c8874d8a2 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shrinker.c @@ -426,7 +426,8 @@ void i915_gem_driver_register__shrinker(struct drm_i915_private *i915) i915->mm.shrinker.count_objects = i915_gem_shrinker_count; i915->mm.shrinker.seeks = DEFAULT_SEEKS; i915->mm.shrinker.batch = 4096; - drm_WARN_ON(&i915->drm, register_shrinker(&i915->mm.shrinker)); + drm_WARN_ON(&i915->drm, register_shrinker(&i915->mm.shrinker, + "drm-i915_gem")); i915->mm.oom_notifier.notifier_call = i915_gem_shrinker_oom; drm_WARN_ON(&i915->drm, register_oom_notifier(&i915->mm.oom_notifier)); diff --git a/drivers/gpu/drm/msm/msm_gem_shrinker.c b/drivers/gpu/drm/msm/msm_gem_shrinker.c index 6e39d959b9f0f..0317055e32532 100644 --- a/drivers/gpu/drm/msm/msm_gem_shrinker.c +++ b/drivers/gpu/drm/msm/msm_gem_shrinker.c @@ -221,7 +221,7 @@ void msm_gem_shrinker_init(struct drm_device *dev) priv->shrinker.count_objects = msm_gem_shrinker_count; priv->shrinker.scan_objects = msm_gem_shrinker_scan; priv->shrinker.seeks = DEFAULT_SEEKS; - WARN_ON(register_shrinker(&priv->shrinker)); + WARN_ON(register_shrinker(&priv->shrinker, "drm-msm_gem")); priv->vmap_notifier.notifier_call = msm_gem_shrinker_vmap; WARN_ON(register_vmap_purge_notifier(&priv->vmap_notifier)); diff --git a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c index 77e7cb6d1ae3b..bf0170782f258 100644 --- a/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c +++ b/drivers/gpu/drm/panfrost/panfrost_gem_shrinker.c @@ -103,7 +103,7 @@ void panfrost_gem_shrinker_init(struct drm_device *dev) pfdev->shrinker.count_objects = panfrost_gem_shrinker_count; pfdev->shrinker.scan_objects = panfrost_gem_shrinker_scan; pfdev->shrinker.seeks = DEFAULT_SEEKS; - WARN_ON(register_shrinker(&pfdev->shrinker)); + WARN_ON(register_shrinker(&pfdev->shrinker, "drm-panfrost")); } /** diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index 1bba0a0ed3f99..21b61631f73a1 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -722,7 +722,7 @@ int ttm_pool_mgr_init(unsigned long num_pages) mm_shrinker.count_objects = ttm_pool_shrinker_count; mm_shrinker.scan_objects = ttm_pool_shrinker_scan; mm_shrinker.seeks = 1; - return register_shrinker(&mm_shrinker); + return register_shrinker(&mm_shrinker, "drm-ttm_pool"); } /** diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c index e136d6edc1ed6..147c493a989a5 100644 --- a/drivers/md/bcache/btree.c +++ b/drivers/md/bcache/btree.c @@ -812,7 +812,7 @@ int bch_btree_cache_alloc(struct cache_set *c) c->shrink.seeks = 4; c->shrink.batch = c->btree_pages * 2; - if (register_shrinker(&c->shrink)) + if (register_shrinker(&c->shrink, "md-bcache:%pU", c->set_uuid)) pr_warn("bcache: %s: could not register shrinker\n", __func__); diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index 5ffa1dcf84cfc..3ff571b20f149 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c @@ -1806,7 +1806,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign c->shrinker.scan_objects = dm_bufio_shrink_scan; c->shrinker.seeks = 1; c->shrinker.batch = 0; - r = register_shrinker(&c->shrinker); + r = register_shrinker(&c->shrinker, "md-%s:(%u:%u)", slab_name, + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev)); if (r) goto bad; diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c index d1ea66114d142..46648f6100fbe 100644 --- a/drivers/md/dm-zoned-metadata.c +++ b/drivers/md/dm-zoned-metadata.c @@ -2944,7 +2944,9 @@ int dmz_ctr_metadata(struct dmz_dev *dev, int num_dev, zmd->mblk_shrinker.seeks = DEFAULT_SEEKS; /* Metadata cache shrinker */ - ret = register_shrinker(&zmd->mblk_shrinker); + ret = register_shrinker(&zmd->mblk_shrinker, "md-meta:(%u:%u)", + MAJOR(dev->bdev->bd_dev), + MINOR(dev->bdev->bd_dev)); if (ret) { dmz_zmd_err(zmd, "Register metadata cache shrinker failed"); goto err; diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 6ceee650b8fac..30f1a9f8227db 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -7601,7 +7601,7 @@ static struct r5conf *setup_conf(struct mddev *mddev) conf->shrinker.count_objects = raid5_cache_count; conf->shrinker.batch = 128; conf->shrinker.flags = 0; - ret = register_shrinker(&conf->shrinker); + ret = register_shrinker(&conf->shrinker, "md-raid5:%s", mdname(mddev)); if (ret) { pr_warn("md/raid:%s: couldn't register shrinker.\n", mdname(mddev)); diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index 85dd6aa33df69..61a2be712bf7b 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -1585,7 +1585,7 @@ static int vmballoon_register_shrinker(struct vmballoon *b) b->shrinker.count_objects = vmballoon_shrinker_count; b->shrinker.seeks = DEFAULT_SEEKS; - r = register_shrinker(&b->shrinker); + r = register_shrinker(&b->shrinker, "vmw-balloon"); if (r == 0) b->shrinker_registered = true; diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c index f974e29a7d8f1..22b3d35a5e762 100644 --- a/drivers/of/fdt.c +++ b/drivers/of/fdt.c @@ -529,7 +529,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long node, pr_debug("Reserved memory: reserved region for node '%s': base %pa, size %lu MiB\n", uname, &base, (unsigned long)(size / SZ_1M)); if (!nomap) - kmemleak_alloc_phys(base, size, 0, 0); + kmemleak_alloc_phys(base, size, 0); } else pr_err("Reserved memory: failed to reserve memory for node '%s': base %pa, size %lu MiB\n", diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c index bd360b91e9d3e..3f78a3a1eb753 100644 --- a/drivers/virtio/virtio_balloon.c +++ b/drivers/virtio/virtio_balloon.c @@ -856,7 +856,7 @@ static int virtio_balloon_register_shrinker(struct virtio_balloon *vb) vb->shrinker.count_objects = virtio_balloon_shrinker_count; vb->shrinker.seeks = DEFAULT_SEEKS; - return register_shrinker(&vb->shrinker); + return register_shrinker(&vb->shrinker, "virtio-balloon"); } static int virtballoon_probe(struct virtio_device *vdev) diff --git a/drivers/xen/xenbus/xenbus_probe_backend.c b/drivers/xen/xenbus/xenbus_probe_backend.c index 5abded97e1a7e..9c09f89d8278b 100644 --- a/drivers/xen/xenbus/xenbus_probe_backend.c +++ b/drivers/xen/xenbus/xenbus_probe_backend.c @@ -305,7 +305,7 @@ static int __init xenbus_probe_backend_init(void) register_xenstore_notifier(&xenstore_notifier); - if (register_shrinker(&backend_memory_shrinker)) + if (register_shrinker(&backend_memory_shrinker, "xen-backend")) pr_warn("shrinker registration failed\n"); return 0; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 4c7089b1681b3..f89beac3c6656 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1816,6 +1816,8 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, error = -EBUSY; } else { snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev); + shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", fs_type->name, + s->s_id); btrfs_sb(s)->bdev_holder = fs_type; if (!strstr(crc32c_impl(), "generic")) set_bit(BTRFS_FS_CSUM_IMPL_FAST, &fs_info->flags); diff --git a/fs/erofs/utils.c b/fs/erofs/utils.c index ec9a1d780dc14..46627cb69abe9 100644 --- a/fs/erofs/utils.c +++ b/fs/erofs/utils.c @@ -282,7 +282,7 @@ static struct shrinker erofs_shrinker_info = { int __init erofs_init_shrinker(void) { - return register_shrinker(&erofs_shrinker_info); + return register_shrinker(&erofs_shrinker_info, "erofs-shrinker"); } void erofs_exit_shrinker(void) diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c index 9a3a8996aacf7..23167efda95ee 100644 --- a/fs/ext4/extents_status.c +++ b/fs/ext4/extents_status.c @@ -1654,7 +1654,8 @@ int ext4_es_register_shrinker(struct ext4_sb_info *sbi) sbi->s_es_shrinker.scan_objects = ext4_es_scan; sbi->s_es_shrinker.count_objects = ext4_es_count; sbi->s_es_shrinker.seeks = DEFAULT_SEEKS; - err = register_shrinker(&sbi->s_es_shrinker); + err = register_shrinker(&sbi->s_es_shrinker, "ext4-es:%s", + sbi->s_sb->s_id); if (err) goto err4; diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index faf9a767d05ae..82bfacb6df466 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -4607,7 +4607,7 @@ static int __init init_f2fs_fs(void) err = f2fs_init_sysfs(); if (err) goto free_garbage_collection_cache; - err = register_shrinker(&f2fs_shrinker_info); + err = register_shrinker(&f2fs_shrinker_info, "f2fs-shrinker"); if (err) goto free_sysfs; err = register_filesystem(&f2fs_fs_type); diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index e540d12900992..1ffbe4a56ed97 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -2530,7 +2530,7 @@ int __init gfs2_glock_init(void) return -ENOMEM; } - ret = register_shrinker(&glock_shrinker); + ret = register_shrinker(&glock_shrinker, "gfs2-glock"); if (ret) { destroy_workqueue(gfs2_delete_workqueue); destroy_workqueue(glock_workqueue); diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c index 244187e3e70f7..b66a3e1ec1528 100644 --- a/fs/gfs2/main.c +++ b/fs/gfs2/main.c @@ -148,7 +148,7 @@ static int __init init_gfs2_fs(void) if (!gfs2_trans_cachep) goto fail_cachep8; - error = register_shrinker(&gfs2_qd_shrinker); + error = register_shrinker(&gfs2_qd_shrinker, "gfs2-qd"); if (error) goto fail_shrinker; diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c index 97e205a0689d2..6d58d2e15e420 100644 --- a/fs/jbd2/journal.c +++ b/fs/jbd2/journal.c @@ -1415,7 +1415,8 @@ static journal_t *journal_init_common(struct block_device *bdev, if (percpu_counter_init(&journal->j_checkpoint_jh_count, 0, GFP_KERNEL)) goto err_cleanup; - if (register_shrinker(&journal->j_shrinker)) { + if (register_shrinker(&journal->j_shrinker, "jbd2-journal:(%u:%u)", + MAJOR(bdev->bd_dev), MINOR(bdev->bd_dev))) { percpu_counter_destroy(&journal->j_checkpoint_jh_count); goto err_cleanup; } diff --git a/fs/mbcache.c b/fs/mbcache.c index 97c54d3a22276..0b833da0a9a5f 100644 --- a/fs/mbcache.c +++ b/fs/mbcache.c @@ -367,7 +367,7 @@ struct mb_cache *mb_cache_create(int bucket_bits) cache->c_shrink.count_objects = mb_cache_count; cache->c_shrink.scan_objects = mb_cache_scan; cache->c_shrink.seeks = DEFAULT_SEEKS; - if (register_shrinker(&cache->c_shrink)) { + if (register_shrinker(&cache->c_shrink, "mbcache-shrinker")) { kfree(cache->c_hash); kfree(cache); goto err_out; diff --git a/fs/nfs/nfs42xattr.c b/fs/nfs/nfs42xattr.c index e7b34f7e0614b..a9bf09fdf2c32 100644 --- a/fs/nfs/nfs42xattr.c +++ b/fs/nfs/nfs42xattr.c @@ -1017,15 +1017,16 @@ int __init nfs4_xattr_cache_init(void) if (ret) goto out2; - ret = register_shrinker(&nfs4_xattr_cache_shrinker); + ret = register_shrinker(&nfs4_xattr_cache_shrinker, "nfs-xattr_cache"); if (ret) goto out1; - ret = register_shrinker(&nfs4_xattr_entry_shrinker); + ret = register_shrinker(&nfs4_xattr_entry_shrinker, "nfs-xattr_entry"); if (ret) goto out; - ret = register_shrinker(&nfs4_xattr_large_entry_shrinker); + ret = register_shrinker(&nfs4_xattr_large_entry_shrinker, + "nfs-xattr_large_entry"); if (!ret) return 0; diff --git a/fs/nfs/super.c b/fs/nfs/super.c index 6ab5eeb000dc0..82944e14fcea1 100644 --- a/fs/nfs/super.c +++ b/fs/nfs/super.c @@ -149,7 +149,7 @@ int __init register_nfs_fs(void) ret = nfs_register_sysctl(); if (ret < 0) goto error_2; - ret = register_shrinker(&acl_shrinker); + ret = register_shrinker(&acl_shrinker, "nfs-acl"); if (ret < 0) goto error_3; #ifdef CONFIG_NFS_V4_2 diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index 4758c2a3fcf8f..9e8d7ee765d75 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -855,7 +855,7 @@ nfsd_file_cache_init(void) goto out_err; } - ret = register_shrinker(&nfsd_file_shrinker); + ret = register_shrinker(&nfsd_file_shrinker, "nfsd-filecache"); if (ret) { pr_err("nfsd: failed to register nfsd_file_shrinker: %d\n", ret); goto out_lru; diff --git a/fs/nfsd/nfscache.c b/fs/nfsd/nfscache.c index 7da88bdc0d6c3..9b31e1103e7b1 100644 --- a/fs/nfsd/nfscache.c +++ b/fs/nfsd/nfscache.c @@ -176,7 +176,8 @@ int nfsd_reply_cache_init(struct nfsd_net *nn) nn->nfsd_reply_cache_shrinker.scan_objects = nfsd_reply_cache_scan; nn->nfsd_reply_cache_shrinker.count_objects = nfsd_reply_cache_count; nn->nfsd_reply_cache_shrinker.seeks = 1; - status = register_shrinker(&nn->nfsd_reply_cache_shrinker); + status = register_shrinker(&nn->nfsd_reply_cache_shrinker, + "nfsd-reply:%s", nn->nfsd_name); if (status) goto out_stats_destroy; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2d04e3470d4cd..751c19d5bfdd9 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -406,6 +406,7 @@ struct mem_size_stats { u64 pss_anon; u64 pss_file; u64 pss_shmem; + u64 pss_dirty; u64 pss_locked; u64 swap_pss; }; @@ -427,6 +428,7 @@ static void smaps_page_accumulate(struct mem_size_stats *mss, mss->pss_locked += pss; if (dirty || PageDirty(page)) { + mss->pss_dirty += pss; if (private) mss->private_dirty += size; else @@ -808,6 +810,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, { SEQ_PUT_DEC("Rss: ", mss->resident); SEQ_PUT_DEC(" kB\nPss: ", mss->pss >> PSS_SHIFT); + SEQ_PUT_DEC(" kB\nPss_Dirty: ", mss->pss_dirty >> PSS_SHIFT); if (rollup_mode) { /* * These are meaningful only for smaps_rollup, otherwise two of diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index 28966da7834ea..0427b44bfee54 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -3002,7 +3002,7 @@ static int __init dquot_init(void) pr_info("VFS: Dquot-cache hash table entries: %ld (order %ld," " %ld bytes)\n", nr_hash, order, (PAGE_SIZE << order)); - if (register_shrinker(&dqcache_shrinker)) + if (register_shrinker(&dqcache_shrinker, "dquota-cache")) panic("Cannot register dquot shrinker"); return 0; diff --git a/fs/super.c b/fs/super.c index 60f57c7bc0a69..4fca6657f442b 100644 --- a/fs/super.c +++ b/fs/super.c @@ -265,7 +265,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, s->s_shrink.count_objects = super_cache_count; s->s_shrink.batch = 1024; s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE; - if (prealloc_shrinker(&s->s_shrink)) + if (prealloc_shrinker(&s->s_shrink, "sb-%s", type->name)) goto fail; if (list_lru_init_memcg(&s->s_dentry_lru, &s->s_shrink)) goto fail; @@ -1288,6 +1288,8 @@ int get_tree_bdev(struct fs_context *fc, } else { s->s_mode = mode; snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev); + shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", + fc->fs_type->name, s->s_id); sb_set_blocksize(s, block_size(bdev)); error = fill_super(s, fc); if (error) { @@ -1363,6 +1365,8 @@ struct dentry *mount_bdev(struct file_system_type *fs_type, } else { s->s_mode = mode; snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev); + shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s", + fs_type->name, s->s_id); sb_set_blocksize(s, block_size(bdev)); error = fill_super(s, data, flags & SB_SILENT ? 1 : 0); if (error) { diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index 0978d01b0ea4f..d0c9a09988bc7 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -2430,7 +2430,7 @@ static int __init ubifs_init(void) if (!ubifs_inode_slab) return -ENOMEM; - err = register_shrinker(&ubifs_shrinker_info); + err = register_shrinker(&ubifs_shrinker_info, "ubifs-slab"); if (err) goto out_slab; diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index bf4e608710682..4aa9c9cf5b6e7 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1986,7 +1986,8 @@ xfs_alloc_buftarg( btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan; btp->bt_shrinker.seeks = DEFAULT_SEEKS; btp->bt_shrinker.flags = SHRINKER_NUMA_AWARE; - if (register_shrinker(&btp->bt_shrinker)) + if (register_shrinker(&btp->bt_shrinker, "xfs-buf:%s", + mp->m_super->s_id)) goto error_pcpu; return btp; diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 2609825d53eed..02e2281009634 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -2219,5 +2219,5 @@ xfs_inodegc_register_shrinker( shrink->flags = SHRINKER_NONSLAB; shrink->batch = XFS_INODEGC_SHRINKER_BATCH; - return register_shrinker(shrink); + return register_shrinker(shrink, "xfs-inodegc:%s", mp->m_super->s_id); } diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index abf08bbf34a90..c31d57453ceba 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -677,7 +677,8 @@ xfs_qm_init_quotainfo( qinf->qi_shrinker.seeks = DEFAULT_SEEKS; qinf->qi_shrinker.flags = SHRINKER_NUMA_AWARE; - error = register_shrinker(&qinf->qi_shrinker); + error = register_shrinker(&qinf->qi_shrinker, "xfs-qm:%s", + mp->m_super->s_id); if (error) goto out_free_inos; diff --git a/include/linux/damon.h b/include/linux/damon.h index 7c62da31ce4b5..7b1f4a4882308 100644 --- a/include/linux/damon.h +++ b/include/linux/damon.h @@ -86,6 +86,8 @@ struct damon_target { * @DAMOS_PAGEOUT: Call ``madvise()`` for the region with MADV_PAGEOUT. * @DAMOS_HUGEPAGE: Call ``madvise()`` for the region with MADV_HUGEPAGE. * @DAMOS_NOHUGEPAGE: Call ``madvise()`` for the region with MADV_NOHUGEPAGE. + * @DAMOS_LRU_PRIO: Prioritize the region on its LRU lists. + * @DAMOS_LRU_DEPRIO: Deprioritize the region on its LRU lists. * @DAMOS_STAT: Do nothing but count the stat. * @NR_DAMOS_ACTIONS: Total number of DAMOS actions */ @@ -95,6 +97,8 @@ enum damos_action { DAMOS_PAGEOUT, DAMOS_HUGEPAGE, DAMOS_NOHUGEPAGE, + DAMOS_LRU_PRIO, + DAMOS_LRU_DEPRIO, DAMOS_STAT, /* Do nothing but only record the stat */ NR_DAMOS_ACTIONS, }; @@ -397,7 +401,6 @@ struct damon_callback { * detail. * * @kdamond: Kernel thread who does the monitoring. - * @kdamond_stop: Notifies whether kdamond should stop. * @kdamond_lock: Mutex for the synchronizations with @kdamond. * * For each monitoring context, one kernel thread for the monitoring is @@ -406,14 +409,14 @@ struct damon_callback { * Once started, the monitoring thread runs until explicitly required to be * terminated or every monitoring target is invalid. The validity of the * targets is checked via the &damon_operations.target_valid of @ops. The - * termination can also be explicitly requested by writing non-zero to - * @kdamond_stop. The thread sets @kdamond to NULL when it terminates. - * Therefore, users can know whether the monitoring is ongoing or terminated by - * reading @kdamond. Reads and writes to @kdamond and @kdamond_stop from - * outside of the monitoring thread must be protected by @kdamond_lock. + * termination can also be explicitly requested by calling damon_stop(). + * The thread sets @kdamond to NULL when it terminates. Therefore, users can + * know whether the monitoring is ongoing or terminated by reading @kdamond. + * Reads and writes to @kdamond from outside of the monitoring thread must + * be protected by @kdamond_lock. * - * Note that the monitoring thread protects only @kdamond and @kdamond_stop via - * @kdamond_lock. Accesses to other fields must be protected by themselves. + * Note that the monitoring thread protects only @kdamond via @kdamond_lock. + * Accesses to other fields must be protected by themselves. * * @ops: Set of monitoring operations for given use cases. * @callback: Set of callbacks for monitoring events notifications. @@ -526,6 +529,12 @@ bool damon_is_registered_ops(enum damon_ops_id id); int damon_register_ops(struct damon_operations *ops); int damon_select_ops(struct damon_ctx *ctx, enum damon_ops_id id); +static inline bool damon_target_has_pid(const struct damon_ctx *ctx) +{ + return ctx->ops.id == DAMON_OPS_VADDR || ctx->ops.id == DAMON_OPS_FVADDR; +} + + int damon_start(struct damon_ctx **ctxs, int nr_ctxs, bool exclusive); int damon_stop(struct damon_ctx **ctxs, int nr_ctxs); diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 56d6a01965348..177b079446407 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -243,6 +243,16 @@ static inline void clear_highpage(struct page *page) kunmap_local(kaddr); } +static inline void clear_highpage_kasan_tagged(struct page *page) +{ + u8 tag; + + tag = page_kasan_tag(page); + page_kasan_tag_reset(page); + clear_highpage(page); + page_kasan_tag_set(page, tag); +} + #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE static inline void tag_clear_highpage(struct page *page) @@ -336,19 +346,6 @@ static inline void memcpy_page(struct page *dst_page, size_t dst_off, kunmap_local(dst); } -static inline void memmove_page(struct page *dst_page, size_t dst_off, - struct page *src_page, size_t src_off, - size_t len) -{ - char *dst = kmap_local_page(dst_page); - char *src = kmap_local_page(src_page); - - VM_BUG_ON(dst_off + len > PAGE_SIZE || src_off + len > PAGE_SIZE); - memmove(dst + dst_off, src + src_off, len); - kunmap_local(src); - kunmap_local(dst); -} - static inline void memset_page(struct page *page, size_t offset, int val, size_t len) { diff --git a/include/linux/hmm.h b/include/linux/hmm.h index d5a6f101f843e..126a365716676 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -4,7 +4,7 @@ * * Authors: Jérôme Glisse * - * See Documentation/vm/hmm.rst for reasons and overview of what HMM is. + * See Documentation/mm/hmm.rst for reasons and overview of what HMM is. */ #ifndef LINUX_HMM_H #define LINUX_HMM_H @@ -100,7 +100,7 @@ struct hmm_range { }; /* - * Please see Documentation/vm/hmm.rst for how to use the range API. + * Please see Documentation/mm/hmm.rst for how to use the range API. */ int hmm_range_fault(struct hmm_range *range); diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index de29821231c9b..648cb3ce7099b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -117,8 +117,10 @@ extern struct kobj_attribute shmem_enabled_attr; extern unsigned long transparent_hugepage_flags; static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, - unsigned long haddr) + unsigned long addr) { + unsigned long haddr; + /* Don't have to check pgoff for anonymous vma */ if (!vma_is_anonymous(vma)) { if (!IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, @@ -126,6 +128,8 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, return false; } + haddr = addr & HPAGE_PMD_MASK; + if (haddr < vma->vm_start || haddr + HPAGE_PMD_SIZE > vma->vm_end) return false; return true; @@ -342,7 +346,7 @@ static inline bool transparent_hugepage_active(struct vm_area_struct *vma) } static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, - unsigned long haddr) + unsigned long addr) { return false; } diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index e4cff27d1198c..c6cccfaf8708b 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -170,7 +170,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to, vm_flags_t vm_flags); long hugetlb_unreserve_pages(struct inode *inode, long start, long end, long freed); -bool isolate_huge_page(struct page *page, struct list_head *list); +int isolate_hugetlb(struct page *page, struct list_head *list); int get_hwpoison_huge_page(struct page *page, bool *hugetlb); int get_huge_page_for_hwpoison(unsigned long pfn, int flags); void putback_active_hugepage(struct page *page); @@ -376,9 +376,9 @@ static inline pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, return NULL; } -static inline bool isolate_huge_page(struct page *page, struct list_head *list) +static inline int isolate_hugetlb(struct page *page, struct list_head *list) { - return false; + return -EBUSY; } static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb) @@ -903,14 +903,6 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm) atomic_long_sub(l, &mm->hugetlb_usage); } -#ifndef set_huge_swap_pte_at -static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned long sz) -{ - set_huge_pte_at(mm, addr, ptep, pte); -} -#endif - #ifndef huge_ptep_modify_prot_start #define huge_ptep_modify_prot_start huge_ptep_modify_prot_start static inline pte_t huge_ptep_modify_prot_start(struct vm_area_struct *vma, @@ -1094,11 +1086,6 @@ static inline void hugetlb_count_sub(long l, struct mm_struct *mm) { } -static inline void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr, - pte_t *ptep, pte_t pte, unsigned long sz) -{ -} - static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { diff --git a/include/linux/kmemleak.h b/include/linux/kmemleak.h index 34684b2026abf..6a3cd1bf4680b 100644 --- a/include/linux/kmemleak.h +++ b/include/linux/kmemleak.h @@ -29,10 +29,9 @@ extern void kmemleak_not_leak(const void *ptr) __ref; extern void kmemleak_ignore(const void *ptr) __ref; extern void kmemleak_scan_area(const void *ptr, size_t size, gfp_t gfp) __ref; extern void kmemleak_no_scan(const void *ptr) __ref; -extern void kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count, +extern void kmemleak_alloc_phys(phys_addr_t phys, size_t size, gfp_t gfp) __ref; extern void kmemleak_free_part_phys(phys_addr_t phys, size_t size) __ref; -extern void kmemleak_not_leak_phys(phys_addr_t phys) __ref; extern void kmemleak_ignore_phys(phys_addr_t phys) __ref; static inline void kmemleak_alloc_recursive(const void *ptr, size_t size, @@ -107,15 +106,12 @@ static inline void kmemleak_no_scan(const void *ptr) { } static inline void kmemleak_alloc_phys(phys_addr_t phys, size_t size, - int min_count, gfp_t gfp) + gfp_t gfp) { } static inline void kmemleak_free_part_phys(phys_addr_t phys, size_t size) { } -static inline void kmemleak_not_leak_phys(phys_addr_t phys) -{ -} static inline void kmemleak_ignore_phys(phys_addr_t phys) { } diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 9ecead1042b9c..4d31ce55b1c0d 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -837,6 +837,15 @@ static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_id(unsigned short id); +#ifdef CONFIG_SHRINKER_DEBUG +static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) +{ + return memcg ? cgroup_ino(memcg->css.cgroup) : 0; +} + +struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino); +#endif + static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { return mem_cgroup_from_css(seq_css(m)); @@ -1343,6 +1352,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return NULL; } +#ifdef CONFIG_SHRINKER_DEBUG +static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg) +{ + return 0; +} + +static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino) +{ + return NULL; +} +#endif + static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m) { return NULL; @@ -1740,6 +1761,7 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) } struct mem_cgroup *mem_cgroup_from_obj(void *p); +struct mem_cgroup *mem_cgroup_from_slab_obj(void *p); static inline void count_objcg_event(struct obj_cgroup *objcg, enum vm_event_item idx) @@ -1755,6 +1777,42 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, rcu_read_unlock(); } +/** + * get_mem_cgroup_from_obj - get a memcg associated with passed kernel object. + * @p: pointer to object from which memcg should be extracted. It can be NULL. + * + * Retrieves the memory group into which the memory of the pointed kernel + * object is accounted. If memcg is found, its reference is taken. + * If a passed kernel object is uncharged, or if proper memcg cannot be found, + * as well as if mem_cgroup is disabled, NULL is returned. + * + * Return: valid memcg pointer with taken reference or NULL. + */ +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); + do { + memcg = mem_cgroup_from_obj(p); + } while (memcg && !css_tryget(&memcg->css)); + rcu_read_unlock(); + return memcg; +} + +/** + * mem_cgroup_or_root - always returns a pointer to a valid memory cgroup. + * @memcg: pointer to a valid memory cgroup or NULL. + * + * If passed argument is not NULL, returns it without any additional checks + * and changes. Otherwise, root_mem_cgroup is returned. + * + * NOTE: root_mem_cgroup can be NULL during early boot. + */ +static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg) +{ + return memcg ? memcg : root_mem_cgroup; +} #else static inline bool mem_cgroup_kmem_disabled(void) { @@ -1798,7 +1856,12 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg) static inline struct mem_cgroup *mem_cgroup_from_obj(void *p) { - return NULL; + return NULL; +} + +static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p) +{ + return NULL; } static inline void count_objcg_event(struct obj_cgroup *objcg, @@ -1806,6 +1869,15 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, { } +static inline struct mem_cgroup *get_mem_cgroup_from_obj(void *p) +{ + return NULL; +} + +static inline struct mem_cgroup *mem_cgroup_or_root(struct mem_cgroup *memcg) +{ + return NULL; +} #endif /* CONFIG_MEMCG_KMEM */ #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h index 20d7edf62a6a5..e0b2209ab71c2 100644 --- a/include/linux/memory_hotplug.h +++ b/include/linux/memory_hotplug.h @@ -351,13 +351,4 @@ void arch_remove_linear_mapping(u64 start, u64 size); extern bool mhp_supports_memmap_on_memory(unsigned long size); #endif /* CONFIG_MEMORY_HOTPLUG */ -#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY -bool mhp_memmap_on_memory(void); -#else -static inline bool mhp_memmap_on_memory(void) -{ - return false; -} -#endif - #endif /* __LINUX_MEMORY_HOTPLUG_H */ diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b5047..9f5ee49482ded 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -39,7 +39,7 @@ struct vmem_altmap { * must be treated as an opaque object, rather than a "normal" struct page. * * A more complete discussion of unaddressable memory may be found in - * include/linux/hmm.h and Documentation/vm/hmm.rst. + * include/linux/hmm.h and Documentation/mm/hmm.rst. * * MEMORY_DEVICE_FS_DAX: * Host memory that has similar access semantics as System RAM i.e. DMA diff --git a/include/linux/mm.h b/include/linux/mm.h index 7898e29bcfb54..125a6a3c9dbde 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -855,7 +855,7 @@ static inline struct folio *virt_to_folio(const void *x) return page_folio(page); } -void __put_page(struct page *page); +void __folio_put(struct folio *folio); void put_pages_list(struct list_head *pages); @@ -892,11 +892,7 @@ static inline void set_compound_page_dtor(struct page *page, page[1].compound_dtor = compound_dtor; } -static inline void destroy_compound_page(struct page *page) -{ - VM_BUG_ON_PAGE(page[1].compound_dtor >= NR_COMPOUND_DTORS, page); - compound_page_dtors[page[1].compound_dtor](page); -} +void destroy_large_folio(struct folio *folio); static inline int head_compound_pincount(struct page *head) { @@ -1201,7 +1197,7 @@ static inline __must_check bool try_get_page(struct page *page) static inline void folio_put(struct folio *folio) { if (folio_put_testzero(folio)) - __put_page(&folio->page); + __folio_put(folio); } /** @@ -1221,7 +1217,26 @@ static inline void folio_put(struct folio *folio) static inline void folio_put_refs(struct folio *folio, int refs) { if (folio_ref_sub_and_test(folio, refs)) - __put_page(&folio->page); + __folio_put(folio); +} + +void release_pages(struct page **pages, int nr); + +/** + * folios_put - Decrement the reference count on an array of folios. + * @folios: The folios. + * @nr: How many folios there are. + * + * Like folio_put(), but for an array of folios. This is more efficient + * than writing the loop yourself as it will optimise the locks which + * need to be taken if the folios are freed. + * + * Context: May be called in process or interrupt context, but not in NMI + * context. May be called while holding a spinlock. + */ +static inline void folios_put(struct folio **folios, unsigned int nr) +{ + release_pages((struct page **)folios, nr); } static inline void put_page(struct page *page) @@ -1966,8 +1981,12 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma, * for now all the callers are only use one of the flags at the same * time. */ -/* Whether we should allow dirty bit accounting */ -#define MM_CP_DIRTY_ACCT (1UL << 0) +/* + * Whether we should manually check if we can map individual PTEs writable, + * because something (e.g., COW, uffd-wp) blocks that from happening for all + * PTEs automatically in a writable mapping. + */ +#define MM_CP_TRY_CHANGE_WRITABLE (1UL << 0) /* Whether this protection change is for NUMA hints */ #define MM_CP_PROT_NUMA (1UL << 1) /* Whether this change is for write protecting */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c29ab4c0cd5c6..6b961a29bf26f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -729,6 +729,7 @@ typedef __bitwise unsigned int vm_fault_t; * @VM_FAULT_NEEDDSYNC: ->fault did not modify page tables and needs * fsync() to complete (for synchronous page faults * in DAX) + * @VM_FAULT_COMPLETED: ->fault completed, meanwhile mmap lock released * @VM_FAULT_HINDEX_MASK: mask HINDEX value * */ @@ -746,6 +747,7 @@ enum vm_fault_reason { VM_FAULT_FALLBACK = (__force vm_fault_t)0x000800, VM_FAULT_DONE_COW = (__force vm_fault_t)0x001000, VM_FAULT_NEEDDSYNC = (__force vm_fault_t)0x002000, + VM_FAULT_COMPLETED = (__force vm_fault_t)0x004000, VM_FAULT_HINDEX_MASK = (__force vm_fault_t)0x0f0000, }; diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 45fc2c81e370c..d6c06e1402771 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -198,7 +198,7 @@ struct mmu_notifier_ops { * invalidate_range_start()/end() notifiers, as * invalidate_range() already catches the points in time when an * external TLB range needs to be flushed. For more in depth - * discussion on this see Documentation/vm/mmu_notifier.rst + * discussion on this see Documentation/mm/mmu_notifier.rst * * Note that this function might be called with just a sub-range * of what was passed to invalidate_range_start()/end(), if diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index aab70355d64f3..735bf5b379491 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -591,8 +591,8 @@ struct zone { * give them a chance of being in the same cacheline. * * Write access to present_pages at runtime should be protected by - * mem_hotplug_begin/end(). Any reader who can't tolerant drift of - * present_pages should get_online_mems() to get a stable value. + * mem_hotplug_begin/done(). Any reader who can't tolerant drift of + * present_pages should use get_online_mems() to get a stable value. */ atomic_long_t managed_pages; unsigned long spanned_pages; @@ -870,7 +870,7 @@ typedef struct pglist_data { unsigned long nr_reclaim_start; /* nr pages written while throttled * when throttling started. */ struct task_struct *kswapd; /* Protected by - mem_hotplug_begin/end() */ + mem_hotplug_begin/done() */ int kswapd_order; enum zone_type kswapd_highest_zoneidx; @@ -1418,16 +1418,32 @@ extern size_t mem_section_usage_size(void); * (equal SECTION_SIZE_BITS - PAGE_SHIFT), and the * worst combination is powerpc with 256k pages, * which results in PFN_SECTION_SHIFT equal 6. - * To sum it up, at least 6 bits are available. + * To sum it up, at least 6 bits are available on all architectures. + * However, we can exceed 6 bits on some other architectures except + * powerpc (e.g. 15 bits are available on x86_64, 13 bits are available + * with the worst case of 64K pages on arm64) if we make sure the + * exceeded bit is not applicable to powerpc. */ -#define SECTION_MARKED_PRESENT (1UL<<0) -#define SECTION_HAS_MEM_MAP (1UL<<1) -#define SECTION_IS_ONLINE (1UL<<2) -#define SECTION_IS_EARLY (1UL<<3) -#define SECTION_TAINT_ZONE_DEVICE (1UL<<4) -#define SECTION_MAP_LAST_BIT (1UL<<5) -#define SECTION_MAP_MASK (~(SECTION_MAP_LAST_BIT-1)) -#define SECTION_NID_SHIFT 6 +enum { + SECTION_MARKED_PRESENT_BIT, + SECTION_HAS_MEM_MAP_BIT, + SECTION_IS_ONLINE_BIT, + SECTION_IS_EARLY_BIT, +#ifdef CONFIG_ZONE_DEVICE + SECTION_TAINT_ZONE_DEVICE_BIT, +#endif + SECTION_MAP_LAST_BIT, +}; + +#define SECTION_MARKED_PRESENT BIT(SECTION_MARKED_PRESENT_BIT) +#define SECTION_HAS_MEM_MAP BIT(SECTION_HAS_MEM_MAP_BIT) +#define SECTION_IS_ONLINE BIT(SECTION_IS_ONLINE_BIT) +#define SECTION_IS_EARLY BIT(SECTION_IS_EARLY_BIT) +#ifdef CONFIG_ZONE_DEVICE +#define SECTION_TAINT_ZONE_DEVICE BIT(SECTION_TAINT_ZONE_DEVICE_BIT) +#endif +#define SECTION_MAP_MASK (~(BIT(SECTION_MAP_LAST_BIT) - 1)) +#define SECTION_NID_SHIFT SECTION_MAP_LAST_BIT static inline struct page *__section_mem_map_addr(struct mem_section *section) { @@ -1466,12 +1482,19 @@ static inline int online_section(struct mem_section *section) return (section && (section->section_mem_map & SECTION_IS_ONLINE)); } +#ifdef CONFIG_ZONE_DEVICE static inline int online_device_section(struct mem_section *section) { unsigned long flags = SECTION_IS_ONLINE | SECTION_TAINT_ZONE_DEVICE; return section && ((section->section_mem_map & flags) == flags); } +#else +static inline int online_device_section(struct mem_section *section) +{ + return 0; +} +#endif static inline int online_section_nr(unsigned long nr) { diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 3f5490f6f0385..cc643eb4b9f51 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -193,6 +193,11 @@ enum pageflags { /* Only valid for buddy pages. Used to track pages that are reported */ PG_reported = PG_uptodate, + +#ifdef CONFIG_MEMORY_HOTPLUG + /* For self-hosted memmap pages */ + PG_vmemmap_self_hosted = PG_owner_priv_1, +#endif }; #define PAGEFLAGS_MASK ((1UL << NR_PAGEFLAGS) - 1) @@ -628,6 +633,12 @@ PAGEFLAG_FALSE(SkipKASanPoison, skip_kasan_poison) */ __PAGEFLAG(Reported, reported, PF_NO_COMPOUND) +#ifdef CONFIG_MEMORY_HOTPLUG +PAGEFLAG(VmemmapSelfHosted, vmemmap_self_hosted, PF_ANY) +#else +PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) +#endif + /* * On an anonymous page mapped into a user virtual memory area, * page->mapping points to its anon_vma, not to a struct address_space; @@ -670,6 +681,12 @@ static __always_inline bool PageAnon(struct page *page) return folio_test_anon(page_folio(page)); } +static __always_inline bool __folio_test_movable(const struct folio *folio) +{ + return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) == + PAGE_MAPPING_MOVABLE; +} + static __always_inline int __PageMovable(struct page *page) { return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) == diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index cc9adbaddb591..0178b2040ea38 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -345,8 +345,6 @@ static inline void filemap_nr_thps_dec(struct address_space *mapping) #endif } -void release_pages(struct page **pages, int nr); - struct address_space *page_mapping(struct page *); struct address_space *folio_mapping(struct folio *); struct address_space *swapcache_mapping(struct folio *); diff --git a/include/linux/pagevec.h b/include/linux/pagevec.h index 6649154a2115e..215eb6c3bdc9f 100644 --- a/include/linux/pagevec.h +++ b/include/linux/pagevec.h @@ -26,7 +26,6 @@ struct pagevec { }; void __pagevec_release(struct pagevec *pvec); -void __pagevec_lru_add(struct pagevec *pvec); unsigned pagevec_lookup_range_tag(struct pagevec *pvec, struct address_space *mapping, pgoff_t *index, pgoff_t end, xa_mark_t tag); diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 9ec23138e4107..bf80adca980b9 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -325,8 +325,8 @@ struct page_vma_mapped_walk { #define DEFINE_PAGE_VMA_WALK(name, _page, _vma, _address, _flags) \ struct page_vma_mapped_walk name = { \ .pfn = page_to_pfn(_page), \ - .nr_pages = compound_nr(page), \ - .pgoff = page_to_pgoff(page), \ + .nr_pages = compound_nr(_page), \ + .pgoff = page_to_pgoff(_page), \ .vma = _vma, \ .address = _address, \ .flags = _flags, \ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 8cd975a8bfebc..2a243616f222d 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -29,7 +29,7 @@ extern struct mm_struct *mm_alloc(void); * * Use mmdrop() to release the reference acquired by mmgrab(). * - * See also for an in-depth explanation + * See also for an in-depth explanation * of &mm_struct.mm_count vs &mm_struct.mm_users. */ static inline void mmgrab(struct mm_struct *mm) @@ -92,7 +92,7 @@ static inline void mmdrop_sched(struct mm_struct *mm) * * Use mmput() to release the reference acquired by mmget(). * - * See also for an in-depth explanation + * See also for an in-depth explanation * of &mm_struct.mm_count vs &mm_struct.mm_users. */ static inline void mmget(struct mm_struct *mm) diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 76fbf92b04d95..08e6054e061f3 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -72,6 +72,11 @@ struct shrinker { #ifdef CONFIG_MEMCG /* ID in shrinker_idr */ int id; +#endif +#ifdef CONFIG_SHRINKER_DEBUG + int debugfs_id; + const char *name; + struct dentry *debugfs_entry; #endif /* objs pending delete, per node */ atomic_long_t *nr_deferred; @@ -88,10 +93,32 @@ struct shrinker { */ #define SHRINKER_NONSLAB (1 << 3) -extern int prealloc_shrinker(struct shrinker *shrinker); +extern int __printf(2, 3) prealloc_shrinker(struct shrinker *shrinker, + const char *fmt, ...); extern void register_shrinker_prepared(struct shrinker *shrinker); -extern int register_shrinker(struct shrinker *shrinker); +extern int __printf(2, 3) register_shrinker(struct shrinker *shrinker, + const char *fmt, ...); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); extern void synchronize_shrinkers(void); -#endif + +#ifdef CONFIG_SHRINKER_DEBUG +extern int shrinker_debugfs_add(struct shrinker *shrinker); +extern void shrinker_debugfs_remove(struct shrinker *shrinker); +extern int __printf(2, 3) shrinker_debugfs_rename(struct shrinker *shrinker, + const char *fmt, ...); +#else /* CONFIG_SHRINKER_DEBUG */ +static inline int shrinker_debugfs_add(struct shrinker *shrinker) +{ + return 0; +} +static inline void shrinker_debugfs_remove(struct shrinker *shrinker) +{ +} +static inline __printf(2, 3) +int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...) +{ + return 0; +} +#endif /* CONFIG_SHRINKER_DEBUG */ +#endif /* _LINUX_SHRINKER_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 8672a7123ccdc..4e84f8b553090 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -74,7 +74,7 @@ static inline int current_is_kswapd(void) /* * Unaddressable device memory support. See include/linux/hmm.h and - * Documentation/vm/hmm.rst. Short description is we need struct pages for + * Documentation/mm/hmm.rst. Short description is we need struct pages for * device memory that is unaddressable (inaccessible) by CPU, so that we can * migrate part of a process memory to device memory. * @@ -456,6 +456,7 @@ static inline unsigned long total_swapcache_pages(void) return global_node_page_state(NR_SWAPCACHE); } +extern void free_swap_cache(struct page *page); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); /* linux/mm/swapfile.c */ @@ -540,6 +541,10 @@ static inline void put_swap_device(struct swap_info_struct *si) /* used to sanity check ptes in zap_pte_range when CONFIG_SWAP=0 */ #define free_swap_and_cache(e) is_pfn_swap_entry(e) +static inline void free_swap_cache(struct page *page) +{ +} + static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask) { return 0; diff --git a/include/linux/swapops.h b/include/linux/swapops.h index f24775b418807..bb7afd03a324f 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -244,8 +244,10 @@ extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl); extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address); -extern void migration_entry_wait_huge(struct vm_area_struct *vma, - struct mm_struct *mm, pte_t *pte); +#ifdef CONFIG_HUGETLB_PAGE +extern void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl); +extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte); +#endif #else static inline swp_entry_t make_readable_migration_entry(pgoff_t offset) { @@ -271,8 +273,10 @@ static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep, spinlock_t *ptl) { } static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, unsigned long address) { } -static inline void migration_entry_wait_huge(struct vm_area_struct *vma, - struct mm_struct *mm, pte_t *pte) { } +#ifdef CONFIG_HUGETLB_PAGE +static inline void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { } +static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { } +#endif static inline int is_writable_migration_entry(swp_entry_t entry) { return 0; diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 84d2817766888..ebbf41d592406 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4584,7 +4584,7 @@ static void __init kfree_rcu_batch_init(void) INIT_DELAYED_WORK(&krcp->page_cache_work, fill_page_cache_func); krcp->initialized = true; } - if (register_shrinker(&kfree_rcu_shrinker)) + if (register_shrinker(&kfree_rcu_shrinker, "rcu-kfree")) pr_err("Failed to register kfree_rcu() shrinker!\n"); } diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ae2bcffe5e1be..63d1e97efec60 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -699,6 +699,15 @@ config DEBUG_OBJECTS_ENABLE_DEFAULT help Debug objects boot parameter default value +config SHRINKER_DEBUG + default y + bool "Enable shrinker debugging support" + depends on DEBUG_FS + help + Say Y to enable the shrinker debugfs interface which provides + visibility into the kernel memory shrinkers subsystem. + Disable it to avoid an extra memory footprint. + config HAVE_DEBUG_KMEMLEAK bool diff --git a/lib/test_hmm.c b/lib/test_hmm.c index cfe6320478391..f2c3015c5c82c 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -732,7 +732,7 @@ static int dmirror_exclusive(struct dmirror *dmirror, mmap_read_lock(mm); for (addr = start; addr < end; addr = next) { - unsigned long mapped; + unsigned long mapped = 0; int i; if (end < addr + (ARRAY_SIZE(pages) << PAGE_SHIFT)) @@ -741,7 +741,13 @@ static int dmirror_exclusive(struct dmirror *dmirror, next = addr + (ARRAY_SIZE(pages) << PAGE_SHIFT); ret = make_device_exclusive_range(mm, addr, next, pages, NULL); - mapped = dmirror_atomic_map(addr, next, pages, dmirror); + /* + * Do dmirror_atomic_map() iff all pages are marked for + * exclusive access to avoid accessing uninitialized + * fields of pages. + */ + if (ret == (next - addr) >> PAGE_SHIFT) + mapped = dmirror_atomic_map(addr, next, pages, dmirror); for (i = 0; i < ret; i++) { if (pages[i]) { unlock_page(pages[i]); diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c index cf41fd6df42a0..4f2f2d1bac562 100644 --- a/lib/test_vmalloc.c +++ b/lib/test_vmalloc.c @@ -74,12 +74,13 @@ test_report_one_done(void) static int random_size_align_alloc_test(void) { - unsigned long size, align, rnd; + unsigned long size, align; + unsigned int rnd; void *ptr; int i; for (i = 0; i < test_loop_count; i++) { - get_random_bytes(&rnd, sizeof(rnd)); + rnd = prandom_u32(); /* * Maximum 1024 pages, if PAGE_SIZE is 4096. @@ -150,7 +151,7 @@ static int random_size_alloc_test(void) int i; for (i = 0; i < test_loop_count; i++) { - get_random_bytes(&n, sizeof(i)); + n = prandom_u32(); n = (n % 100) + 1; p = vmalloc(n * PAGE_SIZE); @@ -294,14 +295,14 @@ pcpu_alloc_test(void) for (i = 0; i < 35000; i++) { unsigned int r; - get_random_bytes(&r, sizeof(i)); + r = prandom_u32(); size = (r % (PAGE_SIZE / 4)) + 1; /* * Maximum PAGE_SIZE */ - get_random_bytes(&r, sizeof(i)); - align = 1 << ((i % 11) + 1); + r = prandom_u32(); + align = 1 << ((r % 11) + 1); pcpu[i] = __alloc_percpu(size, align); if (!pcpu[i]) @@ -396,7 +397,7 @@ static void shuffle_array(int *arr, int n) int i, j; for (i = n - 1; i > 0; i--) { - get_random_bytes(&rnd, sizeof(rnd)); + rnd = prandom_u32(); /* Cut the range. */ j = rnd % i; diff --git a/mm/Kconfig b/mm/Kconfig index b7a44b17c79fa..02e5bd0a8c4e1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -655,7 +655,7 @@ config KSM the many instances by a single page with that content, so saving memory until one or another app needs to modify the content. Recommended for use with KVM, or with other duplicative applications. - See Documentation/vm/ksm.rst for more information: KSM is inactive + See Documentation/mm/ksm.rst for more information: KSM is inactive until a program has madvised that an area is MADV_MERGEABLE, and root has set /sys/kernel/mm/ksm/run to 1 (if CONFIG_SYSFS is set). diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a1..9a564f8364035 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -133,3 +133,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o diff --git a/mm/compaction.c b/mm/compaction.c index a2c53fcf933e6..962d05d1e187d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -3009,7 +3009,7 @@ void kcompactd_run(int nid) /* * Called by memory hotplug when all memory in a node is offlined. Caller must - * hold mem_hotplug_begin/end(). + * be holding mem_hotplug_begin/done(). */ void kcompactd_stop(int nid) { diff --git a/mm/damon/Kconfig b/mm/damon/Kconfig index 9b559c76d6dd1..66265e3a9c659 100644 --- a/mm/damon/Kconfig +++ b/mm/damon/Kconfig @@ -92,4 +92,12 @@ config DAMON_RECLAIM reclamation under light memory pressure, while the traditional page scanning-based reclamation is used for heavy pressure. +config DAMON_LRU_SORT + bool "Build DAMON-based LRU-lists sorting (DAMON_LRU_SORT)" + depends on DAMON_PADDR + help + This builds the DAMON-based LRU-lists sorting subsystem. It tries to + protect frequently accessed (hot) pages while rarely accessed (cold) + pages reclaimed first under memory pressure. + endmenu diff --git a/mm/damon/Makefile b/mm/damon/Makefile index dbf7190b4144a..3e6b8ad73858a 100644 --- a/mm/damon/Makefile +++ b/mm/damon/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_DAMON_PADDR) += ops-common.o paddr.o obj-$(CONFIG_DAMON_SYSFS) += sysfs.o obj-$(CONFIG_DAMON_DBGFS) += dbgfs.o obj-$(CONFIG_DAMON_RECLAIM) += reclaim.o +obj-$(CONFIG_DAMON_LRU_SORT) += lru_sort.o diff --git a/mm/damon/dbgfs.c b/mm/damon/dbgfs.c index a0dab8b5e45f2..cb8a7e9926a40 100644 --- a/mm/damon/dbgfs.c +++ b/mm/damon/dbgfs.c @@ -97,6 +97,31 @@ static ssize_t dbgfs_attrs_write(struct file *file, return ret; } +/* + * Return corresponding dbgfs' scheme action value (int) for the given + * damos_action if the given damos_action value is valid and supported by + * dbgfs, negative error code otherwise. + */ +static int damos_action_to_dbgfs_scheme_action(enum damos_action action) +{ + switch (action) { + case DAMOS_WILLNEED: + return 0; + case DAMOS_COLD: + return 1; + case DAMOS_PAGEOUT: + return 2; + case DAMOS_HUGEPAGE: + return 3; + case DAMOS_NOHUGEPAGE: + return 4; + case DAMOS_STAT: + return 5; + default: + return -EINVAL; + } +} + static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len) { struct damos *s; @@ -109,7 +134,7 @@ static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len) s->min_sz_region, s->max_sz_region, s->min_nr_accesses, s->max_nr_accesses, s->min_age_region, s->max_age_region, - s->action, + damos_action_to_dbgfs_scheme_action(s->action), s->quota.ms, s->quota.sz, s->quota.reset_interval, s->quota.weight_sz, @@ -160,18 +185,27 @@ static void free_schemes_arr(struct damos **schemes, ssize_t nr_schemes) kfree(schemes); } -static bool damos_action_valid(int action) +/* + * Return corresponding damos_action for the given dbgfs input for a scheme + * action if the input is valid, negative error code otherwise. + */ +static enum damos_action dbgfs_scheme_action_to_damos_action(int dbgfs_action) { - switch (action) { - case DAMOS_WILLNEED: - case DAMOS_COLD: - case DAMOS_PAGEOUT: - case DAMOS_HUGEPAGE: - case DAMOS_NOHUGEPAGE: - case DAMOS_STAT: - return true; + switch (dbgfs_action) { + case 0: + return DAMOS_WILLNEED; + case 1: + return DAMOS_COLD; + case 2: + return DAMOS_PAGEOUT; + case 3: + return DAMOS_HUGEPAGE; + case 4: + return DAMOS_NOHUGEPAGE; + case 5: + return DAMOS_STAT; default: - return false; + return -EINVAL; } } @@ -189,7 +223,8 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, int pos = 0, parsed, ret; unsigned long min_sz, max_sz; unsigned int min_nr_a, max_nr_a, min_age, max_age; - unsigned int action; + unsigned int action_input; + enum damos_action action; schemes = kmalloc_array(max_nr_schemes, sizeof(scheme), GFP_KERNEL); @@ -204,7 +239,7 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u %u %lu %lu %lu %lu%n", &min_sz, &max_sz, &min_nr_a, &max_nr_a, - &min_age, &max_age, &action, "a.ms, + &min_age, &max_age, &action_input, "a.ms, "a.sz, "a.reset_interval, "a.weight_sz, "a.weight_nr_accesses, "a.weight_age, &wmarks.metric, @@ -212,7 +247,8 @@ static struct damos **str_to_schemes(const char *str, ssize_t len, &wmarks.low, &parsed); if (ret != 18) break; - if (!damos_action_valid(action)) + action = dbgfs_scheme_action_to_damos_action(action_input); + if ((int)action < 0) goto fail; if (min_sz > max_sz || min_nr_a > max_nr_a || min_age > max_age) @@ -275,11 +311,6 @@ static ssize_t dbgfs_schemes_write(struct file *file, const char __user *buf, return ret; } -static inline bool target_has_pid(const struct damon_ctx *ctx) -{ - return ctx->ops.id == DAMON_OPS_VADDR; -} - static ssize_t sprint_target_ids(struct damon_ctx *ctx, char *buf, ssize_t len) { struct damon_target *t; @@ -288,7 +319,7 @@ static ssize_t sprint_target_ids(struct damon_ctx *ctx, char *buf, ssize_t len) int rc; damon_for_each_target(t, ctx) { - if (target_has_pid(ctx)) + if (damon_target_has_pid(ctx)) /* Show pid numbers to debugfs users */ id = pid_vnr(t->pid); else @@ -415,7 +446,7 @@ static int dbgfs_set_targets(struct damon_ctx *ctx, ssize_t nr_targets, struct damon_target *t, *next; damon_for_each_target_safe(t, next, ctx) { - if (target_has_pid(ctx)) + if (damon_target_has_pid(ctx)) put_pid(t->pid); damon_destroy_target(t); } @@ -425,11 +456,11 @@ static int dbgfs_set_targets(struct damon_ctx *ctx, ssize_t nr_targets, if (!t) { damon_for_each_target_safe(t, next, ctx) damon_destroy_target(t); - if (target_has_pid(ctx)) + if (damon_target_has_pid(ctx)) dbgfs_put_pids(pids, nr_targets); return -ENOMEM; } - if (target_has_pid(ctx)) + if (damon_target_has_pid(ctx)) t->pid = pids[i]; damon_add_target(ctx, t); } @@ -722,7 +753,7 @@ static void dbgfs_before_terminate(struct damon_ctx *ctx) { struct damon_target *t, *next; - if (!target_has_pid(ctx)) + if (!damon_target_has_pid(ctx)) return; mutex_lock(&ctx->kdamond_lock); diff --git a/mm/damon/lru_sort.c b/mm/damon/lru_sort.c new file mode 100644 index 0000000000000..c276736a071c4 --- /dev/null +++ b/mm/damon/lru_sort.c @@ -0,0 +1,546 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * DAMON-based LRU-lists Sorting + * + * Author: SeongJae Park + */ + +#define pr_fmt(fmt) "damon-lru-sort: " fmt + +#include +#include +#include +#include +#include + +#ifdef MODULE_PARAM_PREFIX +#undef MODULE_PARAM_PREFIX +#endif +#define MODULE_PARAM_PREFIX "damon_lru_sort." + +/* + * Enable or disable DAMON_LRU_SORT. + * + * You can enable DAMON_LRU_SORT by setting the value of this parameter as + * ``Y``. Setting it as ``N`` disables DAMON_LRU_SORT. Note that + * DAMON_LRU_SORT could do no real monitoring and LRU-lists sorting due to the + * watermarks-based activation condition. Refer to below descriptions for the + * watermarks parameter for this. + */ +static bool enabled __read_mostly; + +/* + * Make DAMON_LRU_SORT reads the input parameters again, except ``enabled``. + * + * Input parameters that updated while DAMON_LRU_SORT is running are not + * applied by default. Once this parameter is set as ``Y``, DAMON_LRU_SORT + * reads values of parametrs except ``enabled`` again. Once the re-reading is + * done, this parameter is set as ``N``. If invalid parameters are found while + * the re-reading, DAMON_LRU_SORT will be disabled. + */ +static bool commit_inputs __read_mostly; +module_param(commit_inputs, bool, 0600); + +/* + * Access frequency threshold for hot memory regions identification in permil. + * + * If a memory region is accessed in frequency of this or higher, + * DAMON_LRU_SORT identifies the region as hot, and mark it as accessed on the + * LRU list, so that it could not be reclaimed under memory pressure. 50% by + * default. + */ +static unsigned long hot_thres_access_freq = 500; +module_param(hot_thres_access_freq, ulong, 0600); + +/* + * Time threshold for cold memory regions identification in microseconds. + * + * If a memory region is not accessed for this or longer time, DAMON_LRU_SORT + * identifies the region as cold, and mark it as unaccessed on the LRU list, so + * that it could be reclaimed first under memory pressure. 120 seconds by + * default. + */ +static unsigned long cold_min_age __read_mostly = 120000000; +module_param(cold_min_age, ulong, 0600); + +/* + * Limit of time for trying the LRU lists sorting in milliseconds. + * + * DAMON_LRU_SORT tries to use only up to this time within a time window + * (quota_reset_interval_ms) for trying LRU lists sorting. This can be used + * for limiting CPU consumption of DAMON_LRU_SORT. If the value is zero, the + * limit is disabled. + * + * 10 ms by default. + */ +static unsigned long quota_ms __read_mostly = 10; +module_param(quota_ms, ulong, 0600); + +/* + * The time quota charge reset interval in milliseconds. + * + * The charge reset interval for the quota of time (quota_ms). That is, + * DAMON_LRU_SORT does not try LRU-lists sorting for more than quota_ms + * milliseconds or quota_sz bytes within quota_reset_interval_ms milliseconds. + * + * 1 second by default. + */ +static unsigned long quota_reset_interval_ms __read_mostly = 1000; +module_param(quota_reset_interval_ms, ulong, 0600); + +/* + * The watermarks check time interval in microseconds. + * + * Minimal time to wait before checking the watermarks, when DAMON_LRU_SORT is + * enabled but inactive due to its watermarks rule. 5 seconds by default. + */ +static unsigned long wmarks_interval __read_mostly = 5000000; +module_param(wmarks_interval, ulong, 0600); + +/* + * Free memory rate (per thousand) for the high watermark. + * + * If free memory of the system in bytes per thousand bytes is higher than + * this, DAMON_LRU_SORT becomes inactive, so it does nothing but periodically + * checks the watermarks. 200 (20%) by default. + */ +static unsigned long wmarks_high __read_mostly = 200; +module_param(wmarks_high, ulong, 0600); + +/* + * Free memory rate (per thousand) for the middle watermark. + * + * If free memory of the system in bytes per thousand bytes is between this and + * the low watermark, DAMON_LRU_SORT becomes active, so starts the monitoring + * and the LRU-lists sorting. 150 (15%) by default. + */ +static unsigned long wmarks_mid __read_mostly = 150; +module_param(wmarks_mid, ulong, 0600); + +/* + * Free memory rate (per thousand) for the low watermark. + * + * If free memory of the system in bytes per thousand bytes is lower than this, + * DAMON_LRU_SORT becomes inactive, so it does nothing but periodically checks + * the watermarks. 50 (5%) by default. + */ +static unsigned long wmarks_low __read_mostly = 50; +module_param(wmarks_low, ulong, 0600); + +/* + * Sampling interval for the monitoring in microseconds. + * + * The sampling interval of DAMON for the hot/cold memory monitoring. Please + * refer to the DAMON documentation for more detail. 5 ms by default. + */ +static unsigned long sample_interval __read_mostly = 5000; +module_param(sample_interval, ulong, 0600); + +/* + * Aggregation interval for the monitoring in microseconds. + * + * The aggregation interval of DAMON for the hot/cold memory monitoring. + * Please refer to the DAMON documentation for more detail. 100 ms by default. + */ +static unsigned long aggr_interval __read_mostly = 100000; +module_param(aggr_interval, ulong, 0600); + +/* + * Minimum number of monitoring regions. + * + * The minimal number of monitoring regions of DAMON for the hot/cold memory + * monitoring. This can be used to set lower-bound of the monitoring quality. + * But, setting this too high could result in increased monitoring overhead. + * Please refer to the DAMON documentation for more detail. 10 by default. + */ +static unsigned long min_nr_regions __read_mostly = 10; +module_param(min_nr_regions, ulong, 0600); + +/* + * Maximum number of monitoring regions. + * + * The maximum number of monitoring regions of DAMON for the hot/cold memory + * monitoring. This can be used to set upper-bound of the monitoring overhead. + * However, setting this too low could result in bad monitoring quality. + * Please refer to the DAMON documentation for more detail. 1000 by default. + */ +static unsigned long max_nr_regions __read_mostly = 1000; +module_param(max_nr_regions, ulong, 0600); + +/* + * Start of the target memory region in physical address. + * + * The start physical address of memory region that DAMON_LRU_SORT will do work + * against. By default, biggest System RAM is used as the region. + */ +static unsigned long monitor_region_start __read_mostly; +module_param(monitor_region_start, ulong, 0600); + +/* + * End of the target memory region in physical address. + * + * The end physical address of memory region that DAMON_LRU_SORT will do work + * against. By default, biggest System RAM is used as the region. + */ +static unsigned long monitor_region_end __read_mostly; +module_param(monitor_region_end, ulong, 0600); + +/* + * PID of the DAMON thread + * + * If DAMON_LRU_SORT is enabled, this becomes the PID of the worker thread. + * Else, -1. + */ +static int kdamond_pid __read_mostly = -1; +module_param(kdamond_pid, int, 0400); + +/* + * Number of hot memory regions that tried to be LRU-sorted. + */ +static unsigned long nr_lru_sort_tried_hot_regions __read_mostly; +module_param(nr_lru_sort_tried_hot_regions, ulong, 0400); + +/* + * Total bytes of hot memory regions that tried to be LRU-sorted. + */ +static unsigned long bytes_lru_sort_tried_hot_regions __read_mostly; +module_param(bytes_lru_sort_tried_hot_regions, ulong, 0400); + +/* + * Number of hot memory regions that successfully be LRU-sorted. + */ +static unsigned long nr_lru_sorted_hot_regions __read_mostly; +module_param(nr_lru_sorted_hot_regions, ulong, 0400); + +/* + * Total bytes of hot memory regions that successfully be LRU-sorted. + */ +static unsigned long bytes_lru_sorted_hot_regions __read_mostly; +module_param(bytes_lru_sorted_hot_regions, ulong, 0400); + +/* + * Number of times that the time quota limit for hot regions have exceeded + */ +static unsigned long nr_hot_quota_exceeds __read_mostly; +module_param(nr_hot_quota_exceeds, ulong, 0400); + +/* + * Number of cold memory regions that tried to be LRU-sorted. + */ +static unsigned long nr_lru_sort_tried_cold_regions __read_mostly; +module_param(nr_lru_sort_tried_cold_regions, ulong, 0400); + +/* + * Total bytes of cold memory regions that tried to be LRU-sorted. + */ +static unsigned long bytes_lru_sort_tried_cold_regions __read_mostly; +module_param(bytes_lru_sort_tried_cold_regions, ulong, 0400); + +/* + * Number of cold memory regions that successfully be LRU-sorted. + */ +static unsigned long nr_lru_sorted_cold_regions __read_mostly; +module_param(nr_lru_sorted_cold_regions, ulong, 0400); + +/* + * Total bytes of cold memory regions that successfully be LRU-sorted. + */ +static unsigned long bytes_lru_sorted_cold_regions __read_mostly; +module_param(bytes_lru_sorted_cold_regions, ulong, 0400); + +/* + * Number of times that the time quota limit for cold regions have exceeded + */ +static unsigned long nr_cold_quota_exceeds __read_mostly; +module_param(nr_cold_quota_exceeds, ulong, 0400); + +static struct damon_ctx *ctx; +static struct damon_target *target; + +struct damon_lru_sort_ram_walk_arg { + unsigned long start; + unsigned long end; +}; + +static int walk_system_ram(struct resource *res, void *arg) +{ + struct damon_lru_sort_ram_walk_arg *a = arg; + + if (a->end - a->start < resource_size(res)) { + a->start = res->start; + a->end = res->end; + } + return 0; +} + +/* + * Find biggest 'System RAM' resource and store its start and end address in + * @start and @end, respectively. If no System RAM is found, returns false. + */ +static bool get_monitoring_region(unsigned long *start, unsigned long *end) +{ + struct damon_lru_sort_ram_walk_arg arg = {}; + + walk_system_ram_res(0, ULONG_MAX, &arg, walk_system_ram); + if (arg.end <= arg.start) + return false; + + *start = arg.start; + *end = arg.end; + return true; +} + +/* Create a DAMON-based operation scheme for hot memory regions */ +static struct damos *damon_lru_sort_new_hot_scheme(unsigned int hot_thres) +{ + struct damos_watermarks wmarks = { + .metric = DAMOS_WMARK_FREE_MEM_RATE, + .interval = wmarks_interval, + .high = wmarks_high, + .mid = wmarks_mid, + .low = wmarks_low, + }; + struct damos_quota quota = { + /* + * Do not try LRU-lists sorting of hot pages for more than half + * of quota_ms milliseconds within quota_reset_interval_ms. + */ + .ms = quota_ms / 2, + .sz = 0, + .reset_interval = quota_reset_interval_ms, + /* Within the quota, mark hotter regions accessed first. */ + .weight_sz = 0, + .weight_nr_accesses = 1, + .weight_age = 0, + }; + struct damos *scheme = damon_new_scheme( + /* Find regions having PAGE_SIZE or larger size */ + PAGE_SIZE, ULONG_MAX, + /* and accessed for more than the threshold */ + hot_thres, UINT_MAX, + /* no matter its age */ + 0, UINT_MAX, + /* prioritize those on LRU lists, as soon as found */ + DAMOS_LRU_PRIO, + /* under the quota. */ + "a, + /* (De)activate this according to the watermarks. */ + &wmarks); + + return scheme; +} + +/* Create a DAMON-based operation scheme for cold memory regions */ +static struct damos *damon_lru_sort_new_cold_scheme(unsigned int cold_thres) +{ + struct damos_watermarks wmarks = { + .metric = DAMOS_WMARK_FREE_MEM_RATE, + .interval = wmarks_interval, + .high = wmarks_high, + .mid = wmarks_mid, + .low = wmarks_low, + }; + struct damos_quota quota = { + /* + * Do not try LRU-lists sorting of cold pages for more than + * half of quota_ms milliseconds within + * quota_reset_interval_ms. + */ + .ms = quota_ms / 2, + .sz = 0, + .reset_interval = quota_reset_interval_ms, + /* Within the quota, mark colder regions not accessed first. */ + .weight_sz = 0, + .weight_nr_accesses = 0, + .weight_age = 1, + }; + struct damos *scheme = damon_new_scheme( + /* Find regions having PAGE_SIZE or larger size */ + PAGE_SIZE, ULONG_MAX, + /* and not accessed at all */ + 0, 0, + /* for cold_thres or more micro-seconds, and */ + cold_thres, UINT_MAX, + /* mark those as not accessed, as soon as found */ + DAMOS_LRU_DEPRIO, + /* under the quota. */ + "a, + /* (De)activate this according to the watermarks. */ + &wmarks); + + return scheme; +} + +static int damon_lru_sort_apply_parameters(void) +{ + struct damos *scheme, *next_scheme; + struct damon_addr_range addr_range; + unsigned int hot_thres, cold_thres; + int err = 0; + + err = damon_set_attrs(ctx, sample_interval, aggr_interval, 0, + min_nr_regions, max_nr_regions); + if (err) + return err; + + /* free previously set schemes */ + damon_for_each_scheme_safe(scheme, next_scheme, ctx) + damon_destroy_scheme(scheme); + + /* aggr_interval / sample_interval is the maximum nr_accesses */ + hot_thres = aggr_interval / sample_interval * hot_thres_access_freq / + 1000; + scheme = damon_lru_sort_new_hot_scheme(hot_thres); + if (!scheme) + return -ENOMEM; + damon_add_scheme(ctx, scheme); + + cold_thres = cold_min_age / aggr_interval; + scheme = damon_lru_sort_new_cold_scheme(cold_thres); + if (!scheme) + return -ENOMEM; + damon_add_scheme(ctx, scheme); + + if (monitor_region_start > monitor_region_end) + return -EINVAL; + if (!monitor_region_start && !monitor_region_end && + !get_monitoring_region(&monitor_region_start, + &monitor_region_end)) + return -EINVAL; + addr_range.start = monitor_region_start; + addr_range.end = monitor_region_end; + return damon_set_regions(target, &addr_range, 1); +} + +static int damon_lru_sort_turn(bool on) +{ + int err; + + if (!on) { + err = damon_stop(&ctx, 1); + if (!err) + kdamond_pid = -1; + return err; + } + + err = damon_lru_sort_apply_parameters(); + if (err) + return err; + + err = damon_start(&ctx, 1, true); + if (err) + return err; + kdamond_pid = ctx->kdamond->pid; + return 0; +} + +static struct delayed_work damon_lru_sort_timer; +static void damon_lru_sort_timer_fn(struct work_struct *work) +{ + static bool last_enabled; + bool now_enabled; + + now_enabled = enabled; + if (last_enabled != now_enabled) { + if (!damon_lru_sort_turn(now_enabled)) + last_enabled = now_enabled; + else + enabled = last_enabled; + } +} +static DECLARE_DELAYED_WORK(damon_lru_sort_timer, damon_lru_sort_timer_fn); + +static bool damon_lru_sort_initialized; + +static int damon_lru_sort_enabled_store(const char *val, + const struct kernel_param *kp) +{ + int rc = param_set_bool(val, kp); + + if (rc < 0) + return rc; + + if (!damon_lru_sort_initialized) + return rc; + + schedule_delayed_work(&damon_lru_sort_timer, 0); + + return 0; +} + +static const struct kernel_param_ops enabled_param_ops = { + .set = damon_lru_sort_enabled_store, + .get = param_get_bool, +}; + +module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); +MODULE_PARM_DESC(enabled, + "Enable or disable DAMON_LRU_SORT (default: disabled)"); + +static int damon_lru_sort_handle_commit_inputs(void) +{ + int err; + + if (!commit_inputs) + return 0; + + err = damon_lru_sort_apply_parameters(); + commit_inputs = false; + return err; +} + +static int damon_lru_sort_after_aggregation(struct damon_ctx *c) +{ + struct damos *s; + + /* update the stats parameter */ + damon_for_each_scheme(s, c) { + if (s->action == DAMOS_LRU_PRIO) { + nr_lru_sort_tried_hot_regions = s->stat.nr_tried; + bytes_lru_sort_tried_hot_regions = s->stat.sz_tried; + nr_lru_sorted_hot_regions = s->stat.nr_applied; + bytes_lru_sorted_hot_regions = s->stat.sz_applied; + nr_hot_quota_exceeds = s->stat.qt_exceeds; + } else if (s->action == DAMOS_LRU_DEPRIO) { + nr_lru_sort_tried_cold_regions = s->stat.nr_tried; + bytes_lru_sort_tried_cold_regions = s->stat.sz_tried; + nr_lru_sorted_cold_regions = s->stat.nr_applied; + bytes_lru_sorted_cold_regions = s->stat.sz_applied; + nr_cold_quota_exceeds = s->stat.qt_exceeds; + } + } + + return damon_lru_sort_handle_commit_inputs(); +} + +static int damon_lru_sort_after_wmarks_check(struct damon_ctx *c) +{ + return damon_lru_sort_handle_commit_inputs(); +} + +static int __init damon_lru_sort_init(void) +{ + ctx = damon_new_ctx(); + if (!ctx) + return -ENOMEM; + + if (damon_select_ops(ctx, DAMON_OPS_PADDR)) + return -EINVAL; + + ctx->callback.after_wmarks_check = damon_lru_sort_after_wmarks_check; + ctx->callback.after_aggregation = damon_lru_sort_after_aggregation; + + target = damon_new_target(); + if (!target) { + damon_destroy_ctx(ctx); + return -ENOMEM; + } + damon_add_target(ctx, target); + + schedule_delayed_work(&damon_lru_sort_timer, 0); + + damon_lru_sort_initialized = true; + return 0; +} + +module_init(damon_lru_sort_init); diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index 10ef20b2003f5..b1335de200e77 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -130,3 +130,45 @@ int damon_pageout_score(struct damon_ctx *c, struct damon_region *r, /* Return coldness of the region */ return DAMOS_MAX_SCORE - hotness; } + +int damon_hot_score(struct damon_ctx *c, struct damon_region *r, + struct damos *s) +{ + unsigned int max_nr_accesses; + int freq_subscore; + unsigned int age_in_sec; + int age_in_log, age_subscore; + unsigned int freq_weight = s->quota.weight_nr_accesses; + unsigned int age_weight = s->quota.weight_age; + int hotness; + + max_nr_accesses = c->aggr_interval / c->sample_interval; + freq_subscore = r->nr_accesses * DAMON_MAX_SUBSCORE / max_nr_accesses; + + age_in_sec = (unsigned long)r->age * c->aggr_interval / 1000000; + for (age_in_log = 0; age_in_log < DAMON_MAX_AGE_IN_LOG && age_in_sec; + age_in_log++, age_in_sec >>= 1) + ; + + /* If frequency is 0, higher age means it's colder */ + if (freq_subscore == 0) + age_in_log *= -1; + + /* + * Now age_in_log is in [-DAMON_MAX_AGE_IN_LOG, DAMON_MAX_AGE_IN_LOG]. + * Scale it to be in [0, 100] and set it as age subscore. + */ + age_in_log += DAMON_MAX_AGE_IN_LOG; + age_subscore = age_in_log * DAMON_MAX_SUBSCORE / + DAMON_MAX_AGE_IN_LOG / 2; + + hotness = (freq_weight * freq_subscore + age_weight * age_subscore); + if (freq_weight + age_weight) + hotness /= freq_weight + age_weight; + /* + * Transform it to fit in [0, DAMOS_MAX_SCORE] + */ + hotness = hotness * DAMOS_MAX_SCORE / DAMON_MAX_SUBSCORE; + + return hotness; +} diff --git a/mm/damon/ops-common.h b/mm/damon/ops-common.h index e790cb5f8fe05..52329ff361cd0 100644 --- a/mm/damon/ops-common.h +++ b/mm/damon/ops-common.h @@ -14,3 +14,5 @@ void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr); int damon_pageout_score(struct damon_ctx *c, struct damon_region *r, struct damos *s); +int damon_hot_score(struct damon_ctx *c, struct damon_region *r, + struct damos *s); diff --git a/mm/damon/paddr.c b/mm/damon/paddr.c index b40ff5811bb22..dc131c6a54038 100644 --- a/mm/damon/paddr.c +++ b/mm/damon/paddr.c @@ -204,16 +204,11 @@ static unsigned int damon_pa_check_accesses(struct damon_ctx *ctx) return max_nr_accesses; } -static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, - struct damon_target *t, struct damon_region *r, - struct damos *scheme) +static unsigned long damon_pa_pageout(struct damon_region *r) { unsigned long addr, applied; LIST_HEAD(page_list); - if (scheme->action != DAMOS_PAGEOUT) - return 0; - for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) { struct page *page = damon_get_page(PHYS_PFN(addr)); @@ -238,6 +233,55 @@ static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, return applied * PAGE_SIZE; } +static unsigned long damon_pa_mark_accessed(struct damon_region *r) +{ + unsigned long addr, applied = 0; + + for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) { + struct page *page = damon_get_page(PHYS_PFN(addr)); + + if (!page) + continue; + mark_page_accessed(page); + put_page(page); + applied++; + } + return applied * PAGE_SIZE; +} + +static unsigned long damon_pa_deactivate_pages(struct damon_region *r) +{ + unsigned long addr, applied = 0; + + for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) { + struct page *page = damon_get_page(PHYS_PFN(addr)); + + if (!page) + continue; + deactivate_page(page); + put_page(page); + applied++; + } + return applied * PAGE_SIZE; +} + +static unsigned long damon_pa_apply_scheme(struct damon_ctx *ctx, + struct damon_target *t, struct damon_region *r, + struct damos *scheme) +{ + switch (scheme->action) { + case DAMOS_PAGEOUT: + return damon_pa_pageout(r); + case DAMOS_LRU_PRIO: + return damon_pa_mark_accessed(r); + case DAMOS_LRU_DEPRIO: + return damon_pa_deactivate_pages(r); + default: + break; + } + return 0; +} + static int damon_pa_scheme_score(struct damon_ctx *context, struct damon_target *t, struct damon_region *r, struct damos *scheme) @@ -245,6 +289,10 @@ static int damon_pa_scheme_score(struct damon_ctx *context, switch (scheme->action) { case DAMOS_PAGEOUT: return damon_pageout_score(context, r, scheme); + case DAMOS_LRU_PRIO: + return damon_hot_score(context, r, scheme); + case DAMOS_LRU_DEPRIO: + return damon_pageout_score(context, r, scheme); default: break; } diff --git a/mm/damon/reclaim.c b/mm/damon/reclaim.c index 4b07c29effe97..e69b807fefe43 100644 --- a/mm/damon/reclaim.c +++ b/mm/damon/reclaim.c @@ -353,7 +353,6 @@ static int damon_reclaim_turn(bool on) return 0; } -#define ENABLE_CHECK_INTERVAL_MS 1000 static struct delayed_work damon_reclaim_timer; static void damon_reclaim_timer_fn(struct work_struct *work) { @@ -367,16 +366,12 @@ static void damon_reclaim_timer_fn(struct work_struct *work) else enabled = last_enabled; } - - if (enabled) - schedule_delayed_work(&damon_reclaim_timer, - msecs_to_jiffies(ENABLE_CHECK_INTERVAL_MS)); } static DECLARE_DELAYED_WORK(damon_reclaim_timer, damon_reclaim_timer_fn); static bool damon_reclaim_initialized; -static int enabled_store(const char *val, +static int damon_reclaim_enabled_store(const char *val, const struct kernel_param *kp) { int rc = param_set_bool(val, kp); @@ -388,14 +383,12 @@ static int enabled_store(const char *val, if (!damon_reclaim_initialized) return rc; - if (enabled) - schedule_delayed_work(&damon_reclaim_timer, 0); - + schedule_delayed_work(&damon_reclaim_timer, 0); return 0; } static const struct kernel_param_ops enabled_param_ops = { - .set = enabled_store, + .set = damon_reclaim_enabled_store, .get = param_get_bool, }; @@ -403,10 +396,21 @@ module_param_cb(enabled, &enabled_param_ops, &enabled, 0600); MODULE_PARM_DESC(enabled, "Enable or disable DAMON_RECLAIM (default: disabled)"); +static int damon_reclaim_handle_commit_inputs(void) +{ + int err; + + if (!commit_inputs) + return 0; + + err = damon_reclaim_apply_parameters(); + commit_inputs = false; + return err; +} + static int damon_reclaim_after_aggregation(struct damon_ctx *c) { struct damos *s; - int err = 0; /* update the stats parameter */ damon_for_each_scheme(s, c) { @@ -417,22 +421,12 @@ static int damon_reclaim_after_aggregation(struct damon_ctx *c) nr_quota_exceeds = s->stat.qt_exceeds; } - if (commit_inputs) { - err = damon_reclaim_apply_parameters(); - commit_inputs = false; - } - return err; + return damon_reclaim_handle_commit_inputs(); } static int damon_reclaim_after_wmarks_check(struct damon_ctx *c) { - int err = 0; - - if (commit_inputs) { - err = damon_reclaim_apply_parameters(); - commit_inputs = false; - } - return err; + return damon_reclaim_handle_commit_inputs(); } static int __init damon_reclaim_init(void) diff --git a/mm/damon/sysfs.c b/mm/damon/sysfs.c index 09f9e8ca3d1fa..7488e27c87c37 100644 --- a/mm/damon/sysfs.c +++ b/mm/damon/sysfs.c @@ -762,6 +762,8 @@ static const char * const damon_sysfs_damos_action_strs[] = { "pageout", "hugepage", "nohugepage", + "lru_prio", + "lru_deprio", "stat", }; @@ -2136,8 +2138,7 @@ static void damon_sysfs_destroy_targets(struct damon_ctx *ctx) struct damon_target *t, *next; damon_for_each_target_safe(t, next, ctx) { - if (ctx->ops.id == DAMON_OPS_VADDR || - ctx->ops.id == DAMON_OPS_FVADDR) + if (damon_target_has_pid(ctx)) put_pid(t->pid); damon_destroy_target(t); } @@ -2181,8 +2182,7 @@ static int damon_sysfs_add_target(struct damon_sysfs_target *sys_target, if (!t) return -ENOMEM; - if (ctx->ops.id == DAMON_OPS_VADDR || - ctx->ops.id == DAMON_OPS_FVADDR) { + if (damon_target_has_pid(ctx)) { t->pid = find_get_pid(sys_target->pid); if (!t->pid) goto destroy_targets_out; @@ -2210,7 +2210,7 @@ static struct damon_target *damon_sysfs_existing_target( struct pid *pid; struct damon_target *t; - if (ctx->ops.id == DAMON_OPS_PADDR) { + if (!damon_target_has_pid(ctx)) { /* Up to only one target for paddr could exist */ damon_for_each_target(t, ctx) return t; @@ -2359,6 +2359,23 @@ static inline bool damon_sysfs_kdamond_running( damon_sysfs_ctx_running(kdamond->damon_ctx); } +static int damon_sysfs_apply_inputs(struct damon_ctx *ctx, + struct damon_sysfs_context *sys_ctx) +{ + int err; + + err = damon_select_ops(ctx, sys_ctx->ops_id); + if (err) + return err; + err = damon_sysfs_set_attrs(ctx, sys_ctx->attrs); + if (err) + return err; + err = damon_sysfs_set_targets(ctx, sys_ctx->targets); + if (err) + return err; + return damon_sysfs_set_schemes(ctx, sys_ctx->schemes); +} + /* * damon_sysfs_commit_input() - Commit user inputs to a running kdamond. * @kdamond: The kobject wrapper for the associated kdamond. @@ -2367,31 +2384,14 @@ static inline bool damon_sysfs_kdamond_running( */ static int damon_sysfs_commit_input(struct damon_sysfs_kdamond *kdamond) { - struct damon_ctx *ctx = kdamond->damon_ctx; - struct damon_sysfs_context *sys_ctx; - int err = 0; - if (!damon_sysfs_kdamond_running(kdamond)) return -EINVAL; /* TODO: Support multiple contexts per kdamond */ if (kdamond->contexts->nr != 1) return -EINVAL; - sys_ctx = kdamond->contexts->contexts_arr[0]; - - err = damon_select_ops(ctx, sys_ctx->ops_id); - if (err) - return err; - err = damon_sysfs_set_attrs(ctx, sys_ctx->attrs); - if (err) - return err; - err = damon_sysfs_set_targets(ctx, sys_ctx->targets); - if (err) - return err; - err = damon_sysfs_set_schemes(ctx, sys_ctx->schemes); - if (err) - return err; - return err; + return damon_sysfs_apply_inputs(kdamond->damon_ctx, + kdamond->contexts->contexts_arr[0]); } /* @@ -2438,27 +2438,16 @@ static struct damon_ctx *damon_sysfs_build_ctx( if (!ctx) return ERR_PTR(-ENOMEM); - err = damon_select_ops(ctx, sys_ctx->ops_id); - if (err) - goto out; - err = damon_sysfs_set_attrs(ctx, sys_ctx->attrs); - if (err) - goto out; - err = damon_sysfs_set_targets(ctx, sys_ctx->targets); - if (err) - goto out; - err = damon_sysfs_set_schemes(ctx, sys_ctx->schemes); - if (err) - goto out; + err = damon_sysfs_apply_inputs(ctx, sys_ctx); + if (err) { + damon_destroy_ctx(ctx); + return ERR_PTR(err); + } ctx->callback.after_wmarks_check = damon_sysfs_cmd_request_callback; ctx->callback.after_aggregation = damon_sysfs_cmd_request_callback; ctx->callback.before_terminate = damon_sysfs_before_terminate; return ctx; - -out: - damon_destroy_ctx(ctx); - return ERR_PTR(err); } static int damon_sysfs_turn_damon_on(struct damon_sysfs_kdamond *kdamond) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 1ab091f49fc07..dc7df1254f0a0 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -35,7 +35,7 @@ #include /* - * Please refer Documentation/vm/arch_pgtable_helpers.rst for the semantics + * Please refer Documentation/mm/arch_pgtable_helpers.rst for the semantics * expectations that are being validated here. All future changes in here * or the documentation need to be in sync. */ diff --git a/mm/frontswap.c b/mm/frontswap.c index 6f69b044a8cc7..1a97610308cbc 100644 --- a/mm/frontswap.c +++ b/mm/frontswap.c @@ -4,7 +4,7 @@ * * This code provides the generic "frontend" layer to call a matching * "backend" driver implementation of frontswap. See - * Documentation/vm/frontswap.rst for more information. + * Documentation/mm/frontswap.rst for more information. * * Copyright (C) 2009-2012 Oracle Corp. All rights reserved. * Author: Dan Magenheimer diff --git a/mm/gup.c b/mm/gup.c index e2a39e30756d5..2e663d09334e3 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -953,6 +953,25 @@ static int faultin_page(struct vm_area_struct *vma, } ret = handle_mm_fault(vma, address, fault_flags, NULL); + + if (ret & VM_FAULT_COMPLETED) { + /* + * With FAULT_FLAG_RETRY_NOWAIT we'll never release the + * mmap lock in the page fault handler. Sanity check this. + */ + WARN_ON_ONCE(fault_flags & FAULT_FLAG_RETRY_NOWAIT); + if (locked) + *locked = 0; + /* + * We should do the same as VM_FAULT_RETRY, but let's not + * return -EBUSY since that's not reflecting the reality of + * what has happened - we've just fully completed a page + * fault, with the mmap lock released. Use -EAGAIN to show + * that we want to take the mmap lock _again_. + */ + return -EAGAIN; + } + if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, *flags); @@ -1179,6 +1198,7 @@ static long __get_user_pages(struct mm_struct *mm, case 0: goto retry; case -EBUSY: + case -EAGAIN: ret = 0; fallthrough; case -EFAULT: @@ -1305,6 +1325,18 @@ int fixup_user_fault(struct mm_struct *mm, return -EINTR; ret = handle_mm_fault(vma, address, fault_flags, NULL); + + if (ret & VM_FAULT_COMPLETED) { + /* + * NOTE: it's a pity that we need to retake the lock here + * to pair with the unlock() in the callers. Ideally we + * could tell the callers so they do not need to unlock. + */ + mmap_read_lock(mm); + *unlocked = true; + return 0; + } + if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, 0); @@ -1370,7 +1402,7 @@ static __always_inline long __get_user_pages_locked(struct mm_struct *mm, /* VM_FAULT_RETRY couldn't trigger, bypass */ return ret; - /* VM_FAULT_RETRY cannot return errors */ + /* VM_FAULT_RETRY or VM_FAULT_COMPLETED cannot return errors */ if (!*locked) { BUG_ON(ret < 0); BUG_ON(ret >= nr_pages); @@ -1900,7 +1932,7 @@ static long check_and_migrate_movable_pages(unsigned long nr_pages, * Try to move out any movable page before pinning the range. */ if (folio_test_hugetlb(folio)) { - if (!isolate_huge_page(&folio->page, + if (isolate_hugetlb(&folio->page, &movable_page_list)) isolation_error_count++; continue; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a00af281bfdc1..77aa6296bb799 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -424,10 +424,10 @@ static int __init hugepage_init(void) if (err) goto err_slab; - err = register_shrinker(&huge_zero_page_shrinker); + err = register_shrinker(&huge_zero_page_shrinker, "thp-zero"); if (err) goto err_hzp_shrinker; - err = register_shrinker(&deferred_split_shrinker); + err = register_shrinker(&deferred_split_shrinker, "thp-deferred_split"); if (err) goto err_split_shrinker; @@ -1938,7 +1938,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma, * replacing a zero pmd write protected page with a zero pte write * protected page. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ pmdp_huge_clear_flush(vma, haddr, pmd); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8a5e66d66305d..4c20521524f70 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -66,12 +66,6 @@ static bool hugetlb_cma_page(struct page *page, unsigned int order) #endif static unsigned long hugetlb_cma_size __initdata; -/* - * Minimum page order among possible hugepage sizes, set to a proper value - * at boot time. - */ -static unsigned int minimum_order __read_mostly = UINT_MAX; - __initdata LIST_HEAD(huge_boot_pages); /* for command line parsing */ @@ -2152,11 +2146,17 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn) unsigned long pfn; struct page *page; int rc = 0; + unsigned int order; + struct hstate *h; if (!hugepages_supported()) return rc; - for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << minimum_order) { + order = huge_page_order(&default_hstate); + for_each_hstate(h) + order = min(order, huge_page_order(h)); + + for (pfn = start_pfn; pfn < end_pfn; pfn += 1 << order) { page = pfn_to_page(pfn); rc = dissolve_free_huge_page(page); if (rc) @@ -2766,8 +2766,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page, * Fail with -EBUSY if not possible. */ spin_unlock_irq(&hugetlb_lock); - if (!isolate_huge_page(old_page, list)) - ret = -EBUSY; + ret = isolate_hugetlb(old_page, list); spin_lock_irq(&hugetlb_lock); goto free_new; } else if (!HPageFreed(old_page)) { @@ -2843,7 +2842,7 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list) if (hstate_is_gigantic(h)) return -ENOMEM; - if (page_count(head) && isolate_huge_page(head, list)) + if (page_count(head) && !isolate_hugetlb(head, list)) ret = 0; else if (!page_count(head)) ret = alloc_and_dissolve_huge_page(h, head, list); @@ -3149,9 +3148,6 @@ static void __init hugetlb_init_hstates(void) struct hstate *h, *h2; for_each_hstate(h) { - if (minimum_order > huge_page_order(h)) - minimum_order = huge_page_order(h); - /* oversize hugepages were init'ed in early boot */ if (!hstate_is_gigantic(h)) hugetlb_hstate_alloc_pages(h); @@ -3176,7 +3172,6 @@ static void __init hugetlb_init_hstates(void) h->demote_order = h2->order; } } - VM_BUG_ON(minimum_order == UINT_MAX); } static void __init report_hugepages(void) @@ -4808,12 +4803,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, entry = swp_entry_to_pte(swp_entry); if (userfaultfd_wp(src_vma) && uffd_wp) entry = huge_pte_mkuffd_wp(entry); - set_huge_swap_pte_at(src, addr, src_pte, - entry, sz); + set_huge_pte_at(src, addr, src_pte, entry); } if (!userfaultfd_wp(dst_vma) && uffd_wp) entry = huge_pte_clear_uffd_wp(entry); - set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); + set_huge_pte_at(dst, addr, dst_pte, entry); } else if (unlikely(is_pte_marker(entry))) { /* * We copy the pte marker only if the dst vma has @@ -4880,7 +4874,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src, * table protection not changing it to point * to a new page. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ huge_ptep_set_wrprotect(src, addr, src_pte); entry = huge_pte_wrprotect(entry); @@ -5714,7 +5708,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, */ entry = huge_ptep_get(ptep); if (unlikely(is_hugetlb_entry_migration(entry))) { - migration_entry_wait_huge(vma, mm, ptep); + migration_entry_wait_huge(vma, ptep); return 0; } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | @@ -6052,8 +6046,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); - (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte, - dst_vma->vm_flags & VM_WRITE); hugetlb_count_add(pages_per_huge_page(h), dst_mm); /* No need to invalidate - it was non-present before */ @@ -6363,8 +6355,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, newpte = pte_swp_mkuffd_wp(newpte); else if (uffd_wp_resolve) newpte = pte_swp_clear_uffd_wp(newpte); - set_huge_swap_pte_at(mm, address, ptep, - newpte, psize); + set_huge_pte_at(mm, address, ptep, newpte); pages++; } spin_unlock(ptl); @@ -6415,7 +6406,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, * No need to call mmu_notifier_invalidate_range() we are downgrading * page table protection not changing it to point to a new page. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ i_mmap_unlock_write(vma->vm_file->f_mapping); mmu_notifier_invalidate_range_end(&range); @@ -6940,7 +6931,7 @@ follow_huge_pmd(struct mm_struct *mm, unsigned long address, } else { if (is_hugetlb_entry_migration(pte)) { spin_unlock(ptl); - __migration_entry_wait(mm, (pte_t *)pmd, ptl); + __migration_entry_wait_huge((pte_t *)pmd, ptl); goto retry; } /* @@ -6972,15 +6963,15 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int fla return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); } -bool isolate_huge_page(struct page *page, struct list_head *list) +int isolate_hugetlb(struct page *page, struct list_head *list) { - bool ret = true; + int ret = 0; spin_lock_irq(&hugetlb_lock); if (!PageHeadHuge(page) || !HPageMigratable(page) || !get_page_unless_zero(page)) { - ret = false; + ret = -EBUSY; goto unlock; } ClearHPageMigratable(page); @@ -7114,7 +7105,7 @@ void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) i_mmap_unlock_write(vma->vm_file->f_mapping); /* * No need to call mmu_notifier_invalidate_range(), see - * Documentation/vm/mmu_notifier.rst. + * Documentation/mm/mmu_notifier.rst. */ mmu_notifier_invalidate_range_end(&range); } diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index 1089ea8a9c985..1362feb3c6c98 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -6,11 +6,11 @@ * * Author: Muchun Song * - * See Documentation/vm/vmemmap_dedup.rst + * See Documentation/mm/vmemmap_dedup.rst */ #define pr_fmt(fmt) "HugeTLB: " fmt -#include +#include #include "hugetlb_vmemmap.h" /* @@ -97,18 +97,68 @@ int hugetlb_vmemmap_alloc(struct hstate *h, struct page *head) return ret; } +static unsigned int vmemmap_optimizable_pages(struct hstate *h, + struct page *head) +{ + if (READ_ONCE(vmemmap_optimize_mode) == VMEMMAP_OPTIMIZE_OFF) + return 0; + + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) { + pmd_t *pmdp, pmd; + struct page *vmemmap_page; + unsigned long vaddr = (unsigned long)head; + + /* + * Only the vmemmap page's vmemmap page can be self-hosted. + * Walking the page tables to find the backing page of the + * vmemmap page. + */ + pmdp = pmd_off_k(vaddr); + /* + * The READ_ONCE() is used to stabilize *pmdp in a register or + * on the stack so that it will stop changing under the code. + * The only concurrent operation where it can be changed is + * split_vmemmap_huge_pmd() (*pmdp will be stable after this + * operation). + */ + pmd = READ_ONCE(*pmdp); + if (pmd_leaf(pmd)) + vmemmap_page = pmd_page(pmd) + pte_index(vaddr); + else + vmemmap_page = pte_page(*pte_offset_kernel(pmdp, vaddr)); + /* + * Due to HugeTLB alignment requirements and the vmemmap pages + * being at the start of the hotplugged memory region in + * memory_hotplug.memmap_on_memory case. Checking any vmemmap + * page's vmemmap page if it is marked as VmemmapSelfHosted is + * sufficient. + * + * [ hotplugged memory ] + * [ section ][...][ section ] + * [ vmemmap ][ usable memory ] + * ^ | | | + * +---+ | | + * ^ | | + * +-------+ | + * ^ | + * +-------------------------------------------+ + */ + if (PageVmemmapSelfHosted(vmemmap_page)) + return 0; + } + + return hugetlb_optimize_vmemmap_pages(h); +} + void hugetlb_vmemmap_free(struct hstate *h, struct page *head) { unsigned long vmemmap_addr = (unsigned long)head; unsigned long vmemmap_end, vmemmap_reuse, vmemmap_pages; - vmemmap_pages = hugetlb_optimize_vmemmap_pages(h); + vmemmap_pages = vmemmap_optimizable_pages(h, head); if (!vmemmap_pages) return; - if (READ_ONCE(vmemmap_optimize_mode) == VMEMMAP_OPTIMIZE_OFF) - return; - static_branch_inc(&hugetlb_optimize_vmemmap_key); vmemmap_addr += RESERVE_VMEMMAP_SIZE; @@ -199,10 +249,10 @@ static struct ctl_table hugetlb_vmemmap_sysctls[] = { static __init int hugetlb_vmemmap_sysctls_init(void) { /* - * If "memory_hotplug.memmap_on_memory" is enabled or "struct page" - * crosses page boundaries, the vmemmap pages cannot be optimized. + * If "struct page" crosses page boundaries, the vmemmap pages cannot + * be optimized. */ - if (!mhp_memmap_on_memory() && is_power_of_2(sizeof(struct page))) + if (is_power_of_2(sizeof(struct page))) register_sysctl_init("vm", hugetlb_vmemmap_sysctls); return 0; diff --git a/mm/kasan/hw_tags.c b/mm/kasan/hw_tags.c index 9e1b6544bfa8e..9ad8eff71b28d 100644 --- a/mm/kasan/hw_tags.c +++ b/mm/kasan/hw_tags.c @@ -257,27 +257,37 @@ static void unpoison_vmalloc_pages(const void *addr, u8 tag) } } +static void init_vmalloc_pages(const void *start, unsigned long size) +{ + const void *addr; + + for (addr = start; addr < start + size; addr += PAGE_SIZE) { + struct page *page = virt_to_page(addr); + + clear_highpage_kasan_tagged(page); + } +} + void *__kasan_unpoison_vmalloc(const void *start, unsigned long size, kasan_vmalloc_flags_t flags) { u8 tag; unsigned long redzone_start, redzone_size; - if (!kasan_vmalloc_enabled()) - return (void *)start; - - if (!is_vmalloc_or_module_addr(start)) + if (!kasan_vmalloc_enabled() || !is_vmalloc_or_module_addr(start)) { + if (flags & KASAN_VMALLOC_INIT) + init_vmalloc_pages(start, size); return (void *)start; + } /* - * Skip unpoisoning and assigning a pointer tag for non-VM_ALLOC - * mappings as: + * Don't tag non-VM_ALLOC mappings, as: * * 1. Unlike the software KASAN modes, hardware tag-based KASAN only * supports tagging physical memory. Therefore, it can only tag a * single mapping of normal physical pages. * 2. Hardware tag-based KASAN can only tag memory mapped with special - * mapping protection bits, see arch_vmalloc_pgprot_modify(). + * mapping protection bits, see arch_vmap_pgprot_tagged(). * As non-VM_ALLOC mappings can be mapped outside of vmalloc code, * providing these bits would require tracking all non-VM_ALLOC * mappers. @@ -289,15 +299,19 @@ void *__kasan_unpoison_vmalloc(const void *start, unsigned long size, * * For non-VM_ALLOC allocations, page_alloc memory is tagged as usual. */ - if (!(flags & KASAN_VMALLOC_VM_ALLOC)) + if (!(flags & KASAN_VMALLOC_VM_ALLOC)) { + WARN_ON(flags & KASAN_VMALLOC_INIT); return (void *)start; + } /* * Don't tag executable memory. * The kernel doesn't tolerate having the PC register tagged. */ - if (!(flags & KASAN_VMALLOC_PROT_NORMAL)) + if (!(flags & KASAN_VMALLOC_PROT_NORMAL)) { + WARN_ON(flags & KASAN_VMALLOC_INIT); return (void *)start; + } tag = kasan_random_tag(); start = set_tag(start, tag); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 16be62d493cd9..01e0d6336754e 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -147,8 +147,7 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj, return count; } static struct kobj_attribute scan_sleep_millisecs_attr = - __ATTR(scan_sleep_millisecs, 0644, scan_sleep_millisecs_show, - scan_sleep_millisecs_store); + __ATTR_RW(scan_sleep_millisecs); static ssize_t alloc_sleep_millisecs_show(struct kobject *kobj, struct kobj_attribute *attr, @@ -175,8 +174,7 @@ static ssize_t alloc_sleep_millisecs_store(struct kobject *kobj, return count; } static struct kobj_attribute alloc_sleep_millisecs_attr = - __ATTR(alloc_sleep_millisecs, 0644, alloc_sleep_millisecs_show, - alloc_sleep_millisecs_store); + __ATTR_RW(alloc_sleep_millisecs); static ssize_t pages_to_scan_show(struct kobject *kobj, struct kobj_attribute *attr, @@ -200,8 +198,7 @@ static ssize_t pages_to_scan_store(struct kobject *kobj, return count; } static struct kobj_attribute pages_to_scan_attr = - __ATTR(pages_to_scan, 0644, pages_to_scan_show, - pages_to_scan_store); + __ATTR_RW(pages_to_scan); static ssize_t pages_collapsed_show(struct kobject *kobj, struct kobj_attribute *attr, @@ -221,22 +218,21 @@ static ssize_t full_scans_show(struct kobject *kobj, static struct kobj_attribute full_scans_attr = __ATTR_RO(full_scans); -static ssize_t khugepaged_defrag_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) +static ssize_t defrag_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) { return single_hugepage_flag_show(kobj, attr, buf, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); } -static ssize_t khugepaged_defrag_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) +static ssize_t defrag_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) { return single_hugepage_flag_store(kobj, attr, buf, count, TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG); } static struct kobj_attribute khugepaged_defrag_attr = - __ATTR(defrag, 0644, khugepaged_defrag_show, - khugepaged_defrag_store); + __ATTR_RW(defrag); /* * max_ptes_none controls if khugepaged should collapse hugepages over @@ -246,21 +242,21 @@ static struct kobj_attribute khugepaged_defrag_attr = * runs. Increasing max_ptes_none will instead potentially reduce the * free memory in the system during the khugepaged scan. */ -static ssize_t khugepaged_max_ptes_none_show(struct kobject *kobj, - struct kobj_attribute *attr, - char *buf) +static ssize_t max_ptes_none_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) { return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_none); } -static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) +static ssize_t max_ptes_none_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) { int err; unsigned long max_ptes_none; err = kstrtoul(buf, 10, &max_ptes_none); - if (err || max_ptes_none > HPAGE_PMD_NR-1) + if (err || max_ptes_none > HPAGE_PMD_NR - 1) return -EINVAL; khugepaged_max_ptes_none = max_ptes_none; @@ -268,25 +264,24 @@ static ssize_t khugepaged_max_ptes_none_store(struct kobject *kobj, return count; } static struct kobj_attribute khugepaged_max_ptes_none_attr = - __ATTR(max_ptes_none, 0644, khugepaged_max_ptes_none_show, - khugepaged_max_ptes_none_store); + __ATTR_RW(max_ptes_none); -static ssize_t khugepaged_max_ptes_swap_show(struct kobject *kobj, - struct kobj_attribute *attr, - char *buf) +static ssize_t max_ptes_swap_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) { return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_swap); } -static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) +static ssize_t max_ptes_swap_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) { int err; unsigned long max_ptes_swap; err = kstrtoul(buf, 10, &max_ptes_swap); - if (err || max_ptes_swap > HPAGE_PMD_NR-1) + if (err || max_ptes_swap > HPAGE_PMD_NR - 1) return -EINVAL; khugepaged_max_ptes_swap = max_ptes_swap; @@ -295,25 +290,24 @@ static ssize_t khugepaged_max_ptes_swap_store(struct kobject *kobj, } static struct kobj_attribute khugepaged_max_ptes_swap_attr = - __ATTR(max_ptes_swap, 0644, khugepaged_max_ptes_swap_show, - khugepaged_max_ptes_swap_store); + __ATTR_RW(max_ptes_swap); -static ssize_t khugepaged_max_ptes_shared_show(struct kobject *kobj, - struct kobj_attribute *attr, - char *buf) +static ssize_t max_ptes_shared_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) { return sysfs_emit(buf, "%u\n", khugepaged_max_ptes_shared); } -static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) +static ssize_t max_ptes_shared_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) { int err; unsigned long max_ptes_shared; err = kstrtoul(buf, 10, &max_ptes_shared); - if (err || max_ptes_shared > HPAGE_PMD_NR-1) + if (err || max_ptes_shared > HPAGE_PMD_NR - 1) return -EINVAL; khugepaged_max_ptes_shared = max_ptes_shared; @@ -322,8 +316,7 @@ static ssize_t khugepaged_max_ptes_shared_store(struct kobject *kobj, } static struct kobj_attribute khugepaged_max_ptes_shared_attr = - __ATTR(max_ptes_shared, 0644, khugepaged_max_ptes_shared_show, - khugepaged_max_ptes_shared_store); + __ATTR_RW(max_ptes_shared); static struct attribute *khugepaged_attr[] = { &khugepaged_defrag_attr.attr, @@ -599,7 +592,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, int none_or_zero = 0, shared = 0, result = 0, referenced = 0; bool writable = false; - for (_pte = pte; _pte < pte+HPAGE_PMD_NR; + for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, address += PAGE_SIZE) { pte_t pteval = *_pte; if (pte_none(pteval) || (pte_present(pteval) && @@ -762,7 +755,12 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { list_del(&src_page->lru); - release_pte_page(src_page); + mod_node_page_state(page_pgdat(src_page), + NR_ISOLATED_ANON + page_is_file_lru(src_page), + -compound_nr(src_page)); + unlock_page(src_page); + free_swap_cache(src_page); + putback_lru_page(src_page); } } @@ -972,8 +970,8 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address, * Bring missing pages in from swap, to complete THP collapse. * Only done if khugepaged_scan_pmd believes it is worthwhile. * - * Called and returns without pte mapped or spinlocks held, - * but with mmap_lock held to protect against vma changes. + * Called and returns without pte mapped or spinlocks held. + * Note that if false is returned, mmap_lock will be released. */ static bool __collapse_huge_page_swapin(struct mm_struct *mm, @@ -1000,27 +998,24 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, pte_unmap(vmf.pte); continue; } - swapped_in++; ret = do_swap_page(&vmf); - /* do_swap_page returns VM_FAULT_RETRY with released mmap_lock */ + /* + * do_swap_page returns VM_FAULT_RETRY with released mmap_lock. + * Note we treat VM_FAULT_RETRY as VM_FAULT_ERROR here because + * we do not retry here and swap entry will remain in pagetable + * resulting in later failure. + */ if (ret & VM_FAULT_RETRY) { - mmap_read_lock(mm); - if (hugepage_vma_revalidate(mm, haddr, &vma)) { - /* vma is no longer available, don't continue to swapin */ - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); - return false; - } - /* check if the pmd is still valid */ - if (mm_find_pmd(mm, haddr) != pmd) { - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); - return false; - } + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); + return false; } if (ret & VM_FAULT_ERROR) { + mmap_read_unlock(mm); trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); return false; } + swapped_in++; } /* Drain LRU add pagevec to remove extra pin on the swapped in pages */ @@ -1086,13 +1081,12 @@ static void collapse_huge_page(struct mm_struct *mm, } /* - * __collapse_huge_page_swapin always returns with mmap_lock locked. - * If it fails, we release mmap_lock and jump out_nolock. + * __collapse_huge_page_swapin will return with mmap_lock released + * when it fails. So we jump out_nolock directly in that case. * Continuing to collapse causes inconsistency. */ if (unmapped && !__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) { - mmap_read_unlock(mm); goto out_nolock; } @@ -1219,7 +1213,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load)); pte = pte_offset_map_lock(mm, pmd, address, &ptl); - for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR; + for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = *_pte; if (is_swap_pte(pteval)) { @@ -1309,7 +1303,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, /* * Check if the page has any GUP (or other external) pins. * - * Here the check is racy it may see totmal_mapcount > refcount + * Here the check is racy it may see total_mapcount > refcount * in some cases. * For example, one process with one forked child process. * The parent has the PMD split due to MADV_DONTNEED, then @@ -1382,8 +1376,8 @@ static void collect_mm_slot(struct mm_slot *mm_slot) * Notify khugepaged that given addr of the mm is pte-mapped THP. Then * khugepaged should try to collapse the page table. */ -static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm, - unsigned long addr) +static void khugepaged_add_pte_mapped_thp(struct mm_struct *mm, + unsigned long addr) { struct mm_slot *mm_slot; @@ -1394,7 +1388,6 @@ static int khugepaged_add_pte_mapped_thp(struct mm_struct *mm, if (likely(mm_slot && mm_slot->nr_pte_mapped_thp < MAX_PTE_MAPPED_THP)) mm_slot->pte_mapped_thp[mm_slot->nr_pte_mapped_thp++] = addr; spin_unlock(&khugepaged_mm_lock); - return 0; } static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma, @@ -1557,7 +1550,7 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff) * mmap_write_lock(mm) as PMD-mapping is likely to be split * later. * - * Not that vma->anon_vma check is racy: it can be set up after + * Note that vma->anon_vma check is racy: it can be set up after * the check but before we took mmap_lock by the fault path. * But page lock would prevent establishing any new ptes of the * page, so we are safe. @@ -1885,8 +1878,8 @@ static void collapse_file(struct mm_struct *mm, if (nr_none) { __mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none); - if (is_shmem) - __mod_lruvec_page_state(new_page, NR_SHMEM, nr_none); + /* nr_none is always 0 for non-shmem. */ + __mod_lruvec_page_state(new_page, NR_SHMEM, nr_none); } /* Join all the small entries into a single multi-index entry */ @@ -1950,10 +1943,10 @@ static void collapse_file(struct mm_struct *mm, /* Something went wrong: roll back page cache changes */ xas_lock_irq(&xas); - mapping->nrpages -= nr_none; - - if (is_shmem) + if (nr_none) { + mapping->nrpages -= nr_none; shmem_uncharge(mapping->host, nr_none); + } xas_set(&xas, start); xas_for_each(&xas, page, end - 1) { @@ -2145,8 +2138,6 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, if (khugepaged_scan.address < hstart) khugepaged_scan.address = hstart; VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK); - if (shmem_file(vma->vm_file) && !shmem_huge_enabled(vma)) - goto skip; while (khugepaged_scan.address < hend) { int ret; diff --git a/mm/kmemleak.c b/mm/kmemleak.c index a182f5ddaf68b..1eddc0132f7f5 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -14,14 +14,16 @@ * The following locks and mutexes are used by kmemleak: * * - kmemleak_lock (raw_spinlock_t): protects the object_list modifications and - * accesses to the object_tree_root. The object_list is the main list - * holding the metadata (struct kmemleak_object) for the allocated memory - * blocks. The object_tree_root is a red black tree used to look-up - * metadata based on a pointer to the corresponding memory block. The - * kmemleak_object structures are added to the object_list and - * object_tree_root in the create_object() function called from the - * kmemleak_alloc() callback and removed in delete_object() called from the - * kmemleak_free() callback + * accesses to the object_tree_root (or object_phys_tree_root). The + * object_list is the main list holding the metadata (struct kmemleak_object) + * for the allocated memory blocks. The object_tree_root and object_phys_tree_root + * are red black trees used to look-up metadata based on a pointer to the + * corresponding memory block. The object_phys_tree_root is for objects + * allocated with physical address. The kmemleak_object structures are + * added to the object_list and object_tree_root (or object_phys_tree_root) + * in the create_object() function called from the kmemleak_alloc() (or + * kmemleak_alloc_phys()) callback and removed in delete_object() called from + * the kmemleak_free() callback * - kmemleak_object.lock (raw_spinlock_t): protects a kmemleak_object. * Accesses to the metadata (e.g. count) are protected by this lock. Note * that some members of this structure may be protected by other means @@ -172,6 +174,8 @@ struct kmemleak_object { #define OBJECT_NO_SCAN (1 << 2) /* flag set to fully scan the object when scan_area allocation failed */ #define OBJECT_FULL_SCAN (1 << 3) +/* flag set for object allocated with physical address */ +#define OBJECT_PHYS (1 << 4) #define HEX_PREFIX " " /* number of bytes to print per line; must be 16 or 32 */ @@ -193,7 +197,9 @@ static int mem_pool_free_count = ARRAY_SIZE(mem_pool); static LIST_HEAD(mem_pool_free_list); /* search tree for object boundaries */ static struct rb_root object_tree_root = RB_ROOT; -/* protecting the access to object_list and object_tree_root */ +/* search tree for object (with OBJECT_PHYS flag) boundaries */ +static struct rb_root object_phys_tree_root = RB_ROOT; +/* protecting the access to object_list, object_tree_root (or object_phys_tree_root) */ static DEFINE_RAW_SPINLOCK(kmemleak_lock); /* allocation caches for kmemleak internal data */ @@ -285,6 +291,9 @@ static void hex_dump_object(struct seq_file *seq, const u8 *ptr = (const u8 *)object->pointer; size_t len; + if (WARN_ON_ONCE(object->flags & OBJECT_PHYS)) + return; + /* limit the number of lines to HEX_MAX_LINES */ len = min_t(size_t, object->size, HEX_MAX_LINES * HEX_ROW_SIZE); @@ -378,9 +387,11 @@ static void dump_object_info(struct kmemleak_object *object) * beginning of the memory block are allowed. The kmemleak_lock must be held * when calling this function. */ -static struct kmemleak_object *lookup_object(unsigned long ptr, int alias) +static struct kmemleak_object *__lookup_object(unsigned long ptr, int alias, + bool is_phys) { - struct rb_node *rb = object_tree_root.rb_node; + struct rb_node *rb = is_phys ? object_phys_tree_root.rb_node : + object_tree_root.rb_node; unsigned long untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); while (rb) { @@ -406,6 +417,12 @@ static struct kmemleak_object *lookup_object(unsigned long ptr, int alias) return NULL; } +/* Look-up a kmemleak object which allocated with virtual address. */ +static struct kmemleak_object *lookup_object(unsigned long ptr, int alias) +{ + return __lookup_object(ptr, alias, false); +} + /* * Increment the object use_count. Return 1 if successful or 0 otherwise. Note * that once an object's use_count reached 0, the RCU freeing was already @@ -515,14 +532,15 @@ static void put_object(struct kmemleak_object *object) /* * Look up an object in the object search tree and increase its use_count. */ -static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) +static struct kmemleak_object *__find_and_get_object(unsigned long ptr, int alias, + bool is_phys) { unsigned long flags; struct kmemleak_object *object; rcu_read_lock(); raw_spin_lock_irqsave(&kmemleak_lock, flags); - object = lookup_object(ptr, alias); + object = __lookup_object(ptr, alias, is_phys); raw_spin_unlock_irqrestore(&kmemleak_lock, flags); /* check whether the object is still available */ @@ -533,28 +551,39 @@ static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) return object; } +/* Look up and get an object which allocated with virtual address. */ +static struct kmemleak_object *find_and_get_object(unsigned long ptr, int alias) +{ + return __find_and_get_object(ptr, alias, false); +} + /* - * Remove an object from the object_tree_root and object_list. Must be called - * with the kmemleak_lock held _if_ kmemleak is still enabled. + * Remove an object from the object_tree_root (or object_phys_tree_root) + * and object_list. Must be called with the kmemleak_lock held _if_ kmemleak + * is still enabled. */ static void __remove_object(struct kmemleak_object *object) { - rb_erase(&object->rb_node, &object_tree_root); + rb_erase(&object->rb_node, object->flags & OBJECT_PHYS ? + &object_phys_tree_root : + &object_tree_root); list_del_rcu(&object->object_list); } /* * Look up an object in the object search tree and remove it from both - * object_tree_root and object_list. The returned object's use_count should be - * at least 1, as initially set by create_object(). + * object_tree_root (or object_phys_tree_root) and object_list. The + * returned object's use_count should be at least 1, as initially set + * by create_object(). */ -static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int alias) +static struct kmemleak_object *find_and_remove_object(unsigned long ptr, int alias, + bool is_phys) { unsigned long flags; struct kmemleak_object *object; raw_spin_lock_irqsave(&kmemleak_lock, flags); - object = lookup_object(ptr, alias); + object = __lookup_object(ptr, alias, is_phys); if (object) __remove_object(object); raw_spin_unlock_irqrestore(&kmemleak_lock, flags); @@ -572,10 +601,12 @@ static int __save_stack_trace(unsigned long *trace) /* * Create the metadata (struct kmemleak_object) corresponding to an allocated - * memory block and add it to the object_list and object_tree_root. + * memory block and add it to the object_list and object_tree_root (or + * object_phys_tree_root). */ -static struct kmemleak_object *create_object(unsigned long ptr, size_t size, - int min_count, gfp_t gfp) +static struct kmemleak_object *__create_object(unsigned long ptr, size_t size, + int min_count, gfp_t gfp, + bool is_phys) { unsigned long flags; struct kmemleak_object *object, *parent; @@ -595,7 +626,7 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, INIT_HLIST_HEAD(&object->area_list); raw_spin_lock_init(&object->lock); atomic_set(&object->use_count, 1); - object->flags = OBJECT_ALLOCATED; + object->flags = OBJECT_ALLOCATED | (is_phys ? OBJECT_PHYS : 0); object->pointer = ptr; object->size = kfence_ksize((void *)ptr) ?: size; object->excess_ref = 0; @@ -628,9 +659,16 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, raw_spin_lock_irqsave(&kmemleak_lock, flags); untagged_ptr = (unsigned long)kasan_reset_tag((void *)ptr); - min_addr = min(min_addr, untagged_ptr); - max_addr = max(max_addr, untagged_ptr + size); - link = &object_tree_root.rb_node; + /* + * Only update min_addr and max_addr with object + * storing virtual address. + */ + if (!is_phys) { + min_addr = min(min_addr, untagged_ptr); + max_addr = max(max_addr, untagged_ptr + size); + } + link = is_phys ? &object_phys_tree_root.rb_node : + &object_tree_root.rb_node; rb_parent = NULL; while (*link) { rb_parent = *link; @@ -654,7 +692,8 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, } } rb_link_node(&object->rb_node, rb_parent, link); - rb_insert_color(&object->rb_node, &object_tree_root); + rb_insert_color(&object->rb_node, is_phys ? &object_phys_tree_root : + &object_tree_root); list_add_tail_rcu(&object->object_list, &object_list); out: @@ -662,6 +701,20 @@ static struct kmemleak_object *create_object(unsigned long ptr, size_t size, return object; } +/* Create kmemleak object which allocated with virtual address. */ +static struct kmemleak_object *create_object(unsigned long ptr, size_t size, + int min_count, gfp_t gfp) +{ + return __create_object(ptr, size, min_count, gfp, false); +} + +/* Create kmemleak object which allocated with physical address. */ +static struct kmemleak_object *create_object_phys(unsigned long ptr, size_t size, + int min_count, gfp_t gfp) +{ + return __create_object(ptr, size, min_count, gfp, true); +} + /* * Mark the object as not allocated and schedule RCU freeing via put_object(). */ @@ -690,7 +743,7 @@ static void delete_object_full(unsigned long ptr) { struct kmemleak_object *object; - object = find_and_remove_object(ptr, 0); + object = find_and_remove_object(ptr, 0, false); if (!object) { #ifdef DEBUG kmemleak_warn("Freeing unknown object at 0x%08lx\n", @@ -706,12 +759,12 @@ static void delete_object_full(unsigned long ptr) * delete it. If the memory block is partially freed, the function may create * additional metadata for the remaining parts of the block. */ -static void delete_object_part(unsigned long ptr, size_t size) +static void delete_object_part(unsigned long ptr, size_t size, bool is_phys) { struct kmemleak_object *object; unsigned long start, end; - object = find_and_remove_object(ptr, 1); + object = find_and_remove_object(ptr, 1, is_phys); if (!object) { #ifdef DEBUG kmemleak_warn("Partially freeing unknown object at 0x%08lx (size %zu)\n", @@ -728,11 +781,11 @@ static void delete_object_part(unsigned long ptr, size_t size) start = object->pointer; end = object->pointer + object->size; if (ptr > start) - create_object(start, ptr - start, object->min_count, - GFP_KERNEL); + __create_object(start, ptr - start, object->min_count, + GFP_KERNEL, is_phys); if (ptr + size < end) - create_object(ptr + size, end - ptr - size, object->min_count, - GFP_KERNEL); + __create_object(ptr + size, end - ptr - size, object->min_count, + GFP_KERNEL, is_phys); __delete_object(object); } @@ -753,11 +806,11 @@ static void paint_it(struct kmemleak_object *object, int color) raw_spin_unlock_irqrestore(&object->lock, flags); } -static void paint_ptr(unsigned long ptr, int color) +static void paint_ptr(unsigned long ptr, int color, bool is_phys) { struct kmemleak_object *object; - object = find_and_get_object(ptr, 0); + object = __find_and_get_object(ptr, 0, is_phys); if (!object) { kmemleak_warn("Trying to color unknown object at 0x%08lx as %s\n", ptr, @@ -775,16 +828,16 @@ static void paint_ptr(unsigned long ptr, int color) */ static void make_gray_object(unsigned long ptr) { - paint_ptr(ptr, KMEMLEAK_GREY); + paint_ptr(ptr, KMEMLEAK_GREY, false); } /* * Mark the object as black-colored so that it is ignored from scans and * reporting. */ -static void make_black_object(unsigned long ptr) +static void make_black_object(unsigned long ptr, bool is_phys) { - paint_ptr(ptr, KMEMLEAK_BLACK); + paint_ptr(ptr, KMEMLEAK_BLACK, is_phys); } /* @@ -990,7 +1043,7 @@ void __ref kmemleak_free_part(const void *ptr, size_t size) pr_debug("%s(0x%p)\n", __func__, ptr); if (kmemleak_enabled && ptr && !IS_ERR(ptr)) - delete_object_part((unsigned long)ptr, size); + delete_object_part((unsigned long)ptr, size, false); } EXPORT_SYMBOL_GPL(kmemleak_free_part); @@ -1078,7 +1131,7 @@ void __ref kmemleak_ignore(const void *ptr) pr_debug("%s(0x%p)\n", __func__, ptr); if (kmemleak_enabled && ptr && !IS_ERR(ptr)) - make_black_object((unsigned long)ptr); + make_black_object((unsigned long)ptr, false); } EXPORT_SYMBOL(kmemleak_ignore); @@ -1125,15 +1178,18 @@ EXPORT_SYMBOL(kmemleak_no_scan); * address argument * @phys: physical address of the object * @size: size of the object - * @min_count: minimum number of references to this object. - * See kmemleak_alloc() * @gfp: kmalloc() flags used for kmemleak internal memory allocations */ -void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, int min_count, - gfp_t gfp) +void __ref kmemleak_alloc_phys(phys_addr_t phys, size_t size, gfp_t gfp) { - if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn) - kmemleak_alloc(__va(phys), size, min_count, gfp); + pr_debug("%s(0x%pa, %zu)\n", __func__, &phys, size); + + if (kmemleak_enabled) + /* + * Create object with OBJECT_PHYS flag and + * assume min_count 0. + */ + create_object_phys((unsigned long)phys, size, 0, gfp); } EXPORT_SYMBOL(kmemleak_alloc_phys); @@ -1146,22 +1202,12 @@ EXPORT_SYMBOL(kmemleak_alloc_phys); */ void __ref kmemleak_free_part_phys(phys_addr_t phys, size_t size) { - if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn) - kmemleak_free_part(__va(phys), size); -} -EXPORT_SYMBOL(kmemleak_free_part_phys); + pr_debug("%s(0x%pa)\n", __func__, &phys); -/** - * kmemleak_not_leak_phys - similar to kmemleak_not_leak but taking a physical - * address argument - * @phys: physical address of the object - */ -void __ref kmemleak_not_leak_phys(phys_addr_t phys) -{ - if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn) - kmemleak_not_leak(__va(phys)); + if (kmemleak_enabled) + delete_object_part((unsigned long)phys, size, true); } -EXPORT_SYMBOL(kmemleak_not_leak_phys); +EXPORT_SYMBOL(kmemleak_free_part_phys); /** * kmemleak_ignore_phys - similar to kmemleak_ignore but taking a physical @@ -1170,8 +1216,10 @@ EXPORT_SYMBOL(kmemleak_not_leak_phys); */ void __ref kmemleak_ignore_phys(phys_addr_t phys) { - if (PHYS_PFN(phys) >= min_low_pfn && PHYS_PFN(phys) < max_low_pfn) - kmemleak_ignore(__va(phys)); + pr_debug("%s(0x%pa)\n", __func__, &phys); + + if (kmemleak_enabled) + make_black_object((unsigned long)phys, true); } EXPORT_SYMBOL(kmemleak_ignore_phys); @@ -1182,6 +1230,9 @@ static bool update_checksum(struct kmemleak_object *object) { u32 old_csum = object->checksum; + if (WARN_ON_ONCE(object->flags & OBJECT_PHYS)) + return false; + kasan_disable_current(); kcsan_disable_current(); object->checksum = crc32(0, kasan_reset_tag((void *)object->pointer), object->size); @@ -1335,6 +1386,7 @@ static void scan_object(struct kmemleak_object *object) { struct kmemleak_scan_area *area; unsigned long flags; + void *obj_ptr; /* * Once the object->lock is acquired, the corresponding memory block @@ -1346,10 +1398,15 @@ static void scan_object(struct kmemleak_object *object) if (!(object->flags & OBJECT_ALLOCATED)) /* already freed object */ goto out; + + obj_ptr = object->flags & OBJECT_PHYS ? + __va((phys_addr_t)object->pointer) : + (void *)object->pointer; + if (hlist_empty(&object->area_list) || object->flags & OBJECT_FULL_SCAN) { - void *start = (void *)object->pointer; - void *end = (void *)(object->pointer + object->size); + void *start = obj_ptr; + void *end = obj_ptr + object->size; void *next; do { @@ -1413,18 +1470,21 @@ static void scan_gray_list(void) */ static void kmemleak_scan(void) { - unsigned long flags; struct kmemleak_object *object; struct zone *zone; int __maybe_unused i; int new_leaks = 0; + int loop1_cnt = 0; jiffies_last_scan = jiffies; /* prepare the kmemleak_object's */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - raw_spin_lock_irqsave(&object->lock, flags); + bool obj_pinned = false; + + loop1_cnt++; + raw_spin_lock_irq(&object->lock); #ifdef DEBUG /* * With a few exceptions there should be a maximum of @@ -1436,12 +1496,45 @@ static void kmemleak_scan(void) dump_object_info(object); } #endif + + /* ignore objects outside lowmem (paint them black) */ + if ((object->flags & OBJECT_PHYS) && + !(object->flags & OBJECT_NO_SCAN)) { + unsigned long phys = object->pointer; + + if (PHYS_PFN(phys) < min_low_pfn || + PHYS_PFN(phys + object->size) >= max_low_pfn) + __paint_it(object, KMEMLEAK_BLACK); + } + /* reset the reference count (whiten the object) */ object->count = 0; - if (color_gray(object) && get_object(object)) + if (color_gray(object) && get_object(object)) { list_add_tail(&object->gray_list, &gray_list); + obj_pinned = true; + } - raw_spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irq(&object->lock); + + /* + * Do a cond_resched() to avoid soft lockup every 64k objects. + * Make sure a reference has been taken so that the object + * won't go away without RCU read lock. + */ + if (!(loop1_cnt & 0xffff)) { + if (!obj_pinned && !get_object(object)) { + /* Try the next object instead */ + loop1_cnt--; + continue; + } + + rcu_read_unlock(); + cond_resched(); + rcu_read_lock(); + + if (!obj_pinned) + put_object(object); + } } rcu_read_unlock(); @@ -1509,14 +1602,21 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - raw_spin_lock_irqsave(&object->lock, flags); + /* + * This is racy but we can save the overhead of lock/unlock + * calls. The missed objects, if any, should be caught in + * the next scan. + */ + if (!color_white(object)) + continue; + raw_spin_lock_irq(&object->lock); if (color_white(object) && (object->flags & OBJECT_ALLOCATED) && update_checksum(object) && get_object(object)) { /* color it gray temporarily */ object->count = object->min_count; list_add_tail(&object->gray_list, &gray_list); } - raw_spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irq(&object->lock); } rcu_read_unlock(); @@ -1536,7 +1636,14 @@ static void kmemleak_scan(void) */ rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - raw_spin_lock_irqsave(&object->lock, flags); + /* + * This is racy but we can save the overhead of lock/unlock + * calls. The missed objects, if any, should be caught in + * the next scan. + */ + if (!color_white(object)) + continue; + raw_spin_lock_irq(&object->lock); if (unreferenced_object(object) && !(object->flags & OBJECT_REPORTED)) { object->flags |= OBJECT_REPORTED; @@ -1546,7 +1653,7 @@ static void kmemleak_scan(void) new_leaks++; } - raw_spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irq(&object->lock); } rcu_read_unlock(); @@ -1748,15 +1855,14 @@ static int dump_str_object_info(const char *str) static void kmemleak_clear(void) { struct kmemleak_object *object; - unsigned long flags; rcu_read_lock(); list_for_each_entry_rcu(object, &object_list, object_list) { - raw_spin_lock_irqsave(&object->lock, flags); + raw_spin_lock_irq(&object->lock); if ((object->flags & OBJECT_REPORTED) && unreferenced_object(object)) __paint_it(object, KMEMLEAK_GREY); - raw_spin_unlock_irqrestore(&object->lock, flags); + raw_spin_unlock_irq(&object->lock); } rcu_read_unlock(); diff --git a/mm/ksm.c b/mm/ksm.c index e8f8c1a2bb396..fe3e0a39f73a0 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1083,7 +1083,7 @@ static int write_protect_page(struct vm_area_struct *vma, struct page *page, * No need to notify as we are downgrading page table to read * only not changing it to point to a new page. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ entry = ptep_clear_flush(vma, pvmw.address, pvmw.pte); /* @@ -1186,7 +1186,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page, * No need to notify as we are replacing a read only page with another * read only page with the same content. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ ptep_clear_flush(vma, addr, ptep); set_pte_at_notify(mm, addr, ptep, newpte); diff --git a/mm/list_lru.c b/mm/list_lru.c index ba76428ceecea..a05e5bef3b400 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -71,7 +71,7 @@ list_lru_from_kmem(struct list_lru *lru, int nid, void *ptr, if (!list_lru_memcg_aware(lru)) goto out; - memcg = mem_cgroup_from_obj(ptr); + memcg = mem_cgroup_from_slab_obj(ptr); if (!memcg) goto out; diff --git a/mm/madvise.c b/mm/madvise.c index 0316bbc6441b2..e55108d4e4b2c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -195,7 +195,6 @@ static int madvise_update_vma(struct vm_area_struct *vma, static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *walk) { - pte_t *orig_pte; struct vm_area_struct *vma = walk->private; unsigned long index; struct swap_iocb *splug = NULL; @@ -208,12 +207,13 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, swp_entry_t entry; struct page *page; spinlock_t *ptl; + pte_t *ptep; - orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl); - pte = *(orig_pte + ((index - start) / PAGE_SIZE)); - pte_unmap_unlock(orig_pte, ptl); + ptep = pte_offset_map_lock(vma->vm_mm, pmd, index, &ptl); + pte = *ptep; + pte_unmap_unlock(ptep, ptl); - if (pte_present(pte) || pte_none(pte)) + if (!is_swap_pte(pte)) continue; entry = pte_to_swp_entry(pte); if (unlikely(non_swap_entry(entry))) diff --git a/mm/memblock.c b/mm/memblock.c index 4cfeabc313b83..5b835bc04b098 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1359,8 +1359,8 @@ __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, * from the regions with mirroring enabled and then retried from any * memory region. * - * In addition, function sets the min_count to 0 using kmemleak_alloc_phys for - * allocated boot memory block, so that it is never reported as leaks. + * In addition, function using kmemleak_alloc_phys for allocated boot + * memory block, it is never reported as leaks. * * Return: * Physical address of allocated memory block on success, %0 on failure. @@ -1412,12 +1412,12 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, */ if (end != MEMBLOCK_ALLOC_NOLEAKTRACE) /* - * The min_count is set to 0 so that memblock allocated - * blocks are never reported as leaks. This is because many - * of these blocks are only referred via the physical - * address which is not looked up by kmemleak. + * Memblock allocated blocks are never reported as + * leaks. This is because many of these blocks are + * only referred via the physical address which is + * not looked up by kmemleak. */ - kmemleak_alloc_phys(found, size, 0, 0); + kmemleak_alloc_phys(found, size, 0); return found; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 618c366a2f074..1497affe08c4e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -783,7 +783,7 @@ void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) struct lruvec *lruvec; rcu_read_lock(); - memcg = mem_cgroup_from_obj(p); + memcg = mem_cgroup_from_slab_obj(p); /* * Untracked pages have no memcg, no lruvec. Update only the @@ -1460,6 +1460,29 @@ static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg, return memcg_page_state(memcg, item) * memcg_page_state_unit(item); } +/* Subset of vm_event_item to report for memcg event stats */ +static const unsigned int memcg_vm_event_stat[] = { + PGSCAN_KSWAPD, + PGSCAN_DIRECT, + PGSTEAL_KSWAPD, + PGSTEAL_DIRECT, + PGFAULT, + PGMAJFAULT, + PGREFILL, + PGACTIVATE, + PGDEACTIVATE, + PGLAZYFREE, + PGLAZYFREED, +#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) + ZSWPIN, + ZSWPOUT, +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + THP_FAULT_ALLOC, + THP_COLLAPSE_ALLOC, +#endif +}; + static char *memory_stat_format(struct mem_cgroup *memcg) { struct seq_buf s; @@ -1495,41 +1518,17 @@ static char *memory_stat_format(struct mem_cgroup *memcg) } /* Accumulated memory events */ - - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGFAULT), - memcg_events(memcg, PGFAULT)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT), - memcg_events(memcg, PGMAJFAULT)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGREFILL), - memcg_events(memcg, PGREFILL)); seq_buf_printf(&s, "pgscan %lu\n", memcg_events(memcg, PGSCAN_KSWAPD) + memcg_events(memcg, PGSCAN_DIRECT)); seq_buf_printf(&s, "pgsteal %lu\n", memcg_events(memcg, PGSTEAL_KSWAPD) + memcg_events(memcg, PGSTEAL_DIRECT)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGACTIVATE), - memcg_events(memcg, PGACTIVATE)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGDEACTIVATE), - memcg_events(memcg, PGDEACTIVATE)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGLAZYFREE), - memcg_events(memcg, PGLAZYFREE)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGLAZYFREED), - memcg_events(memcg, PGLAZYFREED)); - -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) - seq_buf_printf(&s, "%s %lu\n", vm_event_name(ZSWPIN), - memcg_events(memcg, ZSWPIN)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(ZSWPOUT), - memcg_events(memcg, ZSWPOUT)); -#endif -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - seq_buf_printf(&s, "%s %lu\n", vm_event_name(THP_FAULT_ALLOC), - memcg_events(memcg, THP_FAULT_ALLOC)); - seq_buf_printf(&s, "%s %lu\n", vm_event_name(THP_COLLAPSE_ALLOC), - memcg_events(memcg, THP_COLLAPSE_ALLOC)); -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + for (i = 0; i < ARRAY_SIZE(memcg_vm_event_stat); i++) + seq_buf_printf(&s, "%s %lu\n", + vm_event_name(memcg_vm_event_stat[i]), + memcg_events(memcg, memcg_vm_event_stat[i])); /* The above should easily fit into one page */ WARN_ON_ONCE(seq_buf_has_overflowed(&s)); @@ -2842,27 +2841,9 @@ int memcg_alloc_slab_cgroups(struct slab *slab, struct kmem_cache *s, return 0; } -/* - * Returns a pointer to the memory cgroup to which the kernel object is charged. - * - * A passed kernel object can be a slab object or a generic kernel page, so - * different mechanisms for getting the memory cgroup pointer should be used. - * In certain cases (e.g. kernel stacks or large kmallocs with SLUB) the caller - * can not know for sure how the kernel object is implemented. - * mem_cgroup_from_obj() can be safely used in such cases. - * - * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(), - * cgroup_mutex, etc. - */ -struct mem_cgroup *mem_cgroup_from_obj(void *p) +static __always_inline +struct mem_cgroup *mem_cgroup_from_obj_folio(struct folio *folio, void *p) { - struct folio *folio; - - if (mem_cgroup_disabled()) - return NULL; - - folio = virt_to_folio(p); - /* * Slab objects are accounted individually, not per-page. * Memcg membership data for each individual object is saved in @@ -2895,6 +2876,53 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p) return page_memcg_check(folio_page(folio, 0)); } +/* + * Returns a pointer to the memory cgroup to which the kernel object is charged. + * + * A passed kernel object can be a slab object, vmalloc object or a generic + * kernel page, so different mechanisms for getting the memory cgroup pointer + * should be used. + * + * In certain cases (e.g. kernel stacks or large kmallocs with SLUB) the caller + * can not know for sure how the kernel object is implemented. + * mem_cgroup_from_obj() can be safely used in such cases. + * + * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(), + * cgroup_mutex, etc. + */ +struct mem_cgroup *mem_cgroup_from_obj(void *p) +{ + struct folio *folio; + + if (mem_cgroup_disabled()) + return NULL; + + if (unlikely(is_vmalloc_addr(p))) + folio = page_folio(vmalloc_to_page(p)); + else + folio = virt_to_folio(p); + + return mem_cgroup_from_obj_folio(folio, p); +} + +/* + * Returns a pointer to the memory cgroup to which the kernel object is charged. + * Similar to mem_cgroup_from_obj(), but faster and not suitable for objects, + * allocated using vmalloc(). + * + * A passed kernel object must be a slab object or a generic kernel page. + * + * The caller must ensure the memcg lifetime, e.g. by taking rcu_read_lock(), + * cgroup_mutex, etc. + */ +struct mem_cgroup *mem_cgroup_from_slab_obj(void *p) +{ + if (mem_cgroup_disabled()) + return NULL; + + return mem_cgroup_from_obj_folio(virt_to_folio(p), p); +} + static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg) { struct obj_cgroup *objcg = NULL; @@ -5060,6 +5088,29 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id) return idr_find(&mem_cgroup_idr, id); } +#ifdef CONFIG_SHRINKER_DEBUG +struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino) +{ + struct cgroup *cgrp; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + + cgrp = cgroup_get_from_id(ino); + if (!cgrp) + return ERR_PTR(-ENOENT); + + css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys); + if (css) + memcg = container_of(css, struct mem_cgroup, css); + else + memcg = ERR_PTR(-ENOENT); + + cgroup_put(cgrp); + + return memcg; +} +#endif + static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) { struct mem_cgroup_per_node *pn; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index b864c2eff641a..346633797ff9d 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1007,12 +1007,13 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p) static int me_swapcache_clean(struct page_state *ps, struct page *p) { + struct folio *folio = page_folio(p); int ret; - delete_from_swap_cache(p); + delete_from_swap_cache(folio); ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED; - unlock_page(p); + folio_unlock(folio); if (has_extra_refcount(ps, p, false)) ret = MF_FAILED; @@ -2178,7 +2179,7 @@ static bool isolate_page(struct page *page, struct list_head *pagelist) bool lru = PageLRU(page); if (PageHuge(page)) { - isolated = isolate_huge_page(page, pagelist); + isolated = !isolate_hugetlb(page, pagelist); } else { if (lru) isolated = !isolate_lru_page(page); diff --git a/mm/memory.c b/mm/memory.c index 9631c5f55bac7..78225480d7b83 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3020,7 +3020,7 @@ static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf) balance_dirty_pages_ratelimited(mapping); if (fpin) { fput(fpin); - return VM_FAULT_RETRY; + return VM_FAULT_COMPLETED; } } diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 1213d0c67a535..99ecb2b3ff53e 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -43,30 +43,22 @@ #include "shuffle.h" #ifdef CONFIG_MHP_MEMMAP_ON_MEMORY -static int memmap_on_memory_set(const char *val, const struct kernel_param *kp) -{ - if (hugetlb_optimize_vmemmap_enabled()) - return 0; - return param_set_bool(val, kp); -} - -static const struct kernel_param_ops memmap_on_memory_ops = { - .flags = KERNEL_PARAM_OPS_FL_NOARG, - .set = memmap_on_memory_set, - .get = param_get_bool, -}; - /* * memory_hotplug.memmap_on_memory parameter */ static bool memmap_on_memory __ro_after_init; -module_param_cb(memmap_on_memory, &memmap_on_memory_ops, &memmap_on_memory, 0444); +module_param(memmap_on_memory, bool, 0444); MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug"); -bool mhp_memmap_on_memory(void) +static inline bool mhp_memmap_on_memory(void) { return memmap_on_memory; } +#else +static inline bool mhp_memmap_on_memory(void) +{ + return false; +} #endif enum { @@ -237,8 +229,7 @@ static void release_memory_resource(struct resource *res) kfree(res); } -static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, - const char *reason) +static int check_pfn_span(unsigned long pfn, unsigned long nr_pages) { /* * Disallow all operations smaller than a sub-section and only @@ -255,12 +246,8 @@ static int check_pfn_span(unsigned long pfn, unsigned long nr_pages, min_align = PAGES_PER_SUBSECTION; else min_align = PAGES_PER_SECTION; - if (!IS_ALIGNED(pfn, min_align) - || !IS_ALIGNED(nr_pages, min_align)) { - WARN(1, "Misaligned __%s_pages start: %#lx end: #%lx\n", - reason, pfn, pfn + nr_pages - 1); + if (!IS_ALIGNED(pfn | nr_pages, min_align)) return -EINVAL; - } return 0; } @@ -337,9 +324,10 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, altmap->alloc = 0; } - err = check_pfn_span(pfn, nr_pages, "add"); - if (err) - return err; + if (check_pfn_span(pfn, nr_pages)) { + WARN(1, "Misaligned %s start: %#lx end: #%lx\n", __func__, pfn, pfn + nr_pages - 1); + return -EINVAL; + } for (; pfn < end_pfn; pfn += cur_nr_pages) { /* Select all remaining pages up to the next section boundary */ @@ -536,8 +524,10 @@ void __remove_pages(unsigned long pfn, unsigned long nr_pages, map_offset = vmem_altmap_offset(altmap); - if (check_pfn_span(pfn, nr_pages, "remove")) + if (check_pfn_span(pfn, nr_pages)) { + WARN(1, "Misaligned %s start: %#lx end: #%lx\n", __func__, pfn, pfn + nr_pages - 1); return; + } for (; pfn < end_pfn; pfn += cur_nr_pages) { cond_resched(); @@ -672,12 +662,18 @@ static void __meminit resize_pgdat_range(struct pglist_data *pgdat, unsigned lon } +#ifdef CONFIG_ZONE_DEVICE static void section_taint_zone_device(unsigned long pfn) { struct mem_section *ms = __pfn_to_section(pfn); ms->section_mem_map |= SECTION_TAINT_ZONE_DEVICE; } +#else +static inline void section_taint_zone_device(unsigned long pfn) +{ +} +#endif /* * Associate the pfn range with the given zone, initializing the memmaps @@ -1031,7 +1027,7 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, struct zone *zone) { unsigned long end_pfn = pfn + nr_pages; - int ret; + int ret, i; ret = kasan_add_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages)); if (ret) @@ -1039,6 +1035,9 @@ int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages, move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_UNMOVABLE); + for (i = 0; i < nr_pages; i++) + SetPageVmemmapSelfHosted(pfn_to_page(pfn + i)); + /* * It might be that the vmemmap_pages fully span sections. If that is * the case, mark those sections online here as otherwise they will be @@ -1643,7 +1642,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) if (PageHuge(page)) { pfn = page_to_pfn(head) + compound_nr(head) - 1; - isolate_huge_page(head, &source); + isolate_hugetlb(head, &source); continue; } else if (PageTransHuge(page)) pfn = page_to_pfn(head) + thp_nr_pages(page) - 1; diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d39b01fd52fe4..f4cd963550c1c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -602,7 +602,7 @@ static int queue_pages_hugetlb(pte_t *pte, unsigned long hmask, /* With MPOL_MF_MOVE, we migrate only unshared hugepage. */ if (flags & (MPOL_MF_MOVE_ALL) || (flags & MPOL_MF_MOVE && page_mapcount(page) == 1)) { - if (!isolate_huge_page(page, qp->pagelist) && + if (isolate_hugetlb(page, qp->pagelist) && (flags & MPOL_MF_STRICT)) /* * Failed to isolate page but allow migrating pages @@ -1388,7 +1388,7 @@ static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask, unsigned long bits = min_t(unsigned long, maxnode, BITS_PER_LONG); unsigned long t; - if (get_bitmap(&t, &nmask[maxnode / BITS_PER_LONG], bits)) + if (get_bitmap(&t, &nmask[(maxnode - 1) / BITS_PER_LONG], bits)) return -EFAULT; if (maxnode - bits >= MAX_NUMNODES) { diff --git a/mm/mempool.c b/mm/mempool.c index b933d0fc21b8b..96488b13a1ef1 100644 --- a/mm/mempool.c +++ b/mm/mempool.c @@ -379,7 +379,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask) gfp_t gfp_temp; VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO); - might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); + might_alloc(gfp_mask); gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */ gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */ diff --git a/mm/memremap.c b/mm/memremap.c index 745eea0f99c39..5dceb6e9c973d 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -141,10 +141,10 @@ void memunmap_pages(struct dev_pagemap *pgmap) for (i = 0; i < pgmap->nr_range; i++) percpu_ref_put_many(&pgmap->ref, pfn_len(pgmap, i)); wait_for_completion(&pgmap->done); - percpu_ref_exit(&pgmap->ref); for (i = 0; i < pgmap->nr_range; i++) pageunmap_range(pgmap, i); + percpu_ref_exit(&pgmap->ref); WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); devmap_managed_enable_put(pgmap); @@ -279,8 +279,8 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, /* - * Not device managed version of dev_memremap_pages, undone by - * memunmap_pages(). Please use dev_memremap_pages if you have a struct + * Not device managed version of devm_memremap_pages, undone by + * memunmap_pages(). Please use devm_memremap_pages if you have a struct * device available. */ void *memremap_pages(struct dev_pagemap *pgmap, int nid) diff --git a/mm/migrate.c b/mm/migrate.c index 1b4b977809a1c..3d5f0262ab600 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -132,7 +132,7 @@ static void putback_movable_page(struct page *page) * * This function shall be used whenever the isolated pageset has been * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range() - * and isolate_huge_page(). + * and isolate_hugetlb(). */ void putback_movable_pages(struct list_head *l) { @@ -314,13 +314,28 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd, __migration_entry_wait(mm, ptep, ptl); } -void migration_entry_wait_huge(struct vm_area_struct *vma, - struct mm_struct *mm, pte_t *pte) +#ifdef CONFIG_HUGETLB_PAGE +void __migration_entry_wait_huge(pte_t *ptep, spinlock_t *ptl) { - spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), mm, pte); - __migration_entry_wait(mm, pte, ptl); + pte_t pte; + + spin_lock(ptl); + pte = huge_ptep_get(ptep); + + if (unlikely(!is_hugetlb_entry_migration(pte))) + spin_unlock(ptl); + else + migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl); } +void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) +{ + spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte); + + __migration_entry_wait_huge(pte, ptl); +} +#endif + #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd) { @@ -1132,15 +1147,10 @@ static int unmap_and_move(new_page_t get_new_page, return -ENOSYS; if (page_count(page) == 1) { - /* page was freed from under us. So we are done. */ + /* Page was freed from under us. So we are done. */ ClearPageActive(page); ClearPageUnevictable(page); - if (unlikely(__PageMovable(page))) { - lock_page(page); - if (!PageMovable(page)) - ClearPageIsolated(page); - unlock_page(page); - } + /* free_pages_prepare() will clear PG_isolated. */ goto out; } @@ -1675,8 +1685,9 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, if (PageHuge(page)) { if (PageHead(page)) { - isolate_huge_page(page, pagelist); - err = 1; + err = isolate_hugetlb(page, pagelist); + if (!err) + err = 1; } } else { struct page *head; diff --git a/mm/mmap.c b/mm/mmap.c index 61e6135c54ef6..c14d7286a3798 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2944,7 +2944,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size, unsigned long ret = -EINVAL; struct file *file; - pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.rst.\n", + pr_warn_once("%s (%d) uses deprecated remap_file_pages() syscall. See Documentation/mm/remap_file_pages.rst.\n", current->comm, current->pid); if (prot) diff --git a/mm/mprotect.c b/mm/mprotect.c index ba5592655ee3b..996a97e213adc 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -38,6 +38,39 @@ #include "internal.h" +static inline bool can_change_pte_writable(struct vm_area_struct *vma, + unsigned long addr, pte_t pte) +{ + struct page *page; + + VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte)); + + if (pte_protnone(pte) || !pte_dirty(pte)) + return false; + + /* Do we need write faults for softdirty tracking? */ + if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte)) + return false; + + /* Do we need write faults for uffd-wp tracking? */ + if (userfaultfd_pte_wp(vma, pte)) + return false; + + if (!(vma->vm_flags & VM_SHARED)) { + /* + * We can only special-case on exclusive anonymous pages, + * because we know that our write-fault handler similarly would + * map them writable without any additional checks while holding + * the PT lock. + */ + page = vm_normal_page(vma, addr, pte); + if (!page || !PageAnon(page) || !PageAnonExclusive(page)) + return false; + } + + return true; +} + static unsigned long change_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) @@ -46,7 +79,6 @@ static unsigned long change_pte_range(struct mmu_gather *tlb, spinlock_t *ptl; unsigned long pages = 0; int target_node = NUMA_NO_NODE; - bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT; bool prot_numa = cp_flags & MM_CP_PROT_NUMA; bool uffd_wp = cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE; @@ -137,21 +169,27 @@ static unsigned long change_pte_range(struct mmu_gather *tlb, ptent = pte_wrprotect(ptent); ptent = pte_mkuffd_wp(ptent); } else if (uffd_wp_resolve) { - /* - * Leave the write bit to be handled - * by PF interrupt handler, then - * things like COW could be properly - * handled. - */ ptent = pte_clear_uffd_wp(ptent); } - /* Avoid taking write faults for known dirty pages */ - if (dirty_accountable && pte_dirty(ptent) && - (pte_soft_dirty(ptent) || - !(vma->vm_flags & VM_SOFTDIRTY))) { + /* + * In some writable, shared mappings, we might want + * to catch actual write access -- see + * vma_wants_writenotify(). + * + * In all writable, private mappings, we have to + * properly handle COW. + * + * In both cases, we can sometimes still change PTEs + * writable and avoid the write-fault handler, for + * example, if a PTE is already dirty and no other + * COW or special handling is required. + */ + if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && + !pte_write(ptent) && + can_change_pte_writable(vma, addr, ptent)) ptent = pte_mkwrite(ptent); - } + ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent); if (pte_needs_flush(oldpte, ptent)) tlb_flush_pte_range(tlb, addr, PAGE_SIZE); @@ -505,9 +543,9 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma, unsigned long oldflags = vma->vm_flags; long nrpages = (end - start) >> PAGE_SHIFT; unsigned long charged = 0; + bool try_change_writable; pgoff_t pgoff; int error; - int dirty_accountable = 0; if (newflags == oldflags) { *pprev = vma; @@ -583,11 +621,20 @@ mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma, * held in write mode. */ vma->vm_flags = newflags; - dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot); + /* + * We want to check manually if we can change individual PTEs writable + * if we can't do that automatically for all PTEs in a mapping. For + * private mappings, that's always the case when we have write + * permissions as we properly have to handle COW. + */ + if (vma->vm_flags & VM_SHARED) + try_change_writable = vma_wants_writenotify(vma, vma->vm_page_prot); + else + try_change_writable = !!(vma->vm_flags & VM_WRITE); vma_set_page_prot(vma); change_protection(tlb, vma, start, end, vma->vm_page_prot, - dirty_accountable ? MM_CP_DIRTY_ACCT : 0); + try_change_writable ? MM_CP_TRY_CHANGE_WRITABLE : 0); /* * Private VM_LOCKED VMA becoming writable: trigger COW to avoid major diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f5eecedcc2e4c..d05a8471972d9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -744,6 +744,14 @@ void prep_compound_page(struct page *page, unsigned int order) prep_compound_head(page, order); } +void destroy_large_folio(struct folio *folio) +{ + enum compound_dtor_id dtor = folio_page(folio, 1)->compound_dtor; + + VM_BUG_ON_FOLIO(dtor >= NR_COMPOUND_DTORS, folio); + compound_page_dtors[dtor](&folio->page); +} + #ifdef CONFIG_DEBUG_PAGEALLOC unsigned int _debug_guardpage_minorder; @@ -1296,18 +1304,14 @@ static inline bool should_skip_kasan_poison(struct page *page, fpi_t fpi_flags) PageSkipKASanPoison(page); } -static void kernel_init_free_pages(struct page *page, int numpages) +static void kernel_init_pages(struct page *page, int numpages) { int i; /* s390's use of memset() could override KASAN redzones. */ kasan_disable_current(); - for (i = 0; i < numpages; i++) { - u8 tag = page_kasan_tag(page + i); - page_kasan_tag_reset(page + i); - clear_highpage(page + i); - page_kasan_tag_set(page + i, tag); - } + for (i = 0; i < numpages; i++) + clear_highpage_kasan_tagged(page + i); kasan_enable_current(); } @@ -1396,7 +1400,7 @@ static __always_inline bool free_pages_prepare(struct page *page, init = false; } if (init) - kernel_init_free_pages(page, 1 << order); + kernel_init_pages(page, 1 << order); /* * arch_free_page() can make the page's contents inaccessible. s390 @@ -2442,7 +2446,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order, } /* If memory is still not initialized, do it now. */ if (init) - kernel_init_free_pages(page, 1 << order); + kernel_init_pages(page, 1 << order); /* Propagate __GFP_SKIP_KASAN_POISON to page flags. */ if (kasan_hw_tags_enabled() && (gfp_flags & __GFP_SKIP_KASAN_POISON)) SetPageSkipKASanPoison(page); @@ -5198,10 +5202,7 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, *alloc_flags |= ALLOC_CPUSET; } - fs_reclaim_acquire(gfp_mask); - fs_reclaim_release(gfp_mask); - - might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM); + might_alloc(gfp_mask); if (should_fail_alloc_page(gfp_mask, order)) return false; @@ -5800,14 +5801,14 @@ long si_mem_available(void) /* * Estimate the amount of memory available for userspace allocations, - * without causing swapping. + * without causing swapping or OOM. */ available = global_zone_page_state(NR_FREE_PAGES) - totalreserve_pages; /* * Not all the page cache can be freed, otherwise the system will - * start swapping. Assume at least half of the page cache, or the - * low watermark worth of cache, needs to stay. + * start swapping or thrashing. Assume at least half of the page + * cache, or the low watermark worth of cache, needs to stay. */ pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE]; pagecache -= min(pagecache / 2, wmark_low); diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index c10f839fc410e..e971a467fcdfa 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -243,7 +243,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) * cleared *pmd but not decremented compound_mapcount(). */ if ((pvmw->flags & PVMW_SYNC) && - transparent_hugepage_active(vma) && + transhuge_vma_suitable(vma, pvmw->address) && (pvmw->nr_pages >= HPAGE_PMD_NR)) { spinlock_t *ptl = pmd_lock(mm, pvmw->pmd); diff --git a/mm/rmap.c b/mm/rmap.c index 746c05acad270..c3557d9c419a5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -999,7 +999,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw) * downgrading page table protection not changing it to point * to a new page. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ if (ret) cleaned++; @@ -1537,6 +1537,8 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, PageAnonExclusive(subpage); if (folio_test_hugetlb(folio)) { + bool anon = folio_test_anon(folio); + /* * The try_to_unmap() is only passed a hugetlb page * in the case where the hugetlb page is poisoned. @@ -1551,31 +1553,28 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, */ flush_cache_range(vma, range.start, range.end); - if (!folio_test_anon(folio)) { + /* + * To call huge_pmd_unshare, i_mmap_rwsem must be + * held in write mode. Caller needs to explicitly + * do this outside rmap routines. + */ + VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED)); + if (!anon && huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { + flush_tlb_range(vma, range.start, range.end); + mmu_notifier_invalidate_range(mm, range.start, + range.end); + /* - * To call huge_pmd_unshare, i_mmap_rwsem must be - * held in write mode. Caller needs to explicitly - * do this outside rmap routines. + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. */ - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { - flush_tlb_range(vma, range.start, range.end); - mmu_notifier_invalidate_range(mm, range.start, - range.end); - - /* - * The ref count of the PMD page was dropped - * which is part of the way map counting - * is done for shared PMDs. Return 'true' - * here. When there is no other sharing, - * huge_pmd_unshare returns false and we will - * unmap the actual page and drop map count - * to zero. - */ - page_vma_mapped_walk_done(&pvmw); - break; - } + page_vma_mapped_walk_done(&pvmw); + break; } pteval = huge_ptep_clear_flush(vma, address, pvmw.pte); } else { @@ -1619,9 +1618,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(folio_nr_pages(folio), mm); - set_huge_swap_pte_at(mm, address, - pvmw.pte, pteval, - vma_mmu_pagesize(vma)); + set_huge_pte_at(mm, address, pvmw.pte, pteval); } else { dec_mm_counter(mm, mm_counter(&folio->page)); set_pte_at(mm, address, pvmw.pte, pteval); @@ -1765,7 +1762,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * to point at a new folio while a device is * still using this folio. * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ dec_mm_counter(mm, mm_counter_file(&folio->page)); } @@ -1775,7 +1772,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, * done above for all cases requiring it to happen under page * table lock before mmu_notifier_invalidate_range_end() * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); if (vma->vm_flags & VM_LOCKED) @@ -1921,6 +1918,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, PageAnonExclusive(subpage); if (folio_test_hugetlb(folio)) { + bool anon = folio_test_anon(folio); + /* * huge_pmd_unshare may unmap an entire PMD page. * There is no way of knowing exactly which PMDs may @@ -1930,31 +1929,28 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, */ flush_cache_range(vma, range.start, range.end); - if (!folio_test_anon(folio)) { + /* + * To call huge_pmd_unshare, i_mmap_rwsem must be + * held in write mode. Caller needs to explicitly + * do this outside rmap routines. + */ + VM_BUG_ON(!anon && !(flags & TTU_RMAP_LOCKED)); + if (!anon && huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { + flush_tlb_range(vma, range.start, range.end); + mmu_notifier_invalidate_range(mm, range.start, + range.end); + /* - * To call huge_pmd_unshare, i_mmap_rwsem must be - * held in write mode. Caller needs to explicitly - * do this outside rmap routines. + * The ref count of the PMD page was dropped + * which is part of the way map counting + * is done for shared PMDs. Return 'true' + * here. When there is no other sharing, + * huge_pmd_unshare returns false and we will + * unmap the actual page and drop map count + * to zero. */ - VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - - if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { - flush_tlb_range(vma, range.start, range.end); - mmu_notifier_invalidate_range(mm, range.start, - range.end); - - /* - * The ref count of the PMD page was dropped - * which is part of the way map counting - * is done for shared PMDs. Return 'true' - * here. When there is no other sharing, - * huge_pmd_unshare returns false and we will - * unmap the actual page and drop map count - * to zero. - */ - page_vma_mapped_walk_done(&pvmw); - break; - } + page_vma_mapped_walk_done(&pvmw); + break; } /* Nuke the hugetlb page table entry */ @@ -2013,9 +2009,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); if (folio_test_hugetlb(folio)) { hugetlb_count_sub(folio_nr_pages(folio), mm); - set_huge_swap_pte_at(mm, address, - pvmw.pte, pteval, - vma_mmu_pagesize(vma)); + set_huge_pte_at(mm, address, pvmw.pte, pteval); } else { dec_mm_counter(mm, mm_counter(&folio->page)); set_pte_at(mm, address, pvmw.pte, pteval); @@ -2083,8 +2077,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, if (pte_uffd_wp(pteval)) swp_pte = pte_swp_mkuffd_wp(swp_pte); if (folio_test_hugetlb(folio)) - set_huge_swap_pte_at(mm, address, pvmw.pte, - swp_pte, vma_mmu_pagesize(vma)); + set_huge_pte_at(mm, address, pvmw.pte, swp_pte); else set_pte_at(mm, address, pvmw.pte, swp_pte); trace_set_migration_pte(address, pte_val(swp_pte), @@ -2100,7 +2093,7 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * done above for all cases requiring it to happen under page * table lock before mmu_notifier_invalidate_range_end() * - * See Documentation/vm/mmu_notifier.rst + * See Documentation/mm/mmu_notifier.rst */ page_remove_rmap(subpage, vma, folio_test_hugetlb(folio)); if (vma->vm_flags & VM_LOCKED) diff --git a/mm/shmem.c b/mm/shmem.c index 7657f8e029b08..fcab2be46f721 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1690,7 +1690,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, return; folio_wait_writeback(folio); - delete_from_swap_cache(&folio->page); + delete_from_swap_cache(folio); spin_lock_irq(&info->lock); /* * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks won't @@ -1705,10 +1705,10 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index, } /* - * Swap in the page pointed to by *pagep. - * Caller has to make sure that *pagep contains a valid swapped page. - * Returns 0 and the page in pagep if success. On failure, returns the - * error code and NULL in *pagep. + * Swap in the folio pointed to by *foliop. + * Caller has to make sure that *foliop contains a valid swapped folio. + * Returns 0 and the folio in foliop if success. On failure, returns the + * error code and NULL in *foliop. */ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, struct folio **foliop, enum sgp_type sgp, @@ -1748,7 +1748,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, } folio = page_folio(page); - /* We have to do this with page locked to prevent races */ + /* We have to do this with folio locked to prevent races */ folio_lock(folio); if (!folio_test_swapcache(folio) || folio_swap_entry(folio).val != swap.val || @@ -1788,7 +1788,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, if (sgp == SGP_WRITE) folio_mark_accessed(folio); - delete_from_swap_cache(&folio->page); + delete_from_swap_cache(folio); folio_mark_dirty(folio); swap_free(swap); diff --git a/mm/shrinker_debug.c b/mm/shrinker_debug.c new file mode 100644 index 0000000000000..e5b40c43221d0 --- /dev/null +++ b/mm/shrinker_debug.c @@ -0,0 +1,285 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include + +/* defined in vmscan.c */ +extern struct rw_semaphore shrinker_rwsem; +extern struct list_head shrinker_list; + +static DEFINE_IDA(shrinker_debugfs_ida); +static struct dentry *shrinker_debugfs_root; + +static unsigned long shrinker_count_objects(struct shrinker *shrinker, + struct mem_cgroup *memcg, + unsigned long *count_per_node) +{ + unsigned long nr, total = 0; + int nid; + + for_each_node(nid) { + if (nid == 0 || (shrinker->flags & SHRINKER_NUMA_AWARE)) { + struct shrink_control sc = { + .gfp_mask = GFP_KERNEL, + .nid = nid, + .memcg = memcg, + }; + + nr = shrinker->count_objects(shrinker, &sc); + if (nr == SHRINK_EMPTY) + nr = 0; + } else { + nr = 0; + } + + count_per_node[nid] = nr; + total += nr; + } + + return total; +} + +static int shrinker_debugfs_count_show(struct seq_file *m, void *v) +{ + struct shrinker *shrinker = m->private; + unsigned long *count_per_node; + struct mem_cgroup *memcg; + unsigned long total; + bool memcg_aware; + int ret, nid; + + count_per_node = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL); + if (!count_per_node) + return -ENOMEM; + + ret = down_read_killable(&shrinker_rwsem); + if (ret) { + kfree(count_per_node); + return ret; + } + rcu_read_lock(); + + memcg_aware = shrinker->flags & SHRINKER_MEMCG_AWARE; + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + if (memcg && !mem_cgroup_online(memcg)) + continue; + + total = shrinker_count_objects(shrinker, + memcg_aware ? memcg : NULL, + count_per_node); + if (total) { + seq_printf(m, "%lu", mem_cgroup_ino(memcg)); + for_each_node(nid) + seq_printf(m, " %lu", count_per_node[nid]); + seq_putc(m, '\n'); + } + + if (!memcg_aware) { + mem_cgroup_iter_break(NULL, memcg); + break; + } + + if (signal_pending(current)) { + mem_cgroup_iter_break(NULL, memcg); + ret = -EINTR; + break; + } + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL)) != NULL); + + rcu_read_unlock(); + up_read(&shrinker_rwsem); + + kfree(count_per_node); + return ret; +} +DEFINE_SHOW_ATTRIBUTE(shrinker_debugfs_count); + +static int shrinker_debugfs_scan_open(struct inode *inode, struct file *file) +{ + file->private_data = inode->i_private; + return nonseekable_open(inode, file); +} + +static ssize_t shrinker_debugfs_scan_write(struct file *file, + const char __user *buf, + size_t size, loff_t *pos) +{ + struct shrinker *shrinker = file->private_data; + unsigned long nr_to_scan = 0, ino, read_len; + struct shrink_control sc = { + .gfp_mask = GFP_KERNEL, + }; + struct mem_cgroup *memcg = NULL; + int nid; + char kbuf[72]; + ssize_t ret; + + read_len = size < (sizeof(kbuf) - 1) ? size : (sizeof(kbuf) - 1); + if (copy_from_user(kbuf, buf, read_len)) + return -EFAULT; + kbuf[read_len] = '\0'; + + if (sscanf(kbuf, "%lu %d %lu", &ino, &nid, &nr_to_scan) != 3) + return -EINVAL; + + if (nid < 0 || nid >= nr_node_ids) + return -EINVAL; + + if (nr_to_scan == 0) + return size; + + if (shrinker->flags & SHRINKER_MEMCG_AWARE) { + memcg = mem_cgroup_get_from_ino(ino); + if (!memcg || IS_ERR(memcg)) + return -ENOENT; + + if (!mem_cgroup_online(memcg)) { + mem_cgroup_put(memcg); + return -ENOENT; + } + } else if (ino != 0) { + return -EINVAL; + } + + ret = down_read_killable(&shrinker_rwsem); + if (ret) { + mem_cgroup_put(memcg); + return ret; + } + + sc.nid = nid; + sc.memcg = memcg; + sc.nr_to_scan = nr_to_scan; + sc.nr_scanned = nr_to_scan; + + shrinker->scan_objects(shrinker, &sc); + + up_read(&shrinker_rwsem); + mem_cgroup_put(memcg); + + return size; +} + +static const struct file_operations shrinker_debugfs_scan_fops = { + .owner = THIS_MODULE, + .open = shrinker_debugfs_scan_open, + .write = shrinker_debugfs_scan_write, +}; + +int shrinker_debugfs_add(struct shrinker *shrinker) +{ + struct dentry *entry; + char buf[128]; + int id; + + lockdep_assert_held(&shrinker_rwsem); + + /* debugfs isn't initialized yet, add debugfs entries later. */ + if (!shrinker_debugfs_root) + return 0; + + id = ida_alloc(&shrinker_debugfs_ida, GFP_KERNEL); + if (id < 0) + return id; + shrinker->debugfs_id = id; + + snprintf(buf, sizeof(buf), "%s-%d", shrinker->name, id); + + /* create debugfs entry */ + entry = debugfs_create_dir(buf, shrinker_debugfs_root); + if (IS_ERR(entry)) { + ida_free(&shrinker_debugfs_ida, id); + return PTR_ERR(entry); + } + shrinker->debugfs_entry = entry; + + debugfs_create_file("count", 0220, entry, shrinker, + &shrinker_debugfs_count_fops); + debugfs_create_file("scan", 0440, entry, shrinker, + &shrinker_debugfs_scan_fops); + return 0; +} + +int shrinker_debugfs_rename(struct shrinker *shrinker, const char *fmt, ...) +{ + struct dentry *entry; + char buf[128]; + const char *new, *old; + va_list ap; + int ret = 0; + + va_start(ap, fmt); + new = kvasprintf_const(GFP_KERNEL, fmt, ap); + va_end(ap); + + if (!new) + return -ENOMEM; + + down_write(&shrinker_rwsem); + + old = shrinker->name; + shrinker->name = new; + + if (shrinker->debugfs_entry) { + snprintf(buf, sizeof(buf), "%s-%d", shrinker->name, + shrinker->debugfs_id); + + entry = debugfs_rename(shrinker_debugfs_root, + shrinker->debugfs_entry, + shrinker_debugfs_root, buf); + if (IS_ERR(entry)) + ret = PTR_ERR(entry); + else + shrinker->debugfs_entry = entry; + } + + up_write(&shrinker_rwsem); + + kfree_const(old); + + return ret; +} +EXPORT_SYMBOL(shrinker_debugfs_rename); + +void shrinker_debugfs_remove(struct shrinker *shrinker) +{ + lockdep_assert_held(&shrinker_rwsem); + + kfree_const(shrinker->name); + + if (!shrinker->debugfs_entry) + return; + + debugfs_remove_recursive(shrinker->debugfs_entry); + ida_free(&shrinker_debugfs_ida, shrinker->debugfs_id); +} + +static int __init shrinker_debugfs_init(void) +{ + struct shrinker *shrinker; + struct dentry *dentry; + int ret = 0; + + dentry = debugfs_create_dir("shrinker", NULL); + if (IS_ERR(dentry)) + return PTR_ERR(dentry); + shrinker_debugfs_root = dentry; + + /* Create debugfs entries for shrinkers registered at boot */ + down_write(&shrinker_rwsem); + list_for_each_entry(shrinker, &shrinker_list, list) + if (!shrinker->debugfs_entry) { + ret = shrinker_debugfs_add(shrinker); + if (ret) + break; + } + up_write(&shrinker_rwsem); + + return ret; +} +late_initcall(shrinker_debugfs_init); diff --git a/mm/slab.c b/mm/slab.c index 764cbadba69c2..6474c515a664c 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -2958,12 +2958,6 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags) return ac->entry[--ac->avail]; } -static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep, - gfp_t flags) -{ - might_sleep_if(gfpflags_allow_blocking(flags)); -} - #if DEBUG static void *cache_alloc_debugcheck_after(struct kmem_cache *cachep, gfp_t flags, void *objp, unsigned long caller) @@ -3205,7 +3199,6 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid, size_t orig_ if (unlikely(ptr)) goto out_hooks; - cache_alloc_debugcheck_before(cachep, flags); local_irq_save(save_flags); if (nodeid == NUMA_NO_NODE) @@ -3290,7 +3283,6 @@ slab_alloc(struct kmem_cache *cachep, struct list_lru *lru, gfp_t flags, if (unlikely(objp)) goto out; - cache_alloc_debugcheck_before(cachep, flags); local_irq_save(save_flags); objp = __do_cache_alloc(cachep, flags); local_irq_restore(save_flags); @@ -3527,8 +3519,6 @@ int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, if (!s) return 0; - cache_alloc_debugcheck_before(s, flags); - local_irq_disable(); for (i = 0; i < size; i++) { void *objp = kfence_alloc(s, s->object_size, flags) ?: __do_cache_alloc(s, flags); diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 3f7e4bd34a5b2..5f0ed4717ed06 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -208,8 +208,8 @@ static int vmemmap_remap_range(unsigned long start, unsigned long end, unsigned long next; pgd_t *pgd; - VM_BUG_ON(!IS_ALIGNED(start, PAGE_SIZE)); - VM_BUG_ON(!IS_ALIGNED(end, PAGE_SIZE)); + VM_BUG_ON(!PAGE_ALIGNED(start)); + VM_BUG_ON(!PAGE_ALIGNED(end)); pgd = pgd_offset_k(addr); do { @@ -556,7 +556,7 @@ pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, } else { /* * When a PTE/PMD entry is freed from the init_mm - * there's a a free_pages() call to this page allocated + * there's a free_pages() call to this page allocated * above. Thus this get_page() is paired with the * put_page_testzero() on the freeing path. * This can only called by certain ZONE_DEVICE path, @@ -745,7 +745,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page)); for (addr = start; addr < end; addr += size) { - unsigned long next = addr, last = addr + size; + unsigned long next, last = addr + size; /* Populate the head page vmemmap page */ pte = vmemmap_populate_address(addr, node, NULL, NULL); @@ -760,7 +760,7 @@ static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, /* * Reuse the previous page for the rest of tail pages - * See layout diagram in Documentation/vm/vmemmap_dedup.rst + * See layout diagram in Documentation/mm/vmemmap_dedup.rst */ next += PAGE_SIZE; rc = vmemmap_populate_range(next, last, node, NULL, diff --git a/mm/sparse.c b/mm/sparse.c index cb3bfae640365..e5a8a3a0edd74 100644 --- a/mm/sparse.c +++ b/mm/sparse.c @@ -281,7 +281,7 @@ static unsigned long sparse_encode_mem_map(struct page *mem_map, unsigned long p { unsigned long coded_mem_map = (unsigned long)(mem_map - (section_nr_to_pfn(pnum))); - BUILD_BUG_ON(SECTION_MAP_LAST_BIT > (1UL< PFN_SECTION_SHIFT); BUG_ON(coded_mem_map & ~SECTION_MAP_MASK); return coded_mem_map; } diff --git a/mm/swap.c b/mm/swap.c index 275a4ea1bc660..9cee7f6a38094 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -46,30 +46,30 @@ /* How many pages do we try to swap or page in/out together? */ int page_cluster; -/* Protecting only lru_rotate.pvec which requires disabling interrupts */ +/* Protecting only lru_rotate.fbatch which requires disabling interrupts */ struct lru_rotate { local_lock_t lock; - struct pagevec pvec; + struct folio_batch fbatch; }; static DEFINE_PER_CPU(struct lru_rotate, lru_rotate) = { .lock = INIT_LOCAL_LOCK(lock), }; /* - * The following struct pagevec are grouped together because they are protected + * The following folio batches are grouped together because they are protected * by disabling preemption (and interrupts remain enabled). */ -struct lru_pvecs { +struct cpu_fbatches { local_lock_t lock; - struct pagevec lru_add; - struct pagevec lru_deactivate_file; - struct pagevec lru_deactivate; - struct pagevec lru_lazyfree; + struct folio_batch lru_add; + struct folio_batch lru_deactivate_file; + struct folio_batch lru_deactivate; + struct folio_batch lru_lazyfree; #ifdef CONFIG_SMP - struct pagevec activate_page; + struct folio_batch activate; #endif }; -static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { +static DEFINE_PER_CPU(struct cpu_fbatches, cpu_fbatches) = { .lock = INIT_LOCAL_LOCK(lock), }; @@ -77,36 +77,35 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = { * This path almost never happens for VM activity - pages are normally freed * via pagevecs. But it gets used by networking - and for compound pages. */ -static void __page_cache_release(struct page *page) +static void __page_cache_release(struct folio *folio) { - if (PageLRU(page)) { - struct folio *folio = page_folio(page); + if (folio_test_lru(folio)) { struct lruvec *lruvec; unsigned long flags; lruvec = folio_lruvec_lock_irqsave(folio, &flags); - del_page_from_lru_list(page, lruvec); - __clear_page_lru_flags(page); + lruvec_del_folio(lruvec, folio); + __folio_clear_lru_flags(folio); unlock_page_lruvec_irqrestore(lruvec, flags); } - /* See comment on PageMlocked in release_pages() */ - if (unlikely(PageMlocked(page))) { - int nr_pages = thp_nr_pages(page); + /* See comment on folio_test_mlocked in release_pages() */ + if (unlikely(folio_test_mlocked(folio))) { + long nr_pages = folio_nr_pages(folio); - __ClearPageMlocked(page); - mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages); + __folio_clear_mlocked(folio); + zone_stat_mod_folio(folio, NR_MLOCK, -nr_pages); count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages); } } -static void __put_single_page(struct page *page) +static void __folio_put_small(struct folio *folio) { - __page_cache_release(page); - mem_cgroup_uncharge(page_folio(page)); - free_unref_page(page, 0); + __page_cache_release(folio); + mem_cgroup_uncharge(folio); + free_unref_page(&folio->page, 0); } -static void __put_compound_page(struct page *page) +static void __folio_put_large(struct folio *folio) { /* * __page_cache_release() is supposed to be called for thp, not for @@ -114,21 +113,21 @@ static void __put_compound_page(struct page *page) * (it's never listed to any LRU lists) and no memcg routines should * be called for hugetlb (it has a separate hugetlb_cgroup.) */ - if (!PageHuge(page)) - __page_cache_release(page); - destroy_compound_page(page); + if (!folio_test_hugetlb(folio)) + __page_cache_release(folio); + destroy_large_folio(folio); } -void __put_page(struct page *page) +void __folio_put(struct folio *folio) { - if (unlikely(is_zone_device_page(page))) - free_zone_device_page(page); - else if (unlikely(PageCompound(page))) - __put_compound_page(page); + if (unlikely(folio_is_zone_device(folio))) + free_zone_device_page(&folio->page); + else if (unlikely(folio_test_large(folio))) + __folio_put_large(folio); else - __put_single_page(page); + __folio_put_small(folio); } -EXPORT_SYMBOL(__put_page); +EXPORT_SYMBOL(__folio_put); /** * put_pages_list() - release a list of pages @@ -138,19 +137,19 @@ EXPORT_SYMBOL(__put_page); */ void put_pages_list(struct list_head *pages) { - struct page *page, *next; + struct folio *folio, *next; - list_for_each_entry_safe(page, next, pages, lru) { - if (!put_page_testzero(page)) { - list_del(&page->lru); + list_for_each_entry_safe(folio, next, pages, lru) { + if (!folio_put_testzero(folio)) { + list_del(&folio->lru); continue; } - if (PageHead(page)) { - list_del(&page->lru); - __put_compound_page(page); + if (folio_test_large(folio)) { + list_del(&folio->lru); + __folio_put_large(folio); continue; } - /* Cannot be PageLRU because it's passed to us using the lru */ + /* LRU flag must be clear because it's passed using the lru */ } free_unref_page_list(pages); @@ -188,36 +187,84 @@ int get_kernel_pages(const struct kvec *kiov, int nr_segs, int write, } EXPORT_SYMBOL_GPL(get_kernel_pages); -static void pagevec_lru_move_fn(struct pagevec *pvec, - void (*move_fn)(struct page *page, struct lruvec *lruvec)) +typedef void (*move_fn_t)(struct lruvec *lruvec, struct folio *folio); + +static void lru_add_fn(struct lruvec *lruvec, struct folio *folio) +{ + int was_unevictable = folio_test_clear_unevictable(folio); + long nr_pages = folio_nr_pages(folio); + + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + + /* + * Is an smp_mb__after_atomic() still required here, before + * folio_evictable() tests the mlocked flag, to rule out the possibility + * of stranding an evictable folio on an unevictable LRU? I think + * not, because __munlock_page() only clears the mlocked flag + * while the LRU lock is held. + * + * (That is not true of __page_cache_release(), and not necessarily + * true of release_pages(): but those only clear the mlocked flag after + * folio_put_testzero() has excluded any other users of the folio.) + */ + if (folio_evictable(folio)) { + if (was_unevictable) + __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); + } else { + folio_clear_active(folio); + folio_set_unevictable(folio); + /* + * folio->mlock_count = !!folio_test_mlocked(folio)? + * But that leaves __mlock_page() in doubt whether another + * actor has already counted the mlock or not. Err on the + * safe side, underestimate, let page reclaim fix it, rather + * than leaving a page on the unevictable LRU indefinitely. + */ + folio->mlock_count = 0; + if (!was_unevictable) + __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); + } + + lruvec_add_folio(lruvec, folio); + trace_mm_lru_insertion(folio); +} + +static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn) { int i; struct lruvec *lruvec = NULL; unsigned long flags = 0; - for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; - struct folio *folio = page_folio(page); + for (i = 0; i < folio_batch_count(fbatch); i++) { + struct folio *folio = fbatch->folios[i]; - /* block memcg migration during page moving between lru */ - if (!TestClearPageLRU(page)) + /* block memcg migration while the folio moves between lru */ + if (move_fn != lru_add_fn && !folio_test_clear_lru(folio)) continue; lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags); - (*move_fn)(page, lruvec); + move_fn(lruvec, folio); - SetPageLRU(page); + folio_set_lru(folio); } + if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); - release_pages(pvec->pages, pvec->nr); - pagevec_reinit(pvec); + folios_put(fbatch->folios, folio_batch_count(fbatch)); + folio_batch_init(fbatch); } -static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) +static void folio_batch_add_and_move(struct folio_batch *fbatch, + struct folio *folio, move_fn_t move_fn) { - struct folio *folio = page_folio(page); + if (folio_batch_add(fbatch, folio) && !folio_test_large(folio) && + !lru_cache_disabled()) + return; + folio_batch_move_lru(fbatch, move_fn); +} +static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio) +{ if (!folio_test_unevictable(folio)) { lruvec_del_folio(lruvec, folio); folio_clear_active(folio); @@ -226,18 +273,6 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec) } } -/* return true if pagevec needs to drain */ -static bool pagevec_add_and_need_flush(struct pagevec *pvec, struct page *page) -{ - bool ret = false; - - if (!pagevec_add(pvec, page) || PageCompound(page) || - lru_cache_disabled()) - ret = true; - - return ret; -} - /* * Writeback is about to end against a folio which has been marked for * immediate reclaim. If it still appears to be reclaimable, move it @@ -249,14 +284,13 @@ void folio_rotate_reclaimable(struct folio *folio) { if (!folio_test_locked(folio) && !folio_test_dirty(folio) && !folio_test_unevictable(folio) && folio_test_lru(folio)) { - struct pagevec *pvec; + struct folio_batch *fbatch; unsigned long flags; folio_get(folio); local_lock_irqsave(&lru_rotate.lock, flags); - pvec = this_cpu_ptr(&lru_rotate.pvec); - if (pagevec_add_and_need_flush(pvec, &folio->page)) - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); + fbatch = this_cpu_ptr(&lru_rotate.fbatch); + folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn); local_unlock_irqrestore(&lru_rotate.lock, flags); } } @@ -307,7 +341,7 @@ void lru_note_cost_folio(struct folio *folio) folio_nr_pages(folio)); } -static void __folio_activate(struct folio *folio, struct lruvec *lruvec) +static void folio_activate_fn(struct lruvec *lruvec, struct folio *folio) { if (!folio_test_active(folio) && !folio_test_unevictable(folio)) { long nr_pages = folio_nr_pages(folio); @@ -324,41 +358,30 @@ static void __folio_activate(struct folio *folio, struct lruvec *lruvec) } #ifdef CONFIG_SMP -static void __activate_page(struct page *page, struct lruvec *lruvec) -{ - return __folio_activate(page_folio(page), lruvec); -} - -static void activate_page_drain(int cpu) +static void folio_activate_drain(int cpu) { - struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu); + struct folio_batch *fbatch = &per_cpu(cpu_fbatches.activate, cpu); - if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, __activate_page); -} - -static bool need_activate_page_drain(int cpu) -{ - return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0; + if (folio_batch_count(fbatch)) + folio_batch_move_lru(fbatch, folio_activate_fn); } static void folio_activate(struct folio *folio) { if (folio_test_lru(folio) && !folio_test_active(folio) && !folio_test_unevictable(folio)) { - struct pagevec *pvec; + struct folio_batch *fbatch; folio_get(folio); - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.activate_page); - if (pagevec_add_and_need_flush(pvec, &folio->page)) - pagevec_lru_move_fn(pvec, __activate_page); - local_unlock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.activate); + folio_batch_add_and_move(fbatch, folio, folio_activate_fn); + local_unlock(&cpu_fbatches.lock); } } #else -static inline void activate_page_drain(int cpu) +static inline void folio_activate_drain(int cpu) { } @@ -368,7 +391,7 @@ static void folio_activate(struct folio *folio) if (folio_test_clear_lru(folio)) { lruvec = folio_lruvec_lock_irq(folio); - __folio_activate(folio, lruvec); + folio_activate_fn(lruvec, folio); unlock_page_lruvec_irq(lruvec); folio_set_lru(folio); } @@ -377,32 +400,32 @@ static void folio_activate(struct folio *folio) static void __lru_cache_activate_folio(struct folio *folio) { - struct pagevec *pvec; + struct folio_batch *fbatch; int i; - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.lru_add); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); /* - * Search backwards on the optimistic assumption that the page being - * activated has just been added to this pagevec. Note that only - * the local pagevec is examined as a !PageLRU page could be in the + * Search backwards on the optimistic assumption that the folio being + * activated has just been added to this batch. Note that only + * the local batch is examined as a !LRU folio could be in the * process of being released, reclaimed, migrated or on a remote - * pagevec that is currently being drained. Furthermore, marking - * a remote pagevec's page PageActive potentially hits a race where - * a page is marked PageActive just after it is added to the inactive + * batch that is currently being drained. Furthermore, marking + * a remote batch's folio active potentially hits a race where + * a folio is marked active just after it is added to the inactive * list causing accounting errors and BUG_ON checks to trigger. */ - for (i = pagevec_count(pvec) - 1; i >= 0; i--) { - struct page *pagevec_page = pvec->pages[i]; + for (i = folio_batch_count(fbatch) - 1; i >= 0; i--) { + struct folio *batch_folio = fbatch->folios[i]; - if (pagevec_page == &folio->page) { + if (batch_folio == folio) { folio_set_active(folio); break; } } - local_unlock(&lru_pvecs.lock); + local_unlock(&cpu_fbatches.lock); } /* @@ -427,9 +450,9 @@ void folio_mark_accessed(struct folio *folio) */ } else if (!folio_test_active(folio)) { /* - * If the page is on the LRU, queue it for activation via - * lru_pvecs.activate_page. Otherwise, assume the page is on a - * pagevec, mark it active and it'll be moved to the active + * If the folio is on the LRU, queue it for activation via + * cpu_fbatches.activate. Otherwise, assume the folio is in a + * folio_batch, mark it active and it'll be moved to the active * LRU on the next drain. */ if (folio_test_lru(folio)) @@ -450,22 +473,22 @@ EXPORT_SYMBOL(folio_mark_accessed); * * Queue the folio for addition to the LRU. The decision on whether * to add the page to the [in]active [file|anon] list is deferred until the - * pagevec is drained. This gives a chance for the caller of folio_add_lru() + * folio_batch is drained. This gives a chance for the caller of folio_add_lru() * have the folio added to the active list using folio_mark_accessed(). */ void folio_add_lru(struct folio *folio) { - struct pagevec *pvec; + struct folio_batch *fbatch; - VM_BUG_ON_FOLIO(folio_test_active(folio) && folio_test_unevictable(folio), folio); + VM_BUG_ON_FOLIO(folio_test_active(folio) && + folio_test_unevictable(folio), folio); VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); folio_get(folio); - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.lru_add); - if (pagevec_add_and_need_flush(pvec, &folio->page)) - __pagevec_lru_add(pvec); - local_unlock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_add); + folio_batch_add_and_move(fbatch, folio, lru_add_fn); + local_unlock(&cpu_fbatches.lock); } EXPORT_SYMBOL(folio_add_lru); @@ -489,56 +512,57 @@ void lru_cache_add_inactive_or_unevictable(struct page *page, } /* - * If the page can not be invalidated, it is moved to the + * If the folio cannot be invalidated, it is moved to the * inactive list to speed up its reclaim. It is moved to the * head of the list, rather than the tail, to give the flusher * threads some time to write it out, as this is much more * effective than the single-page writeout from reclaim. * - * If the page isn't page_mapped and dirty/writeback, the page - * could reclaim asap using PG_reclaim. + * If the folio isn't mapped and dirty/writeback, the folio + * could be reclaimed asap using the reclaim flag. * - * 1. active, mapped page -> none - * 2. active, dirty/writeback page -> inactive, head, PG_reclaim - * 3. inactive, mapped page -> none - * 4. inactive, dirty/writeback page -> inactive, head, PG_reclaim + * 1. active, mapped folio -> none + * 2. active, dirty/writeback folio -> inactive, head, reclaim + * 3. inactive, mapped folio -> none + * 4. inactive, dirty/writeback folio -> inactive, head, reclaim * 5. inactive, clean -> inactive, tail * 6. Others -> none * - * In 4, why it moves inactive's head, the VM expects the page would - * be write it out by flusher threads as this is much more effective + * In 4, it moves to the head of the inactive list so the folio is + * written out by flusher threads as this is much more efficient * than the single-page writeout from reclaim. */ -static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) +static void lru_deactivate_file_fn(struct lruvec *lruvec, struct folio *folio) { - bool active = PageActive(page); - int nr_pages = thp_nr_pages(page); + bool active = folio_test_active(folio); + long nr_pages = folio_nr_pages(folio); - if (PageUnevictable(page)) + if (folio_test_unevictable(folio)) return; - /* Some processes are using the page */ - if (page_mapped(page)) + /* Some processes are using the folio */ + if (folio_mapped(folio)) return; - del_page_from_lru_list(page, lruvec); - ClearPageActive(page); - ClearPageReferenced(page); + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + folio_clear_referenced(folio); - if (PageWriteback(page) || PageDirty(page)) { + if (folio_test_writeback(folio) || folio_test_dirty(folio)) { /* - * PG_reclaim could be raced with end_page_writeback - * It can make readahead confusing. But race window - * is _really_ small and it's non-critical problem. + * Setting the reclaim flag could race with + * folio_end_writeback() and confuse readahead. But the + * race window is _really_ small and it's not a critical + * problem. */ - add_page_to_lru_list(page, lruvec); - SetPageReclaim(page); + lruvec_add_folio(lruvec, folio); + folio_set_reclaim(folio); } else { /* - * The page's writeback ends up during pagevec - * We move that page into tail of inactive. + * The folio's writeback ended while it was in the batch. + * We move that folio to the tail of the inactive list. */ - add_page_to_lru_list_tail(page, lruvec); + lruvec_add_folio_tail(lruvec, folio); __count_vm_events(PGROTATED, nr_pages); } @@ -549,15 +573,15 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec) } } -static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) +static void lru_deactivate_fn(struct lruvec *lruvec, struct folio *folio) { - if (PageActive(page) && !PageUnevictable(page)) { - int nr_pages = thp_nr_pages(page); + if (folio_test_active(folio) && !folio_test_unevictable(folio)) { + long nr_pages = folio_nr_pages(folio); - del_page_from_lru_list(page, lruvec); - ClearPageActive(page); - ClearPageReferenced(page); - add_page_to_lru_list(page, lruvec); + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + folio_clear_referenced(folio); + lruvec_add_folio(lruvec, folio); __count_vm_events(PGDEACTIVATE, nr_pages); __count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, @@ -565,22 +589,22 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec) } } -static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec) +static void lru_lazyfree_fn(struct lruvec *lruvec, struct folio *folio) { - if (PageAnon(page) && PageSwapBacked(page) && - !PageSwapCache(page) && !PageUnevictable(page)) { - int nr_pages = thp_nr_pages(page); + if (folio_test_anon(folio) && folio_test_swapbacked(folio) && + !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) { + long nr_pages = folio_nr_pages(folio); - del_page_from_lru_list(page, lruvec); - ClearPageActive(page); - ClearPageReferenced(page); + lruvec_del_folio(lruvec, folio); + folio_clear_active(folio); + folio_clear_referenced(folio); /* - * Lazyfree pages are clean anonymous pages. They have - * PG_swapbacked flag cleared, to distinguish them from normal - * anonymous pages + * Lazyfree folios are clean anonymous folios. They have + * the swapbacked flag cleared, to distinguish them from normal + * anonymous folios */ - ClearPageSwapBacked(page); - add_page_to_lru_list(page, lruvec); + folio_clear_swapbacked(folio); + lruvec_add_folio(lruvec, folio); __count_vm_events(PGLAZYFREE, nr_pages); __count_memcg_events(lruvec_memcg(lruvec), PGLAZYFREE, @@ -589,71 +613,67 @@ static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec) } /* - * Drain pages out of the cpu's pagevecs. + * Drain pages out of the cpu's folio_batch. * Either "cpu" is the current CPU, and preemption has already been * disabled; or "cpu" is being hot-unplugged, and is already dead. */ void lru_add_drain_cpu(int cpu) { - struct pagevec *pvec = &per_cpu(lru_pvecs.lru_add, cpu); + struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu); + struct folio_batch *fbatch = &fbatches->lru_add; - if (pagevec_count(pvec)) - __pagevec_lru_add(pvec); + if (folio_batch_count(fbatch)) + folio_batch_move_lru(fbatch, lru_add_fn); - pvec = &per_cpu(lru_rotate.pvec, cpu); + fbatch = &per_cpu(lru_rotate.fbatch, cpu); /* Disabling interrupts below acts as a compiler barrier. */ - if (data_race(pagevec_count(pvec))) { + if (data_race(folio_batch_count(fbatch))) { unsigned long flags; /* No harm done if a racing interrupt already did this */ local_lock_irqsave(&lru_rotate.lock, flags); - pagevec_lru_move_fn(pvec, pagevec_move_tail_fn); + folio_batch_move_lru(fbatch, lru_move_tail_fn); local_unlock_irqrestore(&lru_rotate.lock, flags); } - pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu); - if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); + fbatch = &fbatches->lru_deactivate_file; + if (folio_batch_count(fbatch)) + folio_batch_move_lru(fbatch, lru_deactivate_file_fn); - pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu); - if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn); + fbatch = &fbatches->lru_deactivate; + if (folio_batch_count(fbatch)) + folio_batch_move_lru(fbatch, lru_deactivate_fn); - pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu); - if (pagevec_count(pvec)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); + fbatch = &fbatches->lru_lazyfree; + if (folio_batch_count(fbatch)) + folio_batch_move_lru(fbatch, lru_lazyfree_fn); - activate_page_drain(cpu); + folio_activate_drain(cpu); } /** - * deactivate_file_folio() - Forcefully deactivate a file folio. + * deactivate_file_folio() - Deactivate a file folio. * @folio: Folio to deactivate. * * This function hints to the VM that @folio is a good reclaim candidate, * for example if its invalidation fails due to the folio being dirty * or under writeback. * - * Context: Caller holds a reference on the page. + * Context: Caller holds a reference on the folio. */ void deactivate_file_folio(struct folio *folio) { - struct pagevec *pvec; + struct folio_batch *fbatch; - /* - * In a workload with many unevictable pages such as mprotect, - * unevictable folio deactivation for accelerating reclaim is pointless. - */ + /* Deactivating an unevictable folio will not accelerate reclaim */ if (folio_test_unevictable(folio)) return; folio_get(folio); - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file); - - if (pagevec_add_and_need_flush(pvec, &folio->page)) - pagevec_lru_move_fn(pvec, lru_deactivate_file_fn); - local_unlock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file); + folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn); + local_unlock(&cpu_fbatches.lock); } /* @@ -666,15 +686,17 @@ void deactivate_file_folio(struct folio *folio) */ void deactivate_page(struct page *page) { - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) { - struct pagevec *pvec; - - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate); - get_page(page); - if (pagevec_add_and_need_flush(pvec, page)) - pagevec_lru_move_fn(pvec, lru_deactivate_fn); - local_unlock(&lru_pvecs.lock); + struct folio *folio = page_folio(page); + + if (folio_test_lru(folio) && folio_test_active(folio) && + !folio_test_unevictable(folio)) { + struct folio_batch *fbatch; + + folio_get(folio); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate); + folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn); + local_unlock(&cpu_fbatches.lock); } } @@ -687,24 +709,26 @@ void deactivate_page(struct page *page) */ void mark_page_lazyfree(struct page *page) { - if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) && - !PageSwapCache(page) && !PageUnevictable(page)) { - struct pagevec *pvec; - - local_lock(&lru_pvecs.lock); - pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree); - get_page(page); - if (pagevec_add_and_need_flush(pvec, page)) - pagevec_lru_move_fn(pvec, lru_lazyfree_fn); - local_unlock(&lru_pvecs.lock); + struct folio *folio = page_folio(page); + + if (folio_test_lru(folio) && folio_test_anon(folio) && + folio_test_swapbacked(folio) && !folio_test_swapcache(folio) && + !folio_test_unevictable(folio)) { + struct folio_batch *fbatch; + + folio_get(folio); + local_lock(&cpu_fbatches.lock); + fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree); + folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn); + local_unlock(&cpu_fbatches.lock); } } void lru_add_drain(void) { - local_lock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); lru_add_drain_cpu(smp_processor_id()); - local_unlock(&lru_pvecs.lock); + local_unlock(&cpu_fbatches.lock); mlock_page_drain_local(); } @@ -716,19 +740,19 @@ void lru_add_drain(void) */ static void lru_add_and_bh_lrus_drain(void) { - local_lock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); lru_add_drain_cpu(smp_processor_id()); - local_unlock(&lru_pvecs.lock); + local_unlock(&cpu_fbatches.lock); invalidate_bh_lrus_cpu(); mlock_page_drain_local(); } void lru_add_drain_cpu_zone(struct zone *zone) { - local_lock(&lru_pvecs.lock); + local_lock(&cpu_fbatches.lock); lru_add_drain_cpu(smp_processor_id()); drain_local_pages(zone); - local_unlock(&lru_pvecs.lock); + local_unlock(&cpu_fbatches.lock); mlock_page_drain_local(); } @@ -741,6 +765,21 @@ static void lru_add_drain_per_cpu(struct work_struct *dummy) lru_add_and_bh_lrus_drain(); } +static bool cpu_needs_drain(unsigned int cpu) +{ + struct cpu_fbatches *fbatches = &per_cpu(cpu_fbatches, cpu); + + /* Check these in order of likelihood that they're not zero */ + return folio_batch_count(&fbatches->lru_add) || + data_race(folio_batch_count(&per_cpu(lru_rotate.fbatch, cpu))) || + folio_batch_count(&fbatches->lru_deactivate_file) || + folio_batch_count(&fbatches->lru_deactivate) || + folio_batch_count(&fbatches->lru_lazyfree) || + folio_batch_count(&fbatches->activate) || + need_mlock_page_drain(cpu) || + has_bh_in_lru(cpu, NULL); +} + /* * Doesn't need any cpu hotplug locking because we do rely on per-cpu * kworkers being shut down before our page_alloc_cpu_dead callback is @@ -773,8 +812,9 @@ static inline void __lru_add_drain_all(bool force_all_cpus) return; /* - * Guarantee pagevec counter stores visible by this CPU are visible to - * other CPUs before loading the current drain generation. + * Guarantee folio_batch counter stores visible by this CPU + * are visible to other CPUs before loading the current drain + * generation. */ smp_mb(); @@ -800,8 +840,9 @@ static inline void __lru_add_drain_all(bool force_all_cpus) * (D) Increment global generation number * * Pairs with smp_load_acquire() at (B), outside of the critical - * section. Use a full memory barrier to guarantee that the new global - * drain generation number is stored before loading pagevec counters. + * section. Use a full memory barrier to guarantee that the + * new global drain generation number is stored before loading + * folio_batch counters. * * This pairing must be done here, before the for_each_online_cpu loop * below which drains the page vectors. @@ -823,14 +864,7 @@ static inline void __lru_add_drain_all(bool force_all_cpus) for_each_online_cpu(cpu) { struct work_struct *work = &per_cpu(lru_add_drain_work, cpu); - if (pagevec_count(&per_cpu(lru_pvecs.lru_add, cpu)) || - data_race(pagevec_count(&per_cpu(lru_rotate.pvec, cpu))) || - pagevec_count(&per_cpu(lru_pvecs.lru_deactivate_file, cpu)) || - pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) || - pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) || - need_activate_page_drain(cpu) || - need_mlock_page_drain(cpu) || - has_bh_in_lru(cpu, NULL)) { + if (cpu_needs_drain(cpu)) { INIT_WORK(work, lru_add_drain_per_cpu); queue_work_on(cpu, mm_percpu_wq, work); __cpumask_set_cpu(cpu, &has_work); @@ -906,8 +940,7 @@ void release_pages(struct page **pages, int nr) unsigned int lock_batch; for (i = 0; i < nr; i++) { - struct page *page = pages[i]; - struct folio *folio = page_folio(page); + struct folio *folio = page_folio(pages[i]); /* * Make sure the IRQ-safe lock-holding time does not get @@ -919,35 +952,34 @@ void release_pages(struct page **pages, int nr) lruvec = NULL; } - page = &folio->page; - if (is_huge_zero_page(page)) + if (is_huge_zero_page(&folio->page)) continue; - if (is_zone_device_page(page)) { + if (folio_is_zone_device(folio)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - if (put_devmap_managed_page(page)) + if (put_devmap_managed_page(&folio->page)) continue; - if (put_page_testzero(page)) - free_zone_device_page(page); + if (folio_put_testzero(folio)) + free_zone_device_page(&folio->page); continue; } - if (!put_page_testzero(page)) + if (!folio_put_testzero(folio)) continue; - if (PageCompound(page)) { + if (folio_test_large(folio)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - __put_compound_page(page); + __folio_put_large(folio); continue; } - if (PageLRU(page)) { + if (folio_test_lru(folio)) { struct lruvec *prev_lruvec = lruvec; lruvec = folio_lruvec_relock_irqsave(folio, lruvec, @@ -955,8 +987,8 @@ void release_pages(struct page **pages, int nr) if (prev_lruvec != lruvec) lock_batch = 0; - del_page_from_lru_list(page, lruvec); - __clear_page_lru_flags(page); + lruvec_del_folio(lruvec, folio); + __folio_clear_lru_flags(folio); } /* @@ -965,13 +997,13 @@ void release_pages(struct page **pages, int nr) * found set here. This does not indicate a problem, unless * "unevictable_pgs_cleared" appears worryingly large. */ - if (unlikely(PageMlocked(page))) { - __ClearPageMlocked(page); - dec_zone_page_state(page, NR_MLOCK); + if (unlikely(folio_test_mlocked(folio))) { + __folio_clear_mlocked(folio); + zone_stat_sub_folio(folio, NR_MLOCK); count_vm_event(UNEVICTABLE_PGCLEARED); } - list_add(&page->lru, &pages_to_free); + list_add(&folio->lru, &pages_to_free); } if (lruvec) unlock_page_lruvec_irqrestore(lruvec, flags); @@ -987,8 +1019,8 @@ EXPORT_SYMBOL(release_pages); * OK from a correctness point of view but is inefficient - those pages may be * cache-warm and we want to give them back to the page allocator ASAP. * - * So __pagevec_release() will drain those queues here. __pagevec_lru_add() - * and __pagevec_lru_add_active() call release_pages() directly to avoid + * So __pagevec_release() will drain those queues here. + * folio_batch_move_lru() calls folios_put() directly to avoid * mutual recursion. */ void __pagevec_release(struct pagevec *pvec) @@ -1002,69 +1034,6 @@ void __pagevec_release(struct pagevec *pvec) } EXPORT_SYMBOL(__pagevec_release); -static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec) -{ - int was_unevictable = folio_test_clear_unevictable(folio); - long nr_pages = folio_nr_pages(folio); - - VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); - - folio_set_lru(folio); - /* - * Is an smp_mb__after_atomic() still required here, before - * folio_evictable() tests PageMlocked, to rule out the possibility - * of stranding an evictable folio on an unevictable LRU? I think - * not, because __munlock_page() only clears PageMlocked while the LRU - * lock is held. - * - * (That is not true of __page_cache_release(), and not necessarily - * true of release_pages(): but those only clear PageMlocked after - * put_page_testzero() has excluded any other users of the page.) - */ - if (folio_evictable(folio)) { - if (was_unevictable) - __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages); - } else { - folio_clear_active(folio); - folio_set_unevictable(folio); - /* - * folio->mlock_count = !!folio_test_mlocked(folio)? - * But that leaves __mlock_page() in doubt whether another - * actor has already counted the mlock or not. Err on the - * safe side, underestimate, let page reclaim fix it, rather - * than leaving a page on the unevictable LRU indefinitely. - */ - folio->mlock_count = 0; - if (!was_unevictable) - __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages); - } - - lruvec_add_folio(lruvec, folio); - trace_mm_lru_insertion(folio); -} - -/* - * Add the passed pages to the LRU, then drop the caller's refcount - * on them. Reinitialises the caller's pagevec. - */ -void __pagevec_lru_add(struct pagevec *pvec) -{ - int i; - struct lruvec *lruvec = NULL; - unsigned long flags = 0; - - for (i = 0; i < pagevec_count(pvec); i++) { - struct folio *folio = page_folio(pvec->pages[i]); - - lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags); - __pagevec_lru_add_fn(folio, lruvec); - } - if (lruvec) - unlock_page_lruvec_irqrestore(lruvec, flags); - release_pages(pvec->pages, pvec->nr); - pagevec_reinit(pvec); -} - /** * folio_batch_remove_exceptionals() - Prune non-folios from a batch. * @fbatch: The batch to prune diff --git a/mm/swap.h b/mm/swap.h index 0193797b0c929..17936e068c1c8 100644 --- a/mm/swap.h +++ b/mm/swap.h @@ -36,12 +36,11 @@ bool add_to_swap(struct folio *folio); void *get_shadow_from_swap_cache(swp_entry_t entry); int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp, void **shadowp); -void __delete_from_swap_cache(struct page *page, +void __delete_from_swap_cache(struct folio *folio, swp_entry_t entry, void *shadow); -void delete_from_swap_cache(struct page *page); +void delete_from_swap_cache(struct folio *folio); void clear_shadow_from_swap_cache(int type, unsigned long begin, unsigned long end); -void free_swap_cache(struct page *page); struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr); @@ -61,9 +60,9 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag, struct page *swapin_readahead(swp_entry_t entry, gfp_t flag, struct vm_fault *vmf); -static inline unsigned int page_swap_flags(struct page *page) +static inline unsigned int folio_swap_flags(struct folio *folio) { - return page_swap_info(page)->flags; + return page_swap_info(&folio->page)->flags; } #else /* CONFIG_SWAP */ struct swap_iocb; @@ -81,10 +80,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry) return NULL; } -static inline void free_swap_cache(struct page *page) -{ -} - static inline void show_swap_cache_info(void) { } @@ -135,12 +130,12 @@ static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, return -1; } -static inline void __delete_from_swap_cache(struct page *page, +static inline void __delete_from_swap_cache(struct folio *folio, swp_entry_t entry, void *shadow) { } -static inline void delete_from_swap_cache(struct page *page) +static inline void delete_from_swap_cache(struct folio *folio) { } @@ -149,7 +144,7 @@ static inline void clear_shadow_from_swap_cache(int type, unsigned long begin, { } -static inline unsigned int page_swap_flags(struct page *page) +static inline unsigned int folio_swap_flags(struct folio *folio) { return 0; } diff --git a/mm/swap_state.c b/mm/swap_state.c index 0a2021fc55ade..e166051566f42 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -59,24 +59,11 @@ static bool enable_vma_readahead __read_mostly = true; #define GET_SWAP_RA_VAL(vma) \ (atomic_long_read(&(vma)->swap_readahead_info) ? : 4) -#define INC_CACHE_INFO(x) data_race(swap_cache_info.x++) -#define ADD_CACHE_INFO(x, nr) data_race(swap_cache_info.x += (nr)) - -static struct { - unsigned long add_total; - unsigned long del_total; - unsigned long find_success; - unsigned long find_total; -} swap_cache_info; - static atomic_t swapin_readahead_hits = ATOMIC_INIT(4); void show_swap_cache_info(void) { printk("%lu pages in swap cache\n", total_swapcache_pages()); - printk("Swap cache stats: add %lu, delete %lu, find %lu/%lu\n", - swap_cache_info.add_total, swap_cache_info.del_total, - swap_cache_info.find_success, swap_cache_info.find_total); printk("Free swap = %ldkB\n", get_nr_swap_pages() << (PAGE_SHIFT - 10)); printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); @@ -133,7 +120,6 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, address_space->nrpages += nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); __mod_lruvec_page_state(page, NR_SWAPCACHE, nr); - ADD_CACHE_INFO(add_total, nr); unlock: xas_unlock_irq(&xas); } while (xas_nomem(&xas, gfp)); @@ -147,32 +133,32 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, } /* - * This must be called only on pages that have + * This must be called only on folios that have * been verified to be in the swap cache. */ -void __delete_from_swap_cache(struct page *page, +void __delete_from_swap_cache(struct folio *folio, swp_entry_t entry, void *shadow) { struct address_space *address_space = swap_address_space(entry); - int i, nr = thp_nr_pages(page); + int i; + long nr = folio_nr_pages(folio); pgoff_t idx = swp_offset(entry); XA_STATE(xas, &address_space->i_pages, idx); - VM_BUG_ON_PAGE(!PageLocked(page), page); - VM_BUG_ON_PAGE(!PageSwapCache(page), page); - VM_BUG_ON_PAGE(PageWriteback(page), page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); + VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); for (i = 0; i < nr; i++) { void *entry = xas_store(&xas, shadow); - VM_BUG_ON_PAGE(entry != page, entry); - set_page_private(page + i, 0); + VM_BUG_ON_FOLIO(entry != folio, folio); + set_page_private(folio_page(folio, i), 0); xas_next(&xas); } - ClearPageSwapCache(page); + folio_clear_swapcache(folio); address_space->nrpages -= nr; - __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); - __mod_lruvec_page_state(page, NR_SWAPCACHE, -nr); - ADD_CACHE_INFO(del_total, nr); + __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr); + __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr); } /** @@ -237,22 +223,22 @@ bool add_to_swap(struct folio *folio) } /* - * This must be called only on pages that have + * This must be called only on folios that have * been verified to be in the swap cache and locked. - * It will never put the page into the free list, - * the caller has a reference on the page. + * It will never put the folio into the free list, + * the caller has a reference on the folio. */ -void delete_from_swap_cache(struct page *page) +void delete_from_swap_cache(struct folio *folio) { - swp_entry_t entry = { .val = page_private(page) }; + swp_entry_t entry = folio_swap_entry(folio); struct address_space *address_space = swap_address_space(entry); xa_lock_irq(&address_space->i_pages); - __delete_from_swap_cache(page, entry, NULL); + __delete_from_swap_cache(folio, entry, NULL); xa_unlock_irq(&address_space->i_pages); - put_swap_page(page, entry); - page_ref_sub(page, thp_nr_pages(page)); + put_swap_page(&folio->page, entry); + folio_ref_sub(folio, folio_nr_pages(folio)); } void clear_shadow_from_swap_cache(int type, unsigned long begin, @@ -348,12 +334,10 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, page = find_get_page(swap_address_space(entry), swp_offset(entry)); put_swap_device(si); - INC_CACHE_INFO(find_total); if (page) { bool vma_ra = swap_use_vma_readahead(); bool readahead; - INC_CACHE_INFO(find_success); /* * At the moment, we don't support PG_readahead for anon THP * so let's bail out rather than confusing the readahead stat. diff --git a/mm/swapfile.c b/mm/swapfile.c index a2e66d855b199..1fdccd2f1422e 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -695,7 +695,7 @@ static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset, si->lowest_bit += nr_entries; if (end == si->highest_bit) WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries); - si->inuse_pages += nr_entries; + WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries); if (si->inuse_pages == si->pages) { si->lowest_bit = si->max; si->highest_bit = 0; @@ -732,7 +732,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, add_to_avail_list(si); } atomic_long_add(nr_entries, &nr_swap_pages); - si->inuse_pages -= nr_entries; + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); if (si->flags & SWP_BLKDEV) swap_slot_free_notify = si->bdev->bd_disk->fops->swap_slot_free_notify; @@ -1568,16 +1568,15 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si, return ret; } -static bool page_swapped(struct page *page) +static bool folio_swapped(struct folio *folio) { swp_entry_t entry; struct swap_info_struct *si; - if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!PageTransCompound(page))) - return page_swapcount(page) != 0; + if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio))) + return page_swapcount(&folio->page) != 0; - page = compound_head(page); - entry.val = page_private(page); + entry = folio_swap_entry(folio); si = _swap_info_get(entry); if (si) return swap_page_trans_huge_swapped(si, entry); @@ -1590,13 +1589,14 @@ static bool page_swapped(struct page *page) */ int try_to_free_swap(struct page *page) { - VM_BUG_ON_PAGE(!PageLocked(page), page); + struct folio *folio = page_folio(page); + VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); - if (!PageSwapCache(page)) + if (!folio_test_swapcache(folio)) return 0; - if (PageWriteback(page)) + if (folio_test_writeback(folio)) return 0; - if (page_swapped(page)) + if (folio_swapped(folio)) return 0; /* @@ -1617,9 +1617,8 @@ int try_to_free_swap(struct page *page) if (pm_suspended_storage()) return 0; - page = compound_head(page); - delete_from_swap_cache(page); - SetPageDirty(page); + delete_from_swap_cache(folio); + folio_set_dirty(folio); return 1; } @@ -2640,7 +2639,7 @@ static int swap_show(struct seq_file *swap, void *v) } bytes = si->pages << (PAGE_SHIFT - 10); - inuse = si->inuse_pages << (PAGE_SHIFT - 10); + inuse = READ_ONCE(si->inuse_pages) << (PAGE_SHIFT - 10); file = si->swap_file; len = seq_file_path(swap, file, " \t\n\\"); @@ -3259,7 +3258,7 @@ void si_swapinfo(struct sysinfo *val) struct swap_info_struct *si = swap_info[type]; if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK)) - nr_to_be_unused += si->inuse_pages; + nr_to_be_unused += READ_ONCE(si->inuse_pages); } val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused; val->totalswap = total_swap_pages + nr_to_be_unused; diff --git a/mm/util.c b/mm/util.c index 53af0e79d3e47..c9439c66d8cf4 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1005,7 +1005,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); * succeed and -ENOMEM implies there is not. * * We currently support three overcommit policies, which are set via the - * vm.overcommit_memory sysctl. See Documentation/vm/overcommit-accounting.rst + * vm.overcommit_memory sysctl. See Documentation/mm/overcommit-accounting.rst * * Strict overcommit modes added 2002 Feb 26 by Alan Cox. * Additional code 2002 Jul 20 by Robert Love. diff --git a/mm/vmalloc.c b/mm/vmalloc.c index effd1ff6a4b41..dd6cdb2011953 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -790,6 +790,7 @@ unsigned long vmalloc_nr_pages(void) return atomic_long_read(&nr_vmalloc_pages); } +/* Look up the first VA which satisfies addr < va_end, NULL if none. */ static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) { struct vmap_area *va = NULL; @@ -814,9 +815,9 @@ static struct vmap_area *find_vmap_area_exceed_addr(unsigned long addr) return va; } -static struct vmap_area *__find_vmap_area(unsigned long addr) +static struct vmap_area *__find_vmap_area(unsigned long addr, struct rb_root *root) { - struct rb_node *n = vmap_area_root.rb_node; + struct rb_node *n = root->rb_node; addr = (unsigned long)kasan_reset_tag((void *)addr); @@ -874,11 +875,9 @@ find_va_links(struct vmap_area *va, * Trigger the BUG() if there are sides(left/right) * or full overlaps. */ - if (va->va_start < tmp_va->va_end && - va->va_end <= tmp_va->va_start) + if (va->va_end <= tmp_va->va_start) link = &(*link)->rb_left; - else if (va->va_end > tmp_va->va_start && - va->va_start >= tmp_va->va_end) + else if (va->va_start >= tmp_va->va_end) link = &(*link)->rb_right; else { WARN(1, "vmalloc bug: 0x%lx-0x%lx overlaps with 0x%lx-0x%lx\n", @@ -911,8 +910,9 @@ get_va_next_sibling(struct rb_node *parent, struct rb_node **link) } static __always_inline void -link_va(struct vmap_area *va, struct rb_root *root, - struct rb_node *parent, struct rb_node **link, struct list_head *head) +__link_va(struct vmap_area *va, struct rb_root *root, + struct rb_node *parent, struct rb_node **link, + struct list_head *head, bool augment) { /* * VA is still not in the list, but we can @@ -926,12 +926,12 @@ link_va(struct vmap_area *va, struct rb_root *root, /* Insert to the rb-tree */ rb_link_node(&va->rb_node, parent, link); - if (root == &free_vmap_area_root) { + if (augment) { /* * Some explanation here. Just perform simple insertion * to the tree. We do not set va->subtree_max_size to * its current size before calling rb_insert_augmented(). - * It is because of we populate the tree from the bottom + * It is because we populate the tree from the bottom * to parent levels when the node _is_ in the tree. * * Therefore we set subtree_max_size to zero after insertion, @@ -950,21 +950,49 @@ link_va(struct vmap_area *va, struct rb_root *root, } static __always_inline void -unlink_va(struct vmap_area *va, struct rb_root *root) +link_va(struct vmap_area *va, struct rb_root *root, + struct rb_node *parent, struct rb_node **link, + struct list_head *head) +{ + __link_va(va, root, parent, link, head, false); +} + +static __always_inline void +link_va_augment(struct vmap_area *va, struct rb_root *root, + struct rb_node *parent, struct rb_node **link, + struct list_head *head) +{ + __link_va(va, root, parent, link, head, true); +} + +static __always_inline void +__unlink_va(struct vmap_area *va, struct rb_root *root, bool augment) { if (WARN_ON(RB_EMPTY_NODE(&va->rb_node))) return; - if (root == &free_vmap_area_root) + if (augment) rb_erase_augmented(&va->rb_node, root, &free_vmap_area_rb_augment_cb); else rb_erase(&va->rb_node, root); - list_del(&va->list); + list_del_init(&va->list); RB_CLEAR_NODE(&va->rb_node); } +static __always_inline void +unlink_va(struct vmap_area *va, struct rb_root *root) +{ + __unlink_va(va, root, false); +} + +static __always_inline void +unlink_va_augment(struct vmap_area *va, struct rb_root *root) +{ + __unlink_va(va, root, true); +} + #if DEBUG_AUGMENT_PROPAGATE_CHECK /* * Gets called when remove the node and rotate. @@ -1060,7 +1088,7 @@ insert_vmap_area_augment(struct vmap_area *va, link = find_va_links(va, root, NULL, &parent); if (link) { - link_va(va, root, parent, link, head); + link_va_augment(va, root, parent, link, head); augment_tree_propagate_from(va); } } @@ -1077,8 +1105,8 @@ insert_vmap_area_augment(struct vmap_area *va, * ongoing. */ static __always_inline struct vmap_area * -merge_or_add_vmap_area(struct vmap_area *va, - struct rb_root *root, struct list_head *head) +__merge_or_add_vmap_area(struct vmap_area *va, + struct rb_root *root, struct list_head *head, bool augment) { struct vmap_area *sibling; struct list_head *next; @@ -1140,7 +1168,7 @@ merge_or_add_vmap_area(struct vmap_area *va, * "normalized" because of rotation operations. */ if (merged) - unlink_va(va, root); + __unlink_va(va, root, augment); sibling->va_end = va->va_end; @@ -1155,16 +1183,23 @@ merge_or_add_vmap_area(struct vmap_area *va, insert: if (!merged) - link_va(va, root, parent, link, head); + __link_va(va, root, parent, link, head, augment); return va; } +static __always_inline struct vmap_area * +merge_or_add_vmap_area(struct vmap_area *va, + struct rb_root *root, struct list_head *head) +{ + return __merge_or_add_vmap_area(va, root, head, false); +} + static __always_inline struct vmap_area * merge_or_add_vmap_area_augment(struct vmap_area *va, struct rb_root *root, struct list_head *head) { - va = merge_or_add_vmap_area(va, root, head); + va = __merge_or_add_vmap_area(va, root, head, true); if (va) augment_tree_propagate_from(va); @@ -1198,15 +1233,15 @@ is_within_this_va(struct vmap_area *va, unsigned long size, * overhead. */ static __always_inline struct vmap_area * -find_vmap_lowest_match(unsigned long size, unsigned long align, - unsigned long vstart, bool adjust_search_size) +find_vmap_lowest_match(struct rb_root *root, unsigned long size, + unsigned long align, unsigned long vstart, bool adjust_search_size) { struct vmap_area *va; struct rb_node *node; unsigned long length; /* Start from the root. */ - node = free_vmap_area_root.rb_node; + node = root->rb_node; /* Adjust the search size for alignment overhead. */ length = adjust_search_size ? size + align - 1 : size; @@ -1334,11 +1369,12 @@ classify_va_fit_type(struct vmap_area *va, } static __always_inline int -adjust_va_to_fit_type(struct vmap_area *va, - unsigned long nva_start_addr, unsigned long size, - enum fit_type type) +adjust_va_to_fit_type(struct rb_root *root, struct list_head *head, + struct vmap_area *va, unsigned long nva_start_addr, + unsigned long size) { struct vmap_area *lva = NULL; + enum fit_type type = classify_va_fit_type(va, nva_start_addr, size); if (type == FL_FIT_TYPE) { /* @@ -1348,7 +1384,7 @@ adjust_va_to_fit_type(struct vmap_area *va, * V NVA V * |---------------| */ - unlink_va(va, &free_vmap_area_root); + unlink_va_augment(va, root); kmem_cache_free(vmap_area_cachep, va); } else if (type == LE_FIT_TYPE) { /* @@ -1426,8 +1462,7 @@ adjust_va_to_fit_type(struct vmap_area *va, augment_tree_propagate_from(va); if (lva) /* type == NE_FIT_TYPE */ - insert_vmap_area_augment(lva, &va->rb_node, - &free_vmap_area_root, &free_vmap_area_list); + insert_vmap_area_augment(lva, &va->rb_node, root, head); } return 0; @@ -1438,13 +1473,13 @@ adjust_va_to_fit_type(struct vmap_area *va, * Otherwise a vend is returned that indicates failure. */ static __always_inline unsigned long -__alloc_vmap_area(unsigned long size, unsigned long align, +__alloc_vmap_area(struct rb_root *root, struct list_head *head, + unsigned long size, unsigned long align, unsigned long vstart, unsigned long vend) { bool adjust_search_size = true; unsigned long nva_start_addr; struct vmap_area *va; - enum fit_type type; int ret; /* @@ -1459,7 +1494,7 @@ __alloc_vmap_area(unsigned long size, unsigned long align, if (align <= PAGE_SIZE || (align > PAGE_SIZE && (vend - vstart) == size)) adjust_search_size = false; - va = find_vmap_lowest_match(size, align, vstart, adjust_search_size); + va = find_vmap_lowest_match(root, size, align, vstart, adjust_search_size); if (unlikely(!va)) return vend; @@ -1472,14 +1507,9 @@ __alloc_vmap_area(unsigned long size, unsigned long align, if (nva_start_addr + size > vend) return vend; - /* Classify what we have found. */ - type = classify_va_fit_type(va, nva_start_addr, size); - if (WARN_ON_ONCE(type == NOTHING_FIT)) - return vend; - /* Update the free vmap_area. */ - ret = adjust_va_to_fit_type(va, nva_start_addr, size, type); - if (ret) + ret = adjust_va_to_fit_type(root, head, va, nva_start_addr, size); + if (WARN_ON_ONCE(ret)) return vend; #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK @@ -1569,7 +1599,8 @@ static struct vmap_area *alloc_vmap_area(unsigned long size, retry: preload_this_cpu_lock(&free_vmap_area_lock, gfp_mask, node); - addr = __alloc_vmap_area(size, align, vstart, vend); + addr = __alloc_vmap_area(&free_vmap_area_root, &free_vmap_area_list, + size, align, vstart, vend); spin_unlock(&free_vmap_area_lock); /* @@ -1663,7 +1694,7 @@ static atomic_long_t vmap_lazy_nr = ATOMIC_LONG_INIT(0); /* * Serialize vmap purging. There is no actual critical section protected - * by this look, but we want to avoid concurrent calls for performance + * by this lock, but we want to avoid concurrent calls for performance * reasons and to make the pcpu_get_vm_areas more deterministic. */ static DEFINE_MUTEX(vmap_purge_lock); @@ -1677,32 +1708,32 @@ static void purge_fragmented_blocks_allcpus(void); static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end) { unsigned long resched_threshold; - struct list_head local_pure_list; + struct list_head local_purge_list; struct vmap_area *va, *n_va; lockdep_assert_held(&vmap_purge_lock); spin_lock(&purge_vmap_area_lock); purge_vmap_area_root = RB_ROOT; - list_replace_init(&purge_vmap_area_list, &local_pure_list); + list_replace_init(&purge_vmap_area_list, &local_purge_list); spin_unlock(&purge_vmap_area_lock); - if (unlikely(list_empty(&local_pure_list))) + if (unlikely(list_empty(&local_purge_list))) return false; start = min(start, - list_first_entry(&local_pure_list, + list_first_entry(&local_purge_list, struct vmap_area, list)->va_start); end = max(end, - list_last_entry(&local_pure_list, + list_last_entry(&local_purge_list, struct vmap_area, list)->va_end); flush_tlb_kernel_range(start, end); resched_threshold = lazy_max_pages() << 1; spin_lock(&free_vmap_area_lock); - list_for_each_entry_safe(va, n_va, &local_pure_list, list) { + list_for_each_entry_safe(va, n_va, &local_purge_list, list) { unsigned long nr = (va->va_end - va->va_start) >> PAGE_SHIFT; unsigned long orig_start = va->va_start; unsigned long orig_end = va->va_end; @@ -1803,7 +1834,7 @@ struct vmap_area *find_vmap_area(unsigned long addr) struct vmap_area *va; spin_lock(&vmap_area_lock); - va = __find_vmap_area(addr); + va = __find_vmap_area(addr, &vmap_area_root); spin_unlock(&vmap_area_lock); return va; @@ -2546,7 +2577,7 @@ struct vm_struct *remove_vm_area(const void *addr) might_sleep(); spin_lock(&vmap_area_lock); - va = __find_vmap_area((unsigned long)addr); + va = __find_vmap_area((unsigned long)addr, &vmap_area_root); if (va && va->vm) { struct vm_struct *vm = va->vm; @@ -3168,15 +3199,15 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, /* * Mark the pages as accessible, now that they are mapped. - * The init condition should match the one in post_alloc_hook() - * (except for the should_skip_init() check) to make sure that memory - * is initialized under the same conditions regardless of the enabled - * KASAN mode. + * The condition for setting KASAN_VMALLOC_INIT should complement the + * one in post_alloc_hook() with regards to the __GFP_SKIP_ZERO check + * to make sure that memory is initialized under the same conditions. * Tag-based KASAN modes only assign tags to normal non-executable * allocations, see __kasan_unpoison_vmalloc(). */ kasan_flags |= KASAN_VMALLOC_VM_ALLOC; - if (!want_init_on_free() && want_init_on_alloc(gfp_mask)) + if (!want_init_on_free() && want_init_on_alloc(gfp_mask) && + (gfp_mask & __GFP_SKIP_ZERO)) kasan_flags |= KASAN_VMALLOC_INIT; /* KASAN_VMALLOC_PROT_NORMAL already set if required. */ area->addr = kasan_unpoison_vmalloc(area->addr, real_size, kasan_flags); @@ -3735,7 +3766,6 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, int area, area2, last_area, term_area; unsigned long base, start, size, end, last_end, orig_start, orig_end; bool purged = false; - enum fit_type type; /* verify parameters and allocate data structures */ BUG_ON(offset_in_page(align) || !is_power_of_2(align)); @@ -3846,15 +3876,13 @@ struct vm_struct **pcpu_get_vm_areas(const unsigned long *offsets, /* It is a BUG(), but trigger recovery instead. */ goto recovery; - type = classify_va_fit_type(va, start, size); - if (WARN_ON_ONCE(type == NOTHING_FIT)) + ret = adjust_va_to_fit_type(&free_vmap_area_root, + &free_vmap_area_list, + va, start, size); + if (WARN_ON_ONCE(unlikely(ret))) /* It is a BUG(), but trigger recovery instead. */ goto recovery; - ret = adjust_va_to_fit_type(va, start, size, type); - if (unlikely(ret)) - goto recovery; - /* Allocated area. */ va = vas[area]; va->va_start = start; diff --git a/mm/vmscan.c b/mm/vmscan.c index 04f8671caad9a..29846d10209ad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -26,8 +26,7 @@ #include #include #include -#include /* for try_to_release_page(), - buffer_heads_over_limit */ +#include /* for buffer_heads_over_limit */ #include #include #include @@ -160,17 +159,17 @@ struct scan_control { }; #ifdef ARCH_HAS_PREFETCHW -#define prefetchw_prev_lru_page(_page, _base, _field) \ +#define prefetchw_prev_lru_folio(_folio, _base, _field) \ do { \ - if ((_page)->lru.prev != _base) { \ - struct page *prev; \ + if ((_folio)->lru.prev != _base) { \ + struct folio *prev; \ \ - prev = lru_to_page(&(_page->lru)); \ + prev = lru_to_folio(&(_folio->lru)); \ prefetchw(&prev->_field); \ } \ } while (0) #else -#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0) +#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0) #endif /* @@ -190,8 +189,8 @@ static void set_task_reclaim_state(struct task_struct *task, task->reclaim_state = rs; } -static LIST_HEAD(shrinker_list); -static DECLARE_RWSEM(shrinker_rwsem); +LIST_HEAD(shrinker_list); +DECLARE_RWSEM(shrinker_rwsem); #ifdef CONFIG_MEMCG static int shrinker_nr_max; @@ -608,7 +607,7 @@ static unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, /* * Add a shrinker callback to be called from the vm. */ -int prealloc_shrinker(struct shrinker *shrinker) +static int __prealloc_shrinker(struct shrinker *shrinker) { unsigned int size; int err; @@ -632,8 +631,36 @@ int prealloc_shrinker(struct shrinker *shrinker) return 0; } +#ifdef CONFIG_SHRINKER_DEBUG +int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...) +{ + va_list ap; + int err; + + va_start(ap, fmt); + shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap); + va_end(ap); + if (!shrinker->name) + return -ENOMEM; + + err = __prealloc_shrinker(shrinker); + if (err) + kfree_const(shrinker->name); + + return err; +} +#else +int prealloc_shrinker(struct shrinker *shrinker, const char *fmt, ...) +{ + return __prealloc_shrinker(shrinker); +} +#endif + void free_prealloced_shrinker(struct shrinker *shrinker) { +#ifdef CONFIG_SHRINKER_DEBUG + kfree_const(shrinker->name); +#endif if (shrinker->flags & SHRINKER_MEMCG_AWARE) { down_write(&shrinker_rwsem); unregister_memcg_shrinker(shrinker); @@ -650,18 +677,43 @@ void register_shrinker_prepared(struct shrinker *shrinker) down_write(&shrinker_rwsem); list_add_tail(&shrinker->list, &shrinker_list); shrinker->flags |= SHRINKER_REGISTERED; + shrinker_debugfs_add(shrinker); up_write(&shrinker_rwsem); } -int register_shrinker(struct shrinker *shrinker) +static int __register_shrinker(struct shrinker *shrinker) { - int err = prealloc_shrinker(shrinker); + int err = __prealloc_shrinker(shrinker); if (err) return err; register_shrinker_prepared(shrinker); return 0; } + +#ifdef CONFIG_SHRINKER_DEBUG +int register_shrinker(struct shrinker *shrinker, const char *fmt, ...) +{ + va_list ap; + int err; + + va_start(ap, fmt); + shrinker->name = kvasprintf_const(GFP_KERNEL, fmt, ap); + va_end(ap); + if (!shrinker->name) + return -ENOMEM; + + err = __register_shrinker(shrinker); + if (err) + kfree_const(shrinker->name); + return err; +} +#else +int register_shrinker(struct shrinker *shrinker, const char *fmt, ...) +{ + return __register_shrinker(shrinker); +} +#endif EXPORT_SYMBOL(register_shrinker); /* @@ -677,6 +729,7 @@ void unregister_shrinker(struct shrinker *shrinker) shrinker->flags &= ~SHRINKER_REGISTERED; if (shrinker->flags & SHRINKER_MEMCG_AWARE) unregister_memcg_shrinker(shrinker); + shrinker_debugfs_remove(shrinker); up_write(&shrinker_rwsem); kfree(shrinker->nr_deferred); @@ -1276,7 +1329,7 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio, mem_cgroup_swapout(folio, swap); if (reclaimed && !mapping_exiting(mapping)) shadow = workingset_eviction(folio, target_memcg); - __delete_from_swap_cache(&folio->page, swap, shadow); + __delete_from_swap_cache(folio, swap, shadow); xa_unlock_irq(&mapping->i_pages); put_swap_page(&folio->page, swap); } else { @@ -1519,7 +1572,7 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask) * but that will never affect SWP_FS_OPS, so the data_race * is safe. */ - return !data_race(page_swap_flags(&folio->page) & SWP_FS_OPS); + return !data_race(folio_swap_flags(folio) & SWP_FS_OPS); } /* @@ -1926,7 +1979,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, * appear not as the counts should be low */ if (unlikely(folio_test_large(folio))) - destroy_compound_page(&folio->page); + destroy_large_folio(folio); else list_add(&folio->lru, &free_pages); continue; @@ -1987,7 +2040,7 @@ static unsigned int shrink_page_list(struct list_head *page_list, } unsigned int reclaim_clean_pages_from_list(struct zone *zone, - struct list_head *page_list) + struct list_head *folio_list) { struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -1995,16 +2048,16 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, }; struct reclaim_stat stat; unsigned int nr_reclaimed; - struct page *page, *next; - LIST_HEAD(clean_pages); + struct folio *folio, *next; + LIST_HEAD(clean_folios); unsigned int noreclaim_flag; - list_for_each_entry_safe(page, next, page_list, lru) { - if (!PageHuge(page) && page_is_file_lru(page) && - !PageDirty(page) && !__PageMovable(page) && - !PageUnevictable(page)) { - ClearPageActive(page); - list_move(&page->lru, &clean_pages); + list_for_each_entry_safe(folio, next, folio_list, lru) { + if (!folio_test_hugetlb(folio) && folio_is_file_lru(folio) && + !folio_test_dirty(folio) && !__folio_test_movable(folio) && + !folio_test_unevictable(folio)) { + folio_clear_active(folio); + list_move(&folio->lru, &clean_folios); } } @@ -2015,11 +2068,11 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone, * change in the future. */ noreclaim_flag = memalloc_noreclaim_save(); - nr_reclaimed = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, + nr_reclaimed = shrink_page_list(&clean_folios, zone->zone_pgdat, &sc, &stat, true); memalloc_noreclaim_restore(noreclaim_flag); - list_splice(&clean_pages, page_list); + list_splice(&clean_folios, folio_list); mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -(long)nr_reclaimed); /* @@ -2085,72 +2138,72 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan, unsigned long nr_skipped[MAX_NR_ZONES] = { 0, }; unsigned long skipped = 0; unsigned long scan, total_scan, nr_pages; - LIST_HEAD(pages_skipped); + LIST_HEAD(folios_skipped); total_scan = 0; scan = 0; while (scan < nr_to_scan && !list_empty(src)) { struct list_head *move_to = src; - struct page *page; + struct folio *folio; - page = lru_to_page(src); - prefetchw_prev_lru_page(page, src, flags); + folio = lru_to_folio(src); + prefetchw_prev_lru_folio(folio, src, flags); - nr_pages = compound_nr(page); + nr_pages = folio_nr_pages(folio); total_scan += nr_pages; - if (page_zonenum(page) > sc->reclaim_idx) { - nr_skipped[page_zonenum(page)] += nr_pages; - move_to = &pages_skipped; + if (folio_zonenum(folio) > sc->reclaim_idx) { + nr_skipped[folio_zonenum(folio)] += nr_pages; + move_to = &folios_skipped; goto move; } /* - * Do not count skipped pages because that makes the function - * return with no isolated pages if the LRU mostly contains - * ineligible pages. This causes the VM to not reclaim any - * pages, triggering a premature OOM. - * Account all tail pages of THP. + * Do not count skipped folios because that makes the function + * return with no isolated folios if the LRU mostly contains + * ineligible folios. This causes the VM to not reclaim any + * folios, triggering a premature OOM. + * Account all pages in a folio. */ scan += nr_pages; - if (!PageLRU(page)) + if (!folio_test_lru(folio)) goto move; - if (!sc->may_unmap && page_mapped(page)) + if (!sc->may_unmap && folio_mapped(folio)) goto move; /* - * Be careful not to clear PageLRU until after we're - * sure the page is not being freed elsewhere -- the - * page release code relies on it. + * Be careful not to clear the lru flag until after we're + * sure the folio is not being freed elsewhere -- the + * folio release code relies on it. */ - if (unlikely(!get_page_unless_zero(page))) + if (unlikely(!folio_try_get(folio))) goto move; - if (!TestClearPageLRU(page)) { - /* Another thread is already isolating this page */ - put_page(page); + if (!folio_test_clear_lru(folio)) { + /* Another thread is already isolating this folio */ + folio_put(folio); goto move; } nr_taken += nr_pages; - nr_zone_taken[page_zonenum(page)] += nr_pages; + nr_zone_taken[folio_zonenum(folio)] += nr_pages; move_to = dst; move: - list_move(&page->lru, move_to); + list_move(&folio->lru, move_to); } /* - * Splice any skipped pages to the start of the LRU list. Note that + * Splice any skipped folios to the start of the LRU list. Note that * this disrupts the LRU order when reclaiming for lower zones but * we cannot splice to the tail. If we did then the SWAP_CLUSTER_MAX - * scanning would soon rescan the same pages to skip and waste lots + * scanning would soon rescan the same folios to skip and waste lots * of cpu cycles. */ - if (!list_empty(&pages_skipped)) { + if (!list_empty(&folios_skipped)) { int zid; - list_splice(&pages_skipped, src); + list_splice(&folios_skipped, src); for (zid = 0; zid < MAX_NR_ZONES; zid++) { if (!nr_skipped[zid]) continue; @@ -2254,8 +2307,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, } /* - * move_pages_to_lru() moves pages from private @list to appropriate LRU list. - * On return, @list is reused as a list of pages to be freed by the caller. + * move_pages_to_lru() moves folios from private @list to appropriate LRU list. + * On return, @list is reused as a list of folios to be freed by the caller. * * Returns the number of pages moved to the given lruvec. */ @@ -2263,42 +2316,42 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec, struct list_head *list) { int nr_pages, nr_moved = 0; - LIST_HEAD(pages_to_free); - struct page *page; + LIST_HEAD(folios_to_free); while (!list_empty(list)) { - page = lru_to_page(list); - VM_BUG_ON_PAGE(PageLRU(page), page); - list_del(&page->lru); - if (unlikely(!page_evictable(page))) { + struct folio *folio = lru_to_folio(list); + + VM_BUG_ON_FOLIO(folio_test_lru(folio), folio); + list_del(&folio->lru); + if (unlikely(!folio_evictable(folio))) { spin_unlock_irq(&lruvec->lru_lock); - putback_lru_page(page); + folio_putback_lru(folio); spin_lock_irq(&lruvec->lru_lock); continue; } /* - * The SetPageLRU needs to be kept here for list integrity. + * The folio_set_lru needs to be kept here for list integrity. * Otherwise: * #0 move_pages_to_lru #1 release_pages - * if !put_page_testzero - * if (put_page_testzero()) - * !PageLRU //skip lru_lock - * SetPageLRU() - * list_add(&page->lru,) - * list_add(&page->lru,) + * if (!folio_put_testzero()) + * if (folio_put_testzero()) + * !lru //skip lru_lock + * folio_set_lru() + * list_add(&folio->lru,) + * list_add(&folio->lru,) */ - SetPageLRU(page); + folio_set_lru(folio); - if (unlikely(put_page_testzero(page))) { - __clear_page_lru_flags(page); + if (unlikely(folio_put_testzero(folio))) { + __folio_clear_lru_flags(folio); - if (unlikely(PageCompound(page))) { + if (unlikely(folio_test_large(folio))) { spin_unlock_irq(&lruvec->lru_lock); - destroy_compound_page(page); + destroy_large_folio(folio); spin_lock_irq(&lruvec->lru_lock); } else - list_add(&page->lru, &pages_to_free); + list_add(&folio->lru, &folios_to_free); continue; } @@ -2307,18 +2360,18 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec, * All pages were isolated from the same lruvec (and isolation * inhibits memcg migration). */ - VM_BUG_ON_PAGE(!folio_matches_lruvec(page_folio(page), lruvec), page); - add_page_to_lru_list(page, lruvec); - nr_pages = thp_nr_pages(page); + VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio); + lruvec_add_folio(lruvec, folio); + nr_pages = folio_nr_pages(folio); nr_moved += nr_pages; - if (PageActive(page)) + if (folio_test_active(folio)) workingset_age_nonresident(lruvec, nr_pages); } /* * To save our caller's stack, now use input list for pages to free. */ - list_splice(&pages_to_free, list); + list_splice(&folios_to_free, list); return nr_moved; } @@ -2429,21 +2482,21 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, } /* - * shrink_active_list() moves pages from the active LRU to the inactive LRU. + * shrink_active_list() moves folios from the active LRU to the inactive LRU. * - * We move them the other way if the page is referenced by one or more + * We move them the other way if the folio is referenced by one or more * processes. * - * If the pages are mostly unmapped, the processing is fast and it is + * If the folios are mostly unmapped, the processing is fast and it is * appropriate to hold lru_lock across the whole operation. But if - * the pages are mapped, the processing is slow (folio_referenced()), so - * we should drop lru_lock around each page. It's impossible to balance - * this, so instead we remove the pages from the LRU while processing them. - * It is safe to rely on PG_active against the non-LRU pages in here because - * nobody will play with that bit on a non-LRU page. + * the folios are mapped, the processing is slow (folio_referenced()), so + * we should drop lru_lock around each folio. It's impossible to balance + * this, so instead we remove the folios from the LRU while processing them. + * It is safe to rely on the active flag against the non-LRU folios in here + * because nobody will play with that bit on a non-LRU folio. * - * The downside is that we have to touch page->_refcount against each page. - * But we had to alter page->flags anyway. + * The downside is that we have to touch folio->_refcount against each folio. + * But we had to alter folio->flags anyway. */ static void shrink_active_list(unsigned long nr_to_scan, struct lruvec *lruvec, @@ -2453,7 +2506,7 @@ static void shrink_active_list(unsigned long nr_to_scan, unsigned long nr_taken; unsigned long nr_scanned; unsigned long vm_flags; - LIST_HEAD(l_hold); /* The pages which were snipped off */ + LIST_HEAD(l_hold); /* The folios which were snipped off */ LIST_HEAD(l_active); LIST_HEAD(l_inactive); unsigned nr_deactivate, nr_activate; @@ -2478,23 +2531,21 @@ static void shrink_active_list(unsigned long nr_to_scan, while (!list_empty(&l_hold)) { struct folio *folio; - struct page *page; cond_resched(); folio = lru_to_folio(&l_hold); list_del(&folio->lru); - page = &folio->page; - if (unlikely(!page_evictable(page))) { - putback_lru_page(page); + if (unlikely(!folio_evictable(folio))) { + folio_putback_lru(folio); continue; } if (unlikely(buffer_heads_over_limit)) { - if (page_has_private(page) && trylock_page(page)) { - if (page_has_private(page)) - try_to_release_page(page, 0); - unlock_page(page); + if (folio_get_private(folio) && folio_trylock(folio)) { + if (folio_get_private(folio)) + filemap_release_folio(folio, 0); + folio_unlock(folio); } } @@ -2502,34 +2553,34 @@ static void shrink_active_list(unsigned long nr_to_scan, if (folio_referenced(folio, 0, sc->target_mem_cgroup, &vm_flags) != 0) { /* - * Identify referenced, file-backed active pages and + * Identify referenced, file-backed active folios and * give them one more trip around the active list. So * that executable code get better chances to stay in - * memory under moderate memory pressure. Anon pages + * memory under moderate memory pressure. Anon folios * are not likely to be evicted by use-once streaming - * IO, plus JVM can create lots of anon VM_EXEC pages, + * IO, plus JVM can create lots of anon VM_EXEC folios, * so we ignore them here. */ - if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) { - nr_rotated += thp_nr_pages(page); - list_add(&page->lru, &l_active); + if ((vm_flags & VM_EXEC) && folio_is_file_lru(folio)) { + nr_rotated += folio_nr_pages(folio); + list_add(&folio->lru, &l_active); continue; } } - ClearPageActive(page); /* we are de-activating */ - SetPageWorkingset(page); - list_add(&page->lru, &l_inactive); + folio_clear_active(folio); /* we are de-activating */ + folio_set_workingset(folio); + list_add(&folio->lru, &l_inactive); } /* - * Move pages back to the lru list. + * Move folios back to the lru list. */ spin_lock_irq(&lruvec->lru_lock); nr_activate = move_pages_to_lru(lruvec, &l_active); nr_deactivate = move_pages_to_lru(lruvec, &l_inactive); - /* Keep all free pages in l_active list */ + /* Keep all free folios in l_active list */ list_splice(&l_inactive, &l_active); __count_vm_events(PGDEACTIVATE, nr_deactivate); @@ -2568,34 +2619,33 @@ static unsigned int reclaim_page_list(struct list_head *page_list, return nr_reclaimed; } -unsigned long reclaim_pages(struct list_head *page_list) +unsigned long reclaim_pages(struct list_head *folio_list) { int nid; unsigned int nr_reclaimed = 0; - LIST_HEAD(node_page_list); - struct page *page; + LIST_HEAD(node_folio_list); unsigned int noreclaim_flag; - if (list_empty(page_list)) + if (list_empty(folio_list)) return nr_reclaimed; noreclaim_flag = memalloc_noreclaim_save(); - nid = page_to_nid(lru_to_page(page_list)); + nid = folio_nid(lru_to_folio(folio_list)); do { - page = lru_to_page(page_list); + struct folio *folio = lru_to_folio(folio_list); - if (nid == page_to_nid(page)) { - ClearPageActive(page); - list_move(&page->lru, &node_page_list); + if (nid == folio_nid(folio)) { + folio_clear_active(folio); + list_move(&folio->lru, &node_folio_list); continue; } - nr_reclaimed += reclaim_page_list(&node_page_list, NODE_DATA(nid)); - nid = page_to_nid(lru_to_page(page_list)); - } while (!list_empty(page_list)); + nr_reclaimed += reclaim_page_list(&node_folio_list, NODE_DATA(nid)); + nid = folio_nid(lru_to_folio(folio_list)); + } while (!list_empty(folio_list)); - nr_reclaimed += reclaim_page_list(&node_page_list, NODE_DATA(nid)); + nr_reclaimed += reclaim_page_list(&node_folio_list, NODE_DATA(nid)); memalloc_noreclaim_restore(noreclaim_flag); @@ -4595,7 +4645,7 @@ void kswapd_run(int nid) /* * Called by memory hotplug when all memory in a node is offlined. Caller must - * hold mem_hotplug_begin/end(). + * be holding mem_hotplug_begin/done(). */ void kswapd_stop(int nid) { diff --git a/mm/workingset.c b/mm/workingset.c index 592569a8974c4..a5e84862fc868 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -625,7 +625,7 @@ static int __init workingset_init(void) pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n", timestamp_bits, max_order, bucket_order); - ret = prealloc_shrinker(&workingset_shadow_shrinker); + ret = prealloc_shrinker(&workingset_shadow_shrinker, "mm-shadow"); if (ret) goto err; ret = __list_lru_init(&shadow_nodes, true, &shadow_nodes_key, diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 71d6edcbea488..a7bad79cbbed8 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -2169,7 +2169,8 @@ static int zs_register_shrinker(struct zs_pool *pool) pool->shrinker.batch = 0; pool->shrinker.seeks = DEFAULT_SEEKS; - return register_shrinker(&pool->shrinker); + return register_shrinker(&pool->shrinker, "mm-zspool:%s", + pool->name); } /** diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 0ec2f5906a27c..6b9f19122ec1a 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -18,6 +18,7 @@ #include #include #include +#include #include #include @@ -1143,7 +1144,13 @@ static int __register_pernet_operations(struct list_head *list, * setup_net() and cleanup_net() are not possible. */ for_each_net(net) { + struct mem_cgroup *old, *memcg; + + memcg = mem_cgroup_or_root(get_mem_cgroup_from_obj(net)); + old = set_active_memcg(memcg); error = ops_init(ops, net); + set_active_memcg(old); + mem_cgroup_put(memcg); if (error) goto out_undo; list_add_tail(&net->exit_list, &net_exit_list); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index b74905fcc3a19..9b203d8660e47 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -16,7 +16,7 @@ #include #include #include -#include /* for __put_page() */ +#include /* for put_page() */ #include #include diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c index 682fcd24bf439..04e7b55fe0d96 100644 --- a/net/sunrpc/auth.c +++ b/net/sunrpc/auth.c @@ -874,7 +874,7 @@ int __init rpcauth_init_module(void) err = rpc_init_authunix(); if (err < 0) goto out1; - err = register_shrinker(&rpc_cred_shrinker); + err = register_shrinker(&rpc_cred_shrinker, "sunrpc_cred"); if (err < 0) goto out2; return 0; diff --git a/tools/cgroup/memcg_shrinker.py b/tools/cgroup/memcg_shrinker.py new file mode 100644 index 0000000000000..706ab27666a43 --- /dev/null +++ b/tools/cgroup/memcg_shrinker.py @@ -0,0 +1,71 @@ +#!/usr/bin/env python3 +# +# Copyright (C) 2022 Roman Gushchin +# Copyright (C) 2022 Meta + +import os +import argparse +import sys + + +def scan_cgroups(cgroup_root): + cgroups = {} + + for root, subdirs, _ in os.walk(cgroup_root): + for cgroup in subdirs: + path = os.path.join(root, cgroup) + ino = os.stat(path).st_ino + cgroups[ino] = path + + # (memcg ino, path) + return cgroups + + +def scan_shrinkers(shrinker_debugfs): + shrinkers = [] + + for root, subdirs, _ in os.walk(shrinker_debugfs): + for shrinker in subdirs: + count_path = os.path.join(root, shrinker, "count") + with open(count_path) as f: + for line in f.readlines(): + items = line.split(' ') + ino = int(items[0]) + # (count, shrinker, memcg ino) + shrinkers.append((int(items[1]), shrinker, ino)) + return shrinkers + + +def main(): + parser = argparse.ArgumentParser(description='Display biggest shrinkers') + parser.add_argument('-n', '--lines', type=int, help='Number of lines to print') + + args = parser.parse_args() + + cgroups = scan_cgroups("/sys/fs/cgroup/") + shrinkers = scan_shrinkers("/sys/kernel/debug/shrinker/") + shrinkers = sorted(shrinkers, reverse = True, key = lambda x: x[0]) + + n = 0 + for s in shrinkers: + count, name, ino = (s[0], s[1], s[2]) + if count == 0: + break + + if ino == 0 or ino == 1: + cg = "/" + else: + try: + cg = cgroups[ino] + except KeyError: + cg = "unknown (%d)" % ino + + print("%-8s %-20s %s" % (count, name, cg)) + + n += 1 + if args.lines and n >= args.lines: + break + + +if __name__ == '__main__': + main() diff --git a/tools/testing/memblock/linux/kmemleak.h b/tools/testing/memblock/linux/kmemleak.h index 462f8c5e8aa09..5fed13bb9ec40 100644 --- a/tools/testing/memblock/linux/kmemleak.h +++ b/tools/testing/memblock/linux/kmemleak.h @@ -7,7 +7,7 @@ static inline void kmemleak_free_part_phys(phys_addr_t phys, size_t size) } static inline void kmemleak_alloc_phys(phys_addr_t phys, size_t size, - int min_count, gfp_t gfp) + gfp_t gfp) { } diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 108587cb327a9..d9fa6a9ea5844 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -93,6 +93,7 @@ TEST_PROGS := run_vmtests.sh TEST_FILES := test_vmalloc.sh TEST_FILES += test_hmm.sh +TEST_FILES += va_128TBswitch.sh include ../lib.mk diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh index 41fce8bea9293..27c01c35c7a93 100755 --- a/tools/testing/selftests/vm/run_vmtests.sh +++ b/tools/testing/selftests/vm/run_vmtests.sh @@ -151,7 +151,7 @@ if [ $VADDR64 -ne 0 ]; then run_test ./virtual_address_range # virtual address 128TB switch test - run_test ./va_128TBswitch + run_test ./va_128TBswitch.sh fi # VADDR64 # vmalloc stability smoke test diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index 4bc24581760da..7c3f1b0ab468a 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -931,7 +931,7 @@ static int faulting_process(int signal_test) unsigned long split_nr_pages; unsigned long lastnr; struct sigaction act; - unsigned long signalled = 0; + volatile unsigned long signalled = 0; split_nr_pages = (nr_pages + 1) / 2; @@ -946,7 +946,7 @@ static int faulting_process(int signal_test) } for (nr = 0; nr < split_nr_pages; nr++) { - int steps = 1; + volatile int steps = 1; unsigned long offset = nr * page_size; if (signal_test) { diff --git a/tools/testing/selftests/vm/va_128TBswitch.sh b/tools/testing/selftests/vm/va_128TBswitch.sh new file mode 100644 index 0000000000000..41580751dc511 --- /dev/null +++ b/tools/testing/selftests/vm/va_128TBswitch.sh @@ -0,0 +1,54 @@ +#!/bin/bash +# SPDX-License-Identifier: GPL-2.0 +# +# Copyright (C) 2022 Adam Sindelar (Meta) +# +# This is a test for mmap behavior with 5-level paging. This script wraps the +# real test to check that the kernel is configured to support at least 5 +# pagetable levels. + +# 1 means the test failed +exitcode=1 + +# Kselftest framework requirement - SKIP code is 4. +ksft_skip=4 + +fail() +{ + echo "$1" + exit $exitcode +} + +check_supported_x86_64() +{ + local config="/proc/config.gz" + [[ -f "${config}" ]] || config="/boot/config-$(uname -r)" + [[ -f "${config}" ]] || fail "Cannot find kernel config in /proc or /boot" + + # gzip -dcfq automatically handles both compressed and plaintext input. + # See man 1 gzip under '-f'. + local pg_table_levels=$(gzip -dcfq "${config}" | grep PGTABLE_LEVELS | cut -d'=' -f 2) + + if [[ "${pg_table_levels}" -lt 5 ]]; then + echo "$0: PGTABLE_LEVELS=${pg_table_levels}, must be >= 5 to run this test" + exit $ksft_skip + fi +} + +check_test_requirements() +{ + # The test supports x86_64 and powerpc64. We currently have no useful + # eligibility check for powerpc64, and the test itself will reject other + # architectures. + case `uname -m` in + "x86_64") + check_supported_x86_64 + ;; + *) + return 0 + ;; + esac +} + +check_test_requirements +./va_128TBswitch diff --git a/tools/vm/page_owner_sort.c b/tools/vm/page_owner_sort.c index c149427eb1c9b..74c3dcecf64d9 100644 --- a/tools/vm/page_owner_sort.c +++ b/tools/vm/page_owner_sort.c @@ -8,7 +8,7 @@ * Or sort by total memory: * ./page_owner_sort -m page_owner_full.txt sorted_page_owner.txt * - * See Documentation/vm/page_owner.rst + * See Documentation/mm/page_owner.rst */ #include diff --git a/tools/vm/slabinfo.c b/tools/vm/slabinfo.c index 5b98f3ee58a58..0fffaeedee767 100644 --- a/tools/vm/slabinfo.c +++ b/tools/vm/slabinfo.c @@ -125,7 +125,7 @@ static void usage(void) "-n|--numa Show NUMA information\n" "-N|--lines=K Show the first K slabs\n" "-o|--ops Show kmem_cache_ops\n" - "-P|--partial Sort by number of partial slabs\n" + "-P|--partial Sort by number of partial slabs\n" "-r|--report Detailed report on single slabs\n" "-s|--shrink Shrink slabs\n" "-S|--Size Sort by size\n" @@ -1067,15 +1067,27 @@ static void sort_slabs(void) for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) { int result; - if (sort_size) - result = slab_size(s1) < slab_size(s2); - else if (sort_active) - result = slab_activity(s1) < slab_activity(s2); - else if (sort_loss) - result = slab_waste(s1) < slab_waste(s2); - else if (sort_partial) - result = s1->partial < s2->partial; - else + if (sort_size) { + if (slab_size(s1) == slab_size(s2)) + result = strcasecmp(s1->name, s2->name); + else + result = slab_size(s1) < slab_size(s2); + } else if (sort_active) { + if (slab_activity(s1) == slab_activity(s2)) + result = strcasecmp(s1->name, s2->name); + else + result = slab_activity(s1) < slab_activity(s2); + } else if (sort_loss) { + if (slab_waste(s1) == slab_waste(s2)) + result = strcasecmp(s1->name, s2->name); + else + result = slab_waste(s1) < slab_waste(s2); + } else if (sort_partial) { + if (s1->partial == s2->partial) + result = strcasecmp(s1->name, s2->name); + else + result = s1->partial < s2->partial; + } else result = strcasecmp(s1->name, s2->name); if (show_inverted)