Skip to content

Commit

Permalink
Merge branch 'akpm' (patches from Andrew)
Browse files Browse the repository at this point in the history
Merge updates from Andrew Morton:
 "A large amount of MM, plenty more to come.

  Subsystems affected by this patch series:
   - tools
   - kthread
   - kbuild
   - scripts
   - ocfs2
   - vfs
   - mm: slub, kmemleak, pagecache, gup, swap, memcg, pagemap, mremap,
         sparsemem, kasan, pagealloc, vmscan, compaction, mempolicy,
         hugetlbfs, hugetlb"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (155 commits)
  include/linux/huge_mm.h: check PageTail in hpage_nr_pages even when !THP
  mm/hugetlb: fix build failure with HUGETLB_PAGE but not HUGEBTLBFS
  selftests/vm: fix map_hugetlb length used for testing read and write
  mm/hugetlb: remove unnecessary memory fetch in PageHeadHuge()
  mm/hugetlb.c: clean code by removing unnecessary initialization
  hugetlb_cgroup: add hugetlb_cgroup reservation docs
  hugetlb_cgroup: add hugetlb_cgroup reservation tests
  hugetlb: support file_region coalescing again
  hugetlb_cgroup: support noreserve mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add reservation accounting for private mappings
  mm/hugetlb_cgroup: fix hugetlb_cgroup migration
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add hugetlb_cgroup reservation counter
  hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race
  hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
  mm/memblock.c: remove redundant assignment to variable max_addr
  mm: mempolicy: require at least one nodeid for MPOL_PREFERRED
  mm: mempolicy: use VM_BUG_ON_VMA in queue_pages_test_walk()
  ...
  • Loading branch information
Linus Torvalds committed Apr 2, 2020
2 parents 7be9713 + 77d6b90 commit 6cad420
Show file tree
Hide file tree
Showing 165 changed files with 4,901 additions and 2,257 deletions.
103 changes: 92 additions & 11 deletions Documentation/admin-guide/cgroup-v1/hugetlb.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,6 @@
HugeTLB Controller
==================

The HugeTLB controller allows to limit the HugeTLB usage per control group and
enforces the controller limit during page fault. Since HugeTLB doesn't
support page reclaim, enforcing the limit at page fault time implies that,
the application will get SIGBUS signal if it tries to access HugeTLB pages
beyond its limit. This requires the application to know beforehand how much
HugeTLB pages it would require for its use.

HugeTLB controller can be created by first mounting the cgroup filesystem.

# mount -t cgroup -o hugetlb none /sys/fs/cgroup
Expand All @@ -28,10 +21,14 @@ process (bash) into it.

Brief summary of control files::

hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB limit
hugetlb.<hugepagesize>.rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults
hugetlb.<hugepagesize>.rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit
hugetlb.<hugepagesize>.limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults
hugetlb.<hugepagesize>.max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded
hugetlb.<hugepagesize>.usage_in_bytes # show current usage for "hugepagesize" hugetlb
hugetlb.<hugepagesize>.failcnt # show the number of allocation failure due to HugeTLB usage limit

For a system supporting three hugepage sizes (64k, 32M and 1G), the control
files include::
Expand All @@ -40,11 +37,95 @@ files include::
hugetlb.1GB.max_usage_in_bytes
hugetlb.1GB.usage_in_bytes
hugetlb.1GB.failcnt
hugetlb.1GB.rsvd.limit_in_bytes
hugetlb.1GB.rsvd.max_usage_in_bytes
hugetlb.1GB.rsvd.usage_in_bytes
hugetlb.1GB.rsvd.failcnt
hugetlb.64KB.limit_in_bytes
hugetlb.64KB.max_usage_in_bytes
hugetlb.64KB.usage_in_bytes
hugetlb.64KB.failcnt
hugetlb.64KB.rsvd.limit_in_bytes
hugetlb.64KB.rsvd.max_usage_in_bytes
hugetlb.64KB.rsvd.usage_in_bytes
hugetlb.64KB.rsvd.failcnt
hugetlb.32MB.limit_in_bytes
hugetlb.32MB.max_usage_in_bytes
hugetlb.32MB.usage_in_bytes
hugetlb.32MB.failcnt
hugetlb.32MB.rsvd.limit_in_bytes
hugetlb.32MB.rsvd.max_usage_in_bytes
hugetlb.32MB.rsvd.usage_in_bytes
hugetlb.32MB.rsvd.failcnt


1. Page fault accounting

hugetlb.<hugepagesize>.limit_in_bytes
hugetlb.<hugepagesize>.max_usage_in_bytes
hugetlb.<hugepagesize>.usage_in_bytes
hugetlb.<hugepagesize>.failcnt

The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
control group and enforces the limit during page fault. Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to fault in HugeTLB
pages beyond its limit. Therefore the application needs to know exactly how many
HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
there are enough available on the machine for all the users to avoid processes
getting SIGBUS.


2. Reservation accounting

hugetlb.<hugepagesize>.rsvd.limit_in_bytes
hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
hugetlb.<hugepagesize>.rsvd.usage_in_bytes
hugetlb.<hugepagesize>.rsvd.failcnt

The HugeTLB controller allows to limit the HugeTLB reservations per control
group and enforces the controller limit at reservation time and at the fault of
HugeTLB memory for which no reservation exists. Since reservation limits are
enforced at reservation time (on mmap or shget), reservation limits never causes
the application to get SIGBUS signal if the memory was reserved before hand. For
MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
limit, enforcing memory usage at fault time and causing the application to
receive a SIGBUS if it's crossing its limit.

Reservation limits are superior to page fault limits described above, since
reservation limits are enforced at reservation time (on mmap or shget), and
never causes the application to get SIGBUS signal if the memory was reserved
before hand. This allows for easier fallback to alternatives such as
non-HugeTLB memory for example. In the case of page fault accounting, it's very
hard to avoid processes getting SIGBUS since the sysadmin needs precisely know
the HugeTLB usage of all the tasks in the system and make sure there is enough
pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited
systems is practically impossible with page fault accounting.


3. Caveats with shared memory

For shared HugeTLB memory, both HugeTLB reservation and page faults are charged
to the first task that causes the memory to be reserved or faulted, and all
subsequent uses of this reserved or faulted memory is done without charging.

Shared HugeTLB memory is only uncharged when it is unreserved or deallocated.
This is usually when the HugeTLB file is deleted, and not when the task that
caused the reservation or fault has exited.


4. Caveats with HugeTLB cgroup offline.

When a HugeTLB cgroup goes offline with some reservations or faults still
charged to it, the behavior is as follows:

- The fault charges are charged to the parent HugeTLB cgroup (reparented),
- the reservation charges remain on the offline HugeTLB cgroup.

This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB
reservations charged to it, that cgroup persists as a zombie until all HugeTLB
reservations are uncharged. HugeTLB reservations behave in this manner to match
the memory controller whose cgroups also persist as zombie until all charged
memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more
complex compared to the tracking of HugeTLB faults, so it is significantly
harder to reparent reservations at offline time.
11 changes: 11 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,17 @@ cgroup v2 currently supports the following mount options.
modified through remount from the init namespace. The mount
option is ignored on non-init namespace mounts.

memory_recursiveprot

Recursively apply memory.min and memory.low protection to
entire subtrees, without requiring explicit downward
propagation into leaf cgroups. This allows protecting entire
subtrees from one another, while retaining free competition
within those subtrees. This should have been the default
behavior but is a mount-option to avoid regressing setups
relying on the original semantics (e.g. specifying bogusly
high 'bypass' protection values at higher tree levels).


Organizing Processes and Threads
--------------------------------
Expand Down
3 changes: 3 additions & 0 deletions Documentation/admin-guide/sysctl/vm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,9 @@ allowed to examine the unevictable lru (mlocked pages) for pages to compact.
This should be used on systems where stalls for minor page faults are an
acceptable trade for large contiguous free memory. Set to 0 to prevent
compaction from moving pages that are unevictable. Default value is 1.
On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due
to compaction, which would block the task from becomming active until the fault
is resolved.


dirty_background_bytes
Expand Down
3 changes: 3 additions & 0 deletions Documentation/core-api/mm-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,9 @@ File Mapping and Page Cache
.. kernel-doc:: mm/truncate.c
:export:

.. kernel-doc:: include/linux/pagemap.h
:internal:

Memory pools
============

Expand Down
86 changes: 55 additions & 31 deletions Documentation/core-api/pin_user_pages.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,22 @@ Which flags are set by each wrapper

For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup
flags the caller provides. The caller is required to pass in a non-null struct
pages* array, and the function then pin pages by incrementing each by a special
value. For now, that value is +1, just like get_user_pages*().::
pages* array, and the function then pins pages by incrementing each by a special
value: GUP_PIN_COUNTING_BIAS.

For huge pages (and in fact, any compound page of more than 2 pages), the
GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting
is achieved, by using the 3rd struct page in the compound page. A new struct
page field, hpage_pinned_refcount, has been added in order to support this.

This approach for compound pages avoids the counting upper limit problems that
are discussed below. Those limitations would have been aggravated severely by
huge pages, because each tail page adds a refcount to the head page. And in
fact, testing revealed that, without a separate hpage_pinned_refcount field,
page overflows were seen in some huge page stress tests.

This also means that huge pages and compound pages (of order > 1) do not suffer
from the false positives problem that is mentioned below.::

Function
--------
Expand Down Expand Up @@ -99,27 +113,6 @@ pages:
This also leads to limitations: there are only 31-10==21 bits available for a
counter that increments 10 bits at a time.

TODO: for 1GB and larger huge pages, this is cutting it close. That's because
when pin_user_pages() follows such pages, it increments the head page by "1"
(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for
pin_user_pages()) for each tail page. So if you have a 1GB huge page:

* There are 256K (18 bits) worth of 4 KB tail pages.
* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is,
10 bits at a time)
* There are 21 - 18 == 3 bits available to count. Except that there aren't,
because you need to allow for a few normal get_page() calls on the head page,
as well. Fortunately, the approach of using addition, rather than "hard"
bitfields, within page->_refcount, allows for sharing these bits gracefully.
But we're still looking at about 8 references.

This, however, is a missing feature more than anything else, because it's easily
solved by addressing an obvious inefficiency in the original get_user_pages()
approach of retrieving pages: stop treating all the pages as if they were
PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of
this, so some work is required. Once that's in place, this limitation mostly
disappears from view, because there will be ample refcounting range available.

* Callers must specifically request "dma-pinned tracking of pages". In other
words, just calling get_user_pages() will not suffice; a new set of functions,
pin_user_page() and related, must be used.
Expand Down Expand Up @@ -173,8 +166,8 @@ CASE 4: Pinning for struct page manipulation only
-------------------------------------------------
Here, normal GUP calls are sufficient, so neither flag needs to be set.

page_dma_pinned(): the whole point of pinning
=============================================
page_maybe_dma_pinned(): the whole point of pinning
===================================================

The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able
to query, "is this page DMA-pinned?" That allows code such as page_mkclean()
Expand All @@ -186,7 +179,7 @@ and debates (see the References at the end of this document). It's a TODO item
here: fill in the details once that's worked out. Meanwhile, it's safe to say
that having this available: ::

static inline bool page_dma_pinned(struct page *page)
static inline bool page_maybe_dma_pinned(struct page *page)

...is a prerequisite to solving the long-running gup+DMA problem.

Expand Down Expand Up @@ -215,18 +208,49 @@ has the following new calls to exercise the new pin*() wrapper functions:
You can monitor how many total dma-pinned pages have been acquired and released
since the system was booted, via two new /proc/vmstat entries: ::

/proc/vmstat/nr_foll_pin_requested
/proc/vmstat/nr_foll_pin_requested
/proc/vmstat/nr_foll_pin_acquired
/proc/vmstat/nr_foll_pin_released

Under normal conditions, these two values will be equal unless there are any
long-term [R]DMA pins in place, or during pin/unpin transitions.

* nr_foll_pin_acquired: This is the number of logical pins that have been
acquired since the system was powered on. For huge pages, the head page is
pinned once for each page (head page and each tail page) within the huge page.
This follows the same sort of behavior that get_user_pages() uses for huge
pages: the head page is refcounted once for each tail or head page in the huge
page, when get_user_pages() is applied to a huge page.

* nr_foll_pin_released: The number of logical pins that have been released since
the system was powered on. Note that pages are released (unpinned) on a
PAGE_SIZE granularity, even if the original pin was applied to a huge page.
Becaused of the pin count behavior described above in "nr_foll_pin_acquired",
the accounting balances out, so that after doing this::

pin_user_pages(huge_page);
for (each page in huge_page)
unpin_user_page(page);

...the following is expected::

nr_foll_pin_released == nr_foll_pin_acquired

(...unless it was already out of balance due to a long-term RDMA pin being in
place.)

Other diagnostics
=================

Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
because there is a noticeable performance drop in unpin_user_page(), when they
are activated.
dump_page() has been enhanced slightly, to handle these new counting fields, and
to better report on compound pages in general. Specifically, for compound pages
with order > 1, the exact (hpage_pinned_refcount) pincount is reported.

References
==========

* `Some slow progress on get_user_pages() (Apr 2, 2019) <https://lwn.net/Articles/784574/>`_
* `DMA and get_user_pages() (LPC: Dec 12, 2018) <https://lwn.net/Articles/774411/>`_
* `The trouble with get_user_pages() (Apr 30, 2018) <https://lwn.net/Articles/753027/>`_
* `LWN kernel index: get_user_pages() <https://lwn.net/Kernel/Index/#Memory_management-get_user_pages>`_

John Hubbard, October, 2019
11 changes: 0 additions & 11 deletions arch/alpha/include/asm/Kbuild
Original file line number Diff line number Diff line change
@@ -1,17 +1,6 @@
# SPDX-License-Identifier: GPL-2.0

generated-y += syscall_table.h
generic-y += compat.h
generic-y += exec.h
generic-y += export.h
generic-y += fb.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += preempt.h
generic-y += sections.h
generic-y += trace_clock.h
generic-y += current.h
generic-y += kprobes.h
6 changes: 3 additions & 3 deletions arch/alpha/mm/fault.c
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
const struct exception_table_entry *fixup;
int si_code = SEGV_MAPERR;
vm_fault_t fault;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
unsigned int flags = FAULT_FLAG_DEFAULT;

/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
(or is suppressed by the PALcode). Support that for older CPUs
Expand Down Expand Up @@ -150,7 +150,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
the fault. */
fault = handle_mm_fault(vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
if (fault_signal_pending(fault, regs))
return;

if (unlikely(fault & VM_FAULT_ERROR)) {
Expand All @@ -169,7 +169,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
else
current->min_flt++;
if (fault & VM_FAULT_RETRY) {
flags &= ~FAULT_FLAG_ALLOW_RETRY;
flags |= FAULT_FLAG_TRIED;

/* No need to up_read(&mm->mmap_sem) as we would
* have already released it in __lock_page_or_retry
Expand Down
21 changes: 0 additions & 21 deletions arch/arc/include/asm/Kbuild
Original file line number Diff line number Diff line change
@@ -1,28 +1,7 @@
# SPDX-License-Identifier: GPL-2.0
generic-y += bugs.h
generic-y += compat.h
generic-y += device.h
generic-y += div64.h
generic-y += dma-mapping.h
generic-y += emergency-restart.h
generic-y += extable.h
generic-y += ftrace.h
generic-y += hardirq.h
generic-y += hw_irq.h
generic-y += irq_regs.h
generic-y += irq_work.h
generic-y += kvm_para.h
generic-y += local.h
generic-y += local64.h
generic-y += mcs_spinlock.h
generic-y += mm-arch-hooks.h
generic-y += mmiowb.h
generic-y += parport.h
generic-y += percpu.h
generic-y += preempt.h
generic-y += topology.h
generic-y += trace_clock.h
generic-y += user.h
generic-y += vga.h
generic-y += word-at-a-time.h
generic-y += xor.h
Loading

0 comments on commit 6cad420

Please sign in to comment.