Skip to content

Commit

Permalink
Merge branch 'akpm' (patches from Andrew)
Browse files Browse the repository at this point in the history
Merge updates from Andrew Morton:

 - a few misc things

 - ocfs2 updates

 - most of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
  hugetlbfs: dirty pages as they are added to pagecache
  mm: export add_swap_extent()
  mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
  tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
  mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
  mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
  mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
  mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
  mm/gup: cache dev_pagemap while pinning pages
  Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
  mm: return zero_resv_unavail optimization
  mm: zero remaining unavailable struct pages
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
  tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
  tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
  tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
  mm/gup_benchmark.c: add additional pinning methods
  mm/gup_benchmark.c: time put_page()
  mm: don't raise MEMCG_OOM event due to failed high-order allocation
  mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
  ...
  • Loading branch information
Linus Torvalds committed Oct 27, 2018
2 parents 4904008 + 22146c3 commit 345671e
Show file tree
Hide file tree
Showing 156 changed files with 3,400 additions and 1,988 deletions.
73 changes: 73 additions & 0 deletions Documentation/accounting/psi.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
================================
PSI - Pressure Stall Information
================================

:Date: April, 2018
:Author: Johannes Weiner <hannes@cmpxchg.org>

When CPU, memory or IO devices are contended, workloads experience
latency spikes, throughput losses, and run the risk of OOM kills.

Without an accurate measure of such contention, users are forced to
either play it safe and under-utilize their hardware resources, or
roll the dice and frequently suffer the disruptions resulting from
excessive overcommit.

The psi feature identifies and quantifies the disruptions caused by
such resource crunches and the time impact it has on complex workloads
or even entire systems.

Having an accurate measure of productivity losses caused by resource
scarcity aids users in sizing workloads to hardware--or provisioning
hardware according to workload demand.

As psi aggregates this information in realtime, systems can be managed
dynamically using techniques such as load shedding, migrating jobs to
other systems or data centers, or strategically pausing or killing low
priority or restartable batch jobs.

This allows maximizing hardware utilization without sacrificing
workload health or risking major disruptions such as OOM kills.

Pressure interface
==================

Pressure information for each resource is exported through the
respective file in /proc/pressure/ -- cpu, memory, and io.

The format for CPU is as such:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0

and for memory and IO:

some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

The "some" line indicates the share of time in which at least some
tasks are stalled on a given resource.

The "full" line indicates the share of time in which all non-idle
tasks are stalled on a given resource simultaneously. In this state
actual CPU cycles are going to waste, and a workload that spends
extended time in this state is considered to be thrashing. This has
severe impact on performance, and it's useful to distinguish this
situation from a state where some tasks are stalled but the CPU is
still doing productive work. As such, time spent in this subset of the
stall state is tracked separately and exported in the "full" averages.

The ratios are tracked as recent trends over ten, sixty, and three
hundred second windows, which gives insight into short term events as
well as medium and long term trends. The total absolute stall time is
tracked and exported as well, to allow detection of latency spikes
which wouldn't necessarily make a dent in the time averages, or to
average trends over custom time frames.

Cgroup2 interface
=================

In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem
mounted, pressure stall information is also tracked for tasks grouped
into cgroups. Each subdirectory in the cgroupfs mountpoint contains
cpu.pressure, memory.pressure, and io.pressure files; the format is
the same as the /proc/pressure/ files.
22 changes: 22 additions & 0 deletions Documentation/admin-guide/cgroup-v2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -966,6 +966,12 @@ All time durations are in microseconds.
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.

cpu.pressure
A read-only nested-key file which exists on non-root cgroups.

Shows pressure stall information for CPU. See
Documentation/accounting/psi.txt for details.


Memory
------
Expand Down Expand Up @@ -1127,6 +1133,10 @@ PAGE_SIZE multiple when read back.
disk readahead. For now OOM in memory cgroup kills
tasks iff shortage has happened inside page fault.

This event is not raised if the OOM killer is not
considered as an option, e.g. for failed high-order
allocations.

oom_kill
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
Expand Down Expand Up @@ -1271,6 +1281,12 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.

memory.pressure
A read-only nested-key file which exists on non-root cgroups.

Shows pressure stall information for memory. See
Documentation/accounting/psi.txt for details.


Usage Guidelines
~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -1408,6 +1424,12 @@ IO Interface Files

8:16 rbps=2097152 wbps=max riops=max wiops=max

io.pressure
A read-only nested-key file which exists on non-root cgroups.

Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details.


Writeback
~~~~~~~~~
Expand Down
12 changes: 12 additions & 0 deletions Documentation/admin-guide/kernel-parameters.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4851,6 +4851,18 @@
This is actually a boot loader parameter; the value is
passed to the kernel using a special protocol.

vm_debug[=options] [KNL] Available with CONFIG_DEBUG_VM=y.
May slow down system boot speed, especially when
enabled on systems with a large amount of memory.
All options are enabled by default, and this
interface is meant to allow for selectively
enabling or disabling specific virtual memory
debugging features.

Available options are:
P Enable page structure init time poisoning
- Disable all of the above options

vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
size of <nn>. This can be used to increase the
minimum size (128MB on x86). It can also be used to
Expand Down
4 changes: 4 additions & 0 deletions Documentation/filesystems/proc.txt
Original file line number Diff line number Diff line change
Expand Up @@ -858,6 +858,7 @@ Writeback: 0 kB
AnonPages: 861800 kB
Mapped: 280372 kB
Shmem: 644 kB
KReclaimable: 168048 kB
Slab: 284364 kB
SReclaimable: 159856 kB
SUnreclaim: 124508 kB
Expand Down Expand Up @@ -925,6 +926,9 @@ AnonHugePages: Non-file backed huge pages mapped into userspace page tables
ShmemHugePages: Memory used by shared memory (shmem) and tmpfs allocated
with huge pages
ShmemPmdMapped: Shared memory mapped into userspace with huge pages
KReclaimable: Kernel allocations that the kernel will attempt to reclaim
under memory pressure. Includes SReclaimable (below), and other
direct allocations with a shrinker.
Slab: in-kernel data structures cache
SReclaimable: Part of Slab, that might be reclaimed, such as caches
SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure
Expand Down
12 changes: 9 additions & 3 deletions Documentation/vm/slub.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,10 @@ debugging is enabled. Format:

slub_debug=<Debug-Options>
Enable options for all slabs
slub_debug=<Debug-Options>,<slab name>
Enable options only for select slabs

slub_debug=<Debug-Options>,<slab name1>,<slab name2>,...
Enable options only for select slabs (no spaces
after a comma)

Possible debug options are::

Expand All @@ -62,7 +63,12 @@ Trying to find an issue in the dentry cache? Try::

slub_debug=,dentry

to only enable debugging on the dentry cache.
to only enable debugging on the dentry cache. You may use an asterisk at the
end of the slab name, in order to cover all slabs with the same prefix. For
example, here's how you can poison the dentry cache as well as all kmalloc
slabs:

slub_debug=P,kmalloc-*,dentry
Red zoning and tracking may realign the slab. We can just apply sanity checks
to the dentry cache with::
Expand Down
4 changes: 2 additions & 2 deletions Documentation/x86/pat.txt
Original file line number Diff line number Diff line change
Expand Up @@ -90,12 +90,12 @@ pci proc | -- | -- | WC |
Advanced APIs for drivers
-------------------------
A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
vm_insert_pfn
vmf_insert_pfn

Drivers wanting to export some pages to userspace do it by using mmap
interface and a combination of
1) pgprot_noncached()
2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
2) io_remap_pfn_range() or remap_pfn_range() or vmf_insert_pfn()

With PAT support, a new API pgprot_writecombine is being added. So, drivers can
continue to use the above sequence, with either pgprot_noncached() or
Expand Down
2 changes: 2 additions & 0 deletions arch/alpha/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ config ALPHA
select ODD_RT_SIGACTION
select OLD_SIGSUSPEND
select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
select HAVE_MEMBLOCK
select NO_BOOTMEM
help
The Alpha is a 64-bit general-purpose processor designed and
marketed by the Digital Equipment Corporation of blessed memory,
Expand Down
4 changes: 2 additions & 2 deletions arch/alpha/kernel/core_irongate.c
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
#include <linux/init.h>
#include <linux/initrd.h>
#include <linux/bootmem.h>
#include <linux/memblock.h>

#include <asm/ptrace.h>
#include <asm/cacheflush.h>
Expand Down Expand Up @@ -241,8 +242,7 @@ albacore_init_arch(void)
size / 1024);
}
#endif
reserve_bootmem_node(NODE_DATA(0), pci_mem, memtop -
pci_mem, BOOTMEM_DEFAULT);
memblock_reserve(pci_mem, memtop - pci_mem);
printk("irongate_init_arch: temporarily reserving "
"region %08lx-%08lx for PCI\n", pci_mem, memtop - 1);
}
Expand Down
98 changes: 12 additions & 86 deletions arch/alpha/kernel/setup.c
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@
#include <linux/ioport.h>
#include <linux/platform_device.h>
#include <linux/bootmem.h>
#include <linux/memblock.h>
#include <linux/pci.h>
#include <linux/seq_file.h>
#include <linux/root_dev.h>
Expand Down Expand Up @@ -312,16 +313,16 @@ setup_memory(void *kernel_end)
{
struct memclust_struct * cluster;
struct memdesc_struct * memdesc;
unsigned long start_kernel_pfn, end_kernel_pfn;
unsigned long bootmap_size, bootmap_pages, bootmap_start;
unsigned long start, end;
unsigned long kernel_size;
unsigned long i;

/* Find free clusters, and init and free the bootmem accordingly. */
memdesc = (struct memdesc_struct *)
(hwrpb->mddt_offset + (unsigned long) hwrpb);

for_each_mem_cluster(memdesc, cluster, i) {
unsigned long end;

printk("memcluster %lu, usage %01lx, start %8lu, end %8lu\n",
i, cluster->usage, cluster->start_pfn,
cluster->start_pfn + cluster->numpages);
Expand All @@ -335,6 +336,9 @@ setup_memory(void *kernel_end)
end = cluster->start_pfn + cluster->numpages;
if (end > max_low_pfn)
max_low_pfn = end;

memblock_add(PFN_PHYS(cluster->start_pfn),
cluster->numpages << PAGE_SHIFT);
}

/*
Expand Down Expand Up @@ -363,87 +367,9 @@ setup_memory(void *kernel_end)
max_low_pfn = mem_size_limit;
}

/* Find the bounds of kernel memory. */
start_kernel_pfn = PFN_DOWN(KERNEL_START_PHYS);
end_kernel_pfn = PFN_UP(virt_to_phys(kernel_end));
bootmap_start = -1;

try_again:
if (max_low_pfn <= end_kernel_pfn)
panic("not enough memory to boot");

/* We need to know how many physically contiguous pages
we'll need for the bootmap. */
bootmap_pages = bootmem_bootmap_pages(max_low_pfn);

/* Now find a good region where to allocate the bootmap. */
for_each_mem_cluster(memdesc, cluster, i) {
if (cluster->usage & 3)
continue;

start = cluster->start_pfn;
end = start + cluster->numpages;
if (start >= max_low_pfn)
continue;
if (end > max_low_pfn)
end = max_low_pfn;
if (start < start_kernel_pfn) {
if (end > end_kernel_pfn
&& end - end_kernel_pfn >= bootmap_pages) {
bootmap_start = end_kernel_pfn;
break;
} else if (end > start_kernel_pfn)
end = start_kernel_pfn;
} else if (start < end_kernel_pfn)
start = end_kernel_pfn;
if (end - start >= bootmap_pages) {
bootmap_start = start;
break;
}
}

if (bootmap_start == ~0UL) {
max_low_pfn >>= 1;
goto try_again;
}

/* Allocate the bootmap and mark the whole MM as reserved. */
bootmap_size = init_bootmem(bootmap_start, max_low_pfn);

/* Mark the free regions. */
for_each_mem_cluster(memdesc, cluster, i) {
if (cluster->usage & 3)
continue;

start = cluster->start_pfn;
end = cluster->start_pfn + cluster->numpages;
if (start >= max_low_pfn)
continue;
if (end > max_low_pfn)
end = max_low_pfn;
if (start < start_kernel_pfn) {
if (end > end_kernel_pfn) {
free_bootmem(PFN_PHYS(start),
(PFN_PHYS(start_kernel_pfn)
- PFN_PHYS(start)));
printk("freeing pages %ld:%ld\n",
start, start_kernel_pfn);
start = end_kernel_pfn;
} else if (end > start_kernel_pfn)
end = start_kernel_pfn;
} else if (start < end_kernel_pfn)
start = end_kernel_pfn;
if (start >= end)
continue;

free_bootmem(PFN_PHYS(start), PFN_PHYS(end) - PFN_PHYS(start));
printk("freeing pages %ld:%ld\n", start, end);
}

/* Reserve the bootmap memory. */
reserve_bootmem(PFN_PHYS(bootmap_start), bootmap_size,
BOOTMEM_DEFAULT);
printk("reserving pages %ld:%ld\n", bootmap_start, bootmap_start+PFN_UP(bootmap_size));
/* Reserve the kernel memory. */
kernel_size = virt_to_phys(kernel_end) - KERNEL_START_PHYS;
memblock_reserve(KERNEL_START_PHYS, kernel_size);

#ifdef CONFIG_BLK_DEV_INITRD
initrd_start = INITRD_START;
Expand All @@ -459,8 +385,8 @@ setup_memory(void *kernel_end)
initrd_end,
phys_to_virt(PFN_PHYS(max_low_pfn)));
} else {
reserve_bootmem(virt_to_phys((void *)initrd_start),
INITRD_SIZE, BOOTMEM_DEFAULT);
memblock_reserve(virt_to_phys((void *)initrd_start),
INITRD_SIZE);
}
}
#endif /* CONFIG_BLK_DEV_INITRD */
Expand Down
Loading

0 comments on commit 345671e

Please sign in to comment.