Skip to content

Commit

Permalink
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux…
Browse files Browse the repository at this point in the history
…/kernel/git/tip/tip

Pull x86 mm changes from Ingo Molnar:
 "The main change in this cycle is the rework of the TLB range flushing
  code, to simplify, fix and consolidate the code.  By Dave Hansen"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Set TLB flush tunable to sane value (33)
  x86/mm: New tunable for single vs full TLB flush
  x86/mm: Add tracepoints for TLB flushes
  x86/mm: Unify remote INVLPG code
  x86/mm: Fix missed global TLB flush stat
  x86/mm: Rip out complicated, out-of-date, buggy TLB flushing
  x86/mm: Clean up the TLB flushing code
  x86/smep: Be more informative when signalling an SMEP fault
  • Loading branch information
Linus Torvalds committed Aug 5, 2014
2 parents 76f09aa + a510247 commit ce47479
Show file tree
Hide file tree
Showing 11 changed files with 193 additions and 99 deletions.
75 changes: 75 additions & 0 deletions Documentation/x86/tlb.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
When the kernel unmaps or modified the attributes of a range of
memory, it has two choices:
1. Flush the entire TLB with a two-instruction sequence. This is
a quick operation, but it causes collateral damage: TLB entries
from areas other than the one we are trying to flush will be
destroyed and must be refilled later, at some cost.
2. Use the invlpg instruction to invalidate a single page at a
time. This could potentialy cost many more instructions, but
it is a much more precise operation, causing no collateral
damage to other TLB entries.

Which method to do depends on a few things:
1. The size of the flush being performed. A flush of the entire
address space is obviously better performed by flushing the
entire TLB than doing 2^48/PAGE_SIZE individual flushes.
2. The contents of the TLB. If the TLB is empty, then there will
be no collateral damage caused by doing the global flush, and
all of the individual flush will have ended up being wasted
work.
3. The size of the TLB. The larger the TLB, the more collateral
damage we do with a full flush. So, the larger the TLB, the
more attrative an individual flush looks. Data and
instructions have separate TLBs, as do different page sizes.
4. The microarchitecture. The TLB has become a multi-level
cache on modern CPUs, and the global flushes have become more
expensive relative to single-page flushes.

There is obviously no way the kernel can know all these things,
especially the contents of the TLB during a given flush. The
sizes of the flush will vary greatly depending on the workload as
well. There is essentially no "right" point to choose.

You may be doing too many individual invalidations if you see the
invlpg instruction (or instructions _near_ it) show up high in
profiles. If you believe that individual invalidations being
called too often, you can lower the tunable:

/sys/debug/kernel/x86/tlb_single_page_flush_ceiling

This will cause us to do the global flush for more cases.
Lowering it to 0 will disable the use of the individual flushes.
Setting it to 1 is a very conservative setting and it should
never need to be 0 under normal circumstances.

Despite the fact that a single individual flush on x86 is
guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
flushes. THP is treated exactly the same as normal memory.

You might see invlpg inside of flush_tlb_mm_range() show up in
profiles, or you can use the trace_tlb_flush() tracepoints. to
determine how long the flush operations are taking.

Essentially, you are balancing the cycles you spend doing invlpg
with the cycles that you spend refilling the TLB later.

You can measure how expensive TLB refills are by using
performance counters and 'perf stat', like this:

perf stat -e
cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/

That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
may have differently-named counters, but they should at least
be there in some form. You can use pmu-tools 'ocperf list'
(https://github.com/andikleen/pmu-tools) to find the right
counters for a given CPU.

1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
says: "One execution of INVLPG is sufficient even for a page
with size greater than 4 KBytes."
6 changes: 6 additions & 0 deletions arch/x86/include/asm/mmu_context.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@

#include <asm/desc.h>
#include <linux/atomic.h>
#include <linux/mm_types.h>

#include <trace/events/tlb.h>

#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
#include <asm/paravirt.h>
Expand Down Expand Up @@ -44,6 +48,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,

/* Re-load page tables */
load_cr3(next->pgd);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);

/* Stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));
Expand Down Expand Up @@ -71,6 +76,7 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
* to make sure to use no freed page tables.
*/
load_cr3(next->pgd);
trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
load_LDT_nolock(&next->context);
}
}
Expand Down
1 change: 0 additions & 1 deletion arch/x86/include/asm/processor.h
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,6 @@ extern u16 __read_mostly tlb_lld_4k[NR_INFO];
extern u16 __read_mostly tlb_lld_2m[NR_INFO];
extern u16 __read_mostly tlb_lld_4m[NR_INFO];
extern u16 __read_mostly tlb_lld_1g[NR_INFO];
extern s8 __read_mostly tlb_flushall_shift;

/*
* CPU type and hardware bug flags. Kept separately for each CPU.
Expand Down
7 changes: 0 additions & 7 deletions arch/x86/kernel/cpu/amd.c
Original file line number Diff line number Diff line change
Expand Up @@ -724,11 +724,6 @@ static unsigned int amd_size_cache(struct cpuinfo_x86 *c, unsigned int size)
}
#endif

static void cpu_set_tlb_flushall_shift(struct cpuinfo_x86 *c)
{
tlb_flushall_shift = 6;
}

static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
{
u32 ebx, eax, ecx, edx;
Expand Down Expand Up @@ -776,8 +771,6 @@ static void cpu_detect_tlb_amd(struct cpuinfo_x86 *c)
tlb_lli_2m[ENTRIES] = eax & mask;

tlb_lli_4m[ENTRIES] = tlb_lli_2m[ENTRIES] >> 1;

cpu_set_tlb_flushall_shift(c);
}

static const struct cpu_dev amd_cpu_dev = {
Expand Down
13 changes: 2 additions & 11 deletions arch/x86/kernel/cpu/common.c
Original file line number Diff line number Diff line change
Expand Up @@ -481,26 +481,17 @@ u16 __read_mostly tlb_lld_2m[NR_INFO];
u16 __read_mostly tlb_lld_4m[NR_INFO];
u16 __read_mostly tlb_lld_1g[NR_INFO];

/*
* tlb_flushall_shift shows the balance point in replacing cr3 write
* with multiple 'invlpg'. It will do this replacement when
* flush_tlb_lines <= active_lines/2^tlb_flushall_shift.
* If tlb_flushall_shift is -1, means the replacement will be disabled.
*/
s8 __read_mostly tlb_flushall_shift = -1;

void cpu_detect_tlb(struct cpuinfo_x86 *c)
{
if (this_cpu->c_detect_tlb)
this_cpu->c_detect_tlb(c);

printk(KERN_INFO "Last level iTLB entries: 4KB %d, 2MB %d, 4MB %d\n"
"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n"
"tlb_flushall_shift: %d\n",
"Last level dTLB entries: 4KB %d, 2MB %d, 4MB %d, 1GB %d\n",
tlb_lli_4k[ENTRIES], tlb_lli_2m[ENTRIES],
tlb_lli_4m[ENTRIES], tlb_lld_4k[ENTRIES],
tlb_lld_2m[ENTRIES], tlb_lld_4m[ENTRIES],
tlb_lld_1g[ENTRIES], tlb_flushall_shift);
tlb_lld_1g[ENTRIES]);
}

void detect_ht(struct cpuinfo_x86 *c)
Expand Down
26 changes: 0 additions & 26 deletions arch/x86/kernel/cpu/intel.c
Original file line number Diff line number Diff line change
Expand Up @@ -634,31 +634,6 @@ static void intel_tlb_lookup(const unsigned char desc)
}
}

static void intel_tlb_flushall_shift_set(struct cpuinfo_x86 *c)
{
switch ((c->x86 << 8) + c->x86_model) {
case 0x60f: /* original 65 nm celeron/pentium/core2/xeon, "Merom"/"Conroe" */
case 0x616: /* single-core 65 nm celeron/core2solo "Merom-L"/"Conroe-L" */
case 0x617: /* current 45 nm celeron/core2/xeon "Penryn"/"Wolfdale" */
case 0x61d: /* six-core 45 nm xeon "Dunnington" */
tlb_flushall_shift = -1;
break;
case 0x63a: /* Ivybridge */
tlb_flushall_shift = 2;
break;
case 0x61a: /* 45 nm nehalem, "Bloomfield" */
case 0x61e: /* 45 nm nehalem, "Lynnfield" */
case 0x625: /* 32 nm nehalem, "Clarkdale" */
case 0x62c: /* 32 nm nehalem, "Gulftown" */
case 0x62e: /* 45 nm nehalem-ex, "Beckton" */
case 0x62f: /* 32 nm Xeon E7 */
case 0x62a: /* SandyBridge */
case 0x62d: /* SandyBridge, "Romely-EP" */
default:
tlb_flushall_shift = 6;
}
}

static void intel_detect_tlb(struct cpuinfo_x86 *c)
{
int i, j, n;
Expand All @@ -683,7 +658,6 @@ static void intel_detect_tlb(struct cpuinfo_x86 *c)
for (j = 1 ; j < 16 ; j++)
intel_tlb_lookup(desc[j]);
}
intel_tlb_flushall_shift_set(c);
}

static const struct cpu_dev intel_cpu_dev = {
Expand Down
6 changes: 6 additions & 0 deletions arch/x86/mm/fault.c
Original file line number Diff line number Diff line change
Expand Up @@ -577,6 +577,8 @@ static int is_f00f_bug(struct pt_regs *regs, unsigned long address)

static const char nx_warning[] = KERN_CRIT
"kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n";
static const char smep_warning[] = KERN_CRIT
"unable to execute userspace code (SMEP?) (uid: %d)\n";

static void
show_fault_oops(struct pt_regs *regs, unsigned long error_code,
Expand All @@ -597,6 +599,10 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code,

if (pte && pte_present(*pte) && !pte_exec(*pte))
printk(nx_warning, from_kuid(&init_user_ns, current_uid()));
if (pte && pte_present(*pte) && pte_exec(*pte) &&
(pgd_flags(*pgd) & _PAGE_USER) &&
(read_cr4() & X86_CR4_SMEP))
printk(smep_warning, from_kuid(&init_user_ns, current_uid()));
}

printk(KERN_ALERT "BUG: unable to handle kernel ");
Expand Down
7 changes: 7 additions & 0 deletions arch/x86/mm/init.c
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@
#include <asm/dma.h> /* for MAX_DMA_PFN */
#include <asm/microcode.h>

/*
* We need to define the tracepoints somewhere, and tlb.c
* is only compied when SMP=y.
*/
#define CREATE_TRACE_POINTS
#include <trace/events/tlb.h>

#include "mm_internal.h"

static unsigned long __initdata pgt_buf_start;
Expand Down
Loading

0 comments on commit ce47479

Please sign in to comment.