Skip to content

Commit

Permalink
Merge the lockless page table walk rework into next
Browse files Browse the repository at this point in the history
This merges the lockless page table walk rework series from Aneesh.
Because it touches powerpc KVM code we are sharing it with the kvm-ppc
tree in our topic/ppc-kvm branch.

This is the cover letter from Aneesh:

Avoid IPI while updating page table entries.

Problem Summary:
Slow termination of KVM guest with large guest RAM config due to a
large number of IPIs that were caused by clearing level 1 PTE
entries (THP) entries. This is shown in the stack trace below.

- qemu-system-ppc  [kernel.vmlinux]            [k] smp_call_function_many
   - smp_call_function_many
      - 36.09% smp_call_function_many
           serialize_against_pte_lookup
           radix__pmdp_huge_get_and_clear
           zap_huge_pmd
           unmap_page_range
           unmap_vmas
           unmap_region
           __do_munmap
           __vm_munmap
           sys_munmap
          system_call
           __munmap
           qemu_ram_munmap
           qemu_anon_ram_free
           reclaim_ramblock
           call_rcu_thread
           qemu_thread_start
           start_thread
           __clone

Why we need to do IPI when clearing PMD entries:
This was added as part of commit: 13bd817 ("powerpc/thp: Serialize pmd clear against a linux page table walk")

serialize_against_pte_lookup makes sure that all parallel lockless
page table walk completes before we convert a PMD pte entry to regular
pmd entry. We end up doing that conversion in the below scenarios

1) __split_huge_zero_page_pmd
2) do_huge_pmd_wp_page_fallback
3) MADV_DONTNEED running parallel to page faults.

local_irq_disable and lockless page table walk:

The lockless page table walk work with the assumption that we can
dereference the page table contents without holding a lock. For this
to work, we need to make sure we read the page table contents
atomically and page table pages are not going to be freed/released
while we are walking the table pages. We can achieve by using a rcu
based freeing for page table pages or if the architecture implements
broadcast tlbie, we can block the IPI as we walk the page table pages.

To support both the above framework, lockless page table walk is done
with irq disabled instead of rcu_read_lock()

We do have two interface for lockless page table walk, gup fast and
__find_linux_pte. This patch series makes __find_linux_pte table walk
safe against the conversion of PMD PTE to regular PMD.

gup fast:

gup fast is already safe against THP split because kernel now
differentiate between a pmd split and a compound page split. gup fast
can run parallel to a pmd split and we prevent a parallel gup fast to
a hugepage split, by freezing the page refcount and failing the
speculative page ref increment.

Similar to how gup is safe against parallel pmd split, this patch
series updates the __find_linux_pte callers to be safe against a
parallel pmd split. We do that by enforcing the following rules.

1) Don't reload the pte value, because that can be updated in
   parallel.
2) Code should be able to work with a stale PTE value and not the
   recent one. ie, the pte value that we are looking at may not be the
   latest value in the page table.
3) Before looking at pte value check for _PAGE_PTE bit. We now do this
as part of pte_present() check.

Performance:

This speeds up Qemu guest RAM del/unplug time as below
128 core, 496GB guest:

Without patch:
  munmap start: timer = 13162 ms, PID=7684
  munmap finish: timer = 95312 ms, PID=7684 - delta = 82150 ms

With patch (upto removing IPI)
  munmap start: timer = 196449 ms, PID=6681
  munmap finish: timer = 196488 ms, PID=6681 - delta = 39ms

With patch (with adding the tlb invalidate in pmdp_huge_get_and_clear_full)
  munmap start: timer = 196345 ms, PID=6879
  munmap finish: timer = 196714 ms, PID=6879 - delta = 369ms

Link: https://lore.kernel.org/r/20200505071729.54912-1-aneesh.kumar@linux.ibm.com
  • Loading branch information
Michael Ellerman committed May 6, 2020
2 parents 140777a + 75358ea commit 1f12096
Show file tree
Hide file tree
Showing 22 changed files with 279 additions and 290 deletions.
20 changes: 15 additions & 5 deletions arch/powerpc/include/asm/book3s/64/pgtable.h
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,12 @@ static inline pte_t pte_clear_savedwrite(pte_t pte)
}
#endif /* CONFIG_NUMA_BALANCING */

static inline bool pte_hw_valid(pte_t pte)
{
return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE)) ==
cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE);
}

static inline int pte_present(pte_t pte)
{
/*
Expand All @@ -561,12 +567,11 @@ static inline int pte_present(pte_t pte)
* invalid during ptep_set_access_flags. Hence we look for _PAGE_INVALID
* if we find _PAGE_PRESENT cleared.
*/
return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_INVALID));
}

static inline bool pte_hw_valid(pte_t pte)
{
return !!(pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT));
if (pte_hw_valid(pte))
return true;
return (pte_raw(pte) & cpu_to_be64(_PAGE_INVALID | _PAGE_PTE)) ==
cpu_to_be64(_PAGE_INVALID | _PAGE_PTE);
}

#ifdef CONFIG_PPC_MEM_KEYS
Expand Down Expand Up @@ -1260,6 +1265,11 @@ static inline pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
}
#define pmdp_collapse_flush pmdp_collapse_flush

#define __HAVE_ARCH_PMDP_HUGE_GET_AND_CLEAR_FULL
pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmdp, int full);

#define __HAVE_ARCH_PGTABLE_DEPOSIT
static inline void pgtable_trans_huge_deposit(struct mm_struct *mm,
pmd_t *pmdp, pgtable_t pgtable)
Expand Down
3 changes: 1 addition & 2 deletions arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
Original file line number Diff line number Diff line change
Expand Up @@ -113,8 +113,7 @@ static inline void hash__flush_tlb_kernel_range(unsigned long start,
struct mmu_gather;
extern void hash__tlb_flush(struct mmu_gather *tlb);
/* Private function for use by PCI IO mapping code */
extern void __flush_hash_table_range(struct mm_struct *mm, unsigned long start,
unsigned long end);
extern void __flush_hash_table_range(unsigned long start, unsigned long end);
extern void flush_tlb_pmd_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr);
#endif /* _ASM_POWERPC_BOOK3S_64_TLBFLUSH_HASH_H */
2 changes: 1 addition & 1 deletion arch/powerpc/include/asm/kvm_book3s.h
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,7 @@ extern void kvmppc_unmap_pte(struct kvm *kvm, pte_t *pte, unsigned long gpa,
unsigned int shift,
const struct kvm_memory_slot *memslot,
unsigned int lpid);
extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable,
extern bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested,
bool writing, unsigned long gpa,
unsigned int lpid);
extern int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
Expand Down
34 changes: 33 additions & 1 deletion arch/powerpc/include/asm/kvm_book3s_64.h
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
#include <asm/book3s/64/mmu-hash.h>
#include <asm/cpu_has_feature.h>
#include <asm/ppc-opcode.h>
#include <asm/pte-walk.h>

#ifdef CONFIG_PPC_PSERIES
static inline bool kvmhv_on_pseries(void)
Expand Down Expand Up @@ -434,7 +435,7 @@ static inline pte_t kvmppc_read_update_linux_pte(pte_t *ptep, int writing)
continue;
}
/* If pte is not present return None */
if (unlikely(!(pte_val(old_pte) & _PAGE_PRESENT)))
if (unlikely(!pte_present(old_pte)))
return __pte(0);

new_pte = pte_mkyoung(old_pte);
Expand Down Expand Up @@ -634,6 +635,37 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
unsigned long gpa, unsigned long hpa,
unsigned long nbytes);

static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
unsigned *hshift)
{
pte_t *pte;

VM_WARN(!spin_is_locked(&kvm->mmu_lock),
"%s called with kvm mmu_lock not held \n", __func__);
pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);

return pte;
}

static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq,
unsigned long ea, unsigned *hshift)
{
pte_t *pte;

VM_WARN(!spin_is_locked(&kvm->mmu_lock),
"%s called with kvm mmu_lock not held \n", __func__);

if (mmu_notifier_retry(kvm, mmu_seq))
return NULL;

pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift);

return pte;
}

extern pte_t *find_kvm_nested_guest_pte(struct kvm *kvm, unsigned long lpid,
unsigned long ea, unsigned *hshift);

#endif /* CONFIG_KVM_BOOK3S_HV_POSSIBLE */

#endif /* __ASM_KVM_BOOK3S_64_H__ */
9 changes: 0 additions & 9 deletions arch/powerpc/include/asm/mmu.h
Original file line number Diff line number Diff line change
Expand Up @@ -291,15 +291,6 @@ static inline bool early_radix_enabled(void)
}
#endif

#ifdef CONFIG_PPC_MEM_KEYS
extern u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address);
#else
static inline u16 get_mm_addr_key(struct mm_struct *mm, unsigned long address)
{
return 0;
}
#endif /* CONFIG_PPC_MEM_KEYS */

#ifdef CONFIG_STRICT_KERNEL_RWX
static inline bool strict_kernel_rwx_enabled(void)
{
Expand Down
14 changes: 9 additions & 5 deletions arch/powerpc/kernel/mce_power.c
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
*/
unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)
{
pte_t *ptep;
pte_t *ptep, pte;
unsigned int shift;
unsigned long pfn, flags;
struct mm_struct *mm;
Expand All @@ -39,19 +39,23 @@ unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr)

local_irq_save(flags);
ptep = __find_linux_pte(mm->pgd, addr, NULL, &shift);
if (!ptep) {
pfn = ULONG_MAX;
goto out;
}
pte = READ_ONCE(*ptep);

if (!ptep || pte_special(*ptep)) {
if (!pte_present(pte) || pte_special(pte)) {
pfn = ULONG_MAX;
goto out;
}

if (shift <= PAGE_SHIFT)
pfn = pte_pfn(*ptep);
pfn = pte_pfn(pte);
else {
unsigned long rpnmask = (1ul << shift) - PAGE_SIZE;
pfn = pte_pfn(__pte(pte_val(*ptep) | (addr & rpnmask)));
pfn = pte_pfn(__pte(pte_val(pte) | (addr & rpnmask)));
}

out:
local_irq_restore(flags);
return pfn;
Expand Down
2 changes: 1 addition & 1 deletion arch/powerpc/kernel/pci_64.c
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ int pcibios_unmap_io_space(struct pci_bus *bus)
pci_name(bus->self));

#ifdef CONFIG_PPC_BOOK3S_64
__flush_hash_table_range(&init_mm, res->start + _IO_BASE,
__flush_hash_table_range(res->start + _IO_BASE,
res->end + _IO_BASE + 1);
#endif
return 0;
Expand Down
18 changes: 9 additions & 9 deletions arch/powerpc/kvm/book3s_64_mmu_hv.c
Original file line number Diff line number Diff line change
Expand Up @@ -281,11 +281,10 @@ static long kvmppc_virtmode_do_h_enter(struct kvm *kvm, unsigned long flags,
{
long ret;

/* Protect linux PTE lookup from page table destruction */
rcu_read_lock_sched(); /* this disables preemption too */
preempt_disable();
ret = kvmppc_do_h_enter(kvm, flags, pte_index, pteh, ptel,
kvm->mm->pgd, false, pte_idx_ret);
rcu_read_unlock_sched();
preempt_enable();
if (ret == H_TOO_HARD) {
/* this can't happen */
pr_err("KVM: Oops, kvmppc_h_enter returned too hard!\n");
Expand Down Expand Up @@ -602,20 +601,21 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
* Read the PTE from the process' radix tree and use that
* so we get the shift and attribute bits.
*/
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
spin_lock(&kvm->mmu_lock);
ptep = find_kvm_host_pte(kvm, mmu_seq, hva, &shift);
pte = __pte(0);
if (ptep)
pte = READ_ONCE(*ptep);
spin_unlock(&kvm->mmu_lock);
/*
* If the PTE disappeared temporarily due to a THP
* collapse, just return and let the guest try again.
*/
if (!ptep) {
local_irq_enable();
if (!pte_present(pte)) {
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();
hpa = pte_pfn(pte) << PAGE_SHIFT;
pte_size = PAGE_SIZE;
if (shift)
Expand Down
43 changes: 22 additions & 21 deletions arch/powerpc/kvm/book3s_64_mmu_radix.c
Original file line number Diff line number Diff line change
Expand Up @@ -735,7 +735,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
return ret;
}

bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
bool kvmppc_hv_handle_set_rc(struct kvm *kvm, bool nested, bool writing,
unsigned long gpa, unsigned int lpid)
{
unsigned long pgflags;
Expand All @@ -750,12 +750,12 @@ bool kvmppc_hv_handle_set_rc(struct kvm *kvm, pgd_t *pgtable, bool writing,
pgflags = _PAGE_ACCESSED;
if (writing)
pgflags |= _PAGE_DIRTY;
/*
* We are walking the secondary (partition-scoped) page table here.
* We can do this without disabling irq because the Linux MM
* subsystem doesn't do THP splits and collapses on this tree.
*/
ptep = __find_linux_pte(pgtable, gpa, NULL, &shift);

if (nested)
ptep = find_kvm_nested_guest_pte(kvm, lpid, gpa, &shift);
else
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);

if (ptep && pte_present(*ptep) && (!writing || pte_write(*ptep))) {
kvmppc_radix_update_pte(kvm, ptep, 0, pgflags, gpa, shift);
return true;
Expand Down Expand Up @@ -813,20 +813,21 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
* Read the PTE from the process' radix tree and use that
* so we get the shift and attribute bits.
*/
local_irq_disable();
ptep = __find_linux_pte(vcpu->arch.pgdir, hva, NULL, &shift);
spin_lock(&kvm->mmu_lock);
ptep = find_kvm_host_pte(kvm, mmu_seq, hva, &shift);
pte = __pte(0);
if (ptep)
pte = READ_ONCE(*ptep);
spin_unlock(&kvm->mmu_lock);
/*
* If the PTE disappeared temporarily due to a THP
* collapse, just return and let the guest try again.
*/
if (!ptep) {
local_irq_enable();
if (!pte_present(pte)) {
if (page)
put_page(page);
return RESUME_GUEST;
}
pte = *ptep;
local_irq_enable();

/* If we're logging dirty pages, always map single pages */
large_enable = !(memslot->flags & KVM_MEM_LOG_DIRTY_PAGES);
Expand Down Expand Up @@ -948,8 +949,8 @@ int kvmppc_book3s_radix_page_fault(struct kvm_run *run, struct kvm_vcpu *vcpu,
/* Failed to set the reference/change bits */
if (dsisr & DSISR_SET_RC) {
spin_lock(&kvm->mmu_lock);
if (kvmppc_hv_handle_set_rc(kvm, kvm->arch.pgtable,
writing, gpa, kvm->arch.lpid))
if (kvmppc_hv_handle_set_rc(kvm, false, writing,
gpa, kvm->arch.lpid))
dsisr &= ~DSISR_SET_RC;
spin_unlock(&kvm->mmu_lock);

Expand Down Expand Up @@ -980,11 +981,11 @@ int kvm_unmap_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
return 0;
}

ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
kvm->arch.lpid);
return 0;
return 0;
}

/* Called with kvm->mmu_lock held */
Expand All @@ -1000,7 +1001,7 @@ int kvm_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;

ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep)) {
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_ACCESSED, 0,
gpa, shift);
Expand All @@ -1027,7 +1028,7 @@ int kvm_test_age_radix(struct kvm *kvm, struct kvm_memory_slot *memslot,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ref;

ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
if (ptep && pte_present(*ptep) && pte_young(*ptep))
ref = 1;
return ref;
Expand All @@ -1047,7 +1048,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ret;

ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
ret = 1;
if (shift)
Expand Down Expand Up @@ -1108,7 +1109,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm,
gpa = memslot->base_gfn << PAGE_SHIFT;
spin_lock(&kvm->mmu_lock);
for (n = memslot->npages; n; --n) {
ptep = __find_linux_pte(kvm->arch.pgtable, gpa, NULL, &shift);
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
if (ptep && pte_present(*ptep))
kvmppc_unmap_pte(kvm, ptep, gpa, shift, memslot,
kvm->arch.lpid);
Expand Down
Loading

0 comments on commit 1f12096

Please sign in to comment.