Skip to content

Commit

Permalink
Merge branch 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/vir…
Browse files Browse the repository at this point in the history
…t/kvm/kvm

* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (198 commits)
  KVM: VMX: Fix host GDT.LIMIT corruption
  KVM: MMU: using __xchg_spte more smarter
  KVM: MMU: cleanup spte set and accssed/dirty tracking
  KVM: MMU: don't atomicly set spte if it's not present
  KVM: MMU: fix page dirty tracking lost while sync page
  KVM: MMU: fix broken page accessed tracking with ept enabled
  KVM: MMU: add missing reserved bits check in speculative path
  KVM: MMU: fix mmu notifier invalidate handler for huge spte
  KVM: x86 emulator: fix xchg instruction emulation
  KVM: x86: Call mask notifiers from pic
  KVM: x86: never re-execute instruction with enabled tdp
  KVM: Document KVM_GET_SUPPORTED_CPUID2 ioctl
  KVM: x86: emulator: inc/dec can have lock prefix
  KVM: MMU: Eliminate redundant temporaries in FNAME(fetch)
  KVM: MMU: Validate all gptes during fetch, not just those used for new pages
  KVM: MMU: Simplify spte fetch() function
  KVM: MMU: Add gpte_valid() helper
  KVM: MMU: Add validate_direct_spte() helper
  KVM: MMU: Add drop_large_spte() helper
  KVM: MMU: Use __set_spte to link shadow pages
  ...
  • Loading branch information
Linus Torvalds committed Aug 4, 2010
2 parents fe445c6 + 3444d7d commit 5e83f6f
Show file tree
Hide file tree
Showing 63 changed files with 3,328 additions and 2,103 deletions.
21 changes: 0 additions & 21 deletions Documentation/feature-removal-schedule.txt
Original file line number Diff line number Diff line change
Expand Up @@ -487,17 +487,6 @@ Who: Jan Kiszka <jan.kiszka@web.de>

----------------------------

What: KVM memory aliases support
When: July 2010
Why: Memory aliasing support is used for speeding up guest vga access
through the vga windows.

Modern userspace no longer uses this feature, so it's just bitrotted
code and can be removed with no impact.
Who: Avi Kivity <avi@redhat.com>

----------------------------

What: xtime, wall_to_monotonic
When: 2.6.36+
Files: kernel/time/timekeeping.c include/linux/time.h
Expand All @@ -508,16 +497,6 @@ Who: John Stultz <johnstul@us.ibm.com>

----------------------------

What: KVM kernel-allocated memory slots
When: July 2010
Why: Since 2.6.25, kvm supports user-allocated memory slots, which are
much more flexible than kernel-allocated slots. All current userspace
supports the newer interface and this code can be removed with no
impact.
Who: Avi Kivity <avi@redhat.com>

----------------------------

What: KVM paravirt mmu host support
When: January 2011
Why: The paravirt mmu host support is slower than non-paravirt mmu, both
Expand Down
208 changes: 174 additions & 34 deletions Documentation/kvm/api.txt
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,10 @@ user fills in the size of the indices array in nmsrs, and in return
kvm adjusts nmsrs to reflect the actual number of msrs and fills in
the indices array with their numbers.

Note: if kvm indicates supports MCE (KVM_CAP_MCE), then the MCE bank MSRs are
not returned in the MSR list, as different vcpus can have a different number
of banks, as set via the KVM_X86_SETUP_MCE ioctl.

4.4 KVM_CHECK_EXTENSION

Capability: basic
Expand Down Expand Up @@ -160,29 +164,7 @@ Type: vm ioctl
Parameters: struct kvm_memory_region (in)
Returns: 0 on success, -1 on error

struct kvm_memory_region {
__u32 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
};

/* for kvm_memory_region::flags */
#define KVM_MEM_LOG_DIRTY_PAGES 1UL

This ioctl allows the user to create or modify a guest physical memory
slot. When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified. It may not be
resized. Slots may not overlap.

The flags field supports just one flag, KVM_MEM_LOG_DIRTY_PAGES, which
instructs kvm to keep track of writes to memory within the slot. See
the KVM_GET_DIRTY_LOG ioctl.

It is recommended to use the KVM_SET_USER_MEMORY_REGION ioctl instead
of this API, if available. This newer API allows placing guest memory
at specified locations in the host address space, yielding better
control and easy access.
This ioctl is obsolete and has been removed.

4.6 KVM_CREATE_VCPU

Expand Down Expand Up @@ -226,17 +208,7 @@ Type: vm ioctl
Parameters: struct kvm_memory_alias (in)
Returns: 0 (success), -1 (error)

struct kvm_memory_alias {
__u32 slot; /* this has a different namespace than memory slots */
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size;
__u64 target_phys_addr;
};

Defines a guest physical address space region as an alias to another
region. Useful for aliased address, for example the VGA low memory
window. Should not be used with userspace memory.
This ioctl is obsolete and has been removed.

4.9 KVM_RUN

Expand Down Expand Up @@ -892,6 +864,174 @@ arguments.
This ioctl is only useful after KVM_CREATE_IRQCHIP. Without an in-kernel
irqchip, the multiprocessing state must be maintained by userspace.

4.39 KVM_SET_IDENTITY_MAP_ADDR

Capability: KVM_CAP_SET_IDENTITY_MAP_ADDR
Architectures: x86
Type: vm ioctl
Parameters: unsigned long identity (in)
Returns: 0 on success, -1 on error

This ioctl defines the physical address of a one-page region in the guest
physical address space. The region must be within the first 4GB of the
guest physical address space and must not conflict with any memory slot
or any mmio address. The guest may malfunction if it accesses this memory
region.

This ioctl is required on Intel-based hosts. This is needed on Intel hardware
because of a quirk in the virtualization implementation (see the internals
documentation when it pops into existence).

4.40 KVM_SET_BOOT_CPU_ID

Capability: KVM_CAP_SET_BOOT_CPU_ID
Architectures: x86, ia64
Type: vm ioctl
Parameters: unsigned long vcpu_id
Returns: 0 on success, -1 on error

Define which vcpu is the Bootstrap Processor (BSP). Values are the same
as the vcpu id in KVM_CREATE_VCPU. If this ioctl is not called, the default
is vcpu 0.

4.41 KVM_GET_XSAVE

Capability: KVM_CAP_XSAVE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (out)
Returns: 0 on success, -1 on error

struct kvm_xsave {
__u32 region[1024];
};

This ioctl would copy current vcpu's xsave struct to the userspace.

4.42 KVM_SET_XSAVE

Capability: KVM_CAP_XSAVE
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xsave (in)
Returns: 0 on success, -1 on error

struct kvm_xsave {
__u32 region[1024];
};

This ioctl would copy userspace's xsave struct to the kernel.

4.43 KVM_GET_XCRS

Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (out)
Returns: 0 on success, -1 on error

struct kvm_xcr {
__u32 xcr;
__u32 reserved;
__u64 value;
};

struct kvm_xcrs {
__u32 nr_xcrs;
__u32 flags;
struct kvm_xcr xcrs[KVM_MAX_XCRS];
__u64 padding[16];
};

This ioctl would copy current vcpu's xcrs to the userspace.

4.44 KVM_SET_XCRS

Capability: KVM_CAP_XCRS
Architectures: x86
Type: vcpu ioctl
Parameters: struct kvm_xcrs (in)
Returns: 0 on success, -1 on error

struct kvm_xcr {
__u32 xcr;
__u32 reserved;
__u64 value;
};

struct kvm_xcrs {
__u32 nr_xcrs;
__u32 flags;
struct kvm_xcr xcrs[KVM_MAX_XCRS];
__u64 padding[16];
};

This ioctl would set vcpu's xcr to the value userspace specified.

4.45 KVM_GET_SUPPORTED_CPUID

Capability: KVM_CAP_EXT_CPUID
Architectures: x86
Type: system ioctl
Parameters: struct kvm_cpuid2 (in/out)
Returns: 0 on success, -1 on error

struct kvm_cpuid2 {
__u32 nent;
__u32 padding;
struct kvm_cpuid_entry2 entries[0];
};

#define KVM_CPUID_FLAG_SIGNIFCANT_INDEX 1
#define KVM_CPUID_FLAG_STATEFUL_FUNC 2
#define KVM_CPUID_FLAG_STATE_READ_NEXT 4

struct kvm_cpuid_entry2 {
__u32 function;
__u32 index;
__u32 flags;
__u32 eax;
__u32 ebx;
__u32 ecx;
__u32 edx;
__u32 padding[3];
};

This ioctl returns x86 cpuid features which are supported by both the hardware
and kvm. Userspace can use the information returned by this ioctl to
construct cpuid information (for KVM_SET_CPUID2) that is consistent with
hardware, kernel, and userspace capabilities, and with user requirements (for
example, the user may wish to constrain cpuid to emulate older hardware,
or for feature consistency across a cluster).

Userspace invokes KVM_GET_SUPPORTED_CPUID by passing a kvm_cpuid2 structure
with the 'nent' field indicating the number of entries in the variable-size
array 'entries'. If the number of entries is too low to describe the cpu
capabilities, an error (E2BIG) is returned. If the number is too high,
the 'nent' field is adjusted and an error (ENOMEM) is returned. If the
number is just right, the 'nent' field is adjusted to the number of valid
entries in the 'entries' array, which is then filled.

The entries returned are the host cpuid as returned by the cpuid instruction,
with unknown or unsupported features masked out. The fields in each entry
are defined as follows:

function: the eax value used to obtain the entry
index: the ecx value used to obtain the entry (for entries that are
affected by ecx)
flags: an OR of zero or more of the following:
KVM_CPUID_FLAG_SIGNIFCANT_INDEX:
if the index field is valid
KVM_CPUID_FLAG_STATEFUL_FUNC:
if cpuid for this function returns different values for successive
invocations; there will be several entries with the same function,
all with this flag set
KVM_CPUID_FLAG_STATE_READ_NEXT:
for KVM_CPUID_FLAG_STATEFUL_FUNC entries, set if this entry is
the first entry to be read by a cpu
eax, ebx, ecx, edx: the values returned by the cpuid instruction for
this function/index combination

5. The kvm_run structure

Application code obtains a pointer to the kvm_run structure by
Expand Down
52 changes: 48 additions & 4 deletions Documentation/kvm/mmu.txt
Original file line number Diff line number Diff line change
Expand Up @@ -77,10 +77,10 @@ Memory

Guest memory (gpa) is part of the user address space of the process that is
using kvm. Userspace defines the translation between guest addresses and user
addresses (gpa->hva); note that two gpas may alias to the same gva, but not
addresses (gpa->hva); note that two gpas may alias to the same hva, but not
vice versa.

These gvas may be backed using any method available to the host: anonymous
These hvas may be backed using any method available to the host: anonymous
memory, file backed memory, and device memory. Memory might be paged by the
host at any time.

Expand Down Expand Up @@ -161,7 +161,7 @@ Shadow pages contain the following information:
role.cr4_pae:
Contains the value of cr4.pae for which the page is valid (e.g. whether
32-bit or 64-bit gptes are in use).
role.cr4_nxe:
role.nxe:
Contains the value of efer.nxe for which the page is valid.
role.cr0_wp:
Contains the value of cr0.wp for which the page is valid.
Expand All @@ -180,7 +180,9 @@ Shadow pages contain the following information:
guest pages as leaves.
gfns:
An array of 512 guest frame numbers, one for each present pte. Used to
perform a reverse map from a pte to a gfn.
perform a reverse map from a pte to a gfn. When role.direct is set, any
element of this array can be calculated from the gfn field when used, in
this case, the array of gfns is not allocated. See role.direct and gfn.
slot_bitmap:
A bitmap containing one bit per memory slot. If the page contains a pte
mapping a page from memory slot n, then bit n of slot_bitmap will be set
Expand Down Expand Up @@ -296,6 +298,48 @@ Host translation updates:
- look up affected sptes through reverse map
- drop (or update) translations

Emulating cr0.wp
================

If tdp is not enabled, the host must keep cr0.wp=1 so page write protection
works for the guest kernel, not guest guest userspace. When the guest
cr0.wp=1, this does not present a problem. However when the guest cr0.wp=0,
we cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the
semantics require allowing any guest kernel access plus user read access).

We handle this by mapping the permissions to two possible sptes, depending
on fault type:

- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access,
disallows user access)
- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel
write access)

(user write faults generate a #PF)

Large pages
===========

The mmu supports all combinations of large and small guest and host pages.
Supported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as
two separate 2M pages, on both guest and host, since the mmu always uses PAE
paging.

To instantiate a large spte, four constraints must be satisfied:

- the spte must point to a large host page
- the guest pte must be a large pte of at least equivalent size (if tdp is
enabled, there is no guest pte and this condition is satisified)
- if the spte will be writeable, the large page frame may not overlap any
write-protected pages
- the guest page must be wholly contained by a single memory slot

To check the last two conditions, the mmu maintains a ->write_count set of
arrays for each memory slot and large page size. Every write protected page
causes its write_count to be incremented, thus preventing instantiation of
a large spte. The frames at the end of an unaligned memory slot have
artificically inflated ->write_counts so they can never be instantiated.

Further reading
===============

Expand Down
Loading

0 comments on commit 5e83f6f

Please sign in to comment.