Commit Graph

728 Commits

Author SHA1 Message Date
Wanpeng Li 963fee1656 KVM: nVMX: Fix virtual interrupt delivery injection
This patch fix bug reported in https://bugzilla.kernel.org/show_bug.cgi?id=73331,
after the patch http://www.spinics.net/lists/kvm/msg105230.html applied, there is
some progress and the L2 can boot up, however, slowly. The original idea of this
fix vid injection patch is from "Zhang, Yang Z" <yang.z.zhang@intel.com>.

Interrupt which delivered by vid should be injected to L1 by L0 if current is in
L1, or should be injected to L2 by L0 through the old injection way if L1 doesn't
have set External-interrupt exiting bit. The current logic doen't consider these
cases. This patch fix it by vid intr to L1 if current is L1 or L2 through old
injection way if L1 doen't have External-interrupt exiting bit set.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-17 14:45:26 +02:00
Paolo Bonzini 37ccdcbe07 KVM: x86: return all bits from get_interrupt_shadow
For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call.  As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Paolo Bonzini 98eb2f8b14 KVM: vmx: speed up emulation of invalid guest state
About 25% of the time spent in emulation of invalid guest state
is wasted in checking whether emulation is required for the next
instruction.  However, this almost never changes except when a
segment register (or TR or LDTR) changes, or when there is a mode
transition (i.e. CR0 changes).

In fact, vmx_set_segment and vmx_set_cr0 already modify
vmx->emulation_required (except that the former for some reason
uses |= instead of just an assignment).  So there is no need to
call guest_state_valid in the emulation loop.

Emulation performance test results indicate 1650-2600 cycles
for common instructions, versus 2300-3200 before this patch on
a Sandy Bridge Xeon.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Nadav Amit 27e6fb5dae KVM: vmx: vmx instructions handling does not consider cs.l
VMX instructions use 32-bit operands in 32-bit mode, and 64-bit operands in
64-bit mode.  The current implementation is broken since it does not use the
register operands correctly, and always uses 64-bit for reads and writes.
Moreover, write to memory in vmwrite only considers long-mode, so it ignores
cs.l. This patch fixes this behavior.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:15 +02:00
Nadav Amit 1e32c07955 KVM: vmx: handle_cr ignores 32/64-bit mode
On 32-bit mode only bits [31:0] of the CR should be used for setting the CR
value.  Otherwise, the host may incorrectly assume the value is invalid if bits
[63:32] are not zero.  Moreover, the CR is currently being read twice when CR8
is used.  Last, nested mov-cr exiting is modified to handle the CR value
correctly as well.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:15 +02:00
Nadav Amit 5777392e83 KVM: x86: check DR6/7 high-bits are clear only on long-mode
When the guest sets DR6 and DR7, KVM asserts the high 32-bits are clear, and
otherwise injects a #GP exception. This exception should only be injected only
if running in long-mode.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:14 +02:00
Jan Kiszka 5381417f6a KVM: nVMX: Fix returned value of MSR_IA32_VMX_VMCS_ENUM
Many real CPUs get this wrong as well, but ours is totally off: bits 9:1
define the highest index value.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:13 +02:00
Jan Kiszka 2996fca069 KVM: nVMX: Allow to disable VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS
Allow L1 to "leak" its debug controls into L2, i.e. permit cleared
VM_{ENTRY_LOAD,EXIT_SAVE}_DEBUG_CONTROLS. This requires to manually
transfer the state of DR7 and IA32_DEBUGCTLMSR from L1 into L2 as both
run on different VMCS.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:13 +02:00
Jan Kiszka 560b7ee12c KVM: nVMX: Fix returned value of MSR_IA32_VMX_PROCBASED_CTLS
SDM says bits 1, 4-6, 8, 13-16, and 26 have to be set.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:12 +02:00
Jan Kiszka 3dcdf3ec6e KVM: nVMX: Allow to disable CR3 access interception
We already have this control enabled by exposing a broken
MSR_IA32_VMX_PROCBASED_CTLS value. This will properly advertise our
capability once the value is fixed by clearing the right bits in
MSR_IA32_VMX_TRUE_PROCBASED_CTLS. We also have to ensure to test the
right value on L2 entry.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:12 +02:00
Jan Kiszka 3dbcd8da7b KVM: nVMX: Advertise support for MSR_IA32_VMX_TRUE_*_CTLS
We already implemented them but failed to advertise them. Currently they
all return the identical values to the capability MSRs they are
augmenting. So there is no change in exposed features yet.

Drop related comments at this chance that are partially incorrect and
redundant anyway.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:11 +02:00
Fabian Frederick bc39c4db71 arch/x86/kvm/vmx.c: use PAGE_ALIGNED instead of IS_ALIGNED(PAGE_SIZE
use mm.h definition

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:08 +02:00
Linus Torvalds b05d59dfce At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM.  Changes include:
 
 - a lot of s390 changes: optimizations, support for migration,
   GDB support and more
 
 - ARM changes are pretty small: support for the PSCI 0.2 hypercall
   interface on both the guest and the host (the latter acked by Catalin)
 
 - initial POWER8 and little-endian host support
 
 - support for running u-boot on embedded POWER targets
 
 - pretty large changes to MIPS too, completing the userspace interface
   and improving the handling of virtualized timer hardware
 
 - for x86, a larger set of changes is scheduled for 3.17.  Still,
   we have a few emulator bugfixes and support for running nested
   fully-virtualized Xen guests (para-virtualized Xen guests have
   always worked).  And some optimizations too.
 
 The only missing architecture here is ia64.  It's not a coincidence
 that support for KVM on ia64 is scheduled for removal in 3.17.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
 BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
 j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
 W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
 0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
 CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
 nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
 TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
 ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
 VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
 3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
 bAknaZpPiOrW11Et1htY
 =j5Od
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next

Pull KVM updates from Paolo Bonzini:
 "At over 200 commits, covering almost all supported architectures, this
  was a pretty active cycle for KVM.  Changes include:

   - a lot of s390 changes: optimizations, support for migration, GDB
     support and more

   - ARM changes are pretty small: support for the PSCI 0.2 hypercall
     interface on both the guest and the host (the latter acked by
     Catalin)

   - initial POWER8 and little-endian host support

   - support for running u-boot on embedded POWER targets

   - pretty large changes to MIPS too, completing the userspace
     interface and improving the handling of virtualized timer hardware

   - for x86, a larger set of changes is scheduled for 3.17.  Still, we
     have a few emulator bugfixes and support for running nested
     fully-virtualized Xen guests (para-virtualized Xen guests have
     always worked).  And some optimizations too.

  The only missing architecture here is ia64.  It's not a coincidence
  that support for KVM on ia64 is scheduled for removal in 3.17"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
  KVM: add missing cleanup_srcu_struct
  KVM: PPC: Book3S PR: Rework SLB switching code
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S HV: Fix machine check delivery to guest
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Fix dirty map for hugepages
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Graciously fail broken LE hypercalls
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
  KVM: PPC: BOOK3S: Always use the saved DAR value
  PPC: KVM: Make NX bit available with magic page
  KVM: PPC: Disable NX for old magic page using guests
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  ...
2014-06-04 08:47:12 -07:00
Nadav Amit 1f85411255 KVM: vmx: DR7 masking on task switch emulation is wrong
The DR7 masking which is done on task switch emulation should be in hex format
(clearing the local breakpoints enable bits 0,2,4 and 6).

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:18 +02:00
Paolo Bonzini ae9fedc793 KVM: x86: get CPL from SS.DPL
CS.RPL is not equal to the CPL in the few instructions between
setting CR0.PE and reloading CS.  And CS.DPL is also not equal
to the CPL for conforming code segments.

However, SS.DPL *is* always equal to the CPL except for the weird
case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
value in the STAR MSR, but force CPL=3 (Intel instead forces
SS.DPL=SS.RPL=CPL=3).

So this patch:

- modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
the above case with SYSRET is not broken further, and the way
to fix it would be to pass the CPL to userspace and back

- modifies VMX to always return the CPL from SS.DPL (except
forcing it to 0 if we are emulating real mode via vm86 mode;
in vm86 mode all DPLs have to be 3, but real mode does allow
privileged instructions).  It also removes the CPL cache,
which becomes a duplicate of the SS access rights cache.

This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
CR0.PE=1 but before CS has been reloaded.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:17 +02:00
Gabriel L. Somlo 87c00572ba kvm: x86: emulate monitor and mwait instructions as nop
Treat monitor and mwait instructions as nop, which is architecturally
correct (but inefficient) behavior. We do this to prevent misbehaving
guests (e.g. OS X <= 10.7) from crashing after they fail to check for
monitor/mwait availability via cpuid.

Since mwait-based idle loops relying on these nop-emulated instructions
would keep the host CPU pegged at 100%, do NOT advertise their presence
via cpuid, to prevent compliant guests from using them inadvertently.

Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-08 15:40:49 +02:00
Nadav Amit a4ab9d0cf1 KVM: vmx: handle_dr does not handle RSP correctly
The RSP register is not automatically cached, causing mov DR instruction with
RSP to fail.  Instead the regular register accessing interface should be used.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-07 17:24:59 +02:00
Paolo Bonzini 696dfd95ba KVM: vmx: disable APIC virtualization in nested guests
While running a nested guest, we should disable APIC virtualization
controls (virtualized APIC register accesses, virtual interrupt
delivery and posted interrupts), because we do not expose them to
the nested guest.

Reported-by: Hu Yaohui <loki2441@gmail.com>
Suggested-by: Abel Gordon <abel@stratoscale.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-07 13:46:02 +02:00
Bandan Das 4291b58885 KVM: nVMX: move vmclear and vmptrld pre-checks to nested_vmx_check_vmptr
Some checks are common to all, and moreover,
according to the spec, the check for whether any bits
beyond the physical address width are set are also
applicable to all of them

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:43 +02:00
Bandan Das 96ec146330 KVM: nVMX: fail on invalid vmclear/vmptrld pointer
The spec mandates that if the vmptrld or vmclear
address is equal to the vmxon region pointer, the
instruction should fail with error "VMPTRLD with
VMXON pointer" or "VMCLEAR with VMXON pointer"

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:37 +02:00
Bandan Das 3573e22cfe KVM: nVMX: additional checks on vmxon region
Currently, the vmxon region isn't used in the nested case.
However, according to the spec, the vmxon instruction performs
additional sanity checks on this region and the associated
pointer. Modify emulated vmxon to better adhere to the spec
requirements

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 19:00:27 +02:00
Bandan Das 19677e32fe KVM: nVMX: rearrange get_vmx_mem_address
Our common function for vmptr checks (in 2/4) needs to fetch
the memory address

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-06 18:59:57 +02:00
Bandan Das fe2b201b3b KVM: x86: Check for host supported fields in shadow vmcs
We track shadow vmcs fields through two static lists,
one for read only and another for r/w fields. However, with
addition of new vmcs fields, not all fields may be supported on
all hosts. If so, copy_vmcs12_to_shadow() trying to vmwrite on
unsupported hosts will result in a vmwrite error. For example, commit
36be0b9deb introduced GUEST_BNDCFGS, which is not supported
by all processors. Filter out host unsupported fields before
letting guests use shadow vmcs

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-28 11:14:51 +02:00
Bandan Das e0ba1a6ffc KVM: nVMX: Advertise support for interrupt acknowledgement
Some Type 1 hypervisors such as XEN won't enable VMX without it present

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:34 -03:00
Bandan Das 77b0f5d67f KVM: nVMX: Ack and write vector info to intr_info if L1 asks us to
This feature emulates the "Acknowledge interrupt on exit" behavior.
We can safely emulate it for L1 to run L2 even if L0 itself has it
disabled (to run L1).

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:33 -03:00
Bandan Das 4b85507860 KVM: nVMX: Don't advertise single context invalidation for invept
For single context invalidation, we fall through to global
invalidation in handle_invept() except for one case - when
the operand supplied by L1 is different from what we have in
vmcs12. However, typically hypervisors will only call invept
for the currently loaded eptp, so the condition will
never be true.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:41:28 -03:00
Huw Davies fd2a445a94 KVM: VMX: Advance rip to after an ICEBP instruction
When entering an exception after an ICEBP, the saved instruction
pointer should point to after the instruction.

This fixes the bug here: https://bugs.launchpad.net/qemu/+bug/1119686

Signed-off-by: Huw Davies <huw@codeweavers.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-22 18:37:43 -03:00
Michael S. Tsirkin 68c3b4d167 KVM: VMX: speed up wildcard MMIO EVENTFD
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Unfortunately, this only works if userspace does not need to match on
access length and data.  The implementation adds a separate FAST_MMIO
bus internally. This serves two purposes:
    - minimize overhead for old userspace that does not use eventfd with lengtth = 0
    - minimize disruption in other code (since we don't know the length,
      devices on the MMIO bus only get a valid address in write, this
      way we don't need to touch all devices to teach them to handle
      an invalid length)

At the moment, this optimization only has effect for EPT on x86.

It will be possible to speed up MMIO for NPT and MMU using the same
idea in the future.

With this patch applied, on VMX MMIO EVENTFD is essentially as fast as PIO.
I was unable to detect any measureable slowdown to non-eventfd MMIO.

Making MMIO faster is important for the upcoming virtio 1.0 which
includes an MMIO signalling capability.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17 14:01:43 -03:00
Feng Wu e1e746b3c5 KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
SMAP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMAP needs to be
manually disabled when guest switches to non-paging mode.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-14 17:50:35 -03:00
Paolo Bonzini 93c4adc7af KVM: x86: handle missing MPX in nested virtualization
When doing nested virtualization, we may be able to read BNDCFGS but
still not be allowed to write to GUEST_BNDCFGS in the VMCS.  Guard
writes to the field with vmx_mpx_supported(), and similarly hide the
MSR from userspace if the processor does not support the field.

We could work around this with the generic MSR save/load machinery,
but there is only a limited number of MSR save/load slots and it is
not really worthwhile to waste one for a scenario that should not
happen except in the nested virtualization case.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:39 +01:00
Paolo Bonzini 36be0b9deb KVM: x86: Add nested virtualization support for MPX
This is simple to do, the "host" BNDCFGS is either 0 or the guest value.
However, both controls have to be present.  We cannot provide MPX if
we only have one of the "load BNDCFGS" or "clear BNDCFGS" controls.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:39 +01:00
Paolo Bonzini d16c293e4e KVM: nVMX: Allow nested guests to run with dirty debug registers
When preparing the VMCS02, the CPU-based execution controls is computed
by vmx_exec_control.  Turn off DR access exits there, too, if the
KVM_DEBUGREG_WONT_EXIT bit is set in switch_db_regs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:03 +01:00
Paolo Bonzini 81908bf443 KVM: vmx: Allow the guest to run with dirty debug registers
When not running in guest-debug mode (i.e. the guest controls the debug
registers, having to take an exit for each DR access is a waste of time.
If the guest gets into a state where each context switch causes DR to be
saved and restored, this can take away as much as 40% of the execution
time from the guest.

If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we
can let it write freely to the debug registers and reload them on the
next exit.  We still need to exit on the first access, so that the
KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
accesses to the debug registers will not cause a vmexit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:03 +01:00
Paolo Bonzini c845f9c646 KVM: vmx: we do rely on loading DR7 on entry
Currently, this works even if the bit is not in "min", because the bit is always
set in MSR_IA32_VMX_ENTRY_CTLS.  Mention it for the sake of documentation, and
to avoid surprises if we later switch to MSR_IA32_VMX_TRUE_ENTRY_CTLS.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:01 +01:00
Jan Kiszka c9a7953f09 KVM: x86: Remove return code from enable_irq/nmi_window
It's no longer possible to enter enable_irq_window in guest mode when
L1 intercepts external interrupts and we are entering L2. This is now
caught in vcpu_enter_guest. So we can remove the check from the VMX
version of enable_irq_window, thus the need to return an error code from
both enable_irq_window and enable_nmi_window.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:47 +01:00
Jan Kiszka 220c567297 KVM: nVMX: Do not inject NMI vmexits when L2 has a pending interrupt
According to SDM 27.2.3, IDT vectoring information will not be valid on
vmexits caused by external NMIs. So we have to avoid creating such
scenarios by delaying EXIT_REASON_EXCEPTION_NMI injection as long as we
have a pending interrupt because that one would be migrated to L1's IDT
vectoring info on nested exit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:46 +01:00
Jan Kiszka f4124500c2 KVM: nVMX: Fully emulate preemption timer
We cannot rely on the hardware-provided preemption timer support because
we are holding L2 in HLT outside non-root mode. Furthermore, emulating
the preemption will resolve tick rate errata on older Intel CPUs.

The emulation is based on hrtimer which is started on L2 entry, stopped
on L2 exit and evaluated via the new check_nested_events hook. As we no
longer rely on hardware features, we can enable both the preemption
timer support and value saving unconditionally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Jan Kiszka b6b8a1451f KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Paolo Bonzini ccf9844e5d kvm, vmx: Really fix lazy FPU on nested guest
Commit e504c9098e (kvm, vmx: Fix lazy FPU on nested guest, 2013-11-13)
highlighted a real problem, but the fix was subtly wrong.

nested_read_cr0 is the CR0 as read by L2, but here we want to look at
the CR0 value reflecting L1's setup.  In other words, L2 might think
that TS=0 (so nested_read_cr0 has the bit clear); but if L1 is actually
running it with TS=1, we should inject the fault into L1.

The effective value of CR0 in L2 is contained in vmcs12->guest_cr0, use
it.

Fixes: e504c9098e
Reported-by: Kashyap Chamarty <kchamart@redhat.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Tested-by: Kashyap Chamarty <kchamart@redhat.com>
Tested-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-03 12:49:42 +01:00
Liu, Jinsong 0dd376e709 KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
From 5d5a80cd172ea6fb51786369bcc23356b1e9e956 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:55 +0800
Subject: [PATCH v5 2/3] KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save

Add MSR_IA32_BNDCFGS to msrs_to_save, and corresponding logic
to kvm_get/set_msr().

Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-25 20:17:09 +01:00
Liu, Jinsong da8999d318 KVM: x86: Intel MPX vmx and msr handle
From caddc009a6d2019034af8f2346b2fd37a81608d0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:11 +0800
Subject: [PATCH v5 1/3] KVM: x86: Intel MPX vmx and msr handle

This patch handle vmx and msr of Intel MPX feature.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-24 12:14:00 +01:00
Jan Kiszka 58cb628dbe KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Check for invalid state transitions on guest-initiated updates of
MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is
not supported and all invalid transitions as described in SDM section
10.12.5. It also checks that no reserved bit is set in APICBASE by the
guest.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
[Use cpuid_maxphyaddr instead of guest_cpuid_get_phys_bits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-27 14:39:44 +01:00
Linus Torvalds 7ebd3faa9b First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
 interesting part is the ARM guys' virtualized interrupt controller
 overhaul, which lets userspace get/set the state and thus enables
 migration of ARM VMs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
 uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
 5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
 iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
 FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
 Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
 dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
 2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
 l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
 X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
 hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
 x7N8KR5hAj+mLBoM9/Al
 =8sYU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First round of KVM updates for 3.14; PPC parts will come next week.

  Nothing major here, just bugfixes all over the place.  The most
  interesting part is the ARM guys' virtualized interrupt controller
  overhaul, which lets userspace get/set the state and thus enables
  migration of ARM VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  kvm: make KVM_MMU_AUDIT help text more readable
  KVM: s390: Fix memory access error detection
  KVM: nVMX: Update guest activity state field on L2 exits
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: SVM: Fix reading of DR6
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  add support for Hyper-V reference time counter
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  KVM: x86: fix tsc catchup issue with tsc scaling
  KVM: x86: limit PIT timer frequency
  KVM: x86: handle invalid root_hpa everywhere
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
  kvm: vfio: silence GCC warning
  KVM: ARM: Remove duplicate include
  arm/arm64: KVM: relax the requirements of VMA alignment for THP
  ...
2014-01-22 21:40:43 -08:00
Jan Kiszka 3edf1e698f KVM: nVMX: Update guest activity state field on L2 exits
Set guest activity state in L1's VMCS according to the VCPUs mp_state.
This ensures we report the correct state in case we L2 executed HLT or
if we put L2 into HLT state and it was now woken up by an event.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:18 +01:00
Jan Kiszka 7af40ad37b KVM: nVMX: Fix nested_run_pending on activity state HLT
When we suspend the guest in HLT state, the nested run is no longer
pending - we emulated it completely. So only set nested_run_pending
after checking the activity state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:17 +01:00
Jan Kiszka cae501397a KVM: nVMX: Clean up handling of VMX-related MSRs
This simplifies the code and also stops issuing warning about writing to
unhandled MSRs when VMX is disabled or the Feature Control MSR is
locked - we do handle them all according to the spec.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:16 +01:00
Jan Kiszka 542060ea79 KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
Already used by nested SVM for tracing nested vmexit: kvm_nested_vmexit
marks exits from L2 to L0 while kvm_nested_vmexit_inject marks vmexits
that are reflected to L1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:15 +01:00
Jan Kiszka 533558bcb6 KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
Instead of fixing up the vmcs12 after the nested vmexit, pass key
parameters already when calling nested_vmx_vmexit. This will help
tracing those vmexits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:14 +01:00
Jan Kiszka 42124925c1 KVM: nVMX: Leave VMX mode on clearing of feature control MSR
When userspace sets MSR_IA32_FEATURE_CONTROL to 0, make sure we leave
root and non-root mode, fully disabling VMX. The register state of the
VCPU is undefined after this step, so userspace has to set it to a
proper state afterward.

This enables to reboot a VM while it is running some hypervisor code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:13 +01:00
Jan Kiszka 8246bf52c7 KVM: VMX: Fix DR6 update on #DB exception
According to the SDM, only bits 0-3 of DR6 "may" be cleared by "certain"
debug exception. So do update them on #DB exception in KVM, but leave
the rest alone, only setting BD and BS in addition to already set bits
in DR6. This also aligns us with kvm_vcpu_check_singlestep.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:12 +01:00
Jan Kiszka 73aaf249ee KVM: SVM: Fix reading of DR6
In contrast to VMX, SVM dose not automatically transfer DR6 into the
VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor
hook to obtain the current value. And as SVM now picks the DR6 state
from its VMCB, we also need a set callback in order to write updates of
DR6 back.

Fixes a regression of 020df0794f.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:10 +01:00
Marcelo Tosatti 26a865f4aa KVM: VMX: fix use after free of vmx->loaded_vmcs
After free_loaded_vmcs executes, the "loaded_vmcs" structure
is kfreed, and now vmx->loaded_vmcs points to a kfreed area.
Subsequent free_loaded_vmcs then attempts to manipulate
vmx->loaded_vmcs.

Switch the order to avoid the problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1047892

Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:14:08 -02:00
Zhihui Zhang 2f0a6397dd KVM: VMX: check use I/O bitmap first before unconditional I/O exit
According to Table C-1 of Intel SDM 3C, a VM exit happens on an I/O instruction when
"use I/O bitmaps" VM-execution control was 0 _and_ the "unconditional I/O exiting"
VM-execution control was 1. So we can't just check "unconditional I/O exiting" alone.
This patch was improved by suggestion from Jan Kiszka.

Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Zhihui Zhang <zzhsuny@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-08 19:01:40 -02:00
Jan Kiszka 29bf08f12b KVM: nVMX: Unconditionally uninit the MMU on nested vmexit
Three reasons for doing this: 1. arch.walk_mmu points to arch.mmu anyway
in case nested EPT wasn't in use. 2. this aligns VMX with SVM. But 3. is
most important: nested_cpu_has_ept(vmcs12) queries the VMCS page, and if
one guest VCPU manipulates the page of another VCPU in L2, we may be
fooled to skip over the nested_ept_uninit_mmu_context, leaving mmu in
nested state. That can crash the host later on if nested_ept_get_cr3 is
invoked while L1 already left vmxon and nested.current_vmcs12 became
NULL therefore.

Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-01-02 11:22:14 -02:00
Jan Kiszka 4c4d563b49 KVM: VMX: Do not skip the instruction if handle_dr injects a fault
If kvm_get_dr or kvm_set_dr reports that it raised a fault, we must not
advance the instruction pointer. Otherwise the exception will hit the
wrong instruction.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-20 19:20:39 +01:00
Jan Kiszka ca3f257ae5 KVM: nVMX: Support direct APIC access from L2
It's a pathological case, but still a valid one: If L1 disables APIC
virtualization and also allows L2 to directly write to the APIC page, we
have to forcibly enable APIC virtualization while in L2 if the in-kernel
APIC is in use.

This allows to run the direct interrupt test case in the vmx unit test
without x2APIC.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-18 10:27:09 +01:00
Jan Kiszka 6dfacadd58 KVM: nVMX: Add support for activity state HLT
We can easily emulate the HLT activity state for L1: If it decides that
L2 shall be halted on entry, just invoke the normal emulation of halt
after switching to L2. We do not depend on specific host features to
provide this, so we can expose the capability unconditionally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 10:49:56 +01:00
Gleb Natapov 2961e8764f KVM: VMX: shadow VM_(ENTRY|EXIT)_CONTROLS vmcs field
VM_(ENTRY|EXIT)_CONTROLS vmcs fields are read/written on each guest
entry but most times it can be avoided since values do not changes.
Keep fields copy in memory to avoid unnecessary reads from vmcs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 10:49:49 +01:00
Anthoine Bourgeois e504c9098e kvm, vmx: Fix lazy FPU on nested guest
If a nested guest does a NM fault but its CR0 doesn't contain the TS
flag (because it was already cleared by the guest with L1 aid) then we
have to activate FPU ourselves in L0 and then continue to L2. If TS flag
is set then we fallback on the previous behavior, forward the fault to
L1 if it asked for.

Signed-off-by: Anthoine Bourgeois <bourgeois@bertin.fr>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-11-13 18:46:54 +01:00
Michael S. Tsirkin 6026620475 kvm/vmx: error message typo fix
mst can't be blamed for lack of switch entries: the
issue is with msrs actually.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-31 11:55:15 +01:00
Alex Williamson e0f0bbc527 kvm: Create non-coherent DMA registeration
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present.  Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group.  Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:23 +01:00
Alex Williamson d96eb2c6f4 kvm/x86: Convert iommu_flags to iommu_noncoherent
Default to operating in coherent mode.  This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:13 +01:00
Jan Kiszka a294c9bbd0 nVMX: Report CPU_BASED_VIRTUAL_NMI_PENDING as supported
If the host supports it, we can and should expose it to the guest as
well, just like we already do with PIN_BASED_VIRTUAL_NMIS.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:38 +01:00
Jan Kiszka cd2633c59b nVMX: Fix pick-up of uninjected NMIs
__vmx_complete_interrupts stored uninjected NMIs in arch.nmi_injected,
not arch.nmi_pending. So we actually need to check the former field in
vmcs12_save_pending_event. This fixes the eventinj unit test when run
in nested KVM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:24 +01:00
Jan Kiszka d3134dbf20 KVM: nVMX: Report 2MB EPT pages as supported
As long as the hardware provides us 2MB EPT pages, we can also expose
them to the guest because our shadow EPT code already supports this
feature.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-28 13:14:12 +01:00
Gleb Natapov 13acfd5715 Powerpc KVM work is based on a commit after rc4.
Merging master into next to satisfy the dependencies.

Conflicts:
	arch/arm/kvm/reset.c
2013-10-17 17:41:49 +03:00
Arthur Chunqi Li 7854cbca81 KVM: nVMX: Fully support nested VMX preemption timer
This patch contains the following two changes:
1. Fix the bug in nested preemption timer support. If vmexit L2->L0
with some reasons not emulated by L1, preemption timer value should
be save in such exits.
2. Add support of "Save VMX-preemption timer value" VM-Exit controls
to nVMX.

With this patch, nested VMX preemption timer features are fully
supported.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 18:22:54 +02:00
Gleb Natapov d0d538b9d1 KVM: nVMX: fix shadow on EPT
72f857950f broke shadow on EPT. This patch reverts it and fixes PAE
on nEPT (which reverted commit fixed) in other way.

Shadow on EPT is now broken because while L1 builds shadow page table
for L2 (which is PAE while L2 is in real mode) it never loads L2's
GUEST_PDPTR[0-3].  They do not need to be loaded because without nested
virtualization HW does this during guest entry if EPT is disabled,
but in our case L0 emulates L2's vmentry while EPT is enables, so we
cannot rely on vmcs12->guest_pdptr[0-3] to contain up-to-date values
and need to re-read PDPTEs from L2 memory. This is what kvm_set_cr3()
is doing, but by clearing cache bits during L2 vmentry we drop values
that kvm_set_cr3() read from memory.

So why the same code does not work for PAE on nEPT? kvm_set_cr3()
reads pdptes into vcpu->arch.walk_mmu->pdptrs[]. walk_mmu points to
vcpu->arch.nested_mmu while nested guest is running, but ept_load_pdptrs()
uses vcpu->arch.mmu which contain incorrect values. Fix that by using
walk_mmu in ept_(load|save)_pdptrs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-10 11:39:57 +02:00
Paolo Bonzini 8a3c1a3347 KVM: mmu: change useless int return types to void
kvm_mmu initialization is mostly filling in function pointers, there is
no way for it to fail.  Clean up unused return values.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:44:02 +03:00
Gleb Natapov feaf0c7dc4 KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2
If #PF happens during delivery of an exception into L2 and L1 also do
not have the page mapped in its shadow page table then L0 needs to
generate vmexit to L2 with original event in IDT_VECTORING_INFO, but
current code combines both exception and generates #DF instead. Fix that
by providing nVMX specific function to handle page faults during page
table walk that handles this case correctly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:25 +02:00
Gleb Natapov e011c663b9 KVM: nVMX: Check all exceptions for intercept during delivery to L2
All exceptions should be checked for intercept during delivery to L2,
but we check only #PF currently. Drop nested_run_pending while we are
at it since exception cannot be injected during vmentry anyway.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
[Renamed the nested_vmx_check_exception function. - Paolo]
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:24 +02:00
Gleb Natapov 851eb6677c KVM: nVMX: Do not put exception that caused vmexit to IDT_VECTORING_INFO
If an exception causes vmexit directly it should not be reported in
IDT_VECTORING_INFO during the exit. For that we need to be able to
distinguish between exception that is injected into nested VM and one that
is reinjected because its delivery failed. Fortunately we already have
mechanism to do so for nested SVM, so here we just use correct function
to requeue exceptions and make sure that reinjected exception is not
moved to IDT_VECTORING_INFO during vmexit emulation and not re-checked
for interception during delivery.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:24 +02:00
Gleb Natapov e0b890d35c KVM: nVMX: Amend nested_run_pending logic
EXIT_REASON_VMLAUNCH/EXIT_REASON_VMRESUME exit does not mean that nested
VM will actually run during next entry. Move setting nested_run_pending
closer to vmentry emulation code and move its clearing close to vmexit to
minimize amount of code that will erroneously run with nested_run_pending
set.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:14:23 +02:00
Gleb Natapov bcd1c29495 KVM: VMX: do not check bit 12 of EPT violation exit qualification when undefined
Bit 12 is undefined in any of the following cases:
- If the "NMI exiting" VM-execution control is 1 and the "virtual NMIs"
  VM-execution control is 0.
- If the VM exit sets the valid bit in the IDT-vectoring information field

Signed-off-by: Gleb Natapov <gleb@redhat.com>
[Add parentheses around & within && - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-25 11:38:26 +02:00
Jan Kiszka 92fbc7b195 KVM: nVMX: Enable unrestricted guest mode support
Now that we provide EPT support, there is no reason to torture our
guests by hiding the relieving unrestricted guest mode feature. We just
need to relax CR0 checks for always-on bits as PE and PG can now be
switched off.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:15 +02:00
Jan Kiszka 10ba54a589 KVM: nVMX: Implement support for EFER saving on VM-exit
Implement and advertise VM_EXIT_SAVE_IA32_EFER. L0 traps EFER writes
unconditionally, so we always find the current L2 value in the
architectural state.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:14 +02:00
Jan Kiszka 59ab5a8f44 KVM: nVMX: Do not set identity page map for L2
Fiddling with CR3 for L2 is L1's job. It may set its own, different
identity map or simple leave it alone if unrestricted guest mode is
enabled. This also fixes reading back the current CR3 on L2 exits for
reporting it to L1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:14 +02:00
Jan Kiszka 9e3e4dbf44 KVM: nVMX: Replace kvm_set_cr0 with vmx_set_cr0 in load_vmcs12_host_state
kvm_set_cr0 performs checks on the state transition that may prevent
loading L1's cr0. For now we rely on the hardware to catch invalid
states loaded by L1 into its VMCS. Still, consistency checks on the host
state part of the VMCS on guest entry will have to be improved later on.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-24 19:12:13 +02:00
Gleb Natapov 0be9c7a89f KVM: VMX: set "blocked by NMI" flag if EPT violation happens during IRET from NMI
Set "blocked by NMI" flag if EPT violation happens during IRET from NMI
otherwise NMI can be called recursively causing stack corruption.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-09-17 19:09:47 +03:00
Gleb Natapov 72f857950f KVM: nEPT: reset PDPTR register cache on nested vmentry emulation
After nested vmentry stale cache can be used to reload L2 PDPTR pointers
which will cause L2 guest to fail. Fix it by invalidating cache on nested
vmentry emulation.

https://bugzilla.kernel.org/show_bug.cgi?id=60830

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-17 12:52:42 +03:00
Paolo Bonzini 94452b9e34 KVM: vmx: count exits to userspace during invalid guest emulation
These will happen due to MMIO.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-28 17:13:15 +03:00
Arthur Chunqi Li c0dfee582e KVM: nVMX: Advertise IA32_PAT in VM exit control
Advertise VM_EXIT_SAVE_IA32_PAT and VM_EXIT_LOAD_IA32_PAT.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:47 +02:00
Jan Kiszka 5743534960 KVM: nVMX: Fix up VM_ENTRY_IA32E_MODE control feature reporting
Do not report that we can enter the guest in 64-bit mode if the host is
32-bit only. This is not supported by KVM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:47 +02:00
Jan Kiszka ca72d970ff KVM: nEPT: Advertise WB type EPTP
At least WB must be possible.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:46 +02:00
Jan Kiszka 44811c02ed nVMX: Keep arch.pat in sync on L1-L2 switches
When asking vmx to load the PAT MSR for us while switching from L1 to L2
or vice versa, we have to update arch.pat as well as it may later be
used again to load or read out the MSR content.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Tested-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:45 +02:00
Nadav Har'El f5c4368f85 nEPT: Miscelleneous cleanups
Some trivial code cleanups not really related to nested EPT.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:44 +02:00
Nadav Har'El 2b1be67741 nEPT: Some additional comments
Some additional comments to preexisting code:
Explain who (L0 or L1) handles EPT violation and misconfiguration exits.
Don't mention "shadow on either EPT or shadow" as the only two options.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:44 +02:00
Nadav Har'El afa61f752b Advertise the support of EPT to the L1 guest, through the appropriate MSR.
This is the last patch of the basic Nested EPT feature, so as to allow
bisection through this patch series: The guest will not see EPT support until
this last patch, and will not attempt to use the half-applied feature.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:43 +02:00
Nadav Har'El bfd0a56b90 nEPT: Nested INVEPT
If we let L1 use EPT, we should probably also support the INVEPT instruction.

In our current nested EPT implementation, when L1 changes its EPT table
for L2 (i.e., EPT12), L0 modifies the shadow EPT table (EPT02), and in
the course of this modification already calls INVEPT. But if last level
of shadow page is unsync not all L1's changes to EPT12 are intercepted,
which means roots need to be synced when L1 calls INVEPT. Global INVEPT
should not be different since roots are synced by kvm_mmu_load() each
time EPTP02 changes.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:42 +02:00
Nadav Har'El 155a97a3d7 nEPT: MMU context for nested EPT
KVM's existing shadow MMU code already supports nested TDP. To use it, we
need to set up a new "MMU context" for nested EPT, and create a few callbacks
for it (nested_ept_*()). This context should also use the EPT versions of
the page table access functions (defined in the previous patch).
Then, we need to switch back and forth between this nested context and the
regular MMU context when switching between L1 and L2 (when L1 runs this L2
with EPT).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:41 +02:00
Yang Zhang 25d92081ae nEPT: Add nEPT violation/misconfigration support
Inject nEPT fault to L1 guest. This patch is original from Xinhao.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:40 +02:00
Nadav Har'El 3633cfc3e8 nEPT: Fix cr3 handling in nested exit and entry
The existing code for handling cr3 and related VMCS fields during nested
exit and entry wasn't correct in all cases:

If L2 is allowed to control cr3 (and this is indeed the case in nested EPT),
during nested exit we must copy the modified cr3 from vmcs02 to vmcs12, and
we forgot to do so. This patch adds this copy.

If L0 isn't controlling cr3 when running L2 (i.e., L0 is using EPT), and
whoever does control cr3 (L1 or L2) is using PAE, the processor might have
saved PDPTEs and we should also save them in vmcs12 (and restore later).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:34 +02:00
Nadav Har'El 8049d651e8 nEPT: Support LOAD_IA32_EFER entry/exit controls for L1
Recent KVM, since http://kerneltrap.org/mailarchive/linux-kvm/2010/5/2/6261577
switch the EFER MSR when EPT is used and the host and guest have different
NX bits. So if we add support for nested EPT (L1 guest using EPT to run L2)
and want to be able to run recent KVM as L1, we need to allow L1 to use this
EFER switching feature.

To do this EFER switching, KVM uses VM_ENTRY/EXIT_LOAD_IA32_EFER if available,
and if it isn't, it uses the generic VM_ENTRY/EXIT_MSR_LOAD. This patch adds
support for the former (the latter is still unsupported).

Nested entry and exit emulation (prepare_vmcs_02 and load_vmcs12_host_state,
respectively) already handled VM_ENTRY/EXIT_LOAD_IA32_EFER correctly. So all
that's left to do in this patch is to properly advertise this feature to L1.

Note that vmcs12's VM_ENTRY/EXIT_LOAD_IA32_EFER are emulated by L0, by using
vmx_set_efer (which itself sets one of several vmcs02 fields), so we always
support this feature, regardless of whether the host supports it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:34 +02:00
Gleb Natapov 205befd9a5 KVM: nVMX: correctly set tr base on nested vmexit emulation
After commit 21feb4eb64 tr base is zeroed
during vmexit. Set it to L1's HOST_TR_BASE. This should fix
https://bugzilla.kernel.org/show_bug.cgi?id=60679

Reported-by: Yongjie Ren <yongjie.ren@intel.com>
Reviewed-by: Arthur Chunqi Li <yzt356@gmail.com>
Tested-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:32 +02:00
Gleb Natapov 63fbf59f8a nVMX: reset rflags register cache during nested vmentry.
During nested vmentry into vm86 mode a vcpu state is found to be incorrect
because rflags does not have VM flag set since it is read from the cache
and has L1's value instead of L2's. If emulate_invalid_guest_state=1 L0
KVM tries to emulate it, but emulation does not work for nVMX and it
never should happen anyway. Fix that by using vmx_set_rflags() to set
rflags during nested vmentry which takes care of updating register cache.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:04:22 +02:00
Paolo Bonzini ac0a48c39a KVM: x86: rename EMULATE_DO_MMIO
The next patch will reuse it for other userspace exits than MMIO,
namely debug events.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:01:14 +02:00
Arthur Chunqi Li 21feb4eb64 KVM: nVMX: Set segment infomation of L1 when L2 exits
When L2 exits to L1, segment infomations of L1 are not set correctly.
According to Intel SDM 27.5.2(Loading Host Segment and Descriptor
Table Registers), segment base/limit/access right of L1 should be
set to some designed value when L2 exits to L1. This patch fixes
this.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: Gleb Natapov <gnatapov@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:31 +02:00
Nadav Har'El b3897a49e2 KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROL
Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment.

This patch simulate this MSR in nested_vmx and the default value is
0x0. BIOS should set it to 0x5 before VMXON. After setting the lock
bit, write to it will cause #GP(0).

Another QEMU patch is also needed to handle emulation of reset
and migration. Reset to vCPU should clear this MSR and migration
should reserve value of it.

This patch is based on Nadav's previous commit.
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478

Signed-off-by: Nadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Mathias Krause c2bae89394 KVM: VMX: Use proper types to access const arrays
Use a const pointer type instead of casting away the const qualifier
from const arrays. Keep the pointer array on the stack, nonetheless.
Making it static just increases the object size.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:28 +02:00
Arthur Chunqi Li a25eb114d5 KVM: nVMX: Set success rflags when emulate VMXON/VMXOFF in nested virt
Set rflags after successfully emulateing VMXON/VMXOFF in VMX.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:27 +02:00