Commit Graph

4427 Commits

Author SHA1 Message Date
David Hildenbrand 42cf014d38 KVM: nVMX: kmap() can't fail
kmap() can't fail, therefore it will always return a valid pointer. Let's
just get rid of the unnecessary checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:44 +01:00
Radim Krčmář 00c87e9a70 KVM: x86: do not save guest-unsupported XSAVE state
Saving unsupported state prevents migration when the new host does not
support a XSAVE feature of the original host, even if the feature is not
exposed to the guest.

We've masked host features with guest-visible features before, with
4344ee981e ("KVM: x86: only copy XSAVE state for the supported
features") and dropped it when implementing XSAVES.  Do it again.

Fixes: df1daba7d1 ("KVM: x86: support XSAVES usage in the host")
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-02-03 18:43:08 +01:00
Frederic Weisbecker 5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Junaid Shahid d3e328f2cb kvm: x86: mmu: Verify that restored PTE has needed perms in fast page fault
Before fast page fault restores an access track PTE back to a regular PTE,
it now also verifies that the restored PTE would grant the necessary
permissions for the faulting access to succeed. If not, it falls back
to the slow page fault path.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid d162f30a7c kvm: x86: mmu: Move pgtbl walk inside retry loop in fast_page_fault
Redo the page table walk in fast_page_fault when retrying so that we are
working on the latest PTE even if the hierarchy changes.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid 20d65236d0 kvm: x86: mmu: Update comment in mark_spte_for_access_track
Reword the comment to hopefully make it more clear.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid 312b616b30 kvm: x86: mmu: Set SPTE_SPECIAL_MASK within mmu.c
Instead of the caller including the SPTE_SPECIAL_MASK in the masks being
supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(),
those functions now themselves include the SPTE_SPECIAL_MASK.

Note that bit 63 is now reset in the default MMIO mask.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid ab22a4733f kvm: x86: mmu: Rename EPT_VIOLATION_READ/WRITE/INSTR constants
Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to
EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these
signify the type of the memory access as opposed to the permissions
granted by the PTE.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:38 +01:00
Jim Mattson 0b4c208d44 Revert "KVM: nested VMX: disable perf cpuid reporting"
This reverts commit bc6134942d.

A CPUID instruction executed in VMX non-root mode always causes a
VM-exit, regardless of the leaf being queried.

Fixes: bc6134942d ("KVM: nested VMX: disable perf cpuid reporting")
Signed-off-by: Jim Mattson <jmattson@google.com>
[The issue solved by bc6134942d has been resolved with ff651cb613
 ("KVM: nVMX: Add nested msr load/restore algorithm").]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-20 22:18:55 +01:00
Piotr Luc a17f32270a kvm: x86: Expose Intel VPOPCNTDQ feature to guest
Vector population count instructions for dwords and qwords are to be
used in future Intel Xeon & Xeon Phi processors. The bit 14 of
CPUID[level:0x07, ECX] indicates that the new instructions are
supported by a processor.

The spec can be found in the Intel Software Developer Manual (SDM)
or in the Instruction Set Extensions Programming Reference (ISE).

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 17:55:18 +01:00
Radim Krčmář a9ff720e0f Merge branch 'x86/cpufeature' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
For AVX512_VPOPCNTDQ.
2017-01-17 17:53:01 +01:00
Dmitry Vyukov ce2e852ecc KVM: x86: fix fixing of hypercalls
emulator_fix_hypercall() replaces hypercall with vmcall instruction,
but it does not handle GP exception properly when writes the new instruction.
It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
This leads to incorrect emulation and triggers
WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
as discovered by syzkaller fuzzer:

WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
Call Trace:
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
 handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
 vcpu_run arch/x86/kvm/x86.c:6947 [inline]

Set exception information when write in emulator_fix_hypercall() fails.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 15:06:05 +01:00
Paolo Bonzini 33ab91103b KVM: x86: fix emulation of "MOV SS, null selector"
This is CVE-2017-2583.  On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 15:17:13 +01:00
Wanpeng Li 546d87e5c9 KVM: x86: fix NULL deref in vcpu_scan_ioapic
Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
    IP: _raw_spin_lock+0xc/0x30
    PGD 3e28eb067
    PUD 3f0ac6067
    PMD 0
    Oops: 0002 [#1] SMP
    CPU: 0 PID: 2431 Comm: test Tainted: G           OE   4.10.0-rc1+ #3
    Call Trace:
     ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
     kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
     ? pick_next_task_fair+0xe1/0x4e0
     ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
     kvm_vcpu_ioctl+0x33a/0x600 [kvm]
     ? hrtimer_try_to_cancel+0x29/0x130
     ? do_nanosleep+0x97/0xf0
     do_vfs_ioctl+0xa1/0x5d0
     ? __hrtimer_init+0x90/0x90
     ? do_nanosleep+0x5b/0xf0
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x6e/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0

The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip.  The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.

    #include <sys/ioctl.h>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <linux/kvm.h>
    #include <pthread.h>
    #include <stddef.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #ifndef KVM_CAP_HYPERV_SYNIC
    #define KVM_CAP_HYPERV_SYNIC 123
    #endif

    void* thr(void* arg)
    {
	struct kvm_enable_cap cap;
	cap.flags = 0;
	cap.cap = KVM_CAP_HYPERV_SYNIC;
	ioctl((long)arg, KVM_ENABLE_CAP, &cap);
	return 0;
    }

    int main()
    {
	void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	int kvmfd = open("/dev/kvm", 0);
	int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
	struct kvm_userspace_memory_region memreg;
	memreg.slot = 0;
	memreg.flags = 0;
	memreg.guest_phys_addr = 0;
	memreg.memory_size = 0x1000;
	memreg.userspace_addr = (unsigned long)host_mem;
	host_mem[0] = 0xf4;
	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
	int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
	struct kvm_sregs sregs;
	ioctl(cpufd, KVM_GET_SREGS, &sregs);
	sregs.cr0 = 0;
	sregs.cr4 = 0;
	sregs.efer = 0;
	sregs.cs.selector = 0;
	sregs.cs.base = 0;
	ioctl(cpufd, KVM_SET_SREGS, &sregs);
	struct kvm_regs regs = { .rflags = 2 };
	ioctl(cpufd, KVM_SET_REGS, &regs);
	ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
	pthread_t th;
	pthread_create(&th, 0, thr, (void*)(long)cpufd);
	usleep(rand() % 10000);
	ioctl(cpufd, KVM_RUN, 0);
	pthread_join(th, 0);
	return 0;
    }

This patch fixes it by failing ENABLE_CAP if without an irqchip.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 5c919412fe (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:52:52 +01:00
Steve Rutherford 129a72a0d3 KVM: x86: Introduce segmented_write_std
Introduces segemented_write_std.

Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt.  This fixes CVE-2017-2584, a longstanding
kernel memory leak.

Since commit 283c95d0e3 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 96051572c8
Fixes: 283c95d0e3
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:34:58 +01:00
David Matlack cef84c302f KVM: x86: flush pending lapic jump label updates on module unload
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.

Use the new static_key_deferred_flush() API to flush pending updates on
module unload.

Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:33:17 +01:00
Jim Mattson 21e7fbe7db kvm: nVMX: Reorder error checks for emulated VMXON
Checks on the operand to VMXON are performed after the check for
legacy mode operation and the #GP checks, according to the pseudo-code
in Intel's SDM.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:48:04 +01:00
Paolo Bonzini 4d82d12b39 KVM: lapic: do not scan IRR when delivering an interrupt
On interrupt delivery the PPR can only grow (except for auto-EOI),
so it is impossible that non-auto-EOI interrupt delivery results
in KVM_REQ_EVENT.  We can therefore use __apic_update_ppr.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:03 +01:00
Paolo Bonzini 26fbbee581 KVM: lapic: do not set KVM_REQ_EVENT unnecessarily on PPR update
On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered.
But we can take into account IRR here already.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:02 +01:00
Paolo Bonzini b3c045d332 KVM: lapic: remove unnecessary KVM_REQ_EVENT on PPR update
PPR needs to be updated whenever on every IRR read because we
may have missed TPR writes that _increased_ PPR.  However, these
writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT
has been set already in __apic_accept_irq, or we are going to
process the interrupt right away.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:01 +01:00
Paolo Bonzini eb90f3417a KVM: vmx: speed up TPR below threshold vmexits
Since we're already in VCPU context, all we have to do here is recompute
the PPR value.  That will in turn generate a KVM_REQ_EVENT if necessary.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:00 +01:00
Paolo Bonzini 0f1e261ead KVM: x86: add VCPU stat for KVM_REQ_EVENT processing
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections.  For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:59 +01:00
Tom Lendacky 0f89b207b0 kvm: svm: Use the hardware provided GPA instead of page walk
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.

The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:58 +01:00
Radim Krčmář 5bd5db385b KVM: x86: allow hotplug of VCPU with APIC ID over 0xff
LAPIC after reset is in xAPIC mode, which poses a problem for hotplug of
VCPUs with high APIC ID, because reset VCPU is waiting for INIT/SIPI,
but there is no way to uniquely address it using xAPIC.

From many possible options, we chose the one that also works on real
hardware: accepting interrupts addressed to LAPIC's x2APIC ID even in
xAPIC mode.

KVM intentionally differs from real hardware, because real hardware
(Knights Landing) does just "x2apic_id & 0xff" to decide whether to
accept the interrupt in xAPIC mode and it can deliver one interrupt to
more than one physical destination, e.g. 0x123 to 0x123 and 0x23.

Fixes: 682f732ecf ("KVM: x86: bump MAX_VCPUS to 288")
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:48 +01:00
Radim Krčmář b4535b58ae KVM: x86: make interrupt delivery fast and slow path behave the same
Slow path tried to prevent IPIs from x2APIC VCPUs from being delivered
to xAPIC VCPUs and vice-versa.  Make slow path behave like fast path,
which never distinguished that.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:40 +01:00
Radim Krčmář 6e50043912 KVM: x86: replace kvm_apic_id with kvm_{x,x2}apic_id
There were three calls sites:
 - recalculate_apic_map and kvm_apic_match_physical_addr, where it would
   only complicate implementation of x2APIC hotplug;
 - in apic_debug, where it was still somewhat preserved, but keeping the
   old function just for apic_debug was not worth it

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:29 +01:00
Radim Krčmář f98a3efb28 KVM: x86: use delivery to self in hyperv synic
Interrupt to self can be sent without knowing the APIC ID.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:13 +01:00
Junaid Shahid f160c7b7bb kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:11 +01:00
Junaid Shahid 37f0e8fe6b kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.

In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:10 +01:00
Junaid Shahid f39a058d0e kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update
mmu_spte_update() tracks changes in the accessed/dirty state of
the SPTE being updated and calls kvm_set_pfn_accessed/dirty
appropriately. However, in some cases (e.g. when aging the SPTE),
this shouldn't be done. mmu_spte_update_no_track() is introduced
for use in such cases.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:09 +01:00
Junaid Shahid 83ef6c8155 kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear
This simplifies mmu_spte_update() a little bit.
The checks for clearing of accessed and dirty bits are refactored into
separate functions, which are used inside both mmu_spte_update() and
mmu_spte_clear_track_bits(), as well as kvm_test_age_rmapp(). The new
helper functions handle both the case when A/D bits are supported in
hardware and the case when they are not.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:08 +01:00
Junaid Shahid 97dceba29a kvm: x86: mmu: Fast Page Fault path retries
This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.

This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:07 +01:00
Junaid Shahid ea4114bcd3 kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().

Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:06 +01:00
Junaid Shahid 27959a4415 kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:05 +01:00
David Matlack 114df303a7 kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.

The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.

hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):

hash            max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits     49847
hash_64         105
perfect         97

While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).

Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").

Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:04 +01:00
David Matlack f3414bc774 kvm: x86: export maximum number of mmu_page_hash collisions
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:03 +01:00
Radim Krčmář 826da32140 KVM: x86: simplify conditions with split/kernel irqchip
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:57 +01:00
Radim Krčmář 8231f50d98 KVM: x86: prevent setup of invalid routes
The check in kvm_set_pic_irq() and kvm_set_ioapic_irq() was just a
temporary measure until the code improved enough for us to do this.

This changes APIC in a case when KVM_SET_GSI_ROUTING is called to set up pic
and ioapic routes before KVM_CREATE_IRQCHIP.  Those rules would get overwritten
by KVM_CREATE_IRQCHIP at best, so it is pointless to allow it.  Userspaces
hopefully noticed that things don't work if they do that and don't do that.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:50 +01:00
Radim Krčmář e5dc48777d KVM: x86: refactor pic setup in kvm_set_routing_entry
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:36 +01:00
Radim Krčmář 099413664c KVM: x86: make pic setup code look like ioapic setup
We don't treat kvm->arch.vpic specially anymore, so the setup can look
like ioapic.  This gets a bit more information out of return values.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:28 +01:00
Radim Krčmář 49776faf93 KVM: x86: decouple irqchip_in_kernel() and pic_irqchip()
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.

Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:42:47 +01:00
Radim Krčmář 35e6eaa3df KVM: x86: don't allow kernel irqchip with split irqchip
Split irqchip cannot be created after creating the kernel irqchip, but
we forgot to restrict the other way.  This is an API change.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:38:19 +01:00
Linus Torvalds 08289086b0 KVM fixes for v4.10-rc3
MIPS: (both for stable)
  - fix host kernel crashes when receiving a signal with 64-bit userspace
  - flush instruction cache on all vcpus after generating entry code
 
 x86:
  - fix NULL dereference in MMU caused by SMM transitions (for stable)
  - correct guest instruction pointer after emulating some VMX errors
  - minor cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYb/N7AAoJEED/6hsPKofoa4QH/0/jwHr64lFeiOzMxqZfTF0y
 wufcTqw3zGq5iPaNlEwn+6AkKnTq2IPws92FludfPHPb7BrLUPqrXxRlSRN+XPVw
 pHVcV9u0q4yghMi7/6Flu3JASnpD6PrPZ7ezugZwgXFrR7pewd/+sTq6xBUnI9rZ
 nNEYsfh8dYiBicxSGXlmZcHLuJJHKshjsv9F6ngyBGXAAf/F+nLiJReUzPO0m2+P
 gmXi5zhVu6z05zlaCW1KAmJ1QV1UJla1vZnzrnK3twRK/05l7YX+xCbHIo1wB03R
 2YhKDnSrnG3Zt+KpXfRhADXazNgM5ASvORdvI6RvjLNVxlnOveQtAcfRyvZezT4=
 =LXLf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "MIPS:
   - fix host kernel crashes when receiving a signal with 64-bit
     userspace

   - flush instruction cache on all vcpus after generating entry code

     (both for stable)

  x86:
   - fix NULL dereference in MMU caused by SMM transitions (for stable)

   - correct guest instruction pointer after emulating some VMX errors

   - minor cleanup"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: remove duplicated declaration
  KVM: MIPS: Flush KVM entry code from icache globally
  KVM: MIPS: Don't clobber CP0_Status.UX
  KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
  KVM: nVMX: fix instruction skipping during emulated vm-entry
2017-01-06 15:27:17 -08:00
Jan Dakinevich 69130ea1e6 KVM: VMX: remove duplicated declaration
Declaration of VMX_VPID_EXTENT_SUPPORTED_MASK occures twice in the code.
Probably, it was happened after unsuccessful merge.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-05 15:08:48 +01:00
Linus Torvalds 3ddc76dfc7 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer type cleanups from Thomas Gleixner:
 "This series does a tree wide cleanup of types related to
  timers/timekeeping.

   - Get rid of cycles_t and use a plain u64. The type is not really
     helpful and caused more confusion than clarity

   - Get rid of the ktime union. The union has become useless as we use
     the scalar nanoseconds storage unconditionally now. The 32bit
     timespec alike storage got removed due to the Y2038 limitations
     some time ago.

     That leaves the odd union access around for no reason. Clean it up.

  Both changes have been done with coccinelle and a small amount of
  manual mopping up"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ktime: Get rid of ktime_equal()
  ktime: Cleanup ktime_set() usage
  ktime: Get rid of the union
  clocksource: Use a plain u64 instead of cycle_t
2016-12-25 14:30:04 -08:00
Thomas Gleixner 8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Thomas Gleixner a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Thomas Gleixner 73c1b41e63 cpu/hotplug: Cleanup state names
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.

Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-25 10:47:44 +01:00
Xiao Guangrong 6ef4e07ecd KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
Otherwise, mismatch between the smm bit in hflags and the MMU role
can cause a NULL pointer dereference.

Cc: stable@vger.kernel.org
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-24 10:16:04 +01:00
David Matlack b428018a06 KVM: nVMX: fix instruction skipping during emulated vm-entry
kvm_skip_emulated_instruction() should not be called after emulating
a VM-entry failure during or after loading guest state
(nested_vmx_entry_failure()). Otherwise the L1 hypervisor is resumed
some number of bytes past vmcs->host_rip.

Fixes: eb27756217
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-21 18:55:09 +01:00
Jim Mattson ef85b67385 kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.

Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 16:05:31 +01:00
Andrea Arcangeli cc0d907c09 kvm: take srcu lock around kvm_steal_time_set_preempted()
kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 15:45:15 +01:00
Andrea Arcangeli 931f261b42 kvm: fix schedule in atomic in kvm_steal_time_set_preempted()
kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 15:45:14 +01:00
Paolo Bonzini 3f5ad8be37 KVM: hyperv: fix locking of struct kvm_hv fields
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex.  Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-16 17:53:38 +01:00
Yi Sun 83781d180b KVM: x86: Expose Intel AVX512IFMA/AVX512VBMI/SHA features to guest.
Expose AVX512IFMA/AVX512VBMI/SHA features to guest.

AVX512 spec can be found at:
https://software.intel.com/sites/default/files/managed/26/40/319433-026.pdf

SHA spec can be found at:
https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

This patch depends on below patch.
http://marc.info/?l=linux-kernel&m=147932800828178&w=2

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-15 15:02:44 +01:00
GanShun 37b9a671f3 kvm: nVMX: Correct a VMX instruction error code for VMPTRLD
When the operand passed to VMPTRLD matches the address of the VMXON
region, the VMX instruction error code should be
VMXERR_VMPTRLD_VMXON_POINTER rather than VMXERR_VMCLEAR_VMXON_POINTER.

Signed-off-by: GanShun <ganshun@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-15 15:02:44 +01:00
Linus Torvalds 93173b5bf2 Small release, the most interesting stuff is x86 nested virt improvements.
x86: userspace can now hide nested VMX features from guests; nested
 VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
 AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
 
 PPC: support for KVM guests on POWER9; improved support for interrupt
 polling; optimizations and cleanups.
 
 s390: two small optimizations, more stuff is in flight and will be
 in 4.11.
 
 ARM: support for the GICv3 ITS on 32bit platforms.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
 lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
 nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
 u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
 gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
 qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
 D0j4tRSyGFIdx6lukZm7HmiSHZ0=
 =mkoB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release, the most interesting stuff is x86 nested virt
  improvements.

  x86:
   - userspace can now hide nested VMX features from guests
   - nested VMX can now run Hyper-V in a guest
   - support for AVX512_4VNNIW and AVX512_FMAPS in KVM
   - infrastructure support for virtual Intel GPUs.

  PPC:
   - support for KVM guests on POWER9
   - improved support for interrupt polling
   - optimizations and cleanups.

  s390:
   - two small optimizations, more stuff is in flight and will be in
     4.11.

  ARM:
   - support for the GICv3 ITS on 32bit platforms"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
  arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
  KVM: arm/arm64: timer: Check for properly initialized timer on init
  KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
  KVM: x86: Handle the kthread worker using the new API
  KVM: nVMX: invvpid handling improvements
  KVM: nVMX: check host CR3 on vmentry and vmexit
  KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
  KVM: nVMX: propagate errors from prepare_vmcs02
  KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
  KVM: x86: Add kvm_skip_emulated_instruction and use it.
  KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
  KVM: VMX: Reorder some skip_emulated_instruction calls
  KVM: x86: Add a return value to kvm_emulate_cpuid
  KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
  ...
2016-12-13 15:47:02 -08:00
Linus Torvalds 9439b3710d Main pull request for drm for 4.10 kernel
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYT3qqAAoJEAx081l5xIa+dLMP/2dqBybSAeWlPmAwVenIHRtS
 KFNktISezFSY/LBcIP2mHkFJmjTKBMZFxWnyEJL9NmFUD1cS2WMyNnC1282h/+rD
 +P8Bsmzmt/daV4UTFxVDpzlmVlavAyakNi6FnSQfAfmf+3PB1yzU3gn8ld9pU/if
 h7KEp9fDn9eYZreTRfCUloI2yoVpD9d0DG3uaGDN/N0kGUnCC6TZT5ig5j2JO016
 fYf/DqoYAk3ItWF9WK/uG7qJIGi37afCpQq+kbSSJk+p3HjJqu8JUe9jzqYdl7j9
 26TGSY5o9WLhZkxDgbcCIJzcFJhMmXgMdhjil9lqaHmnNG5FPFU7g8DK1CZqbel9
 m8+aRPn1EgxIahMgdl8NblW1pfO2Kco0tZmoP5vXx1uqhivd67h0hiQqp66WxOJd
 i2yMLncaCEv8M161CVEgtzuI5a7nCfaZv7J9ArzbkD/huBwu51IZgTs7Dz4njgvz
 VPB5FBTB/ZYteErUNoh6gjF0hLngWvvJSPvuzT+EFO7yypek0IJ28GTdbxYSP+jR
 13697s5Itigf/D3KUdRRGsWRzyVVN9n+djkl//sy5ddL9eOlKSKEga4ujOUjTWaW
 hTvAxpK9GmJS/Iun5jIP6f75zDbi+e8FWUeB/OI2lPtnApaSKdXBTPXsco2RnTEV
 +G6XrH8IMEIsTxOk7hWU
 =7s/c
 -----END PGP SIGNATURE-----

Merge tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux

Pull drm updates from Dave Airlie:
 "This is the main pull request for drm for 4.10 kernel.

  New drivers:
   - ZTE VOU display driver (zxdrm)
   - Amlogic Meson Graphic Controller GXBB/GXL/GXM SoCs (meson)
   - MXSFB support (mxsfb)

  Core:
   - Format handling has been reworked
   - Better atomic state debugging
   - drm_mm leak debugging
   - Atomic explicit fencing support
   - fbdev helper ops
   - Documentation updates
   - MST fbcon fixes

  Bridge:
   - Silicon Image SiI8620 driver

  Panel:
   - Add support for new simple panels

  i915:
   - GVT Device model
   - Better HDMI2.0 support on skylake
   - More watermark fixes
   - GPU idling rework for suspend/resume
   - DP Audio workarounds
   - Scheduler prep-work
   - Opregion CADL handling
   - GPU scheduler and priority boosting

  amdgfx/radeon:
   - Support for virtual devices
   - New VM manager for non-contig VRAM buffers
   - UVD powergating
   - SI register header cleanup
   - Cursor fixes
   - Powermanagement fixes

  nouveau:
   - Powermangement reworks for better voltage/clock changes
   - Atomic modesetting support
   - Displayport Multistream (MST) support.
   - GP102/104 hang and cursor fixes
   - GP106 support

  hisilicon:
   - hibmc support (BMC chip for aarch64 servers)

  armada:
   - add tracing support for overlay change
   - refactor plane support
   - de-midlayer the driver

  omapdrm:
   - Timing code cleanups

  rcar-du:
   - R8A7792/R8A7796 support
   - Misc fixes.

  sunxi:
   - A31 SoC display engine support

  imx-drm:
   - YUV format support
   - Cleanup plane atomic update

  mali-dp:
   - Misc fixes

  dw-hdmi:
   - Add support for HDMI i2c master controller

  tegra:
   - IOMMU support fixes
   - Error handling fixes

  tda998x:
   - Fix connector registration
   - Improved robustness
   - Fix infoframe/audio compliance

  virtio:
   - fix busid issues
   - allocate more vbufs

  qxl:
   - misc fixes and cleanups.

  vc4:
   - Fragment shader threading
   - ETC1 support
   - VEC (tv-out) support

  msm:
   - A5XX GPU support
   - Lots of atomic changes

  tilcdc:
   - Misc fixes and cleanups.

  etnaviv:
   - Fix dma-buf export path
   - DRAW_INSTANCED support
   - fix driver on i.MX6SX

  exynos:
   - HDMI refactoring

  fsl-dcu:
   - fbdev changes"

* tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux: (1343 commits)
  drm/nouveau/kms/nv50: fix atomic regression on original G80
  drm/nouveau/bl: Do not register interface if Apple GMUX detected
  drm/nouveau/bl: Assign different names to interfaces
  drm/nouveau/bios/dp: fix handling of LevelEntryTableIndex on DP table 4.2
  drm/nouveau/ltc: protect clearing of comptags with mutex
  drm/nouveau/gr/gf100-: handle GPC/TPC/MPC trap
  drm/nouveau/core: recognise GP106 chipset
  drm/nouveau/ttm: wait for bo fence to signal before unmapping vmas
  drm/nouveau/gr/gf100-: FECS intr handling is not relevant on proprietary ucode
  drm/nouveau/gr/gf100-: properly ack all FECS error interrupts
  drm/nouveau/fifo/gf100-: recover from host mmu faults
  drm: Add fake controlD* symlinks for backwards compat
  drm/vc4: Don't use drm_put_dev
  drm/vc4: Document VEC DT binding
  drm/vc4: Add support for the VEC (Video Encoder) IP
  drm: Add TV connector states to drm_connector_state
  drm: Turn DRM_MODE_SUBCONNECTOR_xx definitions into an enum
  drm/vc4: Fix ->clock_select setting for the VEC encoder
  drm/amdgpu/dce6: Set MASTER_UPDATE_MODE to 0 in resume_mc_access as well
  drm/amdgpu: use pin rather than pin_restricted in a few cases
  ...
2016-12-13 09:35:09 -08:00
Linus Torvalds 518bacf5a5 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - do a large round of simplifications after all CPUs do 'eager' FPU
     context switching in v4.9: remove CR0 twiddling, remove leftover
     eager/lazy bts, etc (Andy Lutomirski)

   - more FPU code simplifications: remove struct fpu::counter, clarify
     nomenclature, remove unnecessary arguments/functions and better
     structure the code (Rik van Riel)"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Remove clts()
  x86/fpu: Remove stts()
  x86/fpu: Handle #NM without FPU emulation as an error
  x86/fpu, lguest: Remove CR0.TS support
  x86/fpu, kvm: Remove host CR0.TS manipulation
  x86/fpu: Remove irq_ts_save() and irq_ts_restore()
  x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs()
  x86/fpu: Get rid of two redundant clts() calls
  x86/fpu: Finish excising 'eagerfpu'
  x86/fpu: Split old_fpu & new_fpu handling into separate functions
  x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
  x86/fpu: Split old & new FPU code paths
  x86/fpu: Remove __fpregs_(de)activate()
  x86/fpu: Rename lazy restore functions to "register state valid"
  x86/fpu, kvm: Remove KVM vcpu->fpu_counter
  x86/fpu: Remove struct fpu::counter
  x86/fpu: Remove use_eager_fpu()
  x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction
  x86/fpu: Hard-disable lazy FPU mode
  x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
2016-12-12 14:27:49 -08:00
Ingo Molnar 6f38751510 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:07:13 +01:00
Petr Mladek 36da91bdf5 KVM: x86: Handle the kthread worker using the new API
Use the new API to create and destroy the "kvm-pit" kthread
worker. The API hides some implementation details.

In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way
and stores task_struct into the worker structure.

kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.

This patch does not change the existing behavior except for
dynamically allocating struct kthread_worker and storing
only the pointer of this structure.

It is compile tested only because I did not find an easy
way how to run the code. Well, it should be pretty safe
given the nature of the change.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-08 15:31:11 +01:00
Jan Dakinevich 16c2aec6a2 KVM: nVMX: invvpid handling improvements
- Expose all invalidation types to the L1

 - Reject invvpid instruction, if L1 passed zero vpid value to single
   context invalidations

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-08 15:31:11 +01:00
Ladi Prosek 1dc35dacc1 KVM: nVMX: check host CR3 on vmentry and vmexit
This commit adds missing host CR3 checks. Before entering guest mode, the value
of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
called to set the new CR3 value and check and load PDPTRs.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Ladi Prosek 9ed38ffad4 KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.

* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary

This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Ladi Prosek ee146c1c10 KVM: nVMX: propagate errors from prepare_vmcs02
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.

A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:09 +01:00
Ladi Prosek 7ca29de213 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:

1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.

2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.

This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:09 +01:00
David Matlack 5a6a9748b4 KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.

This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:08 +01:00
David Matlack 8322ebbb24 KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.

This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:08 +01:00
David Matlack 3899152ccb KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.

This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.

Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.

This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:07 +01:00
David Matlack 62cc6b9dc6 KVM: nVMX: support restore of VMX capability MSRs
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.

When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.

Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:07 +01:00
David Matlack 0115f9cbac KVM: nVMX: generate non-true VMX MSRs based on true versions
The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.

Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.

Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).

Signed-off-by: David Matlack <dmatlack@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:06 +01:00
Kyle Huey ea07e42dec KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
The trap flag stays set until software clears it.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:06 +01:00
Kyle Huey 6affcbedca KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.

Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:

- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
  MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
  KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
  IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
  take precedence. The singlestep will be triggered again on the next
  instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
  the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
  which do not trigger singlestep traps as mentioned previously.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:05 +01:00
Kyle Huey eb27756217 KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
We can't return both the pass/fail boolean for the vmcs and the upcoming
continue/exit-to-userspace boolean for skip_emulated_instruction out of
nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.

Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
they advance the IP to the following instruction, not when they a) succeed,
b) fail MSR validation or c) throw an exception. Add a separate call to
skip_emulated_instruction that will later not be converted to the variant
that checks the singlestep flag.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:04 +01:00
Kyle Huey 09ca3f2049 KVM: VMX: Reorder some skip_emulated_instruction calls
The functions being moved ahead of skip_emulated_instruction here don't
need updated IPs, and skipping the emulated instruction at the end will
make it easier to return its value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:04 +01:00
Kyle Huey 6a908b628c KVM: x86: Add a return value to kvm_emulate_cpuid
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:03 +01:00
Dave Airlie f03ee46be9 Linux 4.9-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYRIGyAAoJEHm+PkMAQRiG2ksH/jwMUT9j6glbwESxbn1YTqTM
 QcBT5AMc7D0wNuidQe0hWZMtG4RbC+4ZhxzZl2wPgA2gueJ+rBnyX7bgtA7ka8ka
 Fdc3u/Q1v38HPzf8iBnxcdCs40VgsoMLjFYCXrpOxuGDNKYzRd+Q8aI2TeGvzbyi
 X8+6oAWifBwo2oA06jfcuUncEWbyDDyK9aQksmfKOpjHdb26yELPEhsPOlds1g7E
 jYLnvUVnU2CoFaumta+rZQ0kzLdc4Ntu0wEao6WzJuQKsgoID+tS/6iudi8cUhDp
 YowGAVoOfr6rAJB0mwrDVfugpamaT3386XKyocdNsK0/jR60UIJ8x+WzvvSU+lY=
 =JTBj
 -----END PGP SIGNATURE-----

Backmerge tag 'v4.9-rc8' into drm-next

Linux 4.9-rc8

Daniel requested this so we could apply some follow on fixes cleanly to -next.
2016-12-05 17:11:48 +10:00
Radim Krčmář df492896e6 KVM: x86: check for pic and ioapic presence before use
Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 49df6397ed ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:39:28 +01:00
Radim Krčmář 81cdb259fb KVM: x86: fix out-of-bounds accesses of rtc_eoi map
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be
bigger that the maximal number of VCPUs, resulting in out-of-bounds
access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...]
  Write of size 1 by task a.out/27101
  CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905
   [...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495
   [...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86
   [...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360
   [...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222
   [...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235
   [...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670
   [...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668
   [...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999
   [...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: af1bae5497 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:37:19 +01:00
Radim Krčmář 2117d5398c KVM: x86: drop error recovery in em_jmp_far and em_ret_far
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.

Found by syzkaller:

  WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] panic+0x1b7/0x3a3 kernel/panic.c:179
   [...] __warn+0x1c4/0x1e0 kernel/panic.c:542
   [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
   [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
   [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
   [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
   [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
   [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
   [...] complete_emulated_io arch/x86/kvm/x86.c:6870
   [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
   [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
   [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:36:54 +01:00
Radim Krčmář 444fdad88f KVM: x86: fix out-of-bounds access in lapic
Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff.
With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a
userspace can send an interrupt with dest_id that results in
out-of-bounds access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750
  Read of size 8 by task syz-executor/22923
  CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
   [...] print_address_description mm/kasan/report.c:194
   [...] kasan_report_error mm/kasan/report.c:283
   [...] kasan_report+0x231/0x500 mm/kasan/report.c:303
   [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329
   [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824
   [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72
   [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157
   [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74
   [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: e45115b62f ("KVM: x86: use physical LAPIC array for logical x2APIC")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:35:53 +01:00
Tom Lendacky 8370c3d08b kvm: svm: Add kvm_fast_pio_in support
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:45 +01:00
Tom Lendacky 147277540b kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:26 +01:00
Ingo Molnar 064e6a8ba6 Merge branch 'linus' into x86/fpu, to resolve conflicts
Conflicts:
	arch/x86/kernel/fpu/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 07:18:09 +01:00
Bandan Das ae0f549951 kvm: x86: don't print warning messages for unimplemented msrs
Change unimplemented msrs messages to use pr_debug.
If CONFIG_DYNAMIC_DEBUG is set, then these messages can be
enabled at run time or else -DDEBUG can be used at compile
time to enable them. These messages will still be printed if
ignore_msrs=1.

Signed-off-by: Bandan Das <bsd@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 18:29:10 +01:00
Jan Dakinevich bcdde302b8 KVM: nVMX: invvpid handling improvements
- Expose all invalidation types to the L1

 - Reject invvpid instruction, if L1 passed zero vpid value to single
   context invalidations

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 17:26:42 +01:00
Jim Mattson c7dd15b337 kvm: x86: CPUID.01H:EDX.APIC[bit 9] should mirror IA32_APIC_BASE[11]
From the Intel SDM, volume 3, section 10.4.3, "Enabling or Disabling the
Local APIC,"

  When IA32_APIC_BASE[11] is 0, the processor is functionally equivalent
  to an IA-32 processor without an on-chip APIC. The CPUID feature flag
  for the APIC (see Section 10.4.2, "Presence of the Local APIC") is
  also set to 0.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Changed subject tag from nVMX to x86.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 14:51:55 +01:00
Pan Xinhui 0b9f6c4615 x86/kvm: Support the vCPU preemption check
Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vCPUs
than physical CPUs in the system) as doing busy waits for preempted
vCPUs will hurt system performance far worse than early yielding.

Use struct kvm_steal_time::preempted to indicate that if a vCPU
is running or not.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: boqun.feng@gmail.com
Cc: borntraeger@de.ibm.com
Cc: bsingharora@gmail.com
Cc: dave@stgolabs.net
Cc: jgross@suse.com
Cc: kernellwp@gmail.com
Cc: konrad.wilk@oracle.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: paulmck@linux.vnet.ibm.com
Cc: paulus@samba.org
Cc: rkrcmar@redhat.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: xen-devel-request@lists.xenproject.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Typo fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:48:08 +01:00
Paolo Bonzini a2b07739ff kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomic
kvm_arch_set_irq is unused since commit b97e6de9c9.  Merge
its functionality with kvm_arch_set_irq_inatomic.

Reported-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:19 +01:00
Paolo Bonzini 7301d6abae KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
Reported by syzkaller:

    [ INFO: suspicious RCU usage. ]
    4.9.0-rc4+ #47 Not tainted
    -------------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    stack backtrace:
    CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
     ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000
     0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9
     ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000
    Call Trace:
     [<     inline     >] __dump_stack lib/dump_stack.c:15
     [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51
     [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445
     [<     inline     >] __kvm_memslots include/linux/kvm_host.h:534
     [<     inline     >] kvm_memslots include/linux/kvm_host.h:541
     [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941
     [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: fda4e2e855
Cc: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:18 +01:00
Paolo Bonzini e3fd9a93a1 kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory.  This
however is brittle and not necessary anymore with KVM 4.11.  Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:16 +01:00
Ignacio Alvarado 1650b4ebc9 KVM: Disable irq while unregistering user notifier
Function user_notifier_unregister should be called only once for each
registered user notifier.

Function kvm_arch_hardware_disable can be executed from an IPI context
which could cause a race condition with a VCPU returning to user mode
and attempting to unregister the notifier.

Signed-off-by: Ignacio Alvarado <ikalvarado@google.com>
Cc: stable@vger.kernel.org
Fixes: 18863bdd60 ("KVM: x86 shared msr infrastructure")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:04 +01:00
Paolo Bonzini 8b95344064 KVM: x86: do not go through vcpu in __get_kvmclock_ns
Going through the first VCPU is wrong if you follow a KVM_SET_CLOCK with
a KVM_GET_CLOCK immediately after, without letting the VCPU run and
call kvm_guest_time_update.

To fix this, compute the kvmclock value ourselves, using the master
clock (tsc, nsec) pair as the base and the host CPU frequency as
the scale.

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 18:03:03 +01:00
Luwei Kang 4504b5c941 kvm: x86: Add AVX512_4VNNIW and AVX512_4FMAPS support
Add two new AVX512 subfeatures support for KVM guest.

AVX512_4VNNIW:
Vector instructions for deep learning enhanced word variable precision.

AVX512_4FMAPS:
Vector instructions for deep learning floating-point single precision.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Changed subject tags.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-16 22:20:02 +01:00
Radim Krčmář 283c95d0e3 KVM: x86: emulate FXSAVE and FXRSTOR
Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
Old Intels don't have unrestricted_guest, so we have to emulate them.

The patch takes advantage of the hardware implementation.

AMD and Intel differ in saving and restoring other fields in first 32
bytes.  A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
and executed fxsave:

  Intel (Nehalem):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
    ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
  Intel (Haswell -- deprecated FPU CS and FPU DS):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
    ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
  AMD (Opteron 2300-series):
    7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
    ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00

fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
much to improve the situation.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-16 22:09:46 +01:00
Radim Krčmář aabba3c6ab KVM: x86: add asm_safe wrapper
Move the existing exception handling for inline assembly into a macro
and switch its return values to X86EMUL type.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Radim Krčmář 4852018789 KVM: x86: save one bit in ctxt->d
Alignments are exclusive, so 5 modes can be expressed in 3 bits.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Radim Krčmář d3fe959f81 KVM: x86: add Align16 instruction flag
Needed for FXSAVE and FXRSTOR.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Jiang Biao 695151964d kvm: x86: remove unused but set variable
The local variable *gpa_offset* is set but not used afterwards,
which make the compiler issue a warning with option
-Wunused-but-set-variable. Remove it to avoid the warning.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao ecd8a8c2b4 kvm: x86: hyperv: make function static to avoid compiling warning
synic_set_irq is only used in hyperv.c, and should be static to
avoid compiling warning when with -Wmissing-prototypes option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao 1e13175bd2 kvm: x86: cpuid: remove the unnecessary variable
The use of local variable *function* is not necessary here. Remove
it to avoid compiling warning with -Wunused-but-set-variable option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao ae6a237560 kvm: x86: make a function in x86.c static to avoid compiling warning
kvm_emulate_wbinvd_noskip is only used in x86.c, and should be
static to avoid compiling warning when with -Wmissing-prototypes
option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao 33365e7a45 kvm: x86: make function static to avoid compiling warning
vmx_arm_hv_timer is only used in vmx.c, and should be static to
avoid compiling warning when with -Wmissing-prototypes option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:43 +01:00
Paolo Bonzini 6314a17fec The three KVM patches that KVMGT needs.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYIy78AAoJEL/70l94x66DzbQH/34Mgm52OdLzTe+Yf8Pcae5X
 A/z2Aicto+6M+dtYoWPn2DfoevPaSpVmymryRrRzAqk5QDPHiVHZ5iCW0RaIEU7M
 iiPudqSGVGa0oFYBkxsJxCBysAQsHX5sEXEszs4egbO1TtQ8LCAxYUVuLBGlx+Fa
 qj26Fi/p2ByHVp/RX55A2kF5T8J671KT4LWUvFjzTgGFWo8Kr1bk0q4hmRYB9OBc
 /pqczV3Mc6KcmzfIg3Rd6xt8UDlEGJ4YQhpNgY6nxrQ1py3AP7vNqBPAH4RXbHJB
 /OqdAjpqa8+rrwSpQ1f58U+7v/ZO1ZTg0IW9bf60qKjG/aV4fGN6y0/iXKGgKyQ=
 =cpog
 -----END PGP SIGNATURE-----

Merge tag 'tags/for-kvmgt' into HEAD

The three KVM patches that KVMGT needs.

Conflicts:
	arch/x86/include/asm/kvm_page_track.h
	arch/x86/kvm/mmu.c
2016-11-09 15:20:31 +01:00
Jike Song 871b7ef2a1 kvm/page_track: export symbols for external usage
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04 12:13:20 +01:00
Jike Song d126363d8f kvm/page_track: call notifiers with kvm_page_track_notifier_node
The user of page_track might needs extra information, so pass
the kvm_page_track_notifier_node to callbacks.

Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04 12:13:20 +01:00
Xiaoguang Chen ae7cd87372 KVM: x86: add track_flush_slot page track notifier
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.

This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.

Register the notifier type track_flush_slot to receive memslot move
and remove event.

Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-04 12:13:19 +01:00
Paolo Bonzini ad3610919e kvm: x86: avoid atomic operations on APICv vmentry
On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

              | enable_apicv=1  |  enable_apicv=0
              | mean     stdev  |  mean     stdev
    ----------|-----------------|------------------
    before    | 5826     32.65  |  5765     47.09
    after     | 5809     43.42  |  5777     77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-03 12:27:51 +01:00
Paolo Bonzini 1b07304c58 KVM: nVMX: support descriptor table exits
These are never used by the host, but they can still be reflected to
the guest.

Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:32:17 +01:00
Paolo Bonzini 5587859fb1 KVM: x86: use ktime_get instead of seeking the hrtimer_clock_base
The base clock for the LAPIC timer is always CLOCK_MONOTONIC.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li 8003c9ae20 KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support
Most windows guests still utilize APIC Timer periodic/oneshot mode
instead of tsc-deadline mode, and the APIC Timer periodic/oneshot
mode are still emulated by high overhead hrtimer on host. This patch
converts the expected expire time of the periodic/oneshot mode to
guest deadline tsc in order to leverage VMX preemption timer logic
for APIC Timer tsc-deadline mode. After each preemption timer vmexit
preemption timer is restarted to emulate LVTT current-count register
is automatically reloaded from the initial-count register when the
count reaches 0. This patch reduces ~5600 cycles for each APIC Timer
periodic mode operation virtualization.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Squashed with my fixes that were reviewed-by Paolo.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li 7e810a38e6 KVM: LAPIC: rename start/cancel_hv_tscdeadline to start/cancel_hv_timer
Rename start/cancel_hv_tscdeadline to start/cancel_hv_timer since
they will handle both APIC Timer periodic/oneshot mode and tsc-deadline
mode.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li 498f816219 KVM: LAPIC: introduce kvm_get_lapic_target_expiration_tsc()
Introdce kvm_get_lapic_target_expiration_tsc() to get APIC Timer target
deadline tsc.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li a10388e11f KVM: LAPIC: guarantee the timer is in tsc-deadline mode
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot()
and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when
rdmsr MSR_IA32_TSCDEADLINE.

Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li 7d7f7da2f1 KVM: LAPIC: extract start_sw_period() to handle periodic/oneshot mode
Extract start_sw_period() to handle periodic/oneshot mode, it will be
used by later patch.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Longpeng(Mike) 868a32f327 kvm: x86: remove the misleading comment in vmx_handle_external_intr
Since Paolo has removed irq-enable-operation in vmx_handle_external_intr
(KVM: x86: use guest_exit_irqoff), the original comment about the IF bit
in rflags is incorrect and stale now, so remove it.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Xiaoguang Chen b5f5fdca65 KVM: x86: add track_flush_slot page track notifier
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.

This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.

Register the notifier type track_flush_slot to receive memslot move
and remove event.

Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář 2361133293 KVM: VMX: refactor setup of global page-sized bitmaps
We've had 10 page-sized bitmaps that were being allocated and freed one
by one when we could just use a cycle.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář 2e69f86561 KVM: VMX: join functions that disable x2apic msr intercepts
vmx_disable_intercept_msr_read_x2apic() and
vmx_disable_intercept_msr_write_x2apic() differed only in the type.
Pass the type to a new function.

[Ordered and commented TPR intercept according to Paolo's suggestion.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář 40d8338d09 KVM: VMX: remove functions that enable msr intercepts
All intercepts are enabled at the beginning, so they can only be used if
we disabled an intercept that we wanted to have enabled.
This was done for TMCCT to simplify a loop that disables all x2APIC MSR
intercepts, but just keeping TMCCT enabled yields better results.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Jim Mattson 83bafef1a1 kvm: nVMX: Update MSR load counts on a VMCS switch
When L0 establishes (or removes) an MSR entry in the VM-entry or VM-exit
MSR load lists, the change should affect the dormant VMCS as well as the
current VMCS. Moreover, the vmcs02 MSR-load addresses should be
initialized.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Jim Mattson cf3215d939 kvm: nVMX: Fetch VM_INSTRUCTION_ERROR from vmcs02 on vmx->fail
When forwarding a hardware VM-entry failure to L1, fetch the
VM_INSTRUCTION_ERROR field from vmcs02 before loading vmcs01.

(Note that there is an implicit assumption that the VM-entry failure was
on the first VM-entry to vmcs02 after nested_vmx_run; otherwise, L1 is
going to be very confused.)

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Peter Feiner 66d73e12f2 KVM: X86: MMU: no mmu_notifier_seq++ in kvm_age_hva
The MMU notifier sequence number keeps GPA->HPA mappings in sync when
GPA->HPA lookups are done outside of the MMU lock (e.g., in
tdp_page_fault). Since kvm_age_hva doesn't change GPA->HPA, it's
unnecessary to increment the sequence number.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li c63e45635b KVM: VMX: Better name x2apic msr bitmaps
Renames x2apic_apicv_inactive msr_bitmaps to x2apic and original
x2apic bitmaps to x2apic_apicv.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Owen Hofmann d9092f52d7 kvm: x86: Check memopp before dereference (CVE-2016-8630)
Commit 41061cdb98 ("KVM: emulate: do not initialize memopp") removes a
check for non-NULL under incorrect assumptions. An undefined instruction
with a ModR/M byte with Mod=0 and R/M-5 (e.g. 0xc7 0x15) will attempt
to dereference a null pointer here.

Fixes: 41061cdb98
Message-Id: <1477592752-126650-2-git-send-email-osh@google.com>
Signed-off-by: Owen Hofmann <osh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:31:53 +01:00
Jim Mattson 355f4fb140 kvm: nVMX: VMCLEAR an active shadow VMCS after last use
After a successful VM-entry with the "VMCS shadowing" VM-execution
control set, the shadow VMCS referenced by the VMCS link pointer field
in the current VMCS becomes active on the logical processor.

A VMCS that is made active on more than one logical processor may become
corrupted. Therefore, before an active VMCS can be migrated to another
logical processor, the first logical processor must execute a VMCLEAR
for the active VMCS. VMCLEAR both ensures that all VMCS data are written
to memory and makes the VMCS inactive.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-By: David Matlack <dmatlack@google.com>
Message-Id: <1477668579-22555-1-git-send-email-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:17 +01:00
Paolo Bonzini ea26e4ec08 KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB.  This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code.  The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
the vmcs01_tsc_offset can be dropped completely.

More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.

Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:07 +01:00
Andy Lutomirski 04ac88abaf x86/fpu, kvm: Remove host CR0.TS manipulation
Now that x86 always uses eager FPU switching on the host, there's no
need for KVM to manipulate the host's CR0.TS.

This should be both simpler and faster.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/b212064922537c05d0c81d931fc4dbe769127ce7.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:54 +01:00
Ingo Molnar c29c716662 Merge branch 'core/urgent' into x86/fpu, to merge fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:40 +01:00
Ido Yariv bd768e1466 KVM: x86: fix wbinvd_dirty_mask use-after-free
vcpu->arch.wbinvd_dirty_mask may still be used after freeing it,
corrupting memory. For example, the following call trace may set a bit
in an already freed cpu mask:
    kvm_arch_vcpu_load
    vcpu_load
    vmx_free_vcpu_nested
    vmx_free_vcpu
    kvm_arch_vcpu_free

Fix this by deferring freeing of wbinvd_dirty_mask.

Cc: stable@vger.kernel.org
Signed-off-by: Ido Yariv <ido@wizery.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-28 11:35:21 +02:00
Borislav Petkov 796f4687bd kvm/x86: Show WRMSR data is in hex
Add the "0x" prefix to the error messages format to make it unambiguous
about what kind of value we're talking about.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Message-Id: <20161027181445.25319-1-bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-27 20:31:25 +02:00
Jim Mattson 85c856b39b kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types
Bitwise shifts by amounts greater than or equal to the width of the left
operand are undefined. A malicious guest can exploit this to crash a
32-bit host, due to the BUG_ON(1)'s in handle_{invept,invvpid}.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <1477496318-17681-1-git-send-email-jmattson@google.com>
[Change 1UL to 1, to match the range check on the shift count. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-27 12:15:27 +02:00
Jiri Slaby 8678654e3c kvm: x86: memset whole irq_eoi
gcc 7 warns:
arch/x86/kvm/ioapic.c: In function 'kvm_ioapic_reset':
arch/x86/kvm/ioapic.c:597:2: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size]

And it is right. Memset whole array using sizeof operator.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
[Added x86 subject tag]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-20 14:54:11 +02:00
Borislav Petkov 758f588d6f kvm/x86: Fix unused variable warning in kvm_timer_init()
When CONFIG_CPU_FREQ is not set, int cpu is unused and gcc rightfully
warns about it:

  arch/x86/kvm/x86.c: In function ‘kvm_timer_init’:
  arch/x86/kvm/x86.c:5697:6: warning: unused variable ‘cpu’ [-Wunused-variable]
    int cpu;
        ^~~

But since it is used only in the CONFIG_CPU_FREQ block, simply move it
there, thus squashing the warning too.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-20 14:49:52 +02:00
Ingo Molnar 4d69f155d5 Linux 4.9-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYAoDuAAoJEHm+PkMAQRiGUeEH/03/cUjHeY5aJkcJ0JeHkoU5
 GR5nRGcjfFF6cGujw2cSXBf5NzZTcrvBBFSgGNJ/rqm4EeDBsmf6T8qSfEKky/SY
 3CNWSzayFU8Na3C8Z/a/xPTPicneX9zVnAi8XMAKXwWPmu21JCLR/hkKaxQ29qGr
 Nqe4kEdLEF80d5lFRfNjK3CX4bD6w6P7aTBaM6wuRe4u5AXKJlSF+j838o5+/tSQ
 Q1V7fyXlX+kwNmH4gViim8im0PLm7/7Li8e24pL3cAR2G6DHrUzcsYYoRMHpk5bv
 HdBeCgZL6TnIaJc0ui2FRqQsifaVfM5J+pK81wr/JhBP2hmuWIN7NMupfCYtCcM=
 =Mown
 -----END PGP SIGNATURE-----

Merge tag 'v4.9-rc1' into x86/fpu, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 13:04:34 +02:00
Petr Mladek 3989144f86 kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Rik van Riel 3d42de25d2 x86/fpu, kvm: Remove KVM vcpu->fpu_counter
With the removal of the lazy FPU code, this field is no longer used.
Get rid of it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1475627678-20788-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07 11:14:41 +02:00
Andy Lutomirski c592b57347 x86/fpu: Remove use_eager_fpu()
This removes all the obvious code paths that depend on lazy FPU mode.
It shouldn't change the generated code at all.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1475627678-20788-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07 11:14:36 +02:00
Linus Torvalds 6218590bcb KVM updates for v4.9-rc1
All architectures:
   Move `make kvmconfig` stubs from x86;  use 64 bits for debugfs stats.
 
 ARM:
   Important fixes for not using an in-kernel irqchip; handle SError
   exceptions and present them to guests if appropriate; proxying of GICV
   access at EL2 if guest mappings are unsafe; GICv3 on AArch32 on ARMv8;
   preparations for GICv3 save/restore, including ABI docs; cleanups and
   a bit of optimizations.
 
 MIPS:
   A couple of fixes in preparation for supporting MIPS EVA host kernels;
   MIPS SMP host & TLB invalidation fixes.
 
 PPC:
   Fix the bug which caused guests to falsely report lockups; other minor
   fixes; a small optimization.
 
 s390:
   Lazy enablement of runtime instrumentation; up to 255 CPUs for nested
   guests; rework of machine check deliver; cleanups and fixes.
 
 x86:
   IOMMU part of AMD's AVIC for vmexit-less interrupt delivery; Hyper-V
   TSC page; per-vcpu tsc_offset in debugfs; accelerated INS/OUTS in
   nVMX; cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9iDrAAoJEED/6hsPKofoOPoIAIUlgojkb9l2l1XVDgsXdgQL
 sRVhYSVv7/c8sk9vFImrD5ElOPZd+CEAIqFOu45+NM3cNi7gxip9yftUVs7wI5aC
 eDZRWm1E4trDZLe54ZM9ThcqZzZZiELVGMfR1+ZndUycybwyWzafpXYsYyaXp3BW
 hyHM3qVkoWO3dxBWFwHIoO/AUJrWYkRHEByKyvlC6KPxSdBPSa5c1AQwMCoE0Mo4
 K/xUj4gBn9eMelNhg4Oqu/uh49/q+dtdoP2C+sVM8bSdquD+PmIeOhPFIcuGbGFI
 B+oRpUhIuntN39gz8wInJ4/GRSeTuR2faNPxMn4E1i1u4LiuJvipcsOjPfe0a18=
 =fZRB
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "All architectures:
   - move `make kvmconfig` stubs from x86
   - use 64 bits for debugfs stats

  ARM:
   - Important fixes for not using an in-kernel irqchip
   - handle SError exceptions and present them to guests if appropriate
   - proxying of GICV access at EL2 if guest mappings are unsafe
   - GICv3 on AArch32 on ARMv8
   - preparations for GICv3 save/restore, including ABI docs
   - cleanups and a bit of optimizations

  MIPS:
   - A couple of fixes in preparation for supporting MIPS EVA host
     kernels
   - MIPS SMP host & TLB invalidation fixes

  PPC:
   - Fix the bug which caused guests to falsely report lockups
   - other minor fixes
   - a small optimization

  s390:
   - Lazy enablement of runtime instrumentation
   - up to 255 CPUs for nested guests
   - rework of machine check deliver
   - cleanups and fixes

  x86:
   - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
   - Hyper-V TSC page
   - per-vcpu tsc_offset in debugfs
   - accelerated INS/OUTS in nVMX
   - cleanups and fixes"

* tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
  KVM: MIPS: Drop dubious EntryHi optimisation
  KVM: MIPS: Invalidate TLB by regenerating ASIDs
  KVM: MIPS: Split kernel/user ASID regeneration
  KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
  KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
  KVM: arm64: Require in-kernel irqchip for PMU support
  KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
  KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
  KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
  KVM: PPC: BookE: Fix a sanity check
  KVM: PPC: Book3S HV: Take out virtual core piggybacking code
  KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
  ARM: gic-v3: Work around definition of gic_write_bpr1
  KVM: nVMX: Fix the NMI IDT-vectoring handling
  KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
  KVM: nVMX: Fix reload apic access page warning
  kvmconfig: add virtio-gpu to config fragment
  config: move x86 kvm_guest.config to a common location
  arm64: KVM: Remove duplicating init code for setting VMID
  ARM: KVM: Support vgic-v3
  ...
2016-10-06 10:49:01 -07:00
Wanpeng Li c5a6d5f7fa KVM: nVMX: Fix the NMI IDT-vectoring handling
Run kvm-unit-tests/eventinj.flat in L1:

Sending NMI to self
After NMI to self
FAIL: NMI

This test scenario is to test whether VMM can handle NMI IDT-vectoring info correctly.

At the beginning, L2 writes LAPIC to send a self NMI, the EPT page tables on both L1
and L0 are empty so:

- The L2 accesses memory can generate EPT violation which can be intercepted by L0.

  The EPT violation vmexit occurred during delivery of this NMI, and the NMI info is
  recorded in vmcs02's IDT-vectoring info.

- L0 walks L1's EPT12 and L0 sees the mapping is invalid, it injects the EPT violation into L1.

  The vmcs02's IDT-vectoring info is reflected to vmcs12's IDT-vectoring info since
  it is a nested vmexit.

- L1 receives the EPT violation, then fixes its EPT12.
- L1 executes VMRESUME to resume L2 which generates vmexit and causes L1 exits to L0.
- L0 emulates VMRESUME which is called from L1, then return to L2.

  L0 merges the requirement of vmcs12's IDT-vectoring info and injects it to L2 through
  vmcs02.

- The L2 re-executes the fault instruction and cause EPT violation again.
- Since the L1's EPT12 is valid, L0 can fix its EPT02
- L0 resume L2

  The EPT violation vmexit occurred during delivery of this NMI again, and the NMI info
  is recorded in vmcs02's IDT-vectoring info. L0 should inject the NMI through vmentry
  event injection since it is caused by EPT02's EPT violation.

However, vmx_inject_nmi() refuses to inject NMI from IDT-vectoring info if vCPU is in
guest mode, this patch fix it by permitting to inject NMI from IDT-vectoring if it is
the L0's responsibility to inject NMI from IDT-vectoring info to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Bandan Das <bsd@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:15 +02:00
Wanpeng Li f6e90f9e0e KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f95
x86, apicv: add virtual x2apic support) disable virtual x2apic mode
completely if w/o APICv, and the author also told me that windows guest
can't enter into x2apic mode when he developed the APICv feature several
years ago. However, it is not truth currently, Interrupt Remapping and
vIOMMU is added to qemu and the developers from Intel test windows 8 can
work in x2apic mode w/ Interrupt Remapping enabled recently.

This patch enables TPR shadow for virtual x2apic mode to boost
windows guest in x2apic mode even if w/o APICv.

Can pass the kvm-unit-test.

Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Suggested-by: Wincy Van <fanwenyi0529@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:14 +02:00
Wanpeng Li c83b6d1594 KVM: nVMX: Fix reload apic access page warning
WARNING: CPU: 1 PID: 4230 at kernel/sched/core.c:7564 __might_sleep+0x7e/0x80
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d0de7f9>] prepare_to_swait+0x39/0xa0
CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
Call Trace:
 dump_stack+0x99/0xd0
 __warn+0xd1/0xf0
 warn_slowpath_fmt+0x4f/0x60
 ? prepare_to_swait+0x39/0xa0
 ? prepare_to_swait+0x39/0xa0
 __might_sleep+0x7e/0x80
 __gfn_to_pfn_memslot+0x156/0x480 [kvm]
 gfn_to_pfn+0x2a/0x30 [kvm]
 gfn_to_page+0xe/0x20 [kvm]
 kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
 nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
 ? _raw_spin_unlock_irqrestore+0x36/0x80
 vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
 kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
 kvm_vcpu_check_block+0x12/0x60 [kvm]
 kvm_vcpu_block+0x94/0x4c0 [kvm]
 kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
 ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
 kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]

===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #47 Not tainted
-------------------------------
./include/linux/kvm_host.h:535 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/4230:
 #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffc062975c>] vcpu_load+0x1c/0x60 [kvm]

stack backtrace:
CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
Call Trace:
 dump_stack+0x99/0xd0
 lockdep_rcu_suspicious+0xe7/0x120
 gfn_to_memslot+0x12a/0x140 [kvm]
 gfn_to_pfn+0x12/0x30 [kvm]
 gfn_to_page+0xe/0x20 [kvm]
 kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
 nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
 ? _raw_spin_unlock_irqrestore+0x36/0x80
 vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
 kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
 kvm_vcpu_check_block+0x12/0x60 [kvm]
 kvm_vcpu_block+0x94/0x4c0 [kvm]
 kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
 ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
 kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
 ? __fget+0xfd/0x210
 ? __lock_is_held+0x54/0x70
 do_vfs_ioctl+0x96/0x6a0
 ? __fget+0x11c/0x210
 ? __fget+0x5/0x210
 SyS_ioctl+0x79/0x90
 do_syscall_64+0x81/0x220
 entry_SYSCALL64_slow_path+0x25/0x25

These can be triggered by running kvm-unit-test: ./x86-run x86/vmx.flat

The nested preemption timer is based on hrtimer which is started on L2
entry, stopped on L2 exit and evaluated via the new check_nested_events
hook. The current logic adds vCPU to a simple waitqueue (TASK_INTERRUPTIBLE)
if need to yield pCPU and w/o holding srcu read lock when accesses memslots,
both can be in nested preemption timer evaluation path which results in
the warning above.

This patch fix it by leveraging request bit to async reload APIC access
page before vmentry in order to avoid to reload directly during the nested
preemption timer evaluation, it is safe since the vmcs01 is loaded and
current is nested vmexit.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:13 +02:00
Colin Ian King adad0d02a7 kvm: svm: fix unsigned compare less than zero comparison
vm_data->avic_vm_id is a u32, so the check for a error
return (less than zero) such as -EAGAIN from
avic_get_next_vm_id currently has no effect whatsoever.
Fix this by using a temporary int for the comparison
and assign vm_data->avic_vm_id to this. I used an explicit
u32 cast in the assignment to show why vm_data->avic_vm_id
cannot be used in the assign/compare steps.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:30 +02:00
Paolo Bonzini 095cf55df7 KVM: x86: Hyper-V tsc page setup
Lately tsc page was implemented but filled with empty
values. This patch setup tsc page scale and offset based
on vcpu tsc, tsc_khz and  HV_X64_MSR_TIME_REF_COUNT value.

The valid tsc page drops HV_X64_MSR_TIME_REF_COUNT msr
reads count to zero which potentially improves performance.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
[Computation of TSC page parameters rewritten to use the Linux timekeeper
 parameters. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:20 +02:00
Paolo Bonzini 108b249c45 KVM: x86: introduce get_kvmclock_ns
Introduce a function that reads the exact nanoseconds value that is
provided to the guest in kvmclock.  This crystallizes the notion of
kvmclock as a thin veneer over a stable TSC, that the guest will
(hopefully) convert with NTP.  In other words, kvmclock is *not* a
paravirtualized host-to-guest NTP.

Drop the get_kernel_ns() function, that was used both to get the base
value of the master clock and to get the current value of kvmclock.
The former use is replaced by ktime_get_boot_ns(), the latter is
the purpose of get_kernel_ns().

This also allows KVM to provide a Hyper-V time reference counter that
is synchronized with the time that is computed from the TSC page.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:15 +02:00
Paolo Bonzini 67198ac3f3 KVM: x86: initialize kvmclock_offset
Make the guest's kvmclock count up from zero, not from the host boot
time.  The guest cannot rely on that anyway because it changes on
migration, the numbers are easier on the eye and finally it matches the
desired semantics of the Hyper-V time reference counter.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:13 +02:00
Paolo Bonzini 0d6dd2ff82 KVM: x86: always fill in vcpu->arch.hv_clock
We will use it in the next patches for KVM_GET_CLOCK and as a basis for the
contents of the Hyper-V TSC page.  Get the values from the Linux
timekeeper even if kvmclock is not enabled.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:25:53 +02:00
Ingo Molnar b2c16e1efd Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20 08:29:21 +02:00
Linus Torvalds 3286be9480 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A couple of small fixes to x86 perf drivers:

   - Measure L2 for HW_CACHE* events on AMD

   - Fix the address filter handling in the intel/pt driver

   - Handle the BTS disabling at the proper place"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
  perf/x86/intel/pt: Do validate the size of a kernel address filter
  perf/x86/intel/pt: Fix kernel address filter's offset validation
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86/intel: Don't disable "intel_bts" around "intel" event batching
2016-09-18 11:50:48 -07:00
Luiz Capitulino 4f5758fc67 kvm: x86: export TSC information to user-space
This commit exports the following information to
user-space via the newly created per-vcpu debugfs
directory:

 - TSC offset (as a signed number)
 - TSC scaling ratio
 - TSC scaling ratio fractinal bits

The original intention of this commit was to
export only the TSC offset, but the TSC scaling
information is exported for completeness.

We need to retrieve the TSC offset from user-space
in order to support the merging of host and guest
traces in trace-cmd. Today, we use the kvm_write_tsc_offset
tracepoint, but it has a number of problems (mainly,
it requires a running VM to be rebooted, ftrace setup,
and also tracepoints are not supposed to be ABIs).

The merging of host and guest traces is explained
in more detail in this thread:

 [Qemu-devel] [RFC] host and guest kernel trace merging
 https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html

This commit creates the following files in debugfs:

/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-offset
/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio
/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio-frac-bits

The last two are only created if TSC scaling is supported.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:48 +02:00
Luiz Capitulino 235539b48a kvm: add stubs for arch specific debugfs support
Two stubs are added:

 o kvm_arch_has_vcpu_debugfs(): must return true if the arch
   supports creating debugfs entries in the vcpu debugfs dir
   (which will be implemented by the next commit)

 o kvm_arch_create_vcpu_debugfs(): code that creates debugfs
   entries in the vcpu debugfs dir

For x86, this commit introduces a new file to avoid growing
arch/x86/kvm/x86.c even more.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:47 +02:00
Luiz Capitulino 3e3f50262e kvm: x86: drop read_tsc_offset()
The TSC offset can now be read directly from struct kvm_arch_vcpu.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Luiz Capitulino a545ab6a00 kvm: x86: add tsc_offset field to struct kvm_vcpu_arch
A future commit will want to easily read a vCPU's TSC offset,
so we store it in struct kvm_arch_vcpu_arch for easy access.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Matt Fleming 080fe0b790 perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
While the Intel PMU monitors the LLC when perf enables the
HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor
L1 instruction cache fetches (0x0080) and instruction cache misses
(0x0081) on the AMD PMU.

This is extremely confusing when monitoring the same workload across
Intel and AMD machines, since parameters like,

  $ perf stat -e cache-references,cache-misses

measure completely different things.

Instead, make the AMD PMU measure instruction/data cache and TLB fill
requests to the L2 and instruction/data cache and TLB misses in the L2
when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled,
respectively. That way the events measure unified caches on both
platforms.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 16:19:49 +02:00
Paolo Bonzini b0eaf4506f kvm: x86: correctly reset dest_map->vector when restoring LAPIC state
When userspace sends KVM_SET_LAPIC, KVM schedules a check between
the vCPU's IRR and ISR and the IOAPIC redirection table, in order
to re-establish the IOAPIC's dest_map (the list of CPUs servicing
the real-time clock interrupt with the corresponding vectors).

However, __rtc_irq_eoi_tracking_restore_one was forgetting to
set dest_map->vectors.  Because of this, the IOAPIC did not process
the real-time clock interrupt EOI, ioapic->rtc_status.pending_eoi
got stuck at a non-zero value, and further RTC interrupts were
reported to userspace as coalesced.

Fixes: 9e4aabe2bb
Fixes: 4d99ba898d
Cc: stable@vger.kernel.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: David Gilbert <dgilbert@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-15 18:00:32 +02:00
Ingo Molnar d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Suravee Suthikulpanit 411b44ba80 svm: Implements update_pi_irte hook to setup posted interrupt
This patch implements update_pi_irte function hook to allow SVM
communicate to IOMMU driver regarding how to set up IRTE for handling
posted interrupt.

In case AVIC is enabled, during vcpu_load/unload, SVM needs to update
IOMMU IRTE with appropriate host physical APIC ID. Also, when
vcpu_blocking/unblocking, SVM needs to update the is-running bit in
the IOMMU IRTE. Both are achieved via calling amd_iommu_update_ga().

However, if GA mode is not enabled for the pass-through device,
IOMMU driver will simply just return when calling amd_iommu_update_ga.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:18:58 +02:00
Suravee Suthikulpanit 5881f73757 svm: Introduce AMD IOMMU avic_ga_log_notifier
This patch introduces avic_ga_log_notifier, which will be called
by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:11:56 +02:00
Suravee Suthikulpanit 5ea11f2b31 svm: Introduces AVIC per-VM ID
Introduces per-VM AVIC ID and helper functions to manage the IDs.
Currently, the ID will be used to implement 32-bit AVIC IOMMU GA tag.

The ID is 24-bit one-based indexing value, and is managed via helper
functions to get the next ID, or to free an ID once a VM is destroyed.
There should be no ID conflict for any active VMs.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 12:57:20 +02:00
Jan Dakinevich 9ac7e3e815 KVM: nVMX: expose INS/OUTS information support
Expose the feature to L1 hypervisor if host CPU supports it, since
certain hypervisors requires it for own purposes.

According to Intel SDM A.1, if CPU supports the feature,
VMX_INSTRUCTION_INFO field of VMCS will contain detailed information
about INS/OUTS instructions handling. This field is already copied to
VMCS12 for L1 hypervisor (see prepare_vmcs12 routine) independently
feature presence.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini 16cb025565 KVM: VMX: not use vmcs_config in setup_vmcs_config
setup_vmcs_config takes a pointer to the vmcs_config global.  The
indirection is somewhat pointless, but just keep things consistent
for now.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini 1a6982353d KVM: x86: remove stale comments
handle_external_intr does not enable interrupts anymore, vcpu_enter_guest
does it after calling guest_exit_irqoff.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Paolo Bonzini bbe41b9508 KVM: x86: ratelimit and decrease severity for guest-triggered printk
These are mostly related to nested VMX.  They needn't have
a loglevel as high as KERN_WARN, and mustn't be allowed to
pollute the host logs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Jan Dakinevich 119a9c01a5 KVM: nVMX: pass valid guest linear-address to the L1
If EPT support is exposed to L1 hypervisor, guest linear-address field
of VMCS should contain GVA of L2, the access to which caused EPT violation.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:28 +02:00
Wanpeng Li f15a75eedc KVM: nVMX: make emulated nested preemption timer pinned
Commit 61abdbe0bc ("kvm: x86: make lapic hrtimer pinned") pins the emulated
lapic timer. This patch does the same for the emulated nested preemption
timer to avoid vmexit an unrelated vCPU and the latency of kicking IPI to
another vCPU.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Liang Li 72e0ae58a0 vmx: refine validity check for guest linear address
The validity check for the guest line address is inefficient,
check the invalid value instead of enumerating the valid ones.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Wanpeng Li e12c8f36f3 KVM: lapic: adjust preemption timer correctly when goes TSC backward
TSC_OFFSET will be adjusted if discovers TSC backward during vCPU load.
The preemption timer, which relies on the guest tsc to reprogram its
preemption timer value, is also reprogrammed if vCPU is scheded in to
a different pCPU. However, the current implementation reprogram preemption
timer before TSC_OFFSET is adjusted to the right value, resulting in the
preemption timer firing prematurely.

This patch fix it by adjusting TSC_OFFSET before reprogramming preemption
timer if TSC backward.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krċmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-05 16:14:39 +02:00
Luwei Kang 8e3562f6f8 KVM: x86: Expose more Intel AVX512 feature to guest
Expose AVX512DQ, AVX512BW, AVX512VL feature to guest.
Its spec can be found at:
https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Resolved a trivial conflict with removed F(PCOMMIT).]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-19 17:51:40 +02:00
Bandan Das c4f138b451 mmu: don't pass *kvm to spte_write_protect and spte_*_dirty
That parameter isn't used in these functions,
it's probably a historical artifact.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-19 17:51:40 +02:00
Wanpeng Li 187ca84b4b KVM: lapic: don't recalculate apic map table twice when enabling LAPIC
APIC map table is recalculated during reset APIC ID to the initial value
when enabling LAPIC. This patch move the recalculate_apic_map() to the
next branch since we don't need to recalculate apic map twice in current
codes.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-19 17:51:40 +02:00
Peter Feiner c95ba92afb kvm: nVMX: fix nested tsc scaling
When the host supported TSC scaling, L2 would use a TSC multiplier of
0, which causes a VM entry failure. Now L2's TSC uses the same
multiplier as L1.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-18 12:19:09 +02:00
Radim Krčmář dccbfcf52c KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
write with vmcs02 as the current VMCS.
This will incorrectly apply modifications intended for vmcs01 to vmcs02
and L2 can use it to gain access to L0's x2APIC registers by disabling
virtualized x2APIC while using msr bitmap that assumes enabled.

Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
current VMCS.  An alternative solution would temporarily make vmcs01 the
current VMCS, but it requires more care.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-18 12:19:08 +02:00
Radim Krčmář d048c09821 KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5f ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Wincy Van <fanwenyi0529@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-18 12:19:07 +02:00
Kees Cook 404f6aac9b x86: Apply more __ro_after_init and const
Guided by grsecurity's analogous __read_only markings in arch/x86,
this applies several uses of __ro_after_init to structures that are
only updated during __init, and const for some structures that are
never updated.  Additionally extends __init markings to some functions
that are only used during __init, and cleans up some missing C99 style
static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:55:05 +02:00
Bandan Das 45e11817d5 nvmx: mark ept single context invalidation as supported
Commit 4b85507860 ("KVM: nVMX: Don't advertise single
context invalidation for invept") removed advertising
single context invalidation since the spec does not mandate it.
However, some hypervisors (such as ESX) require it to be present
before willing to use ept in a nested environment. Advertise it
and fallback to the global case.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:52 +02:00
Bandan Das 03331b4b8b nvmx: remove comment about missing nested vpid support
Nested vpid is already supported and both single/global
modes are advertised to the guest

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:51 +02:00
Wanpeng Li 910053002e KVM: lapic: fix access preemption timer stuff even if kernel_irqchip=off
BUG: unable to handle kernel NULL pointer dereference at 000000000000008c
IP: [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
PGD 0
Oops: 0000 [#1] SMP
Call Trace:
 kvm_arch_vcpu_load+0x86/0x260 [kvm]
 vcpu_load+0x46/0x60 [kvm]
 kvm_vcpu_ioctl+0x79/0x7c0 [kvm]
 ? __lock_is_held+0x54/0x70
 do_vfs_ioctl+0x96/0x6a0
 ? __fget_light+0x2a/0x90
 SyS_ioctl+0x79/0x90
 do_syscall_64+0x7c/0x1e0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP  [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
 RSP <ffff8800db1f3d70>
CR2: 000000000000008c
---[ end trace a55fb79d2b3b4ee8 ]---

This can be reproduced steadily by kernel_irqchip=off.

We should not access preemption timer stuff if lapic is emulated in userspace.
This patch fix it by avoiding access preemption timer stuff when kernel_irqchip=off.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:43 +02:00
Paolo Bonzini 6f49b2f341 KVM/ARM Changes for v4.8 - Take 2
Includes GSI routing support to go along with the new VGIC and a small fix that
 has been cooking in -next for a while.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXoydqAAoJEEtpOizt6ddyM3oH/1A4VeG/J9q4fBPXqY2tVWXs
 c3P7UgNcrEgUNs/F9ykQY/lb31deecUzaBt1OyTf+RlsNbihq3dQdYcBhxtUODw/
 Faok582ya3UFgLW+IRHcID0EbkVOpIzMhOStYsnU/Dz7HG1JL9HdPzwkid7iu9LT
 fI6yrrBnJFjdWAAQ4BkcEKBENRsY8NTs7jX5vnFA92MkUBby7BmariPDD3FtrB+f
 Ob9B7CxM30pNqsN7OA/QvFOHMJHxf3s1TBKwmPHe5TLIfSzV1YxcEGiMc0lWqF4v
 BT8ZeMGCtjDw94tND1DskfQQRPaMqPmGuRTrAW/IuE2n92bFtbqIqs7Cbw0fzLE=
 =Vm6Q
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.8-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM Changes for v4.8 - Take 2

Includes GSI routing support to go along with the new VGIC and a small fix that
has been cooking in -next for a while.
2016-08-04 13:59:56 +02:00
Linus Torvalds 221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Linus Torvalds aeb35d6b74 Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 header cleanups from Ingo Molnar:
 "This tree is a cleanup of the x86 tree reducing spurious uses of
  module.h - which should improve build performance a bit"

* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
  x86/apic: Remove duplicated include from probe_64.c
  x86/ce4100: Remove duplicated include from ce4100.c
  x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
  x86/platform: Delete extraneous MODULE_* tags fromm ts5500
  x86: Audit and remove any remaining unnecessary uses of module.h
  x86/kvm: Audit and remove any unnecessary uses of module.h
  x86/xen: Audit and remove any unnecessary uses of module.h
  x86/platform: Audit and remove any unnecessary uses of module.h
  x86/lib: Audit and remove any unnecessary uses of module.h
  x86/kernel: Audit and remove any unnecessary uses of module.h
  x86/mm: Audit and remove any unnecessary uses of module.h
  x86: Don't use module.h just for AUTHOR / LICENSE tags
2016-08-01 14:23:42 -04:00
Jim Mattson b80c76ec98 KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
Kexec needs to know the addresses of all VMCSs that are active on
each CPU, so that it can flush them from the VMCS caches. It is
safe to record superfluous addresses that are not associated with
an active VMCS, but it is not safe to omit an address associated
with an active VMCS.

After a call to vmcs_load, the VMCS that was loaded is active on
the CPU. The VMCS should be added to the CPU's list of active
VMCSs before it is loaded.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-01 15:14:24 +02:00
David Matlack 4f2777bc97 kvm: x86: nVMX: maintain internal copy of current VMCS
KVM maintains L1's current VMCS in guest memory, at the guest physical
page identified by the argument to VMPTRLD. This makes hairy
time-of-check to time-of-use bugs possible,as VCPUs can be writing
the the VMCS page in memory while KVM is emulating VMLAUNCH and
VMRESUME.

The spec documents that writing to the VMCS page while it is loaded is
"undefined". Therefore it is reasonable to load the entire VMCS into
an internal cache during VMPTRLD and ignore writes to the VMCS page
-- the guest should be using VMREAD and VMWRITE to access the current
VMCS.

To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
Since this implementation of VMCS caching only maintains the the current
VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
current VMCS pointer.

KVM will also flush during VMXOFF, which is not mandated by the spec,
but also not in conflict with the spec.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-01 14:49:05 +02:00
Linus Torvalds a6408f6cb6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the next part of the hotplug rework.

   - Convert all notifiers with a priority assigned

   - Convert all CPU_STARTING/DYING notifiers

     The final removal of the STARTING/DYING infrastructure will happen
     when the merge window closes.

  Another 700 hundred line of unpenetrable maze gone :)"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  timers/core: Correct callback order during CPU hot plug
  leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
  powerpc/numa: Convert to hotplug state machine
  arm/perf: Fix hotplug state machine conversion
  irqchip/armada: Avoid unused function warnings
  ARC/time: Convert to hotplug state machine
  clocksource/atlas7: Convert to hotplug state machine
  clocksource/armada-370-xp: Convert to hotplug state machine
  clocksource/exynos_mct: Convert to hotplug state machine
  clocksource/arm_global_timer: Convert to hotplug state machine
  rcu: Convert rcutree to hotplug state machine
  KVM/arm/arm64/vgic-new: Convert to hotplug state machine
  smp/cfd: Convert core to hotplug state machine
  x86/x2apic: Convert to CPU hotplug state machine
  profile: Convert to hotplug state machine
  timers/core: Convert to hotplug state machine
  hrtimer: Convert to hotplug state machine
  x86/tboot: Convert to hotplug state machine
  arm64/armv8 deprecated: Convert to hotplug state machine
  hwtracing/coresight-etm4x: Convert to hotplug state machine
  ...
2016-07-29 13:55:30 -07:00
Linus Torvalds f0c98ebc57 libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement either
    ADR, or provide one or more flush addresses per nvdimm. ADR
    (Asynchronous DRAM Refresh) flushes data in posted write buffers to the
    memory controller on a power-fail event. Flush addresses are defined in
    ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
    "Flush Hint Address Structure". A flush hint is an mmio address that
    when written and fenced assures that all previous posted writes
    targeting a given dimm have been flushed to media.
 
 2/ On-demand ARS (address range scrub):
    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices.  When latent errors are detected we re-scrub the media
    to refresh the bad block list, userspace can also request a re-scrub at
    any time.
 
 3/ Support for the Microsoft DSM (device specific method) command format.
 
 4/ Support for EDK2/OVMF virtual disk device memory ranges.
 
 5/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
 NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
 ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
 trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
 KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
 qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
 WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
 DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
 b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
 cN3edFVIdxvZeMgM5Ubq
 =xCBG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

 - Replace pcommit with ADR / directed-flushing.

   The pcommit instruction, which has not shipped on any product, is
   deprecated.  Instead, the requirement is that platforms implement
   either ADR, or provide one or more flush addresses per nvdimm.

   ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
   to the memory controller on a power-fail event.

   Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
   Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
   A flush hint is an mmio address that when written and fenced assures
   that all previous posted writes targeting a given dimm have been
   flushed to media.

 - On-demand ARS (address range scrub).

   Linux uses the results of the ACPI ARS commands to track bad blocks
   in pmem devices.  When latent errors are detected we re-scrub the
   media to refresh the bad block list, userspace can also request a
   re-scrub at any time.

 - Support for the Microsoft DSM (device specific method) command
   format.

 - Support for EDK2/OVMF virtual disk device memory ranges.

 - Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
  libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
  nfit: do an ARS scrub on hitting a latent media error
  nfit: move to nfit/ sub-directory
  nfit, libnvdimm: allow an ARS scrub to be triggered on demand
  libnvdimm: register nvdimm_bus devices with an nd_bus driver
  pmem: clarify a debug print in pmem_clear_poison
  x86/insn: remove pcommit
  Revert "KVM: x86: add pcommit support"
  nfit, tools/testing/nvdimm/: unify shutdown paths
  libnvdimm: move ->module to struct nvdimm_bus_descriptor
  nfit: cleanup acpi_nfit_init calling convention
  nfit: fix _FIT evaluation memory leak + use after free
  tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
  tools/testing/nvdimm: add virtual ramdisk range
  acpi, nfit: treat virtual ramdisk SPA as pmem region
  pmem: kill __pmem address space
  pmem: kill wmb_pmem()
  libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
  fs/dax: remove wmb_pmem()
  libnvdimm, pmem: flush posted-write queues on shutdown
  ...
2016-07-28 17:38:16 -07:00
Linus Torvalds 85802a49a8 This merge introduces three patches that are later reverted,
- Switching of MSR_TSC_AUX in SVM was thought to cause a host
    misbehavior, but it was later cleared of those doubts and the patch
    moved code to a hot path, so we reverted it.  That patch also needed
    a fix for 32 bit builds and both were reverted in one go.
 
  - Al Viro noticed that a fix for a leak in an error path was not valid
    with the given API and provided a better fix, so the original patch
    was reverted.
 
 Then there are two VMX fixes that move code around because VMCS was not
 accessed between vcpu_load() and vcpu_put(), a simple ARM VHE fix, and
 two one-liners for PML and MTRR.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCAAGBQJXljRwAAoJEED/6hsPKofoIZMIAIMm2h5HKplmpT007dVCt1zw
 dG8gO9hxOxstXfGVkNEZvdxyUb0ilFMO5AYySS1ctENpswVAZlKWlyc+aNGsHXOS
 KFylAUNHlibua1xgB64Sitgub8M9Ct5mDfvvqWL79aCgHTLcDxnb/0NprTqB2P3O
 TfsLaiKMDOeZs4nTcs62vNqpPJzoFc6DK2x1RltFGF9RpR7bOD7gnp7KypDWJx7S
 1LleWPHboxHQ40qf8dxAb7HwEARfXndlP6ZoCkf2stoWTwuexHJfesUnsNgEuXnX
 6YJ9mO7np/bHfSDpGMJbb9pPI5g7UDwOzmgvYQvzhak3LRmvjsZePpWchlb0yCs=
 =3VD4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM leftovers from Radim Krčmář:
 "This is a combination of two pull requests for 4.7-rc8 that were not
  merged due to looking hairy.  I have changed the tag message to focus
  on circumstances of contained reverts as they were likely the reason
  behind rejection.

  This merge introduces three patches that are later reverted,

   - Switching of MSR_TSC_AUX in SVM was thought to cause a host
     misbehavior, but it was later cleared of those doubts and the patch
     moved code to a hot path, so we reverted it.  That patch also
     needed a fix for 32 bit builds and both were reverted in one go.

   - Al Viro noticed that a fix for a leak in an error path was not
     valid with the given API and provided a better fix, so the original
     patch was reverted.

  Then there are two VMX fixes that move code around because VMCS was
  not accessed between vcpu_load() and vcpu_put(), a simple ARM VHE fix,
  and two one-liners for PML and MTRR"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  arm64: KVM: VHE: Context switch MDSCR_EL1
  KVM: VMX: handle PML full VMEXIT that occurs during event delivery
  Revert "KVM: SVM: fix trashing of MSR_TSC_AUX"
  KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds
  KVM: don't use anon_inode_getfd() before possible failures
  Revert "KVM: release anon file in failure path of vm creation"
  KVM: release anon file in failure path of vm creation
  KVM: nVMX: Fix memory corruption when using VMCS shadowing
  kvm: vmx: ensure VMCS is current while enabling PML
  KVM: SVM: fix trashing of MSR_TSC_AUX
  KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault
2016-07-26 11:50:42 -07:00
Dan Williams 0606263f24 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next 2016-07-24 08:05:44 -07:00
Dan Williams dfa169bbee Revert "KVM: x86: add pcommit support"
This reverts commit 8b3e34e46a.

Given the deprecation of the pcommit instruction, the relevant VMX
features and CPUID bits are not going to be rolled into the SDM.  Remove
their usage from KVM.

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 11:04:23 -07:00
Eric Auger d9565a7399 KVM: Move kvm_setup_default/empty_irq_routing declaration in arch specific header
kvm_setup_default_irq_routing and kvm_setup_empty_irq_routing are
not used by generic code. So let's move the declarations in x86 irq.h
header instead of kvm_host.h.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2016-07-22 18:52:00 +01:00
Cao, Lei b244c9fc25 KVM: VMX: handle PML full VMEXIT that occurs during event delivery
With PML enabled, guest will shut down if a PML full VMEXIT occurs during
event delivery. According to Intel SDM 27.2.3, PML full VMEXIT can occur when
event is being delivered through IDT, so KVM should not exit to user space
with error. Instead, it should let EXIT_REASON_PML_FULL go through and the
event will be re-injected on the next VMENTRY.

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Cc: stable@vger.kernel.org
Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
[Shortened the summary and Cc'd stable.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-07-16 15:27:40 +02:00
Radim Krčmář 6a907cd0ad Revert "KVM: SVM: fix trashing of MSR_TSC_AUX"
This reverts commit 9770404a00.

The reverted patch is not needed as only userspace uses RDTSCP and
MSR_TSC_AUX is in host_save_user_msrs[] and therefore properly saved in
svm_vcpu_load() and restored in svm_vcpu_put() before every switch to
userspace.

The reverted patch did not allow the kernel to use RDTSCP in the future,
because of missed trashing in svm_set_msr() and 64-bit ifdef.

This reverts commit 2b23c3a6e3.

2b23c3a6e3 ("KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds") is a
build fix for 9770404a00 and reverting them separately would only
break more bisections.

Cc: stable@vger.kernel.org
2016-07-16 15:26:54 +02:00
Sebastian Andrzej Siewior 251a5fd64b x86/kvm/kvmclock: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

We assumed that the priority ordering was ment to invoke the online
callback as the last step. In the original code this also invoked the
down prepare callback as the last step. With the symmetric state
machine the down prepare callback is now the first step.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.542880859@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:40:21 +02:00
Anna-Maria Gleixner 162e52a117 KVM/x86: Remove superfluous SMP function call
Since the following commit:

  1cf4f629d9 ("cpu/hotplug: Move online calls to hotplugged cpu")

... the CPU_ONLINE and CPU_DOWN_PREPARE notifiers are always run on the hot
plugged CPU, and as of commit:

  3b9d6da67e ("cpu/hotplug: Fix rollback during error-out in __cpu_disable()")

the CPU_DOWN_FAILED notifier also runs on the hot plugged CPU.  This patch
converts the SMP functional calls into direct calls.

smp_function_call_single() executes the function with interrupts
disabled. This calling convention is not preserved because there
is no reason to do so.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.452527104@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:40:21 +02:00
Paolo Bonzini 2b23c3a6e3 KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds
This is unnecessary---and besides, __getcpu() is not even
available on 32-bit builds.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:29 +02:00
Jim Mattson 2f1fe81123 KVM: nVMX: Fix memory corruption when using VMCS shadowing
When freeing the nested resources of a vcpu, there is an assumption that
the vcpu's vmcs01 is the current VMCS on the CPU that executes
nested_release_vmcs12(). If this assumption is violated, the vcpu's
vmcs01 may be made active on multiple CPUs at the same time, in
violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
not VMCLEARed on every CPU on which it is active, it can linger in a
CPU's VMCS cache after it has been freed and potentially
repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
miss can result in memory corruption.

It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
the vcpu in question was last loaded on a different CPU, it must be
migrated to the current CPU before calling vmx_load_vmcs01().

Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:20 +02:00
Peter Feiner 4e59516a12 kvm: vmx: ensure VMCS is current while enabling PML
Between loading the new VMCS and enabling PML, the CPU was unpinned.
If the vCPU thread were migrated to another CPU in the interim (e.g.,
due to preemption or sleeping alloc_page), then the VMWRITEs to enable
PML would target the wrong VMCS -- or no VMCS at all:

  [ 2087.266950] vmwrite error: reg 200e value 3fe1d52000 (err -506126336)
  [ 2087.267062] vmwrite error: reg 812 value 1ff (err 511)
  [ 2087.267125] vmwrite error: reg 401e value 12229c00 (err 304258048)

This patch ensures that the VMCS remains current while enabling PML by
doing the VMWRITEs while the CPU is pinned. Allocation of the PML buffer
is hoisted out of the critical section.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:19 +02:00
Paolo Bonzini 9770404a00 KVM: SVM: fix trashing of MSR_TSC_AUX
I don't know what I was thinking when I wrote commit 46896c73c1 ("KVM:
svm: add support for RDTSCP", 2015-11-12); I missed write_rdtscp_aux which
obviously uses MSR_TSC_AUX.

Therefore we do need to save/restore MSR_TSC_AUX in svm_vcpu_run.

Cc: stable@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 46896c73c1 ("KVM: svm: add support for RDTSCP")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:10 +02:00
Paul Gortmaker 1767e931e3 x86/kvm: Audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.

This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.  In the case of
kvm where it is modular, we can extend that to also include files
that are building basic support functionality but not related
to loading or registering the final module; such files also have
no need whatsoever for module.h

The advantage in removing such instances is that module.h itself
sources about 15 other headers; adding significantly to what we feed
cpp, and it can obscure what headers we are effectively using.

Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each instance for the
presence of either and replace as needed.

Several instances got replaced with moduleparam.h since that was
really all that was required for those particular files.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 15:07:00 +02:00
Radim Krčmář af1bae5497 KVM: x86: bump KVM_MAX_VCPU_ID to 1023
kzalloc was replaced with kvm_kvzalloc to allow non-contiguous areas and
rcu had to be modified to cope with it.

The practical limit for KVM_MAX_VCPU_ID right now is INT_MAX, but lower
value was chosen in case there were bugs.  1023 is sufficient maximum
APIC ID for 288 VCPUs.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:35 +02:00
Radim Krčmář c519265f2a KVM: x86: add a flag to disable KVM x2apic broadcast quirk
Add KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK as a feature flag to
KVM_CAP_X2APIC_API.

The quirk made KVM interpret 0xff as a broadcast even in x2APIC mode.
The enableable capability is needed in order to support standard x2APIC and
remain backward compatible.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Expand kvm_apic_mda comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:34 +02:00
Radim Krčmář 3713131345 KVM: x86: add KVM_CAP_X2APIC_API
KVM_CAP_X2APIC_API is a capability for features related to x2APIC
enablement.  KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to
extend APIC ID in get/set ioctl and MSI addresses to 32 bits.
Both are needed to support x2APIC.

The feature has to be enableable and disabled by default, because
get/set ioctl shifted and truncated APIC ID to 8 bits by using a
non-standard protocol inspired by xAPIC and the change is not
backward-compatible.

Changes to MSI addresses follow the format used by interrupt remapping
unit.  The upper address word, that used to be 0, contains upper 24 bits
of the LAPIC address in its upper 24 bits.  Lower 8 bits are reserved as
0.  Using the upper address word is not backward-compatible either as we
didn't check that userspace zeroed the word.  Reserved bits are still
not explicitly checked, but non-zero data will affect LAPIC addresses,
which will cause a bug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:57 +02:00
Radim Krčmář c63cf538eb KVM: pass struct kvm to kvm_set_routing_entry
Arch-specific code will use it.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:56 +02:00
Radim Krčmář 4d8e772bf8 KVM: x86: reset lapic base in kvm_lapic_reset
LAPIC is reset in xAPIC mode and the surrounding code expects that.
KVM never resets after initialization.  This patch is just for sanity.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:56 +02:00
Radim Krčmář c93de59dcd KVM: VMX: optimize APIC ID read with APICv
The register is in hardware-compatible format now, so there is not need
to intercept.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:55 +02:00
Radim Krčmář 49bd29ba1d KVM: x86: reset APIC ID when enabling LAPIC
APIC ID should be set to the initial APIC ID when enabling LAPIC.
This only matters if the guest changes APIC ID.  No sane OS does that.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:54 +02:00
Radim Krčmář a92e2543d6 KVM: x86: use hardware-compatible format for APIC ID register
We currently always shift APIC ID as if APIC was in xAPIC mode.
x2APIC mode wants to use more bits and storing a hardware-compabible
value is the the sanest option.

KVM API to set the lapic expects that bottom 8 bits of APIC ID are in
top 8 bits of APIC_ID register, so the register needs to be shifted in
x2APIC mode.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:54 +02:00
Radim Krčmář 3159d36ad7 KVM: x86: use generic function for MSI parsing
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:53 +02:00
Radim Krčmář 0ca52e7b81 KVM: x86: dynamic kvm_apic_map
x2APIC supports up to 2^32-1 LAPICs, but most guest in coming years will
probably has fewer VCPUs.  Dynamic size saves memory at the cost of
turning one constant into a variable.

apic_map mutex had to be moved before allocation to avoid races with cpu
hotplug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:53 +02:00
Radim Krčmář e45115b62f KVM: x86: use physical LAPIC array for logical x2APIC
Logical x2APIC IDs map injectively to physical x2APIC IDs, so we can
reuse the physical array for them.  This allows us to save space by
sizing the logical maps according to the needs of xAPIC.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:52 +02:00
Radim Krčmář 64aa47bfc4 KVM: x86: add kvm_apic_map_get_dest_lapic
kvm_irq_delivery_to_apic_fast and kvm_intr_is_single_vcpu_fast both
compute the interrupt destination.  Factor the code.

'struct kvm_lapic **dst = NULL' had to be added to silence GCC.
GCC might complain about potential NULL access in the future, because it
missed conditions that avoided uninitialized uses of dst.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:51 +02:00
Bandan Das 02120c45b0 kvm: vmx: advertise support for ept execute only
MMU now knows about execute only mappings, so
advertise the feature to L1 hypervisors

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:50 +02:00
Bandan Das d95c55687e kvm: mmu: track read permission explicitly for shadow EPT page tables
To support execute only mappings on behalf of L1 hypervisors,
reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit
set.

For the nested EPT case, we assumed that the U bit was always set
since there was no equivalent in EPT page tables.  Strictly
speaking, this was not necessary because handle_ept_violation
never set PFERR_USER_MASK in the error code (uf=0 in the
parlance of update_permission_bitmask).  We now have to set
both U and UF correctly, respectively in FNAME(gpte_access)
and in handle_ept_violation.

Also in handle_ept_violation bit 3 of the exit qualification is
not enough to detect a present PTE; all three bits 3-5 have to
be checked.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:50 +02:00
Bandan Das ffb128c89b kvm: mmu: don't set the present bit unconditionally
To support execute only mappings on behalf of L1
hypervisors, we need to teach set_spte() to honor all three of
L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
that will be set for non-EPT shadow paging and clear for EPT.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:14 +02:00
Bandan Das 812f30b234 kvm: mmu: remove is_present_gpte()
We have two versions of the above function.
To prevent confusion and bugs in the future, remove
the non-FNAME version entirely and replace all calls
with the actual check.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:02:47 +02:00
Bandan Das 8d5cf1610d kvm: mmu: extend the is_present check to 32 bits
This is safe because this function is called
on host controlled page table and non-present/non-MMIO
sptes never use bits 1..31. For the EPT case, this
ensures that cases where only the execute bit is set
is marked valid.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:02:46 +02:00
Paolo Bonzini 8391ce447f KVM: VMX: introduce vm_{entry,exit}_control_reset_shadow
There is no reason to read the entry/exit control fields of the
VMCS and immediately write back the same value.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 10:07:49 +02:00
Paolo Bonzini 9314006db8 KVM: nVMX: keep preemption timer enabled during L2 execution
Because the vmcs12 preemption timer is emulated through a separate hrtimer,
we can keep on using the preemption timer in the vmcs02 to emulare L1's
TSC deadline timer.

However, the corresponding bit in the pin-based execution control field
must be kept consistent between vmcs01 and vmcs02.  On vmentry we copy
it into the vmcs02; on vmexit the preemption timer must be disabled in
the vmcs01 if a preemption timer vmexit happened while in guest mode.

The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it
need not be considered in prepare_vmcs02.

Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:49:29 +02:00
Wanpeng Li 55123e3c86 KVM: nVMX: avoid incorrect preemption timer vmexit in nested guest
The preemption timer for nested VMX is emulated by hrtimer which is started on L2
entry, stopped on L2 exit and evaluated via the check_nested_events hook. However,
nested_vmx_exit_handled is always returning true for preemption timer vmexit.  Then,
the L1 preemption timer vmexit is captured and be treated as a L2 preemption
timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's
vmexit handler:

    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<          (null)>]           (null)
    PGD 0
    Oops: 0010 [#1] SMP
    Call Trace:
     ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
     handle_preemption_timer+0xe/0x20 [kvm_intel]
     vmx_handle_exit+0x169/0x15a0 [kvm_intel]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     ? vcpu_load+0x1c/0x60 [kvm]
     ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
     kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
     do_vfs_ioctl+0x96/0x6a0
     ? __fget_light+0x2a/0x90
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x68/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    Code:  Bad RIP value.
    RIP  [<          (null)>]           (null)
     RSP <ffff8800b5263c48>
    CR2: 0000000000000000
    ---[ end trace 9c70c48b1a2bc66e ]---

This can be reproduced readily by preemption timer enabled on L0 and disabled
on L1.

Return false since preemption timer vmexits must never be reflected to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:49:26 +02:00
Paolo Bonzini 1c17c3e6bf KVM: VMX: reflect broken preemption timer in vmcs_config
Simplify cpu_has_vmx_preemption_timer.  This is consistent with the
rest of setup_vmcs_config and preparatory for the next patch.

Tested-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:48:49 +02:00
Ingo Molnar 08fb98f5bf Merge branch 'linus' into x86/fpu, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 17:11:17 +02:00
Alexis Dambricourt 30b072ce03 KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault
The following #PF may occurs:
[ 1403.317041] BUG: unable to handle kernel paging request at 0000000200000068
[ 1403.317045] IP: [<ffffffffc04c20b0>] __mtrr_lookup_var_next+0x10/0xa0 [kvm]

[ 1403.317123] Call Trace:
[ 1403.317134]  [<ffffffffc04c2a65>] ? kvm_mtrr_check_gfn_range_consistency+0xc5/0x120 [kvm]
[ 1403.317143]  [<ffffffffc04ac11f>] ? tdp_page_fault+0x9f/0x2c0 [kvm]
[ 1403.317152]  [<ffffffffc0498128>] ? kvm_set_msr_common+0x858/0xc00 [kvm]
[ 1403.317161]  [<ffffffffc04b8883>] ? x86_emulate_insn+0x273/0xd30 [kvm]
[ 1403.317171]  [<ffffffffc04c04e4>] ? kvm_cpuid+0x34/0x190 [kvm]
[ 1403.317180]  [<ffffffffc04a5bb9>] ? kvm_mmu_page_fault+0x59/0xe0 [kvm]
[ 1403.317183]  [<ffffffffc0d729e1>] ? vmx_handle_exit+0x1d1/0x14a0 [kvm_intel]
[ 1403.317185]  [<ffffffffc0d75f3f>] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
[ 1403.317187]  [<ffffffffc0d7621d>] ? vmx_vcpu_run+0x2ad/0x420 [kvm_intel]
[ 1403.317196]  [<ffffffffc04a0962>] ? kvm_arch_vcpu_ioctl_run+0x622/0x1550 [kvm]
[ 1403.317204]  [<ffffffffc049abb9>] ? kvm_arch_vcpu_load+0x59/0x210 [kvm]
[ 1403.317206]  [<ffffffff81036245>] ? __kernel_fpu_end+0x35/0x100
[ 1403.317213]  [<ffffffffc0487eb6>] ? kvm_vcpu_ioctl+0x316/0x5d0 [kvm]
[ 1403.317215]  [<ffffffff81088225>] ? do_sigtimedwait+0xd5/0x220
[ 1403.317217]  [<ffffffff811f84dd>] ? do_vfs_ioctl+0x9d/0x5c0
[ 1403.317224]  [<ffffffffc04928ae>] ? kvm_on_user_return+0x3e/0x70 [kvm]
[ 1403.317225]  [<ffffffff811f8a74>] ? SyS_ioctl+0x74/0x80
[ 1403.317227]  [<ffffffff815bf0b6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
[ 1403.317242] RIP  [<ffffffffc04c20b0>] __mtrr_lookup_var_next+0x10/0xa0 [kvm]

At mtrr_lookup_fixed_next(), when the condition
'if (iter->index >= ARRAY_SIZE(iter->mtrr_state->fixed_ranges))' becomes true,
mtrr_lookup_var_start() is called with iter->range with gargabe values from the
fixed MTRR union field. Then, list_prepare_entry() do not call list_entry()
initialization, keeping a garbage pointer in iter->range which is accessed in
the following __mtrr_lookup_var_next() call.

Fixes: f571c0973e
Signed-off-by: Alexis Dambricourt <alexis@blade-group.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05 16:14:43 +02:00
Wei Yongjun 03f6a22a39 KVM: x86: Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element
Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05 14:41:45 +02:00
Wanpeng Li 196f20ca52 KVM: vmx: fix missed cancellation of TSC deadline timer
INFO: rcu_sched detected stalls on CPUs/tasks:
 1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663
 (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719)
Task dump for CPU 1:
qemu-system-x86 R  running task        0  3529   3525 0x00080808
 ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40
 ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8
 0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9
Call Trace:
? kvm_write_guest_cached+0xb9/0x160 [kvm]
? __delay+0xf/0x20
? wait_lapic_expire+0x14a/0x200 [kvm]
? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm]
? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm]
? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
? __fget+0x5/0x210
? do_vfs_ioctl+0x96/0x6a0
? __fget_light+0x2a/0x90
? SyS_ioctl+0x79/0x90
? do_syscall_64+0x7c/0x1e0
? entry_SYSCALL64_slow_path+0x25/0x25

This can be reproduced readily by running a full dynticks guest(since hrtimer
in guest is heavily used) w/ lapic_timer_advance disabled.

If fail to program hardware preemption timer, we will fallback to hrtimer based
method, however, a previous programmed preemption timer miss to cancel in this
scenario which results in one hardware preemption timer and one hrtimer emulated
tsc deadline timer run simultaneously. So sometimes the target guest deadline
tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer
can underflow and cause delta_tsc to be set a huge value, then host soft lockup
as above.

This patch fix it by cancelling the previous programmed preemption timer if there
is once we failed to program the new preemption timer and fallback to hrtimer
based method.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:42 +02:00
Wanpeng Li bd97ad0e7e KVM: x86: introduce cancel_hv_tscdeadline
Introduce cancel_hv_tscdeadline() to encapsulate preemption
timer cancel stuff.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:41 +02:00
Paolo Bonzini 9175d2e97b KVM: vmx: fix underflow in TSC deadline calculation
If the TSC deadline timer is programmed really close to the deadline or
even in the past, the computation in vmx_set_hv_timer can underflow and
cause delta_tsc to be set to a huge value.  This generally results
in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting
delta_tsc to be positive or zero.

Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:39 +02:00
Paolo Bonzini f2485b3e0c KVM: x86: use guest_exit_irqoff
This gains a few clock cycles per vmexit.  On Intel there is no need
anymore to enable the interrupts in vmx_handle_external_intr, since
we are using the "acknowledge interrupt on exit" feature.  AMD
needs to do that, and must be careful to avoid the interrupt shadow.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:38 +02:00
Paolo Bonzini 91fa0f8e9e KVM: x86: always use "acknowledge interrupt on exit"
This is necessary to simplify handle_external_intr in the next patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:36 +02:00
Paolo Bonzini 6edaa5307f KVM: remove kvm_guest_enter/exit wrappers
Use the functions from context_tracking.h directly.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:21 +02:00
Quentin Casasnovas ff30ef40de KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.
I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
getting a GP(0) on a rather unspecial vmread from Xen:

     (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
     (XEN) CPU:    1
     (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
     (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
     (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
     (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
     (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
     (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
     (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
     (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
     (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
     (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
     (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
     (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
     (XEN) Xen stack trace from rsp=ffff83003ffbfab0:

     ...

     (XEN) Xen call trace:
     (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
     (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
     (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
     (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
     (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
     (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
     (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
     (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
     (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
     (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
     (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
     (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
     (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
     (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
     (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
     (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
     (XEN)
     (XEN)
     (XEN) ****************************************
     (XEN) Panic on CPU 1:
     (XEN) GENERAL PROTECTION FAULT
     (XEN) [error_code=0000]
     (XEN) ****************************************

Tracing my host KVM showed it was the one injecting the GP(0) when
emulating the VMREAD and checking the destination segment permissions in
get_vmx_mem_address():

     3)               |    vmx_handle_exit() {
     3)               |      handle_vmread() {
     3)               |        nested_vmx_check_permission() {
     3)               |          vmx_get_segment() {
     3)   0.074 us    |            vmx_read_guest_seg_base();
     3)   0.065 us    |            vmx_read_guest_seg_selector();
     3)   0.066 us    |            vmx_read_guest_seg_ar();
     3)   1.636 us    |          }
     3)   0.058 us    |          vmx_get_rflags();
     3)   0.062 us    |          vmx_read_guest_seg_ar();
     3)   3.469 us    |        }
     3)               |        vmx_get_cs_db_l_bits() {
     3)   0.058 us    |          vmx_read_guest_seg_ar();
     3)   0.662 us    |        }
     3)               |        get_vmx_mem_address() {
     3)   0.068 us    |          vmx_cache_reg();
     3)               |          vmx_get_segment() {
     3)   0.074 us    |            vmx_read_guest_seg_base();
     3)   0.068 us    |            vmx_read_guest_seg_selector();
     3)   0.071 us    |            vmx_read_guest_seg_ar();
     3)   1.756 us    |          }
     3)               |          kvm_queue_exception_e() {
     3)   0.066 us    |            kvm_multiple_exception();
     3)   0.684 us    |          }
     3)   4.085 us    |        }
     3)   9.833 us    |      }
     3) + 10.366 us   |    }

Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
Control Structure", I found that we're enforcing that the destination
operand is NOT located in a read-only data segment or any code segment when
the L1 is in long mode - BUT that check should only happen when it is in
protected mode.

Shuffling the code a bit to make our emulation follow the specification
allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
without problems.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:44 +02:00
Marcelo Tosatti b606f189c7 KVM: LAPIC: cap __delay at lapic_timer_advance_ns
The host timer which emulates the guest LAPIC TSC deadline
timer has its expiration diminished by lapic_timer_advance_ns
nanoseconds. Therefore if, at wait_lapic_expire, a difference
larger than lapic_timer_advance_ns is encountered, delay at most
lapic_timer_advance_ns.

This fixes a problem where the guest can cause the host
to delay for large amounts of time.

Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:41 +02:00
Marcelo Tosatti 8d93c874ac KVM: x86: move nsec_to_cycles from x86.c to x86.h
Move the inline function nsec_to_cycles from x86.c to x86.h, as
the next patch uses it from lapic.c.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:38 +02:00
Arnd Bergmann 87aeb54f1b kvm: x86: use getboottime64
KVM reads the current boottime value as a struct timespec in order to
calculate the guest wallclock time, resulting in an overflow in 2038
on 32-bit systems.

The data then gets passed as an unsigned 32-bit number to the guest,
and that in turn overflows in 2106.

We cannot do much about the second overflow, which affects both 32-bit
and 64-bit hosts, but we can ensure that they both behave the same
way and don't overflow until 2106, by using getboottime64() to read
a timespec64 value.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:30 +02:00
Ashok Raj c45dcc71b7 KVM: VMX: enable guest access to LMCE related MSRs
On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
           Only enable LMCE on Intel platform
           Check MSR_IA32_FEATURE_CONTROL when handling guest
             access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:29 +02:00
Haozhong Zhang 37e4c997da KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL
KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:29 +02:00
Haozhong Zhang 3b84080b95 KVM: VMX: move msr_ia32_feature_control to vcpu_vmx
msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:28 +02:00
Yunhong Jiang 64672c95ea kvm: vmx: hook preemption timer support
Hook the VMX preemption timer to the "hv timer" functionality added
by the previous patch.  This includes: checking if the feature is
supported, if the feature is broken on the CPU, the hooks to
setup/clean the VMX preemption timer, arming the timer on vmentry
and handling the vmexit.

A module parameter states if the VMX preemption timer should be
utilized.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
[Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
 Put all VMX bits here.  Enable it by default #yolo. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:50 +02:00
Yunhong Jiang bc22512bb2 kvm: vmx: rename vmx_pre/post_block to pi_pre/post_block
Prepare to switch from preemption timer to hrtimer in the
vmx_pre/post_block. Current functions are only for posted interrupt,
rename them accordingly.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:49 +02:00
Yunhong Jiang ce7a058a21 KVM: x86: support using the vmx preemption timer for tsc deadline timer
The VMX preemption timer can be used to virtualize the TSC deadline timer.
The VMX preemption timer is armed when the vCPU is running, and a VMExit
will happen if the virtual TSC deadline timer expires.

When the vCPU thread is blocked because of HLT, KVM will switch to use
an hrtimer, and then go back to the VMX preemption timer when the vCPU
thread is unblocked.

This solution avoids the complex OS's hrtimer system, and the host
timer interrupt handling cost, replacing them with a little math
(for guest->host TSC and host TSC->preemption timer conversion)
and a cheaper VMexit.  This benefits latency for isolated pCPUs.

[A word about performance... Yunhong reported a 30% reduction in average
 latency from cyclictest.  I made a similar test with tscdeadline_latency
 from kvm-unit-tests, and measured

 - ~20 clock cycles loss (out of ~3200, so less than 1% but still
   statistically significant) in the worst case where the test halts
   just after programming the TSC deadline timer

 - ~800 clock cycles gain (25% reduction in latency) in the best case
   where the test busy waits.

 I removed the VMX bits from Yunhong's patch, to concentrate them in the
 next patch - Paolo]

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:48 +02:00
Yunhong Jiang 53f9eedff7 kvm: lapic: separate start_sw_tscdeadline from start_apic_timer
The function to start the tsc deadline timer virtualization will be used
also by the pre_block hook when we use the preemption timer; change it
to a separate function. No logic changes.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:46 +02:00
Yang Zhang a005219162 kvm: vmx: check apicv is active before using VT-d posted interrupt
VT-d posted interrupt is relying on the CPU side's posted interrupt.
Need to check whether VCPU's APICv is active before enabing VT-d
posted interrupt.

Fixes: d62caabb41
Cc: stable@vger.kernel.org
Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Shengge Ding <shengge.dsg@alibaba-inc.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 09:38:24 +02:00
Suravee Suthikulpanit 5b8abf1f33 kvm: svm: Do not support AVIC if not CONFIG_X86_LOCAL_APIC
Add logic to disable AVIC #ifndef CONFIG_X86_LOCAL_APIC.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:28:30 +02:00
Suravee Suthikulpanit 7d669f5084 kvm: svm: Fix implicit declaration for __default_cpu_present_to_apicid()
The commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
introduces a build error due to implicit function declaration
when #ifdef CONFIG_X86_32 and #ifndef CONFIG_X86_LOCAL_APIC
(as reported by Kbuild test robot i386-randconfig-x0-06121009).

So, this patch introduces kvm_cpu_get_apicid() wrapper
around __default_cpu_present_to_apicid() with additional
handling if CONFIG_X86_LOCAL_APIC is not defined.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:28:24 +02:00
Paolo Bonzini 557abc40d1 KVM: remove kvm_vcpu_compatible
The new created_vcpus field makes it possible to avoid the race between
irqchip and VCPU creation in a much nicer way; just check under kvm->lock
whether a VCPU has already been created.

We can then remove KVM_APIC_ARCHITECTURE too, because at this point the
symbol is only governing the default definition of kvm_vcpu_compatible.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:05:00 +02:00
Andrea Gelmini bb3541f175 KVM: x86: Fix typos
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-14 11:16:28 +02:00
Dave Hansen d1898b7336 x86/fpu: Add tracepoints to dump FPU state at key points
I've been carrying this patch around for a bit and it's helped me
solve at least a couple FPU-related bugs.  In addition to using
it for debugging, I also drug it out because using AVX (and
AVX2/AVX-512) can have serious power consequences for a modern
core.  It's very important to be able to figure out who is using
it.

It's also insanely useful to go out and see who is using a given
feature, like MPX or Memory Protection Keys.  If you, for
instance, want to find all processes using protection keys, you
can do:

	echo 'xfeatures & 0x200' > filter

Since 0x200 is the protection keys feature bit.

Note that this touches the KVM code.  KVM did a CREATE_TRACE_POINTS
and then included a bunch of random headers.  If anyone one of
those included other tracepoints, it would have defined the *OTHER*
tracepoints.  That's bogus, so move it to the right place.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20160601174220.3CDFB90E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-08 13:33:33 +02:00
Kai Huang dca4d72877 kvm/x86: remove unnecessary header file inclusion
arch/x86/kvm/iommu.c includes <linux/intel-iommu.h> and <linux/dmar.h>, which
both are unnecessary, in fact incorrect to be here as they are intel specific.

Building kvm on x86 passed after removing above inclusion.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03 15:28:19 +02:00
Paolo Bonzini 250715a617 KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lock
The syzkaller folks reported a NULL pointer dereference that seems
to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2.
The former takes kvm->lock (except when registering the devices,
which needs kvm->slots_lock); the latter takes kvm->slots_lock only.
Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP.

Testcase:

    #include <pthread.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>
    #include <stdint.h>
    #include <string.h>
    #include <stdlib.h>
    #include <sys/syscall.h>
    #include <unistd.h>

    long r[23];

    void* thr1(void* arg)
    {
        struct kvm_pit_config pitcfg = { .flags = 4 };
        switch ((long)arg) {
        case 0: r[2]  = open("/dev/kvm", O_RDONLY|O_ASYNC);    break;
        case 1: r[3]  = ioctl(r[2], KVM_CREATE_VM, 0);         break;
        case 2: r[4]  = ioctl(r[3], KVM_CREATE_IRQCHIP, 0);    break;
        case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break;
        }
        return 0;
    }

    int main(int argc, char **argv)
    {
        long i;
        pthread_t th[4];

        memset(r, -1, sizeof(r));
        for (i = 0; i < 4; i++) {
            pthread_create(&th[i], 0, thr, (void*)i);
            if (argc > 1 && rand()%2) usleep(rand()%1000);
        }
        usleep(20000);
        return 0;
    }

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03 15:28:19 +02:00
Paolo Bonzini ee2cd4b755 KVM: x86: rename process_smi to enter_smm, process_smi_request to process_smi
Make the function names more similar between KVM_REQ_NMI and KVM_REQ_SMI.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03 15:28:19 +02:00
Paolo Bonzini c43203cab1 KVM: x86: avoid simultaneous queueing of both IRQ and SMI
If the processor exits to KVM while delivering an interrupt,
the hypervisor then requeues the interrupt for the next vmentry.
Trying to enter SMM in this same window causes to enter non-root
mode in emulated SMM (i.e. with IF=0) and with a request to
inject an IRQ (i.e. with a valid VM-entry interrupt info field).
This is invalid guest state (SDM 26.3.1.4 "Check on Guest RIP
and RFLAGS") and the processor fails vmentry.

The fix is to defer the injection from KVM_REQ_SMI to KVM_REQ_EVENT,
like we already do for e.g. NMIs.  This patch doesn't change the
name of the process_smi function so that it can be applied to
stable releases.  The next patch will modify the names so that
process_nmi and process_smi handle respectively KVM_REQ_NMI and
KVM_REQ_SMI.

This is especially common with Windows, probably due to the
self-IPI trick that it uses to deliver deferred procedure
calls (DPCs).

Reported-by: Laszlo Ersek <lersek@redhat.com>
Reported-by: Michał Zegan <webczat_200@poczta.onet.pl>
Fixes: 64d6067057
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-03 15:28:19 +02:00
Paolo Bonzini d14bdb553f KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGS
MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to
any of bits 63:32.  However, this is not detected at KVM_SET_DEBUGREGS
time, and the next KVM_RUN oopses:

   general protection fault: 0000 [#1] SMP
   CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
   Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
   [...]
   Call Trace:
    [<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm]
    [<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
    [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480
    [<ffffffff812418a9>] SyS_ioctl+0x79/0x90
    [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
   Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
   RIP  [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40
    RSP <ffff88005836bd50>

Testcase (beautified/reduced from syzkaller output):

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <string.h>
    #include <stdint.h>
    #include <linux/kvm.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>

    long r[8];

    int main()
    {
        struct kvm_debugregs dr = { 0 };

        r[2] = open("/dev/kvm", O_RDONLY);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
        r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);

        memcpy(&dr,
               "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72"
               "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8"
               "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9"
               "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb",
               48);
        r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr);
        r[6] = ioctl(r[4], KVM_RUN, 0);
    }

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Paolo Bonzini 78e546c824 KVM: fail KVM_SET_VCPU_EVENTS with invalid exception number
This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return
EINVAL.  It causes a WARN from exception_type:

    WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]()
    CPU: 3 PID: 16732 Comm: a.out Tainted: G        W       4.4.6-300.fc23.x86_64 #1
    Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
     0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e
     0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2
     ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001
    Call Trace:
     [<ffffffff813b542e>] dump_stack+0x63/0x85
     [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
     [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
     [<ffffffffa0924809>] exception_type+0x49/0x50 [kvm]
     [<ffffffffa0934622>] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm]
     [<ffffffffa091c04d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
     [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
     [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
     [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
    ---[ end trace b1a0391266848f50 ]---

Testcase (beautified/reduced from syzkaller output):

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <string.h>
    #include <stdint.h>
    #include <fcntl.h>
    #include <sys/ioctl.h>
    #include <linux/kvm.h>

    long r[31];

    int main()
    {
        memset(r, -1, sizeof(r));
        r[2] = open("/dev/kvm", O_RDONLY);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
        r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0);

        struct kvm_vcpu_events ve = {
                .exception.injected = 1,
                .exception.nr = 0xd4
        };
        r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve);
        r[30] = ioctl(r[7], KVM_RUN, 0);
        return 0;
    }

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Paolo Bonzini 83676e9238 KVM: x86: avoid vmalloc(0) in the KVM_SET_CPUID
This causes an ugly dmesg splat.  Beautified syzkaller testcase:

    #include <unistd.h>
    #include <sys/syscall.h>
    #include <sys/ioctl.h>
    #include <fcntl.h>
    #include <linux/kvm.h>

    long r[8];

    int main()
    {
        struct kvm_cpuid2 c = { 0 };
        r[2] = open("/dev/kvm", O_RDWR);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
        r[4] = ioctl(r[3], KVM_CREATE_VCPU, 0x8);
        r[7] = ioctl(r[4], KVM_SET_CPUID, &c);
        return 0;
    }

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Paolo Bonzini b21629da12 kvm: x86: avoid warning on repeated KVM_SET_TSS_ADDR
Found by syzkaller:

    WARNING: CPU: 3 PID: 15175 at arch/x86/kvm/x86.c:7705 __x86_set_memory_region+0x1dc/0x1f0 [kvm]()
    CPU: 3 PID: 15175 Comm: a.out Tainted: G        W       4.4.6-300.fc23.x86_64 #1
    Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
     0000000000000286 00000000950899a7 ffff88011ab3fbf0 ffffffff813b542e
     0000000000000000 ffffffffa0966496 ffff88011ab3fc28 ffffffff810a40f2
     00000000000001fd 0000000000003000 ffff88014fc50000 0000000000000000
    Call Trace:
     [<ffffffff813b542e>] dump_stack+0x63/0x85
     [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
     [<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
     [<ffffffffa09251cc>] __x86_set_memory_region+0x1dc/0x1f0 [kvm]
     [<ffffffffa092521b>] x86_set_memory_region+0x3b/0x60 [kvm]
     [<ffffffffa09bb61c>] vmx_set_tss_addr+0x3c/0x150 [kvm_intel]
     [<ffffffffa092f4d4>] kvm_arch_vm_ioctl+0x654/0xbc0 [kvm]
     [<ffffffffa091d31a>] kvm_vm_ioctl+0x9a/0x6f0 [kvm]
     [<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
     [<ffffffff812414a9>] SyS_ioctl+0x79/0x90
     [<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71

Testcase:

    #include <unistd.h>
    #include <sys/ioctl.h>
    #include <fcntl.h>
    #include <string.h>
    #include <linux/kvm.h>

    long r[8];

    int main()
    {
        memset(r, -1, sizeof(r));
	r[2] = open("/dev/kvm", O_RDONLY|O_TRUNC);
        r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
        r[5] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
        r[7] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
        return 0;
    }

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Dmitry Bilunov 0c2df2a1af KVM: Handle MSR_IA32_PERF_CTL
Intel CPUs having Turbo Boost feature implement an MSR to provide a
control interface via rdmsr/wrmsr instructions. One could detect the
presence of this feature by issuing one of these instructions and
handling the #GP exception which is generated in case the referenced MSR
is not implemented by the CPU.

KVM's vCPU model behaves exactly as a real CPU in this case by injecting
a fault when MSR_IA32_PERF_CTL is called (which KVM does not support).
However, some operating systems use this register during an early boot
stage in which their kernel is not capable of handling #GP correctly,
causing #DP and finally a triple fault effectively resetting the vCPU.

This patch implements a dummy handler for MSR_IA32_PERF_CTL to avoid the
crashes.

Signed-off-by: Dmitry Bilunov <kmeaw@yandex-team.ru>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Nadav Amit b19ee2ff3b KVM: x86: avoid write-tearing of TDP
In theory, nothing prevents the compiler from write-tearing PTEs, or
split PTE writes. These partially-modified PTEs can be fetched by other
cores and cause mayhem. I have not really encountered such case in
real-life, but it does seem possible.

For example, the compiler may try to do something creative for
kvm_set_pte_rmapp() and perform multiple writes to the PTE.

Signed-off-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-02 17:38:50 +02:00
Linus Torvalds e28e909c36 - move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat
(kvm_stat had nothing to do with QEMU in the first place -- the tool
    only interprets debugfs)
 - expose per-vm statistics in debugfs and support them in kvm_stat
   (KVM always collected per-vm statistics, but they were summarised into
    global statistics)
 
 x86:
  - fix dynamic APICv (VMX was improperly configured and a guest could
    access host's APIC MSRs, CVE-2016-4440)
  - minor fixes
 
 ARM changes from Christoffer Dall:
  "This set of changes include the new vgic, which is a reimplementation
   of our horribly broken legacy vgic implementation.  The two
   implementations will live side-by-side (with the new being the
   configured default) for one kernel release and then we'll remove the
   legacy one.
 
   Also fixes a non-critical issue with virtual abort injection to
   guests."
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCAAGBQJXRz0KAAoJEED/6hsPKofosiMIAIHmRI+9I6VMNmQe5vrZKz9/
 vt89QGxDJrFQwhEuZovenLEDaY6rMIJNguyvIbPhNuXNHIIPWbe6cO6OPwByqkdo
 WI/IIqcAJN/Bpwt4/Y2977A5RwDOwWLkaDs0LrZCEKPCgeh9GWQf+EfyxkDJClhG
 uIgbSAU+t+7b05K3c6NbiQT/qCzDTCdl6In6PI/DFSRRkXDaTcopjjp1PmMUSSsR
 AM8LGhEzMer+hGKOH7H5TIbN+HFzAPjBuDGcoZt0/w9IpmmS5OMd3ZrZ320cohz8
 zZQooRcFrT0ulAe+TilckmRMJdMZ69fyw3nzfqgAKEx+3PaqjKSY/tiEgqqDJHY=
 =EEBK
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second batch of KVM updates from Radim Krčmář:
 "General:

   - move kvm_stat tool from QEMU repo into tools/kvm/kvm_stat (kvm_stat
     had nothing to do with QEMU in the first place -- the tool only
     interprets debugfs)

   - expose per-vm statistics in debugfs and support them in kvm_stat
     (KVM always collected per-vm statistics, but they were summarised
     into global statistics)

  x86:

   - fix dynamic APICv (VMX was improperly configured and a guest could
     access host's APIC MSRs, CVE-2016-4440)

   - minor fixes

  ARM changes from Christoffer Dall:

   - new vgic reimplementation of our horribly broken legacy vgic
     implementation.  The two implementations will live side-by-side
     (with the new being the configured default) for one kernel release
     and then we'll remove the legacy one.

   - fix for a non-critical issue with virtual abort injection to guests"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
  tools: kvm_stat: Add comments
  tools: kvm_stat: Introduce pid monitoring
  KVM: Create debugfs dir and stat files for each VM
  MAINTAINERS: Add kvm tools
  tools: kvm_stat: Powerpc related fixes
  tools: Add kvm_stat man page
  tools: Add kvm_stat vm monitor script
  kvm:vmx: more complete state update on APICv on/off
  KVM: SVM: Add more SVM_EXIT_REASONS
  KVM: Unify traced vector format
  svm: bitwise vs logical op typo
  KVM: arm/arm64: vgic-new: Synchronize changes to active state
  KVM: arm/arm64: vgic-new: enable build
  KVM: arm/arm64: vgic-new: implement mapped IRQ handling
  KVM: arm/arm64: vgic-new: Wire up irqfd injection
  KVM: arm/arm64: vgic-new: Add vgic_v2/v3_enable
  KVM: arm/arm64: vgic-new: vgic_init: implement map_resources
  KVM: arm/arm64: vgic-new: vgic_init: implement vgic_init
  KVM: arm/arm64: vgic-new: vgic_init: implement vgic_create
  KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init
  ...
2016-05-27 13:41:54 -07:00
Roman Kagan 3ce424e454 kvm:vmx: more complete state update on APICv on/off
The function to update APICv on/off state (in particular, to deactivate
it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
APICv-related fields among secondary processor-based VM-execution
controls.  As a result, Windows 2012 guests get stuck when SynIC-based
auto-EOI interrupt intersected with e.g. an IPI in the guest.

In addition, the MSR intercept bitmap isn't updated every time "virtualize
x2APIC mode" is toggled.  This path can only be triggered by a malicious
guest, because Windows didn't use x2APIC but rather their own synthetic
APIC access MSRs; however a guest running in a SynIC-enabled VM could
switch to x2APIC and thus obtain direct access to host APIC MSRs
(CVE-2016-4440).

The patch fixes those omissions.

Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Reported-by: Steve Rutherford <srutherford@google.com>
Reported-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-25 16:11:37 +02:00
Dan Carpenter 5446a979e0 svm: bitwise vs logical op typo
These were supposed to be a bitwise operation but there is a typo.
The result is mostly harmless, but sparse correctly complains.

Fixes: 44a95dae1d ('KVM: x86: Detect and Initialize AVIC support')
Fixes: 18f40c53e1 ('svm: Add VMEXIT handlers for AVIC')
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-24 12:11:05 +02:00
Linus Torvalds 7beaa24ba4 Small release overall.
- x86: miscellaneous fixes, AVIC support (local APIC virtualization,
 AMD version)
 
 - s390: polling for interrupts after a VCPU goes to halted state is
 now enabled for s390; use hardware provided information about facility
 bits that do not need any hypervisor activity, and other fixes for
 cpu models and facilities; improve perf output; floating interrupt
 controller improvements.
 
 - MIPS: miscellaneous fixes
 
 - PPC: bugfixes only
 
 - ARM: 16K page size support, generic firmware probing layer for
 timer and GIC
 
 Christoffer Dall (KVM-ARM maintainer) says:
 "There are a few changes in this pull request touching things outside
  KVM, but they should all carry the necessary acks and it made the
  merge process much easier to do it this way."
 
 though actually the irqchip maintainers' acks didn't make it into the
 patches.  Marc Zyngier, who is both irqchip and KVM-ARM maintainer,
 later acked at http://mid.gmane.org/573351D1.4060303@arm.com
 "more formally and for documentation purposes".
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXPJjyAAoJEL/70l94x66DhioH/j4fwQ0FmfPSM9PArzaFHQdx
 LNE3tU4+bobbsy1BJr4DiAaOUQn3DAgwUvGLWXdeLiOXtoWXBiFHKaxlqEsCA6iQ
 xcTH1TgfxsVoqGQ6bT9X/2GCx70heYpcWG3f+zqBy7ZfFmQykLAC/HwOr52VQL8f
 hUFi3YmTHcnorp0n5Xg+9r3+RBS4D/kTbtdn6+KCLnPJ0RcgNkI3/NcafTemoofw
 Tkv8+YYFNvKV13qlIfVqxMa0GwWI3pP6YaNKhaS5XO8Pu16HuuF1JthJsUBDzwBa
 RInp8R9MoXgsBYhLpz3jc9vWG7G9yDl5LehsD9KOUGOaFYJ7sQN+QZOusa6jFgA=
 =llO5
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release overall.

  x86:
   - miscellaneous fixes
   - AVIC support (local APIC virtualization, AMD version)

  s390:
   - polling for interrupts after a VCPU goes to halted state is now
     enabled for s390
   - use hardware provided information about facility bits that do not
     need any hypervisor activity, and other fixes for cpu models and
     facilities
   - improve perf output
   - floating interrupt controller improvements.

  MIPS:
   - miscellaneous fixes

  PPC:
   - bugfixes only

  ARM:
   - 16K page size support
   - generic firmware probing layer for timer and GIC

  Christoffer Dall (KVM-ARM maintainer) says:
    "There are a few changes in this pull request touching things
     outside KVM, but they should all carry the necessary acks and it
     made the merge process much easier to do it this way."

  though actually the irqchip maintainers' acks didn't make it into the
  patches.  Marc Zyngier, who is both irqchip and KVM-ARM maintainer,
  later acked at http://mid.gmane.org/573351D1.4060303@arm.com ('more
  formally and for documentation purposes')"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (82 commits)
  KVM: MTRR: remove MSR 0x2f8
  KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same
  svm: Manage vcpu load/unload when enable AVIC
  svm: Do not intercept CR8 when enable AVIC
  svm: Do not expose x2APIC when enable AVIC
  KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
  svm: Add VMEXIT handlers for AVIC
  svm: Add interrupt injection via AVIC
  KVM: x86: Detect and Initialize AVIC support
  svm: Introduce new AVIC VMCB registers
  KVM: split kvm_vcpu_wake_up from kvm_vcpu_kick
  KVM: x86: Introducing kvm_x86_ops VCPU blocking/unblocking hooks
  KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
  KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg
  KVM: x86: Misc LAPIC changes to expose helper functions
  KVM: shrink halt polling even more for invalid wakeups
  KVM: s390: set halt polling to 80 microseconds
  KVM: halt_polling: provide a way to qualify wakeups during poll
  KVM: PPC: Book3S HV: Re-enable XICS fast path for irqfd-generated interrupts
  kvm: Conditionally register IRQ bypass consumer
  ...
2016-05-19 11:27:09 -07:00
Andy Honig 9842df6200 KVM: MTRR: remove MSR 0x2f8
MSR 0x2f8 accessed the 124th Variable Range MTRR ever since MTRR support
was introduced by 9ba075a664 ("KVM: MTRR support").

0x2f8 became harmful when 910a6aae4e ("KVM: MTRR: exactly define the
size of variable MTRRs") shrinked the array of VR MTRRs from 256 to 8,
which made access to index 124 out of bounds.  The surrounding code only
WARNs in this situation, thus the guest gained a limited read/write
access to struct kvm_arch_vcpu.

0x2f8 is not a valid VR MTRR MSR, because KVM has/advertises only 16 VR
MTRR MSRs, 0x200-0x20f.  Every VR MTRR is set up using two MSRs, 0x2f8
was treated as a PHYSBASE and 0x2f9 would be its PHYSMASK, but 0x2f9 was
not implemented in KVM, therefore 0x2f8 could never do anything useful
and getting rid of it is safe.

This fixes CVE-2016-3713.

Fixes: 910a6aae4e ("KVM: MTRR: exactly define the size of variable MTRRs")
Cc: stable@vger.kernel.org
Reported-by: David Matlack <dmatlack@google.com>
Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:32 +02:00
Paolo Bonzini 67c9dddc95 KVM: x86: make hwapic_isr_update and hwapic_irr_update look the same
Neither APICv nor AVIC actually need the first argument of
hwapic_isr_update, but the vCPU makes more sense than passing the
pointer to the whole virtual machine!  In fact in the APICv case it's
just happening that the vCPU is used implicitly, through the loaded VMCS.

The second argument instead is named differently, make it consistent.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:32 +02:00
Suravee Suthikulpanit 8221c13700 svm: Manage vcpu load/unload when enable AVIC
When a vcpu is loaded/unloaded to a physical core, we need to update
host physical APIC ID information in the Physical APIC-ID table
accordingly.

Also, when vCPU is blocking/un-blocking (due to halt instruction),
we need to make sure that the is-running bit in set accordingly in the
physical APIC-ID table.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
[Return void from new functions, add WARN_ON when they returned negative
 errno; split load and put into separate function as they have almost
 nothing in common. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:31 +02:00
Suravee Suthikulpanit 3bbf3565f4 svm: Do not intercept CR8 when enable AVIC
When enable AVIC:
    * Do not intercept CR8 since this should be handled by AVIC HW.
    * Also, we don't need to sync cr8/V_TPR and APIC backing page.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Rename svm_in_nested_interrupt_shadow to svm_nested_virtualize_tpr. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:31 +02:00
Suravee Suthikulpanit 46781eae2e svm: Do not expose x2APIC when enable AVIC
Since AVIC only virtualizes xAPIC hardware for the guest, this patch
disable x2APIC support in guest CPUID.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:30 +02:00
Suravee Suthikulpanit be8ca170ed KVM: x86: Introducing kvm_x86_ops.apicv_post_state_restore
Adding kvm_x86_ops hooks to allow APICv to do post state restore.
This is required to support VM save and restore feature.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:30 +02:00
Suravee Suthikulpanit 18f40c53e1 svm: Add VMEXIT handlers for AVIC
This patch introduces VMEXIT handlers, avic_incomplete_ipi_interception()
and avic_unaccelerated_access_interception() along with two trace points
(trace_kvm_avic_incomplete_ipi and trace_kvm_avic_unaccelerated_access).

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:29 +02:00
Suravee Suthikulpanit 340d3bc366 svm: Add interrupt injection via AVIC
This patch introduces a new mechanism to inject interrupt using AVIC.
Since VINTR is not supported when enable AVIC, we need to inject
interrupt via APIC backing page instead.

This patch also adds support for AVIC doorbell, which is used by
KVM to signal a running vcpu to check IRR for injected interrupts.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:29 +02:00
Suravee Suthikulpanit 44a95dae1d KVM: x86: Detect and Initialize AVIC support
This patch introduces AVIC-related data structure, and AVIC
initialization code.

There are three main data structures for AVIC:
    * Virtual APIC (vAPIC) backing page (per-VCPU)
    * Physical APIC ID table (per-VM)
    * Logical APIC ID table (per-VM)

Currently, AVIC is disabled by default. Users can manually
enable AVIC via kernel boot option kvm-amd.avic=1 or during
kvm-amd module loading with parameter avic=1.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
[Avoid extra indentation (Boris). - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:28 +02:00
Suravee Suthikulpanit 03543133ce KVM: x86: Introducing kvm_x86_ops VM init/destroy hooks
Adding function pointers in struct kvm_x86_ops for processor-specific
layer to provide hooks for when KVM initialize and destroy VM.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:26 +02:00
Suravee Suthikulpanit dfb9595429 KVM: x86: Rename kvm_apic_get_reg to kvm_lapic_get_reg
Rename kvm_apic_get_reg to kvm_lapic_get_reg to be consistent with
the existing kvm_lapic_set_reg counterpart.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:25 +02:00
Suravee Suthikulpanit 1e6e2755b6 KVM: x86: Misc LAPIC changes to expose helper functions
Exporting LAPIC utility functions and macros for re-use in SVM code.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-18 18:04:25 +02:00
Linus Torvalds 16bf834805 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
  gitignore: fix wording
  mfd: ab8500-debugfs: fix "between" in printk
  memstick: trivial fix of spelling mistake on management
  cpupowerutils: bench: fix "average"
  treewide: Fix typos in printk
  IB/mlx4: printk fix
  pinctrl: sirf/atlas7: fix printk spelling
  serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
  w1: comment spelling s/minmum/minimum/
  Blackfin: comment spelling s/divsor/divisor/
  metag: Fix misspellings in comments.
  ia64: Fix misspellings in comments.
  hexagon: Fix misspellings in comments.
  tools/perf: Fix misspellings in comments.
  cris: Fix misspellings in comments.
  c6x: Fix misspellings in comments.
  blackfin: Fix misspelling of 'register' in comment.
  avr32: Fix misspelling of 'definitions' in comment.
  treewide: Fix typos in printk
  Doc: treewide : Fix typos in DocBook/filesystem.xml
  ...
2016-05-17 17:05:30 -07:00
Linus Torvalds 168f1a7163 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 asm updates from Ingo Molnar:
 "The main changes in this cycle were:

   - MSR access API fixes and enhancements (Andy Lutomirski)

   - early exception handling improvements (Andy Lutomirski)

   - user-space FS/GS prctl usage fixes and improvements (Andy
     Lutomirski)

   - Remove the cpu_has_*() APIs and replace them with equivalents
     (Borislav Petkov)

   - task switch micro-optimization (Brian Gerst)

   - 32-bit entry code simplification (Denys Vlasenko)

   - enhance PAT handling in enumated CPUs (Toshi Kani)

  ... and lots of other cleanups/fixlets"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  x86/arch_prctl/64: Restore accidentally removed put_cpu() in ARCH_SET_GS
  x86/entry/32: Remove asmlinkage_protect()
  x86/entry/32: Remove GET_THREAD_INFO() from entry code
  x86/entry, sched/x86: Don't save/restore EFLAGS on task switch
  x86/asm/entry/32: Simplify pushes of zeroed pt_regs->REGs
  selftests/x86/ldt_gdt: Test set_thread_area() deletion of an active segment
  x86/tls: Synchronize segment registers in set_thread_area()
  x86/asm/64: Rename thread_struct's fs and gs to fsbase and gsbase
  x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization
  x86/segments/64: When load_gs_index fails, clear the base
  x86/segments/64: When loadsegment(fs, ...) fails, clear the base
  x86/asm: Make asm/alternative.h safe from assembly
  x86/asm: Stop depending on ptrace.h in alternative.h
  x86/entry: Rename is_{ia32,x32}_task() to in_{ia32,x32}_syscall()
  x86/asm: Make sure verify_cpu() has a good stack
  x86/extable: Add a comment about early exception handlers
  x86/msr: Set the return value to zero when native_rdmsr_safe() fails
  x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
  x86/paravirt: Add paravirt_{read,write}_msr()
  x86/msr: Carry on after a non-"safe" MSR access fails
  ...
2016-05-16 15:15:17 -07:00
Christian Borntraeger 3491caf275 KVM: halt_polling: provide a way to qualify wakeups during poll
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.

For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As  KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.

This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.

This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-13 17:29:23 +02:00
Alex Williamson 14717e2031 kvm: Conditionally register IRQ bypass consumer
If we don't support a mechanism for bypassing IRQs, don't register as
a consumer.  This eliminates meaningless dev_info()s when the connect
fails between producer and consumer, such as on AMD systems where
kvm_x86_ops->update_pi_irte is not implemented

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-11 22:37:55 +02:00
Linus Torvalds d32917eedc Two small x86 patches, improving "make kvmconfig" and fixing
an objtool warning for CONFIG_PROFILE_ALL_BRANCHES.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXM0h+AAoJEL/70l94x66DhjwH/2l24ouNC32DJGoY511yWogF
 QrBrmQnnT1r+v9IDCC5ITCCkJp+EPF0uFwLI+T6URzRdSdNUN/EqYymUyhsICWwb
 9xIsR9VraeB4OhH5IMps1Pkwyx6FMjeVqvVx8dib+NUQpV2o5KxkkFsE9oIX7mtA
 PgwRaLsV8MFZH06g/qvwYPs5NNUecQU8AFCAps0PzWSXG6KHmzsjJZjsXbDhb3n4
 oOGJTLXlRS+SSOWFjngYJNIQ7Z8V1WHM6qftZXtcDWIFbq8CuWn9dZR2WhpUu+nK
 J/I5WC99stxgljFOjI/zdPKQbRy0eBQyDAPHAblDo83vSKsDRgYGX3608gGz3xY=
 =s75+
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "Two small x86 patches, improving "make kvmconfig" and fixing an
  objtool warning for CONFIG_PROFILE_ALL_BRANCHES"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  kvmconfig: add more virtio drivers
  x86/kvm: Add stack frame dependency to fastop() inline asm
2016-05-11 10:11:44 -07:00
Josh Poimboeuf 4548f63e65 x86/kvm: Add stack frame dependency to fastop() inline asm
The kbuild test robot reported this objtool warning [1]:

  arch/x86/kvm/emulate.o: warning: objtool: fastop()+0x69: call without frame pointer save/setup

The issue seems to be caused by CONFIG_PROFILE_ALL_BRANCHES.  With that
option, for some reason gcc decides not to create a stack frame in
fastop() before doing the inline asm call, which can result in a bad
stack trace.

Force a stack frame to be created if CONFIG_FRAME_POINTER is enabled by
listing the stack pointer as an output operand for the inline asm
statement.

This change has no effect for !CONFIG_PROFILE_ALL_BRANCHES.

[1] https://lists.01.org/pipermail/kbuild-all/2016-March/018249.html

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-05-10 18:16:50 +02:00
Andrea Arcangeli 127393fbe5 mm: thp: kvm: fix memory corruption in KVM with THP enabled
After the THP refcounting change, obtaining a compound pages from
get_user_pages() no longer allows us to assume the entire compound page
is immediately mappable from a secondary MMU.

A secondary MMU doesn't want to call get_user_pages() more than once for
each compound page, in order to know if it can map the whole compound
page.  So a secondary MMU needs to know from a single get_user_pages()
invocation when it can map immediately the entire compound page to avoid
a flood of unnecessary secondary MMU faults and spurious
atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier
users).

Ideally instead of the page->_mapcount < 1 check, get_user_pages()
should return the granularity of the "page" mapping in the "mm" passed
to get_user_pages().  However it's non trivial change to pass the "pmd"
status belonging to the "mm" walked by get_user_pages up the stack (up
to the caller of get_user_pages).  So the fix just checks if there is
not a single pte mapping on the page returned by get_user_pages, and in
turn if the caller can assume that the whole compound page is mapped in
the current "mm" (in a pmd_trans_huge()).  In such case the entire
compound page is safe to map into the secondary MMU without additional
get_user_pages() calls on the surrounding tail/head pages.  In addition
of being faster, not having to run other get_user_pages() calls also
reduces the memory footprint of the secondary MMU fault in case the pmd
split happened as result of memory pressure.

Without this fix after a MADV_DONTNEED (like invoked by QEMU during
postcopy live migration or balloning) or after generic swapping (with a
failure in split_huge_page() that would only result in pmd splitting and
not a physical page split), KVM would map the whole compound page into
the shadow pagetables, despite regular faults or userfaults (like
UFFDIO_COPY) may map regular pages into the primary MMU as result of the
pte faults, leading to the guest mode and userland mode going out of
sync and not working on the same memory at all times.

Any other secondary MMU notifier manager (KVM is just one of the many
MMU notifier users) will need the same information if it doesn't want to
run a flood of get_user_pages_fast and it can support multiple
granularity in the secondary MMU mappings, so I think it is justified to
be exposed not just to KVM.

The other option would be to move transparent_hugepage_adjust to
mm/huge_memory.c but that currently has all kind of KVM data structures
in it, so it's definitely not a cut-and-paste work, so I couldn't do a
fix as cleaner as this one for 4.6.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: "Li, Liang Z" <liang.z.li@intel.com>
Cc: Amit Shah <amit.shah@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-05 17:38:53 -07:00
Ingo Molnar 1fb48f8e54 Linux 4.6-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXJoi6AAoJEHm+PkMAQRiGYKIIAIcocIV48DpGAHXFuSZbzw5D
 rp9EbE5TormtddPz1J1zcqu9tl5H8tfxS+LvHqRaDXqQkbb0BWKttmEpKTm9mrH8
 kfGNW8uwrEgTMMsar54BypAMMhHz4ITsj3VQX5QLSC5j6wixMcOmQ+IqH0Bwt3wr
 Y5JXDtZRysI1GoMkSU7/qsQBjC7aaBa5VzVUiGvhV8DdvPVFQf73P89G1vzMKqb5
 HRWbH4ieu6/mclLvW9N2QKGMHQntlB+9m2kG9nVWWbBSDxpAotwqQZFh3D52MBUy
 6DH/PNgkVyDhX4vfjua0NrmXdwTfKxLWGxe4dZ8Z+JZP5c6pqWlClIPBCkjHj50=
 =KLSM
 -----END PGP SIGNATURE-----

Merge tag 'v4.6-rc6' into x86/asm, to refresh the tree

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-05 08:35:00 +02:00
Wanpeng Li 35f3fae178 kvm: robustify steal time record
Guest should only trust data to be valid when version haven't changed
before and after reads of steal time. Besides not changing, it has to
be an even number. Hypervisor may write an odd number to version field
to indicate that an update is in progress.

kvm_steal_clock() in guest has already done the read side, make write
side in hypervisor more robust by following the above rule.

Reviewed-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-05-03 17:19:05 +02:00
Bruce Rogers f24632475d KVM: x86: fix ordering of cr0 initialization code in vmx_cpu_reset
Commit d28bc9dd25 reversed the order of two lines which initialize cr0,
allowing the current (old) cr0 value to mess up vcpu initialization.
This was observed in the checks for cr0 X86_CR0_WP bit in the context of
kvm_mmu_reset_context(). Besides, setting vcpu->arch.cr0 after vmx_set_cr0()
is completely redundant. Change the order back to ensure proper vcpu
initialization.

The combination of booting with ovmf firmware when guest vcpus > 1 and kvm's
ept=N option being set results in a VM-entry failure. This patch fixes that.

Fixes: d28bc9dd25 ("KVM: x86: INIT and reset sequences are different")
Cc: stable@vger.kernel.org
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-04-29 14:30:41 +02:00
Andy Lutomirski 296f781a4b x86/asm/64: Rename thread_struct's fs and gs to fsbase and gsbase
Unlike ds and es, these are base addresses, not selectors.  Rename
them so their meaning is more obvious.

On x86_32, the field is still called fs.  Fixing that could make sense
as a future cleanup.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/69a18a51c4cba0ce29a241e570fc618ad721d908.1461698311.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-29 11:56:42 +02:00
Alexander Shishkin 1c5ac21a0e perf/x86/intel/pt: Don't die on VMXON
Some versions of Intel PT do not support tracing across VMXON, more
specifically, VMXON will clear TraceEn control bit and any attempt to
set it before VMXOFF will throw a #GP, which in the current state of
things will crash the kernel. Namely:

  $ perf record -e intel_pt// kvm -nographic

on such a machine will kill it.

To avoid this, notify the intel_pt driver before VMXON and after
VMXOFF so that it knows when not to enable itself.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/87oa9dwrfk.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-28 10:32:42 +02:00
Paolo Bonzini 2e4682ba2e KVM: add missing memory barrier in kvm_{make,check}_request
kvm_make_request and kvm_check_request imply a producer-consumer
relationship; add implicit memory barriers to them.  There was indeed
already a place that was adding an explicit smp_mb() to order between
kvm_check_request and the processing of the request.  That memory
barrier can be removed (as an added benefit, kvm_check_request can use
smp_mb__after_atomic which is free on x86).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-20 15:29:17 +02:00
Xiao Guangrong 46971a2f59 KVM: MMU: skip obsolete sp in for_each_gfn_*()
The obsolete sp should not be used on current vCPUs and should not hurt
vCPU's running, so skip it from for_each_gfn_sp() and
for_each_gfn_indirect_valid_sp()

The side effort is we will double check role.invalid in kvm_mmu_get_page()
but i think it is okay as role is well cached

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-20 15:29:17 +02:00
Liang Chen c54cdf141c KVM: x86: optimize steal time calculation
Since accumulate_steal_time is now only called in record_steal_time, it
doesn't quite make sense to put the delta calculation in a separate
function. The function could be called thousands of times before guest
enables the steal time MSR (though the compiler may optimize out this
function call). And after it's enabled, the MSR enable bit is tested twice
every time. Removing the accumulate_steal_time function also avoids the
necessity of having the accum_steal field.

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Gavin Guo <gavin.guo@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-20 15:29:17 +02:00
Ingo Molnar 6666ea558b Linux 4.6-rc4
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXFELfAAoJEHm+PkMAQRiGRYIH+wWsUva7TR9arN1ZrURvI17b
 KqyQH8Ov9zJBsIaq/rFXOr5KfNgx7BU9BL9h7QkBy693HXTWf+GTZ1czHM4N12C3
 0ZdHGrLwTHo2zdisiQaFORZSfhSVTUNGXGHXw13bUMgEqatPgkozXEnsvXXNdt1Z
 HtlcuJn3pcj+QIY7qDXZgTLTwgn248hi1AgNag+ntFcWiz21IYaMIi7/mCY9QUIi
 AY+Y3hqFQM7/8cVyThGS5wZPTg1YzdhsLJpoCk0TbS8FvMEnA+ylcTgc15C78bwu
 AxOwM3OCmH4gMsd7Dd/O+i9lE3K6PFrgzdDisYL3P7eHap+EdiLDvVzPDPPx0xg=
 =Q7r3
 -----END PGP SIGNATURE-----

Merge tag 'v4.6-rc4' into x86/asm, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-19 10:38:52 +02:00
Masanari Iida c19ca6cb4c treewide: Fix typos in printk
This patch fix spelling typos found in printk
within various part of the kernel sources.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2016-04-18 11:23:24 +02:00
Borislav Petkov 782511b00f x86/cpufeature: Replace cpu_has_xsaves with boot_cpu_has() usage
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <kvm@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459801503-15600-11-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13 11:37:42 +02:00
Borislav Petkov d366bf7eb9 x86/cpufeature: Replace cpu_has_xsave with boot_cpu_has() usage
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/1459801503-15600-10-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13 11:37:41 +02:00
Ingo Molnar cb44d0cfc2 Merge branch 'x86/cpu' into x86/asm, to merge more patches
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13 11:15:39 +02:00
Andy Lutomirski 1ed95e52d9 x86/vdso: Remove direct HPET access through the vDSO
Allowing user code to map the HPET is problematic.  HPET
implementations are notoriously buggy, and there are probably many
machines on which even MMIO reads from bogus HPET addresses are
problematic.

We have a report that the Dell Precision M2800 with:

  ACPI: HPET 0x00000000C8FE6238 000038 (v01 DELL   CBX3  01072009 AMI. 00000005)

is either so slow when accessing the HPET or actually hangs in some
regard, causing soft lockups to be reported if users do unexpected
things to the HPET.

The vclock HPET code has also always been a questionable speedup.
Accessing an HPET is exceedingly slow (on the order of several
microseconds), so the added overhead in requiring a syscall to read
the HPET is a small fraction of the total code of accessing it.

To avoid future problems, let's just delete the code entirely.

In the long run, this could actually be a speedup.  Waiman Long as a
patch to optimize the case where multiple CPUs contend for the HPET,
but that won't help unless all the accesses are mediated by the
kernel.

Reported-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Waiman Long <Waiman.Long@hpe.com>
Cc: Waiman Long <waiman.long@hpe.com>
Link: http://lkml.kernel.org/r/d2f90bba98db9905041cff294646d290d378f67a.1460074438.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-04-13 10:28:34 +02:00
Paolo Bonzini 316314cae1 KVM: x86: mask CPUID(0xD,0x1).EAX against host value
This ensures that the guest doesn't see XSAVE extensions
(e.g. xgetbv1 or xsavec) that the host lacks.

Cc: stable@vger.kernel.org
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-10 21:53:50 +02:00
David Matlack fc5b7f3bf1 kvm: x86: do not leak guest xcr0 into host interrupt handlers
An interrupt handler that uses the fpu can kill a KVM VM, if it runs
under the following conditions:
 - the guest's xcr0 register is loaded on the cpu
 - the guest's fpu context is not loaded
 - the host is using eagerfpu

Note that the guest's xcr0 register and fpu context are not loaded as
part of the atomic world switch into "guest mode". They are loaded by
KVM while the cpu is still in "host mode".

Usage of the fpu in interrupt context is gated by irq_fpu_usable(). The
interrupt handler will look something like this:

if (irq_fpu_usable()) {
        kernel_fpu_begin();

        [... code that uses the fpu ...]

        kernel_fpu_end();
}

As long as the guest's fpu is not loaded and the host is using eager
fpu, irq_fpu_usable() returns true (interrupted_kernel_fpu_idle()
returns true). The interrupt handler proceeds to use the fpu with
the guest's xcr0 live.

kernel_fpu_begin() saves the current fpu context. If this uses
XSAVE[OPT], it may leave the xsave area in an undesirable state.
According to the SDM, during XSAVE bit i of XSTATE_BV is not modified
if bit i is 0 in xcr0. So it's possible that XSTATE_BV[i] == 1 and
xcr0[i] == 0 following an XSAVE.

kernel_fpu_end() restores the fpu context. Now if any bit i in
XSTATE_BV == 1 while xcr0[i] == 0, XRSTOR generates a #GP. The
fault is trapped and SIGSEGV is delivered to the current process.

Only pre-4.2 kernels appear to be vulnerable to this sequence of
events. Commit 653f52c ("kvm,x86: load guest FPU context more eagerly")
from 4.2 forces the guest's fpu to always be loaded on eagerfpu hosts.

This patch fixes the bug by keeping the host's xcr0 loaded outside
of the interrupts-disabled region where KVM switches into guest mode.

Cc: stable@vger.kernel.org
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: David Matlack <dmatlack@google.com>
[Move load after goto cancel_injection. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-10 21:53:49 +02:00
Xiao Guangrong 7a98205dee KVM: MMU: fix permission_fault()
kvm-unit-tests complained about the PFEC is not set properly, e.g,:
test pte.rw pte.d pte.nx pde.p pde.rw pde.pse user fetch: FAIL: error code 15
expected 5
Dump mapping: address: 0x123400000000
------L4: 3e95007
------L3: 3e96007
------L2: 2000083

It's caused by the reason that PFEC returned to guest is copied from the
PFEC triggered by shadow page table

This patch fixes it and makes the logic of updating errcode more clean

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Do not assume pfec.p=1. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-10 21:53:49 +02:00
Luiz Capitulino 61abdbe0bc kvm: x86: make lapic hrtimer pinned
When a vCPU runs on a nohz_full core, the hrtimer used by
the lapic emulation code can be migrated to another core.
When this happens, it's possible to observe milisecond
latency when delivering timer IRQs to KVM guests.

The huge latency is mainly due to the fact that
apic_timer_fn() expects to run during a kvm exit. It
sets KVM_REQ_PENDING_TIMER and let it be handled on kvm
entry. However, if the timer fires on a different core,
we have to wait until the next kvm exit for the guest
to see KVM_REQ_PENDING_TIMER set.

This problem became visible after commit 9642d18ee. This
commit changed the timer migration code to always attempt
to migrate timers away from nohz_full cores. While it's
discussable if this is correct/desirable (I don't think
it is), it's clear that the lapic emulation code has
a requirement on firing the hrtimer in the same core
where it was started. This is achieved by making the
hrtimer pinned.

Lastly, note that KVM has code to migrate timers when a
vCPU is scheduled to run in different core. However, this
forced migration may fail. When this happens, we can have
the same problem. If we want 100% correctness, we'll have
to modify apic_timer_fn() to cause a kvm exit when it runs
on a different core than the vCPU. Not sure if this is
possible.

Here's a reproducer for the issue being fixed:

 1. Set all cores but core0 to be nohz_full cores
 2. Start a guest with a single vCPU
 3. Trace apic_timer_fn() and kvm_inject_apic_timer_irqs()

You'll see that apic_timer_fn() will run in core0 while
kvm_inject_apic_timer_irqs() runs in a different core. If
you get both on core0, try running a program that takes 100%
of the CPU and pin it to core0 to force the vCPU out.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-05 14:19:08 +02:00
Yu Zhao 14f4760562 kvm: set page dirty only if page has been writable
In absence of shadow dirty mask, there is no need to set page dirty
if page has never been writable. This is a tiny optimization but
good to have for people who care much about dirty page tracking.

Signed-off-by: Yu Zhao <yuzhao@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-01 12:10:10 +02:00
Paolo Bonzini a2b5c3c0c8 KVM: Hyper-V: do not do hypercall userspace exits if SynIC is disabled
If SynIC is disabled, there is nothing that userspace can do to
handle these exits; on the other hand, userspace probably will
not know about KVM_EXIT_HYPERV_HCALL and complain about it or
even exit.  Just prevent anything bad from happening by handling
the hypercall in KVM and returning an "invalid hypercall" code.

Fixes: 83326e43f2
Cc: Andrey Smetanin <irqlevel@gmail.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-01 12:10:09 +02:00
Yuki Shibuya 321c5658c5 KVM: x86: Inject pending interrupt even if pending nmi exist
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.

In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).

This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.

Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-04-01 12:10:09 +02:00
Borislav Petkov b8291adc19 x86/cpufeature: Remove cpu_has_gbpages
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1459266123-21878-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-31 13:35:08 +02:00
Linus Torvalds d88f48e128 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Misc fixes:

   - fix hotplug bugs
   - fix irq live lock
   - fix various topology handling bugs
   - fix APIC ACK ordering
   - fix PV iopl handling
   - fix speling
   - fix/tweak memcpy_mcsafe() return value
   - fix fbcon bug
   - remove stray prototypes"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/msr: Remove unused native_read_tscp()
  x86/apic: Remove declaration of unused hw_nmi_is_cpu_stuck
  x86/oprofile/nmi: Add missing hotplug FROZEN handling
  x86/hpet: Use proper mask to modify hotplug action
  x86/apic/uv: Fix the hotplug notifier
  x86/apb/timer: Use proper mask to modify hotplug action
  x86/topology: Use total_cpus not nr_cpu_ids for logical packages
  x86/topology: Fix Intel HT disable
  x86/topology: Fix logical package mapping
  x86/irq: Cure live lock in fixup_irqs()
  x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known()
  x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt()
  x86/iopl: Fix iopl capability check on Xen PV
  x86/iopl/64: Properly context-switch IOPL on Xen PV
  selftests/x86: Add an iopl test
  x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe()
  x86/video: Don't assume all FB devices are PCI devices
  arch/x86/irq: Purge useless handler declarations from hw_irq.h
  x86: Fix misspellings in comments
2016-03-24 09:47:32 -07:00
Paolo Bonzini a6adb10622 KVM: page_track: fix access to NULL slot
This happens when doing the reboot test from virt-tests:

[  131.833653] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  131.842461] IP: [<ffffffffa0950087>] kvm_page_track_is_active+0x17/0x60 [kvm]
[  131.850500] PGD 0
[  131.852763] Oops: 0000 [#1] SMP
[  132.007188] task: ffff880075fbc500 ti: ffff880850a3c000 task.ti: ffff880850a3c000
[  132.138891] Call Trace:
[  132.141639]  [<ffffffffa092bd11>] page_fault_handle_page_track+0x31/0x40 [kvm]
[  132.149732]  [<ffffffffa093380f>] paging64_page_fault+0xff/0x910 [kvm]
[  132.172159]  [<ffffffffa092c734>] kvm_mmu_page_fault+0x64/0x110 [kvm]
[  132.179372]  [<ffffffffa06743c2>] handle_exception+0x1b2/0x430 [kvm_intel]
[  132.187072]  [<ffffffffa067a301>] vmx_handle_exit+0x1e1/0xc50 [kvm_intel]
...

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: 3d0c27ad6e
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 17:27:28 +01:00
Lan Tianyu 0f127d12e4 KVM/x86: update the comment of memory barrier in the vcpu_enter_guest()
The barrier also orders the write to mode from any reads
to the page tables done and so update the comment.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:35 +01:00
Lan Tianyu 7bfdf21778 KVM/x86: Call smp_wmb() before increasing tlbs_dirty
Update spte before increasing tlbs_dirty to make sure no tlb flush
in lost after spte is zapped. This pairs with the barrier in the
kvm_flush_remote_tlbs().

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:32 +01:00
Lan Tianyu 36ca7e0a57 KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end()
Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:29 +01:00
Lan Tianyu 9753f52915 KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page()
There is already a barrier inside of kvm_flush_remote_tlbs() which can
help to make sure everyone sees our modifications to the page tables and
see changes to vcpu->mode here. So remove the smp_mb in the
kvm_mmu_commit_zap_page() and update the comment.

Signed-off-by: Lan Tianyu <tianyu.lan@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:27 +01:00
Huaitong Han b9baba8614 KVM, pkeys: expose CPUID/CR4 to guest
X86_FEATURE_PKU is referred to as "PKU" in the hardware documentation:
CPUID.7.0.ECX[3]:PKU. X86_FEATURE_OSPKE is software support for pkeys,
enumerated with CPUID.7.0.ECX[4]:OSPKE, and it reflects the setting of
CR4.PKE(bit 22).

This patch disables CPUID:PKU without ept, because pkeys is not yet
implemented for shadow paging.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:38:17 +01:00
Huaitong Han be94f6b710 KVM, pkeys: add pkeys support for permission_fault
Protection keys define a new 4-bit protection key field (PKEY) in bits
62:59 of leaf entries of the page tables, the PKEY is an index to PKRU
register(16 domains), every domain has 2 bits(write disable bit, access
disable bit).

Static logic has been produced in update_pkru_bitmask, dynamic logic need
read pkey from page table entries, get pkru value, and deduce the correct
result.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:23:37 +01:00
Huaitong Han 2d344105f5 KVM, pkeys: introduce pkru_mask to cache conditions
PKEYS defines a new status bit in the PFEC. PFEC.PK (bit 5), if some
conditions is true, the fault is considered as a PKU violation.
pkru_mask indicates if we need to check PKRU.ADi and PKRU.WDi, and
does cache some conditions for permission_fault.

[ Huaitong: Xiao helps to modify many sections. ]

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:06 +01:00
Xiao Guangrong 1be0e61c1f KVM, pkeys: save/restore PKRU when guest/host switches
Currently XSAVE state of host is not restored after VM-exit and PKRU
is managed by XSAVE so the PKRU from guest is still controlling the
memory access even if the CPU is running the code of host. This is
not safe as KVM needs to access the memory of userspace (e,g QEMU) to
do some emulation.

So we save/restore PKRU when guest/host switches.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:06 +01:00
Huaitong Han 17a511f878 KVM, pkeys: add pkeys support for xsave state
This patch adds pkeys support for xsave state.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:05 +01:00
Huaitong Han ddba262891 KVM, pkeys: disable pkeys for guests in non-paging mode
Pkeys is disabled if CPU is in non-paging mode in hardware. However KVM
always uses paging mode to emulate guest non-paging, mode with TDP. To
emulate this behavior, pkeys needs to be manually disabled when guest
switches to non-paging mode.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:04 +01:00
Huaitong Han e0b18ef718 KVM: x86: remove magic number with enum cpuid_leafs
This patch removes magic number with enum cpuid_leafs.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:21:04 +01:00
Paolo Bonzini f13577e8aa KVM: MMU: return page fault error code from permission_fault
This will help in the implementation of PKRU, where the PK bit of the page
fault error code cannot be computed in advance (unlike I/D, R/W and U/S).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 16:20:54 +01:00
Paolo Bonzini ef697a712a KVM: VMX: fix nested vpid for old KVM guests
Old KVM guests invoke single-context invvpid without actually checking
whether it is supported.  This was fixed by commit 518c8ae ("KVM: VMX:
Make sure single type invvpid is supported before issuing invvpid
instruction", 2010-08-01) and the patch after, but pre-2.6.36
kernels lack it including RHEL 6.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac893
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:46 +01:00
Paolo Bonzini f6870ee9e5 KVM: VMX: avoid guest hang on invalid invvpid instruction
A guest executing an invalid invvpid instruction would hang
because the instruction pointer was not updated.

Reported-by: jmontleo@redhat.com
Tested-by: jmontleo@redhat.com
Cc: stable@vger.kernel.org
Fixes: 99b83ac893
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:42 +01:00
Paolo Bonzini 2849eb4f99 KVM: VMX: avoid guest hang on invalid invept instruction
A guest executing an invalid invept instruction would hang
because the instruction pointer was not updated.

Cc: stable@vger.kernel.org
Fixes: bfd0a56b90
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-22 12:02:38 +01:00
Linus Torvalds 26660a4046 Merge branch 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull 'objtool' stack frame validation from Ingo Molnar:
 "This tree adds a new kernel build-time object file validation feature
  (ONFIG_STACK_VALIDATION=y): kernel stack frame correctness validation.
  It was written by and is maintained by Josh Poimboeuf.

  The motivation: there's a category of hard to find kernel bugs, most
  of them in assembly code (but also occasionally in C code), that
  degrades the quality of kernel stack dumps/backtraces.  These bugs are
  hard to detect at the source code level.  Such bugs result in
  incorrect/incomplete backtraces most of time - but can also in some
  rare cases result in crashes or other undefined behavior.

  The build time correctness checking is done via the new 'objtool'
  user-space utility that was written for this purpose and which is
  hosted in the kernel repository in tools/objtool/.  The tool's (very
  simple) UI and source code design is shaped after Git and perf and
  shares quite a bit of infrastructure with tools/perf (which tooling
  infrastructure sharing effort got merged via perf and is already
  upstream).  Objtool follows the well-known kernel coding style.

  Objtool does not try to check .c or .S files, it instead analyzes the
  resulting .o generated machine code from first principles: it decodes
  the instruction stream and interprets it.  (Right now objtool supports
  the x86-64 architecture.)

  From tools/objtool/Documentation/stack-validation.txt:

   "The kernel CONFIG_STACK_VALIDATION option enables a host tool named
    objtool which runs at compile time.  It has a "check" subcommand
    which analyzes every .o file and ensures the validity of its stack
    metadata.  It enforces a set of rules on asm code and C inline
    assembly code so that stack traces can be reliable.

    Currently it only checks frame pointer usage, but there are plans to
    add CFI validation for C files and CFI generation for asm files.

    For each function, it recursively follows all possible code paths
    and validates the correct frame pointer state at each instruction.

    It also follows code paths involving special sections, like
    .altinstructions, __jump_table, and __ex_table, which can add
    alternative execution paths to a given instruction (or set of
    instructions).  Similarly, it knows how to follow switch statements,
    for which gcc sometimes uses jump tables."

  When this new kernel option is enabled (it's disabled by default), the
  tool, if it finds any suspicious assembly code pattern, outputs
  warnings in compiler warning format:

    warning: objtool: rtlwifi_rate_mapping()+0x2e7: frame pointer state mismatch
    warning: objtool: cik_tiling_mode_table_init()+0x6ce: call without frame pointer save/setup
    warning: objtool:__schedule()+0x3c0: duplicate frame pointer save
    warning: objtool:__schedule()+0x3fd: sibling call from callable instruction with changed frame pointer

  ... so that scripts that pick up compiler warnings will notice them.
  All known warnings triggered by the tool are fixed by the tree, most
  of the commits in fact prepare the kernel to be warning-free.  Most of
  them are bugfixes or cleanups that stand on their own, but there are
  also some annotations of 'special' stack frames for justified cases
  such entries to JIT-ed code (BPF) or really special boot time code.

  There are two other long-term motivations behind this tool as well:

   - To improve the quality and reliability of kernel stack frames, so
     that they can be used for optimized live patching.

   - To create independent infrastructure to check the correctness of
     CFI stack frames at build time.  CFI debuginfo is notoriously
     unreliable and we cannot use it in the kernel as-is without extra
     checking done both on the kernel side and on the build side.

  The quality of kernel stack frames matters to debuggability as well,
  so IMO we can merge this without having to consider the live patching
  or CFI debuginfo angle"

* 'core-objtool-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  objtool: Only print one warning per function
  objtool: Add several performance improvements
  tools: Copy hashtable.h into tools directory
  objtool: Fix false positive warnings for functions with multiple switch statements
  objtool: Rename some variables and functions
  objtool: Remove superflous INIT_LIST_HEAD
  objtool: Add helper macros for traversing instructions
  objtool: Fix false positive warnings related to sibling calls
  objtool: Compile with debugging symbols
  objtool: Detect infinite recursion
  objtool: Prevent infinite recursion in noreturn detection
  objtool: Detect and warn if libelf is missing and don't break the build
  tools: Support relative directory path for 'O='
  objtool: Support CROSS_COMPILE
  x86/asm/decoder: Use explicitly signed chars
  objtool: Enable stack metadata validation on 64-bit x86
  objtool: Add CONFIG_STACK_VALIDATION option
  objtool: Add tool to perform compile-time stack metadata validation
  x86/kprobes: Mark kretprobe_trampoline() stack frame as non-standard
  sched: Always inline context_switch()
  ...
2016-03-20 18:23:21 -07:00
Ingo Molnar 00f5268501 Merge branch 'x86/cleanups' into x86/urgent
Pull in some merge window leftovers.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-03-17 09:44:57 +01:00
Linus Torvalds 10dc374766 One of the largest releases for KVM... Hardly any generic improvement,
but lots of architecture-specific changes.
 
 * ARM:
 - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
 - PMU support for guests
 - 32bit world switch rewritten in C
 - various optimizations to the vgic save/restore code.
 
 * PPC:
 - enabled KVM-VFIO integration ("VFIO device")
 - optimizations to speed up IPIs between vcpus
 - in-kernel handling of IOMMU hypercalls
 - support for dynamic DMA windows (DDW).
 
 * s390:
 - provide the floating point registers via sync regs;
 - separated instruction vs. data accesses
 - dirty log improvements for huge guests
 - bugfixes and documentation improvements.
 
 * x86:
 - Hyper-V VMBus hypercall userspace exit
 - alternative implementation of lowest-priority interrupts using vector
 hashing (for better VT-d posted interrupt support)
 - fixed guest debugging with nested virtualizations
 - improved interrupt tracking in the in-kernel IOAPIC
 - generic infrastructure for tracking writes to guest memory---currently
 its only use is to speedup the legacy shadow paging (pre-EPT) case, but
 in the future it will be used for virtual GPUs as well
 - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
 yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
 oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
 tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
 Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
 WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
 =OUZZ
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "One of the largest releases for KVM...  Hardly any generic
  changes, but lots of architecture-specific updates.

  ARM:
   - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
   - PMU support for guests
   - 32bit world switch rewritten in C
   - various optimizations to the vgic save/restore code.

  PPC:
   - enabled KVM-VFIO integration ("VFIO device")
   - optimizations to speed up IPIs between vcpus
   - in-kernel handling of IOMMU hypercalls
   - support for dynamic DMA windows (DDW).

  s390:
   - provide the floating point registers via sync regs;
   - separated instruction vs.  data accesses
   - dirty log improvements for huge guests
   - bugfixes and documentation improvements.

  x86:
   - Hyper-V VMBus hypercall userspace exit
   - alternative implementation of lowest-priority interrupts using
     vector hashing (for better VT-d posted interrupt support)
   - fixed guest debugging with nested virtualizations
   - improved interrupt tracking in the in-kernel IOAPIC
   - generic infrastructure for tracking writes to guest
     memory - currently its only use is to speedup the legacy shadow
     paging (pre-EPT) case, but in the future it will be used for
     virtual GPUs as well
   - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
  KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
  KVM: x86: disable MPX if host did not enable MPX XSAVE features
  arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
  arm64: KVM: vgic-v3: Reset LRs at boot time
  arm64: KVM: vgic-v3: Do not save an LR known to be empty
  arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
  arm64: KVM: vgic-v3: Avoid accessing ICH registers
  KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
  KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
  KVM: arm/arm64: vgic-v2: Reset LRs at boot time
  KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
  KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
  KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
  KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
  KVM: s390: allocate only one DMA page per VM
  KVM: s390: enable STFLE interpretation only if enabled for the guest
  KVM: s390: wake up when the VCPU cpu timer expires
  KVM: s390: step the VCPU timer while in enabled wait
  KVM: s390: protect VCPU cpu timer with a seqcount
  KVM: s390: step VCPU cpu timer during kvm_run ioctl
  ...
2016-03-16 09:55:35 -07:00
Linus Torvalds d4e796152a Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle are:

   - Make schedstats a runtime tunable (disabled by default) and
     optimize it via static keys.

     As most distributions enable CONFIG_SCHEDSTATS=y due to its
     instrumentation value, this is a nice performance enhancement.
     (Mel Gorman)

   - Implement 'simple waitqueues' (swait): these are just pure
     waitqueues without any of the more complex features of full-blown
     waitqueues (callbacks, wake flags, wake keys, etc.).  Simple
     waitqueues have less memory overhead and are faster.

     Use simple waitqueues in the RCU code (in 4 different places) and
     for handling KVM vCPU wakeups.

     (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
     Marcelo Tosatti)

   - sched/numa enhancements (Rik van Riel)

   - NOHZ performance enhancements (Rik van Riel)

   - Various sched/deadline enhancements (Steven Rostedt)

   - Various fixes (Peter Zijlstra)

   - ... and a number of other fixes, cleanups and smaller enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  sched/cputime: Fix steal_account_process_tick() to always return jiffies
  sched/deadline: Remove dl_new from struct sched_dl_entity
  Revert "kbuild: Add option to turn incompatible pointer check into error"
  sched/deadline: Remove superfluous call to switched_to_dl()
  sched/debug: Fix preempt_disable_ip recording for preempt_disable()
  sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
  time, acct: Drop irq save & restore from __acct_update_integrals()
  acct, time: Change indentation in __acct_update_integrals()
  sched, time: Remove non-power-of-two divides from __acct_update_integrals()
  sched/rt: Kick RT bandwidth timer immediately on start up
  sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
  sched/debug: Move sched_domain_sysctl to debug.c
  sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
  sched/rt: Fix PI handling vs. sched_setscheduler()
  sched/core: Remove duplicated sched_group_set_shares() prototype
  sched/fair: Consolidate nohz CPU load update code
  sched/fair: Avoid using decay_load_missed() with a negative value
  sched/deadline: Always calculate end of period on sched_yield()
  sched/cgroup: Fix cgroup entity load tracking tear-down
  rcu: Use simple wait queues where possible in rcutree
  ...
2016-03-14 19:14:06 -07:00
Paolo Bonzini 5f0b819995 KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0
KVM has special logic to handle pages with pte.u=1 and pte.w=0 when
CR0.WP=1.  These pages' SPTEs flip continuously between two states:
U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed)
and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed).

When SMEP is in effect, however, U=0 will enable kernel execution of
this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1.
When guest EFER has the NX bit cleared, the reserved bit check thinks
that the latter state is invalid; teach it that the smep_andnot_wp case
will also use the NX bit of SPTEs.

Cc: stable@vger.kernel.org
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.inel.com>
Fixes: c258b62b26
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-10 11:26:10 +01:00
Paolo Bonzini 844a5fe219 KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo
Yes, all of these are needed. :) This is admittedly a bit odd, but
kvm-unit-tests access.flat tests this if you run it with "-cpu host"
and of course ept=0.

KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
restarts execution.  This will still cause a user write to fault, while
supervisor writes will succeed.  User reads will fault spuriously now,
and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
will be enabled and supervisor writes disabled, going back to the
originary situation where supervisor writes fault spuriously.

When SMEP is in effect, however, U=0 will enable kernel execution of
this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
with U=0.  If the guest has not enabled NX, the result is a continuous
stream of page faults due to the NX bit being reserved.

The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
control, so they do not use user-return notifiers for EFER---if they did,
EFER.NX would be forced to the same value as the host).

There is another bug in the reserved bit check, which I've split to a
separate patch for easier application to stable kernels.

Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Fixes: f6577a5fa1
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-10 11:26:07 +01:00
Paolo Bonzini 5a5fbdc0e3 KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
It is now equal to use_eager_fpu(), which simply tests a cpufeature bit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-09 14:04:36 +01:00
Paolo Bonzini a87036add0 KVM: x86: disable MPX if host did not enable MPX XSAVE features
When eager FPU is disabled, KVM will still see the MPX bit in CPUID and
presumably the MPX vmentry and vmexit controls.  However, it will not
be able to expose the MPX XSAVE features to the guest, because the guest's
accessible XSAVE features are always a subset of host_xcr0.

In this case, we should disable the MPX CPUID bit, the BNDCFGS MSR,
and the MPX vmentry and vmexit controls for nested virtualization.
It is then unnecessary to enable guest eager FPU if the guest has the
MPX CPUID bit set.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-09 14:04:36 +01:00
Paolo Bonzini ab92f30875 KVM/ARM updates for 4.6
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
 - PMU support for guests
 - 32bit world switch rewritten in C
 - Various optimizations to the vgic save/restore code
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJW36xjAAoJECPQ0LrRPXpDGQkQAMDppzcTOixT3e8VPdHAX09a
 Z5PO0gyTMVV7Jyz5Ul3pedPJA2GSK9mxOCwqvIFbdxLAR6ZB00juO5FrTHkSdI91
 1XLPj4bKoMWcVvhL/g5A4Glp/pVMW1k/9Yq8zZAtYlsLRlqG5rLOutSadcqHcYaJ
 cTD/pFf7b2oPtkTPyoFml75KgHBT/8uvAvFDOWA66Id2z6T11+PsBT/6XnGDiwKg
 tpGTNzx3kPIKIzOAOHqVW6UBxFOeabebXLT8wUz3VwNn/UbG6gkumMNApMAyF2q1
 zU0nAh8+7Ek6Dr4OFWE6BfW6sgg/l7i1lA8XoAmqG7ZTrSptCc59fvaZJxPruG+Q
 dMsU6QgR77JJjbZTinf9a1jReZ/liZrx2gZXedVKdILrjmDSq0UnGcxjUOEDZOGy
 2/dbrlJhv+LhpcJtuPpxPCfoqbW5L0ynzmuYuXRdRz3lTHiOWIRx5gugrhO+wH4D
 4gvZhbw3XCiYfpYHYhl8A1EH5kanKgdXDocz9yIm7mZm89gngufF/HkeXS3ZU25T
 yThyBGulGjqN4FCdgf1HolkTfFjnfSx4qJovJ58eHga+HNLXRkTecZZcbFy2OOHv
 8Bx0PIlwj4RgSaRLWQUudAhdhKS2g22DKDDljxFwhkMPNghvqkYMJCRDKLu6GBXQ
 4YsLKM+TaShHFjSpx+ao
 =rpvb
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.6' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM updates for 4.6

- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- Various optimizations to the vgic save/restore code

Conflicts:
	include/uapi/linux/kvm.h
2016-03-09 11:50:42 +01:00
Radim Krčmář 7099e2e1f4 KVM: VMX: disable PEBS before a guest entry
Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
would crash if you decided to run a host command that uses PEBS, like
  perf record -e 'cpu/mem-stores/pp' -a

This happens because KVM is using VMX MSR switching to disable PEBS, but
SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
isn't safe:
  When software needs to reconfigure PEBS facilities, it should allow a
  quiescent period between stopping the prior event counting and setting
  up a new PEBS event. The quiescent period is to allow any latent
  residual PEBS records to complete its capture at their previously
  specified buffer address (provided by IA32_DS_AREA).

There might not be a quiescent period after the MSR switch, so a CPU
ends up using host's MSR_IA32_DS_AREA to access an area in guest's
memory.  (Or MSR switching is just buggy on some models.)

The guest can learn something about the host this way:
If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
in #PF where we leak host's MSR_IA32_DS_AREA through CR2.

After that, a malicious guest can map and configure memory where
MSR_IA32_DS_AREA is pointing and can therefore get an output from
host's tracing.

This is not a critical leak as the host must initiate with PEBS tracing
and I have not been able to get a record from more than one instruction
before vmentry in vmx_vcpu_run() (that place has most registers already
overwritten with guest's).

We could disable PEBS just few instructions before vmentry, but
disabling it earlier shouldn't affect host tracing too much.
We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
optimization isn't worth its code, IMO.

(If you are implementing PEBS for guests, be sure to handle the case
 where both host and guest enable PEBS, because this patch doesn't.)

Fixes: 26a4f3c08d ("perf/x86: disable PEBS on a guest entry.")
Cc: <stable@vger.kernel.org>
Reported-by: Jiří Olša <jolsa@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:46:46 +01:00
Paolo Bonzini bb9eadf0c3 KVM: MMU: micro-optimize gpte_access
Avoid AND-NOT, most x86 processor lack an instruction for it.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:39 +01:00
Paolo Bonzini 6bb69c9b69 KVM: MMU: simplify last_pte_bitmap
Branch-free code is fun and everybody knows how much Avi loves it,
but last_pte_bitmap takes it a bit to the extreme.  Since the code
is simply doing a range check, like

	(level == 1 ||
	 ((gpte & PT_PAGE_SIZE_MASK) && level < N)

we can make it branch-free without storing the entire truth table;
it is enough to cache N.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:38 +01:00
Paolo Bonzini 50c9e6f3a6 KVM: MMU: coalesce more page zapping in mmu_sync_children
mmu_sync_children can only process up to 16 pages at a time.  Check
if we need to reschedule, and do not bother zapping the pages until
that happens.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:33 +01:00
Paolo Bonzini 2a74003ae8 KVM: MMU: move zap/flush to kvm_mmu_get_page
kvm_mmu_get_page is the only caller of kvm_sync_page_transient
and kvm_sync_pages.  Moving the handling of the invalid_list there
removes the need for the underdocumented kvm_sync_page_transient
function.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:30 +01:00
Paolo Bonzini 1f50f1b3a4 KVM: MMU: invert return value of mmu.sync_page and *kvm_sync_page*
Return true if the page was synced (and the TLB must be flushed)
and false if the page was zapped.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:26 +01:00
Paolo Bonzini 9a43c5d9c3 KVM: MMU: cleanup __kvm_sync_page and its callers
Calling kvm_unlink_unsync_page in the middle of __kvm_sync_page makes
things unnecessarily tricky.  If kvm_mmu_prepare_zap_page is called,
it will call kvm_unlink_unsync_page too.  So kvm_unlink_unsync_page can
be called just as well at the beginning or the end of __kvm_sync_page...
which means that we might do it in kvm_sync_page too and remove the
parameter.

kvm_sync_page ends up being the same code that kvm_sync_pages used
to have before the previous patch.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:23 +01:00
Paolo Bonzini df748f864a KVM: MMU: use kvm_sync_page in kvm_sync_pages
If the last argument is true, kvm_unlink_unsync_page is called anyway in
__kvm_sync_page (either by kvm_mmu_prepare_zap_page or by __kvm_sync_page
itself).  Therefore, kvm_sync_pages can just call kvm_sync_page, instead
of going through kvm_unlink_unsync_page+__kvm_sync_page.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:20 +01:00
Paolo Bonzini 35a70510ee KVM: MMU: move TLB flush out of __kvm_sync_page
By doing this, kvm_sync_pages can use __kvm_sync_page instead of
reinventing it.  Because of kvm_mmu_flush_or_zap, the code does not
end up being more complex than before, and more cleanups to kvm_sync_pages
will come in the next patches.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:17 +01:00
Paolo Bonzini b8c67b7a08 KVM: MMU: introduce kvm_mmu_flush_or_zap
This is a generalization of mmu_pte_write_flush_tlb, that also
takes care of calling kvm_mmu_commit_zap_page.  The next
patches will introduce more uses.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-08 12:33:00 +01:00
Paolo Bonzini 0e4d44151a KVM: i8254: drop local copy of mul_u64_u32_div
A function that does the same as i8254.c's muldiv64 has been added
(for KVM's own use, in fact!) in include/linux/math64.h.  Use it
instead of muldiv64.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 22:39:17 +01:00
Xiao Guangrong e23d3fef83 KVM: MMU: check kvm_mmu_pages and mmu_page_path indices
Give a special invalid index to the root of the walk, so that we
can check the consistency of kvm_mmu_pages and mmu_page_path.

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Extracted from a bigger patch proposed by Guangrong. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 12:35:24 +01:00
Paolo Bonzini 0a47cd8583 KVM: MMU: Fix ubsan warnings
kvm_mmu_pages_init is doing some really yucky stuff.  It is setting
up a sentinel for mmu_page_clear_parents; however, because of a) the
way levels are numbered starting from 1 and b) the way mmu_page_path
sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be
out of bounds.  This is harmless because the code overwrites up to the
first two elements of parents->idx and these are initialized, and
because the sentinel is not needed in this case---mmu_page_clear_parents
exits anyway when it gets to the end of the array.  However ubsan
complains, and everyone else should too.

This fix does three things.  First it makes the mmu_page_path arrays
PT64_ROOT_LEVEL elements in size, so that we can write to them without
checking the level in advance.  Second it disintegrates kvm_mmu_pages_init
between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and
for_each_sp (to place the NULL sentinel at the end of the current path).
This is okay because the mmu_page_path is only used in
mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within
a for_each_sp iterator, and hence always after a call to mmu_pages_next.
Third it changes mmu_pages_clear_parents to just use the sentinel to
stop iteration, without checking the bounds on level.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Reported-by: Mike Krinkin <krinkin.m.u@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 12:35:23 +01:00
Paolo Bonzini 798e88b31f KVM: MMU: cleanup handle_abnormal_pfn
The goto and temporary variable are unnecessary, just use return
statements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 12:35:23 +01:00
Paolo Bonzini 8f22372f85 KVM: VMX: use vmcs_clear/set_bits for debug register exits
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 12:35:20 +01:00
Radim Krčmář a0aace5ac0 KVM: i8254: turn kvm_kpit_state.reinject into atomic_t
Document possible races between readers and concurrent update to the
ioctl.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:25 +01:00
Radim Krčmář ab4c14763b KVM: i8254: move PIT timer function initialization
We can do it just once.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:22 +01:00
Radim Krčmář 34f3941c42 KVM: i8254: don't assume layout of kvm_kpit_state
channels has offset 0 and correct size now, but that can change.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:18 +01:00
Radim Krčmář 4a2095df8a KVM: i8254: remove pointless dereference of PIT
PIT is known at that point.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:15 +01:00
Radim Krčmář a3e1311549 KVM: i8254: remove pit and kvm from kvm_kpit_state
kvm isn't ever used and pit can be accessed with container_of.
If you *really* need kvm, pit_state_to_pit(ps)->kvm.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:12 +01:00
Radim Krčmář 08e5ccf3ae KVM: i8254: refactor kvm_free_pit
Could be easier to read, but git history will become deeper.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:07 +01:00
Radim Krčmář 10d2482126 KVM: i8254: refactor kvm_create_pit
Locks are gone, so we don't need to duplicate error paths.
Use goto everywhere.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:04 +01:00
Radim Krčmář 71474e2f0f KVM: i8254: remove notifiers from PIT discard policy
Discard policy doesn't rely on information from notifiers, so we don't
need to register notifiers unconditionally.  We kept correct counts in
case userspace switched between policies during runtime, but that can be
avoided by reseting the state.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:30:01 +01:00
Radim Krčmář b39c90b656 KVM: i8254: remove unnecessary uses of PIT state lock
- kvm_create_pit had to lock only because it exposed kvm->arch.vpit very
  early, but initialization doesn't use kvm->arch.vpit since the last
  patch, so we can drop locking.
- kvm_free_pit is only run after there are no users of KVM and therefore
  is the sole actor.
- Locking in kvm_vm_ioctl_reinject doesn't do anything, because reinject
  is only protected at that place.
- kvm_pit_reset isn't used anywhere and its locking can be dropped if we
  hide it.

Removing useless locking allows to see what actually is being protected
by PIT state lock (values accessible from the guest).

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:58 +01:00
Radim Krčmář 09edea72b7 KVM: i8254: pass struct kvm_pit instead of kvm in PIT
This patch passes struct kvm_pit into internal PIT functions.
Those functions used to get PIT through kvm->arch.vpit, even though most
of them never used *kvm for other purposes.  Another benefit is that we
don't need to set kvm->arch.vpit during initialization.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:55 +01:00
Radim Krčmář b69d920f68 KVM: i8254: tone down WARN_ON pit.state_lock
If the guest could hit this, it would hang the host kernel, bacause of
sheer number of those reports.  Internal callers have to be sensible
anyway, so we now only check for it in an API function.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:51 +01:00
Radim Krčmář ddf54503e2 KVM: i8254: use atomic_t instead of pit.inject_lock
The lock was an overkill, the same can be done with atomics.

A mb() was added in kvm_pit_ack_irq, to pair with implicit barrier
between pit_timer_fn and pit_do_work.  The mb() prevents a race that
could happen if pending == 0 and irq_ack == 0:

  kvm_pit_ack_irq:                | pit_timer_fn:
   p = atomic_read(&ps->pending); |
                                  |  atomic_inc(&ps->pending);
                                  |  queue_work(pit_do_work);
                                  | pit_do_work:
                                  |  atomic_xchg(&ps->irq_ack, 0);
                                  |  return;
   atomic_set(&ps->irq_ack, 1);   |
   if (p == 0) return;            |

where the interrupt would not be delivered in this tick of pit_timer_fn.
PIT would have eventually delivered the interrupt, but we sacrifice
perofmance to make sure that interrupts are not needlessly delayed.

sfence isn't enough: atomic_dec_if_positive does atomic_read first and
x86 can reorder loads before stores.  lfence isn't enough: store can
pass lfence, turning it into a nop.  A compiler barrier would be more
than enough as CPU needs to stall for unbelievably long to use fences.

This patch doesn't do anything in kvm_pit_reset_reinject, because any
order of resets can race, but the result differs by at most one
interrupt, which is ok, because it's the same result as if the reset
happened at a slightly different time.  (Original code didn't protect
the reset path with a proper lock, so users have to be robust.)

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:47 +01:00
Radim Krčmář fd700a00dc KVM: i8254: add kvm_pit_reset_reinject
pit_state.pending and pit_state.irq_ack are always reset at the same
time.  Create a function for them.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:43 +01:00
Radim Krčmář f6e0a0c113 KVM: i8254: simplify atomics in kvm_pit_ack_irq
We already have a helper that does the same thing.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:37 +01:00
Radim Krčmář 7dd0fdff14 KVM: i8254: change PIT discard tick policy
Discard policy uses ack_notifiers to prevent injection of PIT interrupts
before EOI from the last one.

This patch changes the policy to always try to deliver the interrupt,
which makes a difference when its vector is in ISR.
Old implementation would drop the interrupt, but proposed one injects to
IRR, like real hardware would.

The old policy breaks legacy NMI watchdogs, where PIT is used through
virtual wire (LVT0): PIT never sends an interrupt before receiving EOI,
thus a guest deadlock with disabled interrupts will stop NMIs.

Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt
through IOAPIC.  (KVM's PIT is deeply rotten and luckily not used much
in modern systems.)

Even though there is a chance of regressions, I think we can fix the
LVT0 NMI bug without introducing a new tick policy.

Cc: <stable@vger.kernel.org>
Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-04 09:29:31 +01:00
Xiao Guangrong 13d268ca2c KVM: MMU: apply page track notifier
Register the notifier to receive write track event so that we can update
our shadow page table

It makes kvm_mmu_pte_write() be the callback of the notifier, no function
is changed

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:24 +01:00
Xiao Guangrong 5c520e90af KVM: MMU: simplify mmu_need_write_protect
Now, all non-leaf shadow page are page tracked, if gfn is not tracked
there is no non-leaf shadow page of gfn is existed, we can directly
make the shadow page of gfn to unsync

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:23 +01:00
Xiao Guangrong 56ca57f9fe KVM: MMU: use page track for non-leaf shadow pages
non-leaf shadow pages are always write protected, it can be the user
of page track

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:23 +01:00
Xiao Guangrong 0eb05bf290 KVM: page track: add notifier support
Notifier list is introduced so that any node wants to receive the track
event can register to the list

Two APIs are introduced here:
- kvm_page_track_register_notifier(): register the notifier to receive
  track event

- kvm_page_track_unregister_notifier(): stop receiving track event by
  unregister the notifier

The callback, node->track_write() is called when a write access on the
write tracked page happens

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:22 +01:00
Xiao Guangrong e5691a81e8 KVM: MMU: clear write-flooding on the fast path of tracked page
If the page fault is caused by write access on write tracked page, the
real shadow page walking is skipped, we lost the chance to clear write
flooding for the page structure current vcpu is using

Fix it by locklessly waking shadow page table to clear write flooding
on the shadow page structure out of mmu-lock. So that we change the
count to atomic_t

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:22 +01:00
Xiao Guangrong 3d0c27ad6e KVM: MMU: let page fault handler be aware tracked page
The page fault caused by write access on the write tracked page can not
be fixed, it always need to be emulated. page_fault_handle_page_track()
is the fast path we introduce here to skip holding mmu-lock and shadow
page table walking

However, if the page table is not present, it is worth making the page
table entry present and readonly to make the read access happy

mmu_need_write_protect() need to be cooked to avoid page becoming writable
when making page table present or sync/prefetch shadow page table entries

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:21 +01:00
Xiao Guangrong f29d4d7810 KVM: page track: introduce kvm_slot_page_track_{add,remove}_page
These two functions are the user APIs:
- kvm_slot_page_track_add_page(): add the page to the tracking pool
  after that later specified access on that page will be tracked

- kvm_slot_page_track_remove_page(): remove the page from the tracking
  pool, the specified access on the page is not tracked after the last
  user is gone

Both of these are called under the protection both of mmu-lock and
kvm->srcu or kvm->slots_lock

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:21 +01:00
Xiao Guangrong 21ebbedadd KVM: page track: add the framework of guest page tracking
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero

We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers

Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array

Currently, only write track mode is supported

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:20 +01:00
Xiao Guangrong aeecee2ea6 KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect
Split rmap_write_protect() and introduce the function to abstract the write
protection based on the slot

This function will be used in the later patch

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:20 +01:00
Xiao Guangrong 547ffaed87 KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage
Abstract the common operations from account_shadowed() and
unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage()
and kvm_mmu_gfn_allow_lpage()

These two functions will be used by page tracking in the later patch

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:19 +01:00
Xiao Guangrong 92f94f1e9e KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed
kvm_lpage_info->write_count is used to detect if the large page mapping
for the gfn on the specified level is allowed, rename it to disallow_lpage
to reflect its purpose, also we rename has_wrprotected_page() to
mmu_gfn_lpage_is_disallowed() to make the code more clearer

Later we will extend this mechanism for page tracking: if the gfn is
tracked then large mapping for that gfn on any level is not allowed.
The new name is more straightforward

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:19 +01:00
Joerg Roedel 4d99ba898d kvm: x86: Check dest_map->vector to match eoi signals for rtc
Using the vector stored at interrupt delivery makes the eoi
matching safe agains irq migration in the ioapic.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:18 +01:00
Joerg Roedel 9daa50076f kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map
This allows backtracking later in case the rtc irq has been
moved to another vcpu/vector.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:18 +01:00
Joerg Roedel 9e4aabe2bb kvm: x86: Convert ioapic->rtc_status.dest_map to a struct
Currently this is a bitmap which tracks which CPUs we expect
an EOI from. Move this bitmap to a struct so that we can
track additional information there.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-03 14:36:17 +01:00
Owen Hofmann 2680d6da45 kvm: x86: Update tsc multiplier on change.
vmx.c writes the TSC_MULTIPLIER field in vmx_vcpu_load, but only when a
vcpu has migrated physical cpus. Record the last value written and
update in vmx_vcpu_load on any change, otherwise a cpu migration must
occur for TSC frequency scaling to take effect.

Cc: stable@vger.kernel.org
Fixes: ff2c3a1803
Signed-off-by: Owen Hofmann <osh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-02 10:37:32 +01:00
Ingo Molnar 6aa447bcbb Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-29 09:42:07 +01:00
Paolo Bonzini 70e4da7a8f KVM: x86: fix root cause for missed hardware breakpoints
Commit 172b2386ed ("KVM: x86: fix missed hardware breakpoints",
2016-02-10) worked around a case where the debug registers are not loaded
correctly on preemption and on the first entry to KVM_RUN.

However, Xiao Guangrong pointed out that the root cause must be that
KVM_DEBUGREG_BP_ENABLED is not being set correctly.  This can indeed
happen due to the lazy debug exit mechanism, which does not call
kvm_update_dr7.  Fix it by replacing the existing loop (more or less
equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr*
functions.

Cc: stable@vger.kernel.org   # 4.1+
Fixes: 172b2386ed
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-26 13:03:39 +01:00
Marcelo Tosatti 8577370fb0 KVM: Use simple waitqueue for vcpu->wq
The problem:

On -rt, an emulated LAPIC timer instances has the following path:

1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled

This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.

The solution:

Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.

Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.

cyclictest command line:

This patch reduces the average latency in my tests from 14us to 11us.

Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:

  ./x86-run x86/tscdeadline_latency.flat -cpu host

with idle=poll.

The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:

"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.

The mean shows an improvement indeed."

Before:

               min             max         mean           std
count  1000.000000     1000.000000  1000.000000   1000.000000
mean   5162.596000  2019270.084000  5824.491541  20681.645558
std      75.431231   622607.723969    89.575700   6492.272062
min    4466.000000    23928.000000  5537.926500    585.864966
25%    5163.000000  1613252.750000  5790.132275  16683.745433
50%    5175.000000  2281919.000000  5834.654000  23151.990026
75%    5190.000000  2382865.750000  5861.412950  24148.206168
max    5228.000000  4175158.000000  6254.827300  46481.048691

After
               min            max         mean           std
count  1000.000000     1000.00000  1000.000000   1000.000000
mean   5143.511000  2076886.10300  5813.312474  21207.357565
std      77.668322   610413.09583    86.541500   6331.915127
min    4427.000000    25103.00000  5529.756600    559.187707
25%    5148.000000  1691272.75000  5784.889825  17473.518244
50%    5160.000000  2308328.50000  5832.025000  23464.837068
75%    5172.000000  2393037.75000  5853.177675  24223.969976
max    5222.000000  3922458.00000  6186.720500  42520.379830

[Patch was originaly based on the swait implementation found in the -rt
 tree. Daniel ported it to mainline's version and gathered the
 benchmark numbers for tscdeadline_latency test.]

Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-25 11:27:16 +01:00
Mike Krinkin 17e4bce0ae KVM: x86: MMU: fix ubsan index-out-of-range warning
Ubsan reports the following warning due to a typo in
update_accessed_dirty_bits template, the patch fixes
the typo:

[  168.791851] ================================================================================
[  168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15
[  168.791866] index 4 is out of range for type 'u64 [4]'
[  168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G           O L  4.5.0-rc5-next-20160222 #7
[  168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013
[  168.791876]  0000000000000000 ffff8801cfcaf208 ffffffff81c9f780 0000000041b58ab3
[  168.791882]  ffffffff82eb2cc1 ffffffff81c9f6b4 ffff8801cfcaf230 ffff8801cfcaf1e0
[  168.791886]  0000000000000004 0000000000000001 0000000000000000 ffffffffa1981600
[  168.791891] Call Trace:
[  168.791899]  [<ffffffff81c9f780>] dump_stack+0xcc/0x12c
[  168.791904]  [<ffffffff81c9f6b4>] ? _atomic_dec_and_lock+0xc4/0xc4
[  168.791910]  [<ffffffff81da9e81>] ubsan_epilogue+0xd/0x8a
[  168.791914]  [<ffffffff81daafa2>] __ubsan_handle_out_of_bounds+0x15c/0x1a3
[  168.791918]  [<ffffffff81daae46>] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd
[  168.791922]  [<ffffffff811287ef>] ? get_user_pages_fast+0x2bf/0x360
[  168.791954]  [<ffffffffa1794050>] ? kvm_largepages_enabled+0x30/0x30 [kvm]
[  168.791958]  [<ffffffff81128530>] ? __get_user_pages_fast+0x360/0x360
[  168.791987]  [<ffffffffa181b818>] paging64_walk_addr_generic+0x1b28/0x2600 [kvm]
[  168.792014]  [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792019]  [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350
[  168.792044]  [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm]
[  168.792076]  [<ffffffffa181c36d>] paging64_gva_to_gpa+0x7d/0x110 [kvm]
[  168.792121]  [<ffffffffa181c2f0>] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm]
[  168.792130]  [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792178]  [<ffffffffa17d9a4a>] emulator_read_write_onepage+0x27a/0x1150 [kvm]
[  168.792208]  [<ffffffffa1794d44>] ? __kvm_read_guest_page+0x54/0x70 [kvm]
[  168.792234]  [<ffffffffa17d97d0>] ? kvm_task_switch+0x160/0x160 [kvm]
[  168.792238]  [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792263]  [<ffffffffa17daa07>] emulator_read_write+0xe7/0x6d0 [kvm]
[  168.792290]  [<ffffffffa183b620>] ? em_cr_write+0x230/0x230 [kvm]
[  168.792314]  [<ffffffffa17db005>] emulator_write_emulated+0x15/0x20 [kvm]
[  168.792340]  [<ffffffffa18465f8>] segmented_write+0xf8/0x130 [kvm]
[  168.792367]  [<ffffffffa1846500>] ? em_lgdt+0x20/0x20 [kvm]
[  168.792374]  [<ffffffffa14db512>] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel]
[  168.792400]  [<ffffffffa1846d82>] writeback+0x3f2/0x700 [kvm]
[  168.792424]  [<ffffffffa1846990>] ? em_sidt+0xa0/0xa0 [kvm]
[  168.792449]  [<ffffffffa185554d>] ? x86_decode_insn+0x1b3d/0x4f70 [kvm]
[  168.792474]  [<ffffffffa1859032>] x86_emulate_insn+0x572/0x3010 [kvm]
[  168.792499]  [<ffffffffa17e71dd>] x86_emulate_instruction+0x3bd/0x2110 [kvm]
[  168.792524]  [<ffffffffa17e6e20>] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm]
[  168.792532]  [<ffffffffa14e9a81>] handle_ept_misconfig+0x61/0x460 [kvm_intel]
[  168.792539]  [<ffffffffa14e9a20>] ? handle_pause+0x450/0x450 [kvm_intel]
[  168.792546]  [<ffffffffa15130ea>] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel]
[  168.792572]  [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm]
[  168.792597]  [<ffffffffa17f6bcd>] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm]
[  168.792621]  [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm]
[  168.792627]  [<ffffffff8293b530>] ? __ww_mutex_lock_interruptible+0x1630/0x1630
[  168.792651]  [<ffffffffa17f5e90>] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm]
[  168.792656]  [<ffffffff811eeb30>] ? preempt_notifier_unregister+0x190/0x190
[  168.792681]  [<ffffffffa17e0447>] ? kvm_arch_vcpu_load+0x127/0x650 [kvm]
[  168.792704]  [<ffffffffa178e9a3>] kvm_vcpu_ioctl+0x553/0xda0 [kvm]
[  168.792727]  [<ffffffffa178e450>] ? vcpu_put+0x40/0x40 [kvm]
[  168.792732]  [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350
[  168.792735]  [<ffffffff82946087>] ? _raw_spin_unlock+0x27/0x40
[  168.792740]  [<ffffffff8163a943>] ? handle_mm_fault+0x1673/0x2e40
[  168.792744]  [<ffffffff8129daa8>] ? trace_hardirqs_on_caller+0x478/0x6c0
[  168.792747]  [<ffffffff8129dcfd>] ? trace_hardirqs_on+0xd/0x10
[  168.792751]  [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90
[  168.792756]  [<ffffffff81725a80>] do_vfs_ioctl+0x1b0/0x12b0
[  168.792759]  [<ffffffff817258d0>] ? ioctl_preallocate+0x210/0x210
[  168.792763]  [<ffffffff8174aef3>] ? __fget+0x273/0x4a0
[  168.792766]  [<ffffffff8174acd0>] ? __fget+0x50/0x4a0
[  168.792770]  [<ffffffff8174b1f6>] ? __fget_light+0x96/0x2b0
[  168.792773]  [<ffffffff81726bf9>] SyS_ioctl+0x79/0x90
[  168.792777]  [<ffffffff82946880>] entry_SYSCALL_64_fastpath+0x23/0xc1
[  168.792780] ================================================================================

Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-25 09:50:35 +01:00
Paolo Bonzini 0c1d77f4ba KVM: x86: fix conversion of addresses to linear in 32-bit protected mode
Commit e8dd2d2d64 ("Silence compiler warning in arch/x86/kvm/emulate.c",
2015-09-06) broke boot of the Hurd.  The bug is that the "default:"
case actually could modify "la", but after the patch this change is
not reflected in *linear.

The bug is visible whenever a non-zero segment base causes the linear
address to wrap around the 4GB mark.

Fixes: e8dd2d2d64
Cc: stable@vger.kernel.org
Reported-by: Aurelien Jarno <aurelien@aurel32.net>
Tested-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-24 14:47:45 +01:00
Paolo Bonzini 172b2386ed KVM: x86: fix missed hardware breakpoints
Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.

The following simple reproducer from Oleg Nesterov tries using debug
registers in two threads.  To see the bug, run a 2-VCPU guest with
"taskset -c 0" and run "./bp 0 1" inside the guest.

    #include <unistd.h>
    #include <signal.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/wait.h>
    #include <sys/ptrace.h>
    #include <sys/user.h>
    #include <asm/debugreg.h>
    #include <assert.h>

    #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

    unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
    {
        unsigned long dr7;

        dr7 = ((len | type) & 0xf)
            << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
        if (enable)
            dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));

        return dr7;
    }

    int write_dr(int pid, int dr, unsigned long val)
    {
        return ptrace(PTRACE_POKEUSER, pid,
                offsetof (struct user, u_debugreg[dr]),
                val);
    }

    void set_bp(pid_t pid, void *addr)
    {
        unsigned long dr7;
        assert(write_dr(pid, 0, (long)addr) == 0);
        dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
        assert(write_dr(pid, 7, dr7) == 0);
    }

    void *get_rip(int pid)
    {
        return (void*)ptrace(PTRACE_PEEKUSER, pid,
                offsetof(struct user, regs.rip), 0);
    }

    void test(int nr)
    {
        void *bp_addr = &&label + nr, *bp_hit;
        int pid;

        printf("test bp %d\n", nr);
        assert(nr < 16); // see 16 asm nops below

        pid = fork();
        if (!pid) {
            assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
            kill(getpid(), SIGSTOP);
            for (;;) {
                label: asm (
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                );
            }
        }

        assert(pid == wait(NULL));
        set_bp(pid, bp_addr);

        for (;;) {
            assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
            assert(pid == wait(NULL));

            bp_hit = get_rip(pid);
            if (bp_hit != bp_addr)
                fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
                    bp_hit - &&label, nr);
        }
    }

    int main(int argc, const char *argv[])
    {
        while (--argc) {
            int nr = atoi(*++argv);
            if (!fork())
                test(nr);
        }

        while (wait(NULL) > 0)
            ;
        return 0;
    }

Cc: stable@vger.kernel.org
Suggested-by: Nadav Amit <namit@cs.technion.ac.il>
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-24 14:47:39 +01:00
Adam Buchbinder 6a6256f9e0 x86: Fix misspellings in comments
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-24 08:44:58 +01:00
Chris J Arges 3f62de5f6f x86/kvm: Add output operand in vmx_handle_external_intr inline asm
Stacktool generates the following warning:
  stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call without frame pointer save/setup

By adding the stackpointer as an output operand, this patch ensures that a
stack frame is created when CONFIG_FRAME_POINTER is enabled for the inline
assmebly statement.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: gleb@kernel.org
Cc: kvm@vger.kernel.org
Cc: live-patching@vger.kernel.org
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1453499078-9330-3-git-send-email-chris.j.arges@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-24 08:35:44 +01:00
Josh Poimboeuf cb7390fed4 x86/kvm: Make test_cc() always inline
With some configs (including allyesconfig), gcc doesn't inline
test_cc().  When that happens, test_cc() doesn't create a stack frame
before inserting the inline asm call instruction.  This breaks frame
pointer convention if CONFIG_FRAME_POINTER is enabled and can result in
a bad stack trace.

Force it to always be inlined so that its containing function's stack
frame can be used.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/20160122161612.GE20502@treble.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-24 08:35:44 +01:00
Josh Poimboeuf 1482a0825b x86/kvm: Set ELF function type for fastop functions
The callable functions created with the FOP* and FASTOP* macros are
missing ELF function annotations, which confuses tools like stacktool.
Properly annotate them.

This adds some additional labels to the assembly, but the generated
binary code is unchanged (with the exception of instructions which have
embedded references to __LINE__).

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chris J Arges <chris.j.arges@canonical.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: Pedro Alves <palves@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: live-patching@vger.kernel.org
Link: http://lkml.kernel.org/r/e399651c89ace54906c203c0557f66ed6ea3ce8d.1453405861.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-24 08:35:44 +01:00
Geliang Tang d74c0e6b54 KVM: x86: use list_last_entry
To make the intention clearer, use list_last_entry instead of
list_entry.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-23 15:40:54 +01:00
Geliang Tang 652fc08dae KVM: x86: use list_for_each_entry*
Use list_for_each_entry*() instead of list_for_each*() to simplify
the code.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-23 15:40:54 +01:00
Takuya Yoshikawa e9ee956e31 KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()
Rather than placing a handle_mmio_page_fault() call in each
vcpu->arch.mmu.page_fault() handler, moving it up to
kvm_mmu_page_fault() makes the code better:

 - avoids code duplication
 - for kvm_arch_async_page_ready(), which is the other caller of
   vcpu->arch.mmu.page_fault(), removes an extra error_code check
 - avoids returning both RET_MMIO_PF_* values and raw integer values
   from vcpu->arch.mmu.page_fault()

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-23 14:20:27 +01:00
Takuya Yoshikawa ded5874946 KVM: x86: MMU: Consolidate quickly_check_mmio_pf() and is_mmio_page_fault()
These two have only slight differences:
 - whether 'addr' is of type u64 or of type gva_t
 - whether they have 'direct' parameter or not

Concerning the former, quickly_check_mmio_pf()'s u64 is better because
'addr' needs to be able to have both a guest physical address and a
guest virtual address.

The latter is just a stylistic issue as we can always calculate the mode
from the 'vcpu' as is_mmio_page_fault() does.  This patch keeps the
parameter to make the following patch cleaner.

In addition, the patch renames the function to mmio_info_in_cache() to
make it clear what it actually checks for.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-23 14:20:27 +01:00
Paolo Bonzini 3ae13faac4 KVM: x86: pass kvm_get_time_scale arguments in hertz
Prepare for improving the precision in the next patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:45 +01:00
Andrey Smetanin 83326e43f2 kvm/x86: Hyper-V VMBus hypercall userspace exit
The patch implements KVM_EXIT_HYPERV userspace exit
functionality for Hyper-V VMBus hypercalls:
HV_X64_HCALL_POST_MESSAGE, HV_X64_HCALL_SIGNAL_EVENT.

Changes v3:
* use vcpu->arch.complete_userspace_io to setup hypercall
result

Changes v2:
* use KVM_EXIT_HYPERV for hypercalls

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:44 +01:00
Andrey Smetanin b2fdc2570a kvm/x86: Reject Hyper-V hypercall continuation
Currently we do not support Hyper-V hypercall continuation
so reject it.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:42 +01:00
Andrey Smetanin 0d9c055eaa kvm/x86: Pass return code of kvm_emulate_hypercall
Pass the return code from kvm_emulate_hypercall on to the caller,
in order to allow it to indicate to the userspace that
the hypercall has to be handled there.

Also adjust all the existing code paths to return 1 to make sure the
hypercall isn't passed to the userspace without setting kvm_run
appropriately.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:41 +01:00
Andrey Smetanin 8ed6d76781 kvm/x86: Rename Hyper-V long spin wait hypercall
Rename HV_X64_HV_NOTIFY_LONG_SPIN_WAIT by HVCALL_NOTIFY_LONG_SPIN_WAIT,
so the name is more consistent with the other hypercalls.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Joerg Roedel <joro@8bytes.org>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
[Change name, Andrey used HV_X64_HCALL_NOTIFY_LONG_SPIN_WAIT. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:38 +01:00
Paolo Bonzini 4e422bdd2f KVM: x86: fix missed hardware breakpoints
Sometimes when setting a breakpoint a process doesn't stop on it.
This is because the debug registers are not loaded correctly on
VCPU load.

The following simple reproducer from Oleg Nesterov tries using debug
registers in both the host and the guest, for example by running "./bp
0 1" on the host and "./bp 14 15" under QEMU.

    #include <unistd.h>
    #include <signal.h>
    #include <stdlib.h>
    #include <stdio.h>
    #include <sys/wait.h>
    #include <sys/ptrace.h>
    #include <sys/user.h>
    #include <asm/debugreg.h>
    #include <assert.h>

    #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

    unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len)
    {
        unsigned long dr7;

        dr7 = ((len | type) & 0xf)
            << (DR_CONTROL_SHIFT + drnum * DR_CONTROL_SIZE);
        if (enable)
            dr7 |= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE));

        return dr7;
    }

    int write_dr(int pid, int dr, unsigned long val)
    {
        return ptrace(PTRACE_POKEUSER, pid,
                offsetof (struct user, u_debugreg[dr]),
                val);
    }

    void set_bp(pid_t pid, void *addr)
    {
        unsigned long dr7;
        assert(write_dr(pid, 0, (long)addr) == 0);
        dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1);
        assert(write_dr(pid, 7, dr7) == 0);
    }

    void *get_rip(int pid)
    {
        return (void*)ptrace(PTRACE_PEEKUSER, pid,
                offsetof(struct user, regs.rip), 0);
    }

    void test(int nr)
    {
        void *bp_addr = &&label + nr, *bp_hit;
        int pid;

        printf("test bp %d\n", nr);
        assert(nr < 16); // see 16 asm nops below

        pid = fork();
        if (!pid) {
            assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
            kill(getpid(), SIGSTOP);
            for (;;) {
                label: asm (
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                    "nop; nop; nop; nop;"
                );
            }
        }

        assert(pid == wait(NULL));
        set_bp(pid, bp_addr);

        for (;;) {
            assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0);
            assert(pid == wait(NULL));

            bp_hit = get_rip(pid);
            if (bp_hit != bp_addr)
                fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n",
                    bp_hit - &&label, nr);
        }
    }

    int main(int argc, const char *argv[])
    {
        while (--argc) {
            int nr = atoi(*++argv);
            if (!fork())
                test(nr);
        }

        while (wait(NULL) > 0)
            ;
        return 0;
    }

Cc: stable@vger.kernel.org
Suggested-by: Nadadv Amit <namit@cs.technion.ac.il>
Reported-by: Andrey Wagin <avagin@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:37 +01:00
Radim Krčmář 4efd805fca KVM: x86: fix *NULL on invalid low-prio irq
Smatch noticed a NULL dereference in kvm_intr_is_single_vcpu_fast that
happens if VM already warned about invalid lowest-priority interrupt.

Create a function for common code while fixing it.

Fixes: 6228a0da80 ("KVM: x86: Add lowest-priority support for vt-d posted-interrupts")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:36 +01:00
Paolo Bonzini 78db6a5037 KVM: x86: rewrite handling of scaled TSC for kvmclock
This is the same as before:

    kvm_scale_tsc(tgt_tsc_khz)
        = tgt_tsc_khz * ratio
        = tgt_tsc_khz * user_tsc_khz / tsc_khz   (see set_tsc_khz)
        = user_tsc_khz                           (see kvm_guest_time_update)
        = vcpu->arch.virtual_tsc_khz             (see kvm_set_tsc_khz)

However, computing it through kvm_scale_tsc will make it possible
to include the NTP correction in tgt_tsc_khz.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:34 +01:00
Paolo Bonzini 4941b8cb37 KVM: x86: rename argument to kvm_set_tsc_khz
This refers to the desired (scaled) frequency, which is called
user_tsc_khz in the rest of the file.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:33 +01:00
Jan Kiszka 6f05485d3a KVM: VMX: Fix guest debugging while in L2
When we take a #DB or #BP vmexit while in guest mode, we first of all
need to check if there is ongoing guest debugging that might be
interested in the event. Currently, we unconditionally leave L2 and
inject the event into L1 if it is intercepting the exceptions. That
breaks things marvelously.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:32 +01:00
Jan Kiszka 5bb16016ce KVM: VMX: Factor out is_exception_n helper
There is quite some common code in all these is_<exception>() helpers.
Factor it out before adding even more of them.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-16 18:48:30 +01:00
Paolo Bonzini bce87cce88 KVM: x86: consolidate different ways to test for in-kernel LAPIC
Different pieces of code checked for vcpu->arch.apic being (non-)NULL,
or used kvm_vcpu_has_lapic (more optimized) or lapic_in_kernel.
Replace everything with lapic_in_kernel's name and kvm_vcpu_has_lapic's
implementation.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 16:57:45 +01:00
Paolo Bonzini 1e3161b414 KVM: x86: consolidate "has lapic" checks into irq.c
Do for kvm_cpu_has_pending_timer and kvm_inject_pending_timer_irqs
what the other irq.c routines have been doing.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 16:57:39 +01:00
Paolo Bonzini f8543d6a97 KVM: APIC: remove unnecessary double checks on APIC existence
Usually the in-kernel APIC's existence is checked in the caller.  Do not
bother checking it again in lapic.c.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 16:57:14 +01:00
Feng Wu b6ce978067 KVM/VMX: Add host irq information in trace event when updating IRTE for posted interrupts
Add host irq information in trace event, so we can better understand
which irq is in posted mode.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:43 +01:00
Feng Wu 6228a0da80 KVM: x86: Add lowest-priority support for vt-d posted-interrupts
Use vector-hashing to deliver lowest-priority interrupts for
VT-d posted-interrupts. This patch extends kvm_intr_is_single_vcpu()
to support lowest-priority handling.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:42 +01:00
Feng Wu 520040146a KVM: x86: Use vector-hashing to deliver lowest-priority interrupts
Use vector-hashing to deliver lowest-priority interrupts, As an
example, modern Intel CPUs in server platform use this method to
handle lowest-priority interrupts.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:40 +01:00
Feng Wu 23a1c2579b KVM: Recover IRTE to remapped mode if the interrupt is not single-destination
When the interrupt is not single destination any more, we need
to change back IRTE to remapped mode explicitly.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:39 +01:00
Paolo Bonzini b51012deb3 KVM: x86: introduce do_shl32_div32
This is similar to the existing div_frac function, but it returns the
remainder too.  Unlike div_frac, it can be used to implement long
division, e.g. (a << 64) / b for 32-bit a and b.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-09 13:24:37 +01:00
Dan Williams ba049e93ae kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 18):

The core has developed a need for a "pfn_t" type [1].  Move the existing
pfn_t in KVM to kvm_pfn_t [2].

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-15 17:56:32 -08:00
Linus Torvalds 1baa5efbeb * s390: Support for runtime instrumentation within guests,
support of 248 VCPUs.
 
 * ARM: rewrite of the arm64 world switch in C, support for
 16-bit VM identifiers.  Performance counter virtualization
 missed the boat.
 
 * x86: Support for more Hyper-V features (synthetic interrupt
 controller), MMU cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu
 kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT
 2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD
 /i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY
 YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i
 cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ=
 =PIj1
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "PPC changes will come next week.

   - s390: Support for runtime instrumentation within guests, support of
     248 VCPUs.

   - ARM: rewrite of the arm64 world switch in C, support for 16-bit VM
     identifiers.  Performance counter virtualization missed the boat.

   - x86: Support for more Hyper-V features (synthetic interrupt
     controller), MMU cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits)
  kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
  kvm/x86: Hyper-V SynIC timers tracepoints
  kvm/x86: Hyper-V SynIC tracepoints
  kvm/x86: Update SynIC timers on guest entry only
  kvm/x86: Skip SynIC vector check for QEMU side
  kvm/x86: Hyper-V fix SynIC timer disabling condition
  kvm/x86: Reorg stimer_expiration() to better control timer restart
  kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
  kvm/x86: Drop stimer_stop() function
  kvm/x86: Hyper-V timers fix incorrect logical operation
  KVM: move architecture-dependent requests to arch/
  KVM: renumber vcpu->request bits
  KVM: document which architecture uses each request bit
  KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
  kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
  KVM: s390: implement the RI support of guest
  kvm/s390: drop unpaired smp_mb
  kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
  KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
  arm/arm64: KVM: Detect vGIC presence at runtime
  ...
2016-01-12 13:22:12 -08:00
Huaitong Han 45bdbcfdf2 kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
vmx_cpuid_tries to update SECONDARY_VM_EXEC_CONTROL in the VMCS, but
it will cause a vmwrite error on older CPUs because the code does not
check for the presence of CPU_BASED_ACTIVATE_SECONDARY_CONTROLS.

This will get rid of the following trace on e.g. Core2 6600:

vmwrite error: reg 401e value 10 (err 12)
Call Trace:
[<ffffffff8116e2b9>] dump_stack+0x40/0x57
[<ffffffffa020b88d>] vmx_cpuid_update+0x5d/0x150 [kvm_intel]
[<ffffffffa01d8fdc>] kvm_vcpu_ioctl_set_cpuid2+0x4c/0x70 [kvm]
[<ffffffffa01b8363>] kvm_arch_vcpu_ioctl+0x903/0xfa0 [kvm]

Fixes: feda805fe7
Cc: stable@vger.kernel.org
Reported-by: Zdenek Kaspar <zkaspar82@gmail.com>
Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-12 11:42:16 +01:00
Linus Torvalds 671d5532aa Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpu updates from Ingo Molnar:
 "The main changes in this cycle were:

   - Improved CPU ID handling code and related enhancements (Borislav
     Petkov)

   - RDRAND fix (Len Brown)"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86: Replace RDRAND forced-reseed with simple sanity check
  x86/MSR: Chop off lower 32-bit value
  x86/cpu: Fix MSR value truncation issue
  x86/cpu/amd, kvm: Satisfy guest kernel reads of IC_CFG MSR
  kvm: Add accessors for guest CPU's family, model, stepping
  x86/cpu: Unify CPU family, model, stepping calculation
2016-01-11 16:46:20 -08:00
Andrey Smetanin ac3e5fcae8 kvm/x86: Hyper-V SynIC timers tracepoints
Trace the following Hyper SynIC timers events:
* periodic timer start
* one-shot timer start
* timer callback
* timer expiration and message delivery result
* timer config setup
* timer count setup
* timer cleanup

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:43 +01:00
Andrey Smetanin 18659a9cb1 kvm/x86: Hyper-V SynIC tracepoints
Trace the following Hyper SynIC events:
* set msr
* set sint irq
* ack sint
* sint irq eoi

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:43 +01:00
Andrey Smetanin f3b138c5d8 kvm/x86: Update SynIC timers on guest entry only
Consolidate updating the Hyper-V SynIC timers in a
single place: on guest entry in processing KVM_REQ_HV_STIMER
request.  This simplifies the overall logic, and makes sure
the most current state of msrs and guest clock is used for
arming the timers (to achieve that, KVM_REQ_HV_STIMER
has to be processed after KVM_REQ_CLOCK_UPDATE).

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:42 +01:00
Andrey Smetanin 7be58a6488 kvm/x86: Skip SynIC vector check for QEMU side
QEMU zero-inits Hyper-V SynIC vectors. We should allow that,
and don't reject zero values if set by the host.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:42 +01:00
Andrey Smetanin 23a3b201fd kvm/x86: Hyper-V fix SynIC timer disabling condition
Hypervisor Function Specification(HFS) doesn't require
to disable SynIC timer at timer config write if timer->count = 0.

So drop this check, this allow to load timers MSR's
during migration restore, because config are set before count
in QEMU side.

Also fix condition according to HFS doc(15.3.1):
"It is not permitted to set the SINTx field to zero for an
enabled timer. If attempted, the timer will be
marked disabled (that is, bit 0 cleared) immediately."

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:41 +01:00
Andrey Smetanin 0cdeabb118 kvm/x86: Reorg stimer_expiration() to better control timer restart
Split stimer_expiration() into two parts - timer expiration message
sending and timer restart/cleanup based on timer state(config).

This also fixes a bug where a one-shot timer message whose delivery
failed once would get lost for good.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:41 +01:00
Andrey Smetanin f808495da5 kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
This will be used in future to start Hyper-V SynIC timer
in several places by one logic in one function.

Changes v2:
* drop stimer->count == 0 check inside stimer_start()
* comment stimer_start() assumptions

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:40 +01:00
Andrey Smetanin 019b9781cc kvm/x86: Drop stimer_stop() function
The function stimer_stop() is called in one place
so remove the function and replace it's call by function
content.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:40 +01:00
Andrey Smetanin 1ac1b65ac1 kvm/x86: Hyper-V timers fix incorrect logical operation
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:39 +01:00
Paolo Bonzini 2860c4b167 KVM: move architecture-dependent requests to arch/
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h.  Functions
that refer to architecture-specific requests are also moved
to arch/.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-08 19:04:36 +01:00
Nicholas Krause 1dab1345d8 kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
This makes sure the wall clock is updated only after an odd version value
is successfully written to guest memory.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-07 14:51:32 +01:00
Paolo Bonzini e5e57e7a03 kvm: x86: only channel 0 of the i8254 is linked to the HPET
While setting the KVM PIT counters in 'kvm_pit_load_count', if
'hpet_legacy_start' is set, the function disables the timer on
channel[0], instead of the respective index 'channel'. This is
because channels 1-3 are not linked to the HPET.  Fix the caller
to only activate the special HPET processing for channel 0.

Reported-by: P J P <pjp@fedoraproject.org>
Fixes: 0185604c2d
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-07 13:50:38 +01:00
David Matlack 0af2593b2a kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa
swapped. Fix that, and also add some details describing how each translation
works.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-01-07 11:03:47 +01:00
Andrew Honig 0185604c2d KVM: x86: Reload pit counters for all channels when restoring state
Currently if userspace restores the pit counters with a count of 0
on channels 1 or 2 and the guest attempts to read the count on those
channels, then KVM will perform a mod of 0 and crash.  This will ensure
that 0 values are converted to 65536 as per the spec.

This is CVE-2015-7513.

Signed-off-by: Andy Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-22 15:36:26 +01:00
Paolo Bonzini e24dea2afc KVM: MTRR: treat memory as writeback if MTRR is disabled in guest CPUID
Virtual machines can be run with CPUID such that there are no MTRRs.
In that case, the firmware will never enable MTRRs and it is obviously
undesirable to run the guest entirely with UC memory.  Check out guest
CPUID, and use WB memory if MTRR do not exist.

Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-22 15:29:00 +01:00
Paolo Bonzini fa7c4ebd5a KVM: MTRR: observe maxphyaddr from guest CPUID, not host
Conversion of MTRRs to ranges used the maxphyaddr from the boot CPU.
This is wrong, because var_mtrr_range's mask variable then is discontiguous
(like FF00FFFF000, where the first run of 0s corresponds to the bits
between host and guest maxphyaddr).  Instead always set up the masks
to be full 64-bit values---we know that the reserved bits at the top
are zero, and we can restore them when reading the MSR.  This way
var_mtrr_range gets a mask that just works.

Fixes: a13842dc66
Cc: qemu-stable@nongnu.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-22 15:28:56 +01:00
Alexis Dambricourt a7f2d78657 KVM: MTRR: fix fixed MTRR segment look up
This fixes the slow-down of VM running with pci-passthrough, since some MTRR
range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE.  Memory in the
0K-640K range was incorrectly treated as uncacheable.

Fixes: f7bfb57b3e
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561
Cc: qemu-stable@nongnu.org
Signed-off-by: Alexis Dambricourt <alexis.dambricourt@gmail.com>
[Use correct BZ for "Fixes" annotation.  - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-22 15:28:37 +01:00
Takuya Yoshikawa 774926641d KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
Not just in order to clean up the code, but to make it faster by using
enhanced instructions: the initialization became 20-30% faster on our
testing machine.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-18 19:07:45 +01:00
Andrey Smetanin 481d2bcc84 kvm/x86: Remove Hyper-V SynIC timer stopping
It's possible that guest send us Hyper-V EOM at the middle
of Hyper-V SynIC timer running, so we start processing of Hyper-V
SynIC timers in vcpu context and stop the Hyper-V SynIC timer
unconditionally:

    host                                       guest
    ------------------------------------------------------------------------------
                                           start periodic stimer
    start periodic timer
    timer expires after 15ms
    send expiration message into guest
    restart periodic timer
    timer expires again after 15 ms
    msg slot is still not cleared so
    setup ->msg_pending
(1) restart periodic timer
                                           process timer msg and clear slot
                                           ->msg_pending was set:
                                               send EOM into host
    received EOM
      kvm_make_request(KVM_REQ_HV_STIMER)

    kvm_hv_process_stimers():
        ...
        stimer_stop()
        if (time_now >= stimer->exp_time)
                stimer_expiration(stimer);

Because the timer was rearmed at (1), time_now < stimer->exp_time
and stimer_expiration is not called.  The timer then never fires.

The patch fixes such situation by not stopping Hyper-V SynIC timer
at all, because it's safe to restart it without stop in vcpu context
and timer callback always returns HRTIMER_NORESTART.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:51:22 +01:00
Paolo Bonzini 8a86aea920 KVM: vmx: detect mismatched size in VMCS read/write
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	I am sending this as RFC because the error messages it produces are
	very ugly.  Because of inlining, the original line is lost.  The
	alternative is to change vmcs_read/write/checkXX into macros, but
	then you need to have a single huge BUILD_BUG_ON or BUILD_BUG_ON_MSG
	because multiple BUILD_BUG_ON* with the same __LINE__ are not
	supported well.
2015-12-16 18:49:47 +01:00
Paolo Bonzini 845c5b4054 KVM: VMX: fix read/write sizes of VMCS fields in dump_vmcs
This was not printing the high parts of several 64-bit fields on
32-bit kernels.  Separate from the previous one to make the patches
easier to review.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:47 +01:00
Paolo Bonzini f353105463 KVM: VMX: fix read/write sizes of VMCS fields
In theory this should have broken EPT on 32-bit kernels (due to
reading the high part of natural-width field GUEST_CR3).  Not sure
if no one noticed or the processor behaves differently from the
documentation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:46 +01:00
Li RongQing 0bcf261cc8 KVM: VMX: fix the writing POSTED_INTR_NV
POSTED_INTR_NV is 16bit, should not use 64bit write function

[ 5311.676074] vmwrite error: reg 3 value 0 (err 12)
  [ 5311.680001] CPU: 49 PID: 4240 Comm: qemu-system-i38 Tainted: G I 4.1.13-WR8.0.0.0_standard #1
  [ 5311.689343] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015
  [ 5311.699550] 00000000 00000000 e69a7e1c c1950de1 00000000 e69a7e38 fafcff45 fafebd24
  [ 5311.706924] 00000003 00000000 0000000c b6a06dfa e69a7e40 fafcff79 e69a7eb0 fafd5f57
  [ 5311.714296] e69a7ec0 c1080600 00000000 00000001 c0e18018 000001be 00000000 00000b43
  [ 5311.721651] Call Trace:
  [ 5311.722942] [<c1950de1>] dump_stack+0x4b/0x75
  [ 5311.726467] [<fafcff45>] vmwrite_error+0x35/0x40 [kvm_intel]
  [ 5311.731444] [<fafcff79>] vmcs_writel+0x29/0x30 [kvm_intel]
  [ 5311.736228] [<fafd5f57>] vmx_create_vcpu+0x337/0xb90 [kvm_intel]
  [ 5311.741600] [<c1080600>] ? dequeue_task_fair+0x2e0/0xf60
  [ 5311.746197] [<faf3b9ca>] kvm_arch_vcpu_create+0x3a/0x70 [kvm]
  [ 5311.751278] [<faf29e9d>] kvm_vm_ioctl+0x14d/0x640 [kvm]
  [ 5311.755771] [<c1129d44>] ? free_pages_prepare+0x1a4/0x2d0
  [ 5311.760455] [<c13e2842>] ? debug_smp_processor_id+0x12/0x20
  [ 5311.765333] [<c10793be>] ? sched_move_task+0xbe/0x170
  [ 5311.769621] [<c11752b3>] ? kmem_cache_free+0x213/0x230
  [ 5311.774016] [<faf29d50>] ? kvm_set_memory_region+0x60/0x60 [kvm]
  [ 5311.779379] [<c1199fa2>] do_vfs_ioctl+0x2e2/0x500
  [ 5311.783285] [<c11752b3>] ? kmem_cache_free+0x213/0x230
  [ 5311.787677] [<c104dc73>] ? __mmdrop+0x63/0xd0
  [ 5311.791196] [<c104dc73>] ? __mmdrop+0x63/0xd0
  [ 5311.794712] [<c104dc73>] ? __mmdrop+0x63/0xd0
  [ 5311.798234] [<c11a2ed7>] ? __fget+0x57/0x90
  [ 5311.801559] [<c11a2f72>] ? __fget_light+0x22/0x50
  [ 5311.805464] [<c119a240>] SyS_ioctl+0x80/0x90
  [ 5311.808885] [<c1957d30>] sysenter_do_call+0x12/0x12
  [ 5312.059280] kvm: zapping shadow pages for mmio generation wraparound
  [ 5313.678415] kvm [4231]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff
  [ 5313.726518] kvm [4231]: vcpu0 unhandled rdmsr: 0x570

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Cc: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:45 +01:00
Andrey Smetanin 1f4b34f825 kvm/x86: Hyper-V SynIC timers
Per Hyper-V specification (and as required by Hyper-V-aware guests),
SynIC provides 4 per-vCPU timers.  Each timer is programmed via a pair
of MSRs, and signals expiration by delivering a special format message
to the configured SynIC message slot and triggering the corresponding
synthetic interrupt.

Note: as implemented by this patch, all periodic timers are "lazy"
(i.e. if the vCPU wasn't scheduled for more than the timer period the
timer events are lost), regardless of the corresponding configuration
MSR.  If deemed necessary, the "catch up" mode (the timer period is
shortened until the timer catches up) will be implemented later.

Changes v2:
* Use remainder to calculate periodic timer expiration time

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:45 +01:00
Andrey Smetanin 765eaa0f70 kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ack
The SynIC message protocol mandates that the message slot is claimed
by atomically setting message type to something other than HVMSG_NONE.
If another message is to be delivered while the slot is still busy,
message pending flag is asserted to indicate to the guest that the
hypervisor wants to be notified when the slot is released.

To make sure the protocol works regardless of where the message
sources are (kernel or userspace), clear the pending flag on SINT ACK
notification, and let the message sources compete for the slot again.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:44 +01:00
Andrey Smetanin 93bf417248 kvm/x86: Hyper-V internal helper to read MSR HV_X64_MSR_TIME_REF_COUNT
This helper will be used also in Hyper-V SynIC timers implementation.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:43 +01:00
Andrey Smetanin 0ae80384b2 kvm/x86: Added Hyper-V vcpu_to_hv_vcpu()/hv_vcpu_to_vcpu() helpers
Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:42 +01:00
Andrey Smetanin e18eaeed2b kvm/x86: Rearrange func's declarations inside Hyper-V header
This rearrangement places functions declarations together
according to their functionality, so future additions
will be simplier.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
CC: Gleb Natapov <gleb@kernel.org>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: "K. Y. Srinivasan" <kys@microsoft.com>
CC: Haiyang Zhang <haiyangz@microsoft.com>
CC: Vitaly Kuznetsov <vkuznets@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
CC: qemu-devel@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-16 18:49:42 +01:00
Haozhong Zhang 81b1b9ca6d KVM: VMX: Fix host initiated access to guest MSR_TSC_AUX
The current handling of accesses to guest MSR_TSC_AUX returns error if
vcpu does not support rdtscp, though those accesses are initiated by
host. This can result in the reboot failure of some versions of
QEMU. This patch fixes this issue by passing those host initiated
accesses for further handling instead.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-14 16:24:08 +01:00
Paolo Bonzini 8b89fe1f6c kvm: x86: move tracepoints outside extended quiescent state
Invoking tracepoints within kvm_guest_enter/kvm_guest_exit causes a
lockdep splat.

Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-12-11 12:26:33 +01:00
Takuya Yoshikawa bb11c6c965 KVM: x86: MMU: Remove unused parameter parent_pte from kvm_mmu_get_page()
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-26 15:31:36 +01:00
Takuya Yoshikawa 74c4e63ab9 KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk()
As kvm_mmu_get_page() was changed so that every parent pointer would not
get into the sp->parent_ptes chain before the entry pointed to by it was
set properly, we can use the for_each_rmap_spte macro instead of
pte_list_walk().

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-26 15:31:31 +01:00
Takuya Yoshikawa 98bba23842 KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page()
Every time kvm_mmu_get_page() is called with a non-NULL parent_pte
argument, link_shadow_page() follows that to set the parent entry so
that the new mapping will point to the returned page table.

Moving parent_pte handling there allows to clean up the code because
parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and
mmu_page_add_parent_pte().

In addition, the patch avoids calling mark_unsync() for other parents in
the sp->parent_ptes chain than the newly added parent_pte, because they
have been there since before the current page fault handling started.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-26 15:31:24 +01:00
Takuya Yoshikawa 4700579241 KVM: x86: MMU: Move initialization of parent_ptes out from kvm_mmu_alloc_page()
Make kvm_mmu_alloc_page() do just what its name tells to do, and remove
the extra allocation error check and zero-initialization of parent_ptes:
shadow page headers allocated by kmem_cache_zalloc() are always in the
per-VCPU pools.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:27:06 +01:00
Takuya Yoshikawa 77fbbbd2f0 KVM: x86: MMU: Consolidate BUG_ON checks for reverse-mapped sptes
At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is
placed right after the call to detect unrelated sptes which must not be
found in the reverse-mapping list.

Move this check in rmap_get_first/next() so that all call sites, not
just the users of the for_each_rmap_spte() macro, will be checked the
same way.

One thing to keep in mind is that kvm_mmu_unlink_parents() also uses
rmap_get_first() to handle parent sptes.  The change will not break it
because parent sptes are present, at least until drop_parent_pte()
actually unlinks them, and not mmio-sptes.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:26:47 +01:00
Takuya Yoshikawa afd28fe1c9 KVM: x86: MMU: Remove is_rmap_spte() and use is_shadow_present_pte()
is_rmap_spte(), originally named is_rmap_pte(), was introduced when the
simple reverse mapping was implemented by commit cd4a4e5374
("[PATCH] KVM: MMU: Implement simple reverse mapping").  At that point,
its role was clear and only rmap_add() and rmap_remove() were using it
to select sptes that need to be reverse-mapped.

Independently of that, is_shadow_present_pte() was first introduced by
commit c7addb9020 ("KVM: Allow not-present guest page faults to
bypass kvm") to do bypass_guest_pf optimization, which does not exist
any more.

These two seem to have changed their roles somewhat, and is_rmap_spte()
just calls is_shadow_present_pte() now.

Since using both of them without clear distinction just makes the code
confusing, remove is_rmap_spte().

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:26:35 +01:00
Takuya Yoshikawa 029499b477 KVM: x86: MMU: Make mmu_set_spte() return emulate value
mmu_set_spte()'s code is based on the assumption that the emulate
parameter has a valid pointer value if set_spte() returns true and
write_fault is not zero.  In other cases, emulate may be NULL, so a
NULL-check is needed.

Stop passing emulate pointer and make mmu_set_spte() return the emulate
value instead to clean up this complex interface.  Prefetch functions
can just throw away the return value.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:26:28 +01:00
Takuya Yoshikawa fd9514572f KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap
Both __mmu_unsync_walk() and mmu_pages_clear_parents() have three line
code which clears a bit in the unsync child bitmap; the former places it
inside a loop block and uses a few goto statements to jump to it.

A new helper function, clear_unsync_child_bit(), makes the code cleaner.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:26:15 +01:00
Takuya Yoshikawa 7ee0e5b29d KVM: x86: MMU: Remove unused parameter of __direct_map()
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:26:03 +01:00
Takuya Yoshikawa 018aabb56d KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct
New struct kvm_rmap_head makes the code type-safe to some extent.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:25:44 +01:00
Paolo Bonzini 0e3d0648bd KVM: x86: MMU: always set accessed bit in shadow PTEs
Commit 7a1638ce42 ("nEPT: Redefine EPT-specific link_shadow_page()",
2013-08-05) says:

    Since nEPT doesn't support A/D bit, we should not set those bit
    when building the shadow page table.

but this is not necessary.  Even though nEPT doesn't support A/D
bits, and hence the vmcs12 EPT pointer will never enable them,
we always use them for shadow page tables if available (see
construct_eptp in vmx.c).  So we can set the A/D bits freely
in the shadow page table.

This patch hence basically reverts commit 7a1638ce42.

Cc: Yang Zhang <yang.z.zhang@Intel.com>
Cc: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:23 +01:00
Paolo Bonzini aba2f06c07 KVM: x86: correctly print #AC in traces
Poor #AC was so unimportant until a few days ago that we were
not even tracing its name correctly.  But now it's all over
the place.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:23 +01:00
Paolo Bonzini 46896c73c1 KVM: svm: add support for RDTSCP
RDTSCP was never supported for AMD CPUs, which nobody noticed because
Linux does not use it.  But exactly the fact that Linux does not
use it makes the implementation very simple; we can freely trash
MSR_TSC_AUX while running the guest.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-11-25 17:24:22 +01:00