Commit Graph

1282 Commits

Author SHA1 Message Date
Linus Torvalds 4e241557fc The bulk of the changes here is for x86. And for once it's not
for silicon that no one owns: these are really new features for
 everyone.
 
 * ARM: several features are in progress but missed the 4.2 deadline.
 So here is just a smattering of bug fixes, plus enabling the VFIO
 integration.
 
 * s390: Some fixes/refactorings/optimizations, plus support for
 2GB pages.
 
 * x86: 1) host and guest support for marking kvmclock as a stable
 scheduler clock. 2) support for write combining. 3) support for
 system management mode, needed for secure boot in guests. 4) a bunch
 of cleanups required for 2+3.  5) support for virtualized performance
 counters on AMD; 6) legacy PCI device assignment is deprecated and
 defaults to "n" in Kconfig; VFIO replaces it.  On top of this there are
 also bug fixes and eager FPU context loading for FPU-heavy guests.
 
 * Common code: Support for multiple address spaces; for now it is
 used only for x86 SMM but the s390 folks also have plans.
 
 There are some x86 conflicts, one with the rc8 pull request and
 the rest with Ingo's FPU rework.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJViYzhAAoJEL/70l94x66Dda0H/1IepMbfEy+o849d5G71fNTs
 F8Y8qUP2GZuL7T53FyFUGSBw+AX7kimu9ia4gR/PmDK+QYsdosYeEjwlsolZfTBf
 sHuzNtPoJhi5o1o/ur4NGameo0WjGK8f1xyzr+U8z74QDQyQv/QYCdK/4isp4BJL
 ugHNHkuROX6Zng4i7jc9rfaSRg29I3GBxQUYpMkEnD3eMYMUBWGm6Rs8pHgGAMvL
 vqzntgW00WNxehTqcAkmD/Wv+txxhkvIadZnjgaxH49e9JeXeBKTIR5vtb7Hns3s
 SuapZUyw+c95DIipXq4EznxxaOrjbebOeFgLCJo8+XMXZum8RZf/ob24KroYad0=
 =YsAR
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull first batch of KVM updates from Paolo Bonzini:
 "The bulk of the changes here is for x86.  And for once it's not for
  silicon that no one owns: these are really new features for everyone.

  Details:

   - ARM:
        several features are in progress but missed the 4.2 deadline.
        So here is just a smattering of bug fixes, plus enabling the
        VFIO integration.

   - s390:
        Some fixes/refactorings/optimizations, plus support for 2GB
        pages.

   - x86:
        * host and guest support for marking kvmclock as a stable
          scheduler clock.
        * support for write combining.
        * support for system management mode, needed for secure boot in
          guests.
        * a bunch of cleanups required for the above
        * support for virtualized performance counters on AMD
        * legacy PCI device assignment is deprecated and defaults to "n"
          in Kconfig; VFIO replaces it

        On top of this there are also bug fixes and eager FPU context
        loading for FPU-heavy guests.

   - Common code:
        Support for multiple address spaces; for now it is used only for
        x86 SMM but the s390 folks also have plans"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
  KVM: s390: clear floating interrupt bitmap and parameters
  KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
  KVM: x86/vPMU: Implement AMD vPMU code for KVM
  KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
  KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
  KVM: x86/vPMU: reorder PMU functions
  KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
  KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
  KVM: x86/vPMU: introduce pmu.h header
  KVM: x86/vPMU: rename a few PMU functions
  KVM: MTRR: do not map huge page for non-consistent range
  KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
  KVM: MTRR: introduce mtrr_for_each_mem_type
  KVM: MTRR: introduce fixed_mtrr_addr_* functions
  KVM: MTRR: sort variable MTRRs
  KVM: MTRR: introduce var_mtrr_range
  KVM: MTRR: introduce fixed_mtrr_segment table
  KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
  KVM: MTRR: do not split 64 bits MSR content
  KVM: MTRR: clean up mtrr default type
  ...
2015-06-24 09:36:49 -07:00
Wei Huang 6912ac326d KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
This patch enables AMD guest VM to access (R/W) PMU related MSRs, which
include PERFCTR[0..3] and EVNTSEL[0..3].

Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-23 14:12:15 +02:00
Wei Huang 474a5bb944 KVM: x86/vPMU: introduce pmu.h header
This will be used for private function used by AMD- and Intel-specific
PMU implementations.

Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:29 +02:00
Wei Huang c6702c9dcf KVM: x86/vPMU: rename a few PMU functions
Before introducing a pmu.h header for them, make the naming more
consistent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:29 +02:00
Xiao Guangrong 19efffa244 KVM: MTRR: sort variable MTRRs
Sort all valid variable MTRRs based on its base address, it will help us to
check a range to see if it's fully contained in variable MTRRs

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Fix list insertion sort, simplify var_mtrr_range_is_valid to just
 test the V bit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:28 +02:00
Xiao Guangrong 70109e7d9d KVM: MTRR: remove mtrr_state.have_fixed
vMTRR does not depend on any host MTRR feature and fixed MTRRs have always
been implemented, so drop this field

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong eb839917a7 KVM: MTRR: handle MSR_MTRRcap in kvm_mtrr_get_msr
MSR_MTRRcap is a MTRR msr so move the handler to the common place, also
add some comments to make the hard code more readable

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:27 +02:00
Xiao Guangrong ff53604b40 KVM: x86: move MTRR related code to a separate file
MTRR code locates in x86.c and mmu.c so that move them to a separate file to
make the organization more clearer and it will be the place where we fully
implement vMTRR

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:26 +02:00
Xiao Guangrong b18d5431ac KVM: x86: fix CR0.CD virtualization
Currently, CR0.CD is not checked when we virtualize memory cache type for
noncoherent_dma guests, this patch fixes it by :

- setting UC for all memory if CR0.CD = 1
- zapping all the last sptes in MMU if CR0.CD is changed

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-19 17:16:26 +02:00
Paolo Bonzini 6d396b5520 KVM: x86: advertise KVM_CAP_X86_SMM
... and we're done. :)

Because SMBASE is usually relocated above 1M on modern chipsets, and
SMM handlers might indeed rely on 4G segment limits, we only expose it
if KVM is able to run the guest in big real mode.  This includes any
of VMX+emulate_invalid_guest_state, VMX+unrestricted_guest, or SVM.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:38 +02:00
Paolo Bonzini 699023e239 KVM: x86: add SMM to the MMU role, support SMRAM address space
This is now very simple to do.  The only interesting part is a simple
trick to find the right memslot in gfn_to_rmap, retrieving the address
space from the spte role word.  The same trick is used in the auditing
code.

The comment on top of union kvm_mmu_page_role has been stale forever,
so remove it.  Speaking of stale code, remove pad_for_nice_hex_output
too: it was splitting the "access" bitfield across two bytes and thus
had effectively turned into pad_for_ugly_hex_output.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:37 +02:00
Paolo Bonzini 9da0e4d5ac KVM: x86: work on all available address spaces
This patch has no semantic change, but it prepares for the introduction
of a second address space for system management mode.

A new function x86_set_memory_region (and the "slots_lock taken"
counterpart __x86_set_memory_region) is introduced in order to
operate on all address spaces when adding or deleting private
memory slots.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:37 +02:00
Paolo Bonzini 54bf36aac5 KVM: x86: use vcpu-specific functions to read/write/translate GFNs
We need to hide SMRAM from guests not running in SMM.  Therefore,
all uses of kvm_read_guest* and kvm_write_guest* must be changed to
check whether the VCPU is in system management mode and use a
different set of memslots.  Switch from kvm_* to the newly-introduced
kvm_vcpu_*, which call into kvm_arch_vcpu_memslots_id.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-05 17:26:36 +02:00
Paolo Bonzini 660a5d517a KVM: x86: save/load state on SMM switch
The big ugly one.  This patch adds support for switching in and out of
system management mode, respectively upon receiving KVM_REQ_SMI and upon
executing a RSM instruction.  Both 32- and 64-bit formats are supported
for the SMM state save area.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:17:46 +02:00
Paolo Bonzini cd7764fe9f KVM: x86: latch INITs while in system management mode
Do not process INITs immediately while in system management mode, keep
it instead in apic->pending_events.  Tell userspace if an INIT is
pending when they issue GET_VCPU_EVENTS, and similarly handle the
new field in SET_VCPU_EVENTS.

Note that the same treatment should be done while in VMX non-root mode.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:51 +02:00
Paolo Bonzini 64d6067057 KVM: x86: stubs for SMM support
This patch adds the interface between x86.c and the emulator: the
SMBASE register, a new emulator flag, the RSM instruction.  It also
adds a new request bit that will be used by the KVM_SMI ioctl.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:45 +02:00
Paolo Bonzini f077825a87 KVM: x86: API changes for SMM support
This patch includes changes to the external API for SMM support.
Userspace can predicate the availability of the new fields and
ioctls on a new capability, KVM_CAP_X86_SMM, which is added at the end
of the patch series.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:11 +02:00
Paolo Bonzini a584539b24 KVM: x86: pass the whole hflags field to emulator and back
The hflags field will contain information about system management mode
and will be useful for the emulator.  Pass the entire field rather than
just the guest-mode information.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:05 +02:00
Paolo Bonzini 609e36d372 KVM: x86: pass host_initiated to functions that read MSRs
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it.  Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:01:00 +02:00
Paolo Bonzini 62ef68bb4d KVM: x86: introduce num_emulated_msrs
We will want to filter away MSR_IA32_SMBASE from the emulated_msrs if
the host CPU does not support SMM virtualization.  Introduce the
logic to do that, and also move paravirt MSRs to emulated_msrs for
simplicity and to get rid of KVM_SAVE_MSRS_BEGIN.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 16:00:46 +02:00
Paolo Bonzini e69fab5df4 KVM: x86: clear hidden CPU state at reset time
This was noticed by Radim while reviewing the implementation of
system management mode.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-06-04 10:44:44 +02:00
Marcelo Tosatti b7e60c5aed KVM: x86: zero kvmclock_offset when vcpu0 initializes kvmclock system MSR
Initialize kvmclock base, on kvmclock system MSR write time,
so that the guest sees kvmclock counting from zero.

This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remove unnecessary comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-29 14:02:40 +02:00
Marcelo Tosatti 630994b3c7 KVM: x86: add module parameter to disable periodic kvmclock sync
The periodic kvmclock sync can be an undesired source of latencies.

When running cyclictest on a guest, a latency spike is visible.
With kvmclock periodic sync disabled, the spike is gone.

Guests should use ntp which means the propagations of ntp corrections
from the host clock are unnecessary.

v2:
-> Make parameter read-only (Radim)
-> Return early on kvmclock_sync_fn (Andrew)

Reported-and-tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-28 10:43:35 +02:00
Paolo Bonzini f36f3f2846 KVM: add "new" argument to kvm_arch_commit_memory_region
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot.  It will simplify the code when more
than one address space will be supported.

Unfortunately, the "const"ness of the new argument must be casted
away in two places.  Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-28 10:42:58 +02:00
Ingo Molnar 003e2e8b57 x86/fpu: Standardize the parameter type of copy_kernel_to_fpregs()
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the parameter passing convention of other kernel-to-FPU-registers
copying functions: pass around an in-memory FPU register state pointer,
instead of struct fpu *.

NOTE: This patch also changes the assembly constraint of the FXSAVE-leak
      workaround from 'fpu->fpregs_active' to 'fpstate' - but that is fine,
      as we only need a valid memory address there for the FILDL instruction.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 14:11:32 +02:00
Ingo Molnar 3e1bf47e5c x86/fpu: Rename copy_fpstate_to_fpregs() to copy_kernel_to_fpregs()
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the naming of other kernel-to-FPU-registers copying functions.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-27 14:11:29 +02:00
Paolo Bonzini 15f46015ee KVM: add memslots argument to kvm_arch_memslots_updated
Prepare for the case of multiple address spaces.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-26 12:40:17 +02:00
Paolo Bonzini 09170a4942 KVM: const-ify uses of struct kvm_userspace_memory_region
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents.  Add const to
enforce this.

In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.

Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-26 12:40:13 +02:00
Paolo Bonzini 9f6b802978 KVM: use kvm_memslots whenever possible
kvm_memslots provides lockdep checking.  Use it consistently instead of
explicit dereferencing of kvm->memslots.

Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-26 12:40:08 +02:00
Ingo Molnar 3152657f10 Linux 4.1-rc5
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVYnloAAoJEHm+PkMAQRiGCgkH/j3r2djOOm4h83FXrShaHORY
 p8TBI3FNj4fzLk2PfzqbmiDw2T2CwygB+pxb2Ac9CE99epw8qPk2SRvPXBpdKR7t
 lolhhwfzApLJMZbhzNLVywUCDUhFoiEWRhmPqIfA3WXFcIW3t5VNXAoIFjV5HFr6
 sYUlaxSI1XiQ5tldVv8D6YSFHms41pisziBIZmzhIUg10P6Vv3D0FbE74fjAJwx0
 +08zj3EO7yQMv7Aeeq8F8AJ3628142rcZf0NWF5ohlKLRK3gt0cl9jO5U4Co2dDt
 29v03LP5EI6jDKkIbuWlqRMq9YxJz7N3wnkzV0EJiqXucoqPLFDqzbxB4gnS1pI=
 =7vbA
 -----END PGP SIGNATURE-----

Merge branch 'linus' into x86/fpu

Resolve semantic conflict in arch/x86/kvm/cpuid.c with:

  c447e76b4c ("kvm/fpu: Enable eager restore kvm FPU for MPX")

By removing the FPU internal include files.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-25 09:39:19 +02:00
Paolo Bonzini a9b4fb7e79 Merge branch 'kvm-master' into kvm-next
Grab MPX bugfix, and fix conflicts against Rik's adaptive FPU
deactivation patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:31:37 +02:00
Liang Li c447e76b4c kvm/fpu: Enable eager restore kvm FPU for MPX
The MPX feature requires eager KVM FPU restore support. We have verified
that MPX cannot work correctly with the current lazy KVM FPU restore
mechanism. Eager KVM FPU restore should be enabled if the MPX feature is
exposed to VM.

Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
[Also activate the FPU on AMD processors. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:30:26 +02:00
Andrea Arcangeli e8fd5e9e99 kvm: fix crash in kvm_vcpu_reload_apic_access_page
memslot->userfault_addr is set by the kernel with a mmap executed
from the kernel but the userland can still munmap it and lead to the
below oops after memslot->userfault_addr points to a host virtual
address that has no vma or mapping.

[  327.538306] BUG: unable to handle kernel paging request at fffffffffffffffe
[  327.538407] IP: [<ffffffff811a7b55>] put_page+0x5/0x50
[  327.538474] PGD 1a01067 PUD 1a03067 PMD 0
[  327.538529] Oops: 0000 [#1] SMP
[  327.538574] Modules linked in: macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT iptable_filter ip_tables tun bridge stp llc rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xprtrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ipmi_devintf iTCO_wdt iTCO_vendor_support intel_powerclamp coretemp dcdbas intel_rapl kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sb_edac edac_core ipmi_si ipmi_msghandler acpi_pad wmi acpi_power_meter lpc_ich mfd_core mei_me
[  327.539488]  mei shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc mlx4_ib ib_sa ib_mad ib_core mlx4_en vxlan ib_addr ip_tunnel xfs libcrc32c sd_mod crc_t10dif crct10dif_common crc32c_intel mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm drm ahci i2c_core libahci mlx4_core libata tg3 ptp pps_core megaraid_sas ntb dm_mirror dm_region_hash dm_log dm_mod
[  327.539956] CPU: 3 PID: 3161 Comm: qemu-kvm Not tainted 3.10.0-240.el7.userfault19.4ca4011.x86_64.debug #1
[  327.540045] Hardware name: Dell Inc. PowerEdge R420/0CN7CM, BIOS 2.1.2 01/20/2014
[  327.540115] task: ffff8803280ccf00 ti: ffff880317c58000 task.ti: ffff880317c58000
[  327.540184] RIP: 0010:[<ffffffff811a7b55>]  [<ffffffff811a7b55>] put_page+0x5/0x50
[  327.540261] RSP: 0018:ffff880317c5bcf8  EFLAGS: 00010246
[  327.540313] RAX: 00057ffffffff000 RBX: ffff880616a20000 RCX: 0000000000000000
[  327.540379] RDX: 0000000000002014 RSI: 00057ffffffff000 RDI: fffffffffffffffe
[  327.540445] RBP: ffff880317c5bd10 R08: 0000000000000103 R09: 0000000000000000
[  327.540511] R10: 0000000000000000 R11: 0000000000000000 R12: fffffffffffffffe
[  327.540576] R13: 0000000000000000 R14: ffff880317c5bd70 R15: ffff880317c5bd50
[  327.540643] FS:  00007fd230b7f700(0000) GS:ffff880630800000(0000) knlGS:0000000000000000
[  327.540717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  327.540771] CR2: fffffffffffffffe CR3: 000000062a2c3000 CR4: 00000000000427e0
[  327.540837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  327.540904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  327.540974] Stack:
[  327.541008]  ffffffffa05d6d0c ffff880616a20000 0000000000000000 ffff880317c5bdc0
[  327.541093]  ffffffffa05ddaa2 0000000000000000 00000000002191bf 00000042f3feab2d
[  327.541177]  00000042f3feab2d 0000000000000002 0000000000000001 0321000000000000
[  327.541261] Call Trace:
[  327.541321]  [<ffffffffa05d6d0c>] ? kvm_vcpu_reload_apic_access_page+0x6c/0x80 [kvm]
[  327.543615]  [<ffffffffa05ddaa2>] vcpu_enter_guest+0x3f2/0x10f0 [kvm]
[  327.545918]  [<ffffffffa05e2f10>] kvm_arch_vcpu_ioctl_run+0x2b0/0x5a0 [kvm]
[  327.548211]  [<ffffffffa05e2d02>] ? kvm_arch_vcpu_ioctl_run+0xa2/0x5a0 [kvm]
[  327.550500]  [<ffffffffa05ca845>] kvm_vcpu_ioctl+0x2b5/0x680 [kvm]
[  327.552768]  [<ffffffff810b8d12>] ? creds_are_invalid.part.1+0x12/0x50
[  327.555069]  [<ffffffff810b8d71>] ? creds_are_invalid+0x21/0x30
[  327.557373]  [<ffffffff812d6066>] ? inode_has_perm.isra.49.constprop.65+0x26/0x80
[  327.559663]  [<ffffffff8122d985>] do_vfs_ioctl+0x305/0x530
[  327.561917]  [<ffffffff8122dc51>] SyS_ioctl+0xa1/0xc0
[  327.564185]  [<ffffffff816de829>] system_call_fastpath+0x16/0x1b
[  327.566480] Code: 0b 31 f6 4c 89 e7 e8 4b 7f ff ff 0f 0b e8 24 fd ff ff e9 a9 fd ff ff 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 2a 8b 47 1c 85 c0 74 1e f0

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20 12:30:06 +02:00
Xiao Guangrong d81135a57a KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed
CR0.CD and CR0.NW are not used by shadow page table so that need
not adjust mmu if these two bit are changed

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-19 20:52:43 +02:00
Xiao Guangrong efdfe536d8 KVM: MMU: fix MTRR update
Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
   the original root shadow page will be reused

2) the cache type is set on the last spte, that means we should sync
   the last sptes when MTRR is changed

This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-19 20:52:42 +02:00
Xiao Guangrong edc90b7dc4 KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page

Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-19 20:52:36 +02:00
Ingo Molnar c47ada305d x86/fpu: Harmonize FPU register state types
Use these consistent names:

    struct fregs_state           # was: i387_fsave_struct
    struct fxregs_state          # was: i387_fxsave_struct
    struct swregs_state          # was: i387_soft_struct
    struct xregs_state           # was: xsave_struct
    union  fpregs_state          # was: thread_xstate

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:48:09 +02:00
Ingo Molnar bf935b0b52 x86/fpu: Create 'union thread_xstate' helper for fpstate_init()
fpstate_init() only uses fpu->state, so pass that in to it.

This enables the cleanup we will do in the next patch.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:48:07 +02:00
Ingo Molnar 0e75c54f17 x86/fpu: Rename restore_fpu_checking() to copy_fpstate_to_fpregs()
fpu_restore_checking() is a helper function of restore_fpu_checking(),
but this is not apparent from the naming.

Both copy fpstate contents to fpregs, while the fuller variant does
a full copy without leaking information.

So rename them to:

    copy_fpstate_to_fpregs()
  __copy_fpstate_to_fpregs()

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:48:03 +02:00
Ingo Molnar befc61ad3c x86/fpu: Move asm/xcr.h to asm/fpu/internal.h
Now that all FPU internals using drivers are converted to public APIs,
move xcr.h's definitions into fpu/internal.h and remove xcr.h.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:48:00 +02:00
Ingo Molnar c4d72e2db3 x86/fpu: Simplify fpstate_init_curr() usage
Now that fpstate_init_curr() is not doing implicit allocations
anymore, almost all uses of it involve a very simple pattern:

	if (!fpu->fpstate_active)
		fpstate_init_curr(fpu);

which is basically activating the FPU fpstate if it was not active
before.

So propagate the check into the function itself, and rename the
function according to its new purpose:

	fpu__activate_curr(fpu);

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:51 +02:00
Ingo Molnar 0ee6a51725 x86/fpu, kvm: Simplify fx_init()
Now that fpstate_init() cannot fail the error return of fx_init()
has lost its purpose. Eliminate the error return and propagate this
change to all callers.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:51 +02:00
Ingo Molnar e62bb3d894 x86/fpu: Rename fpstate_alloc_init() to fpstate_init_curr()
Now that there are no FPU context allocations, rename fpstate_alloc_init()
to fpstate_init_curr(), to signal that it initializes the fpstate and
marks it active, for the current task.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:50 +02:00
Ingo Molnar 91d93d0e20 x86/fpu: Remove failure return from fpstate_alloc_init()
Remove the failure code and propagate this down to callers.

Note that this function still has an 'init' aspect, which must be
called.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:50 +02:00
Ingo Molnar c4d6ee6e2e x86/fpu: Remove failure paths from fpstate-alloc low level functions
Now that we always allocate the FPU context as part of task_struct there's
no need for separate allocations - remove them and their primary failure
handling code.

( Note that there's still secondary error codes that have become superfluous,
  those will be removed in separate patches. )

Move the somewhat misplaced setup_xstate_comp() call to the core.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:50 +02:00
Ingo Molnar 7366ed771f x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)
So 6 years ago we made the FPU fpstate dynamically allocated:

  aa283f4927 ("x86, fpu: lazy allocation of FPU area - v5")
  61c4628b53 ("x86, fpu: split FPU state from task struct - v5")

In hindsight this was a mistake:

   - it complicated context allocation failure handling, such as:

		/* kthread execs. TODO: cleanup this horror. */
		if (WARN_ON(fpstate_alloc_init(fpu)))
			force_sig(SIGKILL, tsk);

   - it caused us to enable irqs in fpu__restore():

                local_irq_enable();
                /*
                 * does a slab alloc which can sleep
                 */
                if (fpstate_alloc_init(fpu)) {
                        /*
                         * ran out of memory!
                         */
                        do_group_exit(SIGKILL);
                        return;
                }
                local_irq_disable();

   - it (slightly) slowed down task creation/destruction by adding
     slab allocation/free pattens.

   - it made access to context contents (slightly) slower by adding
     one more pointer dereference.

The motivation for the dynamic allocation was two-fold:

   - reduce memory consumption by non-FPU tasks

   - allocate and handle only the necessary amount of context for
     various XSAVE processors that have varying hardware frame
     sizes.

These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.

For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.

Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.

So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.

We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.

This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:49 +02:00
Ingo Molnar 4f83634710 x86/fpu: Rename fpu_save_init() to copy_fpregs_to_fpstate()
So fpu_save_init() is a historic name that got its name when the only
way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU
state after saving it.

Nowadays the name is misleading, because ever since the introduction of
FXSAVE (and more modern FPU saving instructions) the 'we need to reload
the FPU state' part is only true if there's a pending FPU exception [*],
which is almost never the case.

So rename it to copy_fpregs_to_fpstate() to make it clear what's
happening. Also add a few comments about why we cannot keep registers
in certain cases.

Also clean up the control flow a bit, to make it more apparent when
we are dropping/keeping FP registers, and to optimize the common
case (of keeping fpregs) some more.

[*] Probably not true anymore, modern instructions always leave the FPU
    state intact, even if exceptions are pending: because pending FP
    exceptions are posted on the next FP instruction, not asynchronously.

    They were truly asynchronous back in the IRQ13 case, and we had to
    synchronize with them, but that code is not working anymore: we don't
    have IRQ13 mapped in the IDT anymore.

    But a cleanup patch is obviously not the place to change subtle behavior.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:49 +02:00
Ingo Molnar 400e4b2091 x86/fpu: Rename xsave.header::xstate_bv to 'xfeatures'
'xsave.header::xstate_bv' is a misnomer - what does 'bv' stand for?

It probably comes from the 'XGETBV' instruction name, but I could
not find in the Intel documentation where that abbreviation comes
from. It could mean 'bit vector' - or something else?

But how about - instead of guessing about a weird name - we named
the field in an obvious and descriptive way that tells us exactly
what it does?

So rename it to 'xfeatures', which is a bitmask of the
xfeatures that are fpstate_active in that context structure.

Eyesore like:

           fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;

is now much more readable:

           fpu->state->xsave.header.xfeatures |= XSTATE_FP;

Which form is not just infinitely more readable, but is also
shorter as well.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:35 +02:00
Ingo Molnar 3a54450b5e x86/fpu: Rename 'xsave_hdr' to 'header'
Code like:

           fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;

is an eyesore, because not only is the words 'xsave' and 'state'
are repeated twice times (!), but also because of the 'hdr' and 'bv'
abbreviations that are pretty meaningless at a first glance.

Start cleaning this up by renaming 'xsave_hdr' to 'header'.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:34 +02:00
Ingo Molnar 78f7f1e54b x86/fpu: Rename fpu-internal.h to fpu/internal.h
This unifies all the FPU related header files under a unified, hiearchical
naming scheme:

 - asm/fpu/types.h:      FPU related data types, needed for 'struct task_struct',
                         widely included in almost all kernel code, and hence kept
                         as small as possible.

 - asm/fpu/api.h:        FPU related 'public' methods exported to other subsystems.

 - asm/fpu/internal.h:   FPU subsystem internal methods

 - asm/fpu/xsave.h:      XSAVE support internal methods

(Also standardize the header guard in asm/fpu/internal.h.)

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:31 +02:00
Ingo Molnar db2b1d3ad1 x86/fpu: Use 'struct fpu' in fpstate_alloc_init()
Migrate this function to pure 'struct fpu' usage.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:29 +02:00
Ingo Molnar c5bedc6847 x86/fpu: Get rid of PF_USED_MATH usage, convert it to fpu->fpstate_active
Introduce a simple fpu->fpstate_active flag in the fpu context data structure
and use that instead of PF_USED_MATH in task->flags.

Testing for this flag byte should be slightly more efficient than
testing a bit in a bitmask, but the main advantage is that most
FPU functions can now be performed on a 'struct fpu' alone, they
don't need access to 'struct task_struct' anymore.

There's a slight linecount increase, mostly due to the 'fpu' local
variables and due to extra comments. The local variables will go away
once we move most of the FPU methods to pure 'struct fpu' parameters.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:25 +02:00
Ingo Molnar 4c1384100e x86/fpu: Open code PF_USED_MATH usages
PF_USED_MATH is used directly, but also in a handful of helper inlines.

To ease the elimination of PF_USED_MATH, convert all inline helpers
to open-coded PF_USED_MATH usage.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:24 +02:00
Ingo Molnar f89e32e0a3 x86/fpu: Fix header file dependencies of fpu-internal.h
Fix a minor header file dependency bug in asm/fpu-internal.h: it
relies on i387.h but does not include it. All users of fpu-internal.h
included it explicitly.

Also remove unnecessary includes, to reduce compilation time.

This also makes it easier to use it as a standalone header file
for FPU internals, such as an upcoming C module in arch/x86/kernel/fpu/.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:16 +02:00
Ingo Molnar c0ee2cf61b x86/fpu: Rename fpu_finit() to fpstate_init()
Make it clear that we are initializing the in-memory FPU context area,
no the FPU registers.

Also move it to the fpu__*() namespace.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:13 +02:00
Ingo Molnar a7c2a83364 x86/fpu: Rename fpu_free() to fpstate_free()
Use the fpu__*() namespace.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:13 +02:00
Ingo Molnar ed97b08546 x86/fpu: Rename fpu_alloc() to fpstate_alloc()
Use the fpu__*() namespace for fpstate_alloc() as well.

Also add a comment about FPU state alignment.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:13 +02:00
Ingo Molnar 97185c95f7 x86/fpu: Split an fpstate_alloc_init() function out of init_fpu()
Most init_fpu() users don't want the register-saving aspect of the
function, they are calling it for 'current' and when FPU registers
are not allocated and initialized yet.

Split out a simplified API that does just that (and add debug-checks
for these conditions): fpstate_alloc_init().

Use it where appropriate.

Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-05-19 15:47:10 +02:00
Xiao Guangrong 0be0226f07 KVM: MMU: fix SMAP virtualization
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page

Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated

Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-11 17:17:50 +02:00
Radim Krčmář 74545705cb KVM: x86: fix initial PAT value
PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
VMX used a wrong value (host's PAT) and while SVM used the right one,
it never got to arch.pat.

This is not an issue with QEMU as it will force the correct value.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:46 +02:00
Rik van Riel 653f52c316 kvm,x86: load guest FPU context more eagerly
Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
re-load them every time the guest accesses the FPU after a switch back into
the guest from the host.

This patch copies the x86 task switch semantics for FPU loading, with the
FPU loaded eagerly after first use if the system uses eager fpu mode,
or if the guest uses the FPU frequently.

In the latter case, after loading the FPU for 255 times, the fpu_counter
will roll over, and we will revert to loading the FPU on demand, until
it has been established that the guest is still actively using the FPU.

This mirrors the x86 task switch policy, which seems to work.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:45 +02:00
James Sullivan 93bbf0b8bc kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery
Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
be used to determine if the delivery of the MSI should target only
the lowest priority CPU in the logical group specified for delivery.
(In physical dest mode, the RH bit is not relevant). Initialized the value
of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
to false in all other cases.

Added value of msi_redir_hint to a debug message dump of an IRQ in
apic_send_ipi().

Signed-off-by: James Sullivan <sullivan.james.f@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:44 +02:00
Nadav Amit d28bc9dd25 KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.

References (from Intel SDM):

"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]

[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]

"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]

"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:43 +02:00
Nadav Amit 90de4a1875 KVM: x86: Support for disabling quirks
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
created in order to overcome QEMU issues. Those issue were mostly result of
invalid VM BIOS.  Currently there are two quirks that can be disabled:

1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot

These two issues are already resolved in recent releases of QEMU, and would
therefore be disabled by QEMU.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
[Report capability from KVM_CHECK_EXTENSION too. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:29:42 +02:00
Christian Borntraeger ccf73aaf5a KVM: arm/mips/x86/power use __kvm_guest_{enter|exit}
Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
where interrupts are disabled.

Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07 11:28:22 +02:00
Radim Krčmář 5dca0d9147 kvm: x86: fix kvmclock update protocol
The kvmclock spec says that the host will increment a version field to
an odd number, then update stuff, then increment it to an even number.
The host is buggy and doesn't do this, and the result is observable
when one vcpu reads another vcpu's kvmclock data.

There's no good way for a guest kernel to keep its vdso from reading
a different vcpu's kvmclock data, but we don't need to care about
changing VCPUs as long as we read a consistent data from kvmclock.
(VCPU can change outside of this loop too, so it doesn't matter if we
return a value not fit for this VCPU.)

Based on a patch by Radim Krčmář.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-27 15:48:59 +02:00
Nadav Amit 9e9c3fe40b KVM: x86: Fix MSR_IA32_BNDCFGS in msrs_to_save
kvm_init_msr_list is currently called before hardware_setup. As a result,
vmx_mpx_supported always returns false when kvm_init_msr_list checks whether to
save MSR_IA32_BNDCFGS.

Move kvm_init_msr_list after vmx_hardware_setup is called to fix this issue.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>

Message-Id: <1428864435-4732-1-git-send-email-namit@cs.technion.ac.il>
Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-14 18:09:50 +02:00
Linus Torvalds 7fd56474db Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "The main changes in this cycle were:

   - clockevents state machine cleanups and enhancements (Viresh Kumar)

   - clockevents broadcast notifier horror to state machine conversion
     and related cleanups (Thomas Gleixner, Rafael J Wysocki)

   - clocksource and timekeeping core updates (John Stultz)

   - clocksource driver updates and fixes (Ben Dooks, Dmitry Osipenko,
     Hans de Goede, Laurent Pinchart, Maxime Ripard, Xunlei Pang)

   - y2038 fixes (Xunlei Pang, John Stultz)

   - NMI-safe ktime_get_raw_fast() and general refactoring of the clock
     code, in preparation to perf's per event clock ID support (Peter
     Zijlstra)

   - generic sched/clock fixes, optimizations and cleanups (Daniel
     Thompson)

   - clockevents cpu_down() race fix (Preeti U Murthy)"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits)
  timers/PM: Drop unnecessary braces from tick_freeze()
  timers/PM: Fix up tick_unfreeze()
  timekeeping: Get rid of stale comment
  clockevents: Cleanup dead cpu explicitely
  clockevents: Make tick handover explicit
  clockevents: Remove broadcast oneshot control leftovers
  sched/idle: Use explicit broadcast oneshot control function
  ARM: Tegra: Use explicit broadcast oneshot control function
  ARM: OMAP: Use explicit broadcast oneshot control function
  intel_idle: Use explicit broadcast oneshot control function
  ACPI/idle: Use explicit broadcast control function
  ACPI/PAD: Use explicit broadcast oneshot control function
  x86/amd/idle, clockevents: Use explicit broadcast oneshot control functions
  clockevents: Provide explicit broadcast oneshot control functions
  clockevents: Remove the broadcast control leftovers
  ARM: OMAP: Use explicit broadcast control function
  intel_idle: Use explicit broadcast control function
  cpuidle: Use explicit broadcast control function
  ACPI/processor: Use explicit broadcast control function
  ACPI/PAD: Use explicit broadcast control function
  ...
2015-04-13 11:08:28 -07:00
Linus Torvalds 9003601310 The most interesting bit here is irqfd/ioeventfd support for ARM and ARM64.
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
 vhost, too), page aging
 
 s390: interrupt handling rework, allowing to inject all local interrupts
 via new ioctl and to get/set the full local irq state for migration
 and introspection.  New ioctls to access memory by virtual address,
 and to get/set the guest storage keys.  SIMD support.
 
 MIPS: FPU and MIPS SIMD Architecture (MSA) support.  Includes some patches
 from Ralf Baechle's MIPS tree.
 
 x86: bugfixes (notably for pvclock, the others are small) and cleanups.
 Another small latency improvement for the TSC deadline timer.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
 k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
 Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
 wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
 RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
 KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
 =yu2b
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First batch of KVM changes for 4.1

  The most interesting bit here is irqfd/ioeventfd support for ARM and
  ARM64.

  Summary:

  ARM/ARM64:
     fixes for live migration, irqfd and ioeventfd support (enabling
     vhost, too), page aging

  s390:
     interrupt handling rework, allowing to inject all local interrupts
     via new ioctl and to get/set the full local irq state for migration
     and introspection.  New ioctls to access memory by virtual address,
     and to get/set the guest storage keys.  SIMD support.

  MIPS:
     FPU and MIPS SIMD Architecture (MSA) support.  Includes some
     patches from Ralf Baechle's MIPS tree.

  x86:
     bugfixes (notably for pvclock, the others are small) and cleanups.
     Another small latency improvement for the TSC deadline timer"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
  KVM: use slowpath for cross page cached accesses
  kvm: mmu: lazy collapse small sptes into large sptes
  KVM: x86: Clear CR2 on VCPU reset
  KVM: x86: DR0-DR3 are not clear on reset
  KVM: x86: BSP in MSR_IA32_APICBASE is writable
  KVM: x86: simplify kvm_apic_map
  KVM: x86: avoid logical_map when it is invalid
  KVM: x86: fix mixed APIC mode broadcast
  KVM: x86: use MDA for interrupt matching
  kvm/ppc/mpic: drop unused IRQ_testbit
  KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
  KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
  KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
  KVM: vmx: pass error code with internal error #2
  x86: vdso: fix pvclock races with task migration
  KVM: remove kvm_read_hva and kvm_read_hva_atomic
  KVM: x86: optimize delivery of TSC deadline timer interrupt
  KVM: x86: extract blocking logic from __vcpu_run
  kvm: x86: fix x86 eflags fixed bit
  KVM: s390: migrate vcpu interrupt state
  ...
2015-04-13 09:47:01 -07:00
Wanpeng Li 3ea3b7fa9a kvm: mmu: lazy collapse small sptes into large sptes
Dirty logging tracks sptes in 4k granularity, meaning that large sptes
have to be split.  If live migration is successful, the guest in the
source machine will be destroyed and large sptes will be created in the
destination. However, the guest continues to run in the source machine
(for example if live migration fails), small sptes will remain around
and cause bad performance.

This patch introduce lazy collapsing of small sptes into large sptes.
The rmap will be scanned in ioctl context when dirty logging is stopped,
dropping those sptes which can be collapsed into a single large-page spte.
Later page faults will create the large-page sptes.

Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Message-Id: <1428046825-6905-1-git-send-email-wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:04 +02:00
Nadav Amit 1119022c71 KVM: x86: Clear CR2 on VCPU reset
CR2 is not cleared as it should after reset.  See Intel SDM table named "IA-32
Processor States Following Power-up, Reset, or INIT".

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-5-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:03 +02:00
Nadav Amit ae561edeb4 KVM: x86: DR0-DR3 are not clear on reset
DR0-DR3 are not cleared as they should during reset and when they are set from
userspace.  It appears to be caused by c77fb5fe6f ("KVM: x86: Allow the guest
to run with dirty debug registers").

Force their reload on these situations.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-4-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:03 +02:00
Nadav Amit 58d269d8cc KVM: x86: BSP in MSR_IA32_APICBASE is writable
After reset, the CPU can change the BSP, which will be used upon INIT.  Reset
should return the BSP which QEMU asked for, and therefore handled accordingly.

To quote: "If the MP protocol has completed and a BSP is chosen, subsequent
INITs (either to a specific processor or system wide) do not cause the MP
protocol to be repeated."
[Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions]

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:47:02 +02:00
Eugene Korenevsky 5a4f55cde8 KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
cpuid_maxphyaddr(), which performs lot of memory accesses is called
extensively across KVM, especially in nVMX code.

This patch adds a cached value of maxphyaddr to vcpu.arch to reduce the
pressure onto CPU cache and simplify the code of cpuid_maxphyaddr()
callers. The cached value is initialized in kvm_arch_vcpu_init() and
reloaded every time CPUID is updated by usermode. It is obvious that
these reloads occur infrequently.

Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Message-Id: <20150329205612.GA1223@gnote>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:46:56 +02:00
Paolo Bonzini 9c8fd1ba22 KVM: x86: optimize delivery of TSC deadline timer interrupt
The newly-added tracepoint shows the following results on
the tscdeadline_latency test:

        qemu-kvm-8387  [002]  6425.558974: kvm_vcpu_wakeup:      poll time 10407 ns
        qemu-kvm-8387  [002]  6425.558984: kvm_vcpu_wakeup:      poll time 0 ns
        qemu-kvm-8387  [002]  6425.561242: kvm_vcpu_wakeup:      poll time 10477 ns
        qemu-kvm-8387  [002]  6425.561251: kvm_vcpu_wakeup:      poll time 0 ns

and so on.  This is because we need to go through kvm_vcpu_block again
after the timer IRQ is injected.  Avoid it by polling once before
entering kvm_vcpu_block.

On my machine (Xeon E5 Sandy Bridge) this removes about 500 cycles (7%)
from the latency of the TSC deadline timer.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:46:54 +02:00
Paolo Bonzini 362c698f82 KVM: x86: extract blocking logic from __vcpu_run
Rename the old __vcpu_run to vcpu_run, and extract part of it to a new
function vcpu_block.

The next patch will add a new condition in vcpu_block, avoid extra
indentation.

Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08 10:46:53 +02:00
Paolo Bonzini bf0fb67cf9 KVM/ARM changes for v4.1:
- fixes for live migration
 - irqfd support
 - kvm-io-bus & vgic rework to enable ioeventfd
 - page ageing for stage-2 translation
 - various cleanups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVHQ0kAAoJECPQ0LrRPXpDHKQQALjw6STaZd7n20OFopNgHd4P
 qVeWYEKBxnsiSvL4p3IOSlZlEul+7x08aZqtyxWQRQcDT4ggTI+3FKKfc+8yeRpH
 WV6YJP0bGqz7039PyMLuIgs48xkSZtntePw69hPJfHZh4C1RBlP5T2SfE8mU8VZX
 fWToiU3W12QfKnmN7JFgxZopnGhrYCrG0EexdTDziAZu0GEMlDrO4wnyTR60WCvT
 4TEF73R0kpAz4yplKuhcDHuxIG7VFhQ4z7b09M1JtR0gQ3wUvfbD3Wqqi49SwHkv
 NQOStcyLsIlDosSRcLXNCwb3IxjObXTBcAxnzgm2Aoc1xMMZX1ZPQNNs6zFZzycb
 2c6QMiQ35zm7ellbvrG+bT+BP86JYWcAtHjWcaUFgqSJjb8MtqcMtsCea/DURhqx
 /kictqbPYBBwKW6SKbkNkisz59hPkuQnv35fuf992MRCbT9LAXLPRLbcirucCzkE
 p1MOotsWoO3ldJMZaVn0KYk3sQf6mCIfbYPEdOcw3fhJlvyy3NdjVkLOFbA5UUg1
 rQ7Ru2rTemBc0ExVrymngNTMpMB4XcEeJzXfhcgMl3DWbDj60Ku/O26sDtZ6bsFv
 JuDYn8FVDHz9gpEQHgiUi1YMsBKXLhnILa1ppaa6AflykU3BRfYjAk1SXmX84nQK
 mJUJEdFuxi6pHN0UKxUI
 =avA4
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into 'kvm-next'

KVM/ARM changes for v4.1:

- fixes for live migration
- irqfd support
- kvm-io-bus & vgic rework to enable ioeventfd
- page ageing for stage-2 translation
- various cleanups
2015-04-07 18:09:20 +02:00
Ingo Molnar c5e77f5216 Linux 4.0-rc6
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJVGHwjAAoJEHm+PkMAQRiG8rcIAJ6cEJ6mbqLpyz5XrGf4yNp0
 +wG/QlEpT8rgrxe9wSjB3lfW3kR2Pe69b9fVVCdiklygdkmva5vfmDrVGGzYfe3M
 QrFSSlMVBplvh6IiM/L1mVMtr3DSmCO23YZZ9R5b7FoEYatNHRpNWBCBpuXpd4aD
 sLuIvO3L/S7LqeOAFkkYWv6AuL9umicmjR8u+nsmCSRJom7At/aJ6R66WIp9vxho
 Rn7r6wcUk6B2Q/gYNjdSE8SIwdyKhuBGyvqQ9U9s6Btg9DQfM/b0vG5kw9hqeAq/
 9445jqVDP1whA2vz6GjnvltidxrqRvuDPBwzOnFmY5U+KZz4lS3x2mnWAAJ3xWs=
 =TqVJ
 -----END PGP SIGNATURE-----

Merge tag 'v4.0-rc6' into timers/core, before applying new patches

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-31 09:08:13 +02:00
Peter Zijlstra 876e78818d time: Rename timekeeper::tkr to timekeeper::tkr_mono
In preparation of adding another tkr field, rename this one to
tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
since the structure is not specific to CLOCK_MONOTONIC and the mono
name got added to the tk_read_base instance.

Lots of trivial churn.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-03-27 09:45:06 +01:00
Nikolay Nikolaev e32edf4fd0 KVM: Redesign kvm_io_bus_ API to pass VCPU structure to the callbacks.
This is needed in e.g. ARM vGIC emulation, where the MMIO handling
depends on the VCPU that does the access.

Signed-off-by: Nikolay Nikolaev <n.nikolaev@virtualopensystems.com>
Signed-off-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2015-03-26 21:43:11 +00:00
Xiubo Li 52eb5a6d57 KVM: x86: For the symbols used locally only should be static type
This patch fix the following sparse warnings:

for arch/x86/kvm/x86.c:
warning: symbol 'emulator_read_write' was not declared. Should it be static?
warning: symbol 'emulator_write_emulated' was not declared. Should it be static?
warning: symbol 'emulator_get_dr' was not declared. Should it be static?
warning: symbol 'emulator_set_dr' was not declared. Should it be static?

for arch/x86/kvm/pmu.c:
warning: symbol 'fixed_pmc_events' was not declared. Should it be static?

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-17 22:38:28 -03:00
Xiubo Li 795a149e78 KVM: x86: Avoid using plain integer as NULL pointer warning
This patch fix the following sparse warning:

for file arch/x86/kvm/x86.c:
warning: Using plain integer as NULL pointer

Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-17 22:34:25 -03:00
Paolo Bonzini dc9be0fac7 kvm: move advertising of KVM_CAP_IRQFD to common code
POWER supports irqfds but forgot to advertise them.  Some userspace does
not check for the capability, but others check it---thus they work on
x86 and s390 but not POWER.

To avoid that other architectures in the future make the same mistake, let
common code handle KVM_CAP_IRQFD the same way as KVM_CAP_IRQFD_RESAMPLE.

Reported-and-tested-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Fixes: 297e21053a
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10 21:18:59 -03:00
Joel Schopp 5cb56059c9 kvm: x86: make kvm_emulate_* consistant
Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes
don't.  The end reult is the caller ends up doing the skip themselves.
Let's make them consistant.

Signed-off-by: Joel Schopp <joel.schopp@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10 20:29:15 -03:00
Thomas Huth 548ef28449 KVM: Get rid of kvm_kvfree()
kvm_kvfree() provides exactly the same functionality as the
new common kvfree() function - so let's simply replace the
kvm function with the common function.

Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10 10:37:43 -03:00
Paolo Bonzini f781951299 kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest.  KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too.  When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest.  This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host.  The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter.  It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls.  During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark.  The wasted time is thus very low.  Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer.  Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-06 13:08:37 +01:00
Kai Huang 843e433057 KVM: VMX: Add PML support in VMX
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added
to allow user to enable/disable it manually.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-30 09:39:54 +01:00
Kai Huang 88178fd4f7 KVM: x86: Add new dirty logging kvm_x86_ops for PML
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty
logging for particular memory slot, and to flush potentially logged dirty GPAs
before reporting slot->dirty_bitmap to userspace.

kvm x86 common code calls these hooks when they are available so PML logic can
be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL
there.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:41 +01:00
Kai Huang 1c91cad423 KVM: x86: Change parameter of kvm_mmu_slot_remove_write_access
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from
'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging
hooks, which will be introduced in further patch.

Better way is to change second parameter of kvm_arch_commit_memory_region from
'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it
requires changes on other non-x86 ARCH too, so avoid it now.

Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-29 15:31:37 +01:00
Nadav Amit bac155310b KVM: x86: 32-bit wraparound read/write not emulated correctly
If we got a wraparound of 32-bit operand, and the limit is 0xffffffff, read and
writes should be successful. It just needs to be done in two segments.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-26 12:15:18 +01:00
Nadav Amit 801806d956 KVM: x86: IRET emulation does not clear NMI masking
The IRET instruction should clear NMI masking, but the current implementation
does not do so.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-26 12:14:42 +01:00
Paolo Bonzini 8fff5e374a KVM: s390: fixes and features for kvm/next (3.20)
1. Generic
 - sparse warning (make function static)
 - optimize locking
 - bugfixes for interrupt injection
 - fix MVPG addressing modes
 
 2. hrtimer/wakeup fun
 A recent change can cause KVM hangs if adjtime is used in the host.
 The hrtimer might wake up too early or too late. Too early is fatal
 as vcpu_block will see that the wakeup condition is not met and
 sleep again. This CPU might never wake up again.
 This series addresses this problem. adjclock slowing down the host
 clock will result in too late wakeups. This will require more work.
 In addition to that we also change the hrtimer from REALTIME to
 MONOTONIC to avoid similar problems with timedatectl set-time.
 
 3. sigp rework
 We will move all "slow" sigps to QEMU (protected with a capability that
 can be enabled) to avoid several races between concurrent SIGP orders.
 
 4. Optimize the shadow page table
 Provide an interface to announce the maximum guest size. The kernel
 will use that to make the pagetable 2,3,4 (or theoretically) 5 levels.
 
 5. Provide an interface to set the guest TOD
 We now use two vm attributes instead of two oneregs, as oneregs are
 vcpu ioctl and we don't want to call them from other threads.
 
 6. Protected key functions
 The real HMC allows to enable/disable protected key CPACF functions.
 Lets provide an implementation + an interface for QEMU to activate
 this the protected key instructions.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.14 (GNU/Linux)
 
 iQIcBAABAgAGBQJUwj60AAoJEBF7vIC1phx8iV0QAKq1LZRTmgTLS2fd0oyWKZeN
 ShWUIUiB+7IUiuogYXZMfqOm61oogxwc95Ti+3tpSWYwkzUWagpS/RJQze7E1HOc
 3pHpXwrR01ueUT6uVV4xc/vmVIlQAIl/ScRDDPahlAT2crCleWcKVC9l0zBs/Kut
 IrfzN9pJcrkmXD178CDP8/VwXsn02ptLQEpidGibGHCd03YVFjp3X0wfwNdQxMbU
 qOwNYCz3SLfDm5gsybO2DG+aVY3AbM2ZOJt/qLv2j4Phz4XB4t4W9iJnAefSz7JA
 W4677wbMQpfZlUQYhI78H/Cl9SfWAuLug1xk83O/+lbEiR5u+8zLxB69dkFTiBaH
 442OY957T6TQZ/V9d0jDo2XxFrcaU9OONbVLsfBQ56Vwv5cAg9/7zqG8eqH7Nq9R
 gU3fQesgD4N0Kpa77T9k45TT/hBRnUEtsGixAPT6QYKyE6cK4AJATHKSjMSLbdfj
 ELbt0p2mVtKhuCcANfEx54U2CxOrg5ElBmPz8hRw0OkXdwpqh1sGKmt0govcHP1I
 BGSzE9G4mswwI1bQ7cqcyTk/lwL8g3+KQmRJoOcgCveQlnY12X5zGD5DhuPMPiIT
 VENqbcTzjlxdu+4t7Enml+rXl7ySsewT9L231SSrbLsTQVgCudD1B9m72WLu5ZUT
 9/Z6znv6tkeKV5rM9DYE
 =zLjR
 -----END PGP SIGNATURE-----

Merge tag 'kvm-s390-next-20150122' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-next

KVM: s390: fixes and features for kvm/next (3.20)

1. Generic
- sparse warning (make function static)
- optimize locking
- bugfixes for interrupt injection
- fix MVPG addressing modes

2. hrtimer/wakeup fun
A recent change can cause KVM hangs if adjtime is used in the host.
The hrtimer might wake up too early or too late. Too early is fatal
as vcpu_block will see that the wakeup condition is not met and
sleep again. This CPU might never wake up again.
This series addresses this problem. adjclock slowing down the host
clock will result in too late wakeups. This will require more work.
In addition to that we also change the hrtimer from REALTIME to
MONOTONIC to avoid similar problems with timedatectl set-time.

3. sigp rework
We will move all "slow" sigps to QEMU (protected with a capability that
can be enabled) to avoid several races between concurrent SIGP orders.

4. Optimize the shadow page table
Provide an interface to announce the maximum guest size. The kernel
will use that to make the pagetable 2,3,4 (or theoretically) 5 levels.

5. Provide an interface to set the guest TOD
We now use two vm attributes instead of two oneregs, as oneregs are
vcpu ioctl and we don't want to call them from other threads.

6. Protected key functions
The real HMC allows to enable/disable protected key CPACF functions.
Lets provide an implementation + an interface for QEMU to activate
this the protected key instructions.
2015-01-23 14:33:36 +01:00
Paolo Bonzini 1c6007d59a KVM/ARM changes for v3.20 including GICv3 emulation, dirty page logging, added
trace symbols, and adding an explicit VGIC init device control IOCTL.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJUwhsKAAoJEEtpOizt6ddyuSEH/ia2uf07N0i+C1dPKYiqhKEd
 nFqBvgrhAMVztWLmy1Wq4SnO9YNd+CrPYATrfCiYsYQ9aKc09+qDq+uo06bVpZXz
 KsHjVGUsdyJ4qRqjDixkPvZviGIXa6C//+hcwg1XH2nit1uHmXVupzB9dDz3ZM2l
 GCwApdRdaaUVDt5Ud2ljqIWZa18Qf/5/HD8MdPXpmotDOKucL6pBr/1R1XWueCU/
 ejRs/qy3EFyMWdEdfGFAMCa0ZvHbPmsJmvB/EgkyUnuJj77ptA0jNo1jtzSfEyis
 53x4ffWnIsPl9yqhk0oKerIALVUvV4A7/me2ya6tsQ5fiBX7lJ3+qwggvCkWQzw=
 =fMS2
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-3.20' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-next

KVM/ARM changes for v3.20 including GICv3 emulation, dirty page logging, added
trace symbols, and adding an explicit VGIC init device control IOCTL.

Conflicts:
	arch/arm64/include/asm/kvm_arm.h
	arch/arm64/kvm/handle_exit.c
2015-01-23 13:39:51 +01:00
Dominik Dingel 31928aa586 KVM: remove unneeded return value of vcpu_postcreate
The return value of kvm_arch_vcpu_postcreate is not checked in its
caller.  This is okay, because only x86 provides vcpu_postcreate right
now and it could only fail if vcpu_load failed.  But that is not
possible during KVM_CREATE_VCPU (kvm_arch_vcpu_load is void, too), so
just get rid of the unchecked return value.

Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2015-01-23 13:24:52 +01:00
Marcelo Tosatti 54750f2cf0 KVM: x86: workaround SuSE's 2.6.16 pvclock vs masterclock issue
SuSE's 2.6.16 kernel fails to boot if the delta between tsc_timestamp
and rdtsc is larger than a given threshold:

 * If we get more than the below threshold into the future, we rerequest
 * the real time from the host again which has only little offset then
 * that we need to adjust using the TSC.
 *
 * For now that threshold is 1/5th of a jiffie. That should be good
 * enough accuracy for completely broken systems, but also give us swing
 * to not call out to the host all the time.
 */
#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)

Disable masterclock support (which increases said delta) in case the
boot vcpu does not use MSR_KVM_SYSTEM_TIME_NEW.

Upstreams kernels which support pvclock vsyscalls (and therefore make
use of PVCLOCK_STABLE_BIT) use MSR_KVM_SYSTEM_TIME_NEW.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-20 20:38:39 +01:00
Fengguang Wu 69b0049a89 KVM: fix "Should it be static?" warnings from sparse
arch/x86/kvm/x86.c:495:5: sparse: symbol 'kvm_read_nested_guest_page' was not declared. Should it be static?
arch/x86/kvm/x86.c:646:5: sparse: symbol '__kvm_set_xcr' was not declared. Should it be static?
arch/x86/kvm/x86.c:1183:15: sparse: symbol 'max_tsc_khz' was not declared. Should it be static?
arch/x86/kvm/x86.c:1237:6: sparse: symbol 'kvm_track_tsc_matching' was not declared. Should it be static?

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-20 20:38:35 +01:00
Paolo Bonzini e108ff2f80 KVM: x86: switch to kvm_get_dirty_log_protect
We now have a generic function that does most of the work of
kvm_vm_ioctl_get_dirty_log, now use it.

Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
2015-01-16 14:40:14 +01:00
Radim Krčmář defcf51fa9 KVM: x86: allow TSC deadline timer on all hosts
Emulation does not utilize the feature.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-09 10:23:54 +01:00
Nicholas Krause bab5bb3982 kvm: x86: Remove kvm_make_request from lapic.c
Adds a function kvm_vcpu_set_pending_timer instead of calling
kvm_make_request in lapic.c.

Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:48:08 +01:00
Marcelo Tosatti d0659d946b KVM: x86: add option to advance tscdeadline hrtimer expiration
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.

This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).

Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.

Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:47:30 +01:00
Wincy Van ff651cb613 KVM: nVMX: Add nested msr load/restore algorithm
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-01-08 22:45:14 +01:00
Paolo Bonzini df1daba7d1 KVM: x86: support XSAVES usage in the host
Userspace is expecting non-compacted format for KVM_GET_XSAVE, but
struct xsave_struct might be using the compacted format.  Convert
in order to preserve userspace ABI.

Likewise, userspace is passing non-compacted format for KVM_SET_XSAVE
but the kernel will pass it to XRSTORS, and we need to convert back.

Fixes: f31a9f7c71
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: stable@vger.kernel.org
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05 13:57:05 +01:00
Paolo Bonzini 2b4a273b42 kvm: x86: avoid warning about potential shift wrapping bug
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning.  The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-24 16:53:50 +01:00
Paolo Bonzini c9eab58f64 KVM: x86: move device assignment out of kvm_host.h
Create a new header, and hide the device assignment functions there.
Move struct kvm_assigned_dev_kernel to assigned-dev.c by modifying
arch/x86/kvm/iommu.c to take a PCI device struct.

Based on a patch by Radim Krcmar <rkrcmark@redhat.com>.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-24 16:53:50 +01:00
Radim Krčmář c274e03af7 kvm: x86: move assigned-dev.c and iommu.c to arch/x86/
Now that ia64 is gone, we can hide deprecated device assignment in x86.

Notable changes:
 - kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()

The easy parts were removed from generic kvm code, remaining
 - kvm_iommu_(un)map_pages() would require new code to be moved
 - struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-23 18:33:36 +01:00
Paolo Bonzini 6ef768fac9 kvm: x86: move ioapic.c and irq_comm.c back to arch/x86/
ia64 does not need them anymore.  Ack notifiers become x86-specific
too.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-21 18:02:37 +01:00
Chris J Arges d913b90435 kvm: svm: move WARN_ON in svm_adjust_tsc_offset
When running the tsc_adjust kvm-unit-test on an AMD processor with the
IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
is called; however it may trigger unnecessary warnings.

This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
s64 since this can have signed values.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-13 11:56:11 +01:00
Paolo Bonzini ac146235d4 KVM: x86: fix warning on 32-bit compilation
PCIDs are only supported in 64-bit mode.  No need to clear bit 63
of CR3 unless the host is 64-bit.

Reported by Fengguang Wu's autobuilder.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-10 13:53:25 +01:00
David Matlack ce1a5e60a6 kvm: x86: add trace event for pvclock updates
The new trace event records:
  * the id of vcpu being updated
  * the pvclock_vcpu_time_info struct being written to guest memory

This is useful for debugging pvclock bugs, such as the bug fixed by
"[PATCH] kvm: x86: Fix kvm clock versioning.".

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-08 08:20:55 +01:00
Owen Hofmann 09a0c3f110 kvm: x86: Fix kvm clock versioning.
kvm updates the version number for the guest paravirt clock structure by
incrementing the version of its private copy. It does not read the guest
version, so will write version = 2 in the first update for every new VM,
including after restoring a saved state. If guest state is saved during
reading the clock, it could read and accept struct fields and guest TSC
from two different updates. This changes the code to increment the guest
version and write it back.

Signed-off-by: Owen Hofmann <osh@google.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-08 08:20:54 +01:00
Marcelo Tosatti 7f187922dd KVM: x86: update masterclock values on TSC writes
When the guest writes to the TSC, the masterclock TSC copy must be
updated as well along with the TSC_OFFSET update, otherwise a negative
tsc_timestamp is calculated at kvm_guest_time_update.

Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
"if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
becomes redundant, so remove the do_request boolean and collapse
everything into a single condition.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-08 08:20:53 +01:00
Nadav Amit 38827dbd3f KVM: x86: Do not update EFLAGS on faulting emulation
If the emulation ends in fault, eflags should not be updated.  However, several
instruction emulations (actually all the fastops) currently update eflags, if
the fault was detected afterwards (e.g., #PF during writeback).

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:08 +01:00
Nadav Amit 9d88fca71a KVM: x86: MOV to CR3 can set bit 63
Although Intel SDM mentions bit 63 is reserved, MOV to CR3 can have bit 63 set.
As Intel SDM states in section 4.10.4 "Invalidation of TLBs and
Paging-Structure Caches": " MOV to CR3. ... If CR4.PCIDE = 1 and bit 63 of the
instruction’s source operand is 0 ..."

In other words, bit 63 is not reserved. KVM emulator currently consider bit 63
as reserved. Fix it.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:07 +01:00
Nadav Amit 82b32774c2 KVM: x86: Breakpoints do not consider CS.base
x86 debug registers hold a linear address. Therefore, breakpoints detection
should consider CS.base, and check whether instruction linear address equals
(CS.base + RIP). This patch introduces a function to evaluate RIP linear
address and uses it for breakpoints detection.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:04 +01:00
Nadav Amit 3ffb24681c KVM: x86: No error-code on real-mode exceptions
Real-mode exceptions do not deliver error code. As can be seen in Intel SDM
volume 2, real-mode exceptions do not have parentheses, which indicate
error-code.  To avoid significant changes of the code, the error code is
"removed" during exception queueing.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-07 15:44:02 +01:00
Chao Peng 612263b30c KVM: x86: Enable Intel AVX-512 for guest
Expose Intel AVX-512 feature bits to guest. Also add checks for
xcr0 AVX512 related bits according to spec:
http://download-software.intel.com/sites/default/files/managed/71/2e/319433-017.pdf

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:30 +01:00
Nadav Amit 16f8a6f979 KVM: vmx: Unavailable DR4/5 is checked before CPL
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
"Priority Among Simultaneous Exceptions and Interrupts".

Note, that this may happen on the first DR access, even if the host does not
sets debug breakpoints. Obviously, it occurs when the host debugs the guest.

This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
The emulator already checks DR4/5 availability in check_dr_read. Nested
virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
exceptions to the guest.

As for SVM, the patch follows the previous logic as much as possible. Anyhow,
it appears the DR interception code might be buggy - even if the DR access
may cause an exception, the instruction is skipped.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:26 +01:00
Nadav Amit 6bdf06625d KVM: x86: DR7.GD should be cleared upon any #DB exception
Intel SDM 17.2.4 (Debug Control Register (DR7)) says: "The processor clears the
GD flag upon entering to the debug exception handler." This sentence may be
misunderstood as if it happens only on #DB due to debug-register protection,
but it happens regardless to the cause of the #DB.

Fix the behavior to match both real hardware and Bochs.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03 12:07:23 +01:00
Andy Honig 8b3c3104c3 KVM: x86: Prevent host from panicking on shared MSR writes.
The previous patch blocked invalid writes directly when the MSR
is written.  As a precaution, prevent future similar mistakes by
gracefulling handle GPs caused by writes to shared MSRs.

Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
[Remove parts obsoleted by Nadav's patch. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Nadav Amit 854e8bb1aa KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).

Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD.  To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.

Some references from Intel and AMD manuals:

According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."

According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."

This patch fixes CVE-2014-3610.

Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24 13:21:08 +02:00
Linus Torvalds 0429fbc0bd Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu consistent-ops changes from Tejun Heo:
 "Way back, before the current percpu allocator was implemented, static
  and dynamic percpu memory areas were allocated and handled separately
  and had their own accessors.  The distinction has been gone for many
  years now; however, the now duplicate two sets of accessors remained
  with the pointer based ones - this_cpu_*() - evolving various other
  operations over time.  During the process, we also accumulated other
  inconsistent operations.

  This pull request contains Christoph's patches to clean up the
  duplicate accessor situation.  __get_cpu_var() uses are replaced with
  with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

  Unfortunately, the former sometimes is tricky thanks to C being a bit
  messy with the distinction between lvalues and pointers, which led to
  a rather ugly solution for cpumask_var_t involving the introduction of
  this_cpu_cpumask_var_ptr().

  This converts most of the uses but not all.  Christoph will follow up
  with the remaining conversions in this merge window and hopefully
  remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
  irqchip: Properly fetch the per cpu offset
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
  ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
  percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
  Revert "powerpc: Replace __get_cpu_var uses"
  percpu: Remove __this_cpu_ptr
  clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
  sparc: Replace __get_cpu_var uses
  avr32: Replace __get_cpu_var with __this_cpu_write
  blackfin: Replace __get_cpu_var uses
  tile: Use this_cpu_ptr() for hardware counters
  tile: Replace __get_cpu_var uses
  powerpc: Replace __get_cpu_var uses
  alpha: Replace __get_cpu_var
  ia64: Replace __get_cpu_var uses
  s390: cio driver &__get_cpu_var replacements
  s390: Replace __get_cpu_var uses
  mips: Replace __get_cpu_var uses
  MIPS: Replace __get_cpu_var uses in FPU emulator.
  arm: Replace __this_cpu_ptr with raw_cpu_ptr
  ...
2014-10-15 07:48:18 +02:00
Paolo Bonzini f439ed27f8 kvm: do not handle APIC access page if in-kernel irqchip is not in use
This fixes the following OOPS:

   loaded kvm module (v3.17-rc1-168-gcec26bc)
   BUG: unable to handle kernel paging request at fffffffffffffffe
   IP: [<ffffffff81168449>] put_page+0x9/0x30
   PGD 1e15067 PUD 1e17067 PMD 0
   Oops: 0000 [#1] PREEMPT SMP
    [<ffffffffa063271d>] ? kvm_vcpu_reload_apic_access_page+0x5d/0x70 [kvm]
    [<ffffffffa013b6db>] vmx_vcpu_reset+0x21b/0x470 [kvm_intel]
    [<ffffffffa0658816>] ? kvm_pmu_reset+0x76/0xb0 [kvm]
    [<ffffffffa064032a>] kvm_vcpu_reset+0x15a/0x1b0 [kvm]
    [<ffffffffa06403ac>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm]
    [<ffffffffa062e540>] kvm_vm_ioctl+0x200/0x780 [kvm]
    [<ffffffff81212170>] do_vfs_ioctl+0x2d0/0x4b0
    [<ffffffff8108bd99>] ? __mmdrop+0x69/0xb0
    [<ffffffff812123d1>] SyS_ioctl+0x81/0xa0
    [<ffffffff8112a6f6>] ? __audit_syscall_exit+0x1f6/0x2a0
    [<ffffffff817229e9>] system_call_fastpath+0x16/0x1b
   Code: c6 78 ce a3 81 4c 89 e7 e8 d9 80 ff ff 0f 0b 4c 89 e7 e8 8f f6 ff ff e9 fa fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74 27 f0
   RIP  [<ffffffff81193045>] put_page+0x5/0x50

when not using the in-kernel irqchip ("-machine kernel_irqchip=off"
with QEMU).  The fix is to make the same check in
kvm_vcpu_reload_apic_access_page that we already have
in vmx.c's vm_need_virtualize_apic_accesses().

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Fixes: 4256f43f9f
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-02 19:26:11 +02:00
Tang Chen c24ae0dcd3 kvm: x86: Unpin and remove kvm_arch->apic_access_page
In order to make the APIC access page migratable, stop pinning it in
memory.

And because the APIC access page is not pinned in memory, we can
remove kvm_arch->apic_access_page.  When we need to write its
physical address into vmcs, we use gfn_to_page() to get its page
struct, which is needed to call page_to_phys(); the page is then
immediately unpinned.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:01 +02:00
Tang Chen 4256f43f9f kvm: x86: Add request bit to reload APIC access page address
Currently, the APIC access page is pinned by KVM for the entire life
of the guest.  We want to make it migratable in order to make memory
hot-unplug available for machines that run KVM.

This patch prepares to handle this in generic code, through a new
request bit (that will be set by the MMU notifier) and a new hook
that is called whenever the request bit is processed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:08:00 +02:00
Tang Chen fe71557afb kvm: Add arch specific mmu notifier for page invalidation
This will be used to let the guest run while the APIC access page is
not pinned.  Because subsequent patches will fill in the function
for x86, place the (still empty) x86 implementation in the x86.c file
instead of adding an inline function in kvm_host.h.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:59 +02:00
Chen Yucong 81760dccf8 kvm: x86: use macros to compute bank MSRs
Avoid open coded calculations for bank MSRs by using well-defined
macros that hide the index of higher bank MSRs.

No semantic changes.

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:56 +02:00
Nadav Amit 4566654bb9 KVM: vmx: Inject #GP on invalid PAT CR
Guest which sets the PAT CR to invalid value should get a #GP.  Currently, if
vmx supports loading PAT CR during entry, then the value is not checked.  This
patch makes the required check in that case.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:52 +02:00
Liang Chen 77c3913b74 KVM: x86: directly use kvm_make_request again
A one-line wrapper around kvm_make_request is not particularly
useful. Replace kvm_mmu_flush_tlb() with kvm_make_request().

Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:51 +02:00
Radim Krčmář a70656b63a KVM: x86: count actual tlb flushes
- we count KVM_REQ_TLB_FLUSH requests, not actual flushes
  (KVM can have multiple requests for one flush)
- flushes from kvm_flush_remote_tlbs aren't counted
- it's easy to make a direct request by mistake

Solve these by postponing the counting to kvm_check_request().

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Liang Chen <liangchen.linux@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:50 +02:00
Nadav Amit a2b9e6c1a3 KVM: x86: Don't report guest userspace emulation error to userspace
Commit fc3a9157d3 ("KVM: X86: Don't report L2 emulation failures to
user-space") disabled the reporting of L2 (nested guest) emulation failures to
userspace due to race-condition between a vmexit and the instruction emulator.
The same rational applies also to userspace applications that are permitted by
the guest OS to access MMIO area or perform PIO.

This patch extends the current behavior - of injecting a #UD instead of
reporting it to userspace - also for guest userspace code.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-24 14:07:49 +02:00
Tang Chen a255d4795f kvm: Remove ept_identity_pagetable from struct kvm_arch.
kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
it is never used to refer to the page at all.

In vcpu initialization, it indicates two things:
1. indicates if ept page is allocated
2. indicates if a memory slot for identity page is initialized

Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
identity pagetable is initialized. So we can remove ept_identity_pagetable.

NOTE: In the original code, ept identity pagetable page is pinned in memroy.
      As a result, it cannot be migrated/hot-removed. After this patch, since
      kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
      is no longer pinned in memory. And it can be migrated/hot-removed.

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-17 13:10:11 +02:00
Guo Hui Liu 105b21bbf6 KVM: x86: Use kvm_make_request when applicable
This patch replace the set_bit method by kvm_make_request
to make code more readable and consistent.

Signed-off-by: Guo Hui Liu <liuguohui@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 14:44:20 +02:00
Paolo Bonzini 54987b7afa KVM: x86: propagate exception from permission checks on the nested page fault
Currently, if a permission error happens during the translation of
the final GPA to HPA, walk_addr_generic returns 0 but does not fill
in walker->fault.  To avoid this, add an x86_exception* argument
to the translate_gpa function, and let it fill in walker->fault.
The nested_page_fault field will be true, since the walk_mmu is the
nested_mmu and translate_gpu instead operates on the "outer" (NPT)
instance.

Reported-by: Valentine Sinitsyn <valentine.sinitsyn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-05 12:01:13 +02:00
Paolo Bonzini ef54bcfeea KVM: x86: skip writeback on injection of nested exception
If a nested page fault happens during emulation, we will inject a vmexit,
not a page fault.  However because writeback happens after the injection,
we will write ctxt->eip from L2 into the L1 EIP.  We do not write back
if an instruction caused an interception vmexit---do the same for page
faults.

Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-05 12:01:06 +02:00
Tiejun Chen d143148383 KVM: mmio: cleanup kvm_set_mmio_spte_mask
Just reuse rsvd_bits() inside kvm_set_mmio_spte_mask()
for slightly better code.

Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-03 10:04:10 +02:00
Paolo Bonzini fd2752352b KVM: x86: use guest maxphyaddr to check MTRR values
The check introduced in commit d7a2a246a1 (KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs, 2014-08-19)
will break if the guest maxphyaddr is higher than the host's (which
sometimes happens depending on your hardware and how QEMU is
configured).

To fix this, use cpuid_maxphyaddr similar to how the APIC_BASE MSR
does already.

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 18:56:24 +02:00
Radim Krčmář 13a34e067e KVM: remove garbage arg to *hardware_{en,dis}able
In the beggining was on_each_cpu(), which required an unused argument to
kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten.

Remove unnecessary arguments that stem from this.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29 16:35:55 +02:00
Christoph Lameter 89cbc76768 x86: Replace __get_cpu_var uses
__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-26 13:45:49 -04:00
Radim Krčmář 7b46268d29 KVM: trace kvm_ple_window grow/shrink
Tracepoint for dynamic PLE window, fired on every potential change.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:23 +02:00
Radim Krčmář ae97a3b818 KVM: x86: introduce sched_in to kvm_x86_ops
sched_in preempt notifier is available for x86, allow its use in
specific virtualization technlogies as well.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:22 +02:00
Radim Krčmář e790d9ef64 KVM: add kvm_arch_sched_in
Introduce preempt notifiers for architecture specific code.
Advantage over creating a new notifier in every arch is slightly simpler
code and guaranteed call order with respect to kvm_sched_in.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-21 18:45:21 +02:00
Paolo Bonzini e0ad0b477c KVM: emulate: warn on invalid or uninitialized exception numbers
These were reported when running Jailhouse on AMD processors.

Initialize ctxt->exception.vector with an invalid exception number,
and warn if it remained invalid even though the emulator got
an X86EMUL_PROPAGATE_FAULT return code.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-20 13:01:26 +02:00
Wanpeng Li d7a2a246a1 KVM: x86: #GP when attempts to write reserved bits of Variable Range MTRRs
Section 11.11.2.3 of the SDM mentions "All other bits in the IA32_MTRR_PHYSBASEn
and IA32_MTRR_PHYSMASKn registers are reserved; the processor generates a
general-protection exception(#GP) if software attempts to write to them". This
patch do it in kvm.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:29 +02:00
Wanpeng Li adfb5d2746 KVM: x86: fix check legal type of Variable Range MTRRs
The first entry in each pair(IA32_MTRR_PHYSBASEn) defines the base
address and memory type for the range; the second entry(IA32_MTRR_PHYSMASKn)
contains a mask used to determine the address range. The legal values
for the type field of IA32_MTRR_PHYSBASEn are 0,1,4,5, and 6. However,
IA32_MTRR_PHYSMASKn don't have type field. This patch avoid check if
the type field is legal for IA32_MTRR_PHYSMASKn.

Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:29 +02:00
Wei Huang dc9b2d933a KVM: SVM: add rdmsr support for AMD event registers
Current KVM only supports RDMSR for K7_EVNTSEL0 and K7_PERFCTR0
MSRs. Reading the rest MSRs will trigger KVM to inject #GP into
guest VM. This causes a warning message "Failed to access perfctr
msr (MSR c0010001 is ffffffffffffffff)" on AMD host. This patch
adds RDMSR support for all K7_EVNTSELn and K7_PERFCTRn registers
and thus supresses the warning message.

Signed-off-by: Wei Huang <wehuang@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19 15:12:28 +02:00
Linus Torvalds 66bb0aa077 Here are the PPC and ARM changes for KVM, which I separated because
they had small conflicts (respectively within KVM documentation,
 and with 3.16-rc changes).  Since they were all within the subsystem,
 I took care of them.
 
 Stephen Rothwell reported some snags in PPC builds, but they are all
 fixed now; the latest linux-next report was clean.
 
 New features for ARM include:
 - KVM VGIC v2 emulation on GICv3 hardware
 - Big-Endian support for arm/arm64 (guest and host)
 - Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)
 
 And for PPC:
 - Book3S: Good number of LE host fixes, enable HV on LE
 - Book3S HV: Add in-guest debug support
 
 This release drops support for KVM on the PPC440.  As a result, the
 PPC merge removes more lines than it adds. :)
 
 I also included an x86 change, since Davidlohr tied it to an independent
 bug report and the reporter quickly provided a Tested-by; there was no
 reason to wait for -rc2.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJT4iIJAAoJEBvWZb6bTYbyZqoP/3Wxy8NWPFJ8HGt81NHlGnDS
 a9UbL7EibcOEG+aaKqmtBglTD5YDiGBDNCxxiSJaDHt+grLN4fsWIliJob1nJFoO
 90f89EWN2XjeCrJXA5nUoeg5tpc5OoYKsiP6pTgzIwkP8vvs/H1+zpcTS/UmYsr/
 qipVMMsM+zZeHWZcSbqjW88z7YqIn1sr5282wJ85cbyv4KGizb/G4dyPuDqLb6np
 hkAD8Ah6VV2suQ2FSy7G2fg20R0vglUi60hkEHLoCBPVqJCl7SmC8MvxNbjBnP8S
 J36R0R0u1wHYKzAGooLJGVOZ/o/gSiVqKX+++L2EvJBN+kuA6u/7fxLyBT+LwDAE
 IF/Aln5rpg1fe+eywvhz86WljTVEQ8bO1zVsIQUPY+/ZOPedZHMwyvXft8ogbjSp
 2m9OJ/3e8Aggh0OeHpCDoeow+QDUXvX0YdCw+2Yh0p+7VMXqkyp0QEiBu38jrusC
 rB3VNifJbDSWLKdG9LfCAPHnxZD2XYEwv2WFBo6KQOGMGHfx0GXpCOL/jQihrhA6
 HtEG5Bs3lvnHQemdpUZ58xojiABbMaUPdcnPXQQEp23WhZzrfLMLzqVG0VYnhSsC
 9pi7MJj8c31rqx5WU2oRM28i/BvNxN0NCtkDpineO5s3f89Ws1xnwxqlm38AKP0J
 irJQTYFEqec+GM9JK1rG
 =hyQP
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull second round of KVM changes from Paolo Bonzini:
 "Here are the PPC and ARM changes for KVM, which I separated because
  they had small conflicts (respectively within KVM documentation, and
  with 3.16-rc changes).  Since they were all within the subsystem, I
  took care of them.

  Stephen Rothwell reported some snags in PPC builds, but they are all
  fixed now; the latest linux-next report was clean.

  New features for ARM include:
   - KVM VGIC v2 emulation on GICv3 hardware
   - Big-Endian support for arm/arm64 (guest and host)
   - Debug Architecture support for arm64 (arm32 is on Christoffer's todo list)

  And for PPC:
   - Book3S: Good number of LE host fixes, enable HV on LE
   - Book3S HV: Add in-guest debug support

  This release drops support for KVM on the PPC440.  As a result, the
  PPC merge removes more lines than it adds.  :)

  I also included an x86 change, since Davidlohr tied it to an
  independent bug report and the reporter quickly provided a Tested-by;
  there was no reason to wait for -rc2"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (122 commits)
  KVM: Move more code under CONFIG_HAVE_KVM_IRQFD
  KVM: nVMX: fix "acknowledge interrupt on exit" when APICv is in use
  KVM: nVMX: Fix nested vmexit ack intr before load vmcs01
  KVM: PPC: Enable IRQFD support for the XICS interrupt controller
  KVM: Give IRQFD its own separate enabling Kconfig option
  KVM: Move irq notifier implementation into eventfd.c
  KVM: Move all accesses to kvm::irq_routing into irqchip.c
  KVM: irqchip: Provide and use accessors for irq routing table
  KVM: Don't keep reference to irq routing table in irqfd struct
  KVM: PPC: drop duplicate tracepoint
  arm64: KVM: fix 64bit CP15 VM access for 32bit guests
  KVM: arm64: GICv3: mandate page-aligned GICV region
  arm64: KVM: GICv3: move system register access to msr_s/mrs_s
  KVM: PPC: PR: Handle FSCR feature deselects
  KVM: PPC: HV: Remove generic instruction emulation
  KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
  KVM: PPC: Remove DCR handling
  KVM: PPC: Expose helper functions for data/inst faults
  KVM: PPC: Separate loadstore emulation from priv emulation
  KVM: PPC: Handle magic page in kvmppc_ld/st
  ...
2014-08-07 11:35:30 -07:00
Linus Torvalds e7fda6c4c3 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer and time updates from Thomas Gleixner:
 "A rather large update of timers, timekeeping & co

   - Core timekeeping code is year-2038 safe now for 32bit machines.
     Now we just need to fix all in kernel users and the gazillion of
     user space interfaces which rely on timespec/timeval :)

   - Better cache layout for the timekeeping internal data structures.

   - Proper nanosecond based interfaces for in kernel users.

   - Tree wide cleanup of code which wants nanoseconds but does hoops
     and loops to convert back and forth from timespecs.  Some of it
     definitely belongs into the ugly code museum.

   - Consolidation of the timekeeping interface zoo.

   - A fast NMI safe accessor to clock monotonic for tracing.  This is a
     long standing request to support correlated user/kernel space
     traces.  With proper NTP frequency correction it's also suitable
     for correlation of traces accross separate machines.

   - Checkpoint/restart support for timerfd.

   - A few NOHZ[_FULL] improvements in the [hr]timer code.

   - Code move from kernel to kernel/time of all time* related code.

   - New clocksource/event drivers from the ARM universe.  I'm really
     impressed that despite an architected timer in the newer chips SoC
     manufacturers insist on inventing new and differently broken SoC
     specific timers.

[ Ed. "Impressed"? I don't think that word means what you think it means ]

   - Another round of code move from arch to drivers.  Looks like most
     of the legacy mess in ARM regarding timers is sorted out except for
     a few obnoxious strongholds.

   - The usual updates and fixlets all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
  timekeeping: Fixup typo in update_vsyscall_old definition
  clocksource: document some basic timekeeping concepts
  timekeeping: Use cached ntp_tick_length when accumulating error
  timekeeping: Rework frequency adjustments to work better w/ nohz
  timekeeping: Minor fixup for timespec64->timespec assignment
  ftrace: Provide trace clocks monotonic
  timekeeping: Provide fast and NMI safe access to CLOCK_MONOTONIC
  seqcount: Add raw_write_seqcount_latch()
  seqcount: Provide raw_read_seqcount()
  timekeeping: Use tk_read_base as argument for timekeeping_get_ns()
  timekeeping: Create struct tk_read_base and use it in struct timekeeper
  timekeeping: Restructure the timekeeper some more
  clocksource: Get rid of cycle_last
  clocksource: Move cycle_last validation to core code
  clocksource: Make delta calculation a function
  wireless: ath9k: Get rid of timespec conversions
  drm: vmwgfx: Use nsec based interfaces
  drm: i915: Use nsec based interfaces
  timekeeping: Provide ktime_get_raw()
  hangcheck-timer: Use ktime_get_ns()
  ...
2014-08-05 17:46:42 -07:00
Paolo Bonzini cc568ead3c Patch queue for ppc - 2014-08-01
Highlights in this release include:
 
   - BookE: Rework instruction fetch, not racy anymore now
   - BookE HV: Fix ONE_REG accessors for some in-hardware registers
   - Book3S: Good number of LE host fixes, enable HV on LE
   - Book3S: Some misc bug fixes
   - Book3S HV: Add in-guest debug support
   - Book3S HV: Preload cache lines on context switch
   - Remove 440 support
 
 Alexander Graf (31):
       KVM: PPC: Book3s PR: Disable AIL mode with OPAL
       KVM: PPC: Book3s HV: Fix tlbie compile error
       KVM: PPC: Book3S PR: Handle hyp doorbell exits
       KVM: PPC: Book3S PR: Fix ABIv2 on LE
       KVM: PPC: Book3S PR: Fix sparse endian checks
       PPC: Add asm helpers for BE 32bit load/store
       KVM: PPC: Book3S HV: Make HTAB code LE host aware
       KVM: PPC: Book3S HV: Access guest VPA in BE
       KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
       KVM: PPC: Book3S HV: Access XICS in BE
       KVM: PPC: Book3S HV: Fix ABIv2 on LE
       KVM: PPC: Book3S HV: Enable for little endian hosts
       KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
       KVM: PPC: Deflect page write faults properly in kvmppc_st
       KVM: PPC: Book3S: Stop PTE lookup on write errors
       KVM: PPC: Book3S: Add hack for split real mode
       KVM: PPC: Book3S: Make magic page properly 4k mappable
       KVM: PPC: Remove 440 support
       KVM: Rename and add argument to check_extension
       KVM: Allow KVM_CHECK_EXTENSION on the vm fd
       KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
       KVM: PPC: Implement kvmppc_xlate for all targets
       KVM: PPC: Move kvmppc_ld/st to common code
       KVM: PPC: Remove kvmppc_bad_hva()
       KVM: PPC: Use kvm_read_guest in kvmppc_ld
       KVM: PPC: Handle magic page in kvmppc_ld/st
       KVM: PPC: Separate loadstore emulation from priv emulation
       KVM: PPC: Expose helper functions for data/inst faults
       KVM: PPC: Remove DCR handling
       KVM: PPC: HV: Remove generic instruction emulation
       KVM: PPC: PR: Handle FSCR feature deselects
 
 Alexey Kardashevskiy (1):
       KVM: PPC: Book3S: Fix LPCR one_reg interface
 
 Aneesh Kumar K.V (4):
       KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
       KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
       KVM: PPC: BOOK3S: PR: Emulate instruction counter
       KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page
 
 Anton Blanchard (2):
       KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
       KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()
 
 Bharat Bhushan (10):
       kvm: ppc: bookehv: Added wrapper macros for shadow registers
       kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
       kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
       kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
       kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
       kvm: ppc: Add SPRN_EPR get helper function
       kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
       KVM: PPC: Booke-hv: Add one reg interface for SPRG9
       KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
       KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr
 
 Michael Neuling (1):
       KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling
 
 Mihai Caraman (8):
       KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
       KVM: PPC: e500: Fix default tlb for victim hint
       KVM: PPC: e500: Emulate power management control SPR
       KVM: PPC: e500mc: Revert "add load inst fixup"
       KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
       KVM: PPC: Book3s: Remove kvmppc_read_inst() function
       KVM: PPC: Allow kvmppc_get_last_inst() to fail
       KVM: PPC: Bookehv: Get vcpu's last instruction for emulation
 
 Paul Mackerras (4):
       KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
       KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
       KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
       KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication
 
 Stewart Smith (2):
       Split out struct kvmppc_vcore creation to separate function
       Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.19 (GNU/Linux)
 
 iQIcBAABAgAGBQJT21skAAoJECszeR4D/txgeFEP/AzJopN7s//W33CfyBqURHXp
 XALCyAw+S67gtcaTZbxomcG1xuT8Lj9WEw28iz3rCtAnJwIxsY63xrI1nXMzTaI2
 p1rC0ai5Qy+nlEbd6L78spZy/Nzh8DFYGWx78iUSO1mYD8xywJwtoiBA539pwp8j
 8N+mgn61Hwhv31bKtsZlmzXymVr/jbTp5LVuxsBLJwD2lgT49g+4uBnX2cG/iXkg
 Rzbh7LxoNNXrSPI8sYmTWu/81aeXteeX70ja6DHuV5dWLNTuAXJrh5EUfeAZqBrV
 aYcLWUYmIyB87txNmt6ZGVar2p3jr2Xhb9mKx+EN4dbehblanLc1PUqlHd0q3dKc
 Nt60ByqpZn+qDAK86dShSZLEe+GT3lovvE76CqVXD4Er+OUEkc9JoxhN1cof/Gb0
 o6uwZ2isXHRdGoZx5vb4s3UTOlwZGtoL/CyY/HD/ujYDSURkCGbxLj3kkecSY8ut
 QdDAWsC15BwsHtKLr5Zwjp2w+0eGq2QJgfvO0zqWFiz9k33SCBCUpwluFeqh27Hi
 aR5Wir3j+MIw9G8XlYlDJWYfi0h/SZ4G7hh7jSu26NBNBzQsDa8ow/cLzdMhdUwH
 OYSaeqVk5wiRb9to1uq1NQWPA0uRAx3BSjjvr9MCGRqmvn+FV5nj637YWUT+53Hi
 aSvg/U2npghLPPG2cihu
 =JuLr
 -----END PGP SIGNATURE-----

Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into kvm

Patch queue for ppc - 2014-08-01

Highlights in this release include:

  - BookE: Rework instruction fetch, not racy anymore now
  - BookE HV: Fix ONE_REG accessors for some in-hardware registers
  - Book3S: Good number of LE host fixes, enable HV on LE
  - Book3S: Some misc bug fixes
  - Book3S HV: Add in-guest debug support
  - Book3S HV: Preload cache lines on context switch
  - Remove 440 support

Alexander Graf (31):
      KVM: PPC: Book3s PR: Disable AIL mode with OPAL
      KVM: PPC: Book3s HV: Fix tlbie compile error
      KVM: PPC: Book3S PR: Handle hyp doorbell exits
      KVM: PPC: Book3S PR: Fix ABIv2 on LE
      KVM: PPC: Book3S PR: Fix sparse endian checks
      PPC: Add asm helpers for BE 32bit load/store
      KVM: PPC: Book3S HV: Make HTAB code LE host aware
      KVM: PPC: Book3S HV: Access guest VPA in BE
      KVM: PPC: Book3S HV: Access host lppaca and shadow slb in BE
      KVM: PPC: Book3S HV: Access XICS in BE
      KVM: PPC: Book3S HV: Fix ABIv2 on LE
      KVM: PPC: Book3S HV: Enable for little endian hosts
      KVM: PPC: Book3S: Move vcore definition to end of kvm_arch struct
      KVM: PPC: Deflect page write faults properly in kvmppc_st
      KVM: PPC: Book3S: Stop PTE lookup on write errors
      KVM: PPC: Book3S: Add hack for split real mode
      KVM: PPC: Book3S: Make magic page properly 4k mappable
      KVM: PPC: Remove 440 support
      KVM: Rename and add argument to check_extension
      KVM: Allow KVM_CHECK_EXTENSION on the vm fd
      KVM: PPC: Book3S: Provide different CAPs based on HV or PR mode
      KVM: PPC: Implement kvmppc_xlate for all targets
      KVM: PPC: Move kvmppc_ld/st to common code
      KVM: PPC: Remove kvmppc_bad_hva()
      KVM: PPC: Use kvm_read_guest in kvmppc_ld
      KVM: PPC: Handle magic page in kvmppc_ld/st
      KVM: PPC: Separate loadstore emulation from priv emulation
      KVM: PPC: Expose helper functions for data/inst faults
      KVM: PPC: Remove DCR handling
      KVM: PPC: HV: Remove generic instruction emulation
      KVM: PPC: PR: Handle FSCR feature deselects

Alexey Kardashevskiy (1):
      KVM: PPC: Book3S: Fix LPCR one_reg interface

Aneesh Kumar K.V (4):
      KVM: PPC: BOOK3S: PR: Fix PURR and SPURR emulation
      KVM: PPC: BOOK3S: PR: Emulate virtual timebase register
      KVM: PPC: BOOK3S: PR: Emulate instruction counter
      KVM: PPC: BOOK3S: HV: Update compute_tlbie_rb to handle 16MB base page

Anton Blanchard (2):
      KVM: PPC: Book3S HV: Fix ABIv2 indirect branch issue
      KVM: PPC: Assembly functions exported to modules need _GLOBAL_TOC()

Bharat Bhushan (10):
      kvm: ppc: bookehv: Added wrapper macros for shadow registers
      kvm: ppc: booke: Use the shared struct helpers of SRR0 and SRR1
      kvm: ppc: booke: Use the shared struct helpers of SPRN_DEAR
      kvm: ppc: booke: Add shared struct helpers of SPRN_ESR
      kvm: ppc: booke: Use the shared struct helpers for SPRN_SPRG0-7
      kvm: ppc: Add SPRN_EPR get helper function
      kvm: ppc: bookehv: Save restore SPRN_SPRG9 on guest entry exit
      KVM: PPC: Booke-hv: Add one reg interface for SPRG9
      KVM: PPC: Remove comment saying SPRG1 is used for vcpu pointer
      KVM: PPC: BOOKEHV: rename e500hv_spr to bookehv_spr

Michael Neuling (1):
      KVM: PPC: Book3S HV: Add H_SET_MODE hcall handling

Mihai Caraman (8):
      KVM: PPC: e500mc: Enhance tlb invalidation condition on vcpu schedule
      KVM: PPC: e500: Fix default tlb for victim hint
      KVM: PPC: e500: Emulate power management control SPR
      KVM: PPC: e500mc: Revert "add load inst fixup"
      KVM: PPC: Book3e: Add TLBSEL/TSIZE defines for MAS0/1
      KVM: PPC: Book3s: Remove kvmppc_read_inst() function
      KVM: PPC: Allow kvmppc_get_last_inst() to fail
      KVM: PPC: Bookehv: Get vcpu's last instruction for emulation

Paul Mackerras (4):
      KVM: PPC: Book3S: Controls for in-kernel sPAPR hypercall handling
      KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabled
      KVM: PPC: Book3S PR: Take SRCU read lock around RTAS kvm_read_guest() call
      KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indication

Stewart Smith (2):
      Split out struct kvmppc_vcore creation to separate function
      Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8

Conflicts:
	Documentation/virtual/kvm/api.txt
2014-08-05 09:58:11 +02:00
Linus Torvalds 8533ce7271 These are the x86, MIPS and s390 changes; PPC and ARM will come in a
few days.
 
 MIPS and s390 have little going on this release; just bugfixes, some
 small, some larger.
 
 The highlights for x86 are nested VMX improvements (Jan Kiszka), optimizations
 for old processor (up to Nehalem, by me and Bandan Das), and a lot of x86
 emulator bugfixes (Nadav Amit).
 
 Stephen Rothwell reported a trivial conflict with the tracing branch.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABAgAGBQJT300XAAoJEBvWZb6bTYby3V8QAJz+XyajnhJ8wH55Vxczz22L
 i2gtUGmBLhEXsBcaVKO4BBfek88lLzg0SGLjfW5wCMQmKtxVlrwTCXNkBoPGjapd
 NwHtWkMKym44PDhRovn7zkSumkxC43uFIBR/ebrhP6Bvhh9s+MnkQUxfw9ILB+YV
 EeKyEG8sSgxFCciuHbp3mIXpDcO6r/ldy6I7009OdyhLoMY+Kvmk7kRe9wtAivdg
 CGJi60QvGOn2RGRPOCEtF6UWr8Ae8fe1t84o0hkXPv/j3jtabzAatXKJa4dYNbIs
 7Mp4NQpxaGV6rq3WCYVeZRxGs+UReGDAS3Il4Z8C9eTOTooSfxdVr8acpM8PY6I8
 UmLT6ECLGycc4ELXrETtR+QLmiXACyJqyVxz4aiLV3kWSWfamKD3hBeQK9NizNcE
 VoPDl+PyISvR1tW4KstBuzfUWAEXi+gO78cqqFr/VW6cl7HKpA1DFQaPfGkYKDae
 2CPwcLwI5/M6RtSgkyXTkEqNZLc2BjldqSeM1lmWjhZVW56X2iqePUL46Vab3Yvt
 U+sELtwEE560NLN3hbaHUsLR1tcUix5w8vTzcXPxgoHQBszHCcAZTWd1XHulr64F
 rp/cangqtkPKcu5j1mNhQs38oLjHI1MUsbQrqFoD4tmHjQ75iXHRFzYGoIVKXyHG
 AnGbQzJzBcdAANhm3LW0
 =UXxV
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM changes from Paolo Bonzini:
 "These are the x86, MIPS and s390 changes; PPC and ARM will come in a
  few days.

  MIPS and s390 have little going on this release; just bugfixes, some
  small, some larger.

  The highlights for x86 are nested VMX improvements (Jan Kiszka),
  optimizations for old processor (up to Nehalem, by me and Bandan Das),
  and a lot of x86 emulator bugfixes (Nadav Amit).

  Stephen Rothwell reported a trivial conflict with the tracing branch"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (104 commits)
  x86/kvm: Resolve shadow warnings in macro expansion
  KVM: s390: rework broken SIGP STOP interrupt handling
  KVM: x86: always exit on EOIs for interrupts listed in the IOAPIC redir table
  KVM: vmx: remove duplicate vmx_mpx_supported() prototype
  KVM: s390: Fix memory leak on busy SIGP stop
  x86/kvm: Resolve shadow warning from min macro
  kvm: Resolve missing-field-initializers warnings
  Replace NR_VMX_MSR with its definition
  KVM: x86: Assertions to check no overrun in MSR lists
  KVM: x86: set rflags.rf during fault injection
  KVM: x86: Setting rflags.rf during rep-string emulation
  KVM: x86: DR6/7.RTM cannot be written
  KVM: nVMX: clean up nested_release_vmcs12 and code around it
  KVM: nVMX: fix lifetime issues for vmcs02
  KVM: x86: Defining missing x86 vectors
  KVM: x86: emulator injects #DB when RFLAGS.RF is set
  KVM: x86: Cleanup of rflags.rf cleaning
  KVM: x86: Clear rflags.rf on emulated instructions
  KVM: x86: popf emulation should not change RF
  KVM: x86: Clearing rflags.rf upon skipped emulated instruction
  ...
2014-08-04 12:16:46 -07:00
Alexander Graf 784aa3d7fb KVM: Rename and add argument to check_extension
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-28 15:23:17 +02:00
Nadav Amit 0123be429f KVM: x86: Assertions to check no overrun in MSR lists
Currently there is no check whether shared MSRs list overrun the allocated size
which can results in bugs. In addition there is no check that vmx->guest_msrs
has sufficient space to accommodate all the VMX msrs.  This patch adds the
assertions.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:16:57 +02:00
Nadav Amit d6e8c85456 KVM: x86: set rflags.rf during fault injection
x86 does not automatically set rflags.rf during event injection. This patch
does partial job, setting rflags.rf upon fault injection.  It does not handle
the setting of RF upon interrupt injection on rep-string instruction.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:04:01 +02:00
Thomas Gleixner d28ede8379 timekeeping: Create struct tk_read_base and use it in struct timekeeper
The members of the new struct are the required ones for the new NMI
safe accessor to clcok monotonic. In order to reuse the existing
timekeeping code and to make the update of the fast NMI safe
timekeepers a simple memcpy use the struct for the timekeeper as well
and convert all users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:53 -07:00
Thomas Gleixner 4a0e637738 clocksource: Get rid of cycle_last
cycle_last was added to the clocksource to support the TSC
validation. We moved that to the core code, so we can get rid of the
extra copy.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:52 -07:00
Thomas Gleixner cbcf2dd3b3 x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based
Convert the relevant base data right away to nanoseconds instead of
doing the conversion on every readout. Reduces text size by 160 bytes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:46 -07:00
Thomas Gleixner bb0b58127c x86: kvm: Use ktime_get_boot_ns()
Use the new nanoseconds based interface and get rid of the timespec
conversion dance.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: kvm@vger.kernel.org
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:45 -07:00
Nadav Amit 6f43ed01e8 KVM: x86: DR6/7.RTM cannot be written
Haswell and newer Intel CPUs have support for RTM, and in that case DR6.RTM is
not fixed to 1 and DR7.RTM is not fixed to zero. That is not the case in the
current KVM implementation. This bug is apparent only if the MOV-DR instruction
is emulated or the host also debugs the guest.

This patch is a partial fix which enables DR6.RTM and DR7.RTM to be cleared and
set respectively. It also sets DR6.RTM upon every debug exception. Obviously,
it is not a complete fix, as debugging of RTM is still unsupported.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 17:17:52 +02:00
Nadav Amit 4161a56906 KVM: x86: emulator injects #DB when RFLAGS.RF is set
If the RFLAGS.RF is set, then no #DB should occur on instruction breakpoints.
However, the KVM emulator injects #DB regardless to RFLAGS.RF. This patch fixes
this behavior. KVM, however, still appears not to update RFLAGS.RF correctly,
regardless of this patch.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 13:43:09 +02:00
Nadav Amit bb663c7ada KVM: x86: Clearing rflags.rf upon skipped emulated instruction
When skipping an emulated instruction, rflags.rf should be cleared as it would
be on real x86 CPU.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-21 13:41:32 +02:00
Paolo Bonzini 44583cba91 KVM: x86: use kvm_read_guest_page for emulator accesses
Emulator accesses are always done a page at a time, either by the emulator
itself (for fetches) or because we need to query the MMU for address
translations.  Speed up these accesses by using kvm_read_guest_page
and, in the case of fetches, by inlining kvm_read_guest_virt_helper and
dropping the loop around kvm_read_guest_page.

This final tweak saves 30-100 more clock cycles (4-10%), bringing the
count (as measured by kvm-unit-tests) down to 720-1100 clock cycles on
a Sandy Bridge Xeon host, compared to 2300-3200 before the whole series
and 925-1700 after the first two low-hanging fruit changes.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:14:04 +02:00
Bandan Das 1498507a47 KVM: emulate: move init_decode_cache to emulate.c
Core emulator functions all belong in emulator.c,
x86 should have no knowledge of emulator internals

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:59 +02:00
Paolo Bonzini 6addfc4299 KVM: x86: avoid useless set of KVM_REQ_EVENT after emulation
Despite the provisions to emulate up to 130 consecutive instructions, in
practice KVM will emulate just one before exiting handle_invalid_guest_state,
because x86_emulate_instruction always sets KVM_REQ_EVENT.

However, we only need to do this if an interrupt could be injected,
which happens a) if an interrupt shadow bit (STI or MOV SS) has gone
away; b) if the interrupt flag has just been set (other instructions
than STI can set it without enabling an interrupt shadow).

This cuts another 700-900 cycles from the cost of emulating an
instruction (measured on a Sandy Bridge Xeon: 1650-2600 cycles
before the patch on kvm-unit-tests, 925-1700 afterwards).

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:57 +02:00
Paolo Bonzini 37ccdcbe07 KVM: x86: return all bits from get_interrupt_shadow
For the next patch we will need to know the full state of the
interrupt shadow; we will then set KVM_REQ_EVENT when one bit
is cleared.

However, right now get_interrupt_shadow only returns the one
corresponding to the emulated instruction, or an unconditional
0 if the emulated instruction does not have an interrupt shadow.
This is confusing and does not allow us to check for cleared
bits as mentioned above.

Clean the callback up, and modify toggle_interruptibility to
match the comment above the call.  As a small result, the
call to set_interrupt_shadow will be skipped in the common
case where int_shadow == 0 && mask == 0.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:13:56 +02:00
Matthias Lange 22d48b2d2a KVM: svm: writes to MSR_K7_HWCR generates GPE in guest
Since commit 575203 the MCE subsystem in the Linux kernel for AMD sets bit 18
in MSR_K7_HWCR. Running such a kernel as a guest in KVM on an AMD host results
in a GPE injected into the guest because kvm_set_msr_common returns 1. This
patch fixes this by masking bit 18 from the MSR value desired by the guest.

Signed-off-by: Matthias Lange <matthias.lange@kernkonzept.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:11:59 +02:00
Nadav Amit 5f7552d4a5 KVM: x86: Pending interrupt may be delivered after INIT
We encountered a scenario in which after an INIT is delivered, a pending
interrupt is delivered, although it was sent before the INIT.  As the SDM
states in section 10.4.7.1, the ISR and the IRR should be cleared after INIT as
KVM does.  This also means that pending interrupts should be cleared.  This
patch clears upon reset (and INIT) the pending interrupts; and at the same
occassion clears the pending exceptions, since they may cause a similar issue.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11 09:11:58 +02:00
Tomasz Grabiec 0d3da0d26e KVM: x86: fix TSC matching
I've observed kvmclock being marked as unstable on a modern
single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0.

The culprit was failure in TSC matching because of overflow of
kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes
in a single synchronization cycle.

Turns out that qemu does multiple TSC writes during init, below is the
evidence of that (qemu-2.0.0):

The first one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm]
 0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm]

The second one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641
 #4  cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330
 #5  cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521
 #6  main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390

The third one:

 0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
 0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
 0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
 0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
 0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
 0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
 0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]

 #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
 #1  kvm_put_msrs  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
 #2  kvm_arch_put_registers  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
 #3  kvm_cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635
 #4  cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323
 #5  cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512
 #6  main  at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482

The fix is to count each vCPU only once when matched, so that
nr_vcpus_matched_tsc holds the size of the matched set. This is
achieved by reusing generation counters. Every vCPU with
this_tsc_generation == cur_tsc_generation is in the matched set. The
match set is cleared by setting cur_tsc_generation to a value which no
other vCPU is set to (by incrementing it).

I needed to bump up the counter size form u8 to u64 to ensure it never
overflows. Otherwise in cases TSC is not written the same number of
times on each vCPU the counter could overflow and incorrectly indicate
some vCPUs as being in the matched set. This scenario seems unlikely
but I'm not sure if it can be disregarded.

Signed-off-by: Tomasz Grabiec <tgrabiec@cloudius-systems.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09 18:09:57 +02:00
Bandan Das 9242b5b60d KVM: x86: Check for nested events if there is an injectable interrupt
With commit b6b8a1451f that introduced
vmx_check_nested_events, checks for injectable interrupts happen
at different points in time for L1 and L2 that could potentially
cause a race. The regression occurs because KVM_REQ_EVENT is always
set when nested_run_pending is set even if there's no pending interrupt.
Consequently, there could be a small window when check_nested_events
returns without exiting to L1, but an interrupt comes through soon
after and it incorrectly, gets injected to L2 by inject_pending_event
Fix this by adding a call to check for nested events too when a check
for injectable interrupt returns true

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-08 10:06:42 +02:00
Rickard Strandqvist 9f6226a762 arch: x86: kvm: x86.c: Cleaning up variable is set more than once
A struct member variable is set to the same value more than once

This was found using a static code analysis program called cppcheck.

Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-30 16:52:04 +02:00
Xiaoming Gao e1fa108d24 kvm: fix wrong address when writing Hyper-V tsc page
When kvm_write_guest writes the tsc_ref structure to the guest, or it will lead
the low HV_X64_MSR_TSC_REFERENCE_ADDRESS_SHIFT bits of the TSC page address
must be cleared, or the guest can see a non-zero sequence number.

Otherwise Windows guests would not be able to get a correct clocksource
(QueryPerformanceCounter will always return 0) which causes serious chaos.

Signed-off-by: Xiaoming Gao <newtongao@tencnet.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 13:43:43 +02:00
Nadav Amit a449c7aa51 KVM: x86: Hypercall handling does not considers opsize correctly
Currently, the hypercall handling routine only considers LME as an indication
to whether the guest uses 32/64-bit mode. This is incosistent with hyperv
hypercalls handling and against the common sense of considering cs.l as well.
This patch uses is_64_bit_mode instead of is_long_mode for that matter. In
addition, the result is masked in respect to the guest execution mode. Last, it
changes kvm_hv_hypercall to use is_64_bit_mode as well to simplify the code.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-19 12:52:14 +02:00
Nadav Amit 67f4d4288c KVM: x86: rdpmc emulation checks the counter incorrectly
The rdpmc emulation checks that the counter (ECX) is not higher than 2, without
taking into considerations bits 30:31 role (e.g., bit 30 marks whether the
counter is fixed). The fix uses the pmu information for checking the validity
of the pmu counter.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-18 17:46:18 +02:00
Linus Torvalds b05d59dfce At over 200 commits, covering almost all supported architectures, this
was a pretty active cycle for KVM.  Changes include:
 
 - a lot of s390 changes: optimizations, support for migration,
   GDB support and more
 
 - ARM changes are pretty small: support for the PSCI 0.2 hypercall
   interface on both the guest and the host (the latter acked by Catalin)
 
 - initial POWER8 and little-endian host support
 
 - support for running u-boot on embedded POWER targets
 
 - pretty large changes to MIPS too, completing the userspace interface
   and improving the handling of virtualized timer hardware
 
 - for x86, a larger set of changes is scheduled for 3.17.  Still,
   we have a few emulator bugfixes and support for running nested
   fully-virtualized Xen guests (para-virtualized Xen guests have
   always worked).  And some optimizations too.
 
 The only missing architecture here is ia64.  It's not a coincidence
 that support for KVM on ia64 is scheduled for removal in 3.17.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTjtlBAAoJEBvWZb6bTYbyMOUP/2NAePghE3IjG99ikHFdn+BX
 BfrURsuR6GD0AhYQnBidBmpFbAmN/LwSJxv/M7sV7OBRWLu3qbt69DrPTU2e/FK1
 j9q25peu8jRyHzJ1q9rBroo74nD9lQYuVr3uXNxxcg0DRnw14JHGlM3y8LDEknO8
 W+gpWTeAQ+2AuOX98MpRbCRMuzziCSv5bP5FhBVnsWHiZfvMbcUrbeJt+zYSiDAZ
 0tHm/5dFKzfj/vVrrnjD4EZcRr688Bs5rztG96hY6aoVJryjZGLtLp92wCWkRRmH
 CCvZwd245NmNthuKHzcs27/duSWfU0uOlu7AMrD44QYhzeDGyB/2nbCxbGqLLoBA
 nnOviXH4cC65/CnisZ79zfo979HbZcX+Lzg747EjBgCSxJmLlwgiG8yXtDvk5otB
 TH6GUeGDiEEPj//JD3XtgSz0sF2NvjREWRyemjDMvhz6JC/bLytXKb3sn+NXSj8m
 ujzF9eQoa4qKDcBL4IQYGTJ4z5nY3Pd68dHFIPHB7n82OxFLSQUBKxXw8/1fb5og
 VVb8PL4GOcmakQlAKtTMlFPmuy4bbL2r/2iV5xJiOZKmXIu8Hs1JezBE3SFAltbl
 3cAGwSM9/dDkKxUbTFblyOE9bkKbg4WYmq0LkdzsPEomb3IZWntOT25rYnX+LrBz
 bAknaZpPiOrW11Et1htY
 =j5Od
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into next

Pull KVM updates from Paolo Bonzini:
 "At over 200 commits, covering almost all supported architectures, this
  was a pretty active cycle for KVM.  Changes include:

   - a lot of s390 changes: optimizations, support for migration, GDB
     support and more

   - ARM changes are pretty small: support for the PSCI 0.2 hypercall
     interface on both the guest and the host (the latter acked by
     Catalin)

   - initial POWER8 and little-endian host support

   - support for running u-boot on embedded POWER targets

   - pretty large changes to MIPS too, completing the userspace
     interface and improving the handling of virtualized timer hardware

   - for x86, a larger set of changes is scheduled for 3.17.  Still, we
     have a few emulator bugfixes and support for running nested
     fully-virtualized Xen guests (para-virtualized Xen guests have
     always worked).  And some optimizations too.

  The only missing architecture here is ia64.  It's not a coincidence
  that support for KVM on ia64 is scheduled for removal in 3.17"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits)
  KVM: add missing cleanup_srcu_struct
  KVM: PPC: Book3S PR: Rework SLB switching code
  KVM: PPC: Book3S PR: Use SLB entry 0
  KVM: PPC: Book3S HV: Fix machine check delivery to guest
  KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs
  KVM: PPC: Book3S HV: Make sure we don't miss dirty pages
  KVM: PPC: Book3S HV: Fix dirty map for hugepages
  KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address
  KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates()
  KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number
  KVM: PPC: Book3S: Add ONE_REG register names that were missed
  KVM: PPC: Add CAP to indicate hcall fixes
  KVM: PPC: MPIC: Reset IRQ source private members
  KVM: PPC: Graciously fail broken LE hypercalls
  PPC: ePAPR: Fix hypercall on LE guest
  KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler
  KVM: PPC: BOOK3S: Always use the saved DAR value
  PPC: KVM: Make NX bit available with magic page
  KVM: PPC: Disable NX for old magic page using guests
  KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest
  ...
2014-06-04 08:47:12 -07:00
Linus Torvalds 4efdedca93 Small fixes for x86, slightly larger fixes for PPC, and a forgotten s390 patch.
The PPC fixes are important because they fix breakage that is new in 3.15.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJTdivEAAoJEBvWZb6bTYbyw3YQAIILnflhHNtklj1mfPnnibQf
 c3BLCkJ0gtK6A0FO2aAHgSja0kpgbEEnSphE/A/cb0vkLon3n5O0pQoSKjGUUbBO
 Mo0ndjzBYNmCP4MGxhkrg49VdqD40NaR0BjJAZudb4vUOw892WLFIJMIVmIqs9eG
 8V/y6S7mPLmrooAKHZxXql9y30UC77T1VZ3r4pXwYgKtUT51BQfTyWiSfjQBa8yI
 oGOSb8uqEC7YiOYPJYUNIMsyVqW4E6Qqs46rqtP4XZmSxzWXDzzgP4nQHHyJJCdZ
 aBYkeG+sJZG7ZwleJLejAncjWUY9Oq9GkMYNj0cTAoP/zA6jBGAll96KGKRbes9z
 bZUtCNL3ifLcgbIGeAxgjmYOq0XLGahHbqm9QISYW2XdRkBI+8EJs5FCP4YEHzZn
 FSm3zcCQ+wtbqjBbZZcqqLa6A/CGzjyO26qz+BCxrZ0BQkQX/2am3UykQ0JWam3H
 vX5ZM2ewJhs6SjFisPcswd20AN+SHjPyzPvErBLDfrqnAVbwj2ehgqyN2slVsqrj
 UyGzeKCfJgA0TiEH/4K6j6hvQWynUU+/2JglIfGE6AXmWddazCzl/qx4LvuGKFoB
 b8JSQ7YaHSsq/tHc8WhHkvcP0FSDZEiHcJN2iY1pwLKTSQp9JN3aPNruPKiO8dsW
 N+LoHL5fFcDi6Uu6wS7w
 =E2fU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Small fixes for x86, slightly larger fixes for PPC, and a forgotten
  s390 patch.  The PPC fixes are important because they fix breakage
  that is new in 3.15"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: s390: announce irqfd capability
  KVM: x86: disable master clock if TSC is reset during suspend
  KVM: vmx: disable APIC virtualization in nested guests
  KVM guest: Make pv trampoline code executable
  KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bit
  KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit
  KVM: PPC: Book3S: HV: make _PAGE_NUMA take effect
2014-05-28 08:08:03 -07:00
Paolo Bonzini fb5e336b97 KVM: x86: drop set_rflags callback
Not needed anymore now that the CPL is computed directly
during task switch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22 17:47:16 +02:00
Marcelo Tosatti 16a9602158 KVM: x86: disable master clock if TSC is reset during suspend
Updating system_time from the kernel clock once master clock
has been enabled can result in time backwards event, in case
kernel clock frequency is lower than TSC frequency.

Disable master clock in case it is necessary to update it
from the resume path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-14 17:59:21 +02:00
Jan Kiszka d9f89b88f5 KVM: x86: Fix CR3 reserved bits check in long mode
Regression of 346874c9: PAE is set in long mode, but that does not mean
we have valid PDPTRs.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-12 20:04:01 +02:00
Michael S. Tsirkin b63cf42fd1 kvm/x86: implement hv EOI assist
It seems that it's easy to implement the EOI assist
on top of the PV EOI feature: simply convert the
page address to the format expected by PV EOI.

Notes:
-"No EOI required" is set only if interrupt injected
 is edge triggered; this is true because level interrupts are going
 through IOAPIC which disables PV EOI.
 In any case, if guest triggers EOI the bit will get cleared on exit.
-For migration, set of HV_X64_MSR_APIC_ASSIST_PAGE sets
 KVM_PV_EOI_EN internally, so restoring HV_X64_MSR_APIC_ASSIST_PAGE
 seems sufficient
 In any case, bit is cleared on exit so worst case it's never re-enabled
-no handling of PV EOI data is performed at HV_X64_MSR_EOI write;
 HV_X64_MSR_EOI is a separate optimization - it's an X2APIC
 replacement that lets you do EOI with an MSR and not IO.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-07 18:00:49 +02:00
Andi Kleen 2605fc216f asmlinkage, x86: Add explicit __visible to arch/x86/*
As requested by Linus add explicit __visible to the asmlinkage users.
This marks all functions visible to assembler.

Tree sweep for arch/x86/*

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1398984278-29319-3-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-05-05 16:07:44 -07:00
Ulrich Obergfell 1171903d89 KVM: x86: improve the usability of the 'kvm_pio' tracepoint
This patch moves the 'kvm_pio' tracepoint to emulator_pio_in_emulated()
and emulator_pio_out_emulated(), and it adds an argument (a pointer to
the 'pio_data'). A single 8-bit or 16-bit or 32-bit data item is fetched
from 'pio_data' (depending on 'size'), and the value is included in the
trace record ('val'). If 'count' is greater than one, this is indicated
by the string "(...)" in the trace output.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-05 22:42:05 +02:00
Xiao Guangrong 198c74f43f KVM: MMU: flush tlb out of mmu lock when write-protect the sptes
Now we can flush all the TLBs out of the mmu lock without TLB corruption when
write-proect the sptes, it is because:
- we have marked large sptes readonly instead of dropping them that means we
  just change the spte from writable to readonly so that we only need to care
  the case of changing spte from present to present (changing the spte from
  present to nonpresent will flush all the TLBs immediately), in other words,
  the only case we need to care is mmu_spte_update()

- in mmu_spte_update(), we haved checked
  SPTE_HOST_WRITEABLE | PTE_MMU_WRITEABLE instead of PT_WRITABLE_MASK, that
  means it does not depend on PT_WRITABLE_MASK anymore

Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:49:52 -03:00
Xiao Guangrong c126d94f2c KVM: MMU: lazily drop large spte
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them un-present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

This version has fixed the issue reported in 6b73a9606, the reason of that
issue is that fast_page_fault() directly sets the readonly large spte to
writable but only dirty the first page into the dirty-bitmap that means
other pages are missed. Fixed it by only the normal sptes (on the
PT_PAGE_TABLE_LEVEL level) can be fast fixed

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:49:50 -03:00
Nadav Amit 42bf549f3c KVM: x86: Processor mode may be determined incorrectly
If EFER.LMA is off, cs.l does not determine execution mode.
Currently, the emulation engine assumes differently.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:47:00 -03:00
Nadav Amit 346874c950 KVM: x86: Fix CR3 reserved bits
According to Intel specifications, PAE and non-PAE does not have any reserved
bits.  In long-mode, regardless to PCIDE, only the high bits (above the
physical address) are reserved.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-23 17:46:57 -03:00
Michael S. Tsirkin f848a5a8dc KVM: support any-length wildcard ioeventfd
It is sometimes benefitial to ignore IO size, and only match on address.
In hindsight this would have been a better default than matching length
when KVM_IOEVENTFD_FLAG_DATAMATCH is not set, In particular, this kind
of access can be optimized on VMX: there no need to do page lookups.
This can currently be done with many ioeventfds but in a suboptimal way.

However we can't change kernel/userspace ABI without risk of breaking
some applications.
Use len = 0 to mean "ignore length for matching" in a more optimal way.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-17 14:01:42 -03:00
Linus Torvalds 55101e2d6c Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Marcelo Tosatti:
 - Fix for guest triggerable BUG_ON (CVE-2014-0155)
 - CR4.SMAP support
 - Spurious WARN_ON() fix

* git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: remove WARN_ON from get_kernel_ns()
  KVM: Rename variable smep to cr4_smep
  KVM: expose SMAP feature to guest
  KVM: Disable SMAP for guests in EPT realmode and EPT unpaging mode
  KVM: Add SMAP support when setting CR4
  KVM: Remove SMAP bit from CR4_RESERVED_BITS
  KVM: ioapic: try to recover if pending_eoi goes out of range
  KVM: ioapic: fix assignment of ioapic->rtc_status.pending_eoi (CVE-2014-0155)
2014-04-14 16:21:28 -07:00
Marcelo Tosatti b351c39cc9 KVM: x86: remove WARN_ON from get_kernel_ns()
Function and callers can be preempted.

https://bugzilla.kernel.org/show_bug.cgi?id=73721

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-14 17:50:43 -03:00
Feng Wu 97ec8c067d KVM: Add SMAP support when setting CR4
This patch adds SMAP handling logic when setting CR4 for guests

Thanks a lot to Paolo Bonzini for his suggestion to use the branchless
way to detect SMAP violation.

Signed-off-by: Feng Wu <feng.wu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2014-04-14 17:50:34 -03:00
Linus Torvalds 467a9e1633 CPU hotplug notifiers registration fixes for 3.15-rc1
The purpose of this single series of commits from Srivatsa S Bhat (with
 a small piece from Gautham R Shenoy) touching multiple subsystems that use
 CPU hotplug notifiers is to provide a way to register them that will not
 lead to deadlocks with CPU online/offline operations as described in the
 changelog of commit 93ae4f978c (CPU hotplug: Provide lockless versions
 of callback registration functions).
 
 The first three commits in the series introduce the API and document it
 and the rest simply goes through the users of CPU hotplug notifiers and
 converts them to using the new method.
 
 /
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJTQow2AAoJEILEb/54YlRxW4QQAJlYRDUzwFJzJzYhltQYuVR+
 4D74XMtvXgoJfg3cwdSWvMKKpJZnA9BVN0f7Hcx9wYmgdexYUuHeZJmMNyc3S2+g
 KjKBIsugvgmZhHbbLd6TJ6GBbhGT5JLt9VmSfL9zIkveInU1YHFUUqL/mxdHm4J0
 BSGKjk2rN3waRJgmY+xfliFLtQjDKFwJpMuvrgtoUyfas3f4sIV43UNbqdvA/weJ
 rzedxXOlKH/id4b56lj/4iIzcoL3mwvJJ7r6n0CEMsKv87z09kqR0O+69Tsq/cgs
 j17CsvoJOmZGk3QTeKVMQWBsvk6aPoDu3zK83gLbQMt+qjOpSTbJLz/3HZw4/TrW
 ss4nuZne1DLMGS+6hoxYbTP+6Ni//Kn+l/LrHc5jb7m1X3lMO4W2aV3IROtIE1rv
 lEP1IG01NU4u9YwkVj1dyhrkSp8tLPul4SrUK8W+oNweOC5crjJV7vJbIPJgmYiM
 IZN55wln0yVRtR4TX+rmvN0PixsInE8MeaVCmReApyF9pdzul/StxlBze5BKLSJD
 cqo1kNPpsmdxoDucqUpQ/gSvy+IOl2qnlisB5PpV93sk7De6TFDYrGHxjYIW7jMf
 StXwdCDDQhzd2Q8Kfpp895A1dbIl8rKtwA6bTU2eX+BfMVFzuMdT44cvosx1+UdQ
 sWl//rg76nb13dFjvF+q
 =SW7Q
 -----END PGP SIGNATURE-----

Merge tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull CPU hotplug notifiers registration fixes from Rafael Wysocki:
 "The purpose of this single series of commits from Srivatsa S Bhat
  (with a small piece from Gautham R Shenoy) touching multiple
  subsystems that use CPU hotplug notifiers is to provide a way to
  register them that will not lead to deadlocks with CPU online/offline
  operations as described in the changelog of commit 93ae4f978c ("CPU
  hotplug: Provide lockless versions of callback registration
  functions").

  The first three commits in the series introduce the API and document
  it and the rest simply goes through the users of CPU hotplug notifiers
  and converts them to using the new method"

* tag 'cpu-hotplug-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (52 commits)
  net/iucv/iucv.c: Fix CPU hotplug callback registration
  net/core/flow.c: Fix CPU hotplug callback registration
  mm, zswap: Fix CPU hotplug callback registration
  mm, vmstat: Fix CPU hotplug callback registration
  profile: Fix CPU hotplug callback registration
  trace, ring-buffer: Fix CPU hotplug callback registration
  xen, balloon: Fix CPU hotplug callback registration
  hwmon, via-cputemp: Fix CPU hotplug callback registration
  hwmon, coretemp: Fix CPU hotplug callback registration
  thermal, x86-pkg-temp: Fix CPU hotplug callback registration
  octeon, watchdog: Fix CPU hotplug callback registration
  oprofile, nmi-timer: Fix CPU hotplug callback registration
  intel-idle: Fix CPU hotplug callback registration
  clocksource, dummy-timer: Fix CPU hotplug callback registration
  drivers/base/topology.c: Fix CPU hotplug callback registration
  acpi-cpufreq: Fix CPU hotplug callback registration
  zsmalloc: Fix CPU hotplug callback registration
  scsi, fcoe: Fix CPU hotplug callback registration
  scsi, bnx2fc: Fix CPU hotplug callback registration
  scsi, bnx2i: Fix CPU hotplug callback registration
  ...
2014-04-07 14:55:46 -07:00
Paolo Bonzini 920c837785 KVM: vmx: fix MPX detection
kvm_x86_ops is still NULL at this point.  Since kvm_init_msr_list
cannot fail, it is safe to initialize it before the call.

Fixes: 93c4adc7af
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Jet Chen <jet.chen@intel.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-27 13:06:51 +01:00
Srivatsa S. Bhat 460dd42e11 x86, kvm: Fix CPU hotplug callback registration
Subsystems that want to register CPU hotplug callbacks, as well as perform
initialization for the CPUs that are already online, often do it as shown
below:

	get_online_cpus();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	register_cpu_notifier(&foobar_cpu_notifier);

	put_online_cpus();

This is wrong, since it is prone to ABBA deadlocks involving the
cpu_add_remove_lock and the cpu_hotplug.lock (when running concurrently
with CPU hotplug operations).

Instead, the correct and race-free way of performing the callback
registration is:

	cpu_notifier_register_begin();

	for_each_online_cpu(cpu)
		init_cpu(cpu);

	/* Note the use of the double underscored version of the API */
	__register_cpu_notifier(&foobar_cpu_notifier);

	cpu_notifier_register_done();

Fix the kvm code in x86 by using this latter form of callback registration.

Cc: Gleb Natapov <gleb@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-03-20 13:43:44 +01:00
Paolo Bonzini 93c4adc7af KVM: x86: handle missing MPX in nested virtualization
When doing nested virtualization, we may be able to read BNDCFGS but
still not be allowed to write to GUEST_BNDCFGS in the VMCS.  Guard
writes to the field with vmx_mpx_supported(), and similarly hide the
MSR from userspace if the processor does not support the field.

We could work around this with the generic MSR save/load machinery,
but there is only a limited number of MSR save/load slots and it is
not really worthwhile to waste one for a scenario that should not
happen except in the nested virtualization case.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:39 +01:00
Paolo Bonzini 4ff417320c KVM: x86: introduce kvm_supported_xcr0()
XSAVE support for KVM is already using host_xcr0 & KVM_SUPPORTED_XCR0 as
a "dynamic" version of KVM_SUPPORTED_XCR0.

However, this is not enough because the MPX bits should not be presented
to the guest unless kvm_x86_ops confirms the support.  So, replace all
instances of host_xcr0 & KVM_SUPPORTED_XCR0 with a new function
kvm_supported_xcr0() that also has this check.

Note that here:

		if (xstate_bv & ~KVM_SUPPORTED_XCR0)
			return -EINVAL;
		if (xstate_bv & ~host_cr0)
			return -EINVAL;

the code is equivalent to

		if ((xstate_bv & ~KVM_SUPPORTED_XCR0) ||
		    (xstate_bv & ~host_cr0)
			return -EINVAL;

i.e. "xstate_bv & (~KVM_SUPPORTED_XCR0 | ~host_cr0)" which is in turn
equal to "xstate_bv & ~(KVM_SUPPORTED_XCR0 & host_cr0)".  So we should
also use the new function there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-17 12:21:38 +01:00
Gabriel L. Somlo 100943c54e kvm: x86: ignore ioapic polarity
Both QEMU and KVM have already accumulated a significant number of
optimizations based on the hard-coded assumption that ioapic polarity
will always use the ActiveHigh convention, where the logical and
physical states of level-triggered irq lines always match (i.e.,
active(asserted) == high == 1, inactive == low == 0). QEMU guests
are expected to follow directions given via ACPI and configure the
ioapic with polarity 0 (ActiveHigh). However, even when misbehaving
guests (e.g. OS X <= 10.9) set the ioapic polarity to 1 (ActiveLow),
QEMU will still use the ActiveHigh signaling convention when
interfacing with KVM.

This patch modifies KVM to completely ignore ioapic polarity as set by
the guest OS, enabling misbehaving guests to work alongside those which
comply with the ActiveHigh polarity specified by QEMU's ACPI tables.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu>
[Move documentation to KVM_IRQ_LINE, add ia64. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-13 11:58:21 +01:00
Paolo Bonzini c77fb5fe6f KVM: x86: Allow the guest to run with dirty debug registers
When not running in guest-debug mode, the guest controls the debug
registers and having to take an exit for each DR access is a waste
of time.  If the guest gets into a state where each context switch
causes DR to be saved and restored, this can take away as much as 40%
of the execution time from the guest.

After this patch, VMX- and SVM-specific code can set a flag in
switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
registers might be dirty and need to be reloaded (syncing will be taken
care of by a new callback in kvm_x86_ops).  This flag can be set on the
first access to a debug registers, so that multiple accesses to the
debug registers only cause one vmexit.

Note that since the guest will be able to read debug registers and
enable breakpoints in DR7, we need to ensure that they are synchronized
on entry to the guest---including DR6 that was not synced before.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:02 +01:00
Paolo Bonzini 360b948d88 KVM: x86: change vcpu->arch.switch_db_regs to a bit mask
The next patch will add another bit that we can test with the
same "if".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 10:46:02 +01:00
Jan Kiszka c9a7953f09 KVM: x86: Remove return code from enable_irq/nmi_window
It's no longer possible to enter enable_irq_window in guest mode when
L1 intercepts external interrupts and we are entering L2. This is now
caught in vcpu_enter_guest. So we can remove the check from the VMX
version of enable_irq_window, thus the need to return an error code from
both enable_irq_window and enable_nmi_window.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:47 +01:00
Jan Kiszka b6b8a1451f KVM: nVMX: Rework interception of IRQs and NMIs
Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11 08:41:45 +01:00
Paolo Bonzini 1c2af4968e Merge tag 'kvm-for-3.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms into kvm-next 2014-03-04 15:58:00 +01:00
Andrew Jones 332967a3ea x86: kvm: introduce periodic global clock updates
commit 0061d53daf introduced a mechanism to execute a global clock
update for a vm. We can apply this periodically in order to propagate
host NTP corrections. Also, if all vcpus of a vm are pinned, then
without an additional trigger, no guest NTP corrections can propagate
either, as the current trigger is only vcpu cpu migration.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-04 11:50:54 +01:00
Andrew Jones 7e44e4495a x86: kvm: rate-limit global clock updates
When we update a vcpu's local clock it may pick up an NTP correction.
We can't wait an indeterminate amount of time for other vcpus to pick
up that correction, so commit 0061d53daf introduced a global clock
update. However, we can't request a global clock update on every vcpu
load either (which is what happens if the tsc is marked as unstable).
The solution is to rate-limit the global clock updates. Marcelo
calculated that we should delay the global clock updates no more
than 0.1s as follows:

Assume an NTP correction c is applied to one vcpu, but not the other,
then in n seconds the delta of the vcpu system_timestamps will be
c * n. If we assume a correction of 500ppm (worst-case), then the two
vcpus will diverge 50us in 0.1s, which is a considerable amount.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-04 11:50:47 +01:00
Andrew Honig a08d3b3b99 kvm: x86: fix emulator buffer overflow (CVE-2014-0049)
The problem occurs when the guest performs a pusha with the stack
address pointing to an mmio address (or an invalid guest physical
address) to start with, but then extending into an ordinary guest
physical address.  When doing repeated emulated pushes
emulator_read_write sets mmio_needed to 1 on the first one.  On a
later push when the stack points to regular memory,
mmio_nr_fragments is set to 0, but mmio_is_needed is not set to 0.

As a result, KVM exits to userspace, and then returns to
complete_emulated_mmio.  In complete_emulated_mmio
vcpu->mmio_cur_fragment is incremented.  The termination condition of
vcpu->mmio_cur_fragment == vcpu->mmio_nr_fragments is never achieved.
The code bounces back and fourth to userspace incrementing
mmio_cur_fragment past it's buffer.  If the guest does nothing else it
eventually leads to a a crash on a memcpy from invalid memory address.

However if a guest code can cause the vm to be destroyed in another
vcpu with excellent timing, then kvm_clear_async_pf_completion_queue
can be used by the guest to control the data that's pointed to by the
call to cancel_work_item, which can be used to gain execution.

Fixes: f78146b0f9
Signed-off-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org (3.5+)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-27 19:35:22 +01:00
Takuya Yoshikawa 684851a157 KVM: x86: Break kvm_for_each_vcpu loop after finding the VP_INDEX
No need to scan the entire VCPU array.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-27 19:25:39 +01:00
Marcelo Tosatti d3714010c3 KVM: x86: emulator_cmpxchg_emulated should mark_page_dirty
emulator_cmpxchg_emulated writes to guest memory, therefore it should
update the dirty bitmap accordingly.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-26 10:11:08 +01:00
Liu, Jinsong 390bd528ae KVM: x86: Enable Intel MPX for guest
From 44c2abca2c2eadc6f2f752b66de4acc8131880c4 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:12:31 +0800
Subject: [PATCH v5 3/3] KVM: x86: Enable Intel MPX for guest

This patch enable Intel MPX feature to guest.

Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-25 20:17:12 +01:00
Liu, Jinsong 0dd376e709 KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save
From 5d5a80cd172ea6fb51786369bcc23356b1e9e956 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Mon, 24 Feb 2014 18:11:55 +0800
Subject: [PATCH v5 2/3] KVM: x86: add MSR_IA32_BNDCFGS to msrs_to_save

Add MSR_IA32_BNDCFGS to msrs_to_save, and corresponding logic
to kvm_get/set_msr().

Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-25 20:17:09 +01:00
Liu, Jinsong 56c103ec04 KVM: x86: Fix xsave cpuid exposing bug
From 00c920c96127d20d4c3bb790082700ae375c39a0 Mon Sep 17 00:00:00 2001
From: Liu Jinsong <jinsong.liu@intel.com>
Date: Fri, 21 Feb 2014 23:47:18 +0800
Subject: [PATCH] KVM: x86: Fix xsave cpuid exposing bug

EBX of cpuid(0xD, 0) is dynamic per XCR0 features enable/disable.
Bit 63 of XCR0 is reserved for future expansion.

Signed-off-by: Liu Jinsong <jinsong.liu@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-22 15:53:34 +01:00
Marcelo Tosatti 4f34d683e5 KVM: x86: remove unused last_kernel_ns variable
Remove unused last_kernel_ns variable.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-04 04:20:42 +01:00
Paolo Bonzini 5f66b62095 kvm: x86: move KVM_CAP_HYPERV_TIME outside #ifdef
Self explanatory.

Reported-by: Radim Krcmar <rkrcmar@redhat.com>
Cc: Vadim Rozenfeld <vrozenfe@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-29 18:10:45 +01:00
Jan Kiszka 58cb628dbe KVM: x86: Validate guest writes to MSR_IA32_APICBASE
Check for invalid state transitions on guest-initiated updates of
MSR_IA32_APICBASE. This address both enabling of the x2APIC when it is
not supported and all invalid transitions as described in SDM section
10.12.5. It also checks that no reserved bit is set in APICBASE by the
guest.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
[Use cpuid_maxphyaddr instead of guest_cpuid_get_phys_bits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-27 14:39:44 +01:00
Vadim Rozenfeld b3af1e889e KVM: x86: mark hyper-v vapic assist page as dirty
Signed-off-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-24 16:51:15 +01:00
Vadim Rozenfeld b94b64c9a7 KVM: x86: mark hyper-v hypercall page as dirty
Signed-off-by: Vadim Rozenfeld <vrozenfe@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-23 19:00:04 +01:00
Linus Torvalds 7ebd3faa9b First round of KVM updates for 3.14; PPC parts will come next week.
Nothing major here, just bugfixes all over the place.  The most
 interesting part is the ARM guys' virtualized interrupt controller
 overhaul, which lets userspace get/set the state and thus enables
 migration of ARM VMs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJS3TVKAAoJEBvWZb6bTYbyIFgP/2cmt4ifCuFMaZv4+G1S8jZU
 uC9ZB/+7vzht/p6zAy+4BxurKbHmSBFkC1OKcxYuy7yB4CQkHabzj4V2vRtqFdwH
 5lExP9qh3kqaVLuhnvxLTmkktR3EW4PFy6OI53l5kRNktOXSuZ0aN6K3V7tCg/X0
 iL7ASo4bJKlxeWcDpmuVrNgAajmZVfXrjKY7robgBQno+yIsgKhRZRBQHjozA6B8
 FpCo/k48RZd/EzIbV/PDDRI4hmmry/lgrO9SKjzq56wSqff2bd/k/KYze4dbAPfd
 Ps60enPTuHmeEjjb4MMMU4EKHVdTQFUMx/xZCmT4xzoh8s4of6RHphXbfE0SUznQ
 dTveyEQAR7E3JNS0k1+3WEX5fWlFesp0hO2NeE0wzUq4TAr9ztgVO9NQ6Si15e7Z
 2HysO0T5Ojtt0lY08/PvS6i48eCAuuBomrejJS8hLW4SUZ5adn+yW4Qo7Fp9JeBR
 l9a3LsVT8BZMtUWrUuFcVhlM4MbzElUPjDbgWhR8UYU/kpfVZOQu8qWgGKR4UWXy
 X7/t9l/tjR99CmfMJBAOzJid+ScSpAfg77BdaKiQrVfVIJmsjEjlO8vUMyj5b1HF
 hPX5wNyJjHAOfridLeHSs4Rdm4a8sk8Az5d4h76pLVz8M4jyTi2v0rO3N4/dU/pu
 x7N8KR5hAj+mLBoM9/Al
 =8sYU
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "First round of KVM updates for 3.14; PPC parts will come next week.

  Nothing major here, just bugfixes all over the place.  The most
  interesting part is the ARM guys' virtualized interrupt controller
  overhaul, which lets userspace get/set the state and thus enables
  migration of ARM VMs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
  kvm: make KVM_MMU_AUDIT help text more readable
  KVM: s390: Fix memory access error detection
  KVM: nVMX: Update guest activity state field on L2 exits
  KVM: nVMX: Fix nested_run_pending on activity state HLT
  KVM: nVMX: Clean up handling of VMX-related MSRs
  KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
  KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
  KVM: nVMX: Leave VMX mode on clearing of feature control MSR
  KVM: VMX: Fix DR6 update on #DB exception
  KVM: SVM: Fix reading of DR6
  KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
  add support for Hyper-V reference time counter
  KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
  KVM: x86: fix tsc catchup issue with tsc scaling
  KVM: x86: limit PIT timer frequency
  KVM: x86: handle invalid root_hpa everywhere
  kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
  kvm: vfio: silence GCC warning
  KVM: ARM: Remove duplicate include
  arm/arm64: KVM: relax the requirements of VMA alignment for THP
  ...
2014-01-22 21:40:43 -08:00
Jan Kiszka 73aaf249ee KVM: SVM: Fix reading of DR6
In contrast to VMX, SVM dose not automatically transfer DR6 into the
VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor
hook to obtain the current value. And as SVM now picks the DR6 state
from its VMCB, we also need a set callback in order to write updates of
DR6 back.

Fixes a regression of 020df0794f.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:10 +01:00
Jan Kiszka 9926c9fdbd KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
Whenever we change arch.dr7, we also have to call kvm_update_dr7. In
case guest debugging is off, this will synchronize the new state into
hardware.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:09 +01:00
Vadim Rozenfeld e984097b55 add support for Hyper-V reference time counter
Signed-off: Peter Lieven <pl@kamp.de>
Signed-off: Gleb Natapov
Signed-off: Vadim Rozenfeld <vrozenfe@redhat.com>

After some consideration I decided to submit only Hyper-V reference
counters support this time. I will submit iTSC support as a separate
patch as soon as it is ready.

v1 -> v2
1. mark TSC page dirty as suggested by
    Eric Northup <digitaleric@google.com> and Gleb
2. disable local irq when calling get_kernel_ns,
    as it was done by Peter Lieven <pl@amp.de>
3. move check for TSC page enable from second patch
    to this one.

v3 -> v4
    Get rid of ref counter offset.

v4 -> v5
    replace __copy_to_user with kvm_write_guest
    when updateing iTSC page.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17 10:22:08 +01:00
Paolo Bonzini aab6d7ce37 KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
After the previous patch from Marcelo, the comment before this write
became obsolete.  In fact, the write is unnecessary.  The calls to
kvm_write_tsc ultimately result in a master clock update as soon as
all TSCs agree and the master clock is re-enabled.  This master
clock update will rewrite tsc_timestamp.

So, together with the comment, delete the dead write too.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-15 18:08:25 +01:00
Marcelo Tosatti f25e656d31 KVM: x86: fix tsc catchup issue with tsc scaling
To fix a problem related to different resolution of TSC and system clock,
the offset in TSC units is approximated by

delta = vcpu->hv_clock.tsc_timestamp 	- 	vcpu->last_guest_tsc

(Guest TSC value at 			(Guest TSC value at last VM-exit)
the last kvm_guest_time_update
call)

Delta is then later scaled using mult,shift pair found in hv_clock
structure (which is correct against tsc_timestamp in that
structure).

However, if a frequency change is performed between these two points,
this delta is measured using different TSC frequencies, but scaled using
mult,shift pair for one frequency only.

The end result is an incorrect delta.

The bug which this code works around is not the only cause for
clock backwards events. The global accumulator is still
necessary, so remove the max_kernel_ns fix and rely on the
global accumulator for no clock backwards events.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-15 13:44:46 +01:00
Marcelo Tosatti 9ed96e87c5 KVM: x86: limit PIT timer frequency
Limit PIT timer frequency similarly to the limit applied by
LAPIC timer.

Cc: stable@kernel.org
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-15 12:43:54 +01:00
Takuya Yoshikawa 9357d93952 KVM: x86: Add comment on vcpu_enter_guest()'s return value
Giving proper names to the 0 and 1 was once suggested.  But since 0 is
returned to the userspace, giving it another name can introduce extra
confusion.  This patch just explains the meanings instead.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-13 14:23:54 +01:00
Takuya Yoshikawa c08ac06ab3 KVM: Use cond_resched() directly and remove useless kvm_resched()
Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers
to make kvm preemptible"), the remaining stuff in this function is a
simple cond_resched() call with an extra need_resched() check which was
there to avoid dropping VCPUs unnecessarily.  Now it is meaningless.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-13 14:23:45 +01:00
Andy Honig fda4e2e855 KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368)
In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
potential to corrupt kernel memory if userspace provides an address that
is at the end of a page.  This patches concerts those functions to use
kvm_write_guest_cached and kvm_read_guest_cached.  It also checks the
vapic_address specified by userspace during ioctl processing and returns
an error to userspace if the address is not a valid GPA.

This is generally not guest triggerable, because the required write is
done by firmware that runs before the guest.  Also, it only affects AMD
processors and oldish Intel that do not have the FlexPriority feature
(unless you disable FlexPriority, of course; then newer processors are
also affected).

Fixes: b93463aa59 ('KVM: Accelerated apic support')

Reported-by: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-12-12 22:39:46 +01:00
Michael S. Tsirkin 01b71917b5 kvm: optimize out smp_mb after srcu_read_unlock
I noticed that srcu_read_lock/unlock both have a memory barrier,
so just by moving srcu_read_unlock earlier we can get rid of
one call to smp_mb() using smp_mb__after_srcu_read_unlock instead.

Unsurprisingly, the gain is small but measureable using the unit test
microbenchmark:
before
        vmcall in the ballpark of 1410 cycles
after
        vmcall in the ballpark of 1360 cycles

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-11-06 09:32:31 +02:00
Gleb Natapov 95f328d3ad Merge branch 'kvm-ppc-queue' of git://github.com/agraf/linux-2.6 into queue
Conflicts:
	arch/powerpc/include/asm/processor.h
2013-11-04 10:20:57 +02:00
Paolo Bonzini c67a04cb9a KVM: x86: fix KVM_SET_XCRS loop
The loop was always using 0 as the index.  This means that
any rubbish after the first element of the array went undetected.
It seems reasonable to assume that no KVM userspace did that.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-31 11:31:19 +01:00
Paolo Bonzini 46c34cb059 KVM: x86: fix KVM_SET_XCRS for CPUs that do not support XSAVE
The KVM_SET_XCRS ioctl must accept anything that KVM_GET_XCRS
could return.  XCR0's bit 0 is always 1 in real processors with
XSAVE, and KVM_GET_XCRS will always leave bit 0 set even if the
emulated processor does not have XSAVE.  So, KVM_SET_XCRS must
ignore that bit when checking for attempts to enable unsupported
save states.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-31 11:30:46 +01:00
Alex Williamson e0f0bbc527 kvm: Create non-coherent DMA registeration
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present.  Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group.  Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:23 +01:00
Alex Williamson d96eb2c6f4 kvm/x86: Convert iommu_flags to iommu_noncoherent
Default to operating in coherent mode.  This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 19:02:13 +01:00
Borislav Petkov b51e974fcd kvm, emulator: Rename VendorSpecific flag
Call it EmulateOnUD which is exactly what we're trying to do with
vendor-specific instructions.

Rename ->only_vendor_specific_insn to something shorter, while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 18:54:40 +01:00
Borislav Petkov 1ce19dc16c kvm, emulator: Use opcode length
Add a field to the current emulation context which contains the
instruction opcode length. This will streamline handling of opcodes of
different length.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 18:54:39 +01:00
Borislav Petkov 9c15bb1d0a kvm: Add KVM_GET_EMULATED_CPUID
Add a kvm ioctl which states which system functionality kvm emulates.
The format used is that of CPUID and we return the corresponding CPUID
bits set for which we do emulate functionality.

Make sure ->padding is being passed on clean from userspace so that we
can use it for something in the future, after the ioctl gets cast in
stone.

s/kvm_dev_ioctl_get_supported_cpuid/kvm_dev_ioctl_get_cpuid/ while at
it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-30 18:54:39 +01:00
Aneesh Kumar K.V 5587027ce9 kvm: Add struct kvm arg to memslot APIs
We will use that in the later patch to find the kvm ops handler

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
2013-10-17 15:49:23 +02:00
chai wen f2e106692d KVM: Drop FOLL_GET in GUP when doing async page fault
Page pinning is not mandatory in kvm async page fault processing since
after async page fault event is delivered to a guest it accesses page once
again and does its own GUP.  Drop the FOLL_GET flag in GUP in async_pf
code, and do some simplifying in check/clear processing.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Gu zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: chai wen <chaiw.fnst@cn.fujitsu.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-15 13:43:37 +03:00
Paolo Bonzini 8a3c1a3347 KVM: mmu: change useless int return types to void
kvm_mmu initialization is mostly filling in function pointers, there is
no way for it to fail.  Clean up unused return values.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:44:02 +03:00
Paolo Bonzini d8d173dab2 KVM: mmu: remove uninteresting MMU "new_cr3" callbacks
The new_cr3 MMU callback has been a wrapper for mmu_free_roots since commit
e676505 (KVM: MMU: Force cr3 reload with two dimensional paging on mov
cr3 emulation, 2012-07-08).

The commit message mentioned that "mmu_free_roots() is somewhat of an overkill,
but fixing that is more complicated and will be done after this minimal fix".
One year has passed, and no one really felt the need to do a different fix.
Wrap the call with a kvm_mmu_new_cr3 function for clarity, but remove the
callback.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 15:43:59 +03:00
Paolo Bonzini 4344ee981e KVM: x86: only copy XSAVE state for the supported features
This makes the interface more deterministic for userspace, which can expect
(after configuring only the features it supports) to get exactly the same
state from the kernel, independent of the host CPU and kernel version.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 12:29:09 +03:00
Paolo Bonzini d7876f1be4 KVM: x86: prevent setting unsupported XSAVE states
A guest can still attempt to save and restore XSAVE states even if they
have been masked in CPUID leaf 0Dh.  This usually is not visible to
the guest, but is still wrong: "Any attempt to set a reserved bit (as
determined by the contents of EAX and EDX after executing CPUID with
EAX=0DH, ECX= 0H) in XCR0 for a given processor will result in a #GP
exception".

The patch also performs the same checks as __kvm_set_xcr in KVM_SET_XSAVE.
This catches migration from newer to older kernel/processor before the
guest starts running.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-10-03 12:29:07 +03:00
Paolo Bonzini 2f303b74a6 KVM: Convert kvm_lock back to non-raw spinlock
In commit e935b8372c ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock.  However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held.  This leads to the following:

BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[<ffffffffa0376eac>] mmu_shrink+0x5c/0x1b0 [kvm]

Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
 [<ffffffff8106f2ad>] __might_sleep+0xfd/0x160
 [<ffffffff817d8d64>] rt_spin_lock+0x24/0x50
 [<ffffffffa0376f3c>] mmu_shrink+0xec/0x1b0 [kvm]
 [<ffffffff8111455d>] shrink_slab+0x17d/0x3a0
 [<ffffffff81151f00>] ? mem_cgroup_iter+0x130/0x260
 [<ffffffff8111824a>] balance_pgdat+0x54a/0x730
 [<ffffffff8111fe47>] ? set_pgdat_percpu_threshold+0xa7/0xd0
 [<ffffffff811185bf>] kswapd+0x18f/0x490
 [<ffffffff81070961>] ? get_parent_ip+0x11/0x50
 [<ffffffff81061970>] ? __init_waitqueue_head+0x50/0x50
 [<ffffffff81118430>] ? balance_pgdat+0x730/0x730
 [<ffffffff81060d2b>] kthread+0xdb/0xe0
 [<ffffffff8106e122>] ? finish_task_switch+0x52/0x100
 [<ffffffff817e1e94>] kernel_thread_helper+0x4/0x10
 [<ffffffff81060c50>] ? __init_kthread_worker+0x

After the previous patch, kvm_lock need not be a raw spinlock anymore,
so change it back.

Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-09-30 09:21:51 +02:00
Marcelo Tosatti 2e762ff79f KVM: x86: update masterclock when kvmclock_offset is calculated (v2)
The offset to add to the hosts monotonic time, kvmclock_offset, is
calculated against the monotonic time at KVM_SET_CLOCK ioctl time.

Request a master clock update at this time, to reduce a potentially
unbounded difference between the values of the masterclock and
the clock value used to calculate kvmclock_offset.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-28 17:36:11 +03:00
Paolo Bonzini 0912c9771e KVM: x86: add comments where MMIO does not return to the emulator
Support for single-step in the emulator (new in 3.12) does not work for
MMIO or PIO writes, because they are completed without returning to
the emulator.  This is not worse than what we had in 3.11; still, add
comments so that the issue is not forgotten.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-28 17:15:54 +03:00
Raghavendra K T 24d2166beb kvm hypervisor: Simplify kvm_for_each_vcpu with kvm_irq_delivery_to_apic
Note that we are using APIC_DM_REMRD which has reserved usage.
In future if APIC_DM_REMRD usage is standardized, then we should
find some other way or go back to old method.

Suggested-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-26 12:47:09 +03:00
Srivatsa Vaddagiri 6aef266c6e kvm hypervisor : Add a hypercall to KVM hypervisor to support pv-ticketlocks
kvm_hc_kick_cpu allows the calling vcpu to kick another vcpu out of halt state.
the presence of these hypercalls is indicated to guest via
kvm_feature_pv_unhalt.

Fold pv_unhalt flag into GET_MP_STATE ioctl to aid migration
During migration, any vcpu that got kicked but did not become runnable
(still in halted state) should be runnable after migration.

Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Suzuki Poulose <suzuki@in.ibm.com>
[Raghu: Apic related changes, folding pvunhalted into vcpu_runnable
 Added flags for future use (suggested by Gleb)]
[ Raghu: fold pv_unhalt flag as suggested by Eric Northup]
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-08-26 12:47:09 +03:00
Nadav Har'El b7e914501c nEPT: Fix wrong test in kvm_set_cr3
kvm_set_cr3() attempts to check if the new cr3 is a valid guest physical
address. The problem is that with nested EPT, cr3 is an *L2* physical
address, not an L1 physical address as this test expects.

As the comment above this test explains, it isn't necessary, and doesn't
correspond to anything a real processor would do. So this patch removes it.

Note that this wrong test could have also theoretically caused problems
in nested NPT, not just in nested EPT. However, in practice, the problem
was avoided: nested_svm_vmexit()/vmrun() do not call kvm_set_cr3 in the
nested NPT case, and instead set the vmcb (and arch.cr3) directly, thus
circumventing the problem. Additional potential calls to the buggy function
are avoided in that we don't trap cr3 modifications when nested NPT is
enabled. However, because in nested VMX we did want to use kvm_set_cr3()
(as requested in Avi Kivity's review of the original nested VMX patches),
we can't avoid this problem and need to fix it.

Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Jun Nakajima <jun.nakajima@intel.com>
Signed-off-by: Xinhao Xu <xinhao.xu@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-08-07 15:57:35 +02:00
Paolo Bonzini 663f4c61b8 KVM: x86: handle singlestep during emulation
This lets debugging work better during emulation of invalid
guest state.

This time the check is done after emulation, but before writeback
of the flags; we need to check the flags *before* execution of the
instruction, we cannot check singlestep_rip because the CS base may
have already been modified.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

Conflicts:
	arch/x86/kvm/x86.c
2013-07-29 09:01:14 +02:00
Paolo Bonzini 4a1e10d5b5 KVM: x86: handle hardware breakpoints during emulation
This lets debugging work better during emulation of invalid
guest state.

The check is done before emulating the instruction, and (in the case
of guest debugging) reuses EMULATE_DO_MMIO to exit with KVM_EXIT_DEBUG.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:01:14 +02:00
Paolo Bonzini ac0a48c39a KVM: x86: rename EMULATE_DO_MMIO
The next patch will reuse it for other userspace exits than MMIO,
namely debug events.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-29 09:01:14 +02:00
Nadav Har'El b3897a49e2 KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROL
Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment.

This patch simulate this MSR in nested_vmx and the default value is
0x0. BIOS should set it to 0x5 before VMXON. After setting the lock
bit, write to it will cause #GP(0).

Another QEMU patch is also needed to handle emulation of reset
and migration. Reset to vCPU should clear this MSR and migration
should reserve value of it.

This patch is based on Nadav's previous commit.
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478

Signed-off-by: Nadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Mathias Krause 6b61edf765 KVM: x86: Drop useless cast
Void pointers don't need no casting, drop it.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Takuya Yoshikawa e6dff7d15e KVM: x86: Avoid zapping mmio sptes twice for generation wraparound
Now that kvm_arch_memslots_updated() catches every increment of the
memslots->generation, checking if the mmio generation has reached its
maximum value is enough.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:26 +02:00
Takuya Yoshikawa e59dbe09f8 KVM: Introduce kvm_arch_memslots_updated()
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots().  Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.

In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:25 +02:00
Linus Torvalds fe489bf450 KVM fixes for 3.11
On the x86 side, there are some optimizations and documentation updates.
 The big ARM/KVM change for 3.11, support for AArch64, will come through
 Catalin Marinas's tree.  s390 and PPC have misc cleanups and bugfixes.
 
 There is a conflict due to "s390/pgtable: fix ipte notify bit" having
 entered 3.10 through Martin Schwidefsky's s390 tree.  This pull request
 has additional changes on top, so this tree's version is the correct one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
 6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
 UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
 ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
 dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
 +QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
 1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
 qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
 3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
 RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
 LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
 xBohzi49L9FDbpOnTYfz
 =1zpG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
2013-07-03 13:21:40 -07:00