Commit Graph

5285 Commits

Author SHA1 Message Date
Thomas Gleixner 3b20eb2372 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 320
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 59 temple place suite 330 boston ma 02111
  1307 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.254582722@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:05 +02:00
Thomas Huth a86cb413f4 KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID
KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
architectures. However, on s390x, the amount of usable CPUs is determined
during runtime - it is depending on the features of the machine the code
is running on. Since we are using the vcpu_id as an index into the SCA
structures that are defined by the hardware (see e.g. the sca_add_vcpu()
function), it is not only the amount of CPUs that is limited by the hard-
ware, but also the range of IDs that we can use.
Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
code into the architecture specific code, and on s390x we have to return
the same value here as for KVM_CAP_MAX_VCPUS.
This problem has been discovered with the kvm_create_max_vcpus selftest.
With this change applied, the selftest now passes on s390x, too.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190523164309.13345-9-thuth@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-05-28 15:52:19 +02:00
Paolo Bonzini 66f61c9288 KVM: x86: fix return value for reserved EFER
Commit 11988499e6 ("KVM: x86: Skip EFER vs. guest CPUID checks for
host-initiated writes", 2019-04-02) introduced a "return false" in a
function returning int, and anyway set_efer has a "nonzero on error"
conventon so it should be returning 1.

Reported-by: Pavel Machek <pavel@denx.de>
Fixes: 11988499e6 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:55:02 +02:00
Paolo Bonzini 2924b52117 KVM: x86/pmu: do not mask the value that is written to fixed PMUs
According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
each MSR may be written with any value, and the high-order 8 bits are
sign-extended according to the value of bit 31", but the fixed counters
in real hardware are limited to the width of the fixed counters ("bits
beyond the width of the fixed-function counter are reserved and must be
written as zeros").  Fix KVM to do the same.

Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:14 +02:00
Paolo Bonzini 0e6f467ee2 KVM: x86/pmu: mask the result of rdpmc according to the width of the counters
This patch will simplify the changes in the next, by enforcing the
masking of the counters to RDPMC and RDMSR.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:13 +02:00
Borislav Petkov a80c4ec10e x86/kvm/pmu: Set AMD's virt PMU version to 1
After commit:

  672ff6cff8 ("KVM: x86: Raise #GP when guest vCPU do not support PMU")

my AMD guests started #GPing like this:

  general protection fault: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  RIP: 0010:x86_perf_event_update+0x3b/0xa0

with Code: pointing to RDPMC. It is RDPMC because the guest has the
hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:

  if (!pmu->version)
  	return 1;

which the above commit added. Since AMD's PMU leaves the version at 0,
that causes the #GP injection into the guest.

Set pmu->version arbitrarily to 1 and move it above the non-applicable
struct kvm_pmu members.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 672ff6cff8 ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:13 +02:00
Paolo Bonzini 6f2f84532c KVM: x86: do not spam dmesg with VMCS/VMCB dumps
Userspace can easily set up invalid processor state in such a way that
dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
using a module parameter.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:12 +02:00
Peter Xu 654f1f13ea kvm: Check irqchip mode before assign irqfd
When assigning kvm irqfd we didn't check the irqchip mode but we allow
KVM_IRQFD to succeed with all the irqchip modes.  However it does not
make much sense to create irqfd even without the kernel chips.  Let's
provide a arch-dependent helper to check whether a specific irqfd is
allowed by the arch.  At least for x86, it should make sense to check:

- when irqchip mode is NONE, all irqfds should be disallowed, and,

- when irqchip mode is SPLIT, irqfds that are with resamplefd should
  be disallowed.

For either of the case, previously we'll silently ignore the irq or
the irq ack event if the irqchip mode is incorrect.  However that can
cause misterious guest behaviors and it can be hard to triage.  Let's
fail KVM_IRQFD even earlier to detect these incorrect configurations.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Radim Krčmář <rkrcmar@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:12 +02:00
Suthikulpanit, Suravee c9bcd3e333 kvm: svm/avic: fix off-by-one in checking host APIC ID
Current logic does not allow VCPU to be loaded onto CPU with
APIC ID 255. This should be allowed since the host physical APIC ID
field in the AVIC Physical APIC table entry is an 8-bit value,
and APIC ID 255 is valid in system with x2APIC enabled.
Instead, do not allow VCPU load if the host APIC ID cannot be
represented by an 8-bit value.

Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
instead of AVIC_MAX_PHYSICAL_ID_COUNT.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:11 +02:00
Wanpeng Li 16ba3ab4e1 KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace
Expose per-vCPU timer_advance_ns to userspace, so it is able to
query the auto-adjusted value.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:09 +02:00
Wanpeng Li 0e6edceb8f KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow
After commit c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of
timer advancement), '-1' enables adaptive tuning starting from default
advancment of 1000ns. However, we should expose an int instead of an overflow
uint module parameter.

Before patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295

After patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:-1

Fixes: c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:09 +02:00
Yi Wang 4d25996565 kvm: vmx: Fix -Wmissing-prototypes warnings
We get a warning when build kernel W=1:
arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes]
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)

Add the missing declaration to fix this.

Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:08 +02:00
Wanpeng Li 541e886f79 KVM: nVMX: Fix using __this_cpu_read() in preemptible context
BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590
  caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
  CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G           OE     5.1.0-rc4+ #1
  Call Trace:
   dump_stack+0x67/0x95
   __this_cpu_preempt_check+0xd2/0xe0
   nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
   nested_vmx_run+0xda/0x2b0 [kvm_intel]
   handle_vmlaunch+0x13/0x20 [kvm_intel]
   vmx_handle_exit+0xbd/0x660 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm]
   kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm]
   do_vfs_ioctl+0xa5/0x6e0
   ksys_ioctl+0x6d/0x80
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x6f/0x6c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Accessing per-cpu variable should disable preemption, this patch extends the
preemption disable region for __this_cpu_read().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 52017608da ("KVM: nVMX: add option to perform early consistency checks via H/W")
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:08 +02:00
Jim Mattson 382409b4c4 kvm: x86: Include CPUID leaf 0x8000001e in kvm's supported CPUID
Kvm now supports extended CPUID functions through 0x8000001f.  CPUID
leaf 0x8000001e is AMD's Processor Topology Information leaf. This
contains similar information to CPUID leaf 0xb (Intel's Extended
Topology Enumeration leaf), and should be included in the output of
KVM_GET_SUPPORTED_CPUID, even though userspace is likely to override
some of this information based upon the configuration of the
particular VM.

Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Fixes: 8765d75329 ("KVM: X86: Extend CPUID range to include new leaf")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:06 +02:00
Jim Mattson 32a243df82 kvm: x86: Include multiple indices with CPUID leaf 0x8000001d
Per the APM, "CPUID Fn8000_001D_E[D,C,B,A]X reports cache topology
information for the cache enumerated by the value passed to the
instruction in ECX, referred to as Cache n in the following
description. To gather information for all cache levels, software must
repeatedly execute CPUID with 8000_001Dh in EAX and ECX set to
increasing values beginning with 0 until a value of 00h is returned in
the field CacheType (EAX[4:0]) indicating no more cache descriptions
are available for this processor."

The termination condition is the same as leaf 4, so we can reuse that
code block for leaf 0x8000001d.

Fixes: 8765d75329 ("KVM: X86: Extend CPUID range to include new leaf")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:06 +02:00
Sean Christopherson 21be4ca1ea KVM: nVMX: Clear nested_run_pending if setting nested state fails
VMX's nested_run_pending flag is subtly consumed when stuffing state to
enter guest mode, i.e. needs to be set according before KVM knows if
setting guest state is successful.  If setting guest state fails, clear
the flag as a nested run is obviously not pending.

Reported-by: Aaron Lewis <aaronlewis@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:03 +02:00
Paolo Bonzini db80927ea1 KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATE
The offset for reading the shadow VMCS is sizeof(*kvm_state)+VMCS12_SIZE,
so the correct size must be that plus sizeof(*vmcs12).  This could lead
to KVM reading garbage data from userspace and not reporting an error,
but is otherwise not sensitive.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24 21:27:02 +02:00
Linus Torvalds 0ef0fd3515 * ARM: support for SVE and Pointer Authentication in guests, PMU improvements
* POWER: support for direct access to the POWER9 XIVE interrupt controller,
 memory and performance optimizations.
 
 * x86: support for accessing memory not backed by struct page, fixes and refactoring
 
 * Generic: dirty page tracking improvements
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg
 PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE
 nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy
 Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7
 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n
 hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I=
 =D0+p
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "ARM:
   - support for SVE and Pointer Authentication in guests
   - PMU improvements

  POWER:
   - support for direct access to the POWER9 XIVE interrupt controller
   - memory and performance optimizations

  x86:
   - support for accessing memory not backed by struct page
   - fixes and refactoring

  Generic:
   - dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
  kvm: fix compilation on aarch64
  Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
  kvm: x86: Fix L1TF mitigation for shadow MMU
  KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
  KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
  KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
  KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
  kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
  tests: kvm: Add tests for KVM_SET_NESTED_STATE
  KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
  tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
  tests: kvm: Add tests to .gitignore
  KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
  KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
  KVM: Fix the bitmap range to copy during clear dirty
  KVM: arm64: Fix ptrauth ID register masking logic
  KVM: x86: use direct accessors for RIP and RSP
  KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
  KVM: x86: Omit caching logic for always-available GPRs
  kvm, x86: Properly check whether a pfn is an MMIO or not
  ...
2019-05-17 10:33:30 -07:00
Sean Christopherson f93f7ede08 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
The RDPMC-exiting control is dependent on the existence of the RDPMC
instruction itself, i.e. is not tied to the "Architectural Performance
Monitoring" feature.  For all intents and purposes, the control exists
on all CPUs with VMX support since RDPMC also exists on all VCPUs with
VMX supported.  Per Intel's SDM:

  The RDPMC instruction was introduced into the IA-32 Architecture in
  the Pentium Pro processor and the Pentium processor with MMX technology.
  The earlier Pentium processors have performance-monitoring counters, but
  they must be read with the RDMSR instruction.

Because RDPMC-exiting always exists, KVM requires the control and refuses
to load if it's not available.  As a result, hiding the PMU from a guest
breaks nested virtualization if the guest attemts to use KVM.

While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit
check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization
for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE
checks, but before the counter referenced in ECX is checked for validity.

In other words, the original KVM behavior of injecting a #GP was correct,
and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP
when the unit test guest (L3 in this case) executes RDPMC without
RDPMC-exiting set in the unit test host (L2).

This reverts commit e51bfdb687.

Fixes: e51bfdb687 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU")
Reported-by: David Hill <hilld@binarystorm.net>
Cc: Saar Amar <saaramar@microsoft.com>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15 23:19:19 +02:00
Kai Huang 61455bf262 kvm: x86: Fix L1TF mitigation for shadow MMU
Currently KVM sets 5 most significant bits of physical address bits
reported by CPUID (boot_cpu_data.x86_phys_bits) for nonpresent or
reserved bits SPTE to mitigate L1TF attack from guest when using shadow
MMU. However for some particular Intel CPUs the physical address bits
of internal cache is greater than physical address bits reported by
CPUID.

Use the kernel's existing boot_cpu_data.x86_cache_bits to determine the
five most significant bits. Doing so improves KVM's L1TF mitigation in
the unlikely scenario that system RAM overlaps the high order bits of
the "real" physical address space as reported by CPUID. This aligns with
the kernel's warnings regarding L1TF mitigation, e.g. in the above
scenario the kernel won't warn the user about lack of L1TF mitigation
if x86_cache_bits is greater than x86_phys_bits.

Also initialize shadow_nonpresent_or_rsvd_mask explicitly to make it
consistent with other 'shadow_{xxx}_mask', and opportunistically add a
WARN once if KVM's L1TF mitigation cannot be applied on a system that
is marked as being susceptible to L1TF.

Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15 23:15:59 +02:00
Sean Christopherson d69129b4e4 KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
If L1 is using an MSR bitmap, unconditionally merge the MSR bitmaps from
L0 and L1 for MSR_{KERNEL,}_{FS,GS}_BASE.  KVM unconditionally exposes
MSRs L1.  If KVM is also running in L1 then it's highly likely L1 is
also exposing the MSRs to L2, i.e. KVM doesn't need to intercept L2
accesses.

Based on code from Jintack Lim.

Cc: Jintack Lim <jintack@xxxxxxxxxxxxxxx>
Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15 22:53:44 +02:00
Linus Torvalds bfbfbf7368 More power management updates for 5.2-rc1
- Fix recent regression causing kernels built with CONFIG_PM
    unset to crash on systems that support the Performance and
    Energy Bias Hint (EPB) by avoiding to compile the EPB-related
    code depending on CONFIG_PM when it is unset (Rafael Wysocki).
 
  - Clean up the transition notifier invocation code in the cpufreq
    core and change some users of cpufreq transition notifiers
    accordingly (Viresh Kumar).
 
  - Change MAINTAINERS to cover the schedutil governor as part of
    cpufreq (Viresh Kumar).
 
  - Simplify cpufreq_init_policy() to avoid redundant computations
    (Yue Hu).
 
  - Add explanatory comment to the cpufreq core (Rafael Wysocki).
 
  - Introduce a new flag, GENPD_FLAG_RPM_ALWAYS_ON, to the generic
    power domains (genpd) framework along with the first user of it
    (Leonard Crestez).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAlzb4TASHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxiEAP/37uQOx+I8J3IU7HQcPIkdI1hgksLEzo
 g2eoREekjszIjFK9xa70X3V/QnGK4YSPQ/cHCjgXfVhwkO5TJzte5T5M2z9gUCDT
 7OMYWCI6hP6Mo5UWlP4dQ9Cqce4SB3TdibadevxcVOhFAW/xz42y5Gr6s4WkexJf
 Swb2uoLS4gGANyhUhx6XEZ5NpWZkWcK2ygZ8VJZETnoIwxMSUW7FTJkF+4s2tXLZ
 GH+F5jWAbwPlg6g2c54lPL1HtiAvK+/018aF8CZMqUBec94RHDFybVOlb5sacfQW
 +Y0W/mc/6SMqT3OUcQ0H3Z/qkgwR8mL01hH6gCP1jA5OBljmTjzk0Bbc4c3n9BEN
 aRy4M8Qc/GXzEBPO3Z9AlYik6ALH9iUgL2hewGZAFN8kn9ZGPAqYsctdCVkfKL1u
 4Esz5+wOsyYmBx910PozL+p2jbTH0x89sSo1qXUQr2JEiNm2iL4I4+ndqhuiq4LO
 sQPHCpe4HhYWzIQzJLDurv6hAxxU5PUsGg8XDEGlsyowIPDoIkMgC93RRLGZ/taY
 Ivc2FSlwLTSkzBHwVfckakXPvfyFdw8DFL2n66dQbXS9FFNshOF/TFx40iV42i5H
 wusyIZIT1y1H74De0EVntUho3xBo3nrrsu1o2NaXsTBoEsYwJiCji4yOZlI1Zh+m
 A9coiXKm4hY5
 =LqTN
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull more power management updates from Rafael Wysocki:
 "These fix a recent regression causing kernels built with CONFIG_PM
  unset to crash on systems that support the Performance and Energy Bias
  Hint (EPB), clean up the cpufreq core and some users of transition
  notifiers and introduce a new power domain flag into the generic power
  domains framework (genpd).

  Specifics:

   - Fix recent regression causing kernels built with CONFIG_PM unset to
     crash on systems that support the Performance and Energy Bias Hint
     (EPB) by avoiding to compile the EPB-related code depending on
     CONFIG_PM when it is unset (Rafael Wysocki).

   - Clean up the transition notifier invocation code in the cpufreq
     core and change some users of cpufreq transition notifiers
     accordingly (Viresh Kumar).

   - Change MAINTAINERS to cover the schedutil governor as part of
     cpufreq (Viresh Kumar).

   - Simplify cpufreq_init_policy() to avoid redundant computations (Yue
     Hu).

   - Add explanatory comment to the cpufreq core (Rafael Wysocki).

   - Introduce a new flag, GENPD_FLAG_RPM_ALWAYS_ON, to the generic
     power domains (genpd) framework along with the first user of it
     (Leonard Crestez)"

* tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619
  PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag
  cpufreq: Update MAINTAINERS to include schedutil governor
  cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy()
  cpufreq: Explain the kobject_put() in cpufreq_policy_alloc()
  cpufreq: Call transition notifier only once for each policy
  x86: intel_epb: Take CONFIG_PM into account
2019-05-15 08:46:44 -07:00
Linus Torvalds 318222a35b Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few misc things and hotfixes

 - ocfs2

 - almost all of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
  kernel/memremap.c: remove the unused device_private_entry_fault() export
  mm: delete find_get_entries_tag
  mm/huge_memory.c: make __thp_get_unmapped_area static
  mm/mprotect.c: fix compilation warning because of unused 'mm' variable
  mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
  mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
  mm/Kconfig: update "Memory Model" help text
  mm/vmscan.c: don't disable irq again when count pgrefill for memcg
  mm: memblock: make keeping memblock memory opt-in rather than opt-out
  hugetlbfs: always use address space in inode for resv_map pointer
  mm/z3fold.c: support page migration
  mm/z3fold.c: add structure for buddy handles
  mm/z3fold.c: improve compression by extending search
  mm/z3fold.c: introduce helper functions
  mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
  mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
  mm/vmscan.c: simplify shrink_inactive_list()
  fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
  xen/privcmd-buf.c: convert to use vm_map_pages_zero()
  xen/gntdev.c: convert to use vm_map_pages()
  ...
2019-05-14 10:10:55 -07:00
Ira Weiny 73b0140bf0 mm/gup: change GUP fast to use flags rather than a write 'bool'
To facilitate additional options to get_user_pages_fast() change the
singular write parameter to be gup_flags.

This patch does not change any functionality.  New functionality will
follow in subsequent patches.

Some of the get_user_pages_fast() call sites were unchanged because they
already passed FOLL_WRITE or 0 for the write parameter.

NOTE: It was suggested to change the ordering of the get_user_pages_fast()
arguments to ensure that callers were converted.  This breaks the current
GUP call site convention of having the returned pages be the final
parameter.  So the suggestion was rejected.

Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Mike Marshall <hubcap@omnibond.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14 09:47:46 -07:00
Linus Torvalds fa4bff1650 Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 MDS mitigations from Thomas Gleixner:
 "Microarchitectural Data Sampling (MDS) is a hardware vulnerability
  which allows unprivileged speculative access to data which is
  available in various CPU internal buffers. This new set of misfeatures
  has the following CVEs assigned:

     CVE-2018-12126  MSBDS  Microarchitectural Store Buffer Data Sampling
     CVE-2018-12130  MFBDS  Microarchitectural Fill Buffer Data Sampling
     CVE-2018-12127  MLPDS  Microarchitectural Load Port Data Sampling
     CVE-2019-11091  MDSUM  Microarchitectural Data Sampling Uncacheable Memory

  MDS attacks target microarchitectural buffers which speculatively
  forward data under certain conditions. Disclosure gadgets can expose
  this data via cache side channels.

  Contrary to other speculation based vulnerabilities the MDS
  vulnerability does not allow the attacker to control the memory target
  address. As a consequence the attacks are purely sampling based, but
  as demonstrated with the TLBleed attack samples can be postprocessed
  successfully.

  The mitigation is to flush the microarchitectural buffers on return to
  user space and before entering a VM. It's bolted on the VERW
  instruction and requires a microcode update. As some of the attacks
  exploit data structures shared between hyperthreads, full protection
  requires to disable hyperthreading. The kernel does not do that by
  default to avoid breaking unattended updates.

  The mitigation set comes with documentation for administrators and a
  deeper technical view"

* 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
  x86/speculation/mds: Fix documentation typo
  Documentation: Correct the possible MDS sysfs values
  x86/mds: Add MDSUM variant to the MDS documentation
  x86/speculation/mds: Add 'mitigations=' support for MDS
  x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off
  x86/speculation/mds: Fix comment
  x86/speculation/mds: Add SMT warning message
  x86/speculation: Move arch_smt_update() call to after mitigation decisions
  x86/speculation/mds: Add mds=full,nosmt cmdline option
  Documentation: Add MDS vulnerability documentation
  Documentation: Move L1TF to separate directory
  x86/speculation/mds: Add mitigation mode VMWERV
  x86/speculation/mds: Add sysfs reporting for MDS
  x86/speculation/mds: Add mitigation control for MDS
  x86/speculation/mds: Conditionally clear CPU buffers on idle entry
  x86/kvm/vmx: Add MDS protection when L1D Flush is not active
  x86/speculation/mds: Clear CPU buffers on exit to user
  x86/speculation/mds: Add mds_clear_cpu_buffers()
  x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests
  x86/speculation/mds: Add BUG_MSBDS_ONLY
  ...
2019-05-14 07:57:29 -07:00
Viresh Kumar df24014abe cpufreq: Call transition notifier only once for each policy
Currently, the notifiers are called once for each CPU of the policy->cpus
cpumask. It would be more optimal if the notifier can be called only
once and all the relevant information be provided to it. Out of the 23
drivers that register for the transition notifiers today, only 4 of them
do per-cpu updates and the callback for the rest can be called only once
for the policy without any impact.

This would also avoid multiple function calls to the notifier callbacks
and reduce multiple iterations of notifier core's code (which does
locking as well).

This patch adds pointer to the cpufreq policy to the struct
cpufreq_freqs, so the notifier callback has all the information
available to it with a single call. The five drivers which perform
per-cpu updates are updated to use the cpufreq policy. The freqs->cpu
field is redundant now and is removed.

Acked-by: David S. Miller <davem@davemloft.net> (sparc)
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-10 12:20:36 +02:00
Aaron Lewis 9b5db6c762 kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
nested_run_pending=1 implies we have successfully entered guest mode.
Move setting from external state in vmx_set_nested_state() until after
all other checks are complete.

Based on a patch by Aaron Lewis.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-08 14:14:08 +02:00
Aaron Lewis 332d079735 KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
Move call to nested_enable_evmcs until after free_nested() is complete.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-08 14:12:08 +02:00
Linus Torvalds eac7078a0f pidfd patches for v5.2-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE7btrcuORLb1XUhEwjrBW1T7ssS0FAlzReuoACgkQjrBW1T7s
 sS1uvBAA16pgnhRNxNTrp3LYft6lUWmF4n0baOTVtQNLhPjpwaOxHIrCBugkQCJB
 QcQ9IQSOvIkaEW0XAQoPBaeLviiKhHOFw1Fv89OtW6xUidSfSV15lcI9f1F2pCm2
 4yCL/8XvL6M0NhxiwftJAkWOXeDNLfjFnLwyLxBfgg3EeyqMgUB8raeosEID0ORR
 gm2/g8DYS2r+KNqM/F4xvMSgabfi2bGk+8BtAaVnftJfstpRNrqKwWnSK3Wspj1l
 5gkb8gSsiY6ns3V6RgNHrFlhevFg8V+VjcJt7FR+aUEjOkcoiXas/PhvamMzdsn/
 FM1F/A0pM8FSybIUClhnnnxNPc+p8ZN/71YQAPs+Mnh3xvbtKea2lkhC+Xv4OpK3
 edutSZWFaiIery82Rk00H3vqiSF1+kRIXSpZSS4mElk4FsVljkyH+nSP7rbmE2MR
 EQe+kKnZl8QzWrVbnODC+EVvvVpA2bXDvENJmvKqus+t2G0OdV7Iku3F5E3KjF8k
 S5RRV1zuBF3ugqnjmYrVmJtpEA8mxClmqvg6okru+qW6ngO5oOgVpPLjWn1CXcdj
 wcuQ6Pe1QwAHS54e9WSWgCHVssLvm9nCdCqypdNaoyGWmbTWntwlrY7Y0JUQnAbB
 6/G/DQQiCWY9y8bMZlTEydhIpgcsdROuPYv+oHF5+eQQthsWwHc=
 =LH11
 -----END PGP SIGNATURE-----

Merge tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull pidfd updates from Christian Brauner:
 "This patchset makes it possible to retrieve pidfds at process creation
  time by introducing the new flag CLONE_PIDFD to the clone() system
  call. Linus originally suggested to implement this as a new flag to
  clone() instead of making it a separate system call.

  After a thorough review from Oleg CLONE_PIDFD returns pidfds in the
  parent_tidptr argument. This means we can give back the associated pid
  and the pidfd at the same time. Access to process metadata information
  thus becomes rather trivial.

  As has been agreed, CLONE_PIDFD creates file descriptors based on
  anonymous inodes similar to the new mount api. They are made
  unconditional by this patchset as they are now needed by core kernel
  code (vfs, pidfd) even more than they already were before (timerfd,
  signalfd, io_uring, epoll etc.). The core patchset is rather small.
  The bulky looking changelist is caused by David's very simple changes
  to Kconfig to make anon inodes unconditional.

  A pidfd comes with additional information in fdinfo if the kernel
  supports procfs. The fdinfo file contains the pid of the process in
  the callers pid namespace in the same format as the procfs status
  file, i.e. "Pid:\t%d".

  To remove worries about missing metadata access this patchset comes
  with a sample/test program that illustrates how a combination of
  CLONE_PIDFD and pidfd_send_signal() can be used to gain race-free
  access to process metadata through /proc/<pid>.

  Further work based on this patchset has been done by Joel. His work
  makes pidfds pollable. It finished too late for this merge window. I
  would prefer to have it sitting in linux-next for a while and send it
  for inclusion during the 5.3 merge window"

* tag 'pidfd-v5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  samples: show race-free pidfd metadata access
  signal: support CLONE_PIDFD with pidfd_send_signal
  clone: add CLONE_PIDFD
  Make anon_inodes unconditional
2019-05-07 12:30:24 -07:00
Linus Torvalds 8ff468c29e Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU state handling updates from Borislav Petkov:
 "This contains work started by Rik van Riel and brought to fruition by
  Sebastian Andrzej Siewior with the main goal to optimize when to load
  FPU registers: only when returning to userspace and not on every
  context switch (while the task remains in the kernel).

  In addition, this optimization makes kernel_fpu_begin() cheaper by
  requiring registers saving only on the first invocation and skipping
  that in following ones.

  What is more, this series cleans up and streamlines many aspects of
  the already complex FPU code, hopefully making it more palatable for
  future improvements and simplifications.

  Finally, there's a __user annotations fix from Jann Horn"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
  x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails
  x86/pkeys: Add PKRU value to init_fpstate
  x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath
  x86/fpu: Add a fastpath to copy_fpstate_to_sigframe()
  x86/fpu: Add a fastpath to __fpu__restore_sig()
  x86/fpu: Defer FPU state load until return to userspace
  x86/fpu: Merge the two code paths in __fpu__restore_sig()
  x86/fpu: Restore from kernel memory on the 64-bit path too
  x86/fpu: Inline copy_user_to_fpregs_zeroing()
  x86/fpu: Update xstate's PKRU value on write_pkru()
  x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD
  x86/fpu: Always store the registers in copy_fpstate_to_sigframe()
  x86/entry: Add TIF_NEED_FPU_LOAD
  x86/fpu: Eager switch PKRU state
  x86/pkeys: Don't check if PKRU is zero before writing it
  x86/fpu: Only write PKRU if it is different from current
  x86/pkeys: Provide *pkru() helpers
  x86/fpu: Use a feature number instead of mask in two more helpers
  x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask
  x86/fpu: Add an __fpregs_load_activate() internal helper
  ...
2019-05-07 10:24:10 -07:00
Jim Mattson e8ab8d24b4 KVM: nVMX: Fix size checks in vmx_set_nested_state
The size checks in vmx_nested_state are wrong because the calculations
are made based on the size of a pointer to a struct kvm_nested_state
rather than the size of a struct kvm_nested_state.

Reported-by: Felix Wilhelm  <fwilhelm@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Drew Schmitt <dasch@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Fixes: 8fcc4b5923
Cc: stable@ver.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-01 00:43:44 +02:00
Paolo Bonzini e9c16c7850 KVM: x86: use direct accessors for RIP and RSP
Use specific inline functions for RIP and RSP instead of
going through kvm_register_read and kvm_register_write,
which are quite a mouthful.  kvm_rsp_read and kvm_rsp_write
did not exist, so add them.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 22:07:26 +02:00
Sean Christopherson 2b3eaf815c KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
... now that there is no overhead when using dedicated accessors.

Opportunistically remove a bogus "FIXME" in handle_rdmsr() regarding
the upper 32 bits of RAX and RDX.  Zeroing the upper 32 bits is
architecturally correct as 32-bit writes in 64-bit mode unconditionally
clear the upper 32 bits.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:56:27 +02:00
Sean Christopherson de3cd117ed KVM: x86: Omit caching logic for always-available GPRs
Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always
treated "available and dirtly" on both VMX and SVM, i.e. are
unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit.

Eliminating the unnecessary caching code reduces the size of KVM by a
non-trivial amount, much of which comes from the most common code paths.
E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
dropping by ~1000 bytes.  With CONFIG_RETPOLINE=y, the numbers are even
more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:56:12 +02:00
KarimAllah Ahmed 0c55671f84 kvm, x86: Properly check whether a pfn is an MMIO or not
pfn_valid check is not sufficient because it only checks if a page has a struct
page or not, if "mem=" was passed to the kernel some valid pages won't have a
struct page. This means that if guests were assigned valid memory that lies
after the mem= boundary it will be passed uncached to the guest no matter what
the guest caching attributes are for this memory.

Introduce a new function e820__mapped_raw_any which is equivalent to
e820__mapped_any but uses the original e820 unmodified and use it to
identify real *RAM*.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:46 +02:00
KarimAllah Ahmed e0bf2665ca KVM/nVMX: Use page_address_valid in a few more locations
Use page_address_valid in a few more locations that is already checking for
a page aligned address that does not cross the maximum physical address.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:44 +02:00
KarimAllah Ahmed dee9c04931 KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCS
Use kvm_vcpu_map for accessing the enlightened VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:40 +02:00
KarimAllah Ahmed 8892530598 KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS
Use kvm_vcpu_map for accessing the shadow VMCS since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzessutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:37 +02:00
KarimAllah Ahmed 8c5fbf1a72 KVM/nSVM: Use the new mapping API for mapping guest memory
Use the new mapping API for mapping guest memory to avoid depending on
"struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:49:34 +02:00
KarimAllah Ahmed 42e35f8072 KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated
Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <kjonrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:45:28 +02:00
KarimAllah Ahmed 3278e04925 KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table
Use kvm_vcpu_map when mapping the posted interrupt descriptor table since
using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory
that has a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
interrupt descriptor table page on the host side.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:45:21 +02:00
KarimAllah Ahmed 96c66e87de KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page
Use kvm_vcpu_map when mapping the virtual APIC page since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

One additional semantic change is that the virtual host mapping lifecycle
has changed a bit. It now has the same lifetime of the pinning of the
virtual APIC page on the host side.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:36:43 +02:00
KarimAllah Ahmed 31f0b6c4ba KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap
Use kvm_vcpu_map when mapping the L1 MSR bitmap since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:34:34 +02:00
KarimAllah Ahmed b146b83928 X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from guest memory
Use kvm_vcpu_map to the map the VMCS12 from guest memory because
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:59 +02:00
Filippo Sironi bd53cb35a3 X86/KVM: Handle PFNs outside of kernel reach when touching GPTEs
cmpxchg_gpte() calls get_user_pages_fast() to retrieve the number of
pages and the respective struct page to map in the kernel virtual
address space.
This doesn't work if get_user_pages_fast() is invoked with a userspace
virtual address that's backed by PFNs outside of kernel reach (e.g., when
limiting the kernel memory with mem= in the command line and using
/dev/mem to map memory).

If get_user_pages_fast() fails, look up the VMA that back the userspace
virtual address, compute the PFN and the physical address, and map it in
the kernel virtual address space with memremap().

Signed-off-by: Filippo Sironi <sironi@amazon.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:45 +02:00
KarimAllah Ahmed 3d5f6beb74 X86/nVMX: Update the PML table without mapping and unmapping the page
Update the PML table without mapping and unmapping the page. This also
avoids using kvm_vcpu_gpa_to_page(..) which assumes that there is a "struct
page" for guest memory.

As a side-effect of using kvm_write_guest_page the page is also properly
marked as dirty.

Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:38 +02:00
KarimAllah Ahmed 2e408936b6 X86/nVMX: handle_vmon: Read 4 bytes from guest memory
Read the data directly from guest memory instead of the map->read->unmap
sequence. This also avoids using kvm_vcpu_gpa_to_page() and kmap() which
assumes that there is a "struct page" for guest memory.

Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:28 +02:00
Borislav Petkov 191c8137a9 x86/kvm: Implement HWCR support
The hardware configuration register has some useful bits which can be
used by guests. Implement McStatusWrEn which can be used by guests when
injecting MCEs with the in-kernel mce-inject module.

For that, we need to set bit 18 - McStatusWrEn - first, before writing
the MCi_STATUS registers (otherwise we #GP).

Add the required machinery to do so.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: KVM <kvm@vger.kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:22 +02:00
Sean Christopherson 19e38336d7 KVM: VMX: Include architectural defs header in capabilities.h
The capabilities header depends on asm/vmx.h but doesn't explicitly
include said file.  This currently doesn't cause problems as all users
of capbilities.h first include asm/vmx.h, but the issue often results in
build errors if someone starts moving things around the VMX files.

Fixes: 3077c19108 ("KVM: VMX: Move capabilities structs and helpers to dedicated file")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:21 +02:00
Dan Carpenter d6a85c3223 KVM: vmx: clean up some debug output
Smatch complains about this:

    arch/x86/kvm/vmx/vmx.c:5730 dump_vmcs()
    warn: KERN_* level not at start of string

The code should be using pr_cont() instead of pr_err().

Fixes: 9d609649bb ("KVM: vmx: print more APICv fields in dump_vmcs")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-04-30 21:32:20 +02:00