-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJavDxYAAoJEAx081l5xIa+pCoP/iwjuxkSTdJpZUx5g0daGkCK
O18moGqGPChb7qJovfHqCKZ1f9PGulQt7SxwFzzJXNbv0PbfMA/Og0EhMLBImb+Q
VfYgq2vJLpmkikgcI5fBrzs9DRMQKKobGIzw24VS7IkPYA7d8KgAyywBwG0+LUFR
G3sobClgapsfaUcleb3ZOeDwymGkGCuuYRpYE4giHtuMDIxCWLePKJKOaOIq8o6P
A1557EvSbKuLQGI9X50jzJOoBE3TKRQYkzuM1GthdOF8RHaMNcFy44lDNO030HwZ
hzwAIg5Izhu16PqZGyEdIQ6SJTv3isRJWEciPnOsijvjl1li3ehMdQfhGISa/jZO
ivEGd32kaactiT0jJ5OyexergEViCPVKCIORksSIk46L84luDva9L22A3yu0mf3F
ixB63bAiLH7Py77kH3DmeJdqhMxlVZXCbdBVFDvzZvY4O3Mx0Dv9mmN/nw1FVCFH
scSYnXea9/o4IY5yGASU6FAUJEEGu20HAN12oHJw7/taqV/gbbEos3F7AGmjJE0f
qe6Rt/8fwi7Lhm2va6EoOo6yltH/gL4/AgnsN76VzppNGbaIv7W8Qa4Y/ES1lAE1
SATAEUJfU8kiLrVOolIElPbgfdJwv8TzoxiKB5wK/eoH20wf4BTmOuBMviaL2qXK
Sz6wihq+IlMXW7Y7pIl/
=DrA+
-----END PGP SIGNATURE-----
Merge tag 'drm-for-v4.17' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
"Cannonlake and Vega12 support are probably the two major things. This
pull lacks nouveau, Ben had some unforseen leave and a few other
blockers so we'll see how things look or maybe leave it for this merge
window.
core:
- Device links to handle sound/gpu pm dependency
- Color encoding/range properties
- Plane clipping into plane check helper
- Backlight helpers
- DP TP4 + HBR3 helper support
amdgpu:
- Vega12 support
- Enable DC by default on all supported GPUs
- Powerplay restructuring and cleanup
- DC bandwidth calc updates
- DC backlight on pre-DCE11
- TTM backing store dropping support
- SR-IOV fixes
- Adding "wattman" like functionality
- DC crc support
- Improved DC dual-link handling
amdkfd:
- GPUVM support for dGPU
- KFD events for dGPU
- Enable PCIe atomics for dGPUs
- HSA process eviction support
- Live-lock fixes for process eviction
- VM page table allocation fix for large-bar systems
panel:
- Raydium RM68200
- AUO G104SN02 V2
- KEO TX31D200VM0BAA
- ARM Versatile panels
i915:
- Cannonlake support enabled
- AUX-F port support added
- Icelake base enabling until internal milestone of forcewake support
- Query uAPI interface (used for GPU topology information currently)
- Compressed framebuffer support for sprites
- kmem cache shrinking when GPU is idle
- Avoid boosting GPU when waited item is being processed already
- Avoid retraining LSPCON link unnecessarily
- Decrease request signaling latency
- Deprecation of I915_SET_COLORKEY_NONE
- Kerneldoc and compiler warning cleanup for upcoming CI enforcements
- Full range ycbcr toggling
- HDCP support
i915/gvt:
- Big refactor for shadow ppgtt
- KBL context save/restore via LRI cmd (Weinan)
- Properly unmap dma for guest page (Changbin)
vmwgfx:
- Lots of various improvements
etnaviv:
- Use the drm gpu scheduler
- prep work for GC7000L support
vc4:
- fix alpha blending
- Expose perf counters to userspace
pl111:
- Bandwidth checking/limiting
- Versatile panel support
sun4i:
- A83T HDMI support
- A80 support
- YUV plane support
- H3/H5 HDMI support
omapdrm:
- HPD support for DVI connector
- remove lots of static variables
msm:
- DSI updates from 10nm / SDM845
- fix for race condition with a3xx/a4xx fence completion irq
- some refactoring/prep work for eventual a6xx support (ie. when we
have a userspace)
- a5xx debugfs enhancements
- some mdp5 fixes/cleanups to prepare for eventually merging
writeback
- support (ie. when we have a userspace)
tegra:
- mmap() fixes for fbdev devices
- Overlay plane for hw cursor fix
- dma-buf cache maintenance support
mali-dp:
- YUV->RGB conversion support
rockchip:
- rk3399/chromebook fixes and improvements
rcar-du:
- LVDS support move to drm bridge
- DT bindings for R8A77995
- Driver/DT support for R8A77970
tilcdc:
- DRM panel support"
* tag 'drm-for-v4.17' of git://people.freedesktop.org/~airlied/linux: (1646 commits)
drm/i915: Fix hibernation with ACPI S0 target state
drm/i915/execlists: Use a locked clear_bit() for synchronisation with interrupt
drm/i915: Specify which engines to reset following semaphore/event lockups
drm/i915/dp: Write to SET_POWER dpcd to enable MST hub.
drm/amdkfd: Use ordered workqueue to restore processes
drm/amdgpu: Fix acquiring VM on large-BAR systems
drm/amd/pp: clean header file hwmgr.h
drm/amd/pp: use mlck_table.count for array loop index limit
drm: Fix uabi regression by allowing garbage mode->type from userspace
drm/amdgpu: Add an ATPX quirk for hybrid laptop
drm/amdgpu: fix spelling mistake: "asssert" -> "assert"
drm/amd/pp: Add new asic support in pp_psm.c
drm/amd/pp: Clean up powerplay code on Vega12
drm/amd/pp: Add smu irq handlers for legacy asics
drm/amd/pp: Fix set wrong temperature range on smu7
drm/amdgpu: Don't change preferred domian when fallback GTT v5
drm/vmwgfx: Bump version patchlevel and date
drm/vmwgfx: use monotonic event timestamps
drm/vmwgfx: Unpin the screen object backup buffer when not used
drm/vmwgfx: Stricter count of legacy surface device resources
...
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull perf fixes from Ingo Molnar:
"Two fixlets"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/hwbp: Simplify the perf-hwbp code, fix documentation
perf/x86/intel: Fix linear IP of PEBS real_ip on Haswell and later CPUs
Pull x86 PTI fixes from Ingo Molnar:
"Two fixes: a relatively simple objtool fix that makes Clang built
kernels work with ORC debug info, plus an alternatives macro fix"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fixup alternative_call_2
objtool: Add Clang support
Trivial fix to spelling mistake in the pr_err_once() error message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20180313154709.1015-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some .<target>.cmd files under arch/x86 are showing two instances of
-D__KERNEL__, like arch/x86/boot/ and arch/x86/realmode/rm/.
__KERNEL__ is already defined in KBUILD_CPPFLAGS in the top Makefile,
so it can be dropped safely.
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kbuild@vger.kernel.org
Link: http://lkml.kernel.org/r/20180316084944.3997-1-caoj.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PPC:
- Fix a bug causing occasional machine check exceptions on POWER8 hosts
(introduced in 4.16-rc1)
x86:
- Fix a guest crashing regression with nested VMX and restricted guest
(introduced in 4.16-rc1)
- Fix dependency check for pv tlb flush (The wrong dependency that
effectively disabled the feature was added in 4.16-rc4, the original
feature in 4.16-rc1, so it got decent testing.)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJavUt5AAoJEED/6hsPKofo8uQH/RuijrsAIUnymkYY+6BYFXlh
Ri8qhG8VB+C3SpWEtsqcqNVkjJTepCD2Ej5BJTL4Gc9BSTWy7Ht6kqskEgwcnzu2
xRfkg0q0vTj1+GDd+UiTZfxiinoHtB9x3fiXali5UNTCd1fweLxdidETfO+GqMMq
KDhTR+S8dXE5VG7r+iJ80LZPtHQJ94f0fh9XpQk3X2ExTG5RBxag1U2nCfiKRAZk
xRv1CNAxNaBxS38CgYfHzg31NJx38fnq/qREsIdOx0Ju9WQkglBFkhLAGUb4vL0I
nn8YX/oV9cW2G8tyPWjC245AouABOLbzu0xyj5KgCY/z1leA9tdLFX/ET6Zye+E=
=++uZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"PPC:
- Fix a bug causing occasional machine check exceptions on POWER8
hosts (introduced in 4.16-rc1)
x86:
- Fix a guest crashing regression with nested VMX and restricted
guest (introduced in 4.16-rc1)
- Fix dependency check for pv tlb flush (the wrong dependency that
effectively disabled the feature was added in 4.16-rc4, the
original feature in 4.16-rc1, so it got decent testing)"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: Fix pv tlb flush dependencies
KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0
KVM: PPC: Book3S HV: Fix duplication of host SLB entries
This is a cosmetic patch that deals with the address filter structure's
ambiguous fields 'filter' and 'range'. The former stands to mean that the
filter's *action* should be to filter the traces to its address range if
it's set or stop tracing if it's unset. This is confusing and hard on the
eyes, so this patch replaces it with 'action' enum. The 'range' field is
completely redundant (meaning that the filter is an address range as
opposed to a single address trigger), as we can use zero size to mean the
same thing.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Acked-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/20180329120648.11902-1-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When vCPU runs L2 and there is a pending event that requires to exit
from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
an immediate-exit from L2 (See req_immediate_exit).
Since now handling of req_immediate_exit also makes sure to set
KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
nested_run_pending=1.
This optimizes cases where VMRESUME was executed by L1 to enter L2 and
there is no pending events that require exit from L2 to L1. Previously,
this would have set KVM_REQ_EVENT unnecessarly.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.
While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.
When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!
This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.
The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).
Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).
We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION. This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The reason that exception.pending should block re-injection of
NMI/interrupt is not described correctly in comment in code.
Instead, it describes why a pending exception should be injected
before a pending NMI/interrupt.
Therefore, move currently present comment to code-block evaluating
a new pending event which explains why exception.pending is evaluated
first.
In addition, create a new comment describing that exception.pending
blocks re-injection of NMI/interrupt because the exception was
queued by handling vmexit which was due to NMI/interrupt delivery.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@orcle.com>
[Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b0 ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().
All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.
This was also incorrect when called from vmx_queue_exception().
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Enlightened VMCS is just a structure in memory, the main benefit
besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
mask: we tell the underlying hypervisor which fields were modified
since VMEXIT so there's no need to inspect them all.
Tight CPUID loop test shows significant speedup:
Before: 18890 cycles
After: 8304 cycles
Static key is being used to avoid performance penalty for non-Hyper-V
deployments.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
TLFS 5.0 says: "Support for an enlightened VMCS interface is reported with
CPUID leaf 0x40000004. If an enlightened VMCS interface is supported,
additional nested enlightenments may be discovered by reading the CPUID
leaf 0x4000000A (see 2.4.11)."
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The definitions are according to the Hyper-V TLFS v5.0. KVM on Hyper-V will
use these.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Virtual Processor Assist Pages usage allows us to do optimized EOI
processing for APIC, enable Enlightened VMCS support in KVM and more.
struct hv_vp_assist_page is defined according to the Hyper-V TLFS v5.0b.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The assist page has been used only for the paravirtual EOI so far, hence
the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
called "Virtual VP Assist MSR".
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
mshyperv.h now only contains fucntions/variables we define in kernel, all
definitions from TLFS should go to hyperv-tlfs.h.
'enum hv_cpuid_function' is removed as we already have this info in
hyperv-tlfs.h, code in mshyperv.c is adjusted accordingly.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
hyperv.h is not part of uapi, there are no (known) users outside of kernel.
We are making changes to this file to match current Hyper-V Hypervisor
Top-Level Functional Specification (TLFS, see:
https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs)
and we don't want to maintain backwards compatibility.
Move the file renaming to hyperv-tlfs.h to avoid confusing it with
mshyperv.h. In future, all definitions from TLFS should go to it and
all kernel objects should go to mshyperv.h or include/linux/hyperv.h.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
static_key_disable_cpuslocked(): static key 'virt_spin_lock_key+0x0/0x20' used before call to jump_label_init()
WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:161 static_key_disable_cpuslocked+0x61/0x80
RIP: 0010:static_key_disable_cpuslocked+0x61/0x80
Call Trace:
static_key_disable+0x16/0x20
start_kernel+0x192/0x4b3
secondary_startup_64+0xa5/0xb0
Qspinlock will be choosed when dedicated pCPUs are available, however, the
static virt_spin_lock_key is set in kvm_spinlock_init() before jump_label_init()
has been called, which will result in a WARN(). This patch fixes it by delaying
the virt_spin_lock_key setup to .smp_prepare_cpus().
Reported-by: Davidlohr Bueso <dbueso@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: b2798ba0b8 ("KVM: X86: Choose qspinlock when dedicated physical CPUs are available")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Bring the PLE(pause loop exit) logic to AMD svm driver.
While testing, we found this helping in situations where numerous
pauses are generated. Without these patches we could see continuos
VMEXITS due to pause interceptions. Tested it on AMD EPYC server with
boot parameter idle=poll on a VM with 32 vcpus to simulate extensive
pause behaviour. Here are VMEXITS in 10 seconds interval.
Pauses 810199 504
Total 882184 325415
Signed-off-by: Babu Moger <babu.moger@amd.com>
[Prevented the window from dropping below the initial value. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This patch adds the support for pause filtering threshold. This feature
support is indicated by CPUID Fn8000_000A_EDX. See AMD APM Vol 2 Section
15.14.4 Pause Intercept Filtering for more details.
In this mode, a 16-bit pause filter threshold field is added in VMCB.
The threshold value is a cycle count that is used to reset the pause
counter. As with simple pause filtering, VMRUN loads the pause count
value from VMCB into an internal counter. Then, on each pause instruction
the hardware checks the elapsed number of cycles since the most recent
pause instruction against the pause Filter Threshold. If the elapsed cycle
count is greater than the pause filter threshold, then the internal pause
count is reloaded from VMCB and execution continues. If the elapsed cycle
count is less than the pause filter threshold, then the internal pause
count is decremented. If the count value is less than zero and pause
intercept is enabled, a #VMEXIT is triggered. If advanced pause filtering
is supported and pause filter threshold field is set to zero, the filter
will operate in the simpler, count only mode.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This patch brings some of the code from vmx to x86.h header file. Now, we
can share this code between vmx and svm. Modified couple functions to make
it common.
Signed-off-by: Babu Moger <babu.moger@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Get rid of ple_window_actual_max, because its benefits are really
minuscule and the logic is complicated.
The overflows(and underflow) are controlled in __ple_window_grow
and _ple_window_shrink respectively.
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Babu Moger <babu.moger@amd.com>
[Fixed potential wraparound and change the max to UINT_MAX. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The vmx module parameters are supposed to be unsigned variants.
Also fixed the checkpatch errors like the one below.
WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
+module_param(ple_gap, uint, S_IRUGO);
Signed-off-by: Babu Moger <babu.moger@amd.com>
[Expanded uint to unsigned int in code. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This reverts commit 4b1e84276a.
Software uses the valid bits to decide if the values can be used for
further processing or other actions. So setting the valid bits will have
software act on values that it shouldn't be acting on.
The recommendation to save all the register values does not mean that
the values are always valid.
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: tony.luck@intel.com
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: bp@suse.de
Cc: linux-edac@vger.kernel.org
Link: https://lkml.kernel.org/r/20180326191526.64314-1-Yazen.Ghannam@amd.com
A critical error was found testing the fixed UV4 HUB in that an MMR address
was found to be incorrect. This causes the virtual address space for
accessing the MMIOH1 region to be allocated with the incorrect size.
Fixes: 673aa20c55 ("x86/platform/UV: Update uv_mmrs.h to prepare for UV4A fixes")
Signed-off-by: Mike Travis <travis@sgi.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com>
Cc: Russ Anderson <russ.anderson@hpe.com>
Cc: Andrew Banman <andrew.banman@hpe.com>
Link: https://lkml.kernel.org/r/20180328174011.041801248@stormcage.americas.sgi.com
KVM and perf have a special backdoor mechanism to report the IP for interrupts
re-executed after vm exit. This works for the NMIs that perf normally uses.
However when perf is in timer mode it doesn't work because the timer interrupt
doesn't get this special treatment. This is common when KVM is running
nested in another hypervisor which may not implement the PMU, so only
timer mode is available.
Call the functions to set up the backdoor IP also for non NMI interrupts.
I renamed the functions to set up the backdoor IP reporting to be more
appropiate for their new use. The SVM change is only compile tested.
v2: Moved the functions inline.
For the normal interrupt case the before/after functions are now
called from x86.c, not arch specific code.
For the NMI case we still need to call it in the architecture
specific code, because it's already needed in the low level *_run
functions.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[Removed unnecessary calls from arch handle_external_intr. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
PV TLB FLUSH can only be turned on when steal time is enabled.
The condition got reversed during conflict resolution.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Fixes: 4f2f61fc50 ("KVM: X86: Avoid traversing all the cpus for pv tlb flush when steal time is disabled")
[Rebased on top of kvm/master and reworded the commit message. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
In arch/x86/boot/compressed/kaslr_64.c, CONFIG_AMD_MEM_ENCRYPT support was
initially #undef'd to support SME with minimal effort. When support for
SEV was added, the #undef remained and some minimal support for setting the
encryption bit was added for building identity mapped pagetable entries.
Commit b83ce5ee91 ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
changed __PHYSICAL_MASK_SHIFT from 46 to 52 in support of 5-level paging.
This change resulted in SEV guests failing to boot because the encryption
bit was no longer being automatically masked out. The compressed boot
path now requires sme_me_mask to be defined in order for the pagetable
functions, such as pud_present(), to properly mask out the encryption bit
(currently bit 47) when evaluating pagetable entries.
Add an sme_me_mask variable in arch/x86/boot/compressed/mem_encrypt.S,
which is set when SEV is active, delete the #undef CONFIG_AMD_MEM_ENCRYPT
from arch/x86/boot/compressed/kaslr_64.c and use sme_me_mask when building
the identify mapped pagetable entries.
Fixes: b83ce5ee91 ("x86/mm/64: Make __PHYSICAL_MASK_SHIFT always 52")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180327220711.8702.55842.stgit@tlendack-t1.amdoffice.net
BAU uses the old alloc_initr_gate90 method to setup its interrupt. This
fails silently as the BAU vector is in the range of APIC vectors that are
registered to the spurious interrupt handler. As a consequence BAU
broadcasts are not handled, and the broadcast source CPU hangs.
Update BAU to use new idt structure.
Fixes: dc20b2d526 ("x86/idt: Move interrupt gate initialization to IDT code")
Signed-off-by: Andrew Banman <abanman@hpe.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Mike Travis <mike.travis@hpe.com>
Cc: Dimitri Sivanich <sivanich@hpe.com>
Cc: Russ Anderson <rja@hpe.com>
Cc: stable@vger.kernel.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/1522188546-196177-1-git-send-email-abanman@hpe.com
When changing rdmsr_safe_on_cpu() to schedule, it was missed that
__rdmsr_safe_on_cpu() was also used by rdmsrl_safe_on_cpu()
Make rdmsrl_safe_on_cpu() a wrapper instead of copy/pasting the code which
was added for the completion handling.
Fixes: 07cde313b2 ("x86/msr: Allow rdmsr_safe_on_cpu() to schedule")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180328032233.153055-1-edumazet@google.com
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJauCZfAAoJEHm+PkMAQRiGWGUH/2rhdQDkoJpYWnjaQkolECG8
MxpGE7nmIIHxQcbSDdHTGJ8IhVm6Z5wZ7ym/PwCDTT043Y1y341sJrIwL2/nTG6d
HVidk8hFvgN6QzlzVAHT3ZZMII/V9Zt+VV5SUYLGnPAVuJNHo/6uzWlTU5g+NTFo
IquFDdQUaGBlkKqby+NoAFnkV1UAIkW0g22cfvPnlO5GMer0gusGyVNvVp7TNj3C
sqj4Hvt3RMDLMNe9RZ2pFTiOD096n8FWpYftZneUTxFImhRV3Jg5MaaYZm9SI3HW
tXrv/LChT/F1mi5Pkx6tkT5Hr8WvcrwDMJ4It1kom10RqWAgjxIR3CMm448ileY=
=YKUG
-----END PGP SIGNATURE-----
Backmerge tag 'v4.16-rc7' into drm-next
Linux 4.16-rc7
This was requested by Daniel, and things were getting
a bit hard to reconcile, most of the conflicts were
trivial though.
High latencies can be observed caused by a daemon periodically reading
CPUID on all cpus. On KASAN enabled kernels ~10ms latencies can be
observed. Even without KASAN, sending an IPI to a CPU, which is in a deep
sleep state or in a long hard IRQ disabled section, waiting for the answer
can consume hundreds of microseconds.
cpuid_read() is invoked in preemptible context, so it can be converted to
sleep instead of busy wait.
Switching to smp_call_function_single_async() and a completion allows to
reschedule and reduces CPU usage and latencies.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lkml.kernel.org/r/20180323215818.127774-2-edumazet@google.com
High latencies can be observed caused by a daemon periodically reading
various MSR on all cpus. On KASAN enabled kernels ~10ms latencies can be
observed simply reading one MSR. Even without KASAN, sending an IPI to a
CPU, which is in a deep sleep state or in a long hard IRQ disabled section,
waiting for the answer can consume hundreds of microseconds.
All usage sites are in preemptible context, convert rdmsr_safe_on_cpu() to
use a completion instead of busy polling.
Overall daemon cpu usage was reduced by 35 %, and latencies caused by
msr_read() disappeared.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lkml.kernel.org/r/20180323215818.127774-1-edumazet@google.com
As Kai pointed out, the primary reason for adjusting x86_phys_bits is to
reflect that the the address space is reduced and not the ability to
communicate the available physical address space to virtual machines.
Suggested-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: linux-mm@kvack.org
Link: https://lkml.kernel.org/r/20180315134907.9311-2-kirill.shutemov@linux.intel.com
dmi_get_bios_year() may return 0 when it cannot parse
the BIOS date string. Previously this has been checked in
pci_acpi_crs_quirks().
Update the code to restore old behaviour.
Reported-by: Jean Delvare <jdelvare@suse.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jean Delvare <jdelvare@suse.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: linux-acpi@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Fixes: 69c42d493d ("x86/pci: Simplify code by using the new dmi_get_bios_year() helper")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The file Documentation/x86/early-microcode.txt was renamed to
Documentation/x86/microcode.txt in 0e3258753f, but it was still
referenced by its old name in a three places:
* Documentation/x86/00-INDEX
* arch/x86/Kconfig
* arch/x86/kernel/cpu/microcode/amd.c
This commit updates these references accordingly.
Fixes: 0e3258753f ("x86/microcode: Document the three loading methods")
Signed-off-by: Jaak Ristioja <jaak@ristioja.ee>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The following pattern fails to compile while the same pattern
with alternative_call() does:
if (...)
alternative_call_2(...);
else
alternative_call_2(...);
as it expands into
if (...)
{
}; <===
else
{
};
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180114120504.GA11368@avx2
node_memmap_size_bytes() has been unused since the v3.9 kernel, so remove it.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Fixes: f03574f2d5 ("x86-32, mm: Rip out x86_32 NUMA remapping code")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803262325540.256524@chino.kir.corp.google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
this patch fix a bug in how the pebs->real_ip is handled in the PEBS
handler. real_ip only exists in Haswell and later processor. It is
actually the eventing IP, i.e., where the event occurred. As opposed
to the pebs->ip which is the PEBS interrupt IP which is always off
by one.
The problem is that the real_ip just like the IP needs to be fixed up
because PEBS does not record all the machine state registers, and
in particular the code segement (cs). This is why we have the set_linear_ip()
function. The problem was that set_linear_ip() was only used on the pebs->ip
and not the pebs->real_ip.
We have profiles which ran into invalid callstacks because of this.
Here is an example:
..... 0: ffffffffffffff80 recent entry, marker kernel v
..... 1: 000000000040044d <= user address in kernel space!
..... 2: fffffffffffffe00 marker enter user v
..... 3: 000000000040044d
..... 4: 00000000004004b6 oldest entry
Debugging output in get_perf_callchain():
[ 857.769909] CALLCHAIN: CPU8 ip=40044d regs->cs=10 user_mode(regs)=0
The problem is that the kernel entry in 1: points to a user level
address. How can that be?
The reason is that with PEBS sampling the instruction that caused the event
to occur and the instruction where the CPU was when the interrupt was posted
may be far apart. And sometime during that time window, the privilege level may
change. This happens, for instance, when the PEBS sample is taken close to a
kernel entry point. Here PEBS, eventing IP (real_ip) captured a user level
instruction. But by the time the PMU interrupt fired, the processor had already
entered kernel space. This is why the debug output shows a user address with
user_mode() false.
The problem comes from PEBS not recording the code segment (cs) register.
The register is used in x86_64 to determine if executing in kernel vs user
space. This is okay because the kernel has a software workaround called
set_linear_ip(). But the issue in setup_pebs_sample_data() is that
set_linear_ip() is never called on the real_ip value when it is available
(Haswell and later) and precise_ip > 1.
This patch fixes this problem and eliminates the callchain discrepancy.
The patch restructures the code around set_linear_ip() to minimize the number
of times the IP has to be set.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1521788507-10231-1-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current x86 Device Tree implementation does not support multiprocessing.
Use new DT bindings to describe the processors.
Signed-off-by: Ivan Gorinov <ivan.gorinov@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Link: https://lkml.kernel.org/r/c291fb2cef51b730b59916d7745be0eaa4378c6c.1521753738.git.ivan.gorinov@intel.com
Some functions definitions have either the initial open brace and/or
the closing brace outside of column 1.
Move those braces to column 1.
This allows various function analyzers like gnu complexity to work
properly for these modified functions.
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Acked-by: Alex Deucher <alexander.deucher@amd.com>
Acked-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Nicolin Chen <nicoleotsuka@gmail.com>
Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Pull x86 and PTI fixes from Ingo Molnar:
"Misc fixes:
- fix EFI pagetables freeing
- fix vsyscall pagetable setting on Xen PV guests
- remove ancient CONFIG_X86_PPRO_FENCE=y - x86 is TSO again
- fix two binutils (ld) development version related incompatibilities
- clean up breakpoint handling
- fix an x86 self-test"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Don't use IST entry for #BP stack
x86/efi: Free efi_pgd with free_pages()
x86/vsyscall/64: Use proper accessor to update P4D entry
x86/cpu: Remove the CONFIG_X86_PPRO_FENCE=y quirk
x86/boot/64: Verify alignment of the LOAD segment
x86/build/64: Force the linker to use 2MB page size
selftests/x86/ptrace_syscall: Fix for yet more glibc interference
The kernel build system already takes care of generating the dependency
files. Having the additional -MD in KBUILD_CFLAGS leads to stray
.<pid>.d files in the build directory when we call the cc-option macro.
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LNX.2.21.1803242219380.30139@titan.int.lan.stealer.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the cherry-picked perf/urgent commit merged separately we can now
merge all the fixes without conflicts.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's nothing IST-worthy about #BP/int3. We don't allow kprobes
in the small handful of places in the kernel that run at CPL0 with
an invalid stack, and 32-bit kernels have used normal interrupt
gates for #BP forever.
Furthermore, we don't allow kprobes in places that have usergs while
in kernel mode, so "paranoid" is also unnecessary.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
"rep_done" is always zero so the "(((u64)rep_done & 0xfff) << 32)"
expression is just zero. We can remove the "res" temporary variable as
well and just use "ret" directly.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 99770737ca ("x86/asm/tsc: Add rdtscll() merge helper") added
rdtscll() in August 2015 along with the comment:
/* Deprecated, keep it for a cycle for easier merging: */
12 cycles later it's really overdue for removal.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add struct kvm_svm, which is analagous to struct vcpu_svm, along with
a helper to_kvm_svm() to retrieve kvm_svm from a struct kvm *. Move
the SVM specific variables and struct definitions out of kvm_arch
and into kvm_svm.
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add struct kvm_vmx, which wraps struct kvm, and a helper to_kvm_vmx()
that retrieves 'struct kvm_vmx *' from 'struct kvm *'. Move the VMX
specific variables out of kvm_arch and into kvm_vmx.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add kvm_x86_ops->set_identity_map_addr and set ept_identity_map_addr
in VMX specific code so that ept_identity_map_addr can be moved out
of 'struct kvm_arch' in a future patch.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define kvm_arch_[alloc|free]_vm in x86 as pass through functions
to new kvm_x86_ops vm_alloc and vm_free, and move the current
allocation logic as-is to SVM and VMX. Vendor specific alloc/free
functions set the stage for SVM/VMX wrappers of 'struct kvm',
which will allow us to move the growing number of SVM/VMX specific
member variables out of 'struct kvm_arch'.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Segment registers must be synchronized prior to any code that may
trigger a call to emulation_required()/guest_state_valid(), e.g.
vmx_set_cr0(). Because preparing vmcs02 writes segmentation fields
directly, i.e. doesn't use vmx_set_segment(), emulation_required
will not be re-evaluated when synchronizing the segment registers,
which can result in L0 incorrectly starting emulation of L2.
Fixes: 8665c3f973 ("KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Move all of prepare_vmcs02_full earlier, not just segment registers. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fun set of conflict resolutions here...
For the mac80211 stuff, these were fortunately just parallel
adds. Trivially resolved.
In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
function phy_disable_interrupts() earlier in the file, whilst in
'net-next' the phy_error() call from this function was removed.
In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
'rt_table_id' member of rtable collided with a bug fix in 'net' that
added a new struct member "rt_mtu_locked" which needs to be copied
over here.
The mlxsw driver conflict consisted of net-next separating
the span code and definitions into separate files, whilst
a 'net' bug fix made some changes to that moved code.
The mlx5 infiniband conflict resolution was quite non-trivial,
the RDMA tree's merge commit was used as a guide here, and
here are their notes:
====================
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f95
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge misc fixes from Andrew Morton:
"13 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm, thp: do not cause memcg oom for thp
mm/vmscan: wake up flushers for legacy cgroups too
Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
mm/thp: do not wait for lock_page() in deferred_split_scan()
mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
x86/mm: implement free pmd/pte page interfaces
mm/vmalloc: add interfaces to free unmapped page table
h8300: remove extraneous __BIG_ENDIAN definition
hugetlbfs: check for pgoff value overflow
lockdep: fix fs_reclaim warning
MAINTAINERS: update Mark Fasheh's e-mail
mm/mempolicy.c: avoid use uninitialized preferred_node
Pull libnvdimm fixes from Dan Williams:
"Two regression fixes, two bug fixes for older issues, two fixes for
new functionality added this cycle that have userspace ABI concerns,
and a small cleanup. These have appeared in a linux-next release and
have a build success report from the 0day robot.
* The 4.16 rework of altmap handling led to some configurations
leaking page table allocations due to freeing from the altmap
reservation rather than the page allocator.
The impact without the fix is leaked memory and a WARN() message
when tearing down libnvdimm namespaces. The rework also missed a
place where error handling code needed to be removed that can lead
to a crash if devm_memremap_pages() fails.
* acpi_map_pxm_to_node() had a latent bug whereby it could
misidentify the closest online node to a given proximity domain.
* Block integrity handling was reworked several kernels back to allow
calling add_disk() after setting up the integrity profile.
The nd_btt and nd_blk drivers are just now catching up to fix
automatic partition detection at driver load time.
* The new peristence_domain attribute, a platform indicator of
whether cpu caches are powerfail protected for example, is meant to
be a single value enum and not a set of flags.
This oversight was caught while reviewing new userspace code in
libndctl to communicate the attribute.
Fix this new enabling up so that we are not stuck with an unwanted
userspace ABI"
* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, nfit: fix persistence domain reporting
libnvdimm, region: hide persistence_domain when unknown
acpi, numa: fix pxm to online numa node associations
x86, memremap: fix altmap accounting at free
libnvdimm: remove redundant assignment to pointer 'dev'
libnvdimm, {btt, blk}: do integrity setup before add_disk()
kernel/memremap: Remove stale devres_free() call
Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
clear a given pud/pmd entry and free up lower level page table(s).
The address range associated with the pud/pmd entry must have been
purged by INVLPG.
Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
Fixes: e61ce6ade4 ("mm: change ioremap to set up huge I/O mappings")
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
create pud/pmd mappings. A kernel panic was observed on arm64 systems
with Cortex-A75 in the following steps as described by Hanjun Guo.
1. ioremap a 4K size, valid page table will build,
2. iounmap it, pte0 will set to 0;
3. ioremap the same address with 2M size, pgd/pmd is unchanged,
then set the a new value for pmd;
4. pte0 is leaked;
5. CPU may meet exception because the old pmd is still in TLB,
which will lead to kernel panic.
This panic is not reproducible on x86. INVLPG, called from iounmap,
purges all levels of entries associated with purged address on x86. x86
still has memory leak.
The patch changes the ioremap path to free unmapped page table(s) since
doing so in the unmap path has the following issues:
- The iounmap() path is shared with vunmap(). Since vmap() only
supports pte mappings, making vunmap() to free a pte page is an
overhead for regular vmap users as they do not need a pte page freed
up.
- Checking if all entries in a pte page are cleared in the unmap path
is racy, and serializing this check is expensive.
- The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
purge.
Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
clear a given pud/pmd entry and free up a page for the lower level
entries.
This patch implements their stub functions on x86 and arm64, which work
as workaround.
[akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
Fixes: e61ce6ade4 ("mm: change ioremap to set up huge I/O mappings")
Reported-by: Lei Li <lious.lilei@hisilicon.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Chintan Pandya <cpandya@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) Always validate XFRM esn replay attribute, from Florian Westphal.
2) Fix RCU read lock imbalance in xfrm_get_tos(), from Xin Long.
3) Don't try to get firmware dump if not loaded in iwlwifi, from Shaul
Triebitz.
4) Fix BPF helpers to deal with SCTP GSO SKBs properly, from Daniel
Axtens.
5) Fix some interrupt handling issues in e1000e driver, from Benjamin
Poitier.
6) Use strlcpy() in several ethtool get_strings methods, from Florian
Fainelli.
7) Fix rhlist dup insertion, from Paul Blakey.
8) Fix SKB leak in netem packet scheduler, from Alexey Kodanev.
9) Fix driver unload crash when link is up in smsc911x, from Jeremy
Linton.
10) Purge out invalid socket types in l2tp_tunnel_create(), from Eric
Dumazet.
11) Need to purge the write queue when TCP connections are aborted,
otherwise userspace using MSG_ZEROCOPY can't close the fd. From
Soheil Hassas Yeganeh.
12) Fix double free in error path of team driver, from Arkadi
Sharshevsky.
13) Filter fixes for hv_netvsc driver, from Stephen Hemminger.
14) Fix non-linear packet access in ipv6 ndisc code, from Lorenzo
Bianconi.
15) Properly filter out unsupported feature flags in macvlan driver,
from Shannon Nelson.
16) Don't request loading the diag module for a protocol if the protocol
itself is not even registered. From Xin Long.
17) If datagram connect fails in ipv6, make sure the socket state is
consistent afterwards. From Paolo Abeni.
18) Use after free in qed driver, from Dan Carpenter.
19) If received ipv4 PMTU is less than the min pmtu, lock the mtu in the
entry. From Sabrina Dubroca.
20) Fix sleep in atomic in tg3 driver, from Jonathan Toppins.
21) Fix vlan in vlan untagging in some situations, from Toshiaki Makita.
22) Fix double SKB free in genlmsg_mcast(). From Nicolas Dichtel.
23) Fix NULL derefs in error paths of tcf_*_init(), from Davide Caratti.
24) Unbalanced PM runtime calls in FEC driver, from Florian Fainelli.
25) Memory leak in gemini driver, from Igor Pylypiv.
26) IDR leaks in error paths of tcf_*_init() functions, from Davide
Caratti.
27) Need to use GFP_ATOMIC in seg6_build_state(), from David Lebrun.
28) Missing dev_put() in error path of macsec_newlink(), from Dan
Carpenter.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (201 commits)
macsec: missing dev_put() on error in macsec_newlink()
net: dsa: Fix functional dsa-loop dependency on FIXED_PHY
hv_netvsc: common detach logic
hv_netvsc: change GPAD teardown order on older versions
hv_netvsc: use RCU to fix concurrent rx and queue changes
hv_netvsc: disable NAPI before channel close
net/ipv6: Handle onlink flag with multipath routes
ppp: avoid loop in xmit recursion detection code
ipv6: sr: fix NULL pointer dereference when setting encap source address
ipv6: sr: fix scheduling in RCU when creating seg6 lwtunnel state
net: aquantia: driver version bump
net: aquantia: Implement pci shutdown callback
net: aquantia: Allow live mac address changes
net: aquantia: Add tx clean budget and valid budget handling logic
net: aquantia: Change inefficient wait loop on fw data reads
net: aquantia: Fix a regression with reset on old firmware
net: aquantia: Fix hardware reset when SPI may rarely hangup
s390/qeth: on channel error, reject further cmd requests
s390/qeth: lock read device while queueing next buffer
s390/qeth: when thread completes, wake up all waiters
...
Commit 2bb8cafea8 ("KVM: vVMX: signal failure for nested VMEntry if
emulation_required", 2018-03-12) introduces a new error path which does
not set *entry_failure_code. Fix that to avoid a leak of L0 stack to L1.
Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L1 IOAPIC redirection-table is written, a request of
KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
all vCPUs will now recalc their IOAPIC handled vectors and load
it to their EOI-exitmap.
However, it could be that one of the vCPUs is currently running
L2. In this case, load_eoi_exitmap() will be called which would
write to vmcs02->eoi_exit_bitmap, which is wrong because
vmcs02->eoi_exit_bitmap should always be equal to
vmcs12->eoi_exit_bitmap. Furthermore, at this point
KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
of some IOAPIC level-triggered entry to remain set forever.
Fix this issue by delaying the load of EOI-exitmap to when vCPU
is running L1.
One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
processing to when vCPU is running L1. This is done in order to handle
correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
current nVMX implementation, that results in
vmcs02->virtual_interrupt_delivery to also be 0. Thus,
vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
In order for such L2 EOI to be broadcasted, if needed, from LAPIC
to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
while L2 is running. Therefore, patch makes sure to delay only the
loading of EOI-exitmap but not the update of
vcpu->arch.ioapic_handled_vectors.
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The undocumented 'icebp' instruction (aka 'int1') works pretty much like
'int3' in the absense of in-circuit probing equipment (except,
obviously, that it raises #DB instead of raising #BP), and is used by
some validation test-suites as such.
But Andy Lutomirski noticed that his test suite acted differently in kvm
than on bare hardware.
The reason is that kvm used an inexact test for the icebp instruction:
it just assumed that an all-zero VM exit qualification value meant that
the VM exit was due to icebp.
That is not unlike the guess that do_debug() does for the actual
exception handling case, but it's purely a heuristic, not an absolute
rule. do_debug() does it because it wants to ascribe _some_ reasons to
the #DB that happened, and an empty %dr6 value means that 'icebp' is the
most likely casue and we have no better information.
But kvm can just do it right, because unlike the do_debug() case, kvm
actually sees the real reason for the #DB in the VM-exit interruption
information field.
So instead of relying on an inexact heuristic, just use the actual VM
exit information that says "it was 'icebp'".
Right now the 'icebp' instruction isn't technically documented by Intel,
but that will hopefully change. The special "privileged software
exception" information _is_ actually mentioned in the Intel SDM, even
though the cause of it isn't enumerated.
Reported-by: Andy Lutomirski <luto@kernel.org>
Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Writing to it directly does not work for Xen PV guests.
Fixes: 49275fef98 ("x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy")
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180319143154.3742-1-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to start using asm-goto to guarantee the absence of dynamic
branches (and thus speculation).
A primary prerequisite for this is of course that the compiler
supports asm-goto. This effecively lifts the minimum GCC version to
build an x86 kernel to gcc-4.5.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: torvalds@linux-foundation.org
Link: https://lkml.kernel.org/r/20180319201327.GJ4043@hirez.programming.kicks-ass.net
Pull in pending siginfo changes from Eric Biederman as we depend on
the definition of FPE_FLTUNK for cleaning up our floating-point exception
signal delivery (which is currently broken and using FPE_FIXME).
With that in place the generic DMA-direct routines can be used to
allocate non-encrypted bounce buffers, and the x86 SEV case can use
the generic swiotlb ops including nice features such as using CMA
allocations.
Note that I'm not too happy about using sev_active() in DMA-direct, but
I couldn't come up with a good enough name for a wrapper to make it
worth adding.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-14-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Give the basic phys_to_dma() and dma_to_phys() helpers a __-prefix and add
the memory encryption mask to the non-prefixed versions. Use the
__-prefixed versions directly instead of clearing the mask again in
various places.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-13-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that set_memory_decrypted() is always available we can just call it
directly.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-12-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All dma_ops implementations used on x86 now take care of setting their own
required GFP_ masks for the allocation. And given that the common code
now clears harmful flags itself that means we can stop the flags in all
the IOMMU implementations as well.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-10-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This gains support for CMA allocations for the force_iommu case, and
cleans up the code a bit.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-7-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to phase out looking at the magic GFP_DMA flag in the DMA mapping
routines, so switch the gart driver to use the dev->coherent_dma_mask
instead, which is used to select the GFP_DMA flag in the caller.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-6-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The generic swiotlb DMA ops were based on the x86 ones and provide
equivalent functionality, so use them.
Also fix the sta2x11 case. For that SOC the DMA map ops need an
additional physical to DMA address translations. For swiotlb buffers
that is done throught the phys_to_dma helper, but the sta2x11_dma_ops
also added an additional translation on the return value from
x86_swiotlb_alloc_coherent, which is only correct if that functions
returns a direct allocation and not a swiotlb buffer. With the
generic swiotlb and DMA-direct code phys_to_dma is not always used
and the separate sta2x11_dma_ops can be replaced with a simple
bit that marks if the additional physical to DMA address translation
is needed.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-5-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The generic DMA-direct (CONFIG_DMA_DIRECT_OPS=y) implementation is now
functionally equivalent to the x86 nommu dma_map implementation, so
switch over to using it.
That includes switching from using x86_dma_supported in various IOMMU
drivers to use dma_direct_supported instead, which provides the same
functionality.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-4-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These days all devices (including the ISA fallback device) have a coherent
DMA mask set, so remove the workaround.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-3-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There were only a few Pentium Pro multiprocessors systems where this
errata applied. They are more than 20 years old now, and we've slowly
dropped places which put the workarounds in and discouraged anyone
from enabling the workaround.
Get rid of it for good.
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Muli Ben-Yehuda <mulix@mulix.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: iommu@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/20180319103826.12853-2-hch@lst.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The number of CHAs is miscalculated on multi-domain PCI Skylake server systems,
resulting in an uncore driver initialization error.
Gary Kroening explains:
"For systems with a single PCI segment, it is sufficient to look for the
bus number to change in order to determine that all of the CHa's have
been counted for a single socket.
However, for multi PCI segment systems, each socket is given a new
segment and the bus number does NOT change. So looking only for the
bus number to change ends up counting all of the CHa's on all sockets
in the system. This leads to writing CPU MSRs beyond a valid range and
causes an error in ivbep_uncore_msr_init_box()."
To fix this bug, query the number of CHAs from the CAPID6 register:
it should read bits 27:0 in the CAPID6 register located at
Device 30, Function 3, Offset 0x9C. These 28 bits form a bit vector
of available LLC slices and the CHAs that manage those slices.
Reported-by: Kroening, Gary <gary.kroening@hpe.com>
Tested-by: Kroening, Gary <gary.kroening@hpe.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: abanman@hpe.com
Cc: dimitri.sivanich@hpe.com
Cc: hpa@zytor.com
Cc: mike.travis@hpe.com
Cc: russ.anderson@hpe.com
Fixes: cd34cd97b7 ("perf/x86/intel/uncore: Add Skylake server uncore support")
Link: http://lkml.kernel.org/r/1520967094-13219-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'freerunning PEBS' and 'large PEBS' are the same thing. Both of these
names appear in the code and in the API, which causes confusion.
Rename 'freerunning PEBS' to 'large PEBS' to unify the code,
which eliminates the confusion.
No functional change.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1520865937-22910-1-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We intended to clear the lowest 6 bits but because of a type bug we
clear the high 32 bits as well. Andi says that periods are rarely more
than U32_MAX so this bug probably doesn't have a huge runtime impact.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 294fe0f52a ("perf/x86/intel: Add INST_RETIRED.ALL workarounds")
Link: http://lkml.kernel.org/r/20180317115216.GB4035@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Userspace RDPMC cannot possibly work for large PEBS, which was introduced in:
b8241d2069 ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold)")
When the PEBS interrupt threshold is larger than one, there is no way
to get exact auto-reload times and value for userspace RDPMC. Disable
the userspace RDPMC usage when large PEBS is enabled.
The only exception is when the PEBS interrupt threshold is 1, in which
case user-space RDPMC works well even with auto-reload events.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Fixes: b8241d2069 ("perf/x86/intel: Implement batched PEBS interrupt handling (large PEBS interrupt threshold)")
Link: http://lkml.kernel.org/r/1518474035-21006-6-git-send-email-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1af22eba24)
Since the x86-64 kernel must be aligned to 2MB, refuse to boot the
kernel if the alignment of the LOAD segment isn't a multiple of 2MB.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOrR7xSJgUfiCoZLuqWUwymRxXPoGBW38%2BpN%3D9g%2ByKNhZw@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Binutils 2.31 will enable -z separate-code by default for x86 to avoid
mixing code pages with data to improve cache performance as well as
security. To reduce x86-64 executable and shared object sizes, the
maximum page size is reduced from 2MB to 4KB. But x86-64 kernel must
be aligned to 2MB. Pass -z max-page-size=0x200000 to linker to force
2MB page size regardless of the default page size used by linker.
Tested with Linux kernel 4.15.6 on x86-64.
Signed-off-by: H.J. Lu <hjl.tools@gmail.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/CAMe9rOp4_%3D_8twdpTyAP2DhONOCeaTOsniJLoppzhoNptL8xzA@mail.gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>