SPAPR_TCE_SHIFT is used in few places only and since IOMMU_PAGE_SHIFT_4K
can be easily used instead, remove SPAPR_TCE_SHIFT.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At the moment pages used for TCE tables (in addition to pages addressed
by TCEs) are not counted in locked_vm counter so a malicious userspace
tool can call ioctl(KVM_CREATE_SPAPR_TCE) as many times as
RLIMIT_NOFILE and lock a lot of memory.
This adds counting for pages used for TCE tables.
This counts the number of pages required for a table plus pages for
the kvmppc_spapr_tce_table struct (TCE table descriptor) itself.
This changes release_spapr_tce_table() to store @npages on stack to
avoid calling kvmppc_stt_npages() in the loop (tiny optimization,
probably).
This does not change the amount of used memory.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At the moment only spapr_tce_tables updates are protected against races
but not lookups. This fixes missing protection by using RCU for the list.
As lookups also happen in real mode, this uses
list_for_each_entry_lockless() (which is expected not to access any
vmalloc'd memory).
This converts release_spapr_tce_table() to a RCU scheduled handler.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This reworks the existing H_PUT_TCE/H_GET_TCE handlers to have following
patches applied nicer.
This moves the ioba boundaries check to a helper and adds a check for
least bits which have to be zeros.
The patch is pretty mechanical (only check for least ioba bits is added)
so no change in behaviour is expected.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
ec53500f "kvm: Add VFIO device" added a special KVM pseudo-device which is
used to handle any necessary interactions between KVM and VFIO.
Currently that device is built on x86 and ARM, but not powerpc, although
powerpc does support both KVM and VFIO. This makes things awkward in
userspace
Currently qemu prints an alarming error message if you attempt to use VFIO
and it can't initialize the KVM VFIO device. We don't want to remove the
warning, because lack of the KVM VFIO device could mean coherency problems
on x86. On powerpc, however, the error is harmless but looks disturbing,
and a test based on host architecture in qemu would be ugly, and break if
we do need the KVM VFIO device for something important in future.
There's nothing preventing the KVM VFIO device from being built for
powerpc, so this patch turns it on. It won't actually do anything, since
we don't define any of the arch_*() hooks, but it will make qemu happy and
we can extend it in future if we need to.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
- cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
- cxl: use correct operator when writing pcie config space values from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
- Freescale updates from Scott: Highlights include moving QE code out of
arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
TUxMR/38aI7bLcIQjwj9
=R5bX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Core:
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
Misc:
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de
Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions
fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica
Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from
Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from
Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from
Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing
from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc
from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
cxl:
- cxl: Fix possible idr warning when contexts are released from
Vaibhav Jain
- cxl: use correct operator when writing pcie config space values
from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav
Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma
Krishnan
Freescale:
- Freescale updates from Scott: Highlights include moving QE code out
of arch/powerpc (to be shared with arm), device tree updates, and
minor fixes"
* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
powerpc/module: Handle R_PPC64_ENTRY relocations
scripts/recordmcount.pl: support data in text section on powerpc
powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
powerpc/mm: Fix _PAGE_PTE breaking swapoff
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: use -Werror only with CONFIG_PPC_WERROR
cxl: fix build for GCC 4.6.x
powerpc: Add HWCAP bits for Power9
powerpc/powernv: Reserve PE#0 on NPU
powerpc/powernv: Change NPU PE# assignment
powerpc/powernv: Fix update of NVLink DMA mask
powerpc/powernv: Remove misleading comment in pci.c
powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
powerpc: Fix build break due to paca mm_context_t changes
cxl: Fix DSI misses when the context owning task exits
MAINTAINERS: Update Scott Wood's e-mail address
powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
powerpc: Fix style of self-test config prompts
powerpc/powernv: Only delay opal_rtc_read() retry when necessary
...
The get and set operations got exchanged by mistake when moving the
code from book3s.c to powerpc.c.
Fixes: 3840edc803
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
support of 248 VCPUs.
* ARM: rewrite of the arm64 world switch in C, support for
16-bit VM identifiers. Performance counter virtualization
missed the boat.
* x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu
kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT
2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD
/i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY
YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i
cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ=
=PIj1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"PPC changes will come next week.
- s390: Support for runtime instrumentation within guests, support of
248 VCPUs.
- ARM: rewrite of the arm64 world switch in C, support for 16-bit VM
identifiers. Performance counter virtualization missed the boat.
- x86: Support for more Hyper-V features (synthetic interrupt
controller), MMU cleanups"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits)
kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL
kvm/x86: Hyper-V SynIC timers tracepoints
kvm/x86: Hyper-V SynIC tracepoints
kvm/x86: Update SynIC timers on guest entry only
kvm/x86: Skip SynIC vector check for QEMU side
kvm/x86: Hyper-V fix SynIC timer disabling condition
kvm/x86: Reorg stimer_expiration() to better control timer restart
kvm/x86: Hyper-V unify stimer_start() and stimer_restart()
kvm/x86: Drop stimer_stop() function
kvm/x86: Hyper-V timers fix incorrect logical operation
KVM: move architecture-dependent requests to arch/
KVM: renumber vcpu->request bits
KVM: document which architecture uses each request bit
KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests
kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock
KVM: s390: implement the RI support of guest
kvm/s390: drop unpaired smp_mb
kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa
KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table()
arm/arm64: KVM: Detect vGIC presence at runtime
...
Currently it is possible for userspace (e.g. QEMU) to set a value
for the MSR for a guest VCPU which has both of the TS bits set,
which is an illegal combination. The result of this is that when
we execute a hrfid (hypervisor return from interrupt doubleword)
instruction to enter the guest, the CPU will take a TM Bad Thing
type of program interrupt (vector 0x700).
Now, if PR KVM is configured in the kernel along with HV KVM, we
actually handle this without crashing the host or giving hypervisor
privilege to the guest; instead what happens is that we deliver a
program interrupt to the guest, with SRR0 reflecting the address
of the hrfid instruction and SRR1 containing the MSR value at that
point. If PR KVM is not configured in the kernel, then we try to
run the host's program interrupt handler with the MMU set to the
guest context, which almost certainly causes a host crash.
This closes the hole by making kvmppc_set_msr_hv() check for the
illegal combination and force the TS field to a safe value (00,
meaning non-transactional).
Cc: stable@vger.kernel.org # v3.9+
Signed-off-by: Paul Mackerras <paulus@samba.org>
The vcpu_book3s variable is assigned but never used. So remove it.
Found using cppcheck.
Signed-off-by: Geyslan G. Bem <geyslan@gmail.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In the old DABR register, the BT (Breakpoint Translation) bit
is bit number 61. In the new DAWRX register, the WT (Watchpoint
Translation) bit is bit number 59. So to move the DABR-BT bit
into the position of the DAWRX-WT bit, it has to be shifted by
two, not only by one. This fixes hardware watchpoints in gdb of
older guests that only use the H_SET_DABR/X interface instead
of the new H_SET_MODE interface.
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
As we saw with the TM Bad Thing type of program interrupt occurring
on the hrfid that enters the guest, it is not completely impossible
to have a trap occurring in the guest entry/exit code, despite the
fact that the code has been written to avoid taking any traps.
This adds a check in the kvmppc_handle_exit_hv() function to detect
the case when a trap has occurred in the hypervisor-mode code, and
instead of treating it just like a trap in guest code, we now print
a message and return to userspace with a KVM_EXIT_INTERNAL_ERROR
exit reason.
Of the various interrupts that get handled in the assembly code in
the guest exit path and that can return directly to the guest, the
only one that can occur when MSR.HV=1 and MSR.EE=0 is machine check
(other than system call, which we can avoid just by not doing a sc
instruction). Therefore this adds code to the machine check path to
ensure that if the MCE occurred in hypervisor mode, we exit to the
host rather than trying to continue the guest.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Create a single function that flushes everything (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create a single function that gives everything up (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.
A context switch microbenchmark using yield():
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield --fp --altivec --vector 0 0
shows an improvement of 3% on POWER8.
Signed-off-by: Anton Blanchard <anton@samba.org>
[mpe: giveup_all() needs to be EXPORT_SYMBOL'ed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The enable_kernel_*() functions leave the relevant MSR bits enabled
until we exit the kernel sometime later. Create disable versions
that wrap the kernel use of FP, Altivec VSX or SPE.
While we don't want to disable it normally for performance reasons
(MSR writes are slow), it will be used for a debug boot option that
does this and catches bad uses in other areas of the kernel.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Let's reuse the new common function for VPCU lookup by id.
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
[split out the new function into a separate patch]
In another patch kvm_is_visible_gfn is maken return bool due to this
function only returns zero or one as its return value, let's also make
kvmppc_visible_gpa return bool to keep consistent.
No functional change.
Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- x86: work around two nasty cases where a benign exception occurs while
another is being delivered. The endless stream of exceptions causes an
infinite loop in the processor, which not even NMIs or SMIs can interrupt;
in the virt case, there is no possibility to exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by AMD,
the patches mostly move things from there to common arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and exit
paths when context tracking is enabled. The patches are a few months
old, but we discussed them again at kernel summit. Andy will pick up
from here and, in 4.5, try to remove it from the user entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWRFb0AAoJEL/70l94x66DjjMH/31jr8d119MW0uv2x+03+wRq
6dbJ8tjQ8grvBRExKvLsUVjDmHlhCa1BQl5qjCsyYhX9UeAf4NQOmoEFpq+YTLxh
Ctveyn+yiZWC7qxbQDmauiQ4JCOp+W9ial782iqw5+ouQMajGOffq5WrojCa2ZNF
jI278JgdHJLrKj/uie//WBu3V7MJY5Apc3p4zatnSYFSQ3MA0sxl4r4zIrwOa5qs
23ZeeoqbP4sHh4X5wL/30Y6XFSCHj0qoYHHyAgzLi0PCMvBdt4DrAFUPDG/Rhlv6
o1WB/kcUfcz3DtBX85wfSOMuw0nF6patWhWv07R/3EIbYoz3dKvp9d6ORYgXqlY=
=Um9M
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of kvm updates from Paolo Bonzini:
"Four changes:
- x86: work around two nasty cases where a benign exception occurs
while another is being delivered. The endless stream of exceptions
causes an infinite loop in the processor, which not even NMIs or
SMIs can interrupt; in the virt case, there is no possibility to
exit to the host either.
- x86: support for Skylake per-guest TSC rate. Long supported by
AMD, the patches mostly move things from there to common
arch/x86/kvm/ code.
- generic: remove local_irq_save/restore from the guest entry and
exit paths when context tracking is enabled. The patches are a few
months old, but we discussed them again at kernel summit. Andy
will pick up from here and, in 4.5, try to remove it from the user
entry/exit paths.
- PPC: Two bug fixes, see merge commit 370289756b for details"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: rename update_db_bp_intercept to update_bp_intercept
KVM: svm: unconditionally intercept #DB
KVM: x86: work around infinite loop in microcode when #AC is delivered
context_tracking: avoid irq_save/irq_restore on guest entry and exit
context_tracking: remove duplicate enabled check
KVM: VMX: Dump TSC multiplier in dump_vmcs()
KVM: VMX: Use a scaled host TSC for guest readings of MSR_IA32_TSC
KVM: VMX: Setup TSC scaling ratio when a vcpu is loaded
KVM: VMX: Enable and initialize VMX TSC scaling
KVM: x86: Use the correct vcpu's TSC rate to compute time scale
KVM: x86: Move TSC scaling logic out of call-back read_l1_tsc()
KVM: x86: Move TSC scaling logic out of call-back adjust_tsc_offset()
KVM: x86: Replace call-back compute_tsc_offset() with a common function
KVM: x86: Replace call-back set_tsc_khz() with a common function
KVM: x86: Add a common TSC scaling function
KVM: x86: Add a common TSC scaling ratio field in kvm_vcpu_arch
KVM: x86: Collect information for setting TSC scaling ratio
KVM: x86: declare a few variables as __read_mostly
KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common
KVM: PPC: Book3S HV: Don't dynamically split core when already split
...
- Kconfig: remove BE-only platforms from LE kernel build from Boqun Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference. from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from Colin Ian King
- Disable hugepd for 64K page size. from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file. from Paul Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages. from Christophe Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e kexec/kdump
support, a rework of the qoriq clock driver, device tree changes including
qoriq fman nodes, support for a new 85xx board, and some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for MPC512x
LocalPlus Bus FIFO with its device tree binding documentation, mpc512x
device tree updates and some minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWPEZgAAoJEFHr6jzI4aWANjYQAKX2Q/95hqKfCuF5FBcUmtMC
Pu/Nff027MVzxZ2ApDcvvLGps5Nz2bn3nIhc9zjkXc5E8DuL6X3Yl8ce7qyNcc3g
cJJ8RvtUo6J1OMWetXFehtPYniAAwKMhZYKnj0+WnLr2SyH/Vhl3ehDkFbGyPtuH
r+2E7krFjfVgU+bzciIFnOaDekFuFN/pXWMb6e6zQyBJe9N8ZIp96uouGCebKVd0
VDLItzdaKErT8JFfbymMPvZm3V0rMVx4WWu3kAbQX8LrD5a18NF1zrjAOHRXc61n
kkk8/DPuNOon1PbXXyiS5BcFyZRe+KE3VBnoW5sOMqMIRg5WdO1oU3e2pEfXMO8+
leXYwFLXiKzUZuOgQG2QiUhrzD2yC1o6/TJWATv0dSl9AwrecgPX+Vj6X357slAf
A9E3eMy5tgnpndBWZmvZS3W7YDKH+NkeZ+Q40+NErAlqr++ErrTcKVndk5vWlYTT
7mMZeTXagX66al/k5ATKqwB7iUSpnYHSAa9fcUYPSM2FnXsDxPyeJGkBbcoOmkGj
QrpgNYOvJaUJd076goZCV39v0c1xpfV9/9kyVch8HUadf6JcjpVZwYnbGw2qlJjh
ZanuBG2VOeSwaKQqXiRBSBetnpAg8CVpFjDmX9wOBfSek2wxEJqDX/vQExdbIDQQ
pUs7vnUxLzhmW/x+ygOI
=YwcM
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Kconfig: remove BE-only platforms from LE kernel build from Boqun
Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael
Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan
Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel
Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
Colin Ian King
- Disable hugepd for 64K page size, from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh
Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael
Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe
Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis
Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages, from Christophe
Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e
kexec/kdump support, a rework of the qoriq clock driver, device tree
changes including qoriq fman nodes, support for a new 85xx board, and
some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for
MPC512x LocalPlus Bus FIFO with its device tree binding
documentation, mpc512x device tree updates and some minor fixes.
* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
powerpc/pseries: Correct string length in pseries_of_derive_parent()
powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
powerpc: handle error case in cpm_muram_alloc()
powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
powerpc/book3e-64: Enable kexec
powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
powerpc/book3e-64/kexec: Enable SMP release
powerpc/book3e-64/kexec: create an identity TLB mapping
powerpc/book3e-64: Don't limit paca to 256 MiB
powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
powerpc/book3e: support CONFIG_RELOCATABLE
powerpc/booke64: Fix args to copy_and_flush
powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
powerpc/e6500: kexec: Handle hardware threads
...
In static micro-threading modes, the dynamic micro-threading code
is supposed to be disabled, because subcores can't make independent
decisions about what micro-threading mode to put the core in - there is
only one micro-threading mode for the whole core. The code that
implements dynamic micro-threading checks for this, except that the
check was missed in one case. This means that it is possible for a
subcore in static 2-way micro-threading mode to try to put the core
into 4-way micro-threading mode, which usually leads to stuck CPUs,
spinlock lockups, and other stalls in the host.
The problem was in the can_split_piggybacked_subcores() function, which
should always return false if the system is in a static micro-threading
mode. This fixes the problem by making can_split_piggybacked_subcores()
use subcore_config_ok() for its checks, as subcore_config_ok() includes
the necessary check for the static micro-threading modes.
Credit to Gautham Shenoy for working out that the reason for the hangs
and stalls we were seeing was that we were trying to do dynamic 4-way
micro-threading while we were in static 2-way mode.
Fixes: b4deba5c41
Cc: vger@stable.kernel.org # v4.3
Signed-off-by: Paul Mackerras <paulus@samba.org>
When handling a hypervisor data or instruction storage interrupt (HDSI
or HISI), we look up the SLB entry for the address being accessed in
order to translate the effective address to a virtual address which can
be looked up in the guest HPT. This lookup can occasionally fail due
to the guest replacing an SLB entry without invalidating the evicted
SLB entry. In this situation an ERAT (effective to real address
translation cache) entry can persist and be used by the hardware even
though there is no longer a corresponding SLB entry.
Previously we would just deliver a data or instruction storage interrupt
(DSI or ISI) to the guest in this case. However, this is not correct
and has been observed to cause guests to crash, typically with a
data storage protection interrupt on a store to the vmemmap area.
Instead, what we do now is to synthesize a data or instruction segment
interrupt. That should cause the guest to reload an appropriate entry
into the SLB and retry the faulting instruction. If it still faults,
we should find an appropriate SLB entry next time and be able to handle
the fault.
Tested-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
handling.
PPC: Mostly bug fixes.
ARM: No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite for
IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86: quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new component (in
virt/lib/) that connects VFIO and KVM together. The same infrastructure
will be used for ARM interrupt forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic interrupt
controller will have to wait for 4.5. These will let KVM expose Hyper-V
devices.
- nested virtualization now supports VPID (same as PCID but for vCPUs)
which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for clflushopt,
clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
userspace, which reduces the attack surface of the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten to not
require help from the hypervisor.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
=iv2z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
This reverts commit 9678cdaae9 ("Use the POWER8 Micro Partition
Prefetch Engine in KVM HV on POWER8") because the original commit had
multiple, partly self-cancelling bugs, that could cause occasional
memory corruption.
In fact the logmpp instruction was incorrectly using register r0 as the
source of the buffer address and operation code, and depending on what
was in r0, it would either do nothing or corrupt the 64k page pointed to
by r0.
The logmpp instruction encoding and the operation code definitions could
be corrected, but then there is the problem that there is no clearly
defined way to know when the hardware has finished writing to the
buffer.
The original commit attempted to work around this by aborting the
write-out before starting the prefetch, but this is ineffective in the
case where the virtual core is now executing on a different physical
core from the one where the write-out was initiated.
These problems plus advice from the hardware designers not to use the
function (since the measured performance improvement from using the
feature was actually mostly negative), mean that reverting the code is
the best option.
Fixes: 9678cdaae9 ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently a CPU running a guest can receive a H_DOORBELL in the
following two cases:
1) When the CPU is napping due to CEDE or there not being a guest
vcpu.
2) The CPU is running the guest vcpu.
Case 1), the doorbell message is not cleared since we were waking up
from nap. Hence when the EE bit gets set on transition from guest to
host, the H_DOORBELL interrupt is delivered to the host and the
corresponding handler is invoked.
However in Case 2), the message gets cleared by the action of taking
the H_DOORBELL interrupt. Since the CPU was running a guest, instead
of invoking the doorbell handler, the code invokes the second-level
interrupt handler to switch the context from the guest to the host. At
this point the setting of the EE bit doesn't result in the CPU getting
the doorbell interrupt since it has already been delivered once. So,
the handler for this doorbell is never invoked!
This causes softlockups if the missed DOORBELL was an IPI sent from a
sibling subcore on the same CPU.
This patch fixes it by explitly invoking the doorbell handler on the
exit path if the exit reason is H_DOORBELL similar to the way an
EXTERNAL interrupt is handled. Since this will also handle Case 1), we
can unconditionally clear the doorbell message in
kvmppc_check_wake_reason.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
QEMU assumes 32 memslots if this extension is not implemented. Although,
current value of KVM_USER_MEM_SLOTS is 32, once KVM_USER_MEM_SLOTS
changes QEMU would take a wrong value.
Signed-off-by: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes a bug where the old HPTE value returned by H_REMOVE has
the valid bit clear if the HPTE was an absent HPTE, as happens for
HPTEs for emulated MMIO pages and for RAM pages that have been paged
out by the host. If the absent bit is set, we clear it and set the
valid bit, because from the guest's point of view, the HPTE is valid.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Currently the KVM_PPC_ALLOCATE_HTAB will try to allocate the requested
size of HPT, and if that is not possible, then try to allocate smaller
sizes (by factors of 2) until either a minimum is reached or the
allocation succeeds. This is not ideal for userspace, particularly in
migration scenarios, where the destination VM really does require the
size requested. Also, the minimum HPT size of 256kB may be
insufficient for the guest to run successfully.
This removes the fallback to smaller sizes on allocation failure for
the KVM_PPC_ALLOCATE_HTAB ioctl. The fallback still exists for the
case where the HPT is allocated at the time the first VCPU is run, if
no HPT has been allocated by ioctl by that time.
Signed-off-by: Paul Mackerras <paulus@samba.org>
For the machine check interrupt that happens while we are in the guest,
kvm layer attempts the recovery, and then delivers the machine check interrupt
directly to the guest if recovery fails. On successful recovery we go back to
normal functioning of the guest. But there can be cases where a machine check
interrupt can happen with MSR(RI=0) while we are in the guest. This means
MC interrupt is unrecoverable and we have to deliver a machine check to the
guest since the machine check interrupt might have trashed valid values in
SRR0/1. The current implementation do not handle this case, causing guest
to crash with Bad kernel stack pointer instead of machine check oops message.
[26281.490060] Bad kernel stack pointer 3fff9ccce5b0 at c00000000000490c
[26281.490434] Oops: Bad kernel stack pointer, sig: 6 [#1]
[26281.490472] SMP NR_CPUS=2048 NUMA pSeries
This patch fixes this issue by checking MSR(RI=0) in KVM layer and forwarding
unrecoverable interrupt to guest which then panics with proper machine check
Oops message.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Fix couple of cases where we shift left a 32-bit
value thus might get truncated results on 64-bit
targets.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Suggested-by: Scott Wood <scotttwood@freescale.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Emulate TMCFG0 TMRN register exposing one HW thread per vcpu.
Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
[Laurentiu.Tudor@freescale.com: rebased on latest kernel, use
define instead of hardcoded value, moved code in own function]
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Acked-by: Scott Wood <scotttwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The function can return negative value.
The problem has been detected using proposed semantic patch
scripts/coccinelle/tests/assign_signed_to_unsigned.cocci [1].
[1]: http://permalink.gmane.org/gmane.linux.kernel/2046107
Signed-off-by: Andrzej Hajda <a.hajda@samsung.com>
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
We need to properly identify whether a hugepage is an explicit or
a transparent hugepage in follow_huge_addr(). We used to depend
on hugepage shift argument to do that. But in some case that can
result in wrong results. For ex:
On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
We do prevent reusing the pfn page via the usage of
kick_all_cpus_sync(). But that happens after we updated the pte to 0.
Hence in follow_huge_addr() we can find hugepage shift set, but transparent
huge page check fail for a thp pte.
NOTE: We fixed a variant of this race against thp split in commit
691e95fd73
("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
follow_page_mask occasionally.
In the long term, we may want to switch ppc64 64k page size config to
enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
Reported-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
functions) has to be protected via the kvm->srcu lock.
The kvmppc_h_logical_ci_load() and -store() functions are missing
this lock so far, so let's add it there, too.
This fixes the problem that the kernel reports "suspicious RCU usage"
when lock debugging is enabled.
Cc: stable@vger.kernel.org # v4.1+
Fixes: 99342cf804
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In guest_exit_cont we call kvmhv_commence_exit which expects the trap
number as the argument. However r3 doesn't contain the trap number at
this point and as a result we would be calling the function with a
spurious trap number.
Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
r12 contains the trap number.
Cc: stable@vger.kernel.org # v4.1+
Fixes: eddb60fb14
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes a bug which results in stale vcore pointers being left in
the per-cpu preempted vcore lists when a VM is destroyed. The result
of the stale vcore pointers is usually either a crash or a lockup
inside collect_piggybacks() when another VM is run. A typical
lockup message looks like:
[ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
[ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
[ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
[ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
[ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
[ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+)
[ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000
[ 472.162588] CFAR: c00000000096b0ac SOFTE: 1
GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
GPR12: c000000000111130 c00000000fdae400
[ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
[ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
[ 472.163179] Call Trace:
[ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
[ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
[ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
[ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
[ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
[ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
[ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
[ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
[ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
[ 472.164060] Instruction dump:
[ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
[ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
The bug is that kvmppc_run_vcpu does not correctly handle the case
where a vcpu task receives a signal while its guest vcpu is executing
in the guest as a result of being piggy-backed onto the execution of
another vcore. In that case we need to wait for the vcpu to finish
executing inside the guest, and then remove this vcore from the
preempted vcores list. That way, we avoid leaving this vcpu's vcore
on the preempted vcores list when the vcpu gets interrupted.
Fixes: ec25716508
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.
For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The code that handles the case when we receive a H_DOORBELL interrupt
has a comment which says "Hypervisor doorbell - exit only if host IPI
flag set". However, the current code does not actually check if the
host IPI flag is set. This is due to a comparison instruction that
got missed.
As a result, the current code performs the exit to host only
if some sibling thread or a sibling sub-core is exiting to the
host. This implies that, an IPI sent to a sibling core in
(subcores-per-core != 1) mode will be missed by the host unless the
sibling core is on the exit path to the host.
This patch adds the missing comparison operation which will ensure
that when HOST_IPI flag is set, we unconditionally exit to the host.
Fixes: 66feed61cd
Cc: stable@vger.kernel.org # v4.1+
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The current dynamic micro-threading code has a race due to which a
secondary thread naps when it is supposed to be running a vcpu. As a
side effect of this, on a guest exit, the primary thread in
kvmppc_wait_for_nap() finds that this secondary thread hasn't cleared
its vcore pointer. This results in "CPU X seems to be stuck!"
warnings.
The race is possible since the primary thread on exiting the guests
only waits for all the secondaries to clear its vcore pointer. It
subsequently expects the secondary threads to enter nap while it
unsplits the core. A secondary thread which hasn't yet entered the nap
will loop in kvm_no_guest until its vcore pointer and the do_nap flag
are unset. Once the core has been unsplit, a new vcpu thread can grab
the core and set the do_nap flag *before* setting the vcore pointers
of the secondary. As a result, the secondary thread will now enter nap
via kvm_unsplit_nap instead of running the guest vcpu.
Fix this by setting the do_nap flag after setting the vcore pointer in
the PACA of the secondary in kvmppc_run_core. Also, ensure that a
secondary thread doesn't nap in kvm_unsplit_nap when the vcore pointer
in its PACA struct is set.
Fixes: b4deba5c41
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In 64 bit kernels, the Fixed Point Exception Register (XER) is a 64
bit field (e.g. in kvm_regs and kvm_vcpu_arch) and in most places it is
accessed as such.
This patch corrects places where it is accessed as a 32 bit field by a
64 bit kernel. In some cases this is via a 32 bit load or store
instruction which, depending on endianness, will cause either the
lower or upper 32 bits to be missed. In another case it is cast as a
u32, causing the upper 32 bits to be cleared.
This patch corrects those places by extending the access methods to
64 bits.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Whenever a vcore state is VCORE_PREEMPT we need to be counting stolen
time for it. This currently isn't the case when we have a vcore that
no longer has any runnable threads in it but still has a runner task,
so we do an explicit call to kvmppc_core_start_stolen() in that case.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When a vcore gets preempted, we put it on the preempted vcore list for
the current CPU. The runner task then calls schedule() and comes back
some time later and takes itself off the list. We need to be careful
to lock the list that it was put onto, which may not be the list for the
current CPU since the runner task may have moved to another CPU.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds implementations for the H_CLEAR_REF (test and clear reference
bit) and H_CLEAR_MOD (test and clear changed bit) hypercalls.
When clearing the reference or change bit in the guest view of the HPTE,
we also have to clear it in the real HPTE so that we can detect future
references or changes. When we do so, we transfer the R or C bit value
to the rmap entry for the underlying host page so that kvm_age_hva_hv(),
kvm_test_age_hva_hv() and kvmppc_hv_get_dirty_log() know that the page
has been referenced and/or changed.
These hypercalls are not used by Linux guests. These implementations
have been tested using a FreeBSD guest.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The reference (R) and change (C) bits in a HPT entry can be set by
hardware at any time up until the HPTE is invalidated and the TLB
invalidation sequence has completed. This means that when removing
a HPTE, we need to read the HPTE after the invalidation sequence has
completed in order to obtain reliable values of R and C. The code
in kvmppc_do_h_remove() used to do this. However, commit 6f22bd3265
("KVM: PPC: Book3S HV: Make HTAB code LE host aware") removed the
read after invalidation as a side effect of other changes. This
restores the read of the HPTE after invalidation.
The user-visible effect of this bug would be that when migrating a
guest, there is a small probability that a page modified by the guest
and then unmapped by the guest might not get re-transmitted and thus
the destination might end up with a stale copy of the page.
Fixes: 6f22bd3265
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On this switch branch the regs initialization
doesn't happen so add it.
This was found with the help of a static
code analysis tool.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
When compiling the KVM code for POWER with "make C=1", sparse
complains about functions missing proper prototypes and a 64-bit
constant missing the ULL prefix. Let's fix this by making the
functions static or by including the proper header with the
prototypes, and by appending a ULL prefix to the constant
PPC_MPPE_ADDRESS_MASK.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Since the PPC970 support has been removed from the kvm-hv kernel
module recently, we should also reflect this change in the help
text of the corresponding Kconfig option.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This was signaled by a static code analysis tool.
Signed-off-by: Laurentiu Tudor <Laurentiu.Tudor@freescale.com>
Reviewed-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Even though 'compatability' has a dedicated entry in the Wiktionary,
it's listed as 'Mispelling of compatibility'. Fix it.
Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> for the atomic_helper.c
Signed-off-by: Jiri Kosina <jkosina@suse.com>
Function should_resched() is equal to (!preempt_count() && need_resched()).
In preemptive kernel preempt_count here is non-zero because of vc->lock.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Graf <agraf@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150715095203.12246.72922.stgit@buzz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
for silicon that no one owns: these are really new features for
everyone.
* ARM: several features are in progress but missed the 4.2 deadline.
So here is just a smattering of bug fixes, plus enabling the VFIO
integration.
* s390: Some fixes/refactorings/optimizations, plus support for
2GB pages.
* x86: 1) host and guest support for marking kvmclock as a stable
scheduler clock. 2) support for write combining. 3) support for
system management mode, needed for secure boot in guests. 4) a bunch
of cleanups required for 2+3. 5) support for virtualized performance
counters on AMD; 6) legacy PCI device assignment is deprecated and
defaults to "n" in Kconfig; VFIO replaces it. On top of this there are
also bug fixes and eager FPU context loading for FPU-heavy guests.
* Common code: Support for multiple address spaces; for now it is
used only for x86 SMM but the s390 folks also have plans.
There are some x86 conflicts, one with the rc8 pull request and
the rest with Ingo's FPU rework.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJViYzhAAoJEL/70l94x66Dda0H/1IepMbfEy+o849d5G71fNTs
F8Y8qUP2GZuL7T53FyFUGSBw+AX7kimu9ia4gR/PmDK+QYsdosYeEjwlsolZfTBf
sHuzNtPoJhi5o1o/ur4NGameo0WjGK8f1xyzr+U8z74QDQyQv/QYCdK/4isp4BJL
ugHNHkuROX6Zng4i7jc9rfaSRg29I3GBxQUYpMkEnD3eMYMUBWGm6Rs8pHgGAMvL
vqzntgW00WNxehTqcAkmD/Wv+txxhkvIadZnjgaxH49e9JeXeBKTIR5vtb7Hns3s
SuapZUyw+c95DIipXq4EznxxaOrjbebOeFgLCJo8+XMXZum8RZf/ob24KroYad0=
=YsAR
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first batch of KVM updates from Paolo Bonzini:
"The bulk of the changes here is for x86. And for once it's not for
silicon that no one owns: these are really new features for everyone.
Details:
- ARM:
several features are in progress but missed the 4.2 deadline.
So here is just a smattering of bug fixes, plus enabling the
VFIO integration.
- s390:
Some fixes/refactorings/optimizations, plus support for 2GB
pages.
- x86:
* host and guest support for marking kvmclock as a stable
scheduler clock.
* support for write combining.
* support for system management mode, needed for secure boot in
guests.
* a bunch of cleanups required for the above
* support for virtualized performance counters on AMD
* legacy PCI device assignment is deprecated and defaults to "n"
in Kconfig; VFIO replaces it
On top of this there are also bug fixes and eager FPU context
loading for FPU-heavy guests.
- Common code:
Support for multiple address spaces; for now it is used only for
x86 SMM but the s390 folks also have plans"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
KVM: s390: clear floating interrupt bitmap and parameters
KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
KVM: x86/vPMU: Implement AMD vPMU code for KVM
KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
KVM: x86/vPMU: reorder PMU functions
KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
KVM: x86/vPMU: introduce pmu.h header
KVM: x86/vPMU: rename a few PMU functions
KVM: MTRR: do not map huge page for non-consistent range
KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
KVM: MTRR: introduce mtrr_for_each_mem_type
KVM: MTRR: introduce fixed_mtrr_addr_* functions
KVM: MTRR: sort variable MTRRs
KVM: MTRR: introduce var_mtrr_range
KVM: MTRR: introduce fixed_mtrr_segment table
KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
KVM: MTRR: do not split 64 bits MSR content
KVM: MTRR: clean up mtrr default type
...
- Disable the 32-bit vdso when building LE, so we can build with a 64-bit only
toolchain.
- EEH fixes from Gavin & Richard.
- Enable the sys_kcmp syscall from Laurent.
- Sysfs control for fastsleep workaround from Shreyas.
- Expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- Fix for kernel to userspace backtraces for perf from Anton.
- Merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- Fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- Dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- Reworked version of the patch to abort syscalls when transactional.
- Fix the swap encoding to support 4TB, from Aneesh.
- Various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx optimizations, an
e6500 hugetlb optimization, QMan device tree nodes, t1024/t1023 support, and
various fixes and cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJViSZqAAoJEFHr6jzI4aWAA7kQAKq3+pejfo2rY7alpKJyeVao
vlaIEaDNOTh+ctcmu3MFF9Jy6fai8gNZziRXU5JRmE5RW4GVBN4KZiqXRbkVjdBK
uG9sCX7Y58VRsS2vnGBYLsamfTMgjaXeDvgunQHVLiechJnrDr0RHEK90F3LSi73
Axp6l8XIG63a3zFZmkhzANMCme2lm5+MWmGlSjUUNi5F+viQUgJc5iiO8xrVUgM5
RpNlV2NJSqFiU+gMQWJ226V85UIniouq4j+qtyUcu8/m9BberyolXVU0GPlPFdsx
r/Qh9uCJyZaUdSB5hzomQZj50IsSz6J6nEuJTeGRoVZOmeI8Dnc2xU9fxQF5fC8H
lUJw10WPoNOggQZTeSUKn7wTXw3i4p3KsWNUczaW68VJdhqZUVaSp0+I6mnDSqzs
9iGC+VffLYNa1OHq7mGRFrgDdLBCHes31aZ3CxlQsmyNpAPCwMzsD4TUfVnvOG6E
oJOeaQ4mZM9PvqxEYJfoIL+vgRxmQ8sdIBtNY4in+C7J6eFnZNFO9xmPnJZuVU31
PGtx60kjFCOVMXvqn34WkRNbgqGWI91IK0KcRwFO2LXVio1uY77TWL52kNK2IMsp
Az+VDDvqnT3+BoV1yz0P6SrXAkwTpvFk2y+IdmEiUUN7zZFL5ZSA2epej9AzHTAK
WID2bc5yVtIL6p6x5ICH
=d9Wh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- disable the 32-bit vdso when building LE, so we can build with a
64-bit only toolchain.
- EEH fixes from Gavin & Richard.
- enable the sys_kcmp syscall from Laurent.
- sysfs control for fastsleep workaround from Shreyas.
- expose OPAL events as an irq chip by Alistair.
- MSI ops moved to pci_controller_ops by Daniel.
- fix for kernel to userspace backtraces for perf from Anton.
- merge pseries and pseries_le defconfigs from Cyril.
- CXL in-kernel API from Mikey.
- OPAL prd driver from Jeremy.
- fix for DSCR handling & tests from Anshuman.
- Powernv flash mtd driver from Cyril.
- dynamic DMA Window support on powernv from Alexey.
- LLVM clang fixes & workarounds from Anton.
- reworked version of the patch to abort syscalls when transactional.
- fix the swap encoding to support 4TB, from Aneesh.
- various fixes as usual.
- Freescale updates from Scott: Highlights include more 8xx
optimizations, an e6500 hugetlb optimization, QMan device tree nodes,
t1024/t1023 support, and various fixes and cleanup.
* tag 'powerpc-4.2-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (180 commits)
cxl: Fix typo in debug print
cxl: Add CXL_KERNEL_API config option
powerpc/powernv: Fix wrong IOMMU table in pnv_ioda_setup_bus_dma()
powerpc/mm: Change the swap encoding in pte.
powerpc/mm: PTE_RPN_MAX is not used, remove the same
powerpc/tm: Abort syscalls in active transactions
powerpc/iommu/ioda2: Enable compile with IOV=on and IOMMU_API=off
powerpc/include: Add opal-prd to installed uapi headers
powerpc/powernv: fix construction of opal PRD messages
powerpc/powernv: Increase opal-irqchip initcall priority
powerpc: Make doorbell check preemption safe
powerpc/powernv: pnv_init_idle_states() should only run on powernv
macintosh/nvram: Remove as unused
powerpc: Don't use gcc specific options on clang
powerpc: Don't use -mno-strict-align on clang
powerpc: Only use -mtraceback=no, -mno-string and -msoft-float if toolchain supports it
powerpc: Only use -mabi=altivec if toolchain supports it
powerpc: Fix duplicate const clang warning in user access code
vfio: powerpc/spapr: Support Dynamic DMA windows
vfio: powerpc/spapr: Register memory and define IOMMU v2
...
PACA_DSCR offset macro tracks dscr_default element in the paca
structure. Better change the name of this macro to match that of the
data element it tracks. Makes the code more readable.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.
Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.
In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_memslots provides lockdep checking. Use it consistently instead of
explicit dereferencing of kvm->memslots.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This fixes a regression introduced in commit 25fedfca94, "KVM: PPC:
Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which
leads to a user-triggerable oops.
In the case where we try to run a vcore on a physical core that is
not in single-threaded mode, or the vcore has too many threads for
the physical core, we iterate the list of runnable vcpus to make
each one return an EBUSY error to userspace. Since this involves
taking each vcpu off the runnable_threads list for the vcore, we
need to use list_for_each_entry_safe rather than list_for_each_entry
to traverse the list. Otherwise the kernel will crash with an oops
message like this:
Unable to handle kernel paging request for data at address 0x000fff88
Faulting instruction address: 0xd00000001e635dc8
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=1024 NUMA PowerNV
...
CPU: 48 PID: 91256 Comm: qemu-system-ppc Tainted: G D 3.18.0 #1
task: c00000274e507500 ti: c0000027d1924000 task.ti: c0000027d1924000
NIP: d00000001e635dc8 LR: d00000001e635df8 CTR: c00000000011ba50
REGS: c0000027d19275b0 TRAP: 0300 Tainted: G D (3.18.0)
MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22002824 XER: 00000000
CFAR: c000000000008468 DAR: 00000000000fff88 DSISR: 40000000 SOFTE: 1
GPR00: d00000001e635df8 c0000027d1927830 d00000001e64c850 0000000000000001
GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000
GPR08: 0000000000200200 0000000000000000 0000000000000000 d00000001e63e588
GPR12: 0000000000002200 c000000007dbc800 c000000fc7800000 000000000000000a
GPR16: fffffffffffffffc c000000fd5439690 c000000fc7801c98 0000000000000001
GPR20: 0000000000000003 c0000027d1927aa8 c000000fd543b348 c000000fd543b350
GPR24: 0000000000000000 c000000fa57f0000 0000000000000030 0000000000000000
GPR28: fffffffffffffff0 c000000fd543b328 00000000000fe468 c000000fd543b300
NIP [d00000001e635dc8] kvmppc_run_core+0x198/0x17c0 [kvm_hv]
LR [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv]
Call Trace:
[c0000027d1927830] [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] (unreliable)
[c0000027d1927a30] [d00000001e638350] kvmppc_vcpu_run_hv+0x5b0/0xdd0 [kvm_hv]
[c0000027d1927b70] [d00000001e510504] kvmppc_vcpu_run+0x44/0x60 [kvm]
[c0000027d1927ba0] [d00000001e50d4a4] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[c0000027d1927be0] [d00000001e504be8] kvm_vcpu_ioctl+0x5e8/0x7a0 [kvm]
[c0000027d1927d40] [c0000000002d6720] do_vfs_ioctl+0x490/0x780
[c0000027d1927de0] [c0000000002d6ae4] SyS_ioctl+0xd4/0xf0
[c0000027d1927e30] [c000000000009358] syscall_exit+0x0/0x98
Instruction dump:
60000000 60420000 387e1b30 38800003 38a00001 38c00000 480087d9 e8410018
ebde1c98 7fbdf040 3bdee368 419e0048 <813e1b20> 939e1b18 2f890001 409effcc
---[ end trace 8cdf50251cca6680 ]---
Fixes: 25fedfca94
Signed-off-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
where interrupts are disabled.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 34cb7954c0 "Convert ICS mutex lock to spin lock" added an
include of asm/spinlock.h, which does not work in the SMP=n case.
It should instead include linux/spinlock.h
Fixes: 34cb7954c0 ("KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock")
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Fix for mm_dec_nr_pmds() from Scott.
- Fixes for oopses seen with KVM + THP from Aneesh.
- Build fixes from Aneesh & Shreyas.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVOsk5AAoJEFHr6jzI4aWACWMP/3EaNoeA1g8VbZWZEdRaoLvX
W7D08DI3Dt8HLxyn2JR08jYZF0gr68XrF6OiscVVki7wVXT8fbH4jSNmBkbzNH95
d9taScJyR1CUavkhsXivnR1qEE1Fi2KA2OW9RaNfoSt1MVtdsvOK6xXklUGksuJQ
XygzyrRr4Dj82kuMUAMO0YDMvknMlzi3a8dzyrWZBXBZOOTWavGB6bQKtCTaOQ99
3OFGLQ10uY7lmdHDi0t0tQ99FuYfLiJpg5fTLoUni4J5tFp8JlZ+x0Gwc0apN0cy
Ym8EO6++qWDv8FXvYEPfVUEjbF1fyPiawUgpkMnyvXgd8K5G85SIrtkGW0Ml+6sX
GfJH8w9hpDbF5EnWlC9bn/jT7sHBHFdrxZuQUc0L4M2OtM73R2a0Xr3b7ZxFCD1q
7RpYu8MKKcyvaIXNg7VBJjj8zL+WmUJKF6J5uX5bGU2xH0khmp0vTknyyjbwrlcF
uHidv5ZhMt3aAI70v14jA5BTEmLyOYRu58Ei6cT/VT/DjdbpEApdK8BMAvKSEeib
+hzh6oDFT92AM0tbg15bNmqGbGfgqtVKe4GDS2QyGaHGAFOGs1nPuSa9se1xYDcM
CCtRyABwpzJsrCfwra2fsTU6FxlatK4ONViyWFBXa6mEjBNSZ4XmyZvdWUqlwpSC
F5jNGppm5Ama6xxcLphA
=6yQx
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc fixes from Michael Ellerman:
- fix for mm_dec_nr_pmds() from Scott.
- fixes for oopses seen with KVM + THP from Aneesh.
- build fixes from Aneesh & Shreyas.
* tag 'powerpc-4.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/mm: Fix build error with CONFIG_PPC_TRANSACTIONAL_MEM disabled
powerpc/kvm: Fix ppc64_defconfig + PPC_POWERNV=n build error
powerpc/mm/thp: Return pte address if we find trans_splitting.
powerpc/mm/thp: Make page table walk safe against thp split/collapse
KVM: PPC: Remove page table walk helpers
KVM: PPC: Use READ_ONCE when dereferencing pte_t pointer
powerpc/hugetlb: Call mm_dec_nr_pmds() in hugetlb_free_pmd_range()
Book3S HV only (debugging aids, minor performance improvements and some
cleanups). But there are also bug fixes and small cleanups for ARM,
x86 and s390.
The task_migration_notifier revert and real fix is still pending review,
but I'll send it as soon as possible after -rc1.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVOONLAAoJEL/70l94x66DbsMIAIpZPsaqgXOC1sDEiZuYay+6
rD4n4id7j8hIAzcf3AlZdyf5XgLlr6I1Zyt62s1WcoRq/CCnL7k9EljzSmw31WFX
P2y7/J0iBdkn0et+PpoNThfL2GsgTqNRCLOOQlKgEQwMP9Dlw5fnUbtC1UchOzTg
eAMeBIpYwufkWkXhdMw4PAD4lJ9WxUZ1eXHEBRzJb0o0ZxAATJ1tPZGrFJzoUOSM
WsVNTuBsNd7upT02kQdvA1TUo/OPjseTOEoksHHwfcORt6bc5qvpctL3jYfcr7sk
/L6sIhYGVNkjkuredjlKGLfT2DDJjSEdJb1k2pWrDRsY76dmottQubAE9J9cDTk=
=OAi2
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second batch of KVM changes from Paolo Bonzini:
"This mostly includes the PPC changes for 4.1, which this time cover
Book3S HV only (debugging aids, minor performance improvements and
some cleanups). But there are also bug fixes and small cleanups for
ARM, x86 and s390.
The task_migration_notifier revert and real fix is still pending
review, but I'll send it as soon as possible after -rc1"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (29 commits)
KVM: arm/arm64: check IRQ number on userland injection
KVM: arm: irqfd: fix value returned by kvm_irq_map_gsi
KVM: VMX: Preserve host CR4.MCE value while in guest mode.
KVM: PPC: Book3S HV: Use msgsnd for signalling threads on POWER8
KVM: PPC: Book3S HV: Translate kvmhv_commence_exit to C
KVM: PPC: Book3S HV: Streamline guest entry and exit
KVM: PPC: Book3S HV: Use bitmap of active threads rather than count
KVM: PPC: Book3S HV: Use decrementer to wake napping threads
KVM: PPC: Book3S HV: Don't wake thread with no vcpu on guest IPI
KVM: PPC: Book3S HV: Get rid of vcore nap_count and n_woken
KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu
KVM: PPC: Book3S HV: Minor cleanups
KVM: PPC: Book3S HV: Simplify handling of VCPUs that need a VPA update
KVM: PPC: Book3S HV: Accumulate timing information for real-mode code
KVM: PPC: Book3S HV: Create debugfs file for each guest's HPT
KVM: PPC: Book3S HV: Add ICP real mode counters
KVM: PPC: Book3S HV: Move virtual mode ICP functions to real-mode
KVM: PPC: Book3S HV: Convert ICS mutex lock to spin lock
KVM: PPC: Book3S HV: Add guest->host real mode completion counters
KVM: PPC: Book3S HV: Add helpers for lock/unlock hpte
...
This uses msgsnd where possible for signalling other threads within
the same core on POWER8 systems, rather than IPIs through the XICS
interrupt controller. This includes waking secondary threads to run
the guest, the interrupts generated by the virtual XICS, and the
interrupts to bring the other threads out of the guest when exiting.
Aggregated statistics from debugfs across vcpus for a guest with 32
vcpus, 8 threads/vcore, running on a POWER8, show this before the
change:
rm_entry: 3387.6ns (228 - 86600, 1008969 samples)
rm_exit: 4561.5ns (12 - 3477452, 1009402 samples)
rm_intr: 1660.0ns (12 - 553050, 3600051 samples)
and this after the change:
rm_entry: 3060.1ns (212 - 65138, 953873 samples)
rm_exit: 4244.1ns (12 - 9693408, 954331 samples)
rm_intr: 1342.3ns (12 - 1104718, 3405326 samples)
for a test of booting Fedora 20 big-endian to the login prompt.
The time taken for a H_PROD hcall (which is handled in the host
kernel) went down from about 35 microseconds to about 16 microseconds
with this change.
The noinline added to kvmppc_run_core turned out to be necessary for
good performance, at least with gcc 4.9.2 as packaged with Fedora 21
and a little-endian POWER8 host.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This replaces the assembler code for kvmhv_commence_exit() with C code
in book3s_hv_builtin.c. It also moves the IPI sending code that was
in book3s_hv_rm_xics.c into a new kvmhv_rm_send_ipi() function so it
can be used by kvmhv_commence_exit() as well as icp_rm_set_vcpu_irq().
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On entry to the guest, secondary threads now wait for the primary to
switch the MMU after loading up most of their state, rather than before.
This means that the secondary threads get into the guest sooner, in the
common case where the secondary threads get to kvmppc_hv_entry before
the primary thread.
On exit, the first thread out increments the exit count and interrupts
the other threads (to get them out of the guest) before saving most
of its state, rather than after. That means that the other threads
exit sooner and means that the first thread doesn't spend so much
time waiting for the other threads at the point where the MMU gets
switched back to the host.
This pulls out the code that increments the exit count and interrupts
other threads into a separate function, kvmhv_commence_exit().
This also makes sure that r12 and vcpu->arch.trap are set correctly
in some corner cases.
Statistics from /sys/kernel/debug/kvm/vm*/vcpu*/timings show the
improvement. Aggregating across vcpus for a guest with 32 vcpus,
8 threads/vcore, running on a POWER8, gives this before the change:
rm_entry: avg 4537.3ns (222 - 48444, 1068878 samples)
rm_exit: avg 4787.6ns (152 - 165490, 1010717 samples)
rm_intr: avg 1673.6ns (12 - 341304, 3818691 samples)
and this after the change:
rm_entry: avg 3427.7ns (232 - 68150, 1118921 samples)
rm_exit: avg 4716.0ns (12 - 150720, 1119477 samples)
rm_intr: avg 1614.8ns (12 - 522436, 3850432 samples)
showing a substantial reduction in the time spent per guest entry in
the real-mode guest entry code, and smaller reductions in the real
mode guest exit and interrupt handling times. (The test was to start
the guest and boot Fedora 20 big-endian to the login prompt.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently, the entry_exit_count field in the kvmppc_vcore struct
contains two 8-bit counts, one of the threads that have started entering
the guest, and one of the threads that have started exiting the guest.
This changes it to an entry_exit_map field which contains two bitmaps
of 8 bits each. The advantage of doing this is that it gives us a
bitmap of which threads need to be signalled when exiting the guest.
That means that we no longer need to use the trick of setting the
HDEC to 0 to pull the other threads out of the guest, which led in
some cases to a spurious HDEC interrupt on the next guest entry.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This arranges for threads that are napping due to their vcpu having
ceded or due to not having a vcpu to wake up at the end of the guest's
timeslice without having to be poked with an IPI. We do that by
arranging for the decrementer to contain a value no greater than the
number of timebase ticks remaining until the end of the timeslice.
In the case of a thread with no vcpu, this number is in the hypervisor
decrementer already. In the case of a ceded vcpu, we use the smaller
of the HDEC value and the DEC value.
Using the DEC like this when ceded means we need to save and restore
the guest decrementer value around the nap.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a multi-threaded guest and vcpu 0 in a virtual core
is not running in the guest (i.e. it is busy elsewhere in the host),
thread 0 of the physical core will switch the MMU to the guest and
then go to nap mode in the code at kvm_do_nap. If the guest sends
an IPI to thread 0 using the msgsndp instruction, that will wake
up thread 0 and cause all the threads in the guest to exit to the
host unnecessarily. To avoid the unnecessary exit, this arranges
for the PECEDP bit to be cleared in this situation. When napping
due to a H_CEDE from the guest, we still set PECEDP so that the
thread will wake up on an IPI sent using msgsndp.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We can tell when a secondary thread has finished running a guest by
the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there
is no real need for the nap_count field in the kvmppc_vcore struct.
This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu
pointers of the secondary threads rather than polling vc->nap_count.
Besides reducing the size of the kvmppc_vcore struct by 8 bytes,
this also means that we can tell which secondary threads have got
stuck and thus print a more informative error message.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Rather than calling cond_resched() in kvmppc_run_core() before doing
the post-processing for the vcpus that we have just run (that is,
calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do
that post-processing before calling cond_resched(), and that post-
processing is moved out into its own function, post_guest_process().
The reschedule point is now in kvmppc_run_vcpu() and we define a new
vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner
task is runnable but not running. (Doing the reschedule with the
vcore in VCORE_INACTIVE state would be bad because there are potentially
other vcpus waiting for the runner in kvmppc_wait_for_exec() which
then wouldn't get woken up.)
Also, we make use of the handy cond_resched_lock() function, which
unlocks and relocks vc->lock for us around the reschedule.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
* Remove unused kvmppc_vcore::n_busy field.
* Remove setting of RMOR, since it was only used on PPC970 and the
PPC970 KVM support has been removed.
* Don't use r1 or r2 in setting the runlatch since they are
conventionally reserved for other things; use r0 instead.
* Streamline the code a little and remove the ext_interrupt_to_host
label.
* Add some comments about register usage.
* hcall_try_real_mode doesn't need to be global, and can't be
called from C code anyway.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Previously, if kvmppc_run_core() was running a VCPU that needed a VPA
update (i.e. one of its 3 virtual processor areas needed to be pinned
in memory so the host real mode code can update it on guest entry and
exit), we would drop the vcore lock and do the update there and then.
Future changes will make it inconvenient to drop the lock, so instead
we now remove it from the list of runnable VCPUs and wake up its
VCPU task. This will have the effect that the VCPU task will exit
kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and
re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call
to kvmppc_update_vpas() and then rejoin the vcore.
The one complication is that the runner VCPU (whose VCPU task is the
current task) might be one of the ones that gets removed from the
runnable list. In that case we just return from kvmppc_run_core()
and let the code in kvmppc_run_vcpu() wake up another VCPU task to be
the runner if necessary.
This all means that the VCORE_STARTING state is no longer used, so we
remove it.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This reads the timebase at various points in the real-mode guest
entry/exit code and uses that to accumulate total, minimum and
maximum time spent in those parts of the code. Currently these
times are accumulated per vcpu in 5 parts of the code:
* rm_entry - time taken from the start of kvmppc_hv_entry() until
just before entering the guest.
* rm_intr - time from when we take a hypervisor interrupt in the
guest until we either re-enter the guest or decide to exit to the
host. This includes time spent handling hcalls in real mode.
* rm_exit - time from when we decide to exit the guest until the
return from kvmppc_hv_entry().
* guest - time spend in the guest
* cede - time spent napping in real mode due to an H_CEDE hcall
while other threads in the same vcore are active.
These times are exposed in debugfs in a directory per vcpu that
contains a file called "timings". This file contains one line for
each of the 5 timings above, with the name followed by a colon and
4 numbers, which are the count (number of times the code has been
executed), the total time, the minimum time, and the maximum time,
all in nanoseconds.
The overhead of the extra code amounts to about 30ns for an hcall that
is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since
production environments may not wish to incur this overhead, the new
code is conditional on a new config symbol,
CONFIG_KVM_BOOK3S_HV_EXIT_TIMING.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This creates a debugfs directory for each HV guest (assuming debugfs
is enabled in the kernel config), and within that directory, a file
by which the contents of the guest's HPT (hashed page table) can be
read. The directory is named vmnnnn, where nnnn is the PID of the
process that created the guest. The file is named "htab". This is
intended to help in debugging problems in the host's management
of guest memory.
The contents of the file consist of a series of lines like this:
3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196
The first field is the index of the entry in the HPT, the second and
third are the HPT entry, so the third entry contains the real page
number that is mapped by the entry if the entry's valid bit is set.
The fourth field is the guest's view of the second doubleword of the
entry, so it contains the guest physical address. (The format of the
second through fourth fields are described in the Power ISA and also
in arch/powerpc/include/asm/mmu-hash64.h.)
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Add two counters to count how often we generate real-mode ICS resend
and reject events. The counters provide some performance statistics
that could be used in the future to consider if the real mode functions
need further optimizing. The counters are displayed as part of IPC and
ICP state provided by /sys/debug/kernel/powerpc/kvm* for each VM.
Also added two counters that count (approximately) how many times we
don't find an ICP or ICS we're looking for. These are not currently
exposed through sysfs, but can be useful when debugging crashes.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Interrupt-based hypercalls return H_TOO_HARD to inform KVM that it needs
to switch to the host to complete the rest of hypercall function in
virtual mode. This patch ports the virtual mode ICS/ICP reject and resend
functions to be runnable in hypervisor real mode, thus avoiding the need
to switch to the host to execute these functions in virtual mode. However,
the hypercalls continue to return H_TOO_HARD for vcpu_wakeup and notify
events - these events cannot be done in real mode and they will still need
a switch to host virtual mode.
There are sufficient differences between the real mode code and the
virtual mode code for the ICS/ICP resend and reject functions that
for now the code has been duplicated instead of sharing common code.
In the future, we can look at creating common functions.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Replaces the ICS mutex lock with a spin lock since we will be porting
these routines to real mode. Note that we need to disable interrupts
before we take the lock in anticipation of the fact that on the guest
side, we are running in the context of a hard irq and interrupts are
disabled (EE bit off) when the lock is acquired. Again, because we
will be acquiring the lock in hypervisor real mode, we need to use
an arch_spinlock_t instead of a normal spinlock here as we want to
avoid running any lockdep code (which may not be safe to execute in
real mode).
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Add counters to track number of times we switch from guest real mode
to host virtual mode during an interrupt-related hyper call because the
hypercall requires actions that cannot be completed in real mode. This
will help when making optimizations that reduce guest-host transitions.
It is safe to use an ordinary increment rather than an atomic operation
because there is one ICP per virtual CPU and kvmppc_xics_rm_complete()
only works on the ICP for the current VCPU.
The counters are displayed as part of IPC and ICP state provided by
/sys/debug/kernel/powerpc/kvm* for each VM.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This adds helper routines for locking and unlocking HPTEs, and uses
them in the rest of the code. We don't change any locking rules in
this patch.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
We don't support real-mode areas now that 970 support is removed.
Remove the remaining details of rma from the code. Also rename
rma_setup_done to hpte_setup_done to better reflect the changes.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Some PowerNV systems include a hardware random-number generator.
This HWRNG is present on POWER7+ and POWER8 chips and is capable of
generating one 64-bit random number every microsecond. The random
numbers are produced by sampling a set of 64 unstable high-frequency
oscillators and are almost completely entropic.
PAPR defines an H_RANDOM hypercall which guests can use to obtain one
64-bit random sample from the HWRNG. This adds a real-mode
implementation of the H_RANDOM hypercall. This hypercall was
implemented in real mode because the latency of reading the HWRNG is
generally small compared to the latency of a guest exit and entry for
all the threads in the same virtual core.
Userspace can detect the presence of the HWRNG and the H_RANDOM
implementation by querying the KVM_CAP_PPC_HWRNG capability. The
H_RANDOM hypercall implementation will only be invoked when the guest
does an H_RANDOM hypercall if userspace first enables the in-kernel
H_RANDOM implementation using the KVM_CAP_PPC_ENABLE_HCALL capability.
Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
On POWER, storage caching is usually configured via the MMU - attributes
such as cache-inhibited are stored in the TLB and the hashed page table.
This makes correctly performing cache inhibited IO accesses awkward when
the MMU is turned off (real mode). Some CPU models provide special
registers to control the cache attributes of real mode load and stores but
this is not at all consistent. This is a problem in particular for SLOF,
the firmware used on KVM guests, which runs entirely in real mode, but
which needs to do IO to load the kernel.
To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD
and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to
a logical address (aka guest physical address). SLOF uses these for IO.
However, because these are implemented within qemu, not the host kernel,
these bypass any IO devices emulated within KVM itself. The simplest way
to see this problem is to attempt to boot a KVM guest from a virtio-blk
device with iothread / dataplane enabled. The iothread code relies on an
in kernel implementation of the virtio queue notification, which is not
triggered by the IO hcalls, and so the guest will stall in SLOF unable to
load the guest OS.
This patch addresses this by providing in-kernel implementations of the
2 hypercalls, which correctly scan the KVM IO bus. Any access to an
address not handled by the KVM IO bus will cause a VM exit, hitting the
qemu implementation as before.
Note that a userspace change is also required, in order to enable these
new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
[agraf: fix compilation]
Signed-off-by: Alexander Graf <agraf@suse.de>
kvm_no_guest() calls power7_wakeup_loss() to put the thread into the
deepest supported idle state. power7_wakeup_loss() is defined in
arch/powerpc/kernel/idle_power7.S, which is compiled only when
PPC_P7_NAP=y.
And PPC_P7_NAP is selected when PPC_POWERNV=y.
Hence in cases where PPC_POWERNV=n and KVM_BOOK3S_64_HV=y we see the
following error:
arch/powerpc/kvm/built-in.o: In function `kvm_no_guest':
arch/powerpc/kvm/book3s_hv_rmhandlers.o:(.text+0x42c): undefined reference to `power7_wakeup_loss'
Fix this by adding PPC_POWERNV as a dependency for KVM_BOOK3S_64_HV.
Signed-off-by: Shreyas B. Prabhu <shreyas@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For THP that is marked trans splitting, we return the pte.
This require the callers to handle the pmd_trans_splitting scenario,
if they care. All the current callers are either looking at pfn or
write_ok, hence we don't need to update them.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We can disable a THP split or a hugepage collapse by disabling irq.
We do send IPI to all the cpus in the early part of split/collapse,
and disabling local irq ensure we don't make progress with
split/collapse. If the THP is getting split we return NULL from
find_linux_pte_or_hugepte(). For all the current callers it should be ok.
We need to be careful if we want to use returned pte_t pointer outside
the irq disabled region. W.r.t to THP split, the pfn remains the same,
but then a hugepage collapse will result in a pfn change. There are
few steps we can take to avoid a hugepage collapse.One way is to take page
reference inside the irq disable region. Other option is to take
mmap_sem so that a parallel collapse will not happen. We can also
disable collapse by taking pmd_lock. Another method used by kvm
subsystem is to check whether we had a mmu_notifer update in between
using mmu_notifier_retry().
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch remove helpers which we had used only once in the code.
Limiting page table walk variants help in ensuring that we won't
end up with code walking page table with wrong assumptions.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pte can get updated from other CPUs as part of multiple activities
like THP split, huge page collapse, unmap. We need to make sure we
don't reload the pte value again and again for different checks.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was flashing
your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather than per
machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended transactions
on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree nodes, an
MSI erratum workaround, a couple minor performance improvements, config
updates, and misc fixes/cleanup.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVL2cxAAoJEFHr6jzI4aWAR8cP/19VTo/CzCE4ffPSx7qR464n
F+WFZcbNjIMXu6+B0YLuJZEsuWtKKrCit/MCg3+mSgE4iqvxmtI+HDD0445Buszj
UD4E4HMdPrXQ+KUSUDORvRjv/FFUXIa94LSv/0g2UeMsPz/HeZlhMxEu7AkXw9Nf
rTxsmRTsOWME85Y/c9ss7XHuWKXT3DJV7fOoK9roSaN3dJAuWTtG3WaKS0nUu0ok
0M81D6ZczoD6ybwh2DUMPD9K6SGxLdQ4OzQwtW6vWzcQIBDfy5Pdeo0iAFhGPvXf
T4LLPkv4cF4AwHsAC4rKDPHQNa+oZBoLlScrHClaebAlDiv+XYKNdMogawUObvSh
h7avKmQr0Ygp1OvvZAaXLhuDJI9FJJ8lf6AOIeULgHsDR9SyKMjZWxRzPe11uarO
Fyi0qj3oJaQu6LjazZraApu8mo+JBtQuD3z3o5GhLxeFtBBF60JXj6zAXJikufnl
kk1/BUF10nKUhtKcDX767AMUCtMH3fp5hx8K/z9T5v+pobJB26Wup1bbdT68pNBT
NjdKUppV6QTjZvCsA6U2/ECu6E9KeIaFtFSL2IRRoiI0dWBN5/5eYn3RGkO2ZFoL
1NdwKA2XJcchwTPkpSRrUG70sYH0uM2AldNYyaLfjzrQqza7Y6lF699ilxWmCN/H
OplzJAE5cQ8Am078veTW
=03Yh
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull powerpc updates from Michael Ellerman:
- Numerous minor fixes, cleanups etc.
- More EEH work from Gavin to remove its dependency on device_nodes.
- Memory hotplug implemented entirely in the kernel from Nathan
Fontenot.
- Removal of redundant CONFIG_PPC_OF by Kevin Hao.
- Rewrite of VPHN parsing logic & tests from Greg Kurz.
- A fix from Nish Aravamudan to reduce memory usage by clamping
nodes_possible_map.
- Support for pstore on powernv from Hari Bathini.
- Removal of old powerpc specific byte swap routines by David Gibson.
- Fix from Vasant Hegde to prevent the flash driver telling you it was
flashing your firmware when it wasn't.
- Patch from Ben Herrenschmidt to add an OPAL heartbeat driver.
- Fix for an oops causing get/put_cpu_var() imbalance in perf by Jan
Stancek.
- Some fixes for migration from Tyrel Datwyler.
- A new syscall to switch the cpu endian by Michael Ellerman.
- Large series from Wei Yang to implement SRIOV, reviewed and acked by
Bjorn.
- A fix for the OPAL sensor driver from Cédric Le Goater.
- Fixes to get STRICT_MM_TYPECHECKS building again by Michael Ellerman.
- Large series from Daniel Axtens to make our PCI hooks per PHB rather
than per machine.
- Small patch from Sam Bobroff to explicitly abort non-suspended
transactions on syscalls, plus a test to exercise it.
- Numerous reworks and fixes for the 24x7 PMU from Sukadev Bhattiprolu.
- Small patch to enable the hard lockup detector from Anton Blanchard.
- Fix from Dave Olson for missing L2 cache information on some CPUs.
- Some fixes from Michael Ellerman to get Cell machines booting again.
- Freescale updates from Scott: Highlights include BMan device tree
nodes, an MSI erratum workaround, a couple minor performance
improvements, config updates, and misc fixes/cleanup.
* tag 'powerpc-4.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux: (196 commits)
powerpc/powermac: Fix build error seen with powermac smp builds
powerpc/pseries: Fix compile of memory hotplug without CONFIG_MEMORY_HOTREMOVE
powerpc: Remove PPC32 code from pseries specific find_and_init_phbs()
powerpc/cell: Fix iommu breakage caused by controller_ops change
powerpc/eeh: Fix crash in eeh_add_device_early() on Cell
powerpc/perf: Cap 64bit userspace backtraces to PERF_MAX_STACK_DEPTH
powerpc/perf/hv-24x7: Fail 24x7 initcall if create_events_from_catalog() fails
powerpc/pseries: Correct memory hotplug locking
powerpc: Fix missing L2 cache size in /sys/devices/system/cpu
powerpc: Add ppc64 hard lockup detector support
oprofile: Disable oprofile NMI timer on ppc64
powerpc/perf/hv-24x7: Add missing put_cpu_var()
powerpc/perf/hv-24x7: Break up single_24x7_request
powerpc/perf/hv-24x7: Define update_event_count()
powerpc/perf/hv-24x7: Whitespace cleanup
powerpc/perf/hv-24x7: Define add_event_to_24x7_request()
powerpc/perf/hv-24x7: Rename hv_24x7_event_update
powerpc/perf/hv-24x7: Move debug prints to separate function
powerpc/perf/hv-24x7: Drop event_24x7_request()
powerpc/perf/hv-24x7: Use pr_devel() to log message
...
Conflicts:
tools/testing/selftests/powerpc/Makefile
tools/testing/selftests/powerpc/tm/Makefile
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390: interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS: FPU and MIPS SIMD Architecture (MSA) support. Includes some patches
from Ralf Baechle's MIPS tree.
x86: bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
=yu2b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.1
The most interesting bit here is irqfd/ioeventfd support for ARM and
ARM64.
Summary:
ARM/ARM64:
fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390:
interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS:
FPU and MIPS SIMD Architecture (MSA) support. Includes some
patches from Ralf Baechle's MIPS tree.
x86:
bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
KVM: use slowpath for cross page cached accesses
kvm: mmu: lazy collapse small sptes into large sptes
KVM: x86: Clear CR2 on VCPU reset
KVM: x86: DR0-DR3 are not clear on reset
KVM: x86: BSP in MSR_IA32_APICBASE is writable
KVM: x86: simplify kvm_apic_map
KVM: x86: avoid logical_map when it is invalid
KVM: x86: fix mixed APIC mode broadcast
KVM: x86: use MDA for interrupt matching
kvm/ppc/mpic: drop unused IRQ_testbit
KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
KVM: vmx: pass error code with internal error #2
x86: vdso: fix pvclock races with task migration
KVM: remove kvm_read_hva and kvm_read_hva_atomic
KVM: x86: optimize delivery of TSC deadline timer interrupt
KVM: x86: extract blocking logic from __vcpu_run
kvm: x86: fix x86 eflags fixed bit
KVM: s390: migrate vcpu interrupt state
...
Drop unused static procedure which doesn't have callers within its
translation unit. It had been already removed independently in QEMU[1]
from the OpenPIC implementation borrowed from the kernel.
[1] https://lists.gnu.org/archive/html/qemu-devel/2014-06/msg01812.html
Signed-off-by: Arseny Solokha <asolokha@kb.kras.ru>
Cc: Alexander Graf <agraf@suse.de>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1424768706-23150-3-git-send-email-asolokha@kb.kras.ru>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>