This allows userspace to set the desired virtual SMT (simultaneous
multithreading) mode for a VM, that is, the number of VCPUs that
get assigned to each virtual core. Previously, the virtual SMT mode
was fixed to the number of threads per subcore, and if userspace
wanted to have fewer vcpus per vcore, then it would achieve that by
using a sparse CPU numbering. This had the disadvantage that the
vcpu numbers can get quite large, particularly for SMT1 guests on
a POWER8 with 8 threads per core. With this patch, userspace can
set its desired virtual SMT mode and then use contiguous vcpu
numbering.
On POWER8, where the threading mode is "strict", the virtual SMT mode
must be less than or equal to the number of threads per subcore. On
POWER9, which implements a "loose" threading mode, the virtual SMT
mode can be any power of 2 between 1 and 8, even though there is
effectively one thread per subcore, since the threads are independent
and can all be in different partitions.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to allow us to use a different value for the HFSCR
(Hypervisor Facilities Status and Control Register) when running the
guest from that which applies in the host. The reason for doing this
is to allow us to trap the msgsndp instruction and related operations
in future so that they can be virtualized. We also save the value of
HFSCR when a hypervisor facility unavailable interrupt occurs, because
the high byte of HFSCR indicates which facility the guest attempted to
access.
We save and restore the host value on guest entry/exit because some
bits of it affect host userspace execution.
We only do all this on POWER9, not on POWER8, because we are not
intending to virtualize any of the facilities controlled by HFSCR on
POWER8. In particular, the HFSCR bit that controls execution of
msgsndp and related operations does not exist on POWER8. The HFSCR
doesn't exist at all on POWER7.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This allows userspace (e.g. QEMU) to enable large decrementer mode for
the guest when running on a POWER9 host, by setting the LPCR_LD bit in
the guest LPCR value. With this, the guest exit code saves 64 bits of
the guest DEC value on exit. Other places that use the guest DEC
value check the LPCR_LD bit in the guest LPCR value, and if it is set,
omit the 32-bit sign extension that would otherwise be done.
This doesn't change the DEC emulation used by PR KVM because PR KVM
is not supported on POWER9 yet.
This is partly based on an earlier patch by Oliver O'Halloran.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This merges in the powerpc topic/xive branch to bring in the code for
the in-kernel XICS interrupt controller emulation to use the new XIVE
(eXternal Interrupt Virtualization Engine) hardware in the POWER9 chip
directly, rather than via a XICS emulation in firmware.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch makes KVM capable of using the XIVE interrupt controller
to provide the standard PAPR "XICS" style hypercalls. It is necessary
for proper operations when the host uses XIVE natively.
This has been lightly tested on an actual system, including PCI
pass-through with a TG3 device.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Cleanup pr_xxx(), unsplit pr_xxx() strings, etc., fix build
failures by adding KVM_XIVE which depends on KVM_XICS and XIVE, and
adding empty stubs for the kvm_xive_xxx() routines, fixup subject,
integrate fixes from Paul for building PR=y HV=n]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.
This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.
To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.
If we fail to update a hardware IOMMU table unexpected reason, we just
clear it and move on as there is nothing really we can do about it -
for example, if we hot plug a VFIO device to a guest, existing TCE tables
will be mirrored automatically to the hardware and there is no interface
to report to the guest about possible failures.
This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is cached and referenced so we do not have to look up for it in real mode.
This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is destroyed; this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.
This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.
This adds real mode version of WARN_ON_ONCE() as the generic version
causes problems with rcu_sched. Since we testing what vmalloc_to_phys()
returns in the code, this also adds a check for already existing
vmalloc_to_phys() call in kvmppc_rm_h_put_tce_indirect().
This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.
Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
PR KVM page fault handler performs eaddr to pte translation for a guest,
however kvmppc_mmu_book3s_64_xlate() does not preserve WIMG bits
(storage control) in the kvmppc_pte struct. If PR KVM is running as
a second level guest under HV KVM, and PR KVM tries inserting HPT entry,
this fails in HV KVM if it already has this mapping.
This preserves WIMG bits between kvmppc_mmu_book3s_64_xlate() and
kvmppc_mmu_map_page().
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch provides the MMIO load/store emulation for instructions
of 'double & vector unsigned char & vector signed char & vector
unsigned short & vector signed short & vector unsigned int & vector
signed int & vector double '.
The instructions that this adds emulation for are:
- ldx, ldux, lwax,
- lfs, lfsx, lfsu, lfsux, lfd, lfdx, lfdu, lfdux,
- stfs, stfsx, stfsu, stfsux, stfd, stfdx, stfdu, stfdux, stfiwx,
- lxsdx, lxsspx, lxsiwax, lxsiwzx, lxvd2x, lxvw4x, lxvdsx,
- stxsdx, stxsspx, stxsiwx, stxvd2x, stxvw4x
[paulus@ozlabs.org - some cleanups, fixes and rework, make it
compile for Book E, fix build when PR KVM is built in]
Signed-off-by: Bin Lu <lblulb@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Its value has never changed; we might as well make it part of the ABI instead
of using the return value of KVM_CHECK_EXTENSION(KVM_CAP_COALESCED_MMIO).
Because PPC does not always make MMIO available, the code has to be made
dependent on CONFIG_KVM_MMIO rather than KVM_COALESCED_MMIO_PAGE_OFFSET.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This adds a not yet working outline of the HPT resizing PAPR
extension. Specifically it adds the necessary ioctl() functions,
their basic steps, the work function which will handle preparation for
the resize, and synchronization between these, the guest page fault
path and guest HPT update path.
The actual guts of the implementation isn't here yet, so for now the
calls will always fail.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently the kvm_hpt_info structure stores the hashed page table's order,
and also the number of HPTEs it contains and a mask for its size. The
last two can be easily derived from the order, so remove them and just
calculate them as necessary with a couple of helper inlines.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Currently, the powerpc kvm_arch structure contains a number of variables
tracking the state of the guest's hashed page table (HPT) in KVM HV. This
patch gathers them all together into a single kvm_hpt_info substructure.
This makes life more convenient for the upcoming HPT resizing
implementation.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
With radix, the guest can do TLB invalidations itself using the tlbie
(global) and tlbiel (local) TLB invalidation instructions. Linux guests
use local TLB invalidations for translations that have only ever been
accessed on one vcpu. However, that doesn't mean that the translations
have only been accessed on one physical cpu (pcpu) since vcpus can move
around from one pcpu to another. Thus a tlbiel might leave behind stale
TLB entries on a pcpu where the vcpu previously ran, and if that task
then moves back to that previous pcpu, it could see those stale TLB
entries and thus access memory incorrectly. The usual symptom of this
is random segfaults in userspace programs in the guest.
To cope with this, we detect when a vcpu is about to start executing on
a thread in a core that is a different core from the last time it
executed. If that is the case, then we mark the core as needing a
TLB flush and then send an interrupt to any thread in the core that is
currently running a vcpu from the same guest. This will get those vcpus
out of the guest, and the first one to re-enter the guest will do the
TLB flush. The reason for interrupting the vcpus executing on the old
core is to cope with the following scenario:
CPU 0 CPU 1 CPU 4
(core 0) (core 0) (core 1)
VCPU 0 runs task X VCPU 1 runs
core 0 TLB gets
entries from task X
VCPU 0 moves to CPU 4
VCPU 0 runs task X
Unmap pages of task X
tlbiel
(still VCPU 1) task X moves to VCPU 1
task X runs
task X sees stale TLB
entries
That is, as soon as the VCPU starts executing on the new core, it
could unmap and tlbiel some page table entries, and then the task
could migrate to one of the VCPUs running on the old core and
potentially see stale TLB entries.
Since the TLB is shared between all the threads in a core, we only
use the bit of kvm->arch.need_tlb_flush corresponding to the first
thread in the core. To ensure that we don't have a window where we
can miss a flush, this moves the clearing of the bit from before the
actual flush to after it. This way, two threads might both do the
flush, but we prevent the situation where one thread can enter the
guest before the flush is finished.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds code to branch around the parts that radix guests don't
need - clearing and loading the SLB with the guest SLB contents,
saving the guest SLB contents on exit, and restoring the host SLB
contents.
Since the host is now using radix, we need to save and restore the
host value for the PID register.
On hypervisor data/instruction storage interrupts, we don't do the
guest HPT lookup on radix, but just save the guest physical address
for the fault (from the ASDR register) in the vcpu struct.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a field in struct kvm_arch and an inline helper to
indicate whether a guest is a radix guest or not, plus a new file
to contain the radix MMU code, which currently contains just a
translate function which knows how to traverse the guest page
tables to translate an address.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds the implementation of the KVM_PPC_CONFIGURE_V3_MMU ioctl
for HPT guests on POWER9. With this, we can return 1 for the
KVM_CAP_PPC_MMU_HASH_V3 capability.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KVM_HALT_POLL_NS_DEFAULT is an arch specific constant which sets the
default value of the halt_poll_ns kvm module parameter which determines
the global maximum halt polling interval.
The current value for powerpc is 500000 (500us) which means that any
repetitive workload with a period of less than that can drive the cpu
usage to 100% where it may have been mostly idle without halt polling.
This presents the possibility of a large increase in power usage with
a comparatively small performance benefit.
Reduce the default to 10000 (10us) and a user can tune this themselves
to set their affinity for halt polling based on the trade off between power
and performance which they are willing to make.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER9 adds new capabilities to the tlbie (TLB invalidate entry)
and tlbiel (local tlbie) instructions. Both instructions get a
set of new parameters (RIC, PRS and R) which appear as bits in the
instruction word. The tlbiel instruction now has a second register
operand, which contains a PID and/or LPID value if needed, and
should otherwise contain 0.
This adapts KVM-HV's usage of tlbie and tlbiel to work on POWER9
as well as older processors. Since we only handle HPT guests so
far, we need RIC=0 PRS=0 R=0, which ends up with the same instruction
word as on previous processors, so we don't need to conditionally
execute different instructions depending on the processor.
The local flush on first entry to a guest in book3s_hv_rmhandlers.S
is a loop which depends on the number of TLB sets. Rather than
using feature sections to set the number of iterations based on
which CPU we're on, we now work out this number at VM creation time
and store it in the kvm_arch struct. That will make it possible to
get the number from the device tree in future, which will help with
compatibility with future processors.
Since mmu_partition_table_set_entry() does a global flush of the
whole LPID, we don't need to do the TLB flush on first entry to the
guest on each processor. Therefore we don't set all bits in the
tlb_need_flush bitmap on VM startup on POWER9.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This adds code to handle two new guest-accessible special-purpose
registers on POWER9: TIDR (thread ID register) and PSSCR (processor
stop status and control register). They are context-switched
between host and guest, and the guest values can be read and set
via the one_reg interface.
The PSSCR contains some fields which are guest-accessible and some
which are only accessible in hypervisor mode. We only allow the
guest-accessible fields to be read or set by userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
When switching from/to a guest that has a transaction in progress,
we need to save/restore the checkpointed register state. Although
XER is part of the CPU state that gets checkpointed, the code that
does this saving and restoring doesn't save/restore XER.
This fixes it by saving and restoring the XER. To allow userspace
to read/write the checkpointed XER value, we also add a new ONE_REG
specifier.
The visible effect of this bug is that the guest may see its XER
value being corrupted when it uses transactions.
Fixes: e4e3812150 ("KVM: PPC: Book3S HV: Add transactional memory support")
Fixes: 0a8eccefcb ("KVM: PPC: Book3S HV: Add missing code for transaction reclaim on guest exit")
Cc: stable@vger.kernel.org # v3.15+
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This keeps a per vcpu cache for recently page faulted MMIO entries.
On a page fault, if the entry exists in the cache, we can avoid some
time-consuming paths, for example, looking up HPT, locking HPTE twice
and searching mmio gfn from memslots, then directly call
kvmppc_hv_emulate_mmio().
In current implenment, we limit the size of cache to four. We think
it's enough to cover the high-frequency MMIO HPTEs in most case.
For example, considering the case of using virtio device, for virtio
legacy devices, one HPTE could handle notifications from up to
1024 (64K page / 64 byte Port IO register) devices, so one cache entry
is enough; for virtio modern devices, we always need one HPTE to handle
notification for each device because modern device would use a 8M MMIO
register to notify host instead of Port IO register, typically the
system's configuration should not exceed four virtio devices per
vcpu, four cache entry is also enough in this case. Of course, if needed,
we could also modify the macro to a module parameter in the future.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
POWER8 has one virtual timebase (VTB) register per subcore, not one
per CPU thread. The HV KVM code currently treats VTB as a per-thread
register, which can lead to spurious soft lockup messages from guests
which use the VTB as the time source for the soft lockup detector.
(CPUs before POWER8 did not have the VTB register.)
For HV KVM, this fixes the problem by making only the primary thread
in each virtual core save and restore the VTB value. With this,
the VTB state becomes part of the kvmppc_vcore structure. This
also means that "piggybacking" of multiple virtual cores onto one
subcore is not possible on POWER8, because then the virtual cores
would share a single VTB register.
PR KVM emulates a VTB register, which is per-vcpu because PR KVM
has no notion of CPU threads or SMT. For PR KVM we move the VTB
state into the kvmppc_vcpu_book3s struct.
Cc: stable@vger.kernel.org # v3.14+
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Add VCPU stat counters to track affinity for passthrough
interrupts.
pthru_all: Counts all passthrough interrupts whose IRQ mappings are
in the kvmppc_passthru_irq_map structure.
pthru_host: Counts all cached passthrough interrupts that were injected
from the host through kvm_set_irq (i.e. not handled in
real mode).
pthru_bad_aff: Counts how many cached passthrough interrupts have
bad affinity (receiving CPU is not running VCPU that is
the target of the virtual interrupt in the guest).
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch introduces an IRQ mapping structure, the
kvmppc_passthru_irqmap structure that is to be used
to map the real hardware IRQ in the host with the virtual
hardware IRQ (gsi) that is injected into a guest by KVM for
passthrough adapters.
Currently, we assume a separate IRQ mapping structure for
each guest. Each kvmppc_passthru_irqmap has a mapping arrays,
containing all defined real<->virtual IRQs.
[paulus@ozlabs.org - removed irq_chip field from struct
kvmppc_passthru_irqmap; changed parameter for
kvmppc_get_passthru_irqmap from struct kvm_vcpu * to struct
kvm *, removed small cached array.]
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
vcpu stats are used to collect information about a vcpu which can be viewed
in the debugfs. For example halt_attempted_poll and halt_successful_poll
are used to keep track of the number of times the vcpu attempts to and
successfully polls. These stats are currently not used on powerpc.
Implement incrementation of the halt_attempted_poll and
halt_successful_poll vcpu stats for powerpc. Since these stats are summed
over all the vcpus for all running guests it doesn't matter which vcpu
they are attributed to, thus we choose the current runner vcpu of the
vcore.
Also add new vcpu stats: halt_poll_success_ns, halt_poll_fail_ns and
halt_wait_ns to be used to accumulate the total time spend polling
successfully, polling unsuccessfully and waiting respectively, and
halt_successful_wait to accumulate the number of times the vcpu waits.
Given that halt_poll_success_ns, halt_poll_fail_ns and halt_wait_ns are
expressed in nanoseconds it is necessary to represent these as 64-bit
quantities, otherwise they would overflow after only about 4 seconds.
Given that the total time spend either polling or waiting will be known and
the number of times that each was done, it will be possible to determine
the average poll and wait times. This will give the ability to tune the kvm
module parameters based on the calculated average wait and poll times.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
vms and vcpus have statistics associated with them which can be viewed
within the debugfs. Currently it is assumed within the vcpu_stat_get() and
vm_stat_get() functions that all of these statistics are represented as
u32s, however the next patch adds some u64 vcpu statistics.
Change all vcpu statistics to u64 and modify vcpu_stat_get() accordingly.
Since vcpu statistics are per vcpu, they will only be updated by a single
vcpu at a time so this shouldn't present a problem on 32-bit machines
which can't atomically increment 64-bit numbers. However vm statistics
could potentially be updated by multiple vcpus from that vm at a time.
To avoid the overhead of atomics make all vm statistics ulong such that
they are 64-bit on 64-bit systems where they can be atomically incremented
and are 32-bit on 32-bit systems which may not be able to atomically
increment 64-bit numbers. Modify vm_stat_get() to expect ulongs.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
This patch introduces new halt polling functionality into the kvm_hv kernel
module. When a vcore is idle it will poll for some period of time before
scheduling itself out.
When all of the runnable vcpus on a vcore have ceded (and thus the vcore is
idle) we schedule ourselves out to allow something else to run. In the
event that we need to wake up very quickly (for example an interrupt
arrives), we are required to wait until we get scheduled again.
Implement halt polling so that when a vcore is idle, and before scheduling
ourselves, we poll for vcpus in the runnable_threads list which have
pending exceptions or which leave the ceded state. If we poll successfully
then we can get back into the guest very quickly without ever scheduling
ourselves, otherwise we schedule ourselves out as before.
There exists generic halt_polling code in virt/kvm_main.c, however on
powerpc the polling conditions are different to the generic case. It would
be nice if we could just implement an arch specific kvm_check_block()
function, but there is still other arch specific things which need to be
done for kvm_hv (for example manipulating vcore states) which means that a
separate implementation is the best option.
Testing of this patch with a TCP round robin test between two guests with
virtio network interfaces has found a decrease in round trip time of ~15us
on average. A performance gain is only seen when going out of and
back into the guest often and quickly, otherwise there is no net benefit
from the polling. The polling interval is adjusted such that when we are
often scheduled out for long periods of time it is reduced, and when we
often poll successfully it is increased. The rate at which the polling
interval increases or decreases, and the maximum polling interval, can
be set through module parameters.
Based on the implementation in the generic kvm module by Wanpeng Li and
Paolo Bonzini, and on direction from Paul Mackerras.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The struct kvmppc_vcore is a structure used to store various information
about a virtual core for a kvm guest. The runnable_threads element of the
struct provides a list of all of the currently runnable vcpus on the core
(those in the KVMPPC_VCPU_RUNNABLE state). The previous implementation of
this list was a linked_list. The next patch requires that the list be able
to be iterated over without holding the vcore lock.
Reimplement the runnable_threads list in the kvmppc_vcore struct as an
array. Implement function to iterate over valid entries in the array and
update access sites accordingly.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
The next commit will introduce a member to the kvmppc_vcore struct which
references MAX_SMT_THREADS which is defined in kvm_book3s_asm.h, however
this file isn't included in kvm_host.h directly. Thus compiling for
certain platforms such as pmac32_defconfig and ppc64e_defconfig with KVM
fails due to MAX_SMT_THREADS not being defined.
Move the struct kvmppc_vcore definition to kvm_book3s.h which explicitly
includes kvm_book3s_asm.h.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
It doesn't make sense to create irqfds for a VM that doesn't have
in-kernel interrupt controller emulation. There is an existing
interface for architecture code to tell the irqfd code whether or
not any interrupt controller has been initialized, called
kvm_arch_intc_initialized(), so let's implement that for powerpc.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Some wakeups should not be considered a sucessful poll. For example on
s390 I/O interrupts are usually floating, which means that _ALL_ CPUs
would be considered runnable - letting all vCPUs poll all the time for
transactional like workload, even if one vCPU would be enough.
This can result in huge CPU usage for large guests.
This patch lets architectures provide a way to qualify wakeups if they
should be considered a good/bad wakeups in regard to polls.
For s390 the implementation will fence of halt polling for anything but
known good, single vCPU events. The s390 implementation for floating
interrupts does a wakeup for one vCPU, but the interrupt will be delivered
by whatever CPU checks first for a pending interrupt. We prefer the
woken up CPU by marking the poll of this CPU as "good" poll.
This code will also mark several other wakeup reasons like IPI or
expired timers as "good". This will of course also mark some events as
not sucessful. As KVM on z runs always as a 2nd level hypervisor,
we prefer to not poll, unless we are really sure, though.
This patch successfully limits the CPU usage for cases like uperf 1byte
transactional ping pong workload or wakeup heavy workload like OLTP
while still providing a proper speedup.
This also introduced a new vcpu stat "halt_poll_no_tuning" that marks
wakeups that are considered not good for polling.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version)
Cc: David Matlack <dmatlack@google.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
[Rename config symbol. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_MAX_VCPUS define provides the maximum number of vCPUs per guest, and
also the upper limit for vCPU ids. This is okay for all archs except PowerPC
which can have higher ids, depending on the cpu/core/thread topology. In the
worst case (single threaded guest, host with 8 threads per core), it limits
the maximum number of vCPUS to KVM_MAX_VCPUS / 8.
This patch separates the vCPU numbering from the total number of vCPUs, with
the introduction of KVM_MAX_VCPU_ID, as the maximal valid value for vCPU ids
plus one.
The corresponding KVM_CAP_MAX_VCPU_ID allows userspace to validate vCPU ids
before passing them to KVM_CREATE_VCPU.
This patch only implements KVM_MAX_VCPU_ID with a specific value for PowerPC.
Other archs continue to return KVM_MAX_VCPUS instead.
Suggested-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
but lots of architecture-specific changes.
* ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
* PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
* s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
* x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using vector
hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest memory---currently
its only use is to speedup the legacy shadow paging (pre-EPT) case, but
in the future it will be used for virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0
yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb
oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX
tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o
Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx
WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII=
=OUZZ
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"One of the largest releases for KVM... Hardly any generic
changes, but lots of architecture-specific updates.
ARM:
- VHE support so that we can run the kernel at EL2 on ARMv8.1 systems
- PMU support for guests
- 32bit world switch rewritten in C
- various optimizations to the vgic save/restore code.
PPC:
- enabled KVM-VFIO integration ("VFIO device")
- optimizations to speed up IPIs between vcpus
- in-kernel handling of IOMMU hypercalls
- support for dynamic DMA windows (DDW).
s390:
- provide the floating point registers via sync regs;
- separated instruction vs. data accesses
- dirty log improvements for huge guests
- bugfixes and documentation improvements.
x86:
- Hyper-V VMBus hypercall userspace exit
- alternative implementation of lowest-priority interrupts using
vector hashing (for better VT-d posted interrupt support)
- fixed guest debugging with nested virtualizations
- improved interrupt tracking in the in-kernel IOAPIC
- generic infrastructure for tracking writes to guest
memory - currently its only use is to speedup the legacy shadow
paging (pre-EPT) case, but in the future it will be used for
virtual GPUs as well
- much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits)
KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch
KVM: x86: disable MPX if host did not enable MPX XSAVE features
arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit
arm64: KVM: vgic-v3: Reset LRs at boot time
arm64: KVM: vgic-v3: Do not save an LR known to be empty
arm64: KVM: vgic-v3: Save maintenance interrupt state only if required
arm64: KVM: vgic-v3: Avoid accessing ICH registers
KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit
KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit
KVM: arm/arm64: vgic-v2: Reset LRs at boot time
KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty
KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function
KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required
KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers
KVM: s390: allocate only one DMA page per VM
KVM: s390: enable STFLE interpretation only if enabled for the guest
KVM: s390: wake up when the VCPU cpu timer expires
KVM: s390: step the VCPU timer while in enabled wait
KVM: s390: protect VCPU cpu timer with a seqcount
KVM: s390: step VCPU cpu timer during kvm_run ioctl
...
This enables userspace view of TCE tables to start from non-zero offset
on a bus. This will be used for huge DMA windows.
This only changes the internal structure, the user interface needs to
change in order to use an offset.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
At the moment the kvmppc_spapr_tce_table struct can only describe
4GB windows and handle fixed size (4K) pages. Dynamic DMA windows
support more so these limits need to be extended.
This replaces window_size (in bytes, 4GB max) with page_shift (32bit)
and size (64bit, in pages).
This should cause no behavioural change as this is changing
the internal structures only - the user interface still only
allows one to create a 32-bit table with 4KiB pages at this stage.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
The problem:
On -rt, an emulated LAPIC timer instances has the following path:
1) hard interrupt
2) ksoftirqd is scheduled
3) ksoftirqd wakes up vcpu thread
4) vcpu thread is scheduled
This extra context switch introduces unnecessary latency in the
LAPIC path for a KVM guest.
The solution:
Allow waking up vcpu thread from hardirq context,
thus avoiding the need for ksoftirqd to be scheduled.
Normal waitqueues make use of spinlocks, which on -RT
are sleepable locks. Therefore, waking up a waitqueue
waiter involves locking a sleeping lock, which
is not allowed from hard interrupt context.
cyclictest command line:
This patch reduces the average latency in my tests from 14us to 11us.
Daniel writes:
Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency
benchmark on mainline. The test was run 1000 times on
tip/sched/core 4.4.0-rc8-01134-g0905f04:
./x86-run x86/tscdeadline_latency.flat -cpu host
with idle=poll.
The test seems not to deliver really stable numbers though most of
them are smaller. Paolo write:
"Anything above ~10000 cycles means that the host went to C1 or
lower---the number means more or less nothing in that case.
The mean shows an improvement indeed."
Before:
min max mean std
count 1000.000000 1000.000000 1000.000000 1000.000000
mean 5162.596000 2019270.084000 5824.491541 20681.645558
std 75.431231 622607.723969 89.575700 6492.272062
min 4466.000000 23928.000000 5537.926500 585.864966
25% 5163.000000 1613252.750000 5790.132275 16683.745433
50% 5175.000000 2281919.000000 5834.654000 23151.990026
75% 5190.000000 2382865.750000 5861.412950 24148.206168
max 5228.000000 4175158.000000 6254.827300 46481.048691
After
min max mean std
count 1000.000000 1000.00000 1000.000000 1000.000000
mean 5143.511000 2076886.10300 5813.312474 21207.357565
std 77.668322 610413.09583 86.541500 6331.915127
min 4427.000000 25103.00000 5529.756600 559.187707
25% 5148.000000 1691272.75000 5784.889825 17473.518244
50% 5160.000000 2308328.50000 5832.025000 23464.837068
75% 5172.000000 2393037.75000 5853.177675 24223.969976
max 5222.000000 3922458.00000 6186.720500 42520.379830
[Patch was originaly based on the swait implementation found in the -rt
tree. Daniel ported it to mainline's version and gathered the
benchmark numbers for tscdeadline_latency test.]
Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-rt-users@vger.kernel.org
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
At the moment only spapr_tce_tables updates are protected against races
but not lookups. This fixes missing protection by using RCU for the list.
As lookups also happen in real mode, this uses
list_for_each_entry_lockless() (which is expected not to access any
vmalloc'd memory).
This converts release_spapr_tce_table() to a RCU scheduled handler.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Since the numbers now overlap, it makes sense to enumerate
them in asm/kvm_host.h rather than linux/kvm_host.h. Functions
that refer to architecture-specific requests are also moved
to arch/.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Only using 32 memslots for KVM on powerpc is way too low, you can
nowadays hit this limit quite fast by adding a couple of PCI devices
and/or pluggable memory DIMMs to the guest.
x86 already increased the KVM_USER_MEM_SLOTS to 509, to satisfy 256
pluggable DIMM slots, 3 private slots and 253 slots for other things
like PCI devices (i.e. resulting in 256 + 3 + 253 = 512 slots in
total). We should do something similar for powerpc, and since we do
not use private slots here, we can set the value to 512 directly.
While we're at it, also remove the KVM_MEM_SLOTS_NUM definition
from the powerpc-specific header since this gets defined in the
generic kvm_host.h header anyway.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
handling.
PPC: Mostly bug fixes.
ARM: No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite for
IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86: quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new component (in
virt/lib/) that connects VFIO and KVM together. The same infrastructure
will be used for ARM interrupt forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic interrupt
controller will have to wait for 4.5. These will let KVM expose Hyper-V
devices.
- nested virtualization now supports VPID (same as PCID but for vCPUs)
which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for clflushopt,
clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel + IOAPIC/PIC/PIT in
userspace, which reduces the attack surface of the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten to not
require help from the hypervisor.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJWO2IQAAoJEL/70l94x66D/K0H/3AovAgYmJQToZlimsktMk6a
f2xhdIqfU5lIQQh5uNBCfL3o9o8H9Py1ym7aEw3fmztPHHJYc91oTatt2UEKhmEw
VtZHp/dFHt3hwaIdXmjRPEXiYctraKCyrhaUYdWmUYkoKi7lW5OL5h+S7frG2U6u
p/hFKnHRZfXHr6NSgIqvYkKqtnc+C0FWY696IZMzgCksOO8jB1xrxoSN3tANW3oJ
PDV+4og0fN/Fr1capJUFEc/fejREHneANvlKrLaa8ht0qJQutoczNADUiSFLcMPG
iHljXeDsv5eyjMtUuIL8+MPzcrIt/y4rY41ZPiKggxULrXc6H+JJL/e/zThZpXc=
=iv2z
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).
Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
This reverts commit 9678cdaae9 ("Use the POWER8 Micro Partition
Prefetch Engine in KVM HV on POWER8") because the original commit had
multiple, partly self-cancelling bugs, that could cause occasional
memory corruption.
In fact the logmpp instruction was incorrectly using register r0 as the
source of the buffer address and operation code, and depending on what
was in r0, it would either do nothing or corrupt the 64k page pointed to
by r0.
The logmpp instruction encoding and the operation code definitions could
be corrected, but then there is the problem that there is no clearly
defined way to know when the hardware has finished writing to the
buffer.
The original commit attempted to work around this by aborting the
write-out before starting the prefetch, but this is ineffective in the
case where the virtual core is now executing on a different physical
core from the one where the write-out was initiated.
These problems plus advice from the hardware designers not to use the
function (since the measured performance improvement from using the
feature was actually mostly negative), mean that reverting the code is
the best option.
Fixes: 9678cdaae9 ("Use the POWER8 Micro Partition Prefetch Engine in KVM HV on POWER8")
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.
Architectures are now free to set their own halt_poll_ns
default value.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.
For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The size of the Problem State Priority Boost Register is only
32 bits, but the kvm_vcpu_arch->pspb variable is declared as
"ulong", ie. 64-bit. However, the assembler code accesses this
variable with 32-bit accesses, and the KVM_REG_PPC_PSPB macro
is defined with SIZE_U32, too, so that the current code is
broken on big endian hosts: kvmppc_get_one_reg_hv() will only
return zero for this register since it is using the wrong half
of the pspb variable. Let's fix this problem by adjusting the
size of the pspb field in the kvm_vcpu_arch structure.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes a bug in the tracking of pages that get modified by the
guest. If the guest creates a large-page HPTE, writes to memory
somewhere within the large page, and then removes the HPTE, we only
record the modified state for the first normal page within the large
page, when in fact the guest might have modified some other normal
page within the large page.
To fix this we use some unused bits in the rmap entry to record the
order (log base 2) of the size of the page that was modified, when
removing an HPTE. Then in kvm_test_clear_dirty_npages() we use that
order to return the correct number of modified pages.
The same thing could in principle happen when removing a HPTE at the
host's request, i.e. when paging out a page, except that we never
page out large pages, and the guest can only create large-page HPTEs
if the guest RAM is backed by large pages. However, we also fix
this case for the sake of future-proofing.
The reference bit is also subject to the same loss of information. We
don't make the same fix here for the reference bit because there isn't
an interface for userspace to find out which pages the guest has
referenced, whereas there is one for userspace to find out which pages
the guest has modified. Because of this loss of information, the
kvm_age_hva_hv() and kvm_test_age_hva_hv() functions might incorrectly
say that a page has not been referenced when it has, but that doesn't
matter greatly because we never page or swap out large pages.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This builds on the ability to run more than one vcore on a physical
core by using the micro-threading (split-core) modes of the POWER8
chip. Previously, only vcores from the same VM could be run together,
and (on POWER8) only if they had just one thread per core. With the
ability to split the core on guest entry and unsplit it on guest exit,
we can run up to 8 vcpu threads from up to 4 different VMs, and we can
run multiple vcores with 2 or 4 vcpus per vcore.
Dynamic micro-threading is only available if the static configuration
of the cores is whole-core mode (unsplit), and only on POWER8.
To manage this, we introduce a new kvm_split_mode struct which is
shared across all of the subcores in the core, with a pointer in the
paca on each thread. In addition we extend the core_info struct to
have information on each subcore. When deciding whether to add a
vcore to the set already on the core, we now have two possibilities:
(a) piggyback the vcore onto an existing subcore, or (b) start a new
subcore.
Currently, when any vcpu needs to exit the guest and switch to host
virtual mode, we interrupt all the threads in all subcores and switch
the core back to whole-core mode. It may be possible in future to
allow some of the subcores to keep executing in the guest while
subcore 0 switches to the host, but that is not implemented in this
patch.
This adds a module parameter called dynamic_mt_modes which controls
which micro-threading (split-core) modes the code will consider, as a
bitmap. In other words, if it is 0, no micro-threading mode is
considered; if it is 2, only 2-way micro-threading is considered; if
it is 4, only 4-way, and if it is 6, both 2-way and 4-way
micro-threading mode will be considered. The default is 6.
With this, we now have secondary threads which are the primary thread
for their subcore and therefore need to do the MMU switch. These
threads will need to be started even if they have no vcpu to run, so
we use the vcore pointer in the PACA rather than the vcpu pointer to
trigger them.
It is now possible for thread 0 to find that an exit has been
requested before it gets to switch the subcore state to the guest. In
that case we haven't added the guest's timebase offset to the
timebase, so we need to be careful not to subtract the offset in the
guest exit path. In fact we just skip the whole path that switches
back to host context, since we haven't switched to the guest context.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When running a virtual core of a guest that is configured with fewer
threads per core than the physical cores have, the extra physical
threads are currently unused. This makes it possible to use them to
run one or more other virtual cores from the same guest when certain
conditions are met. This applies on POWER7, and on POWER8 to guests
with one thread per virtual core. (It doesn't apply to POWER8 guests
with multiple threads per vcore because they require a 1-1 virtual to
physical thread mapping in order to be able to use msgsndp and the
TIR.)
The idea is that we maintain a list of preempted vcores for each
physical cpu (i.e. each core, since the host runs single-threaded).
Then, when a vcore is about to run, it checks to see if there are
any vcores on the list for its physical cpu that could be
piggybacked onto this vcore's execution. If so, those additional
vcores are put into state VCORE_PIGGYBACK and their runnable VCPU
threads are started as well as the original vcore, which is called
the master vcore.
After the vcores have exited the guest, the extra ones are put back
onto the preempted list if any of their VCPUs are still runnable and
not idle.
This means that vcpu->arch.ptid is no longer necessarily the same as
the physical thread that the vcpu runs on. In order to make it easier
for code that wants to send an IPI to know which CPU to target, we
now store that in a new field in struct vcpu_arch, called thread_cpu.
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>