1. Remove unnecessary 'irq' argument, because irq number can be retrieved
from the LR.
2. Since cff9211eb1
("arm/arm64: KVM: Fix arch timer behavior for disabled interrupts ")
LR_STATE_PENDING is queued back by vgic_retire_lr() itself. Also, it
clears vlr.state itself. Therefore, we remove the same, now duplicated,
check with all accompanying bit manipulations from vgic_unqueue_irqs().
3. vgic_retire_lr() is always accompanied by vgic_irq_clear_queued(). Since
it already does more than just clearing the LR, move
vgic_irq_clear_queued() inside of it.
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently we use vgic_irq_lr_map in order to track which LRs hold which
IRQs, and lr_used bitmap in order to track which LRs are used or free.
vgic_irq_lr_map is actually used only for piggy-back optimization, and
can be easily replaced by iteration over lr_used. This is good because in
future, when LPI support is introduced, number of IRQs will grow up to at
least 16384, while numbers from 1024 to 8192 are never going to be used.
This would be a huge memory waste.
In its turn, lr_used is also completely redundant since
ae705930fc ("arm/arm64: KVM: Keep elrsr/aisr
in sync with software model"), because together with lr_used we also update
elrsr. This allows to easily replace lr_used with elrsr, inverting all
conditions (because in elrsr '1' means 'free').
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Besides being a coding style issue, it confuses make tags:
ctags: Warning: include/kvm/arm_vgic.h:307: null expansion of name pattern "\1"
ctags: Warning: include/kvm/arm_vgic.h:308: null expansion of name pattern "\1"
ctags: Warning: include/kvm/arm_vgic.h:309: null expansion of name pattern "\1"
ctags: Warning: include/kvm/arm_vgic.h:317: null expansion of name pattern "\1"
Cc: kvmarm@lists.cs.columbia.edu
Signed-off-by: Michal Marek <mmarek@suse.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
If we panic in hyp mode, we inject a call to panic() into the EL1N host
kernel. If a guest context is active, we first attempt to restore the
minimal amount of state necessary to execute the host kernel with
restore_sysregs.
However, the SP is restored as part of restore_common_regs, and so we
may return to the host's panic() function with the SP of the guest. Any
calculations based on the SP will be bogus, and any attempt to access
the stack will result in recursive data aborts.
When running Linux as a guest, the guest's EL1N SP is like to be some
valid kernel address. In this case, the host kernel may use that region
as a stack for panic(), corrupting it in the process.
Avoid the problem by restoring the host SP prior to returning to the
host. To prevent misleading backtraces in the host, the FP is zeroed at
the same time. We don't need any of the other "common" registers in
order to panic successfully.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: <kvmarm@lists.cs.columbia.edu>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The VGIC and timer code for KVM arm/arm64 doesn't have any tracepoints
or tracepoint infrastructure defined. Rewriting some of the timer code
handling showed me how much we need this, so let's add these simple
trace points once and for all and we can easily expand with additional
trace points in these files as we go along.
Cc: Wei Huang <wei@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The ARM architecture only saves the exit class to the HSR (ESR_EL2 for
arm64) on synchronous exceptions, not on asynchronous exceptions like an
IRQ. However, we only report the exception class on kvm_exit, which is
confusing because an IRQ looks like it exited at some PC with the same
reason as the previous exit. Add a lookup table for the exception index
and prepend the kvm_exit tracepoint text with the exception type to
clarify this situation.
Also resolve the exception class (EC) to a human-friendly text version
so the trace output becomes immediately usable for debugging this code.
Cc: Wei Huang <wei@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Correct some old mistakes in the API documentation:
1. VCPU is identified by index (using kvm_get_vcpu() function), but
"cpu id" can be mistaken for affinity ID.
2. Some error codes are wrong.
[ Slightly tweaked some grammer and did some s/CPU index/vcpu_index/
in the descriptions. -Christoffer ]
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We introduce kvm_arm_halt_guest and resume functions. They
will be used for IRQ forward state change.
Halt is synchronous and prevents the guest from being re-entered.
We use the same mechanism put in place for PSCI former pause,
now renamed power_off. A new flag is introduced in arch vcpu state,
pause, only meant to be used by those functions.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
In case a vcpu off PSCI call is called just after we executed the
vcpu_sleep check, we can enter the guest although power_off
is set. Let's check the power_off state in the critical section,
just before entering the guest.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reported-by: Christoffer Dall <christoffer.dall@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
kvm_arch_vcpu_runnable now also checks whether the power_off
flag is set.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The kvm_vcpu_arch pause field is renamed into power_off to prepare
for the introduction of a new pause field. Also vcpu_pause is renamed
into vcpu_sleep since we will sleep until both power_off and pause are
false.
Signed-off-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
vhost drivers provide guest VMs with better I/O performance and lower
CPU utilization. This patch allows users to select vhost devices under
KVM configuration menu on ARM. This makes vhost support on arm/arm64
on a par with other architectures (e.g. x86, ppc).
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We mark edge-triggered interrupts with the HW bit set as queued to
prevent the VGIC code from injecting LRs with both the Active and
Pending bits set at the same time while also setting the HW bit,
because the hardware does not support this.
However, this means that we must also clear the queued flag when we sync
back a LR where the state on the physical distributor went from active
to inactive because the guest deactivated the interrupt. At this point
we must also check if the interrupt is pending on the distributor, and
tell the VGIC to queue it again if it is.
Since these actions on the sync path are extremely close to those for
level-triggered interrupts, rename process_level_irq to
process_queued_irq, allowing it to cater for both cases.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The arch timer currently uses edge-triggered semantics in the sense that
the line is never sampled by the vgic and lowering the line from the
timer to the vgic doesn't have any effect on the pending state of
virtual interrupts in the vgic. This means that we do not support a
guest with the otherwise valid behavior of (1) disable interrupts (2)
enable the timer (3) disable the timer (4) enable interrupts. Such a
guest would validly not expect to see any interrupts on real hardware,
but will see interrupts on KVM.
This patch fixes this shortcoming through the following series of
changes.
First, we change the flow of the timer/vgic sync/flush operations. Now
the timer is always flushed/synced before the vgic, because the vgic
samples the state of the timer output. This has the implication that we
move the timer operations in to non-preempible sections, but that is
fine after the previous commit getting rid of hrtimer schedules on every
entry/exit.
Second, we change the internal behavior of the timer, letting the timer
keep track of its previous output state, and only lower/raise the line
to the vgic when the state changes. Note that in theory this could have
been accomplished more simply by signalling the vgic every time the
state *potentially* changed, but we don't want to be hitting the vgic
more often than necessary.
Third, we get rid of the use of the map->active field in the vgic and
instead simply set the interrupt as active on the physical distributor
whenever the input to the GIC is asserted and conversely clear the
physical active state when the input to the GIC is deasserted.
Fourth, and finally, we now initialize the timer PPIs (and all the other
unused PPIs for now), to be level-triggered, and modify the sync code to
sample the line state on HW sync and re-inject a new interrupt if it is
still pending at that time.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Forwarded physical interrupts on arm/arm64 is a tricky concept and the
way we deal with them is not apparently easy to understand by reading
various specs.
Therefore, add a proper documentation file explaining the flow and
rationale of the behavior of the vgic.
Some of this text was contributed by Marc Zyngier and edited by me.
Omissions and errors are all mine.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently initialize the SGIs to be enabled in the VGIC code, but we
use the VGIC_NR_PPIS define for this purpose, instead of the the more
natural VGIC_NR_SGIS. Change this slightly confusing use of the
defines.
Note: This should have no functional change, as both names are defined
to the number 16.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The GICD_ICFGR allows the bits for the SGIs and PPIs to be read only.
We currently simulate this behavior by writing a hardcoded value to the
register for the SGIs and PPIs on every write of these bits to the
register (ignoring what the guest actually wrote), and by writing the
same value as the reset value to the register.
This is a bit counter-intuitive, as the register is RO for these bits,
and we can just implement it that way, allowing us to control the value
of the bits purely in the reset code.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Currently vgic_process_maintenance() processes dealing with a completed
level-triggered interrupt directly, but we are soon going to reuse this
logic for level-triggered mapped interrupts with the HW bit set, so
move this logic into a separate static function.
Probably the most scary part of this commit is convincing yourself that
the current flow is safe compared to the old one. In the following I
try to list the changes and why they are harmless:
Move vgic_irq_clear_queued after kvm_notify_acked_irq:
Harmless because the only potential effect of clearing the queued
flag wrt. kvm_set_irq is that vgic_update_irq_pending does not set
the pending bit on the emulated CPU interface or in the
pending_on_cpu bitmask if the function is called with level=1.
However, the point of kvm_notify_acked_irq is to call kvm_set_irq
with level=0, and we set the queued flag again in
__kvm_vgic_sync_hwstate later on if the level is stil high.
Move vgic_set_lr before kvm_notify_acked_irq:
Also, harmless because the LR are cpu-local operations and
kvm_notify_acked only affects the dist
Move vgic_dist_irq_clear_soft_pend after kvm_notify_acked_irq:
Also harmless, because now we check the level state in the
clear_soft_pend function and lower the pending bits if the level is
low.
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently schedule a soft timer every time we exit the guest if the
timer did not expire while running the guest. This is really not
necessary, because the only work we do in the timer work function is to
kick the vcpu.
Kicking the vcpu does two things:
(1) If the vpcu thread is on a waitqueue, make it runnable and remove it
from the waitqueue.
(2) If the vcpu is running on a different physical CPU from the one
doing the kick, it sends a reschedule IPI.
The second case cannot happen, because the soft timer is only ever
scheduled when the vcpu is not running. The first case is only relevant
when the vcpu thread is on a waitqueue, which is only the case when the
vcpu thread has called kvm_vcpu_block().
Therefore, we only need to make sure a timer is scheduled for
kvm_vcpu_block(), which we do by encapsulating all calls to
kvm_vcpu_block() with kvm_timer_{un}schedule calls.
Additionally, we only schedule a soft timer if the timer is enabled and
unmasked, since it is useless otherwise.
Note that theoretically userspace can use the SET_ONE_REG interface to
change registers that should cause the timer to fire, even if the vcpu
is blocked without a scheduled timer, but this case was not supported
before this patch and we leave it for future work for now.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Some times it is useful for architecture implementations of KVM to know
when the VCPU thread is about to block or when it comes back from
blocking (arm/arm64 needs to know this to properly implement timers, for
example).
Therefore provide a generic architecture callback function in line with
what we do elsewhere for KVM generic-arch interactions.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We currently do a single update of the vgic state when the distributor
enable/disable control register is accessed and then bypass updating the
state for as long as the distributor remains disabled.
This is incorrect, because updating the state does not consider the
distributor enable bit, and this you can end up in a situation where an
interrupt is marked as pending on the CPU interface, but not pending on
the distributor, which is an impossible state to be in, and triggers a
warning. Consider for example the following sequence of events:
1. An interrupt is marked as pending on the distributor
- the interrupt is also forwarded to the CPU interface
2. The guest turns off the distributor (it's about to do a reboot)
- we stop updating the CPU interface state from now on
3. The guest disables the pending interrupt
- we remove the pending state from the distributor, but don't touch
the CPU interface, see point 2.
Since the distributor disable bit really means that no interrupts should
be forwarded to the CPU interface, we modify the code to keep updating
the internal VGIC state, but always set the CPU interface pending bits
to zero when the distributor is disabled.
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When a guest reboots or offlines/onlines CPUs, it is not uncommon for it
to clear the pending and active states of an interrupt through the
emulated VGIC distributor. However, since the architected timers are
defined by the architecture to be level triggered and the guest
rightfully expects them to be that, but we emulate them as
edge-triggered, we have to mimic level-triggered behavior for an
edge-triggered virtual implementation.
We currently do not signal the VGIC when the map->active field is true,
because it indicates that the guest has already been signalled of the
interrupt as required. Normally this field is set to false when the
guest deactivates the virtual interrupt through the sync path.
We also need to catch the case where the guest deactivates the interrupt
through the emulated distributor, again allowing guests to boot even if
the original virtual timer signal hit before the guest's GIC
initialization sequence is run.
Reviewed-by: Eric Auger <eric.auger@linaro.org>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We have an interesting issue when the guest disables the timer interrupt
on the VGIC, which happens when turning VCPUs off using PSCI, for
example.
The problem is that because the guest disables the virtual interrupt at
the VGIC level, we never inject interrupts to the guest and therefore
never mark the interrupt as active on the physical distributor. The
host also never takes the timer interrupt (we only use the timer device
to trigger a guest exit and everything else is done in software), so the
interrupt does not become active through normal means.
The result is that we keep entering the guest with a programmed timer
that will always fire as soon as we context switch the hardware timer
state and run the guest, preventing forward progress for the VCPU.
Since the active state on the physical distributor is really part of the
timer logic, it is the job of our virtual arch timer driver to manage
this state.
The timer->map->active boolean field indicates whether we have signalled
this interrupt to the vgic and if that interrupt is still pending or
active. As long as that is the case, the hardware doesn't have to
generate physical interrupts and therefore we mark the interrupt as
active on the physical distributor.
We also have to restore the pending state of an interrupt that was
queued to an LR but was retired from the LR for some reason, while
remaining pending in the LR.
Cc: Marc Zyngier <marc.zyngier@arm.com>
Reported-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
The vgic code on ARM is built for all configurations that enable KVM,
but the parent_data field that it references is only present when
CONFIG_IRQ_DOMAIN_HIERARCHY is set:
virt/kvm/arm/vgic.c: In function 'kvm_vgic_map_phys_irq':
virt/kvm/arm/vgic.c:1781:13: error: 'struct irq_data' has no member named 'parent_data'
This flag is implied by the GIC driver, and indeed the VGIC code only
makes sense if a GIC is present. This changes the CONFIG_KVM symbol
to always select GIC, which avoids the issue.
Fixes: 662d971584 ("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Jump to correct label and free kvm_host_cpu_state
Reviewed-by: Wei Huang <wei@redhat.com>
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
When lowering a level-triggered line from userspace, we forgot to lower
the pending bit on the emulated CPU interface and we also did not
re-compute the pending_on_cpu bitmap for the CPU affected by the change.
Update vgic_update_irq_pending() to fix the two issues above and also
raise a warning in vgic_quue_irq_to_lr if we encounter an interrupt
pending on a CPU which is neither marked active nor pending.
[ Commit text reworked completely - Christoffer ]
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
We observed some performance degradation on s390x with dynamic
halt polling. Until we can provide a proper fix, let's enable
halt_poll_ns as default only for supported architectures.
Architectures are now free to set their own halt_poll_ns
default value.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel CPUID on AMD host or vice versa is a weird case, but it can
happen. Handle it by checking the host CPU vendor instead of the
guest's in reset_tdp_shadow_zero_bits_mask. For speed, the
check uses the fact that Intel EPT has an X (executable) bit while
AMD NPT has NX.
Reported-by: Borislav Petkov <bp@alien8.de>
Tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_set_cr0 may want to call kvm_zap_gfn_range and thus access the
memslots array (SRCU protected). Using a mini SRCU critical section
is ugly, and adding it to kvm_arch_vcpu_create doesn't work because
the VMX vcpu_create callback calls synchronize_srcu.
Fixes this lockdep splat:
===============================
[ INFO: suspicious RCU usage. ]
4.3.0-rc1+ #1 Not tainted
-------------------------------
include/linux/kvm_host.h:488 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-i38/17000:
#0: (&(&kvm->mmu_lock)->rlock){+.+...}, at: kvm_zap_gfn_range+0x24/0x1a0 [kvm]
[...]
Call Trace:
dump_stack+0x4e/0x84
lockdep_rcu_suspicious+0xfd/0x130
kvm_zap_gfn_range+0x188/0x1a0 [kvm]
kvm_set_cr0+0xde/0x1e0 [kvm]
init_vmcb+0x760/0xad0 [kvm_amd]
svm_create_vcpu+0x197/0x250 [kvm_amd]
kvm_arch_vcpu_create+0x47/0x70 [kvm]
kvm_vm_ioctl+0x302/0x7e0 [kvm]
? __lock_is_held+0x51/0x70
? __fget+0x101/0x210
do_vfs_ioctl+0x2f4/0x560
? __fget_light+0x29/0x90
SyS_ioctl+0x4c/0x90
entry_SYSCALL_64_fastpath+0x16/0x73
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These have roughly the same purpose as the SMRR, which we do not need
to implement in KVM. However, Linux accesses MSR_K8_TSEG_ADDR at
boot, which causes problems when running a Xen dom0 under KVM.
Just return 0, meaning that processor protection of SMRAM is not
in effect.
Reported-by: M A Young <m.a.young@durham.ac.uk>
Cc: stable@vger.kernel.org
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Access to the kvm->buses (like with the kvm_io_bus_read() and -write()
functions) has to be protected via the kvm->srcu lock.
The kvmppc_h_logical_ci_load() and -store() functions are missing
this lock so far, so let's add it there, too.
This fixes the problem that the kernel reports "suspicious RCU usage"
when lock debugging is enabled.
Cc: stable@vger.kernel.org # v4.1+
Fixes: 99342cf804
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
In guest_exit_cont we call kvmhv_commence_exit which expects the trap
number as the argument. However r3 doesn't contain the trap number at
this point and as a result we would be calling the function with a
spurious trap number.
Fix this by copying r12 into r3 before calling kvmhv_commence_exit as
r12 contains the trap number.
Cc: stable@vger.kernel.org # v4.1+
Fixes: eddb60fb14
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
This fixes a bug which results in stale vcore pointers being left in
the per-cpu preempted vcore lists when a VM is destroyed. The result
of the stale vcore pointers is usually either a crash or a lockup
inside collect_piggybacks() when another VM is run. A typical
lockup message looks like:
[ 472.161074] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 22s! [qemu-system-ppc:7039]
[ 472.161204] Modules linked in: kvm_hv kvm_pr kvm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ses enclosure shpchp rtc_opal i2c_opal powernv_rng binfmt_misc dm_service_time scsi_dh_alua radeon i2c_algo_bit drm_kms_helper ttm drm tg3 ptp pps_core cxgb3 ipr i2c_core mdio dm_multipath [last unloaded: kvm_hv]
[ 472.162111] CPU: 24 PID: 7039 Comm: qemu-system-ppc Not tainted 4.2.0-kvm+ #49
[ 472.162187] task: c000001e38512750 ti: c000001e41bfc000 task.ti: c000001e41bfc000
[ 472.162262] NIP: c00000000096b094 LR: c00000000096b08c CTR: c000000000111130
[ 472.162337] REGS: c000001e41bff520 TRAP: 0901 Not tainted (4.2.0-kvm+)
[ 472.162399] MSR: 9000000100009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24848844 XER: 00000000
[ 472.162588] CFAR: c00000000096b0ac SOFTE: 1
GPR00: c000000000111170 c000001e41bff7a0 c00000000127df00 0000000000000001
GPR04: 0000000000000003 0000000000000001 0000000000000000 0000000000874821
GPR08: c000001e41bff8e0 0000000000000001 0000000000000000 d00000000efde740
GPR12: c000000000111130 c00000000fdae400
[ 472.163053] NIP [c00000000096b094] _raw_spin_lock_irqsave+0xa4/0x130
[ 472.163117] LR [c00000000096b08c] _raw_spin_lock_irqsave+0x9c/0x130
[ 472.163179] Call Trace:
[ 472.163206] [c000001e41bff7a0] [c000001e41bff7f0] 0xc000001e41bff7f0 (unreliable)
[ 472.163295] [c000001e41bff7e0] [c000000000111170] __wake_up+0x40/0x90
[ 472.163375] [c000001e41bff830] [d00000000efd6fc0] kvmppc_run_core+0x1240/0x1950 [kvm_hv]
[ 472.163465] [c000001e41bffa30] [d00000000efd8510] kvmppc_vcpu_run_hv+0x5a0/0xd90 [kvm_hv]
[ 472.163559] [c000001e41bffb70] [d00000000e9318a4] kvmppc_vcpu_run+0x44/0x60 [kvm]
[ 472.163653] [c000001e41bffba0] [d00000000e92e674] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm]
[ 472.163745] [c000001e41bffbe0] [d00000000e9263a8] kvm_vcpu_ioctl+0x538/0x7b0 [kvm]
[ 472.163834] [c000001e41bffd40] [c0000000002d0f50] do_vfs_ioctl+0x480/0x7c0
[ 472.163910] [c000001e41bffde0] [c0000000002d1364] SyS_ioctl+0xd4/0xf0
[ 472.163986] [c000001e41bffe30] [c000000000009260] system_call+0x38/0xd0
[ 472.164060] Instruction dump:
[ 472.164098] ebc1fff0 ebe1fff8 7c0803a6 4e800020 60000000 60000000 60420000 8bad02e2
[ 472.164224] 7fc3f378 4b6a57c1 60000000 7c210b78 <e92d0000> 89290009 792affe3 40820070
The bug is that kvmppc_run_vcpu does not correctly handle the case
where a vcpu task receives a signal while its guest vcpu is executing
in the guest as a result of being piggy-backed onto the execution of
another vcore. In that case we need to wait for the vcpu to finish
executing inside the guest, and then remove this vcore from the
preempted vcores list. That way, we avoid leaving this vcpu's vcore
on the preempted vcores list when the vcpu gets interrupted.
Fixes: ec25716508
Reported-by: Thomas Huth <thuth@redhat.com>
Tested-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
When INIT/SIPI sequence is sent to VCPU which before that
was in use by OS, VMRUN might fail with:
KVM: entry failed, hardware error 0xffffffff
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000006d3
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=00000000 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =9a00 0009a000 0000ffff 00009a00
[...]
CR0=60000010 CR2=b6f3e000 CR3=01942000 CR4=000007e0
[...]
EFER=0000000000000000
with corresponding SVM error:
KVM: FAILED VMRUN WITH VMCB:
[...]
cpl: 0 efer: 0000000000001000
cr0: 0000000080010010 cr2: 00007fd7fe85bf90
cr3: 0000000187d0c000 cr4: 0000000000000020
[...]
What happens is that VCPU state right after offlinig:
CR0: 0x80050033 EFER: 0xd01 CR4: 0x7e0
-> long mode with CR3 pointing to longmode page tables
and when VCPU gets INIT/SIPI following transition happens
CR0: 0 -> 0x60000010 EFER: 0x0 CR4: 0x7e0
-> paging disabled with stale CR3
However SVM under the hood puts VCPU in Paged Real Mode*
which effectively translates CR0 0x60000010 -> 80010010 after
svm_vcpu_reset()
-> init_vmcb()
-> kvm_set_cr0()
-> svm_set_cr0()
but from kvm_set_cr0() perspective CR0: 0 -> 0x60000010
only caching bits are changed and
commit d81135a57a
("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")'
regressed svm_vcpu_reset() which relied on MMU being reset.
As result VMRUN after svm_vcpu_reset() tries to run
VCPU in Paged Real Mode with stale MMU context (longmode page tables),
which causes some AMD CPUs** to bail out with VMEXIT_INVALID.
Fix issue by unconditionally resetting MMU context
at init_vmcb() time.
* AMD64 Architecture Programmer’s Manual,
Volume 2: System Programming, rev: 3.25
15.19 Paged Real Mode
** Opteron 1216
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Fixes: d81135a57a
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 2ee507c472 ("sched: Add function single_task_running to let a task
check if it is the only task running on a cpu") referenced the current
runqueue with the smp_processor_id. When CONFIG_DEBUG_PREEMPT is enabled,
that is only allowed if preemption is disabled or the currrent task is
bound to the local cpu (e.g. kernel worker).
With commit f781951299 ("kvm: add halt_poll_ns module parameter") KVM
calls single_task_running. If CONFIG_DEBUG_PREEMPT is enabled that
generates a lot of kernel messages.
To avoid adding preemption in that cases, as it would limit the usefulness,
we change single_task_running to access directly the cpu local runqueue.
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Fixes: 2ee507c472
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Workaround for a Cortex-A57 erratum
- Bug fix for the debugging infrastructure
- Fix for 32bit guests with more than 4GB of address space
on a 32bit host
- A number of fixes for the (unusual) case when we don't use
the in-kernel GIC emulation
- Removal of ThumbEE handling on arm64, since these have been
dropped from the architecture before anyone actually ever
built a CPU
- Remove the KVM_ARM_MAX_VCPUS limitation which has become
fairly pointless
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJV+slQAAoJECPQ0LrRPXpD+WUQAMLC3ZUasJX1gsVixd++zAwB
FXu0TFlKCUsLWllXZtyhGI6ya7ljuCzfhRbA/eZFmFVbwDnULt1p5ahw7eHCIZ2a
yY93TS6XN3YHwVpY7f2lDsvLhBLyWeTdWhj5TtLy6mslQyEUqxdmsiC7gl40Fp2S
8tKIxoYYRpmbgKl/Lbi8GxdHH6c0aQ2Nt7Fq4nV9dJqy5tiGdg6OxqgU/rVmkdkv
Rv1jrdtncstNRi9NBbKRRDp5DTqWboF35HJQpdIRpR8jJTLuuzzCimP5Hz9crKuO
uXchIq2GtQB60NklZtPL15zMdmfdq+JHwdC14v05kB5Ai8NThGwKYQ3JF+krO3cG
RKsAlrIq0AwPN8hAboLcKGzjLFFryaHZsa+d7elxaaDQz1FGz4uP56fIUURoGZuX
vWTsKLRKcuPCYtnV6Frg2BCTB6nq1cRgjmMC9TABnraelZ3z0lDl4wFngg4aL2u6
QYOdP8L++/S1HAPOF7VhFYndXkbM3KoVLAepev8jvzRnwg4QVrqsvfgwFSdMNcMz
ga7bJ4pUEP+Qq1i0qc41P9O708bCGm7TIw3CzTdKIZhc/l0t137lw1rhv67JfXZh
cAni4osjhpdZUT0F9lIl/6OQB3Kgk6on3cs909Y/tT1srh9s+iVO1AwpGY1j5T4j
gFRy90o2LBuepoI/8yF3
=NNtz
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-4.3-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
Second set of KVM/ARM changes for 4.3-rc2
- Workaround for a Cortex-A57 erratum
- Bug fix for the debugging infrastructure
- Fix for 32bit guests with more than 4GB of address space
on a 32bit host
- A number of fixes for the (unusual) case when we don't use
the in-kernel GIC emulation
- Removal of ThumbEE handling on arm64, since these have been
dropped from the architecture before anyone actually ever
built a CPU
- Remove the KVM_ARM_MAX_VCPUS limitation which has become
fairly pointless
This patch removes config option of KVM_ARM_MAX_VCPUS,
and like other ARCHs, just choose the maximum allowed
value from hardware, and follows the reasons:
1) from distribution view, the option has to be
defined as the max allowed value because it need to
meet all kinds of virtulization applications and
need to support most of SoCs;
2) using a bigger value doesn't introduce extra memory
consumption, and the help text in Kconfig isn't accurate
because kvm_vpu structure isn't allocated until request
of creating VCPU is sent from QEMU;
3) the main effect is that the field of vcpus[] in 'struct kvm'
becomes a bit bigger(sizeof(void *) per vcpu) and need more cache
lines to hold the structure, but 'struct kvm' is one generic struct,
and it has worked well on other ARCHs already in this way. Also,
the world switch frequecy is often low, for example, it is ~2000
when running kernel building load in VM from APM xgene KVM host,
so the effect is very small, and the difference can't be observed
in my test at all.
Cc: Dann Frazier <dann.frazier@canonical.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Although the ThumbEE registers and traps were present in earlier
versions of the v8 architecture, it was retrospectively removed and so
we can do the same.
Whilst this breaks migrating a guest started on a previous version of
the kernel, it is much better to kill these (non existent) registers
as soon as possible.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
[maz: added commend about migration]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.
The fix is to unconditionally turn off the virtual timer on guest
exit.
Cc: stable@vger.kernel.org
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When running a guest with the architected timer disabled (with QEMU and
the kernel_irqchip=off option, for example), it is important to make
sure the timer gets turned off. Otherwise, the guest may try to
enable it anyway, leading to a screaming HW interrupt.
The fix is to unconditionally turn off the virtual timer on guest
exit.
Cc: stable@vger.kernel.org
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Until b26e5fdac4 ("arm/arm64: KVM: introduce per-VM ops"),
kvm_vgic_map_resources() used to include a check on irqchip_in_kernel(),
and vgic_v2_map_resources() still has it.
But now vm_ops are not initialized until we call kvm_vgic_create().
Therefore kvm_vgic_map_resources() can being called without a VGIC,
and we die because vm_ops.map_resources is NULL.
Fixing this restores QEMU's kernel-irqchip=off option to a working state,
allowing to use GIC emulation in userspace.
Fixes: b26e5fdac4 ("arm/arm64: KVM: introduce per-VM ops")
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Fedin <p.fedin@samsung.com>
[maz: reworked commit message]
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
The offending commit accidentally replaces an atomic_clear with an
atomic_or instead of an atomic_andnot in kvm_s390_vcpu_request_handled.
The symptom is that kvm guests on s390 hang on startup.
This patch simply replaces the incorrect atomic_or with atomic_andnot
Fixes: 805de8f43c (atomic: Replace atomic_{set,clear}_mask() usage)
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A critical bug has been found in device memory stage1 translation for
VMs with more then 4GB of address space. Once vm_pgoff size is smaller
then pa (which is true for LPAE case, u32 and u64 respectively) some
more significant bits of pa may be lost as a shift operation is performed
on u32 and later cast onto u64.
Example: vm_pgoff(u32)=0x00210030, PAGE_SHIFT=12
expected pa(u64): 0x0000002010030000
produced pa(u64): 0x0000000010030000
The fix is to change the order of operations (casting first onto phys_addr_t
and then shifting).
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
[maz: fixed changelog and patch formatting]
Cc: stable@vger.kernel.org
Signed-off-by: Marek Majtyka <marek.majtyka@tieto.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
When setting the debug register from userspace, make sure that
copy_from_user() is called with its parameters in the expected
order. It otherwise doesn't do what you think.
Fixes: 84e690bfbe ("KVM: arm64: introduce vcpu->arch.debug_ptr")
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Cc: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reference SDM 28.1:
The current VPID is 0000H in the following situations:
- Outside VMX operation. (This includes operation in system-management
mode under the default treatment of SMIs and SMM with VMX operation;
see Section 34.14.)
- In VMX root operation.
- In VMX non-root operation when the “enable VPID” VM-execution control
is 0.
The VPID should never be 0000H in non-root operation when "enable VPID"
VM-execution control is 1. However, commit 34a1cd60 ("kvm: x86: vmx:
move some vmx setting from vmx_init() to hardware_setup()") remove the
codes which reserve 0000H for VMX root operation.
This patch fix it by again reserving 0000H for VMX root operation.
Cc: stable@vger.kernel.org # 3.19+
Fixes: 34a1cd60d1
Reported-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This new statistic can help diagnosing VCPUs that, for any reason,
trigger bad behavior of halt_poll_ns autotuning.
For example, say halt_poll_ns = 480000, and wakeups are spaced exactly
like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes
10+20+40+80+160+320+480 = 1110 microseconds out of every
479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then
is consuming about 30% more CPU than it would use without
polling. This would show as an abnormally high number of
attempted polling compared to the successful polls.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com<
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, if we had a zero length mmio eventfd assigned on
KVM_MMIO_BUS. It will never be found by kvm_io_bus_cmp() since it
always compares the kvm_io_range() with the length that guest
wrote. This will cause e.g for vhost, kick will be trapped by qemu
userspace instead of vhost. Fixing this by using zero length if an
iodevice is zero length.
Cc: stable@vger.kernel.org
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>