x2apic allows destinations > 0xff and we don't want them delivered to
lower APICs. They are correctly handled by doing nothing.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Physical mode can't address more than one APIC, but lowest-prio is
allowed, so we just reuse our paths.
SDM 10.6.2.1 Physical Destination:
Also, for any non-broadcast IPI or I/O subsystem initiated interrupt
with lowest priority delivery mode, software must ensure that APICs
defined in the interrupt address are present and enabled to receive
interrupts.
We could warn on top of that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
False from kvm_irq_delivery_to_apic_fast() means that we don't handle it
in the fast path, but we still return false in cases that were perfectly
handled, fix that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
0x830 MSR is 0x300 xAPIC MMIO, which is MSR_ICR.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2APIC has no registers for DFR and ICR2 (see Intel SDM 10.12.1.2 "x2APIC
Register Address Space"). KVM needs to cause #GP on such accesses.
Fix it (DFR and ICR2 on read, ICR2 on write, DFR already handled on writes).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Certain x86 instructions that use modrm operands only allow memory operand
(i.e., mod012), and cause a #UD exception otherwise. KVM ignores this fact.
Currently, the instructions that are such and are emulated by KVM are MOVBE,
MOVNTPS, MOVNTPD and MOVNTI. MOVBE is the most blunt example, since it may be
emulated by the host regardless of MMIO.
The fix introduces a new group for handling such instructions, marking mod3 as
illegal instruction.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Here is a bunch of fixes that deal mostly with architectural compliance:
- interrupt priorities
- interrupt handling
- intruction exit handling
We also provide a helper function for getting the guest visible storage key.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJUeHKOAAoJEBF7vIC1phx8fS8P/2i1zXvsB+Mp9FafU6FO3Ci6
yM/ZBLFEsX2jmU0TGR2szZCP4xuHqomJm441+2CX9bXzjf8vnA2hGiDYX0bnBraK
1/klx6Li1fQxsiSHXIgBXr0wh5ftNUdFVZiJoifY9dEdrhVI+YEiptIl7lADCFXi
SdGtjEAzrVEe8H0g6OuBXeEfPzHvxAzNJ31yHuiCKl7vbFVnXNVPeZhq3dJZwmWC
iIdlSqGIjbcHNMLXvrScLDAScBe6WruBLhPSy5aTIA2eBU6f4qkOedSFABJkIAq+
V6v5pXWbrVBIqfLXE7Vp7jxhP7+viBGzu/gPkfT8HV1pQZDa94WojF8hGU0DLGLd
vgZuFDV8cMOZMUS/onXOwnIXN5VPvP8V2v3y8gTiap3MykRiyTGEzq2auU9p0K5n
/8W6Cn1P/WBS8MOFYg726DGmMAWkzpEVz9rxpCLaTpzz7QVyLuSLq/n3SXyQNQIl
zhox/KzwUQD0t1062USoK3w4suYNvnX0BuFOwxXvS7f4bsb+6V/t0GyIBnVAL/OF
DZzJSIyzP/Ur/9krxJQxML3kEELU1CjwSLOrzDUnZA3ytaKvsLrHkTb9nK6fREDK
14AGRnp94B3EMPR6T+7T6gvwK2Y/QIo8Y/EFAa2BwXY4Q/BQPSVU3x3RK6L9+7jY
VaG9sgn4OC2ZPzxjqMTc
=MQUy
-----END PGP SIGNATURE-----
Merge tag 'kvm-s390-next-20141128' of git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: Several fixes,cleanups and reworks
Here is a bunch of fixes that deal mostly with architectural compliance:
- interrupt priorities
- interrupt handling
- intruction exit handling
We also provide a helper function for getting the guest visible storage key.
Allow to specify CR14, logout area, external damage code
and failed storage address.
Since more then one machine check can be indicated to the guest at
a time we need to combine all indication bits with already pending
requests.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch adapts handling of local interrupts to be more compliant with
the z/Architecture Principles of Operation and introduces a data
structure
which allows more efficient handling of interrupts.
* get rid of li->active flag, use bitmap instead
* Keep interrupts in a bitmap instead of a list
* Deliver interrupts in the order of their priority as defined in the
PoP
* Use a second bitmap for sigp emergency requests, as a CPU can have
one request pending from every other CPU in the system.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Adds a bitmap to the vcpu structure which is used to keep track
of local pending interrupts. Also add enum with all interrupt
types sorted in order of priority (highest to lowest)
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Move delivery code for cpu-local interrupt from the huge do_deliver_interrupt()
to smaller functions which handle one type of interrupt.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Get rid of open coded value for virtio and pfault completion interrupts.
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
The 32bit external interrupt parameter is only valid for timing-alert and
service-signal interrupts.
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
In preparation for the rework of the local interrupt injection code,
factor out injection routines from kvm_s390_inject_vcpu().
Signed-off-by: Jens Freimann <jfrei@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Define get_guest_storage_key which can be used to get the value of a guest
storage key. This compliments the functionality provided by the helper function
set_guest_storage_key. Both functions are needed for live migration of s390
guests that use storage keys.
Signed-off-by: Jason J. Herne <jjherne@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
When injecting a floating interrupt and no CPU is idle we
kick one CPU to do an external exit. In case of I/O we
should trigger an I/O exit instead. This does not matter
for Linux guests as external and I/O interrupts are
enabled/disabled at the same time, but play safe anyway.
The same holds true for machine checks. Since there is no
special exit, just reuse the generic stop exit. The injection
code inside the VCPU loop will recheck anyway and rearm the
proper exits (e.g. control registers) if necessary.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
A couple of our interception handlers rewind the PSW to the beginning
of the instruction to run the intercepted instruction again during the
next SIE entry. This normally works fine, but there is also the
possibility that the instruction did not get run directly but via an
EXECUTE instruction.
In this case, the PSW does not point to the instruction that caused the
interception, but to the EXECUTE instruction! So we've got to rewind the
PSW to the beginning of the EXECUTE instruction instead.
This is now accomplished with a new helper function kvm_s390_rewind_psw().
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This patch includes two small fixes for the PFMF handler: First, the
start address for PFMF has to be masked according to the current
addressing mode, which is now done with kvm_s390_logical_to_effective().
Second, the protection exceptions have a lower priority than the
specification exceptions, so the check for low-address protection
has to be moved after the last spot where we inject a specification
exception.
Signed-off-by: Thomas Huth <thuth@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning. The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create a new header, and hide the device assignment functions there.
Move struct kvm_assigned_dev_kernel to assigned-dev.c by modifying
arch/x86/kvm/iommu.c to take a PCI device struct.
Based on a patch by Radim Krcmar <rkrcmark@redhat.com>.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This feature is not supported inside KVM guests yet, because we do not emulate
MSR_IA32_XSS. Mask it out.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that ia64 is gone, we can hide deprecated device assignment in x86.
Notable changes:
- kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()
The easy parts were removed from generic kvm code, remaining
- kvm_iommu_(un)map_pages() would require new code to be moved
- struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM ia64 is no longer present so new applications shouldn't use them.
The main problem is that they most likely didn't work even before,
because of a conflict in the #defines:
#define KVM_SET_GUEST_DEBUG _IOW(KVMIO, 0x9b, struct kvm_guest_debug)
#define KVM_IA64_VCPU_SET_STACK _IOW(KVMIO, 0x9b, void *)
The argument to KVM_SET_GUEST_DEBUG is:
struct kvm_guest_debug {
__u32 control;
__u32 pad;
struct kvm_guest_debug_arch arch;
};
struct kvm_guest_debug_arch {
};
meaning that sizeof(struct kvm_guest_debug) == sizeof(void *) == 8
and KVM_SET_GUEST_DEBUG == KVM_IA64_VCPU_SET_STACK.
KVM_SET_GUEST_DEBUG is handled in virt/kvm/kvm_main.c before even calling
kvm_arch_vcpu_ioctl (which would have handled KVM_IA64_VCPU_SET_STACK),
so KVM_IA64_VCPU_SET_STACK would just return -EINVAL.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ia64 does not need them anymore. Ack notifiers become x86-specific
too.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM for ia64 has been marked as broken not just once, but twice even,
and the last patch from the maintainer is now roughly 5 years old.
Time for it to rest in peace.
Acked-by: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check on the higher limit of the segment, and the check on the
maximum accessible size, is the same for both expand-up and
expand-down segments. Only the computation of "lim" varies.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
register_address has been a duplicate of address_mask ever since the
ancestor of __linearize was born in 90de84f50b (KVM: x86 emulator:
preserve an operand's segment identity, 2010-11-17).
However, we can put it to a better use by including the call to reg_read
in register_address. Similarly, the call to reg_rmw can be moved to
register_address_increment.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In __linearize there is check of the condition whether to check if masking of
the linear address is needed. It occurs immediately after switch that
evaluates the same condition. Merge them.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When SS is used using a non-canonical address, an #SS exception is generated on
real hardware. KVM emulator causes a #GP instead. Fix it to behave as real x86
CPU.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If branch (e.g., jmp, ret) causes limit violations, since the target IP >
limit, the #GP exception occurs before the branch. In other words, the RIP
pushed on the stack should be that of the branch and not that of the target.
To do so, we can call __linearize, with new EIP, which also saves us the code
which performs the canonical address checks. On the case of assigning an EIP >=
2^32 (when switching cs.l), we also safe, as __linearize will check the new EIP
does not exceed the limit and would trigger #GP(0) otherwise.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When segment is accessed, real hardware does not perform any privilege level
checks. In contrast, KVM emulator does. This causes some discrepencies from
real hardware. For instance, reading from readable code segment may fail due to
incorrect segment checks. In addition, it introduces unnecassary overhead.
To reference Intel SDM 5.5 ("Privilege Levels"): "Privilege levels are checked
when the segment selector of a segment descriptor is loaded into a segment
register." The SDM never mentions privilege level checks during memory access,
except for loading far pointers in section 5.10 ("Pointer Validation"). Those
are actually segment selector loads and are emulated in the similarily (i.e.,
regardless to __linearize checks).
This behavior was also checked using sysexit. A data-segment whose DPL=0 was
loaded, and after sysexit (CPL=3) it is still accessible.
Therefore, all the privilege level checks in __linearize are removed.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When performing segmented-read/write in the emulator for stack operations, it
ignores the stack size, and uses the ad_bytes as indication for the pointer
size. As a result, a wrong address may be accessed.
To fix this behavior, we can remove the masking of address in __linearize and
perform it beforehand. It is already done for the operands (so currently it is
inefficiently done twice). It is missing in two cases:
1. When using rip_relative
2. On fetch_bit_operand that changes the address.
This patch masks the address on these two occassions, and removes the masking
from __linearize.
Note that it does not mask EIP during fetch. In protected/legacy mode code
fetch when RIP >= 2^32 should result in #GP and not wrap-around. Since we make
limit checks within __linearize, this is the expected behavior.
Partial revert of commit 518547b32a (KVM: x86: Emulator does not
calculate address correctly, 2014-09-30).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 10e38fc7cab6 ("KVM: x86: Emulator flag for instruction that only support
16-bit addresses in real mode") introduced NoBigReal for instructions such as
MONITOR. Apparetnly, the Intel SDM description that led to this patch is
misleading. Since no instruction is using NoBigReal, it is safe to remove it,
we fully understand what the SDM means.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMIO_MAX_GEN is the same as MMIO_GEN_MASK. Use only one.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead, just use PFERR_{FETCH, PRESENT, WRITE}_MASK
inside handle_ept_violation() for slightly better code.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is
set. If this assumption is broken and apicv is disabled, the injection of
interrupts may be deferred until another interrupt is delivered to the guest.
Ultimately, if no other interrupt should be injected to that vCPU, the pending
interrupt may be lost.
commit 56cc2406d6 ("KVM: nVMX: fix "acknowledge interrupt on exit" when APICv
is in use") changed the behavior of apic_clear_irr so irr_pending is cleared
after setting APIC_IRR vector. After this commit, if apic_set_irr and
apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR
vector set, and irr_pending cleared. In the following example, assume a single
vector is set in IRR prior to calling apic_clear_irr:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic_clear_vector(...);
vec = apic_search_irr(apic);
// => vec == -1
apic_set_vector(...);
apic->irr_pending = (vec != -1);
// => apic->irr_pending == false
Nonetheless, it appears the race might even occur prior to this commit:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic->irr_pending = false;
apic_clear_vector(...);
if (apic_search_irr(apic) != -1)
apic->irr_pending = true;
// => apic->irr_pending == false
apic_set_vector(...);
Fixing this issue by:
1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call
apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending.
2. On apic_set_irr: first call apic_set_vector, then set irr_pending.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical destination mode can be used to send NMI IPIs even when all
APICs are software disabled, so if all APICs are software disabled we
should still look at the DFRs.
So the DFRs should all be the same, even if some or all APICs are
software disabled. However, the SDM does not say this, so tweak
the logic as follows:
- if one APIC is enabled and has LDR != 0, use that one to build the map.
This picks the right DFR in case an OS is only setting it for the
software-enabled APICs, or in case an OS is using logical addressing
on some APICs while leaving the rest in reset state (using LDR was
suggested by Radim).
- if all APICs are disabled, pick a random one to build the map.
We use the last one with LDR != 0 for simplicity.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC logical map does not consider VCPUs whose local-apic is
software-disabled. However, NMIs, INIT, etc. should still be delivered to such
VCPUs. Therefore, the APIC mode should first be determined, and then the map,
considering all VCPUs should be constructed.
To address this issue, first find the APIC mode, and only then construct the
logical map.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The update_memslots invocation is only needed in one case. Make
the code clearer by moving it to __kvm_set_memory_region, and
removing the wrapper around insert_memslot.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The two kmemdup invocations can be unified. I find that the new
placement of the comment makes it easier to see what happens.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This completes the optimization from the previous patch, by
removing the KVM_MEM_SLOTS_NUM-iteration loop from insert_memslot.
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
memslots is a sorted array. When a slot is changed, heapsort (lib/sort.c)
would take O(n log n) time to update it; an optimized insertion sort will
only cost O(n) on an array with just one item out of order.
Replace sort() with a custom sort that takes advantage of memslots usage
pattern and the known position of the changed slot.
performance change of 128 memslots insertions with gradually increasing
size (the worst case):
heap sort custom sort
max: 249747 2500 cycles
with custom sort alg taking ~98% less then original
update time.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With the 3 private slots, this gives us 512 slots total.
Motivation for this is in addition to assigned devices
support more memory hotplug slots, where 1 slot is
used by a hotplugged memory stick.
It will allow to support upto 256 hotplug memory
slots and leave 253 slots for assigned devices and
other devices that use them.
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When running the tsc_adjust kvm-unit-test on an AMD processor with the
IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
is called; however it may trigger unnecessary warnings.
This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
s64 since this can have signed values.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's nothing to switch if the host and guest values are the same.
I am unable to find evidence that this makes any difference
whatsoever.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[I could see a difference on Nehalem. From 5 runs:
userspace exit, guest!=host 12200 11772 12130 12164 12327
userspace exit, guest=host 11983 11780 11920 11919 12040
lightweight exit, guest!=host 3214 3220 3238 3218 3337
lightweight exit, guest=host 3178 3193 3193 3187 3220
This passes the t-test with 99% confidence for userspace exit,
98.5% confidence for lightweight exit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At least on Sandy Bridge, letting the CPU switch IA32_EFER is much
faster than switching it manually.
I benchmarked this using the vmexit kvm-unit-test (single run, but
GOAL multiplied by 5 to do more iterations):
Test Before After Change
cpuid 2000 1932 -3.40%
vmcall 1914 1817 -5.07%
mov_from_cr8 13 13 0.00%
mov_to_cr8 19 19 0.00%
inl_from_pmtimer 19164 10619 -44.59%
inl_from_qemu 15662 10302 -34.22%
inl_from_kernel 3916 3802 -2.91%
outl_to_kernel 2230 2194 -1.61%
mov_dr 172 176 2.33%
ipi (skipped) (skipped)
ipi+halt (skipped) (skipped)
ple-round-robin 13 13 0.00%
wr_tsc_adjust_msr 1920 1845 -3.91%
rd_tsc_adjust_msr 1892 1814 -4.12%
mmio-no-eventfd:pci-mem 16394 11165 -31.90%
mmio-wildcard-eventfd:pci-mem 4607 4645 0.82%
mmio-datamatch-eventfd:pci-mem 4601 4610 0.20%
portio-no-eventfd:pci-io 11507 7942 -30.98%
portio-wildcard-eventfd:pci-io 2239 2225 -0.63%
portio-datamatch-eventfd:pci-io 2250 2234 -0.71%
I haven't explicitly computed the significance of these numbers,
but this isn't subtle.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[The results were reproducible on all of Nehalem, Sandy Bridge and
Ivy Bridge. The slowness of manual switching is because writing
to EFER with WRMSR triggers a TLB flush, even if the only bit you're
touching is SCE (so the page table format is not affected). Doing
the write as part of vmentry/vmexit, instead, does not flush the TLB,
probably because all processors that have EPT also have VPID. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PCIDs are only supported in 64-bit mode. No need to clear bit 63
of CR3 unless the host is 64-bit.
Reported by Fengguang Wu's autobuilder.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>