Since segments need to be handled slightly differently when fetching
instructions, we add a __linearize helper that accepts a new 'fetch' boolean.
[avi: fix oops caused by wrong segmented_address initialization order]
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The last_guest_tsc is used in vcpu_load to adjust the
tsc_offset since tsc-scaling is merged. So the
last_guest_tsc needs to be updated in vcpu_put instead of
the the last_host_tsc. This is fixed with this patch.
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Tested-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch fixes a bug in the nested-svm path when
decode-assists is available on the machine. After a
selective-cr0 intercept is detected the rip is advanced
unconditionally. This causes the l1-guest to continue
running with an l2-rip.
This bug was with the sel_cr0 unit-test on decode-assists
capable hardware.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently, setting a large (i.e. negative) base address for %cs does not work on
a 64-bit host. The "JOS" teaching operating system, used by MIT and other
universities, relies on such segments while bootstrapping its way to full
virtual memory management.
Signed-off-by: Nelson Elhage <nelhage@ksplice.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Just remove useless function define kvm_inject_pit_timer_irqs() from
file arch/x86/kvm/i8254.h
Signed-off-by:Duan Jiong<djduanjiong@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Just remove useless function define kvm_pic_clear_isr_ack() and
pit_has_pending_timer()
Signed-off-by: Duan Jiong<djduanjiong@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When doing a soft int, we need to bump eip before pushing it to
the stack. Otherwise we'll do the int a second time.
[apw@canonical.com: merged eip update as per Jan's recommendation.]
Signed-off-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
em_push() is a simple wrapper of emulate_push(). So this patch replaces
emulate_push() with em_push() and removes the unnecessary former.
In addition, the unused ops arguments are removed from emulate_pusha()
and emulate_grp45().
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
PUSH emulation stores the value by calling writeback() after setting
the dst operand appropriately in emulate_push().
This writeback() using dst is not needed at all because we know the
target is the stack. So this patch makes emulate_push() call, newly
introduced, segmented_write() directly.
By this, many inlined writeback()'s are removed.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
This stops "CMP r/m, reg" to write back the data into memory.
Pointed out by Avi.
The writeback suppression now covers CMP, CMPS, SCAS.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
In case certain allocations fail, vmx_create_vcpu may return 0 as error
instead of a negative value encoded via ERR_PTR. This causes a NULL
pointer dereferencing later on in kvm_vm_ioctl_vcpu_create.
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently we sync registers back and forth before/after exiting
to userspace for IO, but during IO device model shouldn't need to
read/write the registers, so we can as well skip those sync points. The
only exaception is broken vmware backdor interface. The new code sync
registers content during IO only if registers are read from/written to
by userspace in the middle of the IO operation and this almost never
happens in practise.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
When the emulation of vmload or vmsave fails because the
guest passed an unsupported physical address it gets an #GP
with rip pointing to the instruction after vmsave/vmload.
This is a bug and fixed by this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements two new vm-ioctls to get and set the
virtual_tsc_khz if the machine supports tsc-scaling. Setting
the tsc-frequency is only possible before userspace creates
any vcpu.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
With TSC scaling in SVM the tsc-offset needs to be
calculated differently. This patch propagates this
calculation into the architecture specific modules so that
this complexity can be handled there.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements a call-back into the architecture code
to allow the propagation of changes to the virtual tsc_khz
of the vcpu.
On SVM it updates the tsc_ratio variable, on VMX it does
nothing.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The calculation of the tsc_delta value to ensure a
forward-going tsc for the guest is a function of the
host-tsc. This works as long as the guests tsc_khz is equal
to the hosts tsc_khz. With tsc-scaling hardware support this
is not longer true and the tsc_delta needs to be calculated
using guest_tsc values.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch changes the kvm_guest_time_update function to use
TSC frequency the guest actually has for updating its clock.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch enhances the kvm_amd module with functions to
support the TSC_RATE_MSR which can be used to set a given
tsc frequency for the guest vcpu.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
VMMCALL needs the VendorSpecific tag so that #UD emulation
(called if a guest running on AMD was migrated to an Intel host)
is allowed to process the instruction.
Signed-off-by: Avi Kivity <avi@redhat.com>
The mmu_seq verification can be removed since we get the pfn in the
protection of mmu_lock.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The patch below removes unsigned long base_addresss; in i8254.h
since it is unused.
Signed-off-by: Justin P. Mattock <justinmattock@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch removes all the old code which handled the nested
selective cr0 write intercepts. This code was only in place
as a work-around until the instruction emulator is capable
of doing the same. This is the case with this patch-set and
so the code can be removed.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds code to check for IOIO intercepts on
instructions decoded by the KVM instruction emulator.
[avi: fix build error due to missing #define D2bvIP]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch add intercept checks for emulated one-byte
instructions to the KVM instruction emulation path.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds intercepts checks for the remaining twobyte
instructions to the KVM instruction emulator.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements the emulator intercept checks for the
RDTSCP, MONITOR, and MWAIT instructions.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds the necessary code changes in the
instruction emulator and the extensions to svm.c to
implement intercept checks for the svm instructions.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch add intercept checks into the KVM instruction
emulator to check for the 8 instructions that access the
descriptor table addresses.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds the intercept checks for instruction
accessing the debug registers.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds all necessary intercept checks for
instructions that access the crX registers.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a callback into kvm_x86_ops so that svm and
vmx code can do intercept checks on emulated instructions.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a flag for the opcoded to tag instruction
which are only recognized in protected mode. The necessary
check is added too.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a check_perm callback for each opcode into
the instruction emulator. This will be used to do all
necessary permission checks on instructions before checking
whether they are intercepted or not.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch prevents the changed CPU state to be written back
when the emulator detected that the instruction was
intercepted by the guest.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add intercept codes for instructions defined by SVM as
interceptable.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When running in guest mode, certain instructions can be intercepted by
hardware. This also holds for nested guests running on emulated
virtualization hardware, in particular instructions emulated by kvm
itself.
This patch adds a framework for intercepting instructions. If an
instruction is marked for interception, and if we're running in guest
mode, a callback is called to check whether an intercept is needed or
not. The callback is called at three points in time: immediately after
beginning execution, after checking privilge exceptions, and after
checking memory exception. This suits the different interception points
defined for different instructions and for the various virtualization
instruction sets.
In addition, a new X86EMUL_INTERCEPT is defined, which any callback or
memory access may define, allowing the more complicated intercepts to be
implemented in existing callbacks.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Most SIMD instructions use the 66/f2/f3 prefixes to distinguish between
different variants of the same instruction. Usually the encoding is quite
regular, but in some cases (including non-SIMD instructions) the prefixes
generate very different instructions. Examples include XCHG/PAUSE,
MOVQ/MOVDQA/MOVDQU, and MOVBE/CRC32.
Allow the emulator to handle these special cases by splitting such opcodes
into groups, with different decode flags and execution functions for different
prefixes.
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently we store a rep prefix as 1 or 2 depending on whether it is a REPE or
REPNE. Since sse instructions depend on the prefix value, store it as the
original opcode to simplify things further on.
Signed-off-by: Avi Kivity <avi@redhat.com>
Since sse instructions can issue 16-byte mmios, we need to support them. We
can't increase the kvm_run mmio buffer size to 16 bytes without breaking
compatibility, so instead we break the large mmios into two smaller 8-byte
ones. Since the bus is 64-bit we aren't breaking any atomicity guarantees.
Signed-off-by: Avi Kivity <avi@redhat.com>
As Avi recently mentioned, the new standard mechanism for exposing features
is KVM_GET_SUPPORTED_CPUID, not spamming CAPs. For some reason async pf
missed that.
So expose async_pf here.
Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Gleb Natapov <gleb@redhat.com>
CC: Avi Kivity <avi@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use vmx_set_nmi_mask() instead of open-coding management of
the hardware bit and the software hint (nmi_known_unmasked).
There's a slight change of behaviour when running without
hardware virtual NMI support - we now clear the NMI mask if
NMI delivery faulted in that case as well. This improves
emulation accuracy.
Signed-off-by: Avi Kivity <avi@redhat.com>
When we haven't injected an interrupt, we don't need to recover
the nmi blocking state (since the guest can't set it by itself).
This allows us to avoid a VMREAD later on.
Signed-off-by: Avi Kivity <avi@redhat.com>
We may read the cpl quite often in the same vmexit (instruction privilege
check, memory access checks for instruction and operands), so we gain
a bit if we cache the value.
Signed-off-by: Avi Kivity <avi@redhat.com>
In long mode, vm86 mode is disallowed, so we need not check for
it. Reading rflags.vm may require a VMREAD, so it is expensive.
Signed-off-by: Avi Kivity <avi@redhat.com>
Some rflags bits are owned by the host, not guest, so we need to use
kvm_get_rflags() to strip those bits away or kvm_set_rflags() to add them
back.
Signed-off-by: Avi Kivity <avi@redhat.com>
If KVM cannot find an exact match for a requested CPUID leaf, the
code will try to find the closest match instead of simply confessing
it's failure.
The implementation was meant to satisfy the CPUID specification, but
did not properly check for extended and standard leaves and also
didn't account for the index subleaf.
Beside that this rule only applies to CPUID intercepts, which is not
the only user of the kvm_find_cpuid_entry() function.
So fix this algorithm and call it from kvm_emulate_cpuid().
This fixes a crash of newer Linux kernels as KVM guests on
AMD Bulldozer CPUs, where bogus values were returned in response to
a CPUID intercept.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When KVM scans the 0xD CPUID leaf for propagating the XSAVE save area
leaves, it assumes that the leaves are contigious and stops at the
first zero one. On AMD hardware there is a gap, though, as LWP uses
leaf 62 to announce it's state save area.
So lets iterate through all 64 possible leaves and simply skip zero
ones to also cover later features.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Flush TLB if PGD entry is changed in i386 PAE mode
x86, dumpstack: Correct stack dump info when frame pointer is available
x86: Clean up csum-copy_64.S a bit
x86: Fix common misspellings
x86: Fix misspelling and align params
x86: Use PentiumPro-optimized partial_csum() on VIA C7
They were generated by 'codespell' and then manually reviewed.
Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Cc: trivial@kernel.org
LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 6440e5967bc broke old userspaces that do not set tss address
before entering vcpu. Unbreak it by setting tss address to a safe
value on the first vcpu entry. New userspaces should set tss address,
so print warning in case it doesn't.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This patch does:
- call vcpu->arch.mmu.update_pte directly
- use gfn_to_pfn_atomic in update_pte path
The suggestion is from Avi.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cleanup the code of pte_prefetch_gfn_to_memslot and mapping_level_dirty_bitmap
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
commit 387b9f97750444728962b236987fbe8ee8cc4f8c moved kvm_request_guest_time_update(vcpu),
breaking 32bit SMP guests using kvm-clock. Fix this by moving (new) clock update function
to proper place.
Signed-off-by: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently if io port + len crosses 8bit boundary in io permission bitmap the
check may allow IO that otherwise should not be allowed. The patch fixes that.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Current implementation truncates upper 32bit of TR base address during IO
permission bitmap check. The patch fixes this.
Reported-and-tested-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
With CONFIG_CC_STACKPROTECTOR, we need a valid %gs at all times, so disable
lazy reload and do an eager reload immediately after the vmexit.
Reported-by: IVAN ANGELOV <ivangotoy@gmail.com>
Acked-By: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Access to this page is mostly done through the regs member which holds
the address to this page. The exceptions are in vmx_vcpu_reset() and
kvm_free_lapic() and these both can easily be converted to using regs.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
Using __get_free_page instead of alloc_page and page_address,
using free_page instead of __free_page and virt_to_page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
No need to record the gfn to verifier the pte has the same mode as
current vcpu, it's because we only speculatively update the pte only
if the pte and vcpu have the same mode
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_calculate_mmu_pages need to walk all memslots and it's protected by
kvm->slots_lock, so move it out of mmu spinlock
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Set spte accessed bit only if guest_initiated == 1 that means the really
accessed
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
use EFER_SCE, EFER_LME and EFER_LMA instead of magic numbers.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The hash array of async gfns may still contain some left gfns after
kvm_clear_async_pf_completion_queue() called, need to clear them.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently vm86 task is initialized on each real mode entry and vcpu
reset. Initialization is done by zeroing TSS and updating relevant
fields. But since all vcpus are using the same TSS there is a race where
one vcpu may use TSS while other vcpu is initializing it, so the vcpu
that uses TSS will see wrong TSS content and will behave incorrectly.
Fix that by initializing TSS only once.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When rmode.vm86 is active TR descriptor is updated with vm86 task values,
but selector is left intact. vmx_set_segment() makes sure that if TR
register is written into while vm86 is active the new values are saved
for use after vm86 is deactivated, but since selector is not updated on
vm86 activation/deactivation new value is lost. Fix this by writing new
selector into vmcs immediately.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The changelog of 104f226 said "adds the __noclone attribute",
but it was missing in its patch. I think it is still needed.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
isr_ack logic was added by e48258009d to avoid unnecessary IPIs. Back
then it made sense, but now the code checks that vcpu is ready to accept
interrupt before sending IPI, so this logic is no longer needed. The
patch removes it.
Fixes a regression with Debian/Hurd.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reported-and-tested-by: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch fixes the logic used to detect whether BIOS has disabled VMX, for
the case where VMX is enabled only under SMX, but tboot is not active.
Signed-off-by: Joseph Cihula <joseph.cihula@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Code under this lock requires non-preemptibility. Ensure this also over
-rt by converting it to raw spinlock.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When we enable an NMI window, we ask for an IRET intercept, since
the IRET re-enables NMIs. However, the IRET intercept happens before
the instruction executes, while the NMI window architecturally opens
afterwards.
To compensate for this mismatch, we only open the NMI window in the
following exit, assuming that the IRET has by then executed; however,
this assumption is not always correct; we may exit due to a host interrupt
or page fault, without having executed the instruction.
Fix by checking for forward progress by recording and comparing the IRET's
rip. This is somewhat of a hack, since an unchaging rip does not mean that
no forward progress has been made, but is the simplest fix for now.
Signed-off-by: Avi Kivity <avi@redhat.com>
The interrupt injection logic looks something like
if an nmi is pending, and nmi injection allowed
inject nmi
if an nmi is pending
request exit on nmi window
the problem is that "nmi is pending" can be set asynchronously by
the PIT; if it happens to fire between the two if statements, we
will request an nmi window even though nmi injection is allowed. On
SVM, this has disasterous results, since it causes eflags.TF to be
set in random guest code.
The fix is simple; make nmi_pending synchronous using the standard
vcpu->requests mechanism; this ensures the code above is completely
synchronous wrt nmi_pending.
Signed-off-by: Avi Kivity <avi@redhat.com>
Use the new support in the emulator, and drop the ad-hoc code in x86.c.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Mark some instructions as vendor specific, and allow the caller to request
emulation only of vendor specific instructions. This is useful in some
circumstances (responding to a #UD fault).
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>