Commit Graph

3009 Commits

Author SHA1 Message Date
Jan Kiszka 11f5cc0515 KVM: x86: Simplify __apic_accept_irq
If posted interrupts are enabled, we can no longer track if an IRQ was
coalesced based on IRR. So drop this logic also from the classic
software path and simplify apic_test_and_set_irr to apic_set_irr.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-25 13:42:35 +03:00
Andi Kleen 103af0a987 perf, kvm: Support the in_tx/in_tx_cp modifiers in KVM arch perfmon emulation v5
[KVM maintainers:
The underlying support for this is in perf/core now. So please merge
this patch into the KVM tree.]

This is not arch perfmon, but older CPUs will just ignore it. This makes
it possible to do at least some TSX measurements from a KVM guest

v2: Various fixes to address review feedback
v3: Ignore the bits when no CPUID. No #GP. Force raw events with TSX bits.
v4: Use reserved bits for #GP
v5: Remove obsolete argument
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-19 18:24:45 +02:00
Arthur Chunqi Li 21feb4eb64 KVM: nVMX: Set segment infomation of L1 when L2 exits
When L2 exits to L1, segment infomations of L1 are not set correctly.
According to Intel SDM 27.5.2(Loading Host Segment and Descriptor
Table Registers), segment base/limit/access right of L1 should be
set to some designed value when L2 exits to L1. This patch fixes
this.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Reviewed-by: Gleb Natapov <gnatapov@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:31 +02:00
Nadav Har'El b3897a49e2 KVM: nVMX: Fix read/write to MSR_IA32_FEATURE_CONTROL
Fix read/write to IA32_FEATURE_CONTROL MSR in nested environment.

This patch simulate this MSR in nested_vmx and the default value is
0x0. BIOS should set it to 0x5 before VMXON. After setting the lock
bit, write to it will cause #GP(0).

Another QEMU patch is also needed to handle emulation of reset
and migration. Reset to vCPU should clear this MSR and migration
should reserve value of it.

This patch is based on Nadav's previous commit.
http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/88478

Signed-off-by: Nadav Har'El <nyh@math.technion.ac.il>
Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Mathias Krause 6b61edf765 KVM: x86: Drop useless cast
Void pointers don't need no casting, drop it.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:29 +02:00
Mathias Krause c2bae89394 KVM: VMX: Use proper types to access const arrays
Use a const pointer type instead of casting away the const qualifier
from const arrays. Keep the pointer array on the stack, nonetheless.
Making it static just increases the object size.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-07-18 12:29:28 +02:00
Arthur Chunqi Li a25eb114d5 KVM: nVMX: Set success rflags when emulate VMXON/VMXOFF in nested virt
Set rflags after successfully emulateing VMXON/VMXOFF in VMX.

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:27 +02:00
Arthur Chunqi Li 0658fbaad8 KVM: nVMX: Change location of 3 functions in vmx.c
Move nested_vmx_succeed/nested_vmx_failInvalid/nested_vmx_failValid
ahead of handle_vmon to eliminate double declaration in the same
file

Signed-off-by: Arthur Chunqi Li <yzt356@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:26 +02:00
Takuya Yoshikawa e6dff7d15e KVM: x86: Avoid zapping mmio sptes twice for generation wraparound
Now that kvm_arch_memslots_updated() catches every increment of the
memslots->generation, checking if the mmio generation has reached its
maximum value is enough.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:26 +02:00
Takuya Yoshikawa e59dbe09f8 KVM: Introduce kvm_arch_memslots_updated()
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots().  Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.

In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Acked-by: Alexander Graf <agraf@suse.de>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:29:25 +02:00
Xiao Guangrong 1c118b8226 KVM: MMU: avoid fast page fault fixing mmio page fault
Currently, fast page fault incorrectly tries to fix mmio page fault when
the generation number is invalid (spte.gen != kvm.gen).  It then returns
to guest to retry the fault since it sees the last spte is nonpresent.
This causes an infinite loop.

Since fast page fault only works for direct mmu, the issue exists when
1) tdp is enabled. It is only triggered only on AMD host since on Intel host
   the mmio page fault is recognized as ept-misconfig whose handler call
   fault-page path with error_code = 0

2) guest paging is disabled. Under this case, the issue is hardly discovered
   since paging disable is short-lived and the sptes will be invalid after
   memslot changed for 150 times

Fix it by filtering out MMIO page faults in page_fault_can_be_fast.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Tested-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-18 12:26:57 +02:00
Gleb Natapov 03617c188f KVM: VMX: mark unusable segment as nonpresent
Some userspaces do not preserve unusable property. Since usable
segment has to be present according to VMX spec we can use present
property to amend userspace bug by making unusable segment always
nonpresent. vmx_segment_access_rights() already marks nonpresent segment
as unusable.

Cc: stable@vger.kernel.org # 3.9+
Reported-by: Stefan Pietsch <stefan.pietsch@lsexperts.de>
Tested-by: Stefan Pietsch <stefan.pietsch@lsexperts.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-07-04 14:40:36 +02:00
Linus Torvalds fe489bf450 KVM fixes for 3.11
On the x86 side, there are some optimizations and documentation updates.
 The big ARM/KVM change for 3.11, support for AArch64, will come through
 Catalin Marinas's tree.  s390 and PPC have misc cleanups and bugfixes.
 
 There is a conflict due to "s390/pgtable: fix ipte notify bit" having
 entered 3.10 through Martin Schwidefsky's s390 tree.  This pull request
 has additional changes on top, so this tree's version is the correct one.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.13 (GNU/Linux)
 
 iQIcBAABAgAGBQJR0oU6AAoJEBvWZb6bTYbynnsP/RSUrrHrA8Wu1tqVfAKu+1y5
 6OIihqZ9x11/YMaNofAfv86jqxFu0/j7CzMGphNdjzujqKI+Q1tGe7oiVCmKzoG+
 UvSctWsz0lpllgBtnnrm5tcfmG6rrddhLtpA7m320+xCVx8KV5P4VfyHZEU+Ho8h
 ziPmb2mAQ65gBNX6nLHEJ3ITTgad6gt4NNbrKIYpyXuWZQJypzaRqT/vpc4md+Ed
 dCebMXsL1xgyb98EcnOdrWH1wV30MfucR7IpObOhXnnMKeeltqAQPvaOlKzZh4dK
 +QfxJfdRZVS0cepcxzx1Q2X3dgjoKQsHq1nlIyz3qu1vhtfaqBlixLZk0SguZ/R9
 1S1YqucZiLRO57RD4q0Ak5oxwobu18ZoqJZ6nledNdWwDe8bz/W2wGAeVty19ky0
 qstBdM9jnwXrc0qrVgZp3+s5dsx3NAm/KKZBoq4sXiDLd/yBzdEdWIVkIrU3X9wU
 3X26wOmBxtsB7so/JR7ciTsQHelmLicnVeXohAEP9CjIJffB81xVXnXs0P0SYuiQ
 RzbSCwjPzET4JBOaHWT0Dhv0DTS/EaI97KzlN32US3Bn3WiLlS1oDCoPFoaLqd2K
 LxQMsXS8anAWxFvexfSuUpbJGPnKSidSQoQmJeMGBa9QhmZCht3IL16/Fb641ToN
 xBohzi49L9FDbpOnTYfz
 =1zpG
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Paolo Bonzini:
 "On the x86 side, there are some optimizations and documentation
  updates.  The big ARM/KVM change for 3.11, support for AArch64, will
  come through Catalin Marinas's tree.  s390 and PPC have misc cleanups
  and bugfixes"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
  KVM: PPC: Ignore PIR writes
  KVM: PPC: Book3S PR: Invalidate SLB entries properly
  KVM: PPC: Book3S PR: Allow guest to use 1TB segments
  KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
  KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
  KVM: PPC: Book3S PR: Fix proto-VSID calculations
  KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
  KVM: Fix RTC interrupt coalescing tracking
  kvm: Add a tracepoint write_tsc_offset
  KVM: MMU: Inform users of mmio generation wraparound
  KVM: MMU: document fast invalidate all mmio sptes
  KVM: MMU: document fast invalidate all pages
  KVM: MMU: document fast page fault
  KVM: MMU: document mmio page fault
  KVM: MMU: document write_flooding_count
  KVM: MMU: document clear_spte_count
  KVM: MMU: drop kvm_mmu_zap_mmio_sptes
  KVM: MMU: init kvm generation close to mmio wrap-around value
  KVM: MMU: add tracepoint for check_mmio_spte
  KVM: MMU: fast invalidate all mmio sptes
  ...
2013-07-03 13:21:40 -07:00
Linus Torvalds 002e44bfb5 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull asm/x86 changes from Ingo Molnar:
 "Misc changes, with a bigger processor-flags cleanup/reorganization by
  Peter Anvin"

* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, asm, cleanup: Replace open-coded control register values with symbolic
  x86, processor-flags: Fix the datatypes and add bit number defines
  x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
  x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
  linux/const.h: Add _BITUL() and _BITULL()
  x86/vdso: Convert use of typedef ctl_table to struct ctl_table
  x86: __force_order doesn't need to be an actual variable
2013-07-02 16:21:45 -07:00
Gleb Natapov 24f7bb52e9 KVM: Fix RTC interrupt coalescing tracking
This reverts most of the f1ed0450a5. After
the commit kvm_apic_set_irq() no longer returns accurate information
about interrupt injection status if injection is done into disabled
APIC. RTC interrupt coalescing tracking relies on the information to be
accurate and cannot recover if it is not.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-27 14:20:53 +03:00
Yoshihiro YUNOMAE 489223edf2 kvm: Add a tracepoint write_tsc_offset
Add a tracepoint write_tsc_offset for tracing TSC offset change.
We want to merge ftrace's trace data of guest OSs and the host OS using
TSC for timestamp in chronological order. We need "TSC offset" values for
each guest when merge those because the TSC value on a guest is always the
host TSC plus guest's TSC offset. If we get the TSC offset values, we can
calculate the host TSC value for each guest events from the TSC offset and
the event TSC value. The host TSC values of the guest events are used when we
want to merge trace data of guests and the host in chronological order.
(Note: the trace_clock of both the host and the guest must be set x86-tsc in
this case)

This tracepoint also records vcpu_id which can be used to merge trace data for
SMP guests. A merge tool will read TSC offset for each vcpu, then the tool
converts guest TSC values to host TSC values for each vcpu.

TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
The latter function is executed when kvm clock is updated. Only host can read
TSC offset value from VMCS, so a host needs to output TSC offset value
when TSC offset is changed.

Since the TSC offset is not often changed, it could be overwritten by other
frequent events while tracing. To avoid that, I recommend to use a special
instance for getting this event:

1. set a instance before booting a guest
 # cd /sys/kernel/debug/tracing/instances
 # mkdir tsc_offset
 # cd tsc_offset
 # echo x86-tsc > trace_clock
 # echo 1 > events/kvm/kvm_write_tsc_offset/enable

2. boot a guest

Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-27 14:20:51 +03:00
Takuya Yoshikawa 7a2e8aaf0f KVM: MMU: Inform users of mmio generation wraparound
Without this information, users will just see unexpected performance
problems and there is little chance we will get good reports from them:
note that mmio generation is increased even when we just start, or stop,
dirty logging for some memory slot, in which case users cannot expect
all shadow pages to be zapped.

printk_ratelimited() is used for this taking into account the problems
that we can see the information many times when we start multiple VMs
and guests can trigger this by reading ROM in a loop for example.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:49 +03:00
Xiao Guangrong accaefe07d KVM: MMU: document clear_spte_count
Document it to Documentation/virtual/kvm/mmu.txt

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:42 +03:00
Xiao Guangrong a8eca9dcc6 KVM: MMU: drop kvm_mmu_zap_mmio_sptes
Drop kvm_mmu_zap_mmio_sptes and use kvm_mmu_invalidate_zap_all_pages
instead to handle mmio generation number overflow

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:40 +03:00
Xiao Guangrong 69c9ea93ea KVM: MMU: init kvm generation close to mmio wrap-around value
Then it has the chance to trigger mmio generation number wrap-around

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
[Change from MMIO_MAX_GEN - 13 to MMIO_MAX_GEN - 150, because 13 is
 very close to the number of calls to KVM_SET_USER_MEMORY_REGION
 before the guest is started and there is any chance to create any
 spte. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:39 +03:00
Xiao Guangrong 089504c0d4 KVM: MMU: add tracepoint for check_mmio_spte
It is useful for debug mmio spte invalidation

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:37 +03:00
Xiao Guangrong f8f559422b KVM: MMU: fast invalidate all mmio sptes
This patch tries to introduce a very simple and scale way to invalidate
all mmio sptes - it need not walk any shadow pages and hold mmu-lock

KVM maintains a global mmio valid generation-number which is stored in
kvm->memslots.generation and every mmio spte stores the current global
generation-number into his available bits when it is created

When KVM need zap all mmio sptes, it just simply increase the global
generation-number. When guests do mmio access, KVM intercepts a MMIO #PF
then it walks the shadow page table and get the mmio spte. If the
generation-number on the spte does not equal the global generation-number,
it will go to the normal #PF handler to update the mmio spte

Since 19 bits are used to store generation-number on mmio spte, we zap all
mmio sptes when the number is round

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:36 +03:00
Xiao Guangrong b37fbea6ce KVM: MMU: make return value of mmio page fault handler more readable
Define some meaningful names instead of raw code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:20:17 +03:00
Xiao Guangrong f2fd125d32 KVM: MMU: store generation-number into mmio spte
Store the generation-number into bit3 ~ bit11 and bit52 ~ bit61, totally
19 bits can be used, it should be enough for nearly all most common cases

In this patch, the generation-number is always 0, it will be changed in
the later patch

[Gleb: masking generation bits from spte in get_mmio_spte_gfn() and
       get_mmio_spte_access()]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-27 14:18:15 +03:00
H. Peter Anvin afcbf13fa6 x86: Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE
Rename X86_CR4_RDWRGSFS to X86_CR4_FSGSBASE to match the SDM.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Link: http://lkml.kernel.org/n/tip-buq1evi5dpykxx7ak6amaam0@git.kernel.org
2013-06-25 16:26:06 -07:00
H. Peter Anvin 1adfa76a95 x86, flags: Rename X86_EFLAGS_BIT1 to X86_EFLAGS_FIXED
Bit 1 in the x86 EFLAGS is always set.  Name the macro something that
actually tries to explain what it is all about, rather than being a
tautology.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: http://lkml.kernel.org/n/tip-f10rx5vjjm6tfnt8o1wseb3v@git.kernel.org
2013-06-25 16:25:32 -07:00
Xiao Guangrong 885032b910 KVM: MMU: retain more available bits on mmio spte
Let mmio spte only use bit62 and bit63 on upper 32 bits, then bit 52 ~ bit 61
can be used for other purposes

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-20 23:33:20 +02:00
Zhanghaoyu (A) 764bcbc5a6 KVM: x86: remove vcpu's CPL check in host-invoked XCR set
__kvm_set_xcr function does the CPL check when set xcr. __kvm_set_xcr is
called in two flows, one is invoked by guest, call stack shown as below,

  handle_xsetbv(or xsetbv_interception)
    kvm_set_xcr
      __kvm_set_xcr

the other one is invoked by host, for example during system reset:

  kvm_arch_vcpu_ioctl
    kvm_vcpu_ioctl_x86_set_xcrs
      __kvm_set_xcr

The former does need the CPL check, but the latter does not.

Cc: stable@vger.kernel.org
Signed-off-by: Zhang Haoyu <haoyu.zhang@huawei.com>
[Tweaks to commit message. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-06-18 09:55:35 +02:00
Marcelo Tosatti 8915aa27d5 KVM: x86: handle idiv overflow at kvm_write_tsc
Its possible that idivl overflows (due to large delta stored in usdiff,
valid scenario).

Create an exception handler to catch the overflow exception (division by zero
is protected by vcpu->arch.virtual_tsc_khz check), and interpret it accordingly
(delta is larger than USEC_PER_SEC).

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=969644

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-12 14:24:11 +03:00
Gleb Natapov 05988d728d KVM: MMU: reduce KVM_REQ_MMU_RELOAD when root page is zapped
Quote Gleb's mail:
| why don't we check for sp->role.invalid in
| kvm_mmu_prepare_zap_page before calling kvm_reload_remote_mmus()?

and

| Actually we can add check for is_obsolete_sp() there too since
| kvm_mmu_invalidate_all_pages() already calls kvm_reload_remote_mmus()
| after incrementing mmu_valid_gen.

[ Xiao: add some comments and the check of is_obsolete_sp() ]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:34:02 +03:00
Xiao Guangrong 365c886860 KVM: MMU: reclaim the zapped-obsolete page first
As Marcelo pointed out that
| "(retention of large number of pages while zapping)
| can be fatal, it can lead to OOM and host crash"

We introduce a list, kvm->arch.zapped_obsolete_pages, to link all
the pages which are deleted from the mmu cache but not actually
freed. When page reclaiming is needed, we always zap this kind of
pages first.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:33:33 +03:00
Xiao Guangrong f34d251d66 KVM: MMU: collapse TLB flushes when zap all pages
kvm_zap_obsolete_pages uses lock-break technique to zap pages,
it will flush tlb every time when it does lock-break

We can reload mmu on all vcpus after updating the generation
number so that the obsolete pages are not used on any vcpus,
after that we do not need to flush tlb when obsolete pages
are zapped

It will do kvm_mmu_prepare_zap_page many times and use one
kvm_mmu_commit_zap_page to collapse tlb flush, the side-effects
is that causes obsolete pages unlinked from active_list but leave
on hash-list, so we add the comment around the hash list walker

Note: kvm_mmu_commit_zap_page is still needed before free
the pages since other vcpus may be doing locklessly shadow
page walking

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:33:18 +03:00
Xiao Guangrong e7d11c7a89 KVM: MMU: zap pages in batch
Zap at lease 10 pages before releasing mmu-lock to reduce the overload
caused by requiring lock

After the patch, kvm_zap_obsolete_pages can forward progress anyway,
so update the comments

[ It improves the case 0.6% ~ 1% that do kernel building meanwhile read
  PCI ROM. ]

Note: i am not sure that "10" is the best speculative value, i just
guessed that '10' can make vcpu do not spend long time on
kvm_zap_obsolete_pages and do not cause mmu-lock too hungry.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:33:10 +03:00
Xiao Guangrong 7f52af7412 KVM: MMU: do not reuse the obsolete page
The obsolete page will be zapped soon, do not reuse it to
reduce future page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:33:04 +03:00
Xiao Guangrong 35006126f0 KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages
It is good for debug and development

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:57 +03:00
Xiao Guangrong 2248b02321 KVM: MMU: show mmu_valid_gen in shadow page related tracepoints
Show sp->mmu_valid_gen

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:49 +03:00
Xiao Guangrong 6ca18b6950 KVM: x86: use the fast way to invalidate all pages
Replace kvm_mmu_zap_all by kvm_mmu_invalidate_zap_all_pages

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:42 +03:00
Xiao Guangrong 5304b8d37c KVM: MMU: fast invalidate all pages
The current kvm_mmu_zap_all is really slow - it is holding mmu-lock to
walk and zap all shadow pages one by one, also it need to zap all guest
page's rmap and all shadow page's parent spte list. Particularly, things
become worse if guest uses more memory or vcpus. It is not good for
scalability

In this patch, we introduce a faster way to invalidate all shadow pages.
KVM maintains a global mmu invalid generation-number which is stored in
kvm->arch.mmu_valid_gen and every shadow page stores the current global
generation-number into sp->mmu_valid_gen when it is created

When KVM need zap all shadow pages sptes, it just simply increase the
global generation-number then reload root shadow pages on all vcpus.
Vcpu will create a new shadow page table according to current kvm's
generation-number. It ensures the old pages are not used any more.
Then the obsolete pages (sp->mmu_valid_gen != kvm->arch.mmu_valid_gen)
are zapped by using lock-break technique

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:33 +03:00
Xiao Guangrong a2ae162265 KVM: MMU: drop unnecessary kvm_reload_remote_mmus
It is the responsibility of kvm_mmu_zap_all that keeps the
consistent of mmu and tlbs. And it is also unnecessary after
zap all mmio sptes since no mmio spte exists on root shadow
page and it can not be cached into tlb

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:24 +03:00
Xiao Guangrong 758ccc89b8 KVM: x86: drop calling kvm_mmu_zap_all in emulator_fix_hypercall
Quote Gleb's mail:

| Back then kvm->lock protected memslot access so code like:
|
| mutex_lock(&vcpu->kvm->lock);
| kvm_mmu_zap_all(vcpu->kvm);
| mutex_unlock(&vcpu->kvm->lock);
|
| which is what 7aa81cc0 does was enough to guaranty that no vcpu will
| run while code is patched. This is no longer the case and
| mutex_lock(&vcpu->kvm->lock); is gone from that code path long time ago,
| so now kvm_mmu_zap_all() there is useless and the code is incorrect.

So we drop it and it will be fixed later

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-05 12:32:00 +03:00
Gleb Natapov 299018f44a KVM: Fix race in apic->pending_events processing
apic->pending_events processing has a race that may cause INIT and
SIPI
processing to be reordered:

vpu0:                            vcpu1:
set INIT
                               test_and_clear_bit(KVM_APIC_INIT)
                                  process INIT
set INIT
set SIPI
                               test_and_clear_bit(KVM_APIC_SIPI)
                                  process SIPI

At the end INIT is left pending in pending_events. The following patch
fixes this by latching pending event before processing them.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-03 11:32:39 +03:00
Paolo Bonzini 8acb42070e KVM: fix sil/dil/bpl/spl in the mod/rm fields
The x86-64 extended low-byte registers were fetched correctly from reg,
but not from mod/rm.

This fixes another bug in the boot of RHEL5.9 64-bit, but it is still
not enough.

Cc: <stable@vger.kernel.org> # 3.9
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-03 11:27:12 +03:00
Paolo Bonzini 103f98ea64 KVM: Emulate multibyte NOP
This is encountered when booting RHEL5.9 64-bit.  There is another bug
after this one that is not a simple emulation failure, but this one lets
the boot proceed a bit.

Cc: <stable@vger.kernel.org> # 3.9
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-03 11:20:53 +03:00
Avi Kivity e47a5f5fb7 KVM: x86 emulator: convert XADD to fastop
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:24 +03:00
Avi Kivity 203831e8e4 KVM: x86 emulator: drop unused old-style inline emulation
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:23 +03:00
Avi Kivity b8c0b6ae49 KVM: x86 emulator: convert DIV/IDIV to fastop
Since DIV and IDIV can generate exceptions, we need an additional output
parameter indicating whether an execption has occured.  To avoid increasing
register pressure on i386, we use %rsi, which is already allocated for
the fastop code pointer.

Gleb: added comment about fop usage as exception indication.

Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:21 +03:00
Avi Kivity b9fa409b00 KVM: x86 emulator: convert single-operand MUL/IMUL to fastop
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:20 +03:00
Avi Kivity 017da7b604 KVM: x86 emulator: Switch fastop src operand to RDX
This makes OpAccHi useful.

Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:19 +03:00
Avi Kivity ab2c5ce666 KVM: x86 emulator: switch MUL/DIV to DstXacc
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:17 +03:00
Avi Kivity 820207c8fc KVM: x86 emulator: decode extended accumulator explicity
Single-operand MUL and DIV access an extended accumulator: AX for byte
instructions, and DX:AX, EDX:EAX, or RDX:RAX for larger-sized instructions.
Add support for fetching the extended accumulator.

In order not to change things too much, RDX is loaded into Src2, which is
already loaded by fastop().  This avoids increasing register pressure on
i386.

Gleb: disable src writeback for ByteOp div/mul.

Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:16 +03:00
Avi Kivity fb32b1eda2 KVM: x86 emulator: add support for writing back the source operand
Some instructions write back the source operand, not just the destination.
Add support for doing this via the decode flags.

Gleb: add BUG_ON() to prevent source to be memory operand.

Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-21 15:43:14 +03:00
Marc Zyngier 535cf7b3b1 KVM: get rid of $(addprefix ../../../virt/kvm/, ...) in Makefiles
As requested by the KVM maintainers, remove the addprefix used to
refer to the main KVM code from the arch code, and replace it with
a KVM variable that does the same thing.

Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Christoffer Dall <cdall@cs.columbia.edu>
Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Alexander Graf <agraf@suse.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-19 15:14:00 +03:00
Gleb Natapov 35af577aac KVM: MMU: clenaup locking in mmu_free_roots()
Do locking around each case separately instead of having one lock and two
unlocks. Move root_hpa assignment out of the lock.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-16 11:55:51 +03:00
Marcelo Tosatti 0061d53daf KVM: x86: limit difference between kvmclock updates
kvmclock updates which are isolated to a given vcpu, such as vcpu->cpu
migration, should not allow system_timestamp from the rest of the vcpus
to remain static. Otherwise ntp frequency correction applies to one
vcpu's system_timestamp but not the others.

So in those cases, request a kvmclock update for all vcpus. The worst
case for a remote vcpu to update its kvmclock is then bounded by maximum
nohz sleep latency.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-15 20:36:09 +03:00
Jan Kiszka f1ed0450a5 KVM: x86: Remove support for reporting coalesced APIC IRQs
Since the arrival of posted interrupt support we can no longer guarantee
that coalesced IRQs are always reported to the IRQ source. Moreover,
accumulated APIC timer events could cause a busy loop when a VCPU should
rather be halted. The consensus is to remove coalesced tracking from the
LAPIC.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-14 12:09:02 +03:00
Takuya Yoshikawa e2858b4ab1 KVM: MMU: Use kvm_mmu_sync_roots() in kvm_mmu_load()
No need to open-code this function.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-12 14:52:55 +03:00
Paolo Bonzini 326f578f7e KVM: emulator: emulate SALC
This is an almost-undocumented instruction available in 32-bit mode.
I say "almost" undocumented because AMD documents it in their opcode
maps just to say that it is unavailable in 64-bit mode (sections
"A.2.1 One-Byte Opcodes" and "B.3 Invalid and Reassigned Instructions
in 64-Bit Mode").

It is roughly equivalent to "sbb %al, %al" except it does not
set the flags.  Use fastop to emulate it, but do not use the opcode
directly because it would fail if the host is 64-bit!

Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@vger.kernel.org # 3.9
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-09 13:15:08 +03:00
Paolo Bonzini 7fa57952d7 KVM: emulator: emulate XLAT
This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.
It is just a MOV in disguise, with a funny source address.

Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@vger.kernel.org # 3.9
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-09 13:14:51 +03:00
Paolo Bonzini a035d5c64d KVM: emulator: emulate AAM
This is used by SGABIOS, KVM breaks with emulate_invalid_guest_state=1.

AAM needs the source operand to be unsigned; do the same in AAD as well
for consistency, even though it does not affect the result.

Reported-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: stable@vger.kernel.org # 3.9
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-09 13:12:58 +03:00
Gleb Natapov 8d76c49e9f KVM: VMX: fix halt emulation while emulating invalid guest sate
The invalid guest state emulation loop does not check halt_request
which causes 100% cpu loop while guest is in halt and in invalid
state, but more serious issue is that this leaves halt_request set, so
random instruction emulated by vm86 #GP exit can be interpreted
as halt which causes guest hang. Fix both problems by handling
halt_request in emulation loop.

Reported-by: Tomas Papan <tomas.papan@gmail.com>
Tested-by: Tomas Papan <tomas.papan@gmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
CC: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-09 09:04:56 +03:00
Marcelo Tosatti 42bdf991f4 KVM: x86: fix maintenance of guest/host xcr0 state
Emulation of xcr0 writes zero guest_xcr0_loaded variable so that
subsequent VM-entry reloads CPU's xcr0 with guests xcr0 value.

However, this is incorrect because guest_xcr0_loaded variable is
read to decide whether to reload hosts xcr0.

In case the vcpu thread is scheduled out after the guest_xcr0_loaded = 0
assignment, and scheduler decides to preload FPU:

switch_to
{
  __switch_to
    __math_state_restore
      restore_fpu_checking
        fpu_restore_checking
          if (use_xsave())
              fpu_xrstor_checking
		xrstor64 with CPU's xcr0 == guests xcr0

Fix by properly restoring hosts xcr0 during emulation of xcr0 writes.

Analyzed-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-05-08 12:47:43 +03:00
Linus Torvalds 01227a889e Merge tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Gleb Natapov:
 "Highlights of the updates are:

  general:
   - new emulated device API
   - legacy device assignment is now optional
   - irqfd interface is more generic and can be shared between arches

  x86:
   - VMCS shadow support and other nested VMX improvements
   - APIC virtualization and Posted Interrupt hardware support
   - Optimize mmio spte zapping

  ppc:
    - BookE: in-kernel MPIC emulation with irqfd support
    - Book3S: in-kernel XICS emulation (incomplete)
    - Book3S: HV: migration fixes
    - BookE: more debug support preparation
    - BookE: e6500 support

  ARM:
   - reworking of Hyp idmaps

  s390:
   - ioeventfd for virtio-ccw

  And many other bug fixes, cleanups and improvements"

* tag 'kvm-3.10-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits)
  kvm: Add compat_ioctl for device control API
  KVM: x86: Account for failing enable_irq_window for NMI window request
  KVM: PPC: Book3S: Add API for in-kernel XICS emulation
  kvm/ppc/mpic: fix missing unlock in set_base_addr()
  kvm/ppc: Hold srcu lock when calling kvm_io_bus_read/write
  kvm/ppc/mpic: remove users
  kvm/ppc/mpic: fix mmio region lists when multiple guests used
  kvm/ppc/mpic: remove default routes from documentation
  kvm: KVM_CAP_IOMMU only available with device assignment
  ARM: KVM: iterate over all CPUs for CPU compatibility check
  KVM: ARM: Fix spelling in error message
  ARM: KVM: define KVM_ARM_MAX_VCPUS unconditionally
  KVM: ARM: Fix API documentation for ONE_REG encoding
  ARM: KVM: promote vfp_host pointer to generic host cpu context
  ARM: KVM: add architecture specific hook for capabilities
  ARM: KVM: perform HYP initilization for hotplugged CPUs
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: rework HYP page table freeing
  ARM: KVM: enforce maximum size for identity mapped code
  ARM: KVM: move to a KVM provided HYP idmap
  ...
2013-05-05 14:47:31 -07:00
Jan Kiszka 03b28f8133 KVM: x86: Account for failing enable_irq_window for NMI window request
With VMX, enable_irq_window can now return -EBUSY, in which case an
immediate exit shall be requested before entering the guest. Account for
this also in enable_nmi_window which uses enable_irq_window in absence
of vnmi support, e.g.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-05-02 22:17:38 -03:00
Linus Torvalds 1e2f5b598a Merge branch 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 paravirt update from Ingo Molnar:
 "Various paravirtualization related changes - the biggest one makes
  guest support optional via CONFIG_HYPERVISOR_GUEST"

* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, wakeup, sleep: Use pvops functions for changing GDT entries
  x86, xen, gdt: Remove the pvops variant of store_gdt.
  x86-32, gdt: Store/load GDT for ACPI S3 or hibernation/resume path is not needed
  x86-64, gdt: Store/load GDT for ACPI S3 or hibernate/resume path is not needed.
  x86: Make Linux guest support optional
  x86, Kconfig: Move PARAVIRT_DEBUG into the paravirt menu
2013-04-30 08:41:21 -07:00
Linus Torvalds df8edfa9af Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid changes from Ingo Molnar:
 "The biggest change is x86 CPU bug handling refactoring and cleanups,
  by Borislav Petkov"

* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, CPU, AMD: Drop useless label
  x86, AMD: Correct {rd,wr}msr_amd_safe warnings
  x86: Fold-in trivial check_config function
  x86, cpu: Convert AMD Erratum 400
  x86, cpu: Convert AMD Erratum 383
  x86, cpu: Convert Cyrix coma bug detection
  x86, cpu: Convert FDIV bug detection
  x86, cpu: Convert F00F bug detection
  x86, cpu: Expand cpufeature facility to include cpu bugs
2013-04-30 08:34:38 -07:00
Alex Williamson 4cee4b72f1 kvm: KVM_CAP_IOMMU only available with device assignment
Fix build with CONFIG_PCI unset by linking KVM_CAP_IOMMU to
device assignment config option.  It has no purpose otherwise.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-29 23:08:50 -03:00
Jan Kiszka 5a2892ce72 KVM: nVMX: Skip PF interception check when queuing during nested run
While a nested run is pending, vmx_queue_exception is only called to
requeue exceptions that were previously picked up via
vmx_cancel_injection. Therefore, we must not check for PF interception
by L1, possibly causing a bogus nested vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 13:34:39 +03:00
Alex Williamson 2a5bab1004 kvm: Allow build-time configuration of KVM device assignment
We hope to at some point deprecate KVM legacy device assignment in
favor of VFIO-based assignment.  Towards that end, allow legacy
device assignment to be deconfigured.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 12:58:56 +03:00
Gleb Natapov 064d1afaa5 Merge git://github.com/agraf/linux-2.6.git kvm-ppc-next into queue 2013-04-28 12:50:07 +03:00
Jan Kiszka 730dca42c1 KVM: x86: Rework request for immediate exit
The VMX implementation of enable_irq_window raised
KVM_REQ_IMMEDIATE_EXIT after we checked it in vcpu_enter_guest. This
caused infinite loops on vmentry. Fix it by letting enable_irq_window
signal the need for an immediate exit via its return value and drop
KVM_REQ_IMMEDIATE_EXIT.

This issue only affects nested VMX scenarios.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 12:44:18 +03:00
Borislav Petkov 6614c7d042 kvm, svm: Fix typo in printk message
It is "exit_int_info". It is actually EXITINTINFO in the official docs
but we don't like screaming docs.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 12:42:41 +03:00
Jan Kiszka cb0c8cda13 KVM: VMX: remove unprintable characters from comment
Slipped in while copy&pasting from the SDM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-28 08:55:59 +03:00
Alexander Graf 7df35f5496 KVM: Move irqfd resample cap handling to generic code
Now that we have most irqfd code completely platform agnostic, let's move
irqfd's resample capability return to generic code as well.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:19 +02:00
Alexander Graf 1c9f8520bd KVM: Extract generic irqchip logic into irqchip.c
The current irq_comm.c file contains pieces of code that are generic
across different irqchip implementations, as well as code that is
fully IOAPIC specific.

Split the generic bits out into irqchip.c.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:17 +02:00
Alexander Graf a725d56a02 KVM: Introduce CONFIG_HAVE_KVM_IRQ_ROUTING
Quite a bit of code in KVM has been conditionalized on availability of
IOAPIC emulation. However, most of it is generically applicable to
platforms that don't have an IOPIC, but a different type of irq chip.

Make code that only relies on IRQ routing, not an APIC itself, on
CONFIG_HAVE_KVM_IRQ_ROUTING, so that we can reuse it later.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
2013-04-26 20:27:14 +02:00
Gleb Natapov 660696d1d1 KVM: X86 emulator: fix source operand decoding for 8bit mov[zs]x instructions
Source operand for one byte mov[zs]x is decoded incorrectly if it is in
high byte register. Fix that.

Cc: stable@vger.kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-25 10:03:50 +03:00
Jan Kiszka d1fa0352a1 KVM: nVMX: VM_ENTRY/EXIT_LOAD_IA32_EFER overrides EFER.LMA settings
If we load the complete EFER MSR on entry or exit, EFER.LMA (and LME)
loading is skipped. Their consistency is already checked now before
starting the transition.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 12:53:52 +03:00
Jan Kiszka 384bb78327 KVM: nVMX: Validate EFER values for VM_ENTRY/EXIT_LOAD_IA32_EFER
As we may emulate the loading of EFER on VM-entry and VM-exit, implement
the checks that VMX performs on the guest and host values on vmlaunch/
vmresume. Factor out kvm_valid_efer for this purpose which checks for
set reserved bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 12:53:42 +03:00
Jan Kiszka ea8ceb8354 KVM: nVMX: Fix conditions for NMI injection
The logic for checking if interrupts can be injected has to be applied
also on NMIs. The difference is that if NMI interception is on these
events are consumed and blocked by the VM exit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 11:10:49 +03:00
Jan Kiszka 2505dc9fad KVM: VMX: Move vmx_nmi_allowed after vmx_set_nmi_mask
vmx_set_nmi_mask will soon be used by vmx_nmi_allowed. No functional
changes.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 11:10:49 +03:00
Andrew Honig 27469d29b3 KVM: x86: Fix memory leak in vmx.c
If userspace creates and destroys multiple VMs within the same process
we leak 20k of memory in the userspace process context per VM.  This
patch frees the memory in kvm_arch_destroy_vm.  If the process exits
without closing the VM file descriptor or the file descriptor has been
shared with another process then we don't free the memory.

It's still possible for a user space process to leak memory if the last
process to close the fd for the VM is not the process that created it.
However, this is an unexpected case that's only caused by a user space
process that's misbehaving.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 11:02:38 +03:00
Wei Yongjun f179735921 KVM: x86: fix error return code in kvm_arch_vcpu_init()
Fix to return a negative error code from the error handling
case instead of 0, as returned elsewhere in this function.

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:56:44 +03:00
Abel Gordon 8a1b9dd000 KVM: nVMX: Enable and disable shadow vmcs functionality
Once L1 loads VMCS12 we enable shadow-vmcs capability and copy all the VMCS12
shadowed fields to the shadow vmcs.  When we release the VMCS12, we also
disable shadow-vmcs capability.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:55 +03:00
Abel Gordon 012f83cb2f KVM: nVMX: Synchronize VMCS12 content with the shadow vmcs
Synchronize between the VMCS12 software controlled structure and the
processor-specific shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:45 +03:00
Abel Gordon c3114420d1 KVM: nVMX: Copy VMCS12 to processor-specific shadow vmcs
Introduce a function used to copy fields from the software controlled VMCS12
to the processor-specific shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:37 +03:00
Abel Gordon 16f5b9034b KVM: nVMX: Copy processor-specific shadow-vmcs to VMCS12
Introduce a function used to copy fields from the processor-specific shadow
vmcs to the software controlled VMCS12

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:24 +03:00
Abel Gordon e7953d7fab KVM: nVMX: Release shadow vmcs
Unmap vmcs12 and release the corresponding shadow vmcs

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:17 +03:00
Abel Gordon 8de4883370 KVM: nVMX: Allocate shadow vmcs
Allocate a shadow vmcs used by the processor to shadow part of the fields
stored in the software defined VMCS12 (let L1 access fields without causing
exits). Note we keep a shadow vmcs only for the current vmcs12.  Once a vmcs12
becomes non-current, its shadow vmcs is released.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:10 +03:00
Abel Gordon 145c28dd19 KVM: nVMX: Fix VMXON emulation
handle_vmon doesn't check if L1 is already in root mode (VMXON
was previously called). This patch adds this missing check and calls
nested_vmx_failValid if VMX is already ON.
We need this check because L0 will allocate the shadow vmcs when L1
executes VMXON and we want to avoid host leaks (due to shadow vmcs
allocation) if L1 executes VMXON repeatedly.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:52:01 +03:00
Abel Gordon 20b97feaf6 KVM: nVMX: Refactor handle_vmwrite
Refactor existent code so we re-use vmcs12_write_any to copy fields from the
shadow vmcs specified by the link pointer (used by the processor,
implementation-specific) to the VMCS12 software format used by L0 to hold
the fields in L1 memory address space.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:44 +03:00
Abel Gordon 4607c2d7a2 KVM: nVMX: Introduce vmread and vmwrite bitmaps
Prepare vmread and vmwrite bitmaps according to a pre-specified list of fields.
These lists are intended to specifiy most frequent accessed fields so we can
minimize the number of fields that are copied from/to the software controlled
VMCS12 format to/from to processor-specific shadow vmcs. The lists were built
measuring the VMCS fields access rate after L2 Ubuntu 12.04 booted when it was
running on top of L1 KVM, also Ubuntu 12.04. Note that during boot there were
additional fields which were frequently modified but they were not added to
these lists because after boot these fields were not longer accessed by L1.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:34 +03:00
Abel Gordon abc4fc58c5 KVM: nVMX: Detect shadow-vmcs capability
Add logic required to detect if shadow-vmcs is supported by the
processor. Introduce a new kernel module parameter to specify if L0 should use
shadow vmcs (or not) to run L1.

Signed-off-by: Abel Gordon <abelg@il.ibm.com>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-22 10:51:21 +03:00
Zhang, Yang Z 6ffbbbbab3 KVM: x86: Fix posted interrupt with CONFIG_SMP=n
->send_IPI_mask is not defined on UP.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-17 23:11:54 -03:00
Gleb Natapov f13882d84d KVM: VMX: Fix check guest state validity if a guest is in VM86 mode
If guest vcpu is in VM86 mode the vcpu state should be checked as if in
real mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 18:34:19 -03:00
Paolo Bonzini 26539bd0e4 KVM: nVMX: check vmcs12 for valid activity state
KVM does not use the activity state VMCS field, and does not support
it in nested VMX either (the corresponding bits in the misc VMX feature
MSR are zero).  Fail entry if the activity state is set to anything but
"active".

Since the value will always be the same for L1 and L2, we do not need
to read and write the corresponding VMCS field on L1/L2 transitions,
either.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 18:22:14 -03:00
Yang Zhang 5a71785dde KVM: VMX: Use posted interrupt to deliver virtual interrupt
If posted interrupt is avaliable, then uses it to inject virtual
interrupt to guest.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:41 -03:00
Yang Zhang a20ed54d6e KVM: VMX: Add the deliver posted interrupt algorithm
Only deliver the posted interrupt when target vcpu is running
and there is no previous interrupt pending in pir.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang cf9e65b773 KVM: Set TMR when programming ioapic entry
We already know the trigger mode of a given interrupt when programming
the ioapice entry. So it's not necessary to set it in each interrupt
delivery.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang 3d81bc7e96 KVM: Call common update function when ioapic entry changed.
Both TMR and EOI exit bitmap need to be updated when ioapic changed
or vcpu's id/ldr/dfr changed. So use common function instead eoi exit
bitmap specific function.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang 01e439be77 KVM: VMX: Check the posted interrupt capability
Detect the posted interrupt feature. If it exists, then set it in vmcs_config.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:40 -03:00
Yang Zhang a547c6db4d KVM: VMX: Enable acknowledge interupt on vmexit
The "acknowledge interrupt on exit" feature controls processor behavior
for external interrupt acknowledgement. When this control is set, the
processor acknowledges the interrupt controller to acquire the
interrupt vector on VM exit.

After enabling this feature, an interrupt which arrived when target cpu is
running in vmx non-root mode will be handled by vmx handler instead of handler
in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
table to let real handler to handle it. Further, we will recognize the interrupt
and only delivery the interrupt which not belong to current vcpu through idt table.
The interrupt which belonged to current vcpu will be handled inside vmx handler.
This will reduce the interrupt handle cost of KVM.

Also, interrupt enable logic is changed if this feature is turnning on:
Before this patch, hypervior call local_irq_enable() to enable it directly.
Now IF bit is set on interrupt stack frame, and will be enabled on a return from
interrupt handler if exterrupt interrupt exists. If no external interrupt, still
call local_irq_enable() to enable it.

Refer to Intel SDM volum 3, chapter 33.2.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-16 16:32:39 -03:00
Yang Zhang aa2fbe6d44 KVM: Let ioapic know the irq line status
Userspace may deliver RTC interrupt without query the status. So we
want to track RTC EOI for this case.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang 106069193c KVM: Add reset/restore rtc_status support
restore rtc_status from migration or save/restore

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang b4f2225c07 KVM: Return destination vcpu on interrupt injection
Add a new parameter to know vcpus who received the interrupt.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:34 -03:00
Yang Zhang 1fcc7890db KVM: Add vcpu info to ioapic_update_eoi()
Add vcpu info to ioapic_update_eoi, so we can know which vcpu
issued this EOI.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-04-15 23:20:33 -03:00
Jan Kiszka c0d1c770c0 KVM: nVMX: Avoid reading VM_EXIT_INTR_ERROR_CODE needlessly on nested exits
We only need to update vm_exit_intr_error_code if there is a valid exit
interruption information and it comes with a valid error code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:10 +03:00
Jan Kiszka e8457c67a4 KVM: nVMX: Fix conditions for interrupt injection
If we are entering guest mode, we do not want L0 to interrupt this
vmentry with all its side effects on the vmcs. Therefore, injection
shall be disallowed during L1->L2 transitions, as in the previous
version. However, this check is conceptually independent of
nested_exit_on_intr, so decouple it.

If L1 traps external interrupts, we can kick the guest from L2 to L1,
also just like the previous code worked. But we no longer need to
consider L1's idt_vectoring_info_field. It will always be empty at this
point. Instead, if L2 has pending events, those are now found in the
architectural queues and will, thus, prevent vmx_interrupt_allowed from
being called at all.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:09 +03:00
Jan Kiszka 5f3d579997 KVM: nVMX: Rework event injection and recovery
The basic idea is to always transfer the pending event injection on
vmexit into the architectural state of the VCPU and then drop it from
there if it turns out that we left L2 to enter L1, i.e. if we enter
prepare_vmcs12.

vmcs12_save_pending_events takes care to transfer pending L0 events into
the queue of L1. That is mandatory as L1 may decide to switch the guest
state completely, invalidating or preserving the pending events for
later injection (including on a different node, once we support
migration).

This concept is based on the rule that a pending vmlaunch/vmresume is
not canceled. Otherwise, we would risk to lose injected events or leak
them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the
entry of nested_vmx_vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:07 +03:00
Jan Kiszka 3b656cf764 KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1
Check if the interrupt or NMI window exit is for L1 by testing if it has
the corresponding controls enabled. This is required when we allow
direct injection from L0 to L2

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 18:27:05 +03:00
Gleb Natapov 188424ba10 KVM: emulator: mark 0xff 0x7d opcode as undefined.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov 1146a78b8d KVM: emulator: Do not fail on emulation of undefined opcode
Emulation of undefined opcode should inject #UD instead of causing
emulation failure. Do that by moving Undefined flag check to emulation
stage and injection #UD there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov 991eebf9f8 KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state
During invalid guest state emulation vcpu cannot enter guest mode to try
to reexecute instruction that emulator failed to emulate, so emulation
will happen again and again.  Prevent that by telling the emulator that
instruction reexecution should not be attempted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:17 +03:00
Gleb Natapov 0b789eee2c KVM: emulator: fix unimplemented instruction detection
Unimplemented instruction detection is broken for group instructions
since it relies on "flags" field of opcode to be zero, but all
instructions in a group inherit flags from a group encoding. Fix that by
having a separate flag for unimplemented instructions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-14 09:44:16 +03:00
Konrad Rzeszutek Wilk 357d122670 x86, xen, gdt: Remove the pvops variant of store_gdt.
The two use-cases where we needed to store the GDT were during ACPI S3 suspend
and resume. As the patches:
 x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed
 x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed.

have demonstrated - there are other mechanism by which the GDT is
saved and reloaded during early resume path.

Hence we do not need to worry about the pvops call-chain for saving the
GDT and can and can eliminate it. The other areas where the store_gdt is
used are never going to be hit when running under the pvops platforms.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-04-11 15:40:38 -07:00
Kevin Wolf f8da94e9e4 KVM: x86 emulator: Fix segment loading in VM86
This fixes a regression introduced in commit 03ebebeb1 ("KVM: x86
emulator: Leave segment limit and attributs alone in real mode").

The mentioned commit changed the segment descriptors for both real mode
and VM86 to only update the segment base instead of creating a
completely new descriptor with limit 0xffff so that unreal mode keeps
working across a segment register reload.

This leads to an invalid segment descriptor in the eyes of VMX, which
seems to be okay for real mode because KVM will fix it up before the
next VM entry or emulate the state, but it doesn't do this if the guest
is in VM86, so we end up with:

  KVM: entry failed, hardware error 0x80000021

Fix this by effectively reverting commit 03ebebeb1 for VM86 and leaving
it only in place for real mode, which is where it's really needed.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-11 15:53:06 +03:00
Geoff Levand e3ba45b804 KVM: Move kvm_spurious_fault to x86.c
The routine kvm_spurious_fault() is an x86 specific routine, so
move it from virt/kvm/kvm_main.c to arch/x86/kvm/x86.c.

Fixes this sparse warning when building on arm64:

  virt/kvm/kvm_main.c⚠️ symbol 'kvm_spurious_fault' was not declared. Should it be static?

Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 13:02:06 +03:00
Jan Kiszka a63cb56061 KVM: VMX: Add missing braces to avoid redundant error check
The code was already properly aligned, now also add the braces to avoid
that err is checked even if alloc_apic_access_page didn't run and change
it. Found via Coccinelle by Fengguang Wu.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 12:46:06 +03:00
Yang Zhang 458f212e36 KVM: x86: fix memory leak in vmx_init
Free vmx_msr_bitmap_longmode_x2apic and vmx_msr_bitmap_longmode if
kvm_init() fails.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-08 10:56:08 +03:00
Jan Kiszka b8c07d55d0 KVM: nVMX: Check exit control for VM_EXIT_SAVE_IA32_PAT, not entry controls
Obviously a copy&paste mistake: prepare_vmcs12 has to check L1's exit
controls for VM_EXIT_SAVE_IA32_PAT.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 14:06:42 +03:00
Yang Zhang 44944d4d28 KVM: Call kvm_apic_match_dest() to check destination vcpu
For a given vcpu, kvm_apic_match_dest() will tell you whether
the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
and use kvm_apic_match_dest() instead.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 13:55:49 +03:00
Takuya Yoshikawa 450e0b411f Revert "KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()"
With the following commit, shadow pages can be zapped at random during
a shadow page talbe walk:
  KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()
  7ddca7e43c

This patch reverts it and fixes __direct_map() and FNAME(fetch)().

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 13:13:36 +03:00
Andrew Honig 8f964525a1 KVM: Allow cross page reads and writes from cached translations.
This patch adds support for kvm_gfn_to_hva_cache_init functions for
reads and writes that will cross a page.  If the range falls within
the same memslot, then this will be a fast operation.  If the range
is split between two memslots, then the slower kvm_read_guest and
kvm_write_guest are used.

Tested: Test against kvm_clock unit tests.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-07 13:05:35 +03:00
Borislav Petkov e6ee94d58d x86, cpu: Convert AMD Erratum 383
Convert the AMD erratum 383 testing code to the bug infrastructure. This
allows keeping the AMD-specific erratum testing machinery private to
amd.c and not export symbols to modules needlessly.

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1363788448-31325-6-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2013-04-02 10:12:54 -07:00
Paolo Bonzini afd80d85ae pmu: prepare for migration support
In order to migrate the PMU state correctly, we need to restore the
values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and
MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written).
We also need to write the full 40-bit value of the performance counter,
which would only be possible with a v3 architectural PMU's full-width
counter MSRs.

To distinguish host-initiated writes from the guest's, pass the
full struct msr_data to kvm_pmu_set_msr.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-04-02 17:42:44 +03:00
Takuya Yoshikawa 81f4f76bbc KVM: MMU: Rename kvm_mmu_free_some_pages() to make_mmu_pages_available()
The current name "kvm_mmu_free_some_pages" should be used for something
that actually frees some shadow pages, as we expect from the name, but
what the function is doing is to make some, KVM_MIN_FREE_MMU_PAGES,
shadow pages available: it does nothing when there are enough.

This patch changes the name to reflect this meaning better; while doing
this renaming, the code in the wrapper function is inlined into the main
body since the whole function will be inlined into the only caller now.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-21 19:45:01 -03:00
Takuya Yoshikawa 7ddca7e43c KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()
What this function is doing is to ensure that the number of shadow pages
does not exceed the maximum limit stored in n_max_mmu_pages: so this is
placed at every code path that can reach kvm_mmu_alloc_page().

Although it might have some sense to spread this function in each such
code path when it could be called before taking mmu_lock, the rule was
changed not to do so.

Taking this background into account, this patch moves it into
kvm_mmu_alloc_page() and simplifies the code.

Note: the unlikely hint in kvm_mmu_free_some_pages() guarantees that the
overhead of this function is almost zero except when we actually need to
allocate some shadow pages, so we do not need to care about calling it
multiple times in one path by doing kvm_mmu_get_page() a few times.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-21 19:44:56 -03:00
Marcelo Tosatti 2ae33b3896 Merge remote-tracking branch 'upstream/master' into queue
Merge reason:

From: Alexander Graf <agraf@suse.de>

"Just recently this really important patch got pulled into Linus' tree for 3.9:

commit 1674400aae
Author: Anton Blanchard <anton <at> samba.org>
Date:   Tue Mar 12 01:51:51 2013 +0000

Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue.

Could you please merge kvm/next against linus/master, so that I can base my trees against that?"

* upstream/master: (653 commits)
  PCI: Use ROM images from firmware only if no other ROM source available
  sparc: remove unused "config BITS"
  sparc: delete "if !ULTRA_HAS_POPULATION_COUNT"
  KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)
  KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
  KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
  arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS
  arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED
  inet: limit length of fragment queue hash table bucket lists
  qeth: Fix scatter-gather regression
  qeth: Fix invalid router settings handling
  qeth: delay feature trace
  sgy-cts1000: Remove __dev* attributes
  KVM: x86: fix deadlock in clock-in-progress request handling
  KVM: allow host header to be included even for !CONFIG_KVM
  hwmon: (lm75) Fix tcn75 prefix
  hwmon: (lm75.h) Update header inclusion
  MAINTAINERS: Remove Mark M. Hoffman
  xfs: ensure we capture IO errors correctly
  xfs: fix xfs_iomap_eof_prealloc_initial_size type
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-21 11:11:52 -03:00
Paolo Bonzini 04b66839d3 KVM: x86: correctly initialize the CS base on reset
The CS base was initialized to 0 on VMX (wrong, but usually overridden
by userspace before starting) or 0xf0000 on SVM.  The correct value is
0xffff0000, and VMX is able to emulate it now, so use it.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-20 17:34:55 -03:00
Andy Honig 0b79459b48 KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797)
There is a potential use after free issue with the handling of
MSR_KVM_SYSTEM_TIME.  If the guest specifies a GPA in a movable or removable
memory such as frame buffers then KVM might continue to write to that
address even after it's removed via KVM_SET_USER_MEMORY_REGION.  KVM pins
the page in memory so it's unlikely to cause an issue, but if the user
space component re-purposes the memory previously used for the guest, then
the guest will be able to corrupt that memory.

Tested: Tested against kvmclock unit test

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-19 14:17:35 -03:00
Andy Honig c300aa64dd KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796)
If the guest sets the GPA of the time_page so that the request to update the
time straddles a page then KVM will write onto an incorrect page.  The
write is done byusing kmap atomic to get a pointer to the page for the time
structure and then performing a memcpy to that page starting at an offset
that the guest controls.  Well behaved guests always provide a 32-byte aligned
address, however a malicious guest could use this to corrupt host kernel
memory.

Tested: Tested against kvmclock unit test.

Signed-off-by: Andrew Honig <ahonig@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-19 14:17:31 -03:00
Marcelo Tosatti c09664bb44 KVM: x86: fix deadlock in clock-in-progress request handling
There is a deadlock in pvclock handling:

cpu0:                                               cpu1:
kvm_gen_update_masterclock()
                                              kvm_guest_time_update()
 spin_lock(pvclock_gtod_sync_lock)
                                               local_irq_save(flags)

spin_lock(pvclock_gtod_sync_lock)

 kvm_make_mclock_inprogress_request(kvm)
  make_all_cpus_request()
   smp_call_function_many()

Now if smp_call_function_many() called by cpu0 tries to call function on
cpu1 there will be a deadlock.

Fix by moving pvclock_gtod_sync_lock protected section outside irq
disabled section.

Analyzed by Gleb Natapov <gleb@redhat.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
Reported-and-Tested-by: Yongjie Ren <yongjie.ren@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-18 18:03:39 -03:00
Jan Kiszka 4918c6ca68 KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU
Very old user space (namely qemu-kvm before kvm-49) didn't set the TSS
base before running the VCPU. We always warned about this bug, but no
reports about users actually seeing this are known. Time to finally
remove the workaround that effectively prevented to call vmx_vcpu_reset
while already holding the KVM srcu lock.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-18 13:48:15 -03:00
Takuya Yoshikawa 982b3394dd KVM: x86: Optimize mmio spte zapping when creating/moving memslot
When we create or move a memory slot, we need to zap mmio sptes.
Currently, zap_all() is used for this and this is causing two problems:
 - extra page faults after zapping mmu pages
 - long mmu_lock hold time during zapping mmu pages

For the latter, Marcelo reported a disastrous mmu_lock hold time during
hot-plug, which made the guest unresponsive for a long time.

This patch takes a simple way to fix these problems: do not zap mmu
pages unless they are marked mmio cached.  On our test box, this took
only 50us for the 4GB guest and we did not see ms of mmu_lock hold time
any more.

Note that we still need to do zap_all() for other cases.  So another
work is also needed: Xiao's work may be the one.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:21:21 +02:00
Takuya Yoshikawa 95b0430d1a KVM: MMU: Mark sp mmio cached when creating mmio spte
This will be used not to zap unrelated mmu pages when creating/moving
a memory slot later.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:21:10 +02:00
Jan Kiszka 0238ea913c KVM: nVMX: Add preemption timer support
Provided the host has this feature, it's straightforward to offer it to
the guest as well. We just need to load to timer value on L2 entry if
the feature was enabled by L1 and watch out for the corresponding exit
reason.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:01:21 +02:00
Jan Kiszka c18911a23c KVM: nVMX: Provide EFER.LMA saving support
We will need EFER.LMA saving to provide unrestricted guest mode. All
what is missing for this is picking up EFER.LMA from VM_ENTRY_CONTROLS
on L2->L1 switches. If the host does not support EFER.LMA saving,
no change is performed, otherwise we properly emulate for L1 what the
hardware does for L0. Advertise the support, depending on the host
feature.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-14 10:00:55 +02:00
Jan Kiszka eabeaaccfc KVM: nVMX: Clean up and fix pin-based execution controls
Only interrupt and NMI exiting are mandatory for KVM to work, thus can
be exposed to the guest unconditionally, virtual NMI exiting is
optional. So we must not advertise it unless the host supports it.

Introduce the symbolic constant PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR at
this chance.

Reviewed-by:: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 16:14:40 +02:00
Jan Kiszka 66450a21f9 KVM: x86: Rework INIT and SIPI handling
A VCPU sending INIT or SIPI to some other VCPU races for setting the
remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
was overwritten by kvm_emulate_halt and, thus, got lost.

This introduces APIC events for those two signals, keeping them in
kvm_apic until kvm_apic_accept_events is run over the target vcpu
context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there
are pending events, thus if vcpu blocking should end.

The patch comes with the side effect of effectively obsoleting
KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but
immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI.
The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED
state. That also means we no longer exit to user space after receiving a
SIPI event.

Furthermore, we already reset the VCPU on INIT, only fixing up the code
segment later on when SIPI arrives. Moreover, we fix INIT handling for
the BSP: it never enter wait-for-SIPI but directly starts over on INIT.

Tested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 16:08:10 +02:00
Marcelo Tosatti 5d21881432 KVM: MMU: make kvm_mmu_available_pages robust against n_used_mmu_pages > n_max_mmu_pages
As noticed by Ulrich Obergfell <uobergfe@redhat.com>, the mmu
counters are for beancounting purposes only - so n_used_mmu_pages and
n_max_mmu_pages could be relaxed (example: before f0f5933a16),
resulting in n_used_mmu_pages > n_max_mmu_pages.

Make code robust against n_used_mmu_pages > n_max_mmu_pages.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-13 11:46:09 +02:00
Jan Kiszka 57f252f229 KVM: x86: Drop unused return code from VCPU reset callback
Neither vmx nor svm nor the common part may generate an error on
kvm_vcpu_reset. So drop the return code.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-12 13:25:56 +02:00
Marcelo Tosatti 03ba32cae6 VMX: x86: handle host TSC calibration failure
If the host TSC calibration fails, tsc_khz is zero (see tsc_init.c).
Handle such case properly in KVM (instead of dividing by zero).

https://bugzilla.redhat.com/show_bug.cgi?id=859282

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-12 13:18:00 +02:00
Ioan Orghici 0fa24ce3f5 kvm: remove cast for kmalloc return value
Signed-off-by: Ioan Orghici<ioan.orghici@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-03-11 12:03:54 +02:00
Takuya Yoshikawa 5da596078f KVM: MMU: Introduce a helper function for FIFO zapping
Make the code for zapping the oldest mmu page, placed at the tail of the
active list, a separate function.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 17:26:27 -03:00
Takuya Yoshikawa 945315b9db KVM: MMU: Use list_for_each_entry_safe in kvm_mmu_commit_zap_page()
We are traversing the linked list, invalid_list, deleting each entry by
kvm_mmu_free_page().  _safe version is there for such a case.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 17:26:27 -03:00
Takuya Yoshikawa 1044b03034 KVM: MMU: Fix and clean up for_each_gfn_* macros
The expression (sp)->gfn should not be expanded using @gfn.
Although no user of these macros passes a string other than gfn now,
this should be fixed before anyone sees strange errors.

Note: ignored the following checkpatch errors:
  ERROR: Macros with complex values should be enclosed in parenthesis
  ERROR: trailing statements should be on next line

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 17:26:27 -03:00
Jan Kiszka 1a0d74e664 KVM: nVMX: Fix setting of CR0 and CR4 in guest mode
The logic for calculating the value with which we call kvm_set_cr0/4 was
broken (will definitely be visible with nested unrestricted guest mode
support). Also, we performed the check regarding CR0_ALWAYSON too early
when in guest mode.

What really needs to be done on both CR0 and CR4 is to mask out L1-owned
bits and merge them in from L1's guest_cr0/4. In contrast, arch.cr0/4
and arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and,
thus, are not suited as input.

For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and
refuse the update if it fails. To be fully consistent, we implement this
check now also for CR4. For CR4, we move the check into vmx_set_cr4
while we keep it in handle_set_cr0. This is because the CR0 checks for
vmxon vs. guest mode will diverge soon when adding unrestricted guest
mode support.

Finally, we have to set the shadow to the value L2 wanted to write
originally.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 15:48:47 -03:00
Jan Kiszka 33fb20c39e KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS
Properly set those bits to 1 that the spec demands in case bit 55 of
VMX_BASIC is 0 - like in our case.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-07 15:47:11 -03:00
Jan Kiszka c4627c72e9 KVM: nVMX: Reset RFLAGS on VM-exit
Ouch, how could this work so well that far? We need to clear RFLAGS to
the reset value as specified by the SDM. Particularly, IF must be off
after VM-exit!

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-05 20:49:17 -03:00
Jan Kiszka 503cd0c50a KVM: nVMX: Fix switching of debug state
First of all, do not blindly overwrite GUEST_DR7 on L2 entry. The host
may have guest debugging enabled. Then properly reset DR7 and DEBUG_CTL
on L2->L1 switch as specified in the SDM.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 21:37:28 -03:00
Takuya Yoshikawa 8482644aea KVM: set_memory_region: Refactor commit_memory_region()
This patch makes the parameter old a const pointer to the old memory
slot and adds a new parameter named change to know the change being
requested: the former is for removing extra copying and the latter is
for cleaning up the code.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa 7b6195a91d KVM: set_memory_region: Refactor prepare_memory_region()
This patch drops the parameter old, a copy of the old memory slot, and
adds a new parameter named change to know the change being requested.

This not only cleans up the code but also removes extra copying of the
memory slot structure.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa 47ae31e257 KVM: set_memory_region: Drop user_alloc from set_memory_region()
Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only
used for sanity checks in __kvm_set_memory_region() which can easily
be changed to use slot id instead.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Takuya Yoshikawa 462fce4606 KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region()
X86 does not use this any more.  The remaining user, s390's !user_alloc
check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no
longer supported.

Note: fixed powerpc's indentations with spaces to suppress checkpatch
errors.

Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-03-04 20:21:08 -03:00
Marcelo Tosatti ee2c25efdd Merge branch 'master' into queue
* master: (15791 commits)
  Linux 3.9-rc1
  btrfs/raid56: Add missing #include <linux/vmalloc.h>
  fix compat_sys_rt_sigprocmask()
  SUNRPC: One line comment fix
  ext4: enable quotas before orphan cleanup
  ext4: don't allow quota mount options when quota feature enabled
  ext4: fix a warning from sparse check for ext4_dir_llseek
  ext4: convert number of blocks to clusters properly
  ext4: fix possible memory leak in ext4_remount()
  jbd2: fix ERR_PTR dereference in jbd2__journal_start
  metag: Provide dma_get_sgtable()
  metag: prom.h: remove declaration of metag_dt_memblock_reserve()
  metag: copy devicetree to non-init memory
  metag: cleanup metag_ksyms.c includes
  metag: move mm/init.c exports out of metag_ksyms.c
  metag: move usercopy.c exports out of metag_ksyms.c
  metag: move setup.c exports out of metag_ksyms.c
  metag: move kick.c exports out of metag_ksyms.c
  metag: move traps.c exports out of metag_ksyms.c
  metag: move irq enable out of irqflags.h on SMP
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Conflicts:
	arch/x86/kernel/kvmclock.c
2013-03-04 20:10:32 -03:00
Jan Kiszka 3ab66e8a45 KVM: VMX: Pass vcpu to __vmx_complete_interrupts
Cleanup: __vmx_complete_interrupts has no use for the vmx structure.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:29:03 +02:00
Jan Kiszka 44ceb9d665 KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12
IDT_VECTORING_INFO_FIELD was already read right after vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:20:06 +02:00
Sasha Levin b67bfe0d42 hlist: drop the node parameter from iterators
I'm not sure why, but the hlist for each entry iterators were conceived

        list_for_each_entry(pos, head, member)

The hlist ones were greedy and wanted an extra parameter:

        hlist_for_each_entry(tpos, pos, head, member)

Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.

Besides the semantic patch, there was some manual work required:

 - Fix up the actual hlist iterators in linux/list.h
 - Fix up the declaration of other iterators based on the hlist ones.
 - A very small amount of places were using the 'node' parameter, this
 was modified to use 'obj->member' instead.
 - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
 properly, so those had to be fixed up manually.

The semantic patch which is mostly the work of Peter Senna Tschudin is here:

@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

type T;
expression a,c,d,e;
identifier b;
statement S;
@@

-T b;
    <+... when != b
(
hlist_for_each_entry(a,
- b,
c, d) S
|
hlist_for_each_entry_continue(a,
- b,
c) S
|
hlist_for_each_entry_from(a,
- b,
c) S
|
hlist_for_each_entry_rcu(a,
- b,
c, d) S
|
hlist_for_each_entry_rcu_bh(a,
- b,
c, d) S
|
hlist_for_each_entry_continue_rcu_bh(a,
- b,
c) S
|
for_each_busy_worker(a, c,
- b,
d) S
|
ax25_uid_for_each(a,
- b,
c) S
|
ax25_for_each(a,
- b,
c) S
|
inet_bind_bucket_for_each(a,
- b,
c) S
|
sctp_for_each_hentry(a,
- b,
c) S
|
sk_for_each(a,
- b,
c) S
|
sk_for_each_rcu(a,
- b,
c) S
|
sk_for_each_from
-(a, b)
+(a)
S
+ sk_for_each_from(a) S
|
sk_for_each_safe(a,
- b,
c, d) S
|
sk_for_each_bound(a,
- b,
c) S
|
hlist_for_each_entry_safe(a,
- b,
c, d, e) S
|
hlist_for_each_entry_continue_rcu(a,
- b,
c) S
|
nr_neigh_for_each(a,
- b,
c) S
|
nr_neigh_for_each_safe(a,
- b,
c, d) S
|
nr_node_for_each(a,
- b,
c) S
|
nr_node_for_each_safe(a,
- b,
c, d) S
|
- for_each_gfn_sp(a, c, d, b) S
+ for_each_gfn_sp(a, c, d) S
|
- for_each_gfn_indirect_valid_sp(a, c, d, b) S
+ for_each_gfn_indirect_valid_sp(a, c, d) S
|
for_each_host(a,
- b,
c) S
|
for_each_host_safe(a,
- b,
c, d) S
|
for_each_mesh_entry(a,
- b,
c, d) S
)
    ...+>

[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin <peter.senna@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:24 -08:00
Chen Gang 02cdb50fd7 arch/x86/kvm: beautify source code for __u32 irq which is never < 0
irp->irq is __u32 which is never < 0.

Signed-off-by: Chen Gang <gang.chen@asianux.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 16:07:45 +02:00
Jan Kiszka 957c897e8c KVM: nVMX: Use cached exit reason
No need to re-read what vmx_vcpu_run already picked up for us.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:46:07 +02:00
Jan Kiszka 36c3cc422b KVM: nVMX: Clear segment cache after switching between L1 and L2
Switching the VMCS obviously invalidates what may have been cached about
the guest segments.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:41:09 +02:00
Jan Kiszka d6851fbeee KVM: nVMX: Advertise PAUSE and WBINVD exiting support
These exits have no preconditions, and we already process the
corresponding reasons in nested_vmx_exit_handled correctly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:33:51 +02:00
Jan Kiszka 733568f9ce KVM: VMX: Make prepare_vmcs12 and load_vmcs12_host_state static
Both are only used locally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:31:15 +02:00
Linus Torvalds 89f883372f Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "KVM updates for the 3.9 merge window, including x86 real mode
  emulation fixes, stronger memory slot interface restrictions, mmu_lock
  spinlock hold time reduction, improved handling of large page faults
  on shadow, initial APICv HW acceleration support, s390 channel IO
  based virtio, amongst others"

* tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
  Revert "KVM: MMU: lazily drop large spte"
  x86: pvclock kvm: align allocation size to page size
  KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
  x86 emulator: fix parity calculation for AAD instruction
  KVM: PPC: BookE: Handle alignment interrupts
  booke: Added DBCR4 SPR number
  KVM: PPC: booke: Allow multiple exception types
  KVM: PPC: booke: use vcpu reference from thread_struct
  KVM: Remove user_alloc from struct kvm_memory_slot
  KVM: VMX: disable apicv by default
  KVM: s390: Fix handling of iscs.
  KVM: MMU: cleanup __direct_map
  KVM: MMU: remove pt_access in mmu_set_spte
  KVM: MMU: cleanup mapping-level
  KVM: MMU: lazily drop large spte
  KVM: VMX: cleanup vmx_set_cr0().
  KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
  KVM: VMX: disable SMEP feature when guest is in non-paging mode
  KVM: Remove duplicate text in api.txt
  Revert "KVM: MMU: split kvm_mmu_free_page"
  ...
2013-02-24 13:07:18 -08:00
Jan Kiszka bd31a7f557 KVM: nVMX: Trap unconditionally if msr bitmap access fails
This avoids basing decisions on uninitialized variables, potentially
leaking kernel data to the L1 guest.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:45 -03:00
Jan Kiszka 908a7bdd6a KVM: nVMX: Improve I/O exit handling
This prevents trapping L2 I/O exits if L1 has neither unconditional nor
bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
handling.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:42 -03:00
Marcelo Tosatti 6b73a96065 Revert "KVM: MMU: lazily drop large spte"
This reverts commit caf6900f2d.

It is causing migration failures, reference
https://bugzilla.kernel.org/show_bug.cgi?id=54061.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-20 18:52:02 -03:00
Borislav Petkov 2e32b71906 x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs
The "x86, AMD: Enable WC+ memory type on family 10 processors" patch
currently in -tip added a workaround for AMD F10h CPUs which #GPs my
guest when booted in kvm. This is because it accesses MSR_AMD64_BU_CFG2
which is not currently ignored by kvm. Do that because this MSR is only
baremetal-relevant anyway. While at it, move the ignored MSRs at the
beginning of kvm_set_msr_common so that we exit then and there.

Acked-by: Gleb Natapov <gleb@redhat.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Andre Przywara <andre@andrep.de>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1361298793-31834-2-git-send-email-bp@alien8.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2013-02-19 10:44:07 -08:00
Jan Kiszka cbd29cb6e3 KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
We already pass vmcs12 as argument.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-14 10:35:16 +02:00
Gleb Natapov f583c29b79 x86 emulator: fix parity calculation for AAD instruction
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-13 18:01:00 +02:00
Takuya Yoshikawa 7a905b1485 KVM: Remove user_alloc from struct kvm_memory_slot
This field was needed to differentiate memory slots created by the new
API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
KVM_SET_MEMORY_REGION, whose support was dropped long before:

  commit b74a07beed
  KVM: Remove kernel-allocated memory regions

Although we also have private memory slots to which KVM allocates
memory with vm_mmap(), !user_alloc slots in other words, the slot id
should be enough for differentiating them.

Note: corresponding function parameters will be removed later.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-11 11:52:00 +02:00
Yang Zhang 257090f702 KVM: VMX: disable apicv by default
Without Posted Interrupt, current code is broken. Just disable by
default until Posted Interrupt is ready.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-11 10:51:13 +02:00
Xiao Guangrong 24db2734ad KVM: MMU: cleanup __direct_map
Use link_shadow_page to link the sp to the spte in __direct_map

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:42:09 -02:00
Xiao Guangrong f761620377 KVM: MMU: remove pt_access in mmu_set_spte
It is only used in debug code, so drop it

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:42:08 -02:00
Xiao Guangrong 55dd98c3a8 KVM: MMU: cleanup mapping-level
Use min() to cleanup mapping_level

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:42:08 -02:00
Xiao Guangrong caf6900f2d KVM: MMU: lazily drop large spte
Currently, kvm zaps the large spte if write-protected is needed, the later
read can fault on that spte. Actually, we can make the large spte readonly
instead of making them not present, the page fault caused by read access can
be avoided

The idea is from Avi:
| As I mentioned before, write-protecting a large spte is a good idea,
| since it moves some work from protect-time to fault-time, so it reduces
| jitter.  This removes the need for the return value.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:28:01 -02:00
Gleb Natapov 5037878e22 KVM: VMX: cleanup vmx_set_cr0().
When calculating hw_cr0 teh current code masks bits that should be always
on and re-adds them back immediately after. Cleanup the code by masking
only those bits that should be dropped from hw_cr0. This allow us to
get rid of some defines.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:00:02 -02:00
Dongxiao Xu c08800a56c KVM: VMX: disable SMEP feature when guest is in non-paging mode
SMEP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMEP needs to be manually
disabled when guest switches to non-paging mode.

We met an issue that, SMP Linux guest with recent kernel (enable
SMEP support, for example, 3.5.3) would crash with triple fault if
setting unrestricted_guest=0. This is because KVM uses an identity
mapping page table to emulate the non-paging mode, where the page
table is set with USER flag. If SMEP is still enabled in this case,
guest will meet unhandlable page fault and then crash.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-05 23:28:07 -02:00
Gleb Natapov 834be0d83f Revert "KVM: MMU: split kvm_mmu_free_page"
This reverts commit bd4c86eaa6.

There is not user for kvm_mmu_isolate_page() any more.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-05 22:47:39 -02:00
Gleb Natapov eb3fce87cc KVM: MMU: drop superfluous is_present_gpte() check.
Gust page walker puts only present ptes into ptes[] array. No need to
check it again.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 23:24:28 -02:00
Gleb Natapov 116eb3d30e KVM: MMU: drop superfluous min() call.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 23:24:28 -02:00
Gleb Natapov 2c9afa52ef KVM: MMU: set base_role.nxe during mmu initialization.
Move base_role.nxe initialisation to where all other roles are initialized.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 23:24:28 -02:00
Gleb Natapov 9bb4f6b15e KVM: MMU: drop unneeded checks.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 23:24:28 -02:00
Gleb Natapov feb3eb704a KVM: MMU: make spte_is_locklessly_modifiable() more clear
spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and
SPTE_MMU_WRITEABLE are present on spte. Make it more explicit.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-04 23:24:28 -02:00
Yang Zhang c7c9c56ca2 x86, apicv: add virtual interrupt delivery support
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:19 +02:00
Yang Zhang 8d14695f95 x86, apicv: add virtual x2apic support
basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
0x800 - 0x8ff: no read intercept for apicv register virtualization,
               except APIC ID and TMCCT which need software's assistance to
               get right value.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:06 +02:00
Yang Zhang 83d4c28693 x86, apicv: add APICv register virtualization support
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:47:54 +02:00
Avi Kivity 3f0c3d0bb2 KVM: x86 emulator: fix test_cc() build failure on i386
'pushq' doesn't exist on i386.  Replace with 'push', which should work
since the operand is a register.

Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-27 11:09:38 +02:00
Gleb Natapov 141687869f KVM: VMX: set vmx->emulation_required only when needed.
If emulate_invalid_guest_state=false vmx->emulation_required is never
actually used, but it ends up to be always set to true since
handle_invalid_guest_state(), the only place it is reset back to
false, is never called. This, besides been not very clean, makes vmexit
and vmentry path to check emulate_invalid_guest_state needlessly.

The patch fixes that by keeping emulation_required coherent with
emulate_invalid_guest_state setting.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:31 -02:00
Gleb Natapov 378a8b099f KVM: x86: fix use of uninitialized memory as segment descriptor in emulator.
If VMX reports segment as unusable, zero descriptor passed by the emulator
before returning. Such descriptor will be considered not present by the
emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:31 -02:00
Gleb Natapov 91b0aa2ca6 KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg.
The function deals with code segment too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:30 -02:00
Gleb Natapov 25391454e7 KVM: VMX: don't clobber segment AR of unusable segments.
Usability is returned in unusable field, so not need to clobber entire
AR. Callers have to know how to deal with unusable segments already
since if emulate_invalid_guest_state=true AR is not zeroed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov 218e763f45 KVM: VMX: skip vmx->rmode.vm86_active check on cr0 write if unrestricted guest is enabled
vmx->rmode.vm86_active is never true is unrestricted guest is enabled.
Make it more explicit that neither enter_pmode() nor enter_rmode() is
called in this case.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov 286da4156d KVM: VMX: remove hack that disables emulation on vcpu reset/init
There is no reason for it. If state is suitable for vmentry it
will be detected during guest entry and no emulation will happen.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:27 -02:00
Gleb Natapov c5e97c80b5 KVM: VMX: if unrestricted guest is enabled vcpu state is always valid.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov 2f143240cb KVM: VMX: reset CPL only on CS register write.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov 1f3141e80b KVM: VMX: remove special CPL cache access during transition to real mode.
Since vmx_get_cpl() always returns 0 when VCPU is in real mode it is no
longer needed. Also reset CPL cache to zero during transaction to
protected mode since transaction may happen while CS.selectors & 3 != 0,
but in reality CPL is 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:25 -02:00
Avi Kivity 158de57f90 KVM: x86 emulator: convert a few freestanding emulations to fastop
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:40 -02:00
Avi Kivity 34b77652b9 KVM: x86 emulator: rearrange fastop definitions
Make fastop opcodes usable in other emulations.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:39 -02:00
Avi Kivity 4d7583493e KVM: x86 emulator: convert 2-operand IMUL to fastop
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:38 -02:00
Avi Kivity 11c363ba8f KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:37 -02:00
Avi Kivity 95413dc413 KVM: x86 emulator: convert INC/DEC to fastop
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:37 -02:00
Avi Kivity 9ae9febae9 KVM: x86 emulator: covert SETCC to fastop
This is a bit of a special case since we don't have the usual
byte/word/long/quad switch; instead we switch on the condition code embedded
in the instruction.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:36 -02:00
Avi Kivity 007a3b5475 KVM: x86 emulator: convert shift/rotate instructions to fastop
SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:35 -02:00
Avi Kivity 0bdea06892 KVM: x86 emulator: Convert SHLD, SHRD to fastop
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-23 22:15:33 -02:00
Xiao Guangrong 93c05d3ef2 KVM: x86: improve reexecute_instruction
The current reexecute_instruction can not well detect the failed instruction
emulation. It allows guest to retry all the instructions except it accesses
on error pfn

For example, some cases are nested-write-protect - if the page we want to
write is used as PDE but it chains to itself. Under this case, we should
stop the emulation and report the case to userspace

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-21 22:58:33 -02:00
Xiao Guangrong 95b3cf69bd KVM: x86: let reexecute_instruction work for tdp
Currently, reexecute_instruction refused to retry all instructions if
tdp is enabled. If nested npt is used, the emulation may be caused by
shadow page, it can be fixed by dropping the shadow page. And the only
condition that tdp can not retry the instruction is the access fault
on error pfn

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-21 22:58:32 -02:00
Xiao Guangrong 22368028fe KVM: x86: clean up reexecute_instruction
Little cleanup for reexecute_instruction, also use gpa_to_gfn in
retry_instruction

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-21 22:58:31 -02:00
Takuya Yoshikawa 6b81b05e44 KVM: MMU: Conditionally reschedule when kvm_mmu_slot_remove_write_access() takes a long time
If the userspace starts dirty logging for a large slot, say 64GB of
memory, kvm_mmu_slot_remove_write_access() needs to hold mmu_lock for
a long time such as tens of milliseconds.  This patch controls the lock
hold time by asking the scheduler if we need to reschedule for others.

One penalty for this is that we need to flush TLBs before releasing
mmu_lock.  But since holding mmu_lock for a long time does affect not
only the guest, vCPU threads in other words, but also the host as a
whole, we should pay for that.

In practice, the cost will not be so high because we can protect a fair
amount of memory before being rescheduled: on my test environment,
cond_resched_lock() was called only once for protecting 12GB of memory
even without THP.  We can also revisit Avi's "unlocked TLB flush" work
later for completely suppressing extra TLB flushes if needed.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:14:28 +02:00
Takuya Yoshikawa 9d1beefb71 KVM: Make kvm_mmu_slot_remove_write_access() take mmu_lock by itself
Better to place mmu_lock handling and TLB flushing code together since
this is a self-contained function.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:14:17 +02:00
Takuya Yoshikawa b34cb590fb KVM: Make kvm_mmu_change_mmu_pages() take mmu_lock by itself
No reason to make callers take mmu_lock since we do not need to protect
kvm_mmu_change_mmu_pages() and kvm_mmu_slot_remove_write_access()
together by mmu_lock in kvm_arch_commit_memory_region(): the former
calls kvm_mmu_commit_zap_page() and flushes TLBs by itself.

Note: we do not need to protect kvm->arch.n_requested_mmu_pages by
mmu_lock as can be seen from the fact that it is read locklessly.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:14:09 +02:00
Takuya Yoshikawa e12091ce7b KVM: Remove unused slot_bitmap from kvm_mmu_page
Not needed any more.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:58 +02:00
Takuya Yoshikawa b99db1d352 KVM: MMU: Make kvm_mmu_slot_remove_write_access() rmap based
This makes it possible to release mmu_lock and reschedule conditionally
in a later patch.  Although this may increase the time needed to protect
the whole slot when we start dirty logging, the kernel should not allow
the userspace to trigger something that will hold a spinlock for such a
long time as tens of milliseconds: actually there is no limit since it
is roughly proportional to the number of guest pages.

Another point to note is that this patch removes the only user of
slot_bitmap which will cause some problems when we increase the number
of slots further.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:47 +02:00
Takuya Yoshikawa 245c3912ea KVM: MMU: Remove unused parameter level from __rmap_write_protect()
No longer need to care about the mapping level in this function.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:31 +02:00
Takuya Yoshikawa c972f3b125 KVM: Write protect the updated slot only when dirty logging is enabled
Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
nothing but search for non-existent mmu pages which have mappings to
that deleted memory; this is safe but a waste of time.

Since we want to make the function rmap based in a later patch, in a
manner which makes it unsafe to be called for a deleted slot, we makes
the caller see if the slot is non-zero and being dirty logged.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-14 11:13:15 +02:00
Xiao Guangrong 7751babd3c KVM: MMU: fix infinite fault access retry
We have two issues in current code:
- if target gfn is used as its page table, guest will refault then kvm will use
  small page size to map it. We need two #PF to fix its shadow page table

- sometimes, say a exception is triggered during vm-exit caused by #PF
  (see handle_exception() in vmx.c), we remove all the shadow pages shadowed
  by the target gfn before go into page fault path, it will cause infinite
  loop:
  delete shadow pages shadowed by the gfn -> try to use large page size to map
  the gfn -> retry the access ->...

To fix these, we can adjust page size early if the target gfn is used as page
table

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-10 15:28:30 -02:00
Xiao Guangrong c22885050e KVM: MMU: fix Dirty bit missed if CR0.WP = 0
If the write-fault access is from supervisor and CR0.WP is not set on the
vcpu, kvm will fix it by adjusting pte access - it sets the W bit on pte
and clears U bit. This is the chance that kvm can change pte access from
readonly to writable

Unfortunately, the pte access is the access of 'direct' shadow page table,
means direct sp.role.access = pte_access, then we will create a writable
spte entry on the readonly shadow page table. It will cause Dirty bit is
not tracked when two guest ptes point to the same large page. Note, it
does not have other impact except Dirty bit since cr0.wp is encoded into
sp.role

It can be fixed by adjusting pte access before establishing shadow page
table. Also, after that, no mmu specified code exists in the common function
and drop two parameters in set_spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-10 15:28:08 -02:00
Avi Kivity fb864fbc72 KVM: x86 emulator: convert basic ALU ops to fastop
Opcodes:
	TEST
	CMP
	ADD
	ADC
	SUB
	SBB
	XOR
	OR
	AND

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:30 -02:00
Avi Kivity f7857f35db KVM: x86 emulator: add macros for defining 2-operand fastop emulation
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:28 -02:00
Avi Kivity 45a1467d7e KVM: x86 emulator: convert NOT, NEG to fastop
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:25 -02:00
Avi Kivity 75f728456f KVM: x86 emulator: mark CMP, CMPS, SCAS, TEST as NoWrite
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:21 -02:00
Avi Kivity b6744dc3fb KVM: x86 emulator: introduce NoWrite flag
Instead of disabling writeback via OP_NONE, just specify NoWrite.

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:18 -02:00
Avi Kivity b7d491e7f0 KVM: x86 emulator: Support for declaring single operand fastops
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:17 -02:00
Avi Kivity e28bbd44da KVM: x86 emulator: framework for streamlining arithmetic opcodes
We emulate arithmetic opcodes by executing a "similar" (same operation,
different operands) on the cpu.  This ensures accurate emulation, esp. wrt.
eflags.  However, the prologue and epilogue around the opcode is fairly long,
consisting of a switch (for the operand size) and code to load and save the
operands.  This is repeated for every opcode.

This patch introduces an alternative way to emulate arithmetic opcodes.
Instead of the above, we have four (three on i386) functions consisting
of just the opcode and a ret; one for each operand size.  For example:

   .align 8
   em_notb:
	not %al
	ret

   .align 8
   em_notw:
	not %ax
	ret

   .align 8
   em_notl:
	not %eax
	ret

   .align 8
   em_notq:
	not %rax
	ret

The prologue and epilogue are shared across all opcodes.  Note the functions
use a special calling convention; notably eflags is an input/output parameter
and is not clobbered.  Rather than dispatching the four functions through a
jump table, the functions are declared as a constant size (8) so their address
can be calculated.

Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-09 17:39:17 -02:00
Marcelo Tosatti b09408d00f KVM: VMX: fix incorrect cached cpl value with real/v8086 modes
CPL is always 0 when in real mode, and always 3 when virtual 8086 mode.

Using values other than those can cause failures on operations that
check CPL.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-08 17:25:35 -02:00
Gleb Natapov b0cfeb5ded KVM: x86: remove unused variable from walk_addr_generic()
Fix compilation warning.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-08 17:23:39 -02:00
Marcelo Tosatti 013f6a5d3d KVM: x86: use dynamic percpu allocations for shared msrs area
Use dynamic percpu allocations for the shared msrs structure,
to avoid using the limited reserved percpu space.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-08 12:51:56 -02:00
Gleb Natapov 908e7d7999 KVM: MMU: simplify folding of dirty bit into accessed_dirty
MMU code tries to avoid if()s HW is not able to predict reliably by using
bitwise operation to streamline code execution, but in case of a dirty bit
folding this gives us nothing since write_fault is checked right before
the folding code. Lets just piggyback onto the if() to make code more clear.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-07 20:31:35 -02:00
Gleb Natapov ee04e0cea8 KVM: mmu: remove unused trace event
trace_kvm_mmu_delay_free_pages() is no longer used.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-07 19:54:50 -02:00
Gleb Natapov 0ca1b4f4ba KVM: VMX: handle IO when emulation is due to #GP in real mode.
With emulate_invalid_guest_state=0 if a vcpu is in real mode VMX can
enter the vcpu with smaller segment limit than guest configured.  If the
guest tries to access pass this limit it will get #GP at which point
instruction will be emulated with correct segment limit applied. If
during the emulation IO is detected it is not handled correctly. Vcpu
thread should exit to userspace to serve the IO, but it returns to the
guest instead.  Since emulation is not completed till userspace completes
the IO the faulty instruction is re-executed ad infinitum.

The patch fixes that by exiting to userspace if IO happens during
instruction emulation.

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:31 -02:00
Gleb Natapov d54d07b2ca KVM: VMX: Do not fix segment register during vcpu initialization.
Segment registers will be fixed according to current emulation policy
during switching to real mode for the first time.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:30 -02:00
Gleb Natapov d99e415275 KVM: VMX: fix emulation of invalid guest state.
Currently when emulation of invalid guest state is enable
(emulate_invalid_guest_state=1) segment registers are still fixed for
entry to vm86 mode some times. Segment register fixing is avoided in
enter_rmode(), but vmx_set_segment() still does it unconditionally.
The patch fixes it.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:29 -02:00
Gleb Natapov 89efbed02c KVM: VMX: make rmode_segment_valid() more strict.
Currently it allows entering vm86 mode if segment limit is greater than
0xffff and db bit is set. Both of those can cause incorrect execution of
instruction by cpu since in vm86 mode limit will be set to 0xffff and db
will be forced to 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:28 -02:00
Gleb Natapov 045a282ca4 KVM: emulator: implement fninit, fnstsw, fnstcw
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:27 -02:00
Gleb Natapov 3a78a4f463 KVM: emulator: drop RPL check from linearize() function
According to Intel SDM Vol3 Section 5.5 "Privilege Levels" and 5.6
"Privilege Level Checking When Accessing Data Segments" RPL checking is
done during loading of a segment selector, not during data access. We
already do checking during segment selector loading, so drop the check
during data access. Checking RPL during data access triggers #GP if
after transition from real mode to protected mode RPL bits in a segment
selector are set.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:26 -02:00
Gleb Natapov f924d66d27 KVM: VMX: remove unneeded temporary variable from vmx_set_segment()
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:02:00 +02:00
Gleb Natapov 1ecd50a947 KVM: VMX: clean-up vmx_set_segment()
Move all vm86_active logic into one place.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:49 +02:00
Gleb Natapov 39dcfb95de KVM: VMX: remove redundant code from vmx_set_segment()
Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need
to do it twice.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:37 +02:00
Gleb Natapov beb853ffec KVM: VMX: use fix_rmode_seg() to fix all code/data segments
The code for SS and CS does the same thing fix_rmode_seg() is doing.
Use it instead of hand crafted code.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:18 +02:00
Gleb Natapov c6ad115348 KVM: VMX: return correct segment limit and flags for CS/SS registers in real mode
VMX without unrestricted mode cannot virtualize real mode, so if
emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate
it. Sometimes, when guest moves from protected mode to real mode, it
leaves segment descriptors in a state not suitable for use by vm86 mode
virtualization, so we keep shadow copy of segment descriptors for internal
use and load fake register to VMCS for guest entry to succeed. Till
now we kept shadow for all segments except SS and CS (for SS and CS we
returned parameters directly from VMCS), but since commit a5625189f6
emulator enforces segment limits in real mode. This causes #GP during move
from protected mode to real mode when emulator fetches first instruction
after moving to real mode since it uses incorrect CS base and limit to
linearize the %rip. Fix by keeping shadow for SS and CS too.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:03 +02:00
Gleb Natapov 0647f4aa8c KVM: VMX: relax check for CS register in rmode_segment_valid()
rmode_segment_valid() checks if segment descriptor can be used to enter
vm86 mode. VMX spec mandates that in vm86 mode CS register will be of
type data, not code. Lets allow guest entry with vm86 mode if the only
problem with CS register is incorrect type. Otherwise entire real mode
will be emulated.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:47 +02:00
Gleb Natapov 07f42f5f25 KVM: VMX: cleanup rmode_segment_valid()
Set segment fields explicitly instead of using  binary operations.

No behaviour changes.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:36 +02:00
Nickolai Zeldovich d4b06c2d4c kvm: fix i8254 counter 0 wraparound
The kvm i8254 emulation for counter 0 (but not for counters 1 and 2)
has at least two bugs in mode 0:

1. The OUT bit, computed by pit_get_out(), is never set high.

2. The counter value, computed by pit_get_count(), wraps back around to
   the initial counter value, rather than wrapping back to 0xFFFF
   (which is the behavior described in the comment in __kpit_elapsed,
   the behavior implemented by qemu, and the behavior observed on AMD
   hardware).

The bug stems from __kpit_elapsed computing the elapsed time mod the
initial counter value (stored as nanoseconds in ps->period).  This is both
unnecessary (none of the callers of kpit_elapsed expect the value to be
at most the initial counter value) and incorrect (it causes pit_get_count
to appear to wrap around to the initial counter value rather than 0xFFFF).
Removing this mod from __kpit_elapsed fixes both of the above bugs.

Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-18 11:12:38 +02:00
Gleb Natapov e11ae1a102 KVM: remove unused variable.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-14 23:07:44 -02:00
Alex Williamson f82a8cfe93 KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:24:38 -02:00
Alex Williamson bbacc0c111 KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM.  One is
the user accessible slots and the other is user + private.  Make this
more obvious.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:21:57 -02:00
Gleb Natapov f3200d00ea KVM: inject ExtINT interrupt before APIC interrupts
According to Intel SDM Volume 3 Section 10.8.1 "Interrupt Handling with
the Pentium 4 and Intel Xeon Processors" and Section 10.8.2 "Interrupt
Handling with the P6 Family and Pentium Processors" ExtINT interrupts are
sent directly to the processor core for handling. Currently KVM checks
APIC before it considers ExtINT interrupts for injection which is
backwards from the spec. Make code behave according to the SDM.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:05:21 -02:00
Nadav Amit 5e2c688351 KVM: x86: fix mov immediate emulation for 64-bit operands
MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand.
The previous emulation implementation assumes the operand is no longer than 32.
Adding OpImm64 for this matter.

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579

Signed-off-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 22:30:56 -02:00
Gleb Natapov 7f662273e4 KVM: emulator: implement AAD instruction
Windows2000 uses it during boot. This fixes
https://bugzilla.kernel.org/show_bug.cgi?id=50921

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 22:27:22 -02:00
Linus Torvalds 66cdd0ceaf Merge tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "Considerable KVM/PPC work, x86 kvmclock vsyscall support,
  IA32_TSC_ADJUST MSR emulation, amongst others."

Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.

* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
  KVM: emulator: fix real mode segment checks in address linearization
  VMX: remove unneeded enable_unrestricted_guest check
  KVM: VMX: fix DPL during entry to protected mode
  x86/kexec: crash_vmclear_local_vmcss needs __rcu
  kvm: Fix irqfd resampler list walk
  KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
  x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
  KVM: MMU: optimize for set_spte
  KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
  KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
  KVM: PPC: bookehv: Add guest computation mode for irq delivery
  KVM: PPC: Make EPCR a valid field for booke64 and bookehv
  KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
  KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
  KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
  KVM: PPC: e500: Add emulation helper for getting instruction ea
  KVM: PPC: bookehv64: Add support for interrupt handling
  KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
  KVM: PPC: booke: Fix get_tb() compile error on 64-bit
  KVM: PPC: e500: Silence bogus GCC warning in tlb code
  ...
2012-12-13 15:31:08 -08:00
Gleb Natapov 58b7825bc3 KVM: emulator: fix real mode segment checks in address linearization
In real mode CS register is writable, so do not #GP on write.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:28 -02:00
Gleb Natapov 0b26b588d9 VMX: remove unneeded enable_unrestricted_guest check
If enable_unrestricted_guest is true vmx->rmode.vm86_active will
always be false.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:28 -02:00
Gleb Natapov a4d3326c2d KVM: VMX: fix DPL during entry to protected mode
On CPUs without support for unrestricted guests DPL cannot be smaller
than RPL for data segments during guest entry, but this state can occurs
if a data segment selector changes while vcpu is in real mode to a value
with lowest two bits != 00. Fix that by forcing DPL == RPL on transition
to protected mode.

This is a regression introduced by c865c43de6.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:27 -02:00
Zhang Yanfei 8f536b7697 KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
The vmclear function will be assigned to the callback function pointer
when loading kvm-intel module. And the bitmap indicates whether we
should do VMCLEAR operation in kdump. The bits in the bitmap are
set/unset according to different conditions.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-06 18:26:57 +02:00
Xiao Guangrong c219346325 KVM: MMU: optimize for set_spte
There are two cases we need to adjust page size in set_spte:
1): the one is other vcpu creates new sp in the window between mapping_level()
    and acquiring mmu-lock.
2): the another case is the new sp is created by itself (page-fault path) when
    guest uses the target gfn as its page table.

In current code, set_spte drop the spte and emulate the access for these case,
it works not good:
- for the case 1, it may destroy the mapping established by other vcpu, and
  do expensive instruction emulation.
- for the case 2, it may emulate the access even if the guest is accessing
  the page which not used as page table. There is a example, 0~2M is used as
  huge page in guest, in this huge page, only page 3 used as page table, then
  guest read/writes on other pages can cause instruction emulation.

Both of these cases can be fixed by allowing guest to retry the access, it
will refault, then we can establish the mapping by using small page

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-06 09:11:25 +02:00
Julian Stecklina 66f7b72e11 KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to
generic code path. General-purpose registers are now cleared on reset and
INIT.  SVM code properly initializes EDX.

Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 18:00:07 +02:00
Zhang Xiantao 2b3c5cbc0d kvm: don't use bit24 for detecting address-specific invalidation capability
Bit24 in VMX_EPT_VPID_CAP_MASI is  not used for address-specific invalidation capability
reporting, so remove it from KVM to avoid conflicts in future.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:48 +02:00
Zhang Xiantao 0307b7b8c2 kvm: remove unnecessary bit checking for ept violation
Bit 6 in EPT vmexit's exit qualification is not defined in SDM, so remove it.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:21 +02:00
Jan Kiszka 45e3cc7d9f KVM: x86: Fix uninitialized return code
This is a regression caused by 18595411a7.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-02 17:37:04 +02:00
Will Auld ba904635d4 KVM: x86: Emulate IA32_TSC_ADJUST MSR
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the "use TSC counter offsetting" VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.

The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:

The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:29:30 -02:00
Will Auld 8fe8ab46be KVM: x86: Add code to track call origin for msr assignment
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:26:12 -02:00
Xiao Guangrong 5a560f8b5e KVM: VMX: fix memory order between loading vmcs and clearing vmcs
vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu

If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-29 21:14:46 -02:00
Xiao Guangrong e6c7d32172 KVM: VMX: fix invalid cpu passed to smp_call_function_single
In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single

It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-28 22:04:58 -02:00
Marcelo Tosatti d98d07ca7e KVM: x86: update pvclock area conditionally, on cpu migration
As requested by Glauber, do not update kvmclock area on vcpu->pcpu
migration, in case the host has stable TSC.

This is to reduce cacheline bouncing.

Acked-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:15 -02:00
Marcelo Tosatti b48aa97e38 KVM: x86: require matched TSC offsets for master clock
With master clock, a pvclock clock read calculates:

ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ]

Where 'rdtsc' is the host TSC.

system_timestamp and tsc_timestamp are unique, one tuple
per VM: the "master clock".

Given a host with synchronized TSCs, its obvious that
guest TSC must be matched for the above to guarantee monotonicity.

Allow master clock usage only if guest TSCs are synchronized.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:15 -02:00
Marcelo Tosatti 42897d866b KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:14 -02:00
Marcelo Tosatti d828199e84 KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between

	1. ktime_get_ts(&timespec);
	2. rdtscll(tsc);

Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.

The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).

If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single <system_timestamp, tsc_timestamp> tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.

Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.

Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:13 -02:00
Marcelo Tosatti 16e8d74d2d KVM: x86: notifier for clocksource changes
Register a notifier for clocksource change event. In case
the host switches to clock other than TSC, disable master
clock usage.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:12 -02:00
Marcelo Tosatti 886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Marcelo Tosatti 78c0337a38 KVM: x86: retain pvclock guest stopped bit in guest memory
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU
migration) to clear the bit.

Noticed by Paolo Bonzini.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Glauber Costa <glommer@parallels.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:05 -02:00
H. Peter Anvin cb7cb2864e x86, kvm: Remove incorrect redundant assembly constraint
In __emulate_1op_rax_rdx, we use "+a" and "+d" which are input/output
constraints, and *then* use "a" and "d" as input constraints.  This is
incorrect, but happens to work on some versions of gcc.

However, it breaks gcc with -O0 and icc, and may break on future
versions of gcc.

Reported-and-tested-by: Melanie Blower <melanie.blower@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/B3584E72CFEBED439A3ECA9BCE67A4EF1B17AF90@FMSMSX107.amr.corp.intel.com
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-26 15:52:48 -08:00
Takashi Iwai 29282fde80 KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with
EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL,
and this triggers kernel warnings like below on old CPUs:

    vmwrite error: reg 401e value a0568000 (err 12)
    Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154
    Call Trace:
     [<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel]
     [<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel]
     [<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel]
     [<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm]
     [<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm]
     [<ffffffff81143f7c>] ? __vunmap+0x9c/0x110
     [<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel]
     [<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm]
     [<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm]
     [<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm]
     [<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm]
     [<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530
     [<ffffffff81139d76>] ? remove_vma+0x56/0x60
     [<ffffffff8113b708>] ? do_munmap+0x328/0x400
     [<ffffffff81187c8c>] ? fget_light+0x4c/0x100
     [<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0
     [<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f

This patch adds a check for the availability of secondary exec
control to avoid these warnings.

Cc: <stable@vger.kernel.org> [v3.6+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-16 20:25:18 -02:00
Guo Chao 807f12e57c KVM: remove unnecessary return value check
No need to check return value before breaking switch.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:29 -02:00
Guo Chao 951179ce86 KVM: x86: fix return value of kvm_vm_ioctl_set_tss_addr()
Return value of this function will be that of ioctl().

#include <stdio.h>
#include <linux/kvm.h>

int main () {
	int fd;
	fd = open ("/dev/kvm", 0);
	fd = ioctl (fd, KVM_CREATE_VM, 0);
	ioctl (fd, KVM_SET_TSS_ADDR, 0xfffff000);
	perror ("");
	return 0;
}

Output is "Operation not permitted". That's not what
we want.

Return -EINVAL in this case.

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:29 -02:00
Guo Chao 18595411a7 KVM: do not kfree error pointer
We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
kvm_arch_vcpu_ioctl().

Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-13 22:14:28 -02:00
Petr Matousek 6d1068b3a9 KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461)
On hosts without the XSAVE support unprivileged local user can trigger
oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest
cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN
ioctl.

invalid opcode: 0000 [#2] SMP
Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables
...
Pid: 24935, comm: zoog_kvm_monito Tainted: G      D      3.2.0-3-686-pae
EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0
EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm]
EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000
ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0
task.ti=d7c62000)
Stack:
 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000
 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0
 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80
Call Trace:
 [<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm]
...
 [<c12bfb44>] ? syscall_call+0x7/0xb
Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74
1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01
d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89
EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP
0068:d7c63e70

QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID
and then sets them later. So guest's X86_FEATURE_XSAVE should be masked
out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with
X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with
X86_FEATURE_XSAVE even on hosts that do not support it, might be
susceptible to this attack from inside the guest as well.

Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support.

Signed-off-by: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-12 21:16:45 -02:00
Xiao Guangrong 87da7e66a4 KVM: x86: fix vcpu->mmio_fragments overflow
After commit b3356bf0db (KVM: emulator: optimize "rep ins" handling),
the pieces of io data can be collected and write them to the guest memory
or MMIO together

Unfortunately, kvm splits the mmio access into 8 bytes and store them to
vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it
will cause vcpu->mmio_fragments overflow

The bug can be exposed by isapc (-M isapc):

[23154.818733] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[ ......]
[23154.858083] Call Trace:
[23154.859874]  [<ffffffffa04f0e17>] kvm_get_cr8+0x1d/0x28 [kvm]
[23154.861677]  [<ffffffffa04fa6d4>] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm]
[23154.863604]  [<ffffffffa04f5a1a>] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm]

Actually, we can use one mmio_fragment to store a large mmio access then
split it when we pass the mmio-exit-info to userspace. After that, we only
need two entries to store mmio info for the cross-mmio pages access

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-31 20:36:30 -02:00
Xiao Guangrong 81c52c56e2 KVM: do not treat noslot pfn as a error pfn
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfn

After this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
  when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
  is noslot pfn
And is_invalid_pfn can be removed, it makes the code more clean

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-29 20:31:04 -02:00
Marcelo Tosatti 19bf7f8ac3 Merge remote-tracking branch 'master' into queue
Merge reason: development work has dependency on kvm patches merged
upstream.

Conflicts:
	arch/powerpc/include/asm/Kbuild
	arch/powerpc/include/asm/kvm_para.h

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-29 19:15:32 -02:00
Christoffer Dall 8ca40a70a7 KVM: Take kvm instead of vcpu to mmu_notifier_retry
The mmu_notifier_retry is not specific to any vcpu (and never will be)
so only take struct kvm as a parameter.

The motivation is the ARM mmu code that needs to call this from
somewhere where we long let go of the vcpu pointer.

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-23 13:35:43 +02:00
Gleb Natapov 7f46ddbd48 KVM: apic: fix LDR calculation in x2apic mode
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Chegu Vinod  <chegu_vinod@hp.com>
Tested-by: Chegu Vinod <chegu_vinod@hp.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-22 18:03:27 +02:00
Xiao Guangrong f3ac1a4b66 KVM: MMU: fix release noslot pfn
We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-22 18:03:25 +02:00
Borislav Petkov 1f5b77f51a KVM: SVM: Cleanup error statements
Use __func__ instead of the function name in svm_hardware_enable since
those things tend to get out of sync. This also slims down printk line
length in conjunction with using pr_err.

No functionality change.

Cc: Joerg Roedel <joro@8bytes.org>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-22 16:04:59 +02:00
Xiao Guangrong bf4ca23ef5 KVM: VMX: report internal error for MMIO #PF due to delivery event
The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we
can not fix it if it is caused by delivery event. Reporting internal error
for this case

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:32 +02:00
Xiao Guangrong b9bf6882c1 KVM: VMX: report internal error for the unhandleable event
VM exits during Event Delivery is really unexpected if it is not caused
by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal
and freeze the guest, the VMM has the chance to check the guest

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:29 +02:00
Gleb Natapov 471842ec49 KVM: do not de-cache cr4 bits needlessly
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 14:49:28 +02:00
Xiao Guangrong bd6360cc0a KVM: MMU: introduce FNAME(prefetch_gpte)
The only difference between FNAME(update_pte) and FNAME(pte_prefetch)
is that the former is allowed to prefetch gfn from dirty logged slot,
so introduce a common function to prefetch spte

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:19 +02:00
Xiao Guangrong a052b42b0e KVM: MMU: move prefetch_invalid_gpte out of pagaing_tmp.h
The function does not depend on guest mmu mode, move it out from
paging_tmpl.h

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:18 +02:00
Xiao Guangrong d4878f24e3 KVM: MMU: cleanup FNAME(page_fault)
Let it return emulate state instead of spte like __direct_map

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:16 +02:00
Xiao Guangrong bd660776da KVM: MMU: remove mmu_is_invalid
Remove mmu_is_invalid and use is_invalid_pfn instead

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-17 16:39:15 +02:00
Jan Kiszka b6785def83 KVM: x86: Make emulator_fix_hypercall static
No users outside of kvm/x86.c.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-08 14:03:16 -03:00
Jan Kiszka 8b6e4547e0 KVM: x86: Convert kvm_arch_vcpu_reset into private kvm_vcpu_reset
There are no external callers of this function as there is no concept of
resetting a vcpu from generic code.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-10-08 14:02:35 -03:00
Linus Torvalds ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Linus Torvalds ac07f5c3cb Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu update from Ingo Molnar:
 "The biggest change is the addition of the non-lazy (eager) FPU saving
  support model and enabling it on CPUs with optimized xsaveopt/xrstor
  FPU state saving instructions.

  There are also various Sparse fixes"

Fix up trivial add-add conflict in arch/x86/kernel/traps.c

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
  x86, fpu: remove cpu_has_xmm check in the fx_finit()
  x86, fpu: make eagerfpu= boot param tri-state
  x86, fpu: enable eagerfpu by default for xsaveopt
  x86, fpu: decouple non-lazy/eager fpu restore from xsave
  x86, fpu: use non-lazy fpu restore for processors supporting xsave
  lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
  x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
  x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
  x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
  x86, fpu: drop_fpu() before restoring new state from sigframe
  x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
  x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
  x86, signal: Cleanup ifdefs and is_ia32, is_x32
2012-10-01 11:10:52 -07:00
Linus Torvalds 08815bc267 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/cleanups from Ingo Molnar:
 "Smaller cleanups"

* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  arch/x86: Remove unecessary semicolons
  x86, boot: Remove obsolete and unused constant RAMDISK
2012-10-01 10:47:45 -07:00
Linus Torvalds 7e92daaefa Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf update from Ingo Molnar:
 "Lots of changes in this cycle as well, with hundreds of commits from
  over 30 contributors.  Most of the activity was on the tooling side.

  Higher level changes:

   - New 'perf kvm' analysis tool, from Xiao Guangrong.

   - New 'perf trace' system-wide tracing tool

   - uprobes fixes + cleanups from Oleg Nesterov.

   - Lots of patches to make perf build on Android out of box, from
     Irina Tirdea

   - Extend ftrace function tracing utility to be more dynamic for its
     users.  It allows for data passing to the callback functions, as
     well as reading regs as if a breakpoint were to trigger at function
     entry.

     The main goal of this patch series was to allow kprobes to use
     ftrace as an optimized probe point when a probe is placed on an
     ftrace nop.  With lots of help from Masami Hiramatsu, and going
     through lots of iterations, we finally came up with a good
     solution.

   - Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.

   - Various tracing updates from Steve Rostedt

   - Clean up and improve 'perf sched' performance by elliminating lots
     of needless calls to libtraceevent.

   - Event group parsing support, from Jiri Olsa

   - UI/gtk refactorings and improvements from Namhyung Kim

   - Add support for non-tracepoint events in perf script python, from
     Feng Tang

   - Add --symbols to 'script', similar to the one in 'report', from
     Feng Tang.

  Infrastructure enhancements and fixes:

   - Convert the trace builtins to use the growing evsel/evlist
     tracepoint infrastructure, removing several open coded constructs
     like switch like series of strcmp to dispatch events, etc.
     Basically what had already been showcased in 'perf sched'.

   - Add evsel constructor for tracepoints, that uses libtraceevent just
     to parse the /format events file, use it in a new 'perf test' to
     make sure the libtraceevent format parsing regressions can be more
     readily caught.

   - Some strange errors were happening in some builds, but not on the
     next, reported by several people, problem was some parser related
     files, generated during the build, didn't had proper make deps, fix
     from Eric Sandeen.

   - Introduce struct and cache information about the environment where
     a perf.data file was captured, from Namhyung Kim.

   - Fix handling of unresolved samples when --symbols is used in
     'report', from Feng Tang.

   - Add union member access support to 'probe', from Hyeoncheol Lee.

   - Fixups to die() removal, from Namhyung Kim.

   - Render fixes for the TUI, from Namhyung Kim.

   - Don't enable annotation in non symbolic view, from Namhyung Kim.

   - Fix pipe mode in 'report', from Namhyung Kim.

   - Move related stats code from stat to util/, will be used by the
     'stat' kvm tool, from Xiao Guangrong.

   - Remove die()/exit() calls from several tools.

   - Resolve vdso callchains, from Jiri Olsa

   - Don't pass const char pointers to basename, so that we can
     unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
     from David Ahern

   - Refactor hist formatting so that it can be reused with the GTK
     browser, From Namhyung Kim

   - Fix build for another rbtree.c change, from Adrian Hunter.

   - Make 'perf diff' command work with evsel hists, from Jiri Olsa.

   - Use the only field_sep var that is set up: symbol_conf.field_sep,
     fix from Jiri Olsa.

   - .gitignore compiled python binaries, from Namhyung Kim.

   - Get rid of die() in more libtraceevent places, from Namhyung Kim.

   - Rename libtraceevent 'private' struct member to 'priv' so that it
     works in C++, from Steven Rostedt

   - Remove lots of exit()/die() calls from tools so that the main perf
     exit routine can take place, from David Ahern

   - Fix x86 build on x86-64, from David Ahern.

   - {int,str,rb}list fixes from Suzuki K Poulose

   - perf.data header fixes from Namhyung Kim

   - Allow user to indicate objdump path, needed in cross environments,
     from Maciek Borzecki

   - Fix hardware cache event name generation, fix from Jiri Olsa

   - Add round trip test for sw, hw and cache event names, catching the
     problem Jiri fixed, after Jiri's patch, the test passes
     successfully.

   - Clean target should do clean for lib/traceevent too, fix from David
     Ahern

   - Check the right variable for allocation failure, fix from Namhyung
     Kim

   - Set up evsel->tp_format regardless of evsel->name being set
     already, fix from Namhyung Kim

   - Oprofile fixes from Robert Richter.

   - Remove perf_event_attr needless version inflation, from Jiri Olsa

   - Introduce libtraceevent strerror like error reporting facility,
     from Namhyung Kim

   - Add pmu mappings to perf.data header and use event names from cmd
     line, from Robert Richter

   - Fix include order for bison/flex-generated C files, from Ben
     Hutchings

   - Build fixes and documentation corrections from David Ahern

   - Assorted cleanups from Robert Richter

   - Let O= makes handle relative paths, from Steven Rostedt

   - perf script python fixes, from Feng Tang.

   - Initial bash completion support, from Frederic Weisbecker

   - Allow building without libelf, from Namhyung Kim.

   - Support DWARF CFI based unwind to have callchains when %bp based
     unwinding is not possible, from Jiri Olsa.

   - Symbol resolution fixes, while fixing support PPC64 files with an
     .opt ELF section was the end goal, several fixes for code that
     handles all architectures and cleanups are included, from Cody
     Schafer.

   - Assorted fixes for Documentation and build in 32 bit, from Robert
     Richter

   - Cache the libtraceevent event_format associated to each evsel
     early, so that we avoid relookups, i.e.  calling pevent_find_event
     repeatedly when processing tracepoint events.

     [ This is to reduce the surface contact with libtraceevents and
        make clear what is that the perf tools needs from that lib: so
        far parsing the common and per event fields.  ]

   - Don't stop the build if the audit libraries are not installed, fix
     from Namhyung Kim.

   - Fix bfd.h/libbfd detection with recent binutils, from Markus
     Trippelsdorf.

   - Improve warning message when libunwind devel packages not present,
     from Jiri Olsa"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
  perf trace: Add aliases for some syscalls
  perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
  perf tools: Check libaudit availability for perf-trace builtin
  perf hists: Add missing period_* fields when collapsing a hist entry
  perf trace: New tool
  perf evsel: Export the event_format constructor
  perf evsel: Introduce rawptr() method
  perf tools: Use perf_evsel__newtp in the event parser
  perf evsel: The tracepoint constructor should store sys:name
  perf evlist: Introduce set_filter() method
  perf evlist: Renane set_filters method to apply_filters
  perf test: Add test to check we correctly parse and match syscall open parms
  perf evsel: Handle endianity in intval method
  perf evsel: Know if byte swap is needed
  perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
  perf evsel: Improve tracepoint constructor setup
  tools lib traceevent: Fix error path on pevent_parse_event
  perf test: Fix build failure
  trace: Move trace event enable from fs_initcall to core_initcall
  tracing: Add an option for disabling markers
  ...
2012-10-01 10:28:49 -07:00
Jan Kiszka c863901075 KVM: x86: Fix guest debug across vcpu INIT reset
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 15:00:07 +02:00
Alex Williamson 7a84428af7 KVM: Add resampling irqfds for level triggered interrupts
To emulate level triggered interrupts, add a resample option to
KVM_IRQFD.  When specified, a new resamplefd is provided that notifies
the user when the irqchip has been resampled by the VM.  This may, for
instance, indicate an EOI.  Also in this mode, posting of an interrupt
through an irqfd only asserts the interrupt.  On resampling, the
interrupt is automatically de-asserted prior to user notification.
This enables level triggered interrupts to be posted and re-enabled
from vfio with no userspace intervention.

All resampling irqfds can make use of a single irq source ID, so we
reserve a new one for this interface.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 13:50:15 +02:00
Suresh Siddha b1a74bf821 x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-21 16:59:04 -07:00
Xiao Guangrong 26bf264e87 KVM: x86: Export svm/vmx exit code and vector code to userspace
Exporting KVM exit information to userspace to be consumed by perf.

Signed-off-by: Dong Hao <haodong@linux.vnet.ibm.com>
[ Dong Hao <haodong@linux.vnet.ibm.com>: rebase it on acme's git tree ]
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Runzhen Wang <runzhen@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haodong@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2012-09-21 12:48:09 -03:00
Gleb Natapov 1e08ec4a13 KVM: optimize apic interrupt delivery
Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus. In case
of logical mode loop only through vcpus in a logical cluster irq is sent
to.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 15:05:26 +03:00
Avi Kivity c5421519f3 KVM: MMU: Eliminate pointless temporary 'ac'
'ac' essentially reconstructs the 'access' variable we already
have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK.  As
these are not used by callees, just use 'access' directly.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:10 +03:00
Avi Kivity b514c30f77 KVM: MMU: Avoid access/dirty update loop if all is well
Keep track of accessed/dirty bits; if they are all set, do not
enter the accessed/dirty update loop.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:09 +03:00
Avi Kivity 71331a1da1 KVM: MMU: Eliminate eperm temporary
'eperm' is no longer used in the walker loop, so we can eliminate it.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:09 +03:00
Avi Kivity 6fd01b711b KVM: MMU: Optimize is_last_gpte()
Instead of branchy code depending on level, gpte.ps, and mmu configuration,
prepare everything in a bitmap during mode changes and look it up during
runtime.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:09 +03:00
Avi Kivity 13d22b6aeb KVM: MMU: Simplify walk_addr_generic() loop
The page table walk is coded as an infinite loop, with a special
case on the last pte.

Code it as an ordinary loop with a termination condition on the last
pte (large page or walk length exhausted), and put the last pte handling
code after the loop where it belongs.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity 97d64b7881 KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup.  It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.

Optimize this away by precalculating all variants and storing them in a
bitmap.  The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).

The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.

The result is short, branch-free code.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity 8cbc70696f KVM: MMU: Update accessed and dirty bits after guest pagetable walk
While unspecified, the behaviour of Intel processors is to first
perform the page table walk, then, if the walk was successful, to
atomically update the accessed and dirty bits of walked paging elements.

While we are not required to follow this exactly, doing so will allow us
to perform the access permissions check after the walk is complete, rather
than after each walk step.

(the tricky case is SMEP: a zero in any pte's U bit makes the referenced
page a supervisor page, so we can't fault on a one bit during the walk
itself).

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity 3d34adec70 KVM: MMU: Move gpte_access() out of paging_tmpl.h
We no longer rely on paging_tmpl.h defines; so we can move the function
to mmu.c.

Rely on zero extension to 64 bits to get the correct nx behaviour.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:08 +03:00
Avi Kivity edc2ae84eb KVM: MMU: Optimize gpte_access() slightly
If nx is disabled, then is gpte[63] is set we will hit a reserved
bit set fault before checking permissions; so we can ignore the
setting of efer.nxe.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:07 +03:00
Avi Kivity 8ea667f259 KVM: MMU: Push clean gpte write protection out of gpte_access()
gpte_access() computes the access permissions of a guest pte and also
write-protects clean gptes.  This is wrong when we are servicing a
write fault (since we'll be setting the dirty bit momentarily) but
correct when instantiating a speculative spte, or when servicing a
read fault (since we'll want to trap a following write in order to
set the dirty bit).

It doesn't seem to hurt in practice, but in order to make the code
readable, push the write protection out of gpte_access() and into
a new protect_clean_gpte() which is called explicitly when needed.

Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-20 13:00:07 +03:00
Peter Senna Tschudin 4b8073e467 arch/x86: Remove unecessary semicolons
Found by http://coccinelle.lip6.fr/

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rusty@rustcorp.com.au
Cc: masami.hiramatsu.pt@hitachi.com
Cc: suresh.b.siddha@intel.com
Cc: joerg.roedel@amd.com
Cc: agordeev@redhat.com
Cc: yinghai@kernel.org
Cc: bhelgaas@google.com
Cc: liuj97@gmail.com
Link: http://lkml.kernel.org/r/1347986174-30287-7-git-send-email-peter.senna@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-19 17:32:48 +02:00
Suresh Siddha 9c1c3fac53 x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
kvm's guest fpu save/restore should be wrapped around
kernel_fpu_begin/end(). This will avoid for example taking a DNA
in kvm_load_guest_fpu() when it tries to load the fpu immediately
after doing unlazy_fpu() on the host side.

More importantly this will prevent the host process fpu from being
corrupted.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.siddha@intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-18 15:52:07 -07:00
Michael S. Tsirkin 9fc77441e5 KVM: make processes waiting on vcpu mutex killable
vcpu mutex can be held for unlimited time so
taking it with mutex_lock on an ioctl is wrong:
one process could be passed a vcpu fd and
call this ioctl on the vcpu used by another process,
it will then be unkillable until the owner exits.

Call mutex_lock_killable instead and return status.
Note: mutex_lock_interruptible would be even nicer,
but I am not sure all users are prepared to handle EINTR
from these ioctls. They might misinterpret it as an error.

Cleanup paths expect a vcpu that can't be used by
any userspace so this will always succeed - catch bugs
by calling BUG_ON.

Catch callers that don't check return state by adding
__must_check.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 13:46:32 -03:00
Avi Kivity 7454766f7b KVM: SVM: Make use of asm.h
Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:05 -03:00
Avi Kivity b188c81f2e KVM: VMX: Make use of asm.h
Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:04 -03:00
Avi Kivity 83287ea420 KVM: VMX: Make lto-friendly
LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units.  Use an external variable instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:03 -03:00
Takuya Yoshikawa ecba9a52ac KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
find_highest_vector() and count_vectors():
 - Instead of using magic values, define and use proper macros.

find_highest_vector():
 - Remove likely() which is there only for historical reasons and not
   doing correct branch predictions anymore.  Using such heuristics
   to optimize this function is not worth it now.  Let CPUs predict
   things instead.

 - Stop checking word[0] separately.  This was only needed for doing
   likely() optimization.

 - Use for loop, not while, to iterate over the register array to make
   the code clearer.

Note that we actually confirmed that the likely() did wrong predictions
by inserting debug code.

Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-12 13:38:23 -03:00
Xiao Guangrong 4484141a94 KVM: fix error paths for failed gfn_to_page() calls
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361]  [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382]  [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401]  [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407]  [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411]  [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418]  [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424]  [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca

The test case:

	printf(fmt, ##args);		\
	exit(-1);} while (0)

static int create_vm(void)
{
	int sys_fd, vm_fd;

	sys_fd = open("/dev/kvm", O_RDWR);
	if (sys_fd < 0)
		die("open /dev/kvm fail.\n");

	vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
	if (vm_fd < 0)
		die("KVM_CREATE_VM fail.\n");

	return vm_fd;
}

static int create_vcpu(int vm_fd)
{
	int vcpu_fd;

	vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
	if (vcpu_fd < 0)
		die("KVM_CREATE_VCPU ioctl.\n");
	printf("Create vcpu.\n");
	return vcpu_fd;
}

static void *vcpu_thread(void *arg)
{
	int vm_fd = (int)(long)arg;

	create_vcpu(vm_fd);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread;
	int vm_fd;

	(void)argc;
	(void)argv;

	vm_fd = create_vm();
	pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
	printf("Exit.\n");
	return 0;
}

It caused by release kvm->arch.ept_identity_map_addr which is the
error page.

The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time

Fixed by checking the page before it is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:34:11 +03:00
Xiao Guangrong 7de5bdc96c KVM: MMU: remove unnecessary check
Checking the return of kvm_mmu_get_page is unnecessary since it is
guaranteed by memory cache

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:26:16 +03:00
Liu, Jinsong 92b5265d38 KVM: Depend on HIGH_RES_TIMERS
KVM lapic timer and tsc deadline timer based on hrtimer,
setting a leftmost node to rb tree and then do hrtimer reprogram.
If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram
do nothing and then make kvm lapic timer and tsc deadline timer fail.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:10:03 +03:00
Ren, Yongjie 4f97704555 KVM: x86: Check INVPCID feature bit in EBX of leaf 7
Checks and operations on the INVPCID feature bit should use EBX
of CPUID leaf 7 instead of ECX.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Yongjie Ren <yongjien.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-09 17:34:01 +03:00
Michael S. Tsirkin a50abc3b2b KVM: use symbolic constant for nr interrupts
interrupt_bitmap is KVM_NR_INTERRUPTS bits in size,
so just use that instead of hard-coded constants
and math.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:37:44 +03:00
Gleb Natapov b3356bf0db KVM: emulator: optimize "rep ins" handling
Optimize "rep ins" by allowing emulator to write back more than one
datum at a time. Introduce new operand type OP_MEM_STR which tells
writeback() that dst contains pointer to an array that should be written
back as opposite to just one data element.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:07:38 +03:00
Gleb Natapov f3bd64c68a KVM: emulator: string_addr_inc() cleanup
Remove unneeded segment argument. Address structure already has correct
segment which was put there during decode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:07:01 +03:00
Gleb Natapov 9d1b39a967 KVM: emulator: make x86 emulation modes enum instead of defines
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:07:01 +03:00
Gleb Natapov 716d51abff KVM: Provide userspace IO exit completion callback
Current code assumes that IO exit was due to instruction emulation
and handles execution back to emulator directly. This patch adds new
userspace IO exit completion callback that can be set by any other code
that caused IO exit to userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 18:06:37 +03:00
Marcelo Tosatti 3b4dc3a031 KVM: move postcommit flush to x86, as mmio sptes are x86 specific
Other arches do not need this.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya)
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 16:37:30 +03:00
Marcelo Tosatti 2df72e9bc4 KVM: split kvm_arch_flush_shadow
Introducing kvm_arch_flush_shadow_memslot, to invalidate the
translations of a single memory slot.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-06 16:37:25 +03:00
Mathias Krause 09941fbb71 KVM: SVM: constify lookup tables
We never modify direct_access_msrs[], msrpm_ranges[],
svm_exit_handlers[] or x86_intercept_map[] at runtime.
Mark them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:14 +03:00
Mathias Krause 772e031899 KVM: VMX: constify lookup tables
We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:09 +03:00
Mathias Krause f1d248315a KVM: x86: more constification
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:02 +03:00
Mathias Krause 0fbe9b0b19 KVM: x86: constify read_write_emulator_ops
We never change those, make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:54 +03:00
Mathias Krause 0225fb509d KVM: x86 emulator: constify emulate_ops
We never change emulate_ops[] at runtime so it should be r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:48 +03:00
Mathias Krause fd0a0d8208 KVM: x86 emulator: mark opcode tables const
The opcode tables never change at runtime, therefor mark them const.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:28 +03:00
Mathias Krause 89a87c6779 KVM: x86 emulator: use aligned variants of SSE register ops
As the the compiler ensures that the memory operand is always aligned
to a 16 byte memory location, use the aligned variant of MOVDQ for
read_sse_reg() and write_sse_reg().

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:11 +03:00
Mathias Krause 326d07cb30 KVM: x86: minor size optimization
Some fields can be constified and/or made static to reduce code and data
size.

Numbers for a 32 bit build:

        text    data     bss     dec     hex filename
before: 3351      80       0    3431     d67 cpuid.o
 after: 3391       0       0    3391     d3f cpuid.o

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:41:09 +03:00
Jamie Iles 749c59fd15 KVM: PIC: fix use of uninitialised variable.
Commit aea218f3cb (KVM: PIC: call ack notifiers for irqs that are
dropped form irr) used an uninitialised variable to track whether an
appropriate apic had been found.  This could result in calling the ack
notifier incorrectly.

Cc: Gleb Natapov <gleb@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Jamie Iles <jamie@jamieiles.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-04 15:44:42 +03:00
Gleb Natapov ec798660cf KVM: cleanup pic reset
kvm_pic_reset() is not used anywhere. Move reset logic from
pic_ioport_write() there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-04 14:53:51 +03:00
Marcelo Tosatti 9a7819774e KVM: x86: remove unused variable from kvm_task_switch()
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-30 17:45:54 -03:00
Avi Kivity a81aba14dc KVM: VMX: Ignore segment G and D bits when considering whether we can virtualize
We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity ce56680347 KVM: VMX: Save all segment data in real mode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity 1390a28b27 KVM: VMX: Preserve segment limit and access rights in real mode
While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity 7263642028 KVM: VMX: Return real real-mode segment data even if emulate_invalid_guest_state=1
emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do.  So we need to return the real ones (maintained by vmx_set_segment).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity 0afbe2f878 KVM: x86 emulator: Fix #GP error code during linearization
We want the segment selector, nor segment number.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity a5625189f6 KVM: x86 emulator: Check segment limits in real mode too
Segment limits are verified in real mode, not just protected mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity 03ebebeb1f KVM: x86 emulator: Leave segment limit and attributs alone in real mode
When loading a segment in real mode, only the base and selector must
be modified.  The limit needs to be left alone, otherwise big real mode
users will hit a #GP due to limit checking (currently this is suppressed
because we don't check limits in real mode).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity e2a610d7fc KVM: VMX: Allow vm86 virtualization of big real mode
Usually, big real mode uses large (4GB) segments.  Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator.  This results in significantly
faster execution, and less risk of hitting an unemulated instruction.

If the limit is less than 0xffff, we retain the existing behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity 495e116684 KVM: VMX: Allow real mode emulation using vm86 with dpl=0
Real mode is always entered from protected mode with dpl=0.  Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity c865c43de6 KVM: VMX: Retain limit and attributes when entering protected mode
Real processors don't change segment limits and attributes while in
real mode.  Mimic that behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity f5f7b2fe3b KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmode
Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use.  This leads to some simplification.

We replace save_rmode_seg() with a call to vmx_save_segment().  Since this depends
on rmode.vm86_active, we move the call to before setting the flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity 72fbefec26 KVM: VMX: Fix incorrect lookup of segment S flag in fix_pmode_dataseg()
fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity baa7e81e32 KVM: VMX: Separate saving pre-realmode state from setting segments
Commit b246dd5df1 ("KVM: VMX: Fix KVM_SET_SREGS with big real mode
segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is
applied not just on transitions to real mode, but also on KVM_SET_SREGS
(migration).  However fix_rmode_seg() not only munges the vmcs segments,
it also sets up the save area for us to restore when returning to
protected mode or to return in vmx_get_segment().

Move saving the segment into a new function, save_rmode_seg(), and
call it just during the transition.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity dd856efafe KVM: x86 emulator: access GPRs on demand
Instead of populating the entire register file, read in registers
as they are accessed, and write back only the modified ones.  This
saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually
used during emulation), and a two 128-byte copies for the registers.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 18:38:55 -03:00
Michael S. Tsirkin 1d92128fe9 KVM: x86: fix KVM_GET_MSR for PV EOI
KVM_GET_MSR was missing support for PV EOI,
which is needed for migration.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 18:03:05 -03:00
Marcelo Tosatti c78aa4c4b9 Merge remote-tracking branch 'upstream/master' into queue
Merging critical fixes from upstream required for development.

* upstream/master: (809 commits)
  libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
  Revert "powerpc: Update g5_defconfig"
  powerpc/perf: Use pmc_overflow() to detect rolled back events
  powerpc: Fix VMX in interrupt check in POWER7 copy loops
  powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
  powerpc: Fix personality handling in ppc64_personality()
  powerpc/dma-iommu: Fix IOMMU window check
  powerpc: Remove unnecessary ifdefs
  powerpc/kgdb: Restore current_thread_info properly
  powerpc/kgdb: Bail out of KGDB when we've been triggered
  powerpc/kgdb: Do not set kgdb_single_step on ppc
  powerpc/mpic_msgr: Add missing includes
  powerpc: Fix null pointer deref in perf hardware breakpoints
  powerpc: Fixup whitespace in xmon
  powerpc: Fix xmon dl command for new printk implementation
  xfs: check for possible overflow in xfs_ioc_trim
  xfs: unlock the AGI buffer when looping in xfs_dialloc
  xfs: fix uninitialised variable in xfs_rtbuf_get()
  powerpc/fsl: fix "Failed to mount /dev: No such device" errors
  powerpc/fsl: update defconfigs
  ...

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-26 13:58:41 -03:00
Avi Kivity 5ad105e569 KVM: x86 emulator: use stack size attribute to mask rsp in stack ops
The sub-register used to access the stack (sp, esp, or rsp) is not
determined by the address size attribute like other memory references,
but by the stack segment's B bit (if not in x86_64 mode).

Fix by using the existing stack_mask() to figure out the correct mask.

This long-existing bug was exposed by a combination of a27685c33a
(emulate invalid guest state by default), which causes many more
instructions to be emulated, and a seabios change (possibly a bug) which
causes the high 16 bits of esp to become polluted across calls to real
mode software interrupts.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-22 18:54:26 -03:00
Takuya Yoshikawa 35f2d16bb9 KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended
Although the possible race described in

  commit 85b7059169
  KVM: MMU: fix shrinking page from the empty mmu

was correct, the real cause of that issue was a more trivial bug of
mmu_shrink() introduced by

  commit 1952639665
  KVM: MMU: do not iterate over all VMs in mmu_shrink()

Here is the bug:

	if (kvm->arch.n_used_mmu_pages > 0) {
		if (!nr_to_scan--)
			break;
		continue;
	}

We skip VMs whose n_used_mmu_pages is not zero and try to shrink others:
in other words we try to shrink empty ones by mistake.

This patch reverses the logic so that mmu_shrink() can free pages from
the first VM whose n_used_mmu_pages is not zero.  Note that we also add
comments explaining the role of nr_to_scan which is not practically
important now, hoping this will be improved in the future.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:27:13 +03:00
Xiao Guangrong 4d8b81abc4 KVM: introduce readonly memslot
In current code, if we map a readonly memory space from host to guest
and the page is not currently mapped in the host, we will get a fault
pfn and async is not allowed, then the vm will crash

We introduce readonly memory region to map ROM/ROMD to the guest, read access
is happy for readonly memslot, write access on readonly memslot will cause
KVM_EXIT_MMIO exit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:09:03 +03:00
Xiao Guangrong 037d92dc5d KVM: introduce gfn_to_pfn_memslot_atomic
It can instead of hva_to_pfn_atomic

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:08:52 +03:00
Xiao Guangrong 8e3d9d061b KVM: x86: fix possible infinite loop caused by reexecute_instruction
Currently, we reexecute all unhandleable instructions if they do not
access on the mmio, however, it can not work if host map the readonly
memory to guest. If the instruction try to write this kind of memory,
it will fault again when guest retry it, then we will goto a infinite
loop: retry instruction -> write #PF -> emulation fail ->
retry instruction -> ...

Fix it by retrying the instruction only when it faults on the writable
memory

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-22 15:08:49 +03:00
Michael S. Tsirkin 28a6fdabb3 KVM: x86: drop parameter validation in ioapic/pic
We validate irq pin number when routing is setup, so
code handling illegal irq # in pic and ioapic on each injection
is never called.
Drop it, replace with BUG_ON to catch out of bounds access bugs.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-14 22:35:22 -03:00
Avi Kivity dbcb4e7980 KVM: VMX: Advertize RDTSC exiting to nested guests
All processors that support VMX have that feature, and guests (Xen) depend on
it.  As we already implement it, advertize it to the guest.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:08:28 -03:00
Gleb Natapov 2a7921b7a0 KVM: VMX: restore MSR_IA32_DEBUGCTLMSR after VMEXIT
MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT. Restore it to the correct
value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:07:58 -03:00
Marcelo Tosatti 51d59c6b42 KVM: x86: fix pvclock guest stopped flag reporting
kvm_guest_time_update unconditionally clears hv_clock.flags field,
so the notification never reaches the guest.

Fix it by allowing PVCLOCK_GUEST_STOPPED to passthrough.

Reviewed-by: Eric B Munson <emunson@mgebm.net>
Reviewed-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 16:10:45 -03:00
Gleb Natapov 64eb062029 KVM: correctly detect APIC SW state in kvm_apic_post_state_restore()
For apic_set_spiv() to track APIC SW state correctly it needs to see
previous and next values of the spurious vector register, but currently
memset() overwrite the old value before apic_set_spiv() get a chance to
do tracking. Fix it by calling apic_set_spiv() before overwriting old
value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-09 12:44:46 +03:00
Gleb Natapov c48f14966c KVM: inline kvm_apic_present() and kvm_lapic_enabled()
Those functions are used during interrupt injection. When inlined they
become nops on the fast path.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:45 +03:00
Gleb Natapov 54e9818f39 KVM: use jump label to optimize checking for in kernel local apic presence
Usually all vcpus have local apic pointer initialized, so the check may
be completely skipped.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:44 +03:00
Gleb Natapov f8c1ea1039 KVM: use jump label to optimize checking for SW enabled apic in spurious interrupt register
Usually all APICs are SW enabled so the check can be optimized out.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:43 +03:00
Gleb Natapov c5cc421ba3 KVM: use jump label to optimize checking for HW enabled APIC in APIC_BASE MSR
Usually all APICs are HW enabled so the check can be optimized out.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 19:00:43 +03:00
Gleb Natapov 6aed64a8a4 KVM: mark apic enabled on start up
According to SDM apic is enabled on start up.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:20:14 +03:00
Gleb Natapov 5dbc8f3fed KVM: use kvm_lapic_set_base() to change apic_base
Do not change apic_base directly. Use kvm_lapic_set_base() instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:20:05 +03:00
Gleb Natapov 8a5a87d9b7 KVM: clean up kvm_(set|get)_apic_base
kvm_get_apic_base() needlessly checks irqchip_in_kernel although it does
the same no matter what result of the check is. kvm_set_apic_base() also
checks for irqchip_in_kernel, but kvm_lapic_set_base() can handle this
case.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:20:03 +03:00
Xiao Guangrong 32cad84f44 KVM: do not release the error page
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:58 +03:00
Xiao Guangrong cb9aaa30b1 KVM: do not release the error pfn
After commit a2766325cf, the error pfn is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:57 +03:00
Xiao Guangrong e6c1502b3f KVM: introduce KVM_PFN_ERR_HWPOISON
Then, get_hwpoison_pfn and is_hwpoison_pfn can be removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:52 +03:00
Xiao Guangrong 6c8ee57be9 KVM: introduce KVM_PFN_ERR_FAULT
After that, the exported and un-inline function, get_fault_pfn,
can be removed

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:50 +03:00
Takuya Yoshikawa d89cc617b9 KVM: Push rmap into kvm_arch_memory_slot
Two reasons:
 - x86 can integrate rmap and rmap_pde and remove heuristics in
   __gfn_to_rmap().
 - Some architectures do not need rmap.

Since rmap is one of the most memory consuming stuff in KVM, ppc'd
better restrict the allocation to Book3S HV.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:30 +03:00
Takuya Yoshikawa 65fbe37c42 KVM: MMU: Use gfn_to_rmap() instead of directly reading rmap array
This helps to make rmap architecture specific in a later patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:04 +03:00
Takuya Yoshikawa aab2eb7a38 KVM: Stop checking rmap to see if slot is being created
Instead, check npages consistently.  This helps to make rmap
architecture specific in a later patch.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 12:47:02 +03:00
Gleb Natapov 439793d4b3 KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 16:52:38 +03:00
Avi Kivity fe56097b23 Merge remote-tracking branch 'upstream' into next
- bring back critical fixes (esp. aa67f6096c)
 - provide an updated base for development

* upstream: (4334 commits)
  missed mnt_drop_write() in do_dentry_open()
  UBIFS: nuke pdflush from comments
  gfs2: nuke pdflush from comments
  drbd: nuke pdflush from comments
  nilfs2: nuke write_super from comments
  hfs: nuke write_super from comments
  vfs: nuke pdflush from comments
  jbd/jbd2: nuke write_super from comments
  btrfs: nuke pdflush from comments
  btrfs: nuke write_super from comments
  ext4: nuke pdflush from comments
  ext4: nuke write_super from comments
  ext3: nuke write_super from comments
  Documentation: fix the VM knobs descritpion WRT pdflush
  Documentation: get rid of write_super
  vfs: kill write_super and sync_supers
  ACPI processor: Fix tick_broadcast_mask online/offline regression
  ACPI: Only count valid srat memory structures
  ACPI: Untangle a return statement for better readability
  Linux 3.6-rc1
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 13:25:10 +03:00
Gleb Natapov e115676e04 KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
When MSR_KVM_PV_EOI_EN was added to msrs_to_save array
KVM_SAVE_MSRS_BEGIN was not updated accordingly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-03 15:08:22 -03:00
Avi Kivity aa67f6096c KVM: VMX: Fix ds/es corruption on i386 with preemption
Commit b2da15ac26 ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.rpl, es.rpl cleared by hardware)

  interrupt
    push ds, es  # pushes bad ds, es
    schedule
      vmx_vcpu_put
        vmx_load_host_state
          reload ds, es (with __USER_DS)
    pop ds, es  # of other thread's stack
    iret
  # other thread runs
  interrupt
    push ds, es
    schedule  # back in vcpu thread
    pop ds, es  # now with rpl=0
    iret
  ...
  vcpu_put
  resume_userspace
  iret  # clears ds, es due to mismatched rpl

(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)

Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.

Reported-by: Chris Clayron <chris2553@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 20:23:57 -03:00
Bruce Rogers 4b6486659a KVM: x86: apply kvmclock offset to guest wall clock time
When a guest migrates to a new host, the system time difference from the
previous host is used in the updates to the kvmclock system time visible
to the guest, resulting in a continuation of correct kvmclock based guest
timekeeping.

The wall clock component of the kvmclock provided time is currently not
updated with this same time offset. Since the Linux guest caches the
wall clock based time, this discrepency is not noticed until the guest is
rebooted. After reboot the guest's time calculations are off.

This patch adjusts the wall clock by the kvmclock_offset, resulting in
correct guest time after a reboot.

Cc: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Bruce Rogers <brogers@suse.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 17:23:50 -03:00
Avi Kivity 26ef19242f KVM: fold kvm_pit_timer into kvm_kpit_state
One structure nests inside the other, providing no value at all.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 00:21:07 -03:00
Avi Kivity 9d9d2239bd KVM: Simplify kvm_pit_timer
'timer_mode_mask' is unused
'tscdeadline' is unused
't_ops' only adds needless indirection
'vcpu' is unused

Remove.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 00:21:07 -03:00
Avi Kivity 2a6eac9638 KVM: Simplify kvm_timer
'reinject' is never initialized
't_ops' only serves as indirection to lapic_is_periodic; call that directly
   instead
'kvm' is never used
'vcpu' can be derived via container_of

Remove these fields.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 00:21:06 -03:00
Avi Kivity e9d90d472d KVM: Remove internal timer abstraction
kvm_timer_fn(), the sole inhabitant of timer.c, is only used by lapic.c. Move
it there to make it easier to hack on it.

struct kvm_timer is a thin wrapper around hrtimer, and only adds obfuscation.
Move near its two users (with different names) to prepare for simplification.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 00:21:06 -03:00
Avi Kivity 4a4541a40e KVM: Don't update PPR on any APIC read
The current code will update the PPR on almost any APIC read; however
that's only required if we read the PPR.

kvm_update_ppr() shows up in some profiles, albeit with a low usage (~1%).
This should reduce it further (it will still be called during interrupt
processing).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-30 20:47:32 -03:00
Christoffer Dall 23d43cf998 KVM: Move KVM_IRQ_LINE to arch-generic code
Handle KVM_IRQ_LINE and KVM_IRQ_LINE_STATUS in the generic
kvm_vm_ioctl() function and call into kvm_vm_ioctl_irq_line().

This is even more relevant when KVM/ARM also uses this ioctl.

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 12:23:25 +03:00
Gleb Natapov 99245b507d KVM: x86 emulator: drop unneeded call to get_segment()
setup_syscalls_segments() calls get_segment() and than overwrites all
but one of the structure fields and this one should also be overwritten
anyway, so we can drop call to get_segment() and avoid a couple of vmreads
on vmx. Also drop zeroing ss/cs structures since most of the fields are
set anyway. Just set those that were not set explicitly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 12:21:29 +03:00
Gleb Natapov aea218f3cb KVM: PIC: call ack notifiers for irqs that are dropped form irr
After commit 242ec97c35 PIT interrupts are no longer delivered after
PIC reset. It happens because PIT injects interrupt only if previous one
was acked, but since on PIC reset it is dropped from irr it will never
be delivered and hence acknowledged. Fix that by calling ack notifier on
PIC reset.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 12:19:06 +03:00
Xiao Guangrong f23b070e66 KVM: x86 emulator: simplify read_emulated
No need split mmio read region into 8-bits pieces since we do it in
emulator_read_write_onepage

Changelog:
  Add a WARN_ON to check read-cache overflow

Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 12:10:58 +03:00
Xiao Guangrong 3b2bd2f800 KVM: MMU: use kvm_release_pfn_clean to release pfn
The current code depends on the fact that fault_page is the normal page,
however, we will use the error code instead of these dummy pages in the
later patch, so we use kvm_release_pfn_clean to release pfn which will
release the error code properly

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 11:55:30 +03:00
Avi Kivity e9bda6f6f9 Merge branch 'queue' into next
Merge patches queued during the run-up to the merge window.

* queue: (25 commits)
  KVM: Choose better candidate for directed yield
  KVM: Note down when cpu relax intercepted or pause loop exited
  KVM: Add config to support ple or cpu relax optimzation
  KVM: switch to symbolic name for irq_states size
  KVM: x86: Fix typos in pmu.c
  KVM: x86: Fix typos in lapic.c
  KVM: x86: Fix typos in cpuid.c
  KVM: x86: Fix typos in emulate.c
  KVM: x86: Fix typos in x86.c
  KVM: SVM: Fix typos
  KVM: VMX: Fix typos
  KVM: remove the unused parameter of gfn_to_pfn_memslot
  KVM: remove is_error_hpa
  KVM: make bad_pfn static to kvm_main.c
  KVM: using get_fault_pfn to get the fault pfn
  KVM: MMU: track the refcount when unmap the page
  KVM: x86: remove unnecessary mark_page_dirty
  KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range()
  KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp()
  KVM: MMU: Add memslot parameter to hva handlers
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-26 11:54:21 +03:00
Linus Torvalds 5fecc9d8f5 KVM updates for the 3.6 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
 ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
 ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
 UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
 csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
 3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
 pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
 Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
 Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
 pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
 JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
 2AsN5bS0BWq+aqPpZHa5
 =pECD
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights include
  - full big real mode emulation on pre-Westmere Intel hosts (can be
    disabled with emulate_invalid_guest_state=0)
  - relatively small ppc and s390 updates
  - PCID/INVPCID support in guests
  - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
    interrupt intensive workloads)
  - Lockless write faults during live migration
  - EPT accessed/dirty bits support for new Intel processors"

Fix up conflicts in:
 - Documentation/virtual/kvm/api.txt:

   Stupid subchapter numbering, added next to each other.

 - arch/powerpc/kvm/booke_interrupts.S:

   PPC asm changes clashing with the KVM fixes

 - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:

   Duplicated commits through the kvm tree and the s390 tree, with
   subsequent edits in the KVM tree.

* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
  KVM: fix race with level interrupts
  x86, hyper: fix build with !CONFIG_KVM_GUEST
  Revert "apic: fix kvm build on UP without IOAPIC"
  KVM guest: switch to apic_set_eoi_write, apic_write
  apic: add apic_set_eoi_write for PV use
  KVM: VMX: Implement PCID/INVPCID for guests with EPT
  KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
  KVM: PPC: Critical interrupt emulation support
  KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
  KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
  KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
  KVM: PPC: bookehv64: Add support for std/ld emulation.
  booke: Added crit/mc exception handler for e500v2
  booke/bookehv: Add host crit-watchdog exception support
  KVM: MMU: document mmu-lock and fast page fault
  KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
  KVM: MMU: trace fast page fault
  KVM: MMU: fast path of handling guest page fault
  KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
  KVM: MMU: fold tlb flush judgement into mmu_spte_update
  ...
2012-07-24 12:01:20 -07:00
Raghavendra K T f2a7434731 KVM: Add config to support ple or cpu relax optimzation
Suggested-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-23 13:00:53 +03:00
Michael S. Tsirkin 93b6547e22 KVM: switch to symbolic name for irq_states size
Use PIC_NUM_PINS instead of hard-coded 16 for pic pins.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 16:12:16 -03:00
Michael S. Tsirkin 1a577b7247 KVM: fix race with level interrupts
When more than 1 source id is in use for the same GSI, we have the
following race related to handling irq_states race:

CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1.
CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0).
Now ioapic thinks the level is 0 but irq_state is not 0.

Fix by performing all irq_states bitmap handling under pic/ioapic lock.
This also removes the need for atomics with irq_states handling.

Reported-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 16:12:00 -03:00
Guo Chao c7a7062fa0 KVM: x86: Fix typos in pmu.c
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 15:27:00 -03:00