linux

Commit Graph

Author	SHA1	Message	Date
Yang Zhang	a547c6db4d	KVM: VMX: Enable acknowledge interupt on vmexit The "acknowledge interrupt on exit" feature controls processor behavior for external interrupt acknowledgement. When this control is set, the processor acknowledges the interrupt controller to acquire the interrupt vector on VM exit. After enabling this feature, an interrupt which arrived when target cpu is running in vmx non-root mode will be handled by vmx handler instead of handler in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt table to let real handler to handle it. Further, we will recognize the interrupt and only delivery the interrupt which not belong to current vcpu through idt table. The interrupt which belonged to current vcpu will be handled inside vmx handler. This will reduce the interrupt handle cost of KVM. Also, interrupt enable logic is changed if this feature is turnning on: Before this patch, hypervior call local_irq_enable() to enable it directly. Now IF bit is set on interrupt stack frame, and will be enabled on a return from interrupt handler if exterrupt interrupt exists. If no external interrupt, still call local_irq_enable() to enable it. Refer to Intel SDM volum 3, chapter 33.2. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-16 16:32:39 -03:00
Yang Zhang	aa2fbe6d44	KVM: Let ioapic know the irq line status Userspace may deliver RTC interrupt without query the status. So we want to track RTC EOI for this case. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	106069193c	KVM: Add reset/restore rtc_status support restore rtc_status from migration or save/restore Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	b4f2225c07	KVM: Return destination vcpu on interrupt injection Add a new parameter to know vcpus who received the interrupt. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:34 -03:00
Yang Zhang	1fcc7890db	KVM: Add vcpu info to ioapic_update_eoi() Add vcpu info to ioapic_update_eoi, so we can know which vcpu issued this EOI. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-04-15 23:20:33 -03:00
Jan Kiszka	c0d1c770c0	KVM: nVMX: Avoid reading VM_EXIT_INTR_ERROR_CODE needlessly on nested exits We only need to update vm_exit_intr_error_code if there is a valid exit interruption information and it comes with a valid error code. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:10 +03:00
Jan Kiszka	e8457c67a4	KVM: nVMX: Fix conditions for interrupt injection If we are entering guest mode, we do not want L0 to interrupt this vmentry with all its side effects on the vmcs. Therefore, injection shall be disallowed during L1->L2 transitions, as in the previous version. However, this check is conceptually independent of nested_exit_on_intr, so decouple it. If L1 traps external interrupts, we can kick the guest from L2 to L1, also just like the previous code worked. But we no longer need to consider L1's idt_vectoring_info_field. It will always be empty at this point. Instead, if L2 has pending events, those are now found in the architectural queues and will, thus, prevent vmx_interrupt_allowed from being called at all. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:09 +03:00
Jan Kiszka	5f3d579997	KVM: nVMX: Rework event injection and recovery The basic idea is to always transfer the pending event injection on vmexit into the architectural state of the VCPU and then drop it from there if it turns out that we left L2 to enter L1, i.e. if we enter prepare_vmcs12. vmcs12_save_pending_events takes care to transfer pending L0 events into the queue of L1. That is mandatory as L1 may decide to switch the guest state completely, invalidating or preserving the pending events for later injection (including on a different node, once we support migration). This concept is based on the rule that a pending vmlaunch/vmresume is not canceled. Otherwise, we would risk to lose injected events or leak them into the wrong queues. Encode this rule via a WARN_ON_ONCE at the entry of nested_vmx_vmexit. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:07 +03:00
Jan Kiszka	3b656cf764	KVM: nVMX: Fix injection of PENDING_INTERRUPT and NMI_WINDOW exits to L1 Check if the interrupt or NMI window exit is for L1 by testing if it has the corresponding controls enabled. This is required when we allow direct injection from L0 to L2 Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 18:27:05 +03:00
Gleb Natapov	188424ba10	KVM: emulator: mark 0xff 0x7d opcode as undefined. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	1146a78b8d	KVM: emulator: Do not fail on emulation of undefined opcode Emulation of undefined opcode should inject #UD instead of causing emulation failure. Do that by moving Undefined flag check to emulation stage and injection #UD there. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	991eebf9f8	KVM: VMX: do not try to reexecute failed instruction while emulating invalid guest state During invalid guest state emulation vcpu cannot enter guest mode to try to reexecute instruction that emulator failed to emulate, so emulation will happen again and again. Prevent that by telling the emulator that instruction reexecution should not be attempted. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:17 +03:00
Gleb Natapov	0b789eee2c	KVM: emulator: fix unimplemented instruction detection Unimplemented instruction detection is broken for group instructions since it relies on "flags" field of opcode to be zero, but all instructions in a group inherit flags from a group encoding. Fix that by having a separate flag for unimplemented instructions. Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-14 09:44:16 +03:00
Konrad Rzeszutek Wilk	357d122670	x86, xen, gdt: Remove the pvops variant of store_gdt. The two use-cases where we needed to store the GDT were during ACPI S3 suspend and resume. As the patches: x86/gdt/i386: store/load GDT for ACPI S3 or hibernation/resume path is not needed x86/gdt/64-bit: store/load GDT for ACPI S3 or hibernate/resume path is not needed. have demonstrated - there are other mechanism by which the GDT is saved and reloaded during early resume path. Hence we do not need to worry about the pvops call-chain for saving the GDT and can and can eliminate it. The other areas where the store_gdt is used are never going to be hit when running under the pvops platforms. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1365194544-14648-4-git-send-email-konrad.wilk@oracle.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-04-11 15:40:38 -07:00
Kevin Wolf	f8da94e9e4	KVM: x86 emulator: Fix segment loading in VM86 This fixes a regression introduced in commit `03ebebeb1` ("KVM: x86 emulator: Leave segment limit and attributs alone in real mode"). The mentioned commit changed the segment descriptors for both real mode and VM86 to only update the segment base instead of creating a completely new descriptor with limit 0xffff so that unreal mode keeps working across a segment register reload. This leads to an invalid segment descriptor in the eyes of VMX, which seems to be okay for real mode because KVM will fix it up before the next VM entry or emulate the state, but it doesn't do this if the guest is in VM86, so we end up with: KVM: entry failed, hardware error 0x80000021 Fix this by effectively reverting commit `03ebebeb1` for VM86 and leaving it only in place for real mode, which is where it's really needed. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-11 15:53:06 +03:00
Geoff Levand	e3ba45b804	KVM: Move kvm_spurious_fault to x86.c The routine kvm_spurious_fault() is an x86 specific routine, so move it from virt/kvm/kvm_main.c to arch/x86/kvm/x86.c. Fixes this sparse warning when building on arm64: virt/kvm/kvm_main.c⚠️ symbol 'kvm_spurious_fault' was not declared. Should it be static? Signed-off-by: Geoff Levand <geoff@infradead.org> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 13:02:06 +03:00
Jan Kiszka	a63cb56061	KVM: VMX: Add missing braces to avoid redundant error check The code was already properly aligned, now also add the braces to avoid that err is checked even if alloc_apic_access_page didn't run and change it. Found via Coccinelle by Fengguang Wu. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 12:46:06 +03:00
Yang Zhang	458f212e36	KVM: x86: fix memory leak in vmx_init Free vmx_msr_bitmap_longmode_x2apic and vmx_msr_bitmap_longmode if kvm_init() fails. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-08 10:56:08 +03:00
Jan Kiszka	b8c07d55d0	KVM: nVMX: Check exit control for VM_EXIT_SAVE_IA32_PAT, not entry controls Obviously a copy&paste mistake: prepare_vmcs12 has to check L1's exit controls for VM_EXIT_SAVE_IA32_PAT. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 14:06:42 +03:00
Yang Zhang	44944d4d28	KVM: Call kvm_apic_match_dest() to check destination vcpu For a given vcpu, kvm_apic_match_dest() will tell you whether the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap() and use kvm_apic_match_dest() instead. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:55:49 +03:00
Takuya Yoshikawa	450e0b411f	Revert "KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()" With the following commit, shadow pages can be zapped at random during a shadow page talbe walk: KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page() `7ddca7e43c` This patch reverts it and fixes __direct_map() and FNAME(fetch)(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:13:36 +03:00
Andrew Honig	8f964525a1	KVM: Allow cross page reads and writes from cached translations. This patch adds support for kvm_gfn_to_hva_cache_init functions for reads and writes that will cross a page. If the range falls within the same memslot, then this will be a fast operation. If the range is split between two memslots, then the slower kvm_read_guest and kvm_write_guest are used. Tested: Test against kvm_clock unit tests. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-07 13:05:35 +03:00
Borislav Petkov	e6ee94d58d	x86, cpu: Convert AMD Erratum 383 Convert the AMD erratum 383 testing code to the bug infrastructure. This allows keeping the AMD-specific erratum testing machinery private to amd.c and not export symbols to modules needlessly. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1363788448-31325-6-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@zytor.com>	2013-04-02 10:12:54 -07:00
Paolo Bonzini	afd80d85ae	pmu: prepare for migration support In order to migrate the PMU state correctly, we need to restore the values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written). We also need to write the full 40-bit value of the performance counter, which would only be possible with a v3 architectural PMU's full-width counter MSRs. To distinguish host-initiated writes from the guest's, pass the full struct msr_data to kvm_pmu_set_msr. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-04-02 17:42:44 +03:00
Takuya Yoshikawa	81f4f76bbc	KVM: MMU: Rename kvm_mmu_free_some_pages() to make_mmu_pages_available() The current name "kvm_mmu_free_some_pages" should be used for something that actually frees some shadow pages, as we expect from the name, but what the function is doing is to make some, KVM_MIN_FREE_MMU_PAGES, shadow pages available: it does nothing when there are enough. This patch changes the name to reflect this meaning better; while doing this renaming, the code in the wrapper function is inlined into the main body since the whole function will be inlined into the only caller now. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 19:45:01 -03:00
Takuya Yoshikawa	7ddca7e43c	KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page() What this function is doing is to ensure that the number of shadow pages does not exceed the maximum limit stored in n_max_mmu_pages: so this is placed at every code path that can reach kvm_mmu_alloc_page(). Although it might have some sense to spread this function in each such code path when it could be called before taking mmu_lock, the rule was changed not to do so. Taking this background into account, this patch moves it into kvm_mmu_alloc_page() and simplifies the code. Note: the unlikely hint in kvm_mmu_free_some_pages() guarantees that the overhead of this function is almost zero except when we actually need to allocate some shadow pages, so we do not need to care about calling it multiple times in one path by doing kvm_mmu_get_page() a few times. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 19:44:56 -03:00
Marcelo Tosatti	2ae33b3896	Merge remote-tracking branch 'upstream/master' into queue Merge reason: From: Alexander Graf <agraf@suse.de> "Just recently this really important patch got pulled into Linus' tree for 3.9: commit `1674400aae` Author: Anton Blanchard <anton <at> samba.org> Date: Tue Mar 12 01:51:51 2013 +0000 Without that commit, I can not boot my G5, thus I can't run automated tests on it against my queue. Could you please merge kvm/next against linus/master, so that I can base my trees against that?" * upstream/master: (653 commits) PCI: Use ROM images from firmware only if no other ROM source available sparc: remove unused "config BITS" sparc: delete "if !ULTRA_HAS_POPULATION_COUNT" KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798) KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) arm64: Kconfig.debug: Remove unused CONFIG_DEBUG_ERRORS arm64: Do not select GENERIC_HARDIRQS_NO_DEPRECATED inet: limit length of fragment queue hash table bucket lists qeth: Fix scatter-gather regression qeth: Fix invalid router settings handling qeth: delay feature trace sgy-cts1000: Remove __dev* attributes KVM: x86: fix deadlock in clock-in-progress request handling KVM: allow host header to be included even for !CONFIG_KVM hwmon: (lm75) Fix tcn75 prefix hwmon: (lm75.h) Update header inclusion MAINTAINERS: Remove Mark M. Hoffman xfs: ensure we capture IO errors correctly xfs: fix xfs_iomap_eof_prealloc_initial_size type ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-21 11:11:52 -03:00
Paolo Bonzini	04b66839d3	KVM: x86: correctly initialize the CS base on reset The CS base was initialized to 0 on VMX (wrong, but usually overridden by userspace before starting) or 0xf0000 on SVM. The correct value is 0xffff0000, and VMX is able to emulate it now, so use it. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-20 17:34:55 -03:00
Andy Honig	0b79459b48	KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) There is a potential use after free issue with the handling of MSR_KVM_SYSTEM_TIME. If the guest specifies a GPA in a movable or removable memory such as frame buffers then KVM might continue to write to that address even after it's removed via KVM_SET_USER_MEMORY_REGION. KVM pins the page in memory so it's unlikely to cause an issue, but if the user space component re-purposes the memory previously used for the guest, then the guest will be able to corrupt that memory. Tested: Tested against kvmclock unit test Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-19 14:17:35 -03:00
Andy Honig	c300aa64dd	KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) If the guest sets the GPA of the time_page so that the request to update the time straddles a page then KVM will write onto an incorrect page. The write is done byusing kmap atomic to get a pointer to the page for the time structure and then performing a memcpy to that page starting at an offset that the guest controls. Well behaved guests always provide a 32-byte aligned address, however a malicious guest could use this to corrupt host kernel memory. Tested: Tested against kvmclock unit test. Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-19 14:17:31 -03:00
Marcelo Tosatti	c09664bb44	KVM: x86: fix deadlock in clock-in-progress request handling There is a deadlock in pvclock handling: cpu0: cpu1: kvm_gen_update_masterclock() kvm_guest_time_update() spin_lock(pvclock_gtod_sync_lock) local_irq_save(flags) spin_lock(pvclock_gtod_sync_lock) kvm_make_mclock_inprogress_request(kvm) make_all_cpus_request() smp_call_function_many() Now if smp_call_function_many() called by cpu0 tries to call function on cpu1 there will be a deadlock. Fix by moving pvclock_gtod_sync_lock protected section outside irq disabled section. Analyzed by Gleb Natapov <gleb@redhat.com> Acked-by: Gleb Natapov <gleb@redhat.com> Reported-and-Tested-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-18 18:03:39 -03:00
Jan Kiszka	4918c6ca68	KVM: VMX: Require KVM_SET_TSS_ADDR being called prior to running a VCPU Very old user space (namely qemu-kvm before kvm-49) didn't set the TSS base before running the VCPU. We always warned about this bug, but no reports about users actually seeing this are known. Time to finally remove the workaround that effectively prevented to call vmx_vcpu_reset while already holding the KVM srcu lock. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-18 13:48:15 -03:00
Takuya Yoshikawa	982b3394dd	KVM: x86: Optimize mmio spte zapping when creating/moving memslot When we create or move a memory slot, we need to zap mmio sptes. Currently, zap_all() is used for this and this is causing two problems: - extra page faults after zapping mmu pages - long mmu_lock hold time during zapping mmu pages For the latter, Marcelo reported a disastrous mmu_lock hold time during hot-plug, which made the guest unresponsive for a long time. This patch takes a simple way to fix these problems: do not zap mmu pages unless they are marked mmio cached. On our test box, this took only 50us for the 4GB guest and we did not see ms of mmu_lock hold time any more. Note that we still need to do zap_all() for other cases. So another work is also needed: Xiao's work may be the one. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:21:21 +02:00
Takuya Yoshikawa	95b0430d1a	KVM: MMU: Mark sp mmio cached when creating mmio spte This will be used not to zap unrelated mmu pages when creating/moving a memory slot later. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:21:10 +02:00
Jan Kiszka	0238ea913c	KVM: nVMX: Add preemption timer support Provided the host has this feature, it's straightforward to offer it to the guest as well. We just need to load to timer value on L2 entry if the feature was enabled by L1 and watch out for the corresponding exit reason. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:01:21 +02:00
Jan Kiszka	c18911a23c	KVM: nVMX: Provide EFER.LMA saving support We will need EFER.LMA saving to provide unrestricted guest mode. All what is missing for this is picking up EFER.LMA from VM_ENTRY_CONTROLS on L2->L1 switches. If the host does not support EFER.LMA saving, no change is performed, otherwise we properly emulate for L1 what the hardware does for L0. Advertise the support, depending on the host feature. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-14 10:00:55 +02:00
Jan Kiszka	eabeaaccfc	KVM: nVMX: Clean up and fix pin-based execution controls Only interrupt and NMI exiting are mandatory for KVM to work, thus can be exposed to the guest unconditionally, virtual NMI exiting is optional. So we must not advertise it unless the host supports it. Introduce the symbolic constant PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR at this chance. Reviewed-by:: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 16:14:40 +02:00
Jan Kiszka	66450a21f9	KVM: x86: Rework INIT and SIPI handling A VCPU sending INIT or SIPI to some other VCPU races for setting the remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED was overwritten by kvm_emulate_halt and, thus, got lost. This introduces APIC events for those two signals, keeping them in kvm_apic until kvm_apic_accept_events is run over the target vcpu context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there are pending events, thus if vcpu blocking should end. The patch comes with the side effect of effectively obsoleting KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI. The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED state. That also means we no longer exit to user space after receiving a SIPI event. Furthermore, we already reset the VCPU on INIT, only fixing up the code segment later on when SIPI arrives. Moreover, we fix INIT handling for the BSP: it never enter wait-for-SIPI but directly starts over on INIT. Tested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 16:08:10 +02:00
Marcelo Tosatti	5d21881432	KVM: MMU: make kvm_mmu_available_pages robust against n_used_mmu_pages > n_max_mmu_pages As noticed by Ulrich Obergfell <uobergfe@redhat.com>, the mmu counters are for beancounting purposes only - so n_used_mmu_pages and n_max_mmu_pages could be relaxed (example: before `f0f5933a16`), resulting in n_used_mmu_pages > n_max_mmu_pages. Make code robust against n_used_mmu_pages > n_max_mmu_pages. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-13 11:46:09 +02:00
Jan Kiszka	57f252f229	KVM: x86: Drop unused return code from VCPU reset callback Neither vmx nor svm nor the common part may generate an error on kvm_vcpu_reset. So drop the return code. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-12 13:25:56 +02:00
Marcelo Tosatti	03ba32cae6	VMX: x86: handle host TSC calibration failure If the host TSC calibration fails, tsc_khz is zero (see tsc_init.c). Handle such case properly in KVM (instead of dividing by zero). https://bugzilla.redhat.com/show_bug.cgi?id=859282 Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-12 13:18:00 +02:00
Ioan Orghici	0fa24ce3f5	kvm: remove cast for kmalloc return value Signed-off-by: Ioan Orghici<ioan.orghici@gmail.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-03-11 12:03:54 +02:00
Takuya Yoshikawa	5da596078f	KVM: MMU: Introduce a helper function for FIFO zapping Make the code for zapping the oldest mmu page, placed at the tail of the active list, a separate function. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Takuya Yoshikawa	945315b9db	KVM: MMU: Use list_for_each_entry_safe in kvm_mmu_commit_zap_page() We are traversing the linked list, invalid_list, deleting each entry by kvm_mmu_free_page(). _safe version is there for such a case. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Takuya Yoshikawa	1044b03034	KVM: MMU: Fix and clean up for_each_gfn_* macros The expression (sp)->gfn should not be expanded using @gfn. Although no user of these macros passes a string other than gfn now, this should be fixed before anyone sees strange errors. Note: ignored the following checkpatch errors: ERROR: Macros with complex values should be enclosed in parenthesis ERROR: trailing statements should be on next line Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 17:26:27 -03:00
Jan Kiszka	1a0d74e664	KVM: nVMX: Fix setting of CR0 and CR4 in guest mode The logic for calculating the value with which we call kvm_set_cr0/4 was broken (will definitely be visible with nested unrestricted guest mode support). Also, we performed the check regarding CR0_ALWAYSON too early when in guest mode. What really needs to be done on both CR0 and CR4 is to mask out L1-owned bits and merge them in from L1's guest_cr0/4. In contrast, arch.cr0/4 and arch.cr0/4_guest_owned_bits contain the mangled L0+L1 state and, thus, are not suited as input. For both CRs, we can then apply the check against VMXON_CRx_ALWAYSON and refuse the update if it fails. To be fully consistent, we implement this check now also for CR4. For CR4, we move the check into vmx_set_cr4 while we keep it in handle_set_cr0. This is because the CR0 checks for vmxon vs. guest mode will diverge soon when adding unrestricted guest mode support. Finally, we have to set the shadow to the value L2 wanted to write originally. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 15:48:47 -03:00
Jan Kiszka	33fb20c39e	KVM: nVMX: Fix content of MSR_IA32_VMX_ENTRY/EXIT_CTLS Properly set those bits to 1 that the spec demands in case bit 55 of VMX_BASIC is 0 - like in our case. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-07 15:47:11 -03:00
Jan Kiszka	c4627c72e9	KVM: nVMX: Reset RFLAGS on VM-exit Ouch, how could this work so well that far? We need to clear RFLAGS to the reset value as specified by the SDM. Particularly, IF must be off after VM-exit! Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-05 20:49:17 -03:00
Jan Kiszka	503cd0c50a	KVM: nVMX: Fix switching of debug state First of all, do not blindly overwrite GUEST_DR7 on L2 entry. The host may have guest debugging enabled. Then properly reset DR7 and DEBUG_CTL on L2->L1 switch as specified in the SDM. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 21:37:28 -03:00
Takuya Yoshikawa	8482644aea	KVM: set_memory_region: Refactor commit_memory_region() This patch makes the parameter old a const pointer to the old memory slot and adds a new parameter named change to know the change being requested: the former is for removing extra copying and the latter is for cleaning up the code. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	7b6195a91d	KVM: set_memory_region: Refactor prepare_memory_region() This patch drops the parameter old, a copy of the old memory slot, and adds a new parameter named change to know the change being requested. This not only cleans up the code but also removes extra copying of the memory slot structure. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	47ae31e257	KVM: set_memory_region: Drop user_alloc from set_memory_region() Except ia64's stale code, KVM_SET_MEMORY_REGION support, this is only used for sanity checks in __kvm_set_memory_region() which can easily be changed to use slot id instead. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Takuya Yoshikawa	462fce4606	KVM: set_memory_region: Drop user_alloc from prepare/commit_memory_region() X86 does not use this any more. The remaining user, s390's !user_alloc check, can be simply removed since KVM_SET_MEMORY_REGION ioctl is no longer supported. Note: fixed powerpc's indentations with spaces to suppress checkpatch errors. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-03-04 20:21:08 -03:00
Marcelo Tosatti	ee2c25efdd	Merge branch 'master' into queue * master: (15791 commits) Linux 3.9-rc1 btrfs/raid56: Add missing #include <linux/vmalloc.h> fix compat_sys_rt_sigprocmask() SUNRPC: One line comment fix ext4: enable quotas before orphan cleanup ext4: don't allow quota mount options when quota feature enabled ext4: fix a warning from sparse check for ext4_dir_llseek ext4: convert number of blocks to clusters properly ext4: fix possible memory leak in ext4_remount() jbd2: fix ERR_PTR dereference in jbd2__journal_start metag: Provide dma_get_sgtable() metag: prom.h: remove declaration of metag_dt_memblock_reserve() metag: copy devicetree to non-init memory metag: cleanup metag_ksyms.c includes metag: move mm/init.c exports out of metag_ksyms.c metag: move usercopy.c exports out of metag_ksyms.c metag: move setup.c exports out of metag_ksyms.c metag: move kick.c exports out of metag_ksyms.c metag: move traps.c exports out of metag_ksyms.c metag: move irq enable out of irqflags.h on SMP ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Conflicts: arch/x86/kernel/kvmclock.c	2013-03-04 20:10:32 -03:00
Jan Kiszka	3ab66e8a45	KVM: VMX: Pass vcpu to __vmx_complete_interrupts Cleanup: __vmx_complete_interrupts has no use for the vmx structure. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-28 10:29:03 +02:00
Jan Kiszka	44ceb9d665	KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12 IDT_VECTORING_INFO_FIELD was already read right after vmexit. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-28 10:20:06 +02:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Chen Gang	02cdb50fd7	arch/x86/kvm: beautify source code for __u32 irq which is never < 0 irp->irq is __u32 which is never < 0. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 16:07:45 +02:00
Jan Kiszka	957c897e8c	KVM: nVMX: Use cached exit reason No need to re-read what vmx_vcpu_run already picked up for us. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:46:07 +02:00
Jan Kiszka	36c3cc422b	KVM: nVMX: Clear segment cache after switching between L1 and L2 Switching the VMCS obviously invalidates what may have been cached about the guest segments. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:41:09 +02:00
Jan Kiszka	d6851fbeee	KVM: nVMX: Advertise PAUSE and WBINVD exiting support These exits have no preconditions, and we already process the corresponding reasons in nested_vmx_exit_handled correctly. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:33:51 +02:00
Jan Kiszka	733568f9ce	KVM: VMX: Make prepare_vmcs12 and load_vmcs12_host_state static Both are only used locally. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-27 15:31:15 +02:00
Linus Torvalds	89f883372f	Merge tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Marcelo Tosatti: "KVM updates for the 3.9 merge window, including x86 real mode emulation fixes, stronger memory slot interface restrictions, mmu_lock spinlock hold time reduction, improved handling of large page faults on shadow, initial APICv HW acceleration support, s390 channel IO based virtio, amongst others" * tag 'kvm-3.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) Revert "KVM: MMU: lazily drop large spte" x86: pvclock kvm: align allocation size to page size KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr x86 emulator: fix parity calculation for AAD instruction KVM: PPC: BookE: Handle alignment interrupts booke: Added DBCR4 SPR number KVM: PPC: booke: Allow multiple exception types KVM: PPC: booke: use vcpu reference from thread_struct KVM: Remove user_alloc from struct kvm_memory_slot KVM: VMX: disable apicv by default KVM: s390: Fix handling of iscs. KVM: MMU: cleanup __direct_map KVM: MMU: remove pt_access in mmu_set_spte KVM: MMU: cleanup mapping-level KVM: MMU: lazily drop large spte KVM: VMX: cleanup vmx_set_cr0(). KVM: VMX: add missing exit names to VMX_EXIT_REASONS array KVM: VMX: disable SMEP feature when guest is in non-paging mode KVM: Remove duplicate text in api.txt Revert "KVM: MMU: split kvm_mmu_free_page" ...	2013-02-24 13:07:18 -08:00
Jan Kiszka	bd31a7f557	KVM: nVMX: Trap unconditionally if msr bitmap access fails This avoids basing decisions on uninitialized variables, potentially leaking kernel data to the L1 guest. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-22 00:50:45 -03:00
Jan Kiszka	908a7bdd6a	KVM: nVMX: Improve I/O exit handling This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-22 00:50:42 -03:00
Marcelo Tosatti	6b73a96065	Revert "KVM: MMU: lazily drop large spte" This reverts commit `caf6900f2d`. It is causing migration failures, reference https://bugzilla.kernel.org/show_bug.cgi?id=54061. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-20 18:52:02 -03:00
Borislav Petkov	2e32b71906	x86, kvm: Add MSR_AMD64_BU_CFG2 to the list of ignored MSRs The "x86, AMD: Enable WC+ memory type on family 10 processors" patch currently in -tip added a workaround for AMD F10h CPUs which #GPs my guest when booted in kvm. This is because it accesses MSR_AMD64_BU_CFG2 which is not currently ignored by kvm. Do that because this MSR is only baremetal-relevant anyway. While at it, move the ignored MSRs at the beginning of kvm_set_msr_common so that we exit then and there. Acked-by: Gleb Natapov <gleb@redhat.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Andre Przywara <andre@andrep.de> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1361298793-31834-2-git-send-email-bp@alien8.de Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2013-02-19 10:44:07 -08:00
Jan Kiszka	cbd29cb6e3	KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr We already pass vmcs12 as argument. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-14 10:35:16 +02:00
Gleb Natapov	f583c29b79	x86 emulator: fix parity calculation for AAD instruction Reported-by: Paolo Bonzini <pbonzini@redhat.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-13 18:01:00 +02:00
Takuya Yoshikawa	7a905b1485	KVM: Remove user_alloc from struct kvm_memory_slot This field was needed to differentiate memory slots created by the new API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent, KVM_SET_MEMORY_REGION, whose support was dropped long before: commit `b74a07beed` KVM: Remove kernel-allocated memory regions Although we also have private memory slots to which KVM allocates memory with vm_mmap(), !user_alloc slots in other words, the slot id should be enough for differentiating them. Note: corresponding function parameters will be removed later. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-11 11:52:00 +02:00
Yang Zhang	257090f702	KVM: VMX: disable apicv by default Without Posted Interrupt, current code is broken. Just disable by default until Posted Interrupt is ready. Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-02-11 10:51:13 +02:00
Xiao Guangrong	24db2734ad	KVM: MMU: cleanup __direct_map Use link_shadow_page to link the sp to the spte in __direct_map Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-06 22:42:09 -02:00
Xiao Guangrong	f761620377	KVM: MMU: remove pt_access in mmu_set_spte It is only used in debug code, so drop it Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-06 22:42:08 -02:00
Xiao Guangrong	55dd98c3a8	KVM: MMU: cleanup mapping-level Use min() to cleanup mapping_level Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-06 22:42:08 -02:00
Xiao Guangrong	caf6900f2d	KVM: MMU: lazily drop large spte Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them not present, the page fault caused by read access can be avoided The idea is from Avi: \| As I mentioned before, write-protecting a large spte is a good idea, \| since it moves some work from protect-time to fault-time, so it reduces \| jitter. This removes the need for the return value. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-06 22:28:01 -02:00
Gleb Natapov	5037878e22	KVM: VMX: cleanup vmx_set_cr0(). When calculating hw_cr0 teh current code masks bits that should be always on and re-adds them back immediately after. Cleanup the code by masking only those bits that should be dropped from hw_cr0. This allow us to get rid of some defines. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-06 22:00:02 -02:00
Dongxiao Xu	c08800a56c	KVM: VMX: disable SMEP feature when guest is in non-paging mode SMEP is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging mode with TDP. To emulate this behavior, SMEP needs to be manually disabled when guest switches to non-paging mode. We met an issue that, SMP Linux guest with recent kernel (enable SMEP support, for example, 3.5.3) would crash with triple fault if setting unrestricted_guest=0. This is because KVM uses an identity mapping page table to emulate the non-paging mode, where the page table is set with USER flag. If SMEP is still enabled in this case, guest will meet unhandlable page fault and then crash. Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-05 23:28:07 -02:00
Gleb Natapov	834be0d83f	Revert "KVM: MMU: split kvm_mmu_free_page" This reverts commit `bd4c86eaa6`. There is not user for kvm_mmu_isolate_page() any more. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-05 22:47:39 -02:00
Gleb Natapov	eb3fce87cc	KVM: MMU: drop superfluous is_present_gpte() check. Gust page walker puts only present ptes into ptes[] array. No need to check it again. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-04 23:24:28 -02:00
Gleb Natapov	116eb3d30e	KVM: MMU: drop superfluous min() call. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-04 23:24:28 -02:00
Gleb Natapov	2c9afa52ef	KVM: MMU: set base_role.nxe during mmu initialization. Move base_role.nxe initialisation to where all other roles are initialized. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-04 23:24:28 -02:00
Gleb Natapov	9bb4f6b15e	KVM: MMU: drop unneeded checks. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-04 23:24:28 -02:00
Gleb Natapov	feb3eb704a	KVM: MMU: make spte_is_locklessly_modifiable() more clear spte_is_locklessly_modifiable() checks that both SPTE_HOST_WRITEABLE and SPTE_MMU_WRITEABLE are present on spte. Make it more explicit. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-02-04 23:24:28 -02:00
Yang Zhang	c7c9c56ca2	x86, apicv: add virtual interrupt delivery support Virtual interrupt delivery avoids KVM to inject vAPIC interrupts manually, which is fully taken care of by the hardware. This needs some special awareness into existing interrupr injection path: - for pending interrupt, instead of direct injection, we may need update architecture specific indicators before resuming to guest. - A pending interrupt, which is masked by ISR, should be also considered in above update action, since hardware will decide when to inject it at right time. Current has_interrupt and get_interrupt only returns a valid vector from injection p.o.v. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-29 10:48:19 +02:00
Yang Zhang	8d14695f95	x86, apicv: add virtual x2apic support basically to benefit from apicv, we need to enable virtualized x2apic mode. Currently, we only enable it when guest is really using x2apic. Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic: 0x800 - 0x8ff: no read intercept for apicv register virtualization, except APIC ID and TMCCT which need software's assistance to get right value. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-29 10:48:06 +02:00
Yang Zhang	83d4c28693	x86, apicv: add APICv register virtualization support - APIC read doesn't cause VM-Exit - APIC write becomes trap-like Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yang Zhang <yang.z.zhang@intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-29 10:47:54 +02:00
Avi Kivity	3f0c3d0bb2	KVM: x86 emulator: fix test_cc() build failure on i386 'pushq' doesn't exist on i386. Replace with 'push', which should work since the operand is a register. Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-27 11:09:38 +02:00
Gleb Natapov	141687869f	KVM: VMX: set vmx->emulation_required only when needed. If emulate_invalid_guest_state=false vmx->emulation_required is never actually used, but it ends up to be always set to true since handle_invalid_guest_state(), the only place it is reset back to false, is never called. This, besides been not very clean, makes vmexit and vmentry path to check emulate_invalid_guest_state needlessly. The patch fixes that by keeping emulation_required coherent with emulate_invalid_guest_state setting. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:31 -02:00
Gleb Natapov	378a8b099f	KVM: x86: fix use of uninitialized memory as segment descriptor in emulator. If VMX reports segment as unusable, zero descriptor passed by the emulator before returning. Such descriptor will be considered not present by the emulator. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:31 -02:00
Gleb Natapov	91b0aa2ca6	KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg. The function deals with code segment too. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:30 -02:00
Gleb Natapov	25391454e7	KVM: VMX: don't clobber segment AR of unusable segments. Usability is returned in unusable field, so not need to clobber entire AR. Callers have to know how to deal with unusable segments already since if emulate_invalid_guest_state=true AR is not zeroed. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:28 -02:00
Gleb Natapov	218e763f45	KVM: VMX: skip vmx->rmode.vm86_active check on cr0 write if unrestricted guest is enabled vmx->rmode.vm86_active is never true is unrestricted guest is enabled. Make it more explicit that neither enter_pmode() nor enter_rmode() is called in this case. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:28 -02:00
Gleb Natapov	286da4156d	KVM: VMX: remove hack that disables emulation on vcpu reset/init There is no reason for it. If state is suitable for vmentry it will be detected during guest entry and no emulation will happen. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:27 -02:00
Gleb Natapov	c5e97c80b5	KVM: VMX: if unrestricted guest is enabled vcpu state is always valid. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:26 -02:00
Gleb Natapov	2f143240cb	KVM: VMX: reset CPL only on CS register write. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:26 -02:00
Gleb Natapov	1f3141e80b	KVM: VMX: remove special CPL cache access during transition to real mode. Since vmx_get_cpl() always returns 0 when VCPU is in real mode it is no longer needed. Also reset CPL cache to zero during transaction to protected mode since transaction may happen while CS.selectors & 3 != 0, but in reality CPL is 0. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-24 00:40:25 -02:00
Avi Kivity	158de57f90	KVM: x86 emulator: convert a few freestanding emulations to fastop Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:40 -02:00
Avi Kivity	34b77652b9	KVM: x86 emulator: rearrange fastop definitions Make fastop opcodes usable in other emulations. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:39 -02:00
Avi Kivity	4d7583493e	KVM: x86 emulator: convert 2-operand IMUL to fastop Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:38 -02:00
Avi Kivity	11c363ba8f	KVM: x86 emulator: convert BT/BTS/BTR/BTC/BSF/BSR to fastop Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:37 -02:00
Avi Kivity	95413dc413	KVM: x86 emulator: convert INC/DEC to fastop Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:37 -02:00
Avi Kivity	9ae9febae9	KVM: x86 emulator: covert SETCC to fastop This is a bit of a special case since we don't have the usual byte/word/long/quad switch; instead we switch on the condition code embedded in the instruction. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:36 -02:00
Avi Kivity	007a3b5475	KVM: x86 emulator: convert shift/rotate instructions to fastop SHL, SHR, ROL, ROR, RCL, RCR, SAR, SAL Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:35 -02:00
Avi Kivity	0bdea06892	KVM: x86 emulator: Convert SHLD, SHRD to fastop Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-23 22:15:33 -02:00
Xiao Guangrong	93c05d3ef2	KVM: x86: improve reexecute_instruction The current reexecute_instruction can not well detect the failed instruction emulation. It allows guest to retry all the instructions except it accesses on error pfn For example, some cases are nested-write-protect - if the page we want to write is used as PDE but it chains to itself. Under this case, we should stop the emulation and report the case to userspace Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-21 22:58:33 -02:00
Xiao Guangrong	95b3cf69bd	KVM: x86: let reexecute_instruction work for tdp Currently, reexecute_instruction refused to retry all instructions if tdp is enabled. If nested npt is used, the emulation may be caused by shadow page, it can be fixed by dropping the shadow page. And the only condition that tdp can not retry the instruction is the access fault on error pfn Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-21 22:58:32 -02:00
Xiao Guangrong	22368028fe	KVM: x86: clean up reexecute_instruction Little cleanup for reexecute_instruction, also use gpa_to_gfn in retry_instruction Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-21 22:58:31 -02:00
Takuya Yoshikawa	6b81b05e44	KVM: MMU: Conditionally reschedule when kvm_mmu_slot_remove_write_access() takes a long time If the userspace starts dirty logging for a large slot, say 64GB of memory, kvm_mmu_slot_remove_write_access() needs to hold mmu_lock for a long time such as tens of milliseconds. This patch controls the lock hold time by asking the scheduler if we need to reschedule for others. One penalty for this is that we need to flush TLBs before releasing mmu_lock. But since holding mmu_lock for a long time does affect not only the guest, vCPU threads in other words, but also the host as a whole, we should pay for that. In practice, the cost will not be so high because we can protect a fair amount of memory before being rescheduled: on my test environment, cond_resched_lock() was called only once for protecting 12GB of memory even without THP. We can also revisit Avi's "unlocked TLB flush" work later for completely suppressing extra TLB flushes if needed. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:14:28 +02:00
Takuya Yoshikawa	9d1beefb71	KVM: Make kvm_mmu_slot_remove_write_access() take mmu_lock by itself Better to place mmu_lock handling and TLB flushing code together since this is a self-contained function. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:14:17 +02:00
Takuya Yoshikawa	b34cb590fb	KVM: Make kvm_mmu_change_mmu_pages() take mmu_lock by itself No reason to make callers take mmu_lock since we do not need to protect kvm_mmu_change_mmu_pages() and kvm_mmu_slot_remove_write_access() together by mmu_lock in kvm_arch_commit_memory_region(): the former calls kvm_mmu_commit_zap_page() and flushes TLBs by itself. Note: we do not need to protect kvm->arch.n_requested_mmu_pages by mmu_lock as can be seen from the fact that it is read locklessly. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:14:09 +02:00
Takuya Yoshikawa	e12091ce7b	KVM: Remove unused slot_bitmap from kvm_mmu_page Not needed any more. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:13:58 +02:00
Takuya Yoshikawa	b99db1d352	KVM: MMU: Make kvm_mmu_slot_remove_write_access() rmap based This makes it possible to release mmu_lock and reschedule conditionally in a later patch. Although this may increase the time needed to protect the whole slot when we start dirty logging, the kernel should not allow the userspace to trigger something that will hold a spinlock for such a long time as tens of milliseconds: actually there is no limit since it is roughly proportional to the number of guest pages. Another point to note is that this patch removes the only user of slot_bitmap which will cause some problems when we increase the number of slots further. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:13:47 +02:00
Takuya Yoshikawa	245c3912ea	KVM: MMU: Remove unused parameter level from __rmap_write_protect() No longer need to care about the mapping level in this function. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:13:31 +02:00
Takuya Yoshikawa	c972f3b125	KVM: Write protect the updated slot only when dirty logging is enabled Calling kvm_mmu_slot_remove_write_access() for a deleted slot does nothing but search for non-existent mmu pages which have mappings to that deleted memory; this is safe but a waste of time. Since we want to make the function rmap based in a later patch, in a manner which makes it unsafe to be called for a deleted slot, we makes the caller see if the slot is non-zero and being dirty logged. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2013-01-14 11:13:15 +02:00
Xiao Guangrong	7751babd3c	KVM: MMU: fix infinite fault access retry We have two issues in current code: - if target gfn is used as its page table, guest will refault then kvm will use small page size to map it. We need two #PF to fix its shadow page table - sometimes, say a exception is triggered during vm-exit caused by #PF (see handle_exception() in vmx.c), we remove all the shadow pages shadowed by the target gfn before go into page fault path, it will cause infinite loop: delete shadow pages shadowed by the gfn -> try to use large page size to map the gfn -> retry the access ->... To fix these, we can adjust page size early if the target gfn is used as page table Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-10 15:28:30 -02:00
Xiao Guangrong	c22885050e	KVM: MMU: fix Dirty bit missed if CR0.WP = 0 If the write-fault access is from supervisor and CR0.WP is not set on the vcpu, kvm will fix it by adjusting pte access - it sets the W bit on pte and clears U bit. This is the chance that kvm can change pte access from readonly to writable Unfortunately, the pte access is the access of 'direct' shadow page table, means direct sp.role.access = pte_access, then we will create a writable spte entry on the readonly shadow page table. It will cause Dirty bit is not tracked when two guest ptes point to the same large page. Note, it does not have other impact except Dirty bit since cr0.wp is encoded into sp.role It can be fixed by adjusting pte access before establishing shadow page table. Also, after that, no mmu specified code exists in the common function and drop two parameters in set_spte Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-10 15:28:08 -02:00
Avi Kivity	fb864fbc72	KVM: x86 emulator: convert basic ALU ops to fastop Opcodes: TEST CMP ADD ADC SUB SBB XOR OR AND Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:30 -02:00
Avi Kivity	f7857f35db	KVM: x86 emulator: add macros for defining 2-operand fastop emulation Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:28 -02:00
Avi Kivity	45a1467d7e	KVM: x86 emulator: convert NOT, NEG to fastop Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:25 -02:00
Avi Kivity	75f728456f	KVM: x86 emulator: mark CMP, CMPS, SCAS, TEST as NoWrite Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:21 -02:00
Avi Kivity	b6744dc3fb	KVM: x86 emulator: introduce NoWrite flag Instead of disabling writeback via OP_NONE, just specify NoWrite. Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:18 -02:00
Avi Kivity	b7d491e7f0	KVM: x86 emulator: Support for declaring single operand fastops Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:17 -02:00
Avi Kivity	e28bbd44da	KVM: x86 emulator: framework for streamlining arithmetic opcodes We emulate arithmetic opcodes by executing a "similar" (same operation, different operands) on the cpu. This ensures accurate emulation, esp. wrt. eflags. However, the prologue and epilogue around the opcode is fairly long, consisting of a switch (for the operand size) and code to load and save the operands. This is repeated for every opcode. This patch introduces an alternative way to emulate arithmetic opcodes. Instead of the above, we have four (three on i386) functions consisting of just the opcode and a ret; one for each operand size. For example: .align 8 em_notb: not %al ret .align 8 em_notw: not %ax ret .align 8 em_notl: not %eax ret .align 8 em_notq: not %rax ret The prologue and epilogue are shared across all opcodes. Note the functions use a special calling convention; notably eflags is an input/output parameter and is not clobbered. Rather than dispatching the four functions through a jump table, the functions are declared as a constant size (8) so their address can be calculated. Acked-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi.kivity@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-09 17:39:17 -02:00
Marcelo Tosatti	b09408d00f	KVM: VMX: fix incorrect cached cpl value with real/v8086 modes CPL is always 0 when in real mode, and always 3 when virtual 8086 mode. Using values other than those can cause failures on operations that check CPL. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-08 17:25:35 -02:00
Gleb Natapov	b0cfeb5ded	KVM: x86: remove unused variable from walk_addr_generic() Fix compilation warning. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-08 17:23:39 -02:00
Marcelo Tosatti	013f6a5d3d	KVM: x86: use dynamic percpu allocations for shared msrs area Use dynamic percpu allocations for the shared msrs structure, to avoid using the limited reserved percpu space. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-08 12:51:56 -02:00
Gleb Natapov	908e7d7999	KVM: MMU: simplify folding of dirty bit into accessed_dirty MMU code tries to avoid if()s HW is not able to predict reliably by using bitwise operation to streamline code execution, but in case of a dirty bit folding this gives us nothing since write_fault is checked right before the folding code. Lets just piggyback onto the if() to make code more clear. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-07 20:31:35 -02:00
Gleb Natapov	ee04e0cea8	KVM: mmu: remove unused trace event trace_kvm_mmu_delay_free_pages() is no longer used. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-07 19:54:50 -02:00
Gleb Natapov	0ca1b4f4ba	KVM: VMX: handle IO when emulation is due to #GP in real mode. With emulate_invalid_guest_state=0 if a vcpu is in real mode VMX can enter the vcpu with smaller segment limit than guest configured. If the guest tries to access pass this limit it will get #GP at which point instruction will be emulated with correct segment limit applied. If during the emulation IO is detected it is not handled correctly. Vcpu thread should exit to userspace to serve the IO, but it returns to the guest instead. Since emulation is not completed till userspace completes the IO the faulty instruction is re-executed ad infinitum. The patch fixes that by exiting to userspace if IO happens during instruction emulation. Reported-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:31 -02:00
Gleb Natapov	d54d07b2ca	KVM: VMX: Do not fix segment register during vcpu initialization. Segment registers will be fixed according to current emulation policy during switching to real mode for the first time. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:30 -02:00
Gleb Natapov	d99e415275	KVM: VMX: fix emulation of invalid guest state. Currently when emulation of invalid guest state is enable (emulate_invalid_guest_state=1) segment registers are still fixed for entry to vm86 mode some times. Segment register fixing is avoided in enter_rmode(), but vmx_set_segment() still does it unconditionally. The patch fixes it. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:29 -02:00
Gleb Natapov	89efbed02c	KVM: VMX: make rmode_segment_valid() more strict. Currently it allows entering vm86 mode if segment limit is greater than 0xffff and db bit is set. Both of those can cause incorrect execution of instruction by cpu since in vm86 mode limit will be set to 0xffff and db will be forced to 0. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:28 -02:00
Gleb Natapov	045a282ca4	KVM: emulator: implement fninit, fnstsw, fnstcw Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:27 -02:00
Gleb Natapov	3a78a4f463	KVM: emulator: drop RPL check from linearize() function According to Intel SDM Vol3 Section 5.5 "Privilege Levels" and 5.6 "Privilege Level Checking When Accessing Data Segments" RPL checking is done during loading of a segment selector, not during data access. We already do checking during segment selector loading, so drop the check during data access. Checking RPL during data access triggers #GP if after transition from real mode to protected mode RPL bits in a segment selector are set. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2013-01-02 19:36:26 -02:00
Gleb Natapov	f924d66d27	KVM: VMX: remove unneeded temporary variable from vmx_set_segment() Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:02:00 +02:00
Gleb Natapov	1ecd50a947	KVM: VMX: clean-up vmx_set_segment() Move all vm86_active logic into one place. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:01:49 +02:00
Gleb Natapov	39dcfb95de	KVM: VMX: remove redundant code from vmx_set_segment() Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need to do it twice. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:01:37 +02:00
Gleb Natapov	beb853ffec	KVM: VMX: use fix_rmode_seg() to fix all code/data segments The code for SS and CS does the same thing fix_rmode_seg() is doing. Use it instead of hand crafted code. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:01:18 +02:00
Gleb Natapov	c6ad115348	KVM: VMX: return correct segment limit and flags for CS/SS registers in real mode VMX without unrestricted mode cannot virtualize real mode, so if emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate it. Sometimes, when guest moves from protected mode to real mode, it leaves segment descriptors in a state not suitable for use by vm86 mode virtualization, so we keep shadow copy of segment descriptors for internal use and load fake register to VMCS for guest entry to succeed. Till now we kept shadow for all segments except SS and CS (for SS and CS we returned parameters directly from VMCS), but since commit `a5625189f6` emulator enforces segment limits in real mode. This causes #GP during move from protected mode to real mode when emulator fetches first instruction after moving to real mode since it uses incorrect CS base and limit to linearize the %rip. Fix by keeping shadow for SS and CS too. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:01:03 +02:00
Gleb Natapov	0647f4aa8c	KVM: VMX: relax check for CS register in rmode_segment_valid() rmode_segment_valid() checks if segment descriptor can be used to enter vm86 mode. VMX spec mandates that in vm86 mode CS register will be of type data, not code. Lets allow guest entry with vm86 mode if the only problem with CS register is incorrect type. Otherwise entire real mode will be emulated. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:00:47 +02:00
Gleb Natapov	07f42f5f25	KVM: VMX: cleanup rmode_segment_valid() Set segment fields explicitly instead of using binary operations. No behaviour changes. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-23 14:00:36 +02:00
Nickolai Zeldovich	d4b06c2d4c	kvm: fix i8254 counter 0 wraparound The kvm i8254 emulation for counter 0 (but not for counters 1 and 2) has at least two bugs in mode 0: 1. The OUT bit, computed by pit_get_out(), is never set high. 2. The counter value, computed by pit_get_count(), wraps back around to the initial counter value, rather than wrapping back to 0xFFFF (which is the behavior described in the comment in __kpit_elapsed, the behavior implemented by qemu, and the behavior observed on AMD hardware). The bug stems from __kpit_elapsed computing the elapsed time mod the initial counter value (stored as nanoseconds in ps->period). This is both unnecessary (none of the callers of kpit_elapsed expect the value to be at most the initial counter value) and incorrect (it causes pit_get_count to appear to wrap around to the initial counter value rather than 0xFFFF). Removing this mod from __kpit_elapsed fixes both of the above bugs. Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-18 11:12:38 +02:00
Gleb Natapov	e11ae1a102	KVM: remove unused variable. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-14 23:07:44 -02:00
Alex Williamson	f82a8cfe93	KVM: struct kvm_memory_slot.user_alloc -> bool There's no need for this to be an int, it holds a boolean. Move to the end of the struct for alignment. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-13 23:24:38 -02:00
Alex Williamson	bbacc0c111	KVM: Rename KVM_MEMORY_SLOTS -> KVM_USER_MEM_SLOTS It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is the user accessible slots and the other is user + private. Make this more obvious. Reviewed-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-13 23:21:57 -02:00
Gleb Natapov	f3200d00ea	KVM: inject ExtINT interrupt before APIC interrupts According to Intel SDM Volume 3 Section 10.8.1 "Interrupt Handling with the Pentium 4 and Intel Xeon Processors" and Section 10.8.2 "Interrupt Handling with the P6 Family and Pentium Processors" ExtINT interrupts are sent directly to the processor core for handling. Currently KVM checks APIC before it considers ExtINT interrupts for injection which is backwards from the spec. Make code behave according to the SDM. Signed-off-by: Gleb Natapov <gleb@redhat.com> Acked-by: "Zhang, Yang Z" <yang.z.zhang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-13 23:05:21 -02:00
Nadav Amit	5e2c688351	KVM: x86: fix mov immediate emulation for 64-bit operands MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand. The previous emulation implementation assumes the operand is no longer than 32. Adding OpImm64 for this matter. Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579 Signed-off-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-13 22:30:56 -02:00
Gleb Natapov	7f662273e4	KVM: emulator: implement AAD instruction Windows2000 uses it during boot. This fixes https://bugzilla.kernel.org/show_bug.cgi?id=50921 Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-13 22:27:22 -02:00
Linus Torvalds	66cdd0ceaf	Merge tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Marcelo Tosatti: "Considerable KVM/PPC work, x86 kvmclock vsyscall support, IA32_TSC_ADJUST MSR emulation, amongst others." Fix up trivial conflict in kernel/sched/core.c due to cross-cpu migration notifier added next to rq migration call-back. * tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits) KVM: emulator: fix real mode segment checks in address linearization VMX: remove unneeded enable_unrestricted_guest check KVM: VMX: fix DPL during entry to protected mode x86/kexec: crash_vmclear_local_vmcss needs __rcu kvm: Fix irqfd resampler list walk KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary KVM: MMU: optimize for set_spte KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation KVM: PPC: bookehv: Add guest computation mode for irq delivery KVM: PPC: Make EPCR a valid field for booke64 and bookehv KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation KVM: PPC: e500: Add emulation helper for getting instruction ea KVM: PPC: bookehv64: Add support for interrupt handling KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler KVM: PPC: booke: Fix get_tb() compile error on 64-bit KVM: PPC: e500: Silence bogus GCC warning in tlb code ...	2012-12-13 15:31:08 -08:00
Gleb Natapov	58b7825bc3	KVM: emulator: fix real mode segment checks in address linearization In real mode CS register is writable, so do not #GP on write. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-11 21:00:28 -02:00
Gleb Natapov	0b26b588d9	VMX: remove unneeded enable_unrestricted_guest check If enable_unrestricted_guest is true vmx->rmode.vm86_active will always be false. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-11 21:00:28 -02:00
Gleb Natapov	a4d3326c2d	KVM: VMX: fix DPL during entry to protected mode On CPUs without support for unrestricted guests DPL cannot be smaller than RPL for data segments during guest entry, but this state can occurs if a data segment selector changes while vcpu is in real mode to a value with lowest two bits != 00. Fix that by forcing DPL == RPL on transition to protected mode. This is a regression introduced by `c865c43de6`. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-12-11 21:00:27 -02:00
Zhang Yanfei	8f536b7697	KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump The vmclear function will be assigned to the callback function pointer when loading kvm-intel module. And the bitmap indicates whether we should do VMCLEAR operation in kdump. The bits in the bitmap are set/unset according to different conditions. Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-06 18:26:57 +02:00
Xiao Guangrong	c219346325	KVM: MMU: optimize for set_spte There are two cases we need to adjust page size in set_spte: 1): the one is other vcpu creates new sp in the window between mapping_level() and acquiring mmu-lock. 2): the another case is the new sp is created by itself (page-fault path) when guest uses the target gfn as its page table. In current code, set_spte drop the spte and emulate the access for these case, it works not good: - for the case 1, it may destroy the mapping established by other vcpu, and do expensive instruction emulation. - for the case 2, it may emulate the access even if the guest is accessing the page which not used as page table. There is a example, 0~2M is used as huge page in guest, in this huge page, only page 3 used as page table, then guest read/writes on other pages can cause instruction emulation. Both of these cases can be fixed by allowing guest to retry the access, it will refault, then we can establish the mapping by using small page Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-06 09:11:25 +02:00
Julian Stecklina	66f7b72e11	KVM: x86: Make register state after reset conform to specification VMX behaves now as SVM wrt to FPU initialization. Code has been moved to generic code path. General-purpose registers are now cleared on reset and INIT. SVM code properly initializes EDX. Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-05 18:00:07 +02:00
Zhang Xiantao	2b3c5cbc0d	kvm: don't use bit24 for detecting address-specific invalidation capability Bit24 in VMX_EPT_VPID_CAP_MASI is not used for address-specific invalidation capability reporting, so remove it from KVM to avoid conflicts in future. Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-05 16:35:48 +02:00
Zhang Xiantao	0307b7b8c2	kvm: remove unnecessary bit checking for ept violation Bit 6 in EPT vmexit's exit qualification is not defined in SDM, so remove it. Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-05 16:35:21 +02:00
Jan Kiszka	45e3cc7d9f	KVM: x86: Fix uninitialized return code This is a regression caused by `18595411a7`. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>	2012-12-02 17:37:04 +02:00
Will Auld	ba904635d4	KVM: x86: Emulate IA32_TSC_ADJUST MSR CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported Basic design is to emulate the MSR by allowing reads and writes to a guest vcpu specific location to store the value of the emulated MSR while adding the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This is of course as long as the "use TSC counter offsetting" VM-execution control is enabled as well as the IA32_TSC_ADJUST control. However, because hardware will only return the TSC + IA32_TSC_ADJUST + vmsc tsc_offset for a guest process when it does and rdtsc (with the correct settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one of these three locations. The argument against storing it in the actual MSR is performance. This is likely to be seldom used while the save/restore is required on every transition. IA32_TSC_ADJUST was created as a way to solve some issues with writing TSC itself so that is not an option either. The remaining option, defined above as our solution has the problem of returning incorrect vmcs tsc_offset values (unless we intercept and fix, not done here) as mentioned above. However, more problematic is that storing the data in vmcs tsc_offset will have a different semantic effect on the system than does using the actual MSR. This is illustrated in the following example: The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest process performs a rdtsc. In this case the guest process will get TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics as seen by the guest do not and hence this will not cause a problem. Signed-off-by: Will Auld <will.auld@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-30 18:29:30 -02:00
Will Auld	8fe8ab46be	KVM: x86: Add code to track call origin for msr assignment In order to track who initiated the call (host or guest) to modify an msr value I have changed function call parameters along the call path. The specific change is to add a struct pointer parameter that points to (index, data, caller) information rather than having this information passed as individual parameters. The initial use for this capability is for updating the IA32_TSC_ADJUST msr while setting the tsc value. It is anticipated that this capability is useful for other tasks. Signed-off-by: Will Auld <will.auld@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-30 18:26:12 -02:00
Xiao Guangrong	5a560f8b5e	KVM: VMX: fix memory order between loading vmcs and clearing vmcs vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs does not exist on any vcpu If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu list. The list can be corrupted if the cpu prefetch the vmcs's list before reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before making vmcs->vcpu == -1 be visible Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-29 21:14:46 -02:00
Xiao Guangrong	e6c7d32172	KVM: VMX: fix invalid cpu passed to smp_call_function_single In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to smp_call_function_single, if the target cpu is downing (doing cpu hot remove), loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called in the preemptionable context Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-28 22:04:58 -02:00
Marcelo Tosatti	d98d07ca7e	KVM: x86: update pvclock area conditionally, on cpu migration As requested by Glauber, do not update kvmclock area on vcpu->pcpu migration, in case the host has stable TSC. This is to reduce cacheline bouncing. Acked-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:15 -02:00
Marcelo Tosatti	b48aa97e38	KVM: x86: require matched TSC offsets for master clock With master clock, a pvclock clock read calculates: ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ] Where 'rdtsc' is the host TSC. system_timestamp and tsc_timestamp are unique, one tuple per VM: the "master clock". Given a host with synchronized TSCs, its obvious that guest TSC must be matched for the above to guarantee monotonicity. Allow master clock usage only if guest TSCs are synchronized. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:15 -02:00
Marcelo Tosatti	42897d866b	KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization TSC initialization will soon make use of online_vcpus. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:14 -02:00
Marcelo Tosatti	d828199e84	KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag KVM added a global variable to guarantee monotonicity in the guest. One of the reasons for that is that the time between 1. ktime_get_ts(&timespec); 2. rdtscll(tsc); Is variable. That is, given a host with stable TSC, suppose that two VCPUs read the same time via ktime_get_ts() above. The time required to execute 2. is not the same on those two instances executing in different VCPUS (cache misses, interrupts...). If the TSC value that is used by the host to interpolate when calculating the monotonic time is the same value used to calculate the tsc_timestamp value stored in the pvclock data structure, and a single <system_timestamp, tsc_timestamp> tuple is visible to all vcpus simultaneously, this problem disappears. See comment on top of pvclock_update_vm_gtod_copy for details. Monotonicity is then guaranteed by synchronicity of the host TSCs and guest TSCs. Set TSC stable pvclock flag in that case, allowing the guest to read clock from userspace. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:13 -02:00
Marcelo Tosatti	16e8d74d2d	KVM: x86: notifier for clocksource changes Register a notifier for clocksource change event. In case the host switches to clock other than TSC, disable master clock usage. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:12 -02:00
Marcelo Tosatti	886b470cb1	KVM: x86: pass host_tsc to read_l1_tsc Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:11 -02:00
Marcelo Tosatti	78c0337a38	KVM: x86: retain pvclock guest stopped bit in guest memory Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU migration) to clear the bit. Noticed by Paolo Bonzini. Reviewed-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Glauber Costa <glommer@parallels.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-27 23:29:05 -02:00
H. Peter Anvin	cb7cb2864e	x86, kvm: Remove incorrect redundant assembly constraint In __emulate_1op_rax_rdx, we use "+a" and "+d" which are input/output constraints, and then use "a" and "d" as input constraints. This is incorrect, but happens to work on some versions of gcc. However, it breaks gcc with -O0 and icc, and may break on future versions of gcc. Reported-and-tested-by: Melanie Blower <melanie.blower@intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Link: http://lkml.kernel.org/r/B3584E72CFEBED439A3ECA9BCE67A4EF1B17AF90@FMSMSX107.amr.corp.intel.com Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-26 15:52:48 -08:00
Takashi Iwai	29282fde80	KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update() The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL, and this triggers kernel warnings like below on old CPUs: vmwrite error: reg 401e value a0568000 (err 12) Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154 Call Trace: [<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel] [<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel] [<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel] [<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm] [<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm] [<ffffffff81143f7c>] ? __vunmap+0x9c/0x110 [<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel] [<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm] [<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm] [<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm] [<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm] [<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530 [<ffffffff81139d76>] ? remove_vma+0x56/0x60 [<ffffffff8113b708>] ? do_munmap+0x328/0x400 [<ffffffff81187c8c>] ? fget_light+0x4c/0x100 [<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0 [<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f This patch adds a check for the availability of secondary exec control to avoid these warnings. Cc: <stable@vger.kernel.org> [v3.6+] Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-16 20:25:18 -02:00
Guo Chao	807f12e57c	KVM: remove unnecessary return value check No need to check return value before breaking switch. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-13 22:14:29 -02:00
Guo Chao	951179ce86	KVM: x86: fix return value of kvm_vm_ioctl_set_tss_addr() Return value of this function will be that of ioctl(). #include <stdio.h> #include <linux/kvm.h> int main () { int fd; fd = open ("/dev/kvm", 0); fd = ioctl (fd, KVM_CREATE_VM, 0); ioctl (fd, KVM_SET_TSS_ADDR, 0xfffff000); perror (""); return 0; } Output is "Operation not permitted". That's not what we want. Return -EINVAL in this case. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-13 22:14:29 -02:00
Guo Chao	18595411a7	KVM: do not kfree error pointer We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and kvm_arch_vcpu_ioctl(). Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-13 22:14:28 -02:00
Petr Matousek	6d1068b3a9	KVM: x86: invalid opcode oops on SET_SREGS with OSXSAVE bit set (CVE-2012-4461) On hosts without the XSAVE support unprivileged local user can trigger oops similar to the one below by setting X86_CR4_OSXSAVE bit in guest cr4 register using KVM_SET_SREGS ioctl and later issuing KVM_RUN ioctl. invalid opcode: 0000 [#2] SMP Modules linked in: tun ip6table_filter ip6_tables ebtable_nat ebtables ... Pid: 24935, comm: zoog_kvm_monito Tainted: G D 3.2.0-3-686-pae EIP: 0060:[<f8b9550c>] EFLAGS: 00210246 CPU: 0 EIP is at kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] EAX: 00000001 EBX: 000f387e ECX: 00000000 EDX: 00000000 ESI: 00000000 EDI: 00000000 EBP: ef5a0060 ESP: d7c63e70 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 Process zoog_kvm_monito (pid: 24935, ti=d7c62000 task=ed84a0c0 task.ti=d7c62000) Stack: 00000001 f70a1200 f8b940a9 ef5a0060 00000000 00200202 f8769009 00000000 ef5a0060 000f387e eda5c020 8722f9c8 00015bae 00000000 ed84a0c0 ed84a0c0 c12bf02d 0000ae80 ef7f8740 fffffffb f359b740 ef5a0060 f8b85dc1 0000ae80 Call Trace: [<f8b940a9>] ? kvm_arch_vcpu_ioctl_set_sregs+0x2fe/0x308 [kvm] ... [<c12bfb44>] ? syscall_call+0x7/0xb Code: 89 e8 e8 14 ee ff ff ba 00 00 04 00 89 e8 e8 98 48 ff ff 85 c0 74 1e 83 7d 48 00 75 18 8b 85 08 07 00 00 31 c9 8b 95 0c 07 00 00 <0f> 01 d1 c7 45 48 01 00 00 00 c7 45 1c 01 00 00 00 0f ae f0 89 EIP: [<f8b9550c>] kvm_arch_vcpu_ioctl_run+0x92a/0xd13 [kvm] SS:ESP 0068:d7c63e70 QEMU first retrieves the supported features via KVM_GET_SUPPORTED_CPUID and then sets them later. So guest's X86_FEATURE_XSAVE should be masked out on hosts without X86_FEATURE_XSAVE, making kvm_set_cr4 with X86_CR4_OSXSAVE fail. Userspaces that allow specifying guest cpuid with X86_FEATURE_XSAVE even on hosts that do not support it, might be susceptible to this attack from inside the guest as well. Allow setting X86_CR4_OSXSAVE bit only if host has XSAVE support. Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-11-12 21:16:45 -02:00
Xiao Guangrong	87da7e66a4	KVM: x86: fix vcpu->mmio_fragments overflow After commit `b3356bf0db` (KVM: emulator: optimize "rep ins" handling), the pieces of io data can be collected and write them to the guest memory or MMIO together Unfortunately, kvm splits the mmio access into 8 bytes and store them to vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it will cause vcpu->mmio_fragments overflow The bug can be exposed by isapc (-M isapc): [23154.818733] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ ......] [23154.858083] Call Trace: [23154.859874] [<ffffffffa04f0e17>] kvm_get_cr8+0x1d/0x28 [kvm] [23154.861677] [<ffffffffa04fa6d4>] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm] [23154.863604] [<ffffffffa04f5a1a>] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm] Actually, we can use one mmio_fragment to store a large mmio access then split it when we pass the mmio-exit-info to userspace. After that, we only need two entries to store mmio info for the cross-mmio pages access Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-10-31 20:36:30 -02:00
Xiao Guangrong	81c52c56e2	KVM: do not treat noslot pfn as a error pfn This patch filters noslot pfn out from error pfns based on Marcelo comment: noslot pfn is not a error pfn After this patch, - is_noslot_pfn indicates that the gfn is not in slot - is_error_pfn indicates that the gfn is in slot but the error is occurred when translate the gfn to pfn - is_error_noslot_pfn indicates that the pfn either it is error pfns or it is noslot pfn And is_invalid_pfn can be removed, it makes the code more clean Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-10-29 20:31:04 -02:00
Marcelo Tosatti	19bf7f8ac3	Merge remote-tracking branch 'master' into queue Merge reason: development work has dependency on kvm patches merged upstream. Conflicts: arch/powerpc/include/asm/Kbuild arch/powerpc/include/asm/kvm_para.h Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-10-29 19:15:32 -02:00
Christoffer Dall	8ca40a70a7	KVM: Take kvm instead of vcpu to mmu_notifier_retry The mmu_notifier_retry is not specific to any vcpu (and never will be) so only take struct kvm as a parameter. The motivation is the ARM mmu code that needs to call this from somewhere where we long let go of the vcpu pointer. Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-23 13:35:43 +02:00
Gleb Natapov	7f46ddbd48	KVM: apic: fix LDR calculation in x2apic mode Signed-off-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Chegu Vinod <chegu_vinod@hp.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-22 18:03:27 +02:00
Xiao Guangrong	f3ac1a4b66	KVM: MMU: fix release noslot pfn We can not directly call kvm_release_pfn_clean to release the pfn since we can meet noslot pfn which is used to cache mmio info into spte Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-22 18:03:25 +02:00
Borislav Petkov	1f5b77f51a	KVM: SVM: Cleanup error statements Use __func__ instead of the function name in svm_hardware_enable since those things tend to get out of sync. This also slims down printk line length in conjunction with using pr_err. No functionality change. Cc: Joerg Roedel <joro@8bytes.org> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-22 16:04:59 +02:00
Xiao Guangrong	bf4ca23ef5	KVM: VMX: report internal error for MMIO #PF due to delivery event The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we can not fix it if it is caused by delivery event. Reporting internal error for this case Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-18 16:30:32 +02:00
Xiao Guangrong	b9bf6882c1	KVM: VMX: report internal error for the unhandleable event VM exits during Event Delivery is really unexpected if it is not caused by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal and freeze the guest, the VMM has the chance to check the guest Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-18 16:30:29 +02:00
Gleb Natapov	471842ec49	KVM: do not de-cache cr4 bits needlessly Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-18 14:49:28 +02:00
Xiao Guangrong	bd6360cc0a	KVM: MMU: introduce FNAME(prefetch_gpte) The only difference between FNAME(update_pte) and FNAME(pte_prefetch) is that the former is allowed to prefetch gfn from dirty logged slot, so introduce a common function to prefetch spte Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-17 16:39:19 +02:00
Xiao Guangrong	a052b42b0e	KVM: MMU: move prefetch_invalid_gpte out of pagaing_tmp.h The function does not depend on guest mmu mode, move it out from paging_tmpl.h Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-17 16:39:18 +02:00
Xiao Guangrong	d4878f24e3	KVM: MMU: cleanup FNAME(page_fault) Let it return emulate state instead of spte like __direct_map Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-17 16:39:16 +02:00
Xiao Guangrong	bd660776da	KVM: MMU: remove mmu_is_invalid Remove mmu_is_invalid and use is_invalid_pfn instead Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-10-17 16:39:15 +02:00
Jan Kiszka	b6785def83	KVM: x86: Make emulator_fix_hypercall static No users outside of kvm/x86.c. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-10-08 14:03:16 -03:00
Jan Kiszka	8b6e4547e0	KVM: x86: Convert kvm_arch_vcpu_reset into private kvm_vcpu_reset There are no external callers of this function as there is no concept of resetting a vcpu from generic code. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-10-08 14:02:35 -03:00
Linus Torvalds	ecefbd94b8	KVM updates for the 3.7 merge window -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+ R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7 lb/nbAsXf9UJLgGir4I1 =kN6V -----END PGP SIGNATURE----- Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Avi Kivity: "Highlights of the changes for this release include support for vfio level triggered interrupts, improved big real mode support on older Intels, a streamlines guest page table walker, guest APIC speedups, PIO optimizations, better overcommit handling, and read-only memory." * tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits) KVM: s390: Fix vcpu_load handling in interrupt code KVM: x86: Fix guest debug across vcpu INIT reset KVM: Add resampling irqfds for level triggered interrupts KVM: optimize apic interrupt delivery KVM: MMU: Eliminate pointless temporary 'ac' KVM: MMU: Avoid access/dirty update loop if all is well KVM: MMU: Eliminate eperm temporary KVM: MMU: Optimize is_last_gpte() KVM: MMU: Simplify walk_addr_generic() loop KVM: MMU: Optimize pte permission checks KVM: MMU: Update accessed and dirty bits after guest pagetable walk KVM: MMU: Move gpte_access() out of paging_tmpl.h KVM: MMU: Optimize gpte_access() slightly KVM: MMU: Push clean gpte write protection out of gpte_access() KVM: clarify kvmclock documentation KVM: make processes waiting on vcpu mutex killable KVM: SVM: Make use of asm.h KVM: VMX: Make use of asm.h KVM: VMX: Make lto-friendly KVM: x86: lapic: Clean up find_highest_vector() and count_vectors() ... Conflicts: arch/s390/include/asm/processor.h arch/x86/kvm/i8259.c	2012-10-04 09:30:33 -07:00
Linus Torvalds	ac07f5c3cb	Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu update from Ingo Molnar: "The biggest change is the addition of the non-lazy (eager) FPU saving support model and enabling it on CPUs with optimized xsaveopt/xrstor FPU state saving instructions. There are also various Sparse fixes" Fix up trivial add-add conflict in arch/x86/kernel/traps.c * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, kvm: fix kvm's usage of kernel_fpu_begin/end() x86, fpu: remove cpu_has_xmm check in the fx_finit() x86, fpu: make eagerfpu= boot param tri-state x86, fpu: enable eagerfpu by default for xsaveopt x86, fpu: decouple non-lazy/eager fpu restore from xsave x86, fpu: use non-lazy fpu restore for processors supporting xsave lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu() x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig() x86, fpu: drop_fpu() before restoring new state from sigframe x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels x86, fpu: Consolidate inline asm routines for saving/restoring fpu state x86, signal: Cleanup ifdefs and is_ia32, is_x32	2012-10-01 11:10:52 -07:00
Linus Torvalds	08815bc267	Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/cleanups from Ingo Molnar: "Smaller cleanups" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: arch/x86: Remove unecessary semicolons x86, boot: Remove obsolete and unused constant RAMDISK	2012-10-01 10:47:45 -07:00
Linus Torvalds	7e92daaefa	Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf update from Ingo Molnar: "Lots of changes in this cycle as well, with hundreds of commits from over 30 contributors. Most of the activity was on the tooling side. Higher level changes: - New 'perf kvm' analysis tool, from Xiao Guangrong. - New 'perf trace' system-wide tracing tool - uprobes fixes + cleanups from Oleg Nesterov. - Lots of patches to make perf build on Android out of box, from Irina Tirdea - Extend ftrace function tracing utility to be more dynamic for its users. It allows for data passing to the callback functions, as well as reading regs as if a breakpoint were to trigger at function entry. The main goal of this patch series was to allow kprobes to use ftrace as an optimized probe point when a probe is placed on an ftrace nop. With lots of help from Masami Hiramatsu, and going through lots of iterations, we finally came up with a good solution. - Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng. - Various tracing updates from Steve Rostedt - Clean up and improve 'perf sched' performance by elliminating lots of needless calls to libtraceevent. - Event group parsing support, from Jiri Olsa - UI/gtk refactorings and improvements from Namhyung Kim - Add support for non-tracepoint events in perf script python, from Feng Tang - Add --symbols to 'script', similar to the one in 'report', from Feng Tang. Infrastructure enhancements and fixes: - Convert the trace builtins to use the growing evsel/evlist tracepoint infrastructure, removing several open coded constructs like switch like series of strcmp to dispatch events, etc. Basically what had already been showcased in 'perf sched'. - Add evsel constructor for tracepoints, that uses libtraceevent just to parse the /format events file, use it in a new 'perf test' to make sure the libtraceevent format parsing regressions can be more readily caught. - Some strange errors were happening in some builds, but not on the next, reported by several people, problem was some parser related files, generated during the build, didn't had proper make deps, fix from Eric Sandeen. - Introduce struct and cache information about the environment where a perf.data file was captured, from Namhyung Kim. - Fix handling of unresolved samples when --symbols is used in 'report', from Feng Tang. - Add union member access support to 'probe', from Hyeoncheol Lee. - Fixups to die() removal, from Namhyung Kim. - Render fixes for the TUI, from Namhyung Kim. - Don't enable annotation in non symbolic view, from Namhyung Kim. - Fix pipe mode in 'report', from Namhyung Kim. - Move related stats code from stat to util/, will be used by the 'stat' kvm tool, from Xiao Guangrong. - Remove die()/exit() calls from several tools. - Resolve vdso callchains, from Jiri Olsa - Don't pass const char pointers to basename, so that we can unconditionally use libgen.h and thus avoid ifdef BIONIC lines, from David Ahern - Refactor hist formatting so that it can be reused with the GTK browser, From Namhyung Kim - Fix build for another rbtree.c change, from Adrian Hunter. - Make 'perf diff' command work with evsel hists, from Jiri Olsa. - Use the only field_sep var that is set up: symbol_conf.field_sep, fix from Jiri Olsa. - .gitignore compiled python binaries, from Namhyung Kim. - Get rid of die() in more libtraceevent places, from Namhyung Kim. - Rename libtraceevent 'private' struct member to 'priv' so that it works in C++, from Steven Rostedt - Remove lots of exit()/die() calls from tools so that the main perf exit routine can take place, from David Ahern - Fix x86 build on x86-64, from David Ahern. - {int,str,rb}list fixes from Suzuki K Poulose - perf.data header fixes from Namhyung Kim - Allow user to indicate objdump path, needed in cross environments, from Maciek Borzecki - Fix hardware cache event name generation, fix from Jiri Olsa - Add round trip test for sw, hw and cache event names, catching the problem Jiri fixed, after Jiri's patch, the test passes successfully. - Clean target should do clean for lib/traceevent too, fix from David Ahern - Check the right variable for allocation failure, fix from Namhyung Kim - Set up evsel->tp_format regardless of evsel->name being set already, fix from Namhyung Kim - Oprofile fixes from Robert Richter. - Remove perf_event_attr needless version inflation, from Jiri Olsa - Introduce libtraceevent strerror like error reporting facility, from Namhyung Kim - Add pmu mappings to perf.data header and use event names from cmd line, from Robert Richter - Fix include order for bison/flex-generated C files, from Ben Hutchings - Build fixes and documentation corrections from David Ahern - Assorted cleanups from Robert Richter - Let O= makes handle relative paths, from Steven Rostedt - perf script python fixes, from Feng Tang. - Initial bash completion support, from Frederic Weisbecker - Allow building without libelf, from Namhyung Kim. - Support DWARF CFI based unwind to have callchains when %bp based unwinding is not possible, from Jiri Olsa. - Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF section was the end goal, several fixes for code that handles all architectures and cleanups are included, from Cody Schafer. - Assorted fixes for Documentation and build in 32 bit, from Robert Richter - Cache the libtraceevent event_format associated to each evsel early, so that we avoid relookups, i.e. calling pevent_find_event repeatedly when processing tracepoint events. [ This is to reduce the surface contact with libtraceevents and make clear what is that the perf tools needs from that lib: so far parsing the common and per event fields. ] - Don't stop the build if the audit libraries are not installed, fix from Namhyung Kim. - Fix bfd.h/libbfd detection with recent binutils, from Markus Trippelsdorf. - Improve warning message when libunwind devel packages not present, from Jiri Olsa" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits) perf trace: Add aliases for some syscalls perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables perf tools: Check libaudit availability for perf-trace builtin perf hists: Add missing period_* fields when collapsing a hist entry perf trace: New tool perf evsel: Export the event_format constructor perf evsel: Introduce rawptr() method perf tools: Use perf_evsel__newtp in the event parser perf evsel: The tracepoint constructor should store sys:name perf evlist: Introduce set_filter() method perf evlist: Renane set_filters method to apply_filters perf test: Add test to check we correctly parse and match syscall open parms perf evsel: Handle endianity in intval method perf evsel: Know if byte swap is needed perf tools: Allow handling a NULL cpu_map as meaning "all cpus" perf evsel: Improve tracepoint constructor setup tools lib traceevent: Fix error path on pevent_parse_event perf test: Fix build failure trace: Move trace event enable from fs_initcall to core_initcall tracing: Add an option for disabling markers ...	2012-10-01 10:28:49 -07:00
Jan Kiszka	c863901075	KVM: x86: Fix guest debug across vcpu INIT reset If we reset a vcpu on INIT, we so far overwrote dr7 as provided by KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally. Fix this by saving the dr7 used for guest debugging and calculating the effective register value as well as switch_db_regs on any potential change. This will change to focus of the set_guest_debug vendor op to update_dp_bp_intercept. Found while trying to stop on start_secondary. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-23 15:00:07 +02:00
Alex Williamson	7a84428af7	KVM: Add resampling irqfds for level triggered interrupts To emulate level triggered interrupts, add a resample option to KVM_IRQFD. When specified, a new resamplefd is provided that notifies the user when the irqchip has been resampled by the VM. This may, for instance, indicate an EOI. Also in this mode, posting of an interrupt through an irqfd only asserts the interrupt. On resampling, the interrupt is automatically de-asserted prior to user notification. This enables level triggered interrupts to be posted and re-enabled from vfio with no userspace intervention. All resampling irqfds can make use of a single irq source ID, so we reserve a new one for this interface. Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-23 13:50:15 +02:00
Suresh Siddha	b1a74bf821	x86, kvm: fix kvm's usage of kernel_fpu_begin/end() Preemption is disabled between kernel_fpu_begin/end() and as such it is not a good idea to use these routines in kvm_load/put_guest_fpu() which can be very far apart. kvm_load/put_guest_fpu() routines are already called with preemption disabled and KVM already uses the preempt notifier to save the guest fpu state using kvm_put_guest_fpu(). So introduce __kernel_fpu_begin/end() routines which don't touch preemption and use them instead of kernel_fpu_begin/end() for KVM's use model of saving/restoring guest FPU state. Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit state in the case of VMX. For eagerFPU case, host cr0.TS is always clear. So no need to worry about it. For the traditional lazyFPU restore case, change the cr0.TS bit for the host state during vm-exit to be always clear and cr0.TS bit is set in the __vmx_load_host_state() when the FPU (guest FPU or the host task's FPU) state is not active. This ensures that the host/guest FPU state is properly saved, restored during context-switch and with interrupts (using irq_fpu_usable()) not stomping on the active FPU state. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com Cc: Avi Kivity <avi@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2012-09-21 16:59:04 -07:00
Xiao Guangrong	26bf264e87	KVM: x86: Export svm/vmx exit code and vector code to userspace Exporting KVM exit information to userspace to be consumed by perf. Signed-off-by: Dong Hao <haodong@linux.vnet.ibm.com> [ Dong Hao <haodong@linux.vnet.ibm.com>: rebase it on acme's git tree ] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: kvm@vger.kernel.org Cc: Runzhen Wang <runzhen@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1347870675-31495-2-git-send-email-haodong@linux.vnet.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>	2012-09-21 12:48:09 -03:00
Gleb Natapov	1e08ec4a13	KVM: optimize apic interrupt delivery Most interrupt are delivered to only one vcpu. Use pre-build tables to find interrupt destination instead of looping through all vcpus. In case of logical mode loop only through vcpus in a logical cluster irq is sent to. Signed-off-by: Gleb Natapov <gleb@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 15:05:26 +03:00
Avi Kivity	c5421519f3	KVM: MMU: Eliminate pointless temporary 'ac' 'ac' essentially reconstructs the 'access' variable we already have, except for the PFERR_PRESENT_MASK and PFERR_RSVD_MASK. As these are not used by callees, just use 'access' directly. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:10 +03:00
Avi Kivity	b514c30f77	KVM: MMU: Avoid access/dirty update loop if all is well Keep track of accessed/dirty bits; if they are all set, do not enter the accessed/dirty update loop. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:09 +03:00
Avi Kivity	71331a1da1	KVM: MMU: Eliminate eperm temporary 'eperm' is no longer used in the walker loop, so we can eliminate it. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:09 +03:00
Avi Kivity	6fd01b711b	KVM: MMU: Optimize is_last_gpte() Instead of branchy code depending on level, gpte.ps, and mmu configuration, prepare everything in a bitmap during mode changes and look it up during runtime. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:09 +03:00
Avi Kivity	13d22b6aeb	KVM: MMU: Simplify walk_addr_generic() loop The page table walk is coded as an infinite loop, with a special case on the last pte. Code it as an ordinary loop with a termination condition on the last pte (large page or walk length exhausted), and put the last pte handling code after the loop where it belongs. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:08 +03:00
Avi Kivity	97d64b7881	KVM: MMU: Optimize pte permission checks walk_addr_generic() permission checks are a maze of branchy code, which is performed four times per lookup. It depends on the type of access, efer.nxe, cr0.wp, cr4.smep, and in the near future, cr4.smap. Optimize this away by precalculating all variants and storing them in a bitmap. The bitmap is recalculated when rarely-changing variables change (cr0, cr4) and is indexed by the often-changing variables (page fault error code, pte access permissions). The permission check is moved to the end of the loop, otherwise an SMEP fault could be reported as a false positive, when PDE.U=1 but PTE.U=0. Noted by Xiao Guangrong. The result is short, branch-free code. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:08 +03:00
Avi Kivity	8cbc70696f	KVM: MMU: Update accessed and dirty bits after guest pagetable walk While unspecified, the behaviour of Intel processors is to first perform the page table walk, then, if the walk was successful, to atomically update the accessed and dirty bits of walked paging elements. While we are not required to follow this exactly, doing so will allow us to perform the access permissions check after the walk is complete, rather than after each walk step. (the tricky case is SMEP: a zero in any pte's U bit makes the referenced page a supervisor page, so we can't fault on a one bit during the walk itself). Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:08 +03:00
Avi Kivity	3d34adec70	KVM: MMU: Move gpte_access() out of paging_tmpl.h We no longer rely on paging_tmpl.h defines; so we can move the function to mmu.c. Rely on zero extension to 64 bits to get the correct nx behaviour. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:08 +03:00
Avi Kivity	edc2ae84eb	KVM: MMU: Optimize gpte_access() slightly If nx is disabled, then is gpte[63] is set we will hit a reserved bit set fault before checking permissions; so we can ignore the setting of efer.nxe. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:07 +03:00
Avi Kivity	8ea667f259	KVM: MMU: Push clean gpte write protection out of gpte_access() gpte_access() computes the access permissions of a guest pte and also write-protects clean gptes. This is wrong when we are servicing a write fault (since we'll be setting the dirty bit momentarily) but correct when instantiating a speculative spte, or when servicing a read fault (since we'll want to trap a following write in order to set the dirty bit). It doesn't seem to hurt in practice, but in order to make the code readable, push the write protection out of gpte_access() and into a new protect_clean_gpte() which is called explicitly when needed. Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-20 13:00:07 +03:00
Peter Senna Tschudin	4b8073e467	arch/x86: Remove unecessary semicolons Found by http://coccinelle.lip6.fr/ Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com> Cc: avi@redhat.com Cc: mtosatti@redhat.com Cc: a.p.zijlstra@chello.nl Cc: rusty@rustcorp.com.au Cc: masami.hiramatsu.pt@hitachi.com Cc: suresh.b.siddha@intel.com Cc: joerg.roedel@amd.com Cc: agordeev@redhat.com Cc: yinghai@kernel.org Cc: bhelgaas@google.com Cc: liuj97@gmail.com Link: http://lkml.kernel.org/r/1347986174-30287-7-git-send-email-peter.senna@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-09-19 17:32:48 +02:00
Suresh Siddha	9c1c3fac53	x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu() kvm's guest fpu save/restore should be wrapped around kernel_fpu_begin/end(). This will avoid for example taking a DNA in kvm_load_guest_fpu() when it tries to load the fpu immediately after doing unlazy_fpu() on the host side. More importantly this will prevent the host process fpu from being corrupted. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1345842782-24175-4-git-send-email-suresh.b.siddha@intel.com Cc: Avi Kivity <avi@redhat.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2012-09-18 15:52:07 -07:00
Michael S. Tsirkin	9fc77441e5	KVM: make processes waiting on vcpu mutex killable vcpu mutex can be held for unlimited time so taking it with mutex_lock on an ioctl is wrong: one process could be passed a vcpu fd and call this ioctl on the vcpu used by another process, it will then be unkillable until the owner exits. Call mutex_lock_killable instead and return status. Note: mutex_lock_interruptible would be even nicer, but I am not sure all users are prepared to handle EINTR from these ioctls. They might misinterpret it as an error. Cleanup paths expect a vcpu that can't be used by any userspace so this will always succeed - catch bugs by calling BUG_ON. Catch callers that don't check return state by adding __must_check. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-09-17 13:46:32 -03:00
Avi Kivity	7454766f7b	KVM: SVM: Make use of asm.h Use macros for bitness-insensitive register names, instead of rolling our own. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-09-17 10:38:05 -03:00
Avi Kivity	b188c81f2e	KVM: VMX: Make use of asm.h Use macros for bitness-insensitive register names, instead of rolling our own. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-09-17 10:38:04 -03:00
Avi Kivity	83287ea420	KVM: VMX: Make lto-friendly LTO (link-time optimization) doesn't like local labels to be referred to from a different function, since the two functions may be built in separate compilation units. Use an external variable instead. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-09-17 10:38:03 -03:00
Takuya Yoshikawa	ecba9a52ac	KVM: x86: lapic: Clean up find_highest_vector() and count_vectors() find_highest_vector() and count_vectors(): - Instead of using magic values, define and use proper macros. find_highest_vector(): - Remove likely() which is there only for historical reasons and not doing correct branch predictions anymore. Using such heuristics to optimize this function is not worth it now. Let CPUs predict things instead. - Stop checking word[0] separately. This was only needed for doing likely() optimization. - Use for loop, not while, to iterate over the register array to make the code clearer. Note that we actually confirmed that the likely() did wrong predictions by inserting debug code. Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-09-12 13:38:23 -03:00
Xiao Guangrong	4484141a94	KVM: fix error paths for failed gfn_to_page() calls This bug was triggered: [ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe [ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34 ...... [ 4220.237326] Call Trace: [ 4220.237361] [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm] [ 4220.237382] [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm] [ 4220.237401] [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm] [ 4220.237407] [<ffffffff81145425>] __fput+0x111/0x1ed [ 4220.237411] [<ffffffff8114550f>] ____fput+0xe/0x10 [ 4220.237418] [<ffffffff81063511>] task_work_run+0x5d/0x88 [ 4220.237424] [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca The test case: printf(fmt, ##args); \ exit(-1);} while (0) static int create_vm(void) { int sys_fd, vm_fd; sys_fd = open("/dev/kvm", O_RDWR); if (sys_fd < 0) die("open /dev/kvm fail.\n"); vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0); if (vm_fd < 0) die("KVM_CREATE_VM fail.\n"); return vm_fd; } static int create_vcpu(int vm_fd) { int vcpu_fd; vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0); if (vcpu_fd < 0) die("KVM_CREATE_VCPU ioctl.\n"); printf("Create vcpu.\n"); return vcpu_fd; } static void vcpu_thread(void arg) { int vm_fd = (int)(long)arg; create_vcpu(vm_fd); return NULL; } int main(int argc, char argv[]) { pthread_t thread; int vm_fd; (void)argc; (void)argv; vm_fd = create_vm(); pthread_create(&thread, NULL, vcpu_thread, (void )(long)vm_fd); printf("Exit.\n"); return 0; } It caused by release kvm->arch.ept_identity_map_addr which is the error page. The parent thread can send KILL signal to the vcpu thread when it was exiting which stops faulting pages and potentially allocating memory. So gfn_to_pfn/gfn_to_page may fail at this time Fixed by checking the page before it is used Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-10 11:34:11 +03:00
Xiao Guangrong	7de5bdc96c	KVM: MMU: remove unnecessary check Checking the return of kvm_mmu_get_page is unnecessary since it is guaranteed by memory cache Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-10 11:26:16 +03:00
Liu, Jinsong	92b5265d38	KVM: Depend on HIGH_RES_TIMERS KVM lapic timer and tsc deadline timer based on hrtimer, setting a leftmost node to rb tree and then do hrtimer reprogram. If hrtimer not configured as high resolution, hrtimer_enqueue_reprogram do nothing and then make kvm lapic timer and tsc deadline timer fail. Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-10 11:10:03 +03:00
Ren, Yongjie	4f97704555	KVM: x86: Check INVPCID feature bit in EBX of leaf 7 Checks and operations on the INVPCID feature bit should use EBX of CPUID leaf 7 instead of ECX. Signed-off-by: Junjie Mao <junjie.mao@intel.com> Signed-off-by: Yongjie Ren <yongjien.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-09 17:34:01 +03:00
Michael S. Tsirkin	a50abc3b2b	KVM: use symbolic constant for nr interrupts interrupt_bitmap is KVM_NR_INTERRUPTS bits in size, so just use that instead of hard-coded constants and math. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 18:37:44 +03:00
Gleb Natapov	b3356bf0db	KVM: emulator: optimize "rep ins" handling Optimize "rep ins" by allowing emulator to write back more than one datum at a time. Introduce new operand type OP_MEM_STR which tells writeback() that dst contains pointer to an array that should be written back as opposite to just one data element. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 18:07:38 +03:00
Gleb Natapov	f3bd64c68a	KVM: emulator: string_addr_inc() cleanup Remove unneeded segment argument. Address structure already has correct segment which was put there during decode. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 18:07:01 +03:00
Gleb Natapov	9d1b39a967	KVM: emulator: make x86 emulation modes enum instead of defines Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 18:07:01 +03:00
Gleb Natapov	716d51abff	KVM: Provide userspace IO exit completion callback Current code assumes that IO exit was due to instruction emulation and handles execution back to emulator directly. This patch adds new userspace IO exit completion callback that can be set by any other code that caused IO exit to userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 18:06:37 +03:00
Marcelo Tosatti	3b4dc3a031	KVM: move postcommit flush to x86, as mmio sptes are x86 specific Other arches do not need this. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya) Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 16:37:30 +03:00
Marcelo Tosatti	2df72e9bc4	KVM: split kvm_arch_flush_shadow Introducing kvm_arch_flush_shadow_memslot, to invalidate the translations of a single memory slot. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-06 16:37:25 +03:00
Mathias Krause	09941fbb71	KVM: SVM: constify lookup tables We never modify direct_access_msrs[], msrpm_ranges[], svm_exit_handlers[] or x86_intercept_map[] at runtime. Mark them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:42:14 +03:00
Mathias Krause	772e031899	KVM: VMX: constify lookup tables We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:42:09 +03:00
Mathias Krause	f1d248315a	KVM: x86: more constification Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:42:02 +03:00
Mathias Krause	0fbe9b0b19	KVM: x86: constify read_write_emulator_ops We never change those, make them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:41:54 +03:00
Mathias Krause	0225fb509d	KVM: x86 emulator: constify emulate_ops We never change emulate_ops[] at runtime so it should be r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:41:48 +03:00
Mathias Krause	fd0a0d8208	KVM: x86 emulator: mark opcode tables const The opcode tables never change at runtime, therefor mark them const. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:41:28 +03:00
Mathias Krause	89a87c6779	KVM: x86 emulator: use aligned variants of SSE register ops As the the compiler ensures that the memory operand is always aligned to a 16 byte memory location, use the aligned variant of MOVDQ for read_sse_reg() and write_sse_reg(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:41:11 +03:00
Mathias Krause	326d07cb30	KVM: x86: minor size optimization Some fields can be constified and/or made static to reduce code and data size. Numbers for a 32 bit build: text data bss dec hex filename before: 3351 80 0 3431 d67 cpuid.o after: 3391 0 0 3391 d3f cpuid.o Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-05 12:41:09 +03:00
Jamie Iles	749c59fd15	KVM: PIC: fix use of uninitialised variable. Commit `aea218f3cb` (KVM: PIC: call ack notifiers for irqs that are dropped form irr) used an uninitialised variable to track whether an appropriate apic had been found. This could result in calling the ack notifier incorrectly. Cc: Gleb Natapov <gleb@redhat.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: Jamie Iles <jamie@jamieiles.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-04 15:44:42 +03:00
Gleb Natapov	ec798660cf	KVM: cleanup pic reset kvm_pic_reset() is not used anywhere. Move reset logic from pic_ioport_write() there. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-09-04 14:53:51 +03:00
Marcelo Tosatti	9a7819774e	KVM: x86: remove unused variable from kvm_task_switch() Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-30 17:45:54 -03:00
Avi Kivity	a81aba14dc	KVM: VMX: Ignore segment G and D bits when considering whether we can virtualize We will enter the guest with G and D cleared; as real hardware ignores D in real mode, and G is taken care of by the limit test, we allow more code to run in vm86 mode. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:21 -03:00
Avi Kivity	ce56680347	KVM: VMX: Save all segment data in real mode Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:21 -03:00
Avi Kivity	1390a28b27	KVM: VMX: Preserve segment limit and access rights in real mode While this is undocumented, real processors do not reload the segment limit and access rights when loading a segment register in real mode. Real programs rely on it so we need to comply with this behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:21 -03:00
Avi Kivity	7263642028	KVM: VMX: Return real real-mode segment data even if emulate_invalid_guest_state=1 emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the vmcs; we do. So we need to return the real ones (maintained by vmx_set_segment). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:21 -03:00
Avi Kivity	0afbe2f878	KVM: x86 emulator: Fix #GP error code during linearization We want the segment selector, nor segment number. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	a5625189f6	KVM: x86 emulator: Check segment limits in real mode too Segment limits are verified in real mode, not just protected mode. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	03ebebeb1f	KVM: x86 emulator: Leave segment limit and attributs alone in real mode When loading a segment in real mode, only the base and selector must be modified. The limit needs to be left alone, otherwise big real mode users will hit a #GP due to limit checking (currently this is suppressed because we don't check limits in real mode). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	e2a610d7fc	KVM: VMX: Allow vm86 virtualization of big real mode Usually, big real mode uses large (4GB) segments. Currently we don't virtualize this; if any segment has a limit other than 0xffff, we emulate. But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize real mode; if an access overruns the segment limit, the guest will #GP, which we will trap and forward to the emulator. This results in significantly faster execution, and less risk of hitting an unemulated instruction. If the limit is less than 0xffff, we retain the existing behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	495e116684	KVM: VMX: Allow real mode emulation using vm86 with dpl=0 Real mode is always entered from protected mode with dpl=0. Since the dpl doesn't affect execution, and we already override it to 3 in the vmcs (as vmx requires), we can allow execution in that state. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	c865c43de6	KVM: VMX: Retain limit and attributes when entering protected mode Real processors don't change segment limits and attributes while in real mode. Mimic that behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:20 -03:00
Avi Kivity	f5f7b2fe3b	KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmode Instead of using struct kvm_save_segment, use struct kvm_segment, which is what the other APIs use. This leads to some simplification. We replace save_rmode_seg() with a call to vmx_save_segment(). Since this depends on rmode.vm86_active, we move the call to before setting the flag. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:19 -03:00
Avi Kivity	72fbefec26	KVM: VMX: Fix incorrect lookup of segment S flag in fix_pmode_dataseg() fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:19 -03:00
Avi Kivity	baa7e81e32	KVM: VMX: Separate saving pre-realmode state from setting segments Commit `b246dd5df1` ("KVM: VMX: Fix KVM_SET_SREGS with big real mode segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is applied not just on transitions to real mode, but also on KVM_SET_SREGS (migration). However fix_rmode_seg() not only munges the vmcs segments, it also sets up the save area for us to restore when returning to protected mode or to return in vmx_get_segment(). Move saving the segment into a new function, save_rmode_seg(), and call it just during the transition. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 20:02:19 -03:00
Avi Kivity	dd856efafe	KVM: x86 emulator: access GPRs on demand Instead of populating the entire register file, read in registers as they are accessed, and write back only the modified ones. This saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually used during emulation), and a two 128-byte copies for the registers. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 18:38:55 -03:00
Michael S. Tsirkin	1d92128fe9	KVM: x86: fix KVM_GET_MSR for PV EOI KVM_GET_MSR was missing support for PV EOI, which is needed for migration. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-27 18:03:05 -03:00
Marcelo Tosatti	c78aa4c4b9	Merge remote-tracking branch 'upstream/master' into queue Merging critical fixes from upstream required for development. * upstream/master: (809 commits) libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry Revert "powerpc: Update g5_defconfig" powerpc/perf: Use pmc_overflow() to detect rolled back events powerpc: Fix VMX in interrupt check in POWER7 copy loops powerpc: POWER7 copy_to_user/copy_from_user patch applied twice powerpc: Fix personality handling in ppc64_personality() powerpc/dma-iommu: Fix IOMMU window check powerpc: Remove unnecessary ifdefs powerpc/kgdb: Restore current_thread_info properly powerpc/kgdb: Bail out of KGDB when we've been triggered powerpc/kgdb: Do not set kgdb_single_step on ppc powerpc/mpic_msgr: Add missing includes powerpc: Fix null pointer deref in perf hardware breakpoints powerpc: Fixup whitespace in xmon powerpc: Fix xmon dl command for new printk implementation xfs: check for possible overflow in xfs_ioc_trim xfs: unlock the AGI buffer when looping in xfs_dialloc xfs: fix uninitialised variable in xfs_rtbuf_get() powerpc/fsl: fix "Failed to mount /dev: No such device" errors powerpc/fsl: update defconfigs ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-26 13:58:41 -03:00
Avi Kivity	5ad105e569	KVM: x86 emulator: use stack size attribute to mask rsp in stack ops The sub-register used to access the stack (sp, esp, or rsp) is not determined by the address size attribute like other memory references, but by the stack segment's B bit (if not in x86_64 mode). Fix by using the existing stack_mask() to figure out the correct mask. This long-existing bug was exposed by a combination of `a27685c33a` (emulate invalid guest state by default), which causes many more instructions to be emulated, and a seabios change (possibly a bug) which causes the high 16 bits of esp to become polluted across calls to real mode software interrupts. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-22 18:54:26 -03:00
Takuya Yoshikawa	35f2d16bb9	KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended Although the possible race described in commit `85b7059169` KVM: MMU: fix shrinking page from the empty mmu was correct, the real cause of that issue was a more trivial bug of mmu_shrink() introduced by commit `1952639665` KVM: MMU: do not iterate over all VMs in mmu_shrink() Here is the bug: if (kvm->arch.n_used_mmu_pages > 0) { if (!nr_to_scan--) break; continue; } We skip VMs whose n_used_mmu_pages is not zero and try to shrink others: in other words we try to shrink empty ones by mistake. This patch reverses the logic so that mmu_shrink() can free pages from the first VM whose n_used_mmu_pages is not zero. Note that we also add comments explaining the role of nr_to_scan which is not practically important now, hoping this will be improved in the future. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-22 15:27:13 +03:00
Xiao Guangrong	4d8b81abc4	KVM: introduce readonly memslot In current code, if we map a readonly memory space from host to guest and the page is not currently mapped in the host, we will get a fault pfn and async is not allowed, then the vm will crash We introduce readonly memory region to map ROM/ROMD to the guest, read access is happy for readonly memslot, write access on readonly memslot will cause KVM_EXIT_MMIO exit Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-22 15:09:03 +03:00
Xiao Guangrong	037d92dc5d	KVM: introduce gfn_to_pfn_memslot_atomic It can instead of hva_to_pfn_atomic Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-22 15:08:52 +03:00
Xiao Guangrong	8e3d9d061b	KVM: x86: fix possible infinite loop caused by reexecute_instruction Currently, we reexecute all unhandleable instructions if they do not access on the mmio, however, it can not work if host map the readonly memory to guest. If the instruction try to write this kind of memory, it will fault again when guest retry it, then we will goto a infinite loop: retry instruction -> write #PF -> emulation fail -> retry instruction -> ... Fix it by retrying the instruction only when it faults on the writable memory Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-22 15:08:49 +03:00
Michael S. Tsirkin	28a6fdabb3	KVM: x86: drop parameter validation in ioapic/pic We validate irq pin number when routing is setup, so code handling illegal irq # in pic and ioapic on each injection is never called. Drop it, replace with BUG_ON to catch out of bounds access bugs. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-14 22:35:22 -03:00
Avi Kivity	dbcb4e7980	KVM: VMX: Advertize RDTSC exiting to nested guests All processors that support VMX have that feature, and guests (Xen) depend on it. As we already implement it, advertize it to the guest. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-13 19:08:28 -03:00
Gleb Natapov	2a7921b7a0	KVM: VMX: restore MSR_IA32_DEBUGCTLMSR after VMEXIT MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT. Restore it to the correct value. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-13 19:07:58 -03:00
Marcelo Tosatti	51d59c6b42	KVM: x86: fix pvclock guest stopped flag reporting kvm_guest_time_update unconditionally clears hv_clock.flags field, so the notification never reaches the guest. Fix it by allowing PVCLOCK_GUEST_STOPPED to passthrough. Reviewed-by: Eric B Munson <emunson@mgebm.net> Reviewed-by: Amit Shah <amit.shah@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-13 16:10:45 -03:00
Gleb Natapov	64eb062029	KVM: correctly detect APIC SW state in kvm_apic_post_state_restore() For apic_set_spiv() to track APIC SW state correctly it needs to see previous and next values of the spurious vector register, but currently memset() overwrite the old value before apic_set_spiv() get a chance to do tracking. Fix it by calling apic_set_spiv() before overwriting old value. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-09 12:44:46 +03:00
Gleb Natapov	c48f14966c	KVM: inline kvm_apic_present() and kvm_lapic_enabled() Those functions are used during interrupt injection. When inlined they become nops on the fast path. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 19:00:45 +03:00
Gleb Natapov	54e9818f39	KVM: use jump label to optimize checking for in kernel local apic presence Usually all vcpus have local apic pointer initialized, so the check may be completely skipped. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 19:00:44 +03:00
Gleb Natapov	f8c1ea1039	KVM: use jump label to optimize checking for SW enabled apic in spurious interrupt register Usually all APICs are SW enabled so the check can be optimized out. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 19:00:43 +03:00
Gleb Natapov	c5cc421ba3	KVM: use jump label to optimize checking for HW enabled APIC in APIC_BASE MSR Usually all APICs are HW enabled so the check can be optimized out. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 19:00:43 +03:00
Gleb Natapov	6aed64a8a4	KVM: mark apic enabled on start up According to SDM apic is enabled on start up. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:20:14 +03:00
Gleb Natapov	5dbc8f3fed	KVM: use kvm_lapic_set_base() to change apic_base Do not change apic_base directly. Use kvm_lapic_set_base() instead. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:20:05 +03:00
Gleb Natapov	8a5a87d9b7	KVM: clean up kvm_(set\|get)_apic_base kvm_get_apic_base() needlessly checks irqchip_in_kernel although it does the same no matter what result of the check is. kvm_set_apic_base() also checks for irqchip_in_kernel, but kvm_lapic_set_base() can handle this case. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:20:03 +03:00
Xiao Guangrong	32cad84f44	KVM: do not release the error page After commit `a2766325cf`, the error page is replaced by the error code, it need not be released anymore [ The patch has been compiling tested for powerpc ] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:04:58 +03:00
Xiao Guangrong	cb9aaa30b1	KVM: do not release the error pfn After commit `a2766325cf`, the error pfn is replaced by the error code, it need not be released anymore [ The patch has been compiling tested for powerpc ] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:04:57 +03:00
Xiao Guangrong	e6c1502b3f	KVM: introduce KVM_PFN_ERR_HWPOISON Then, get_hwpoison_pfn and is_hwpoison_pfn can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:04:52 +03:00
Xiao Guangrong	6c8ee57be9	KVM: introduce KVM_PFN_ERR_FAULT After that, the exported and un-inline function, get_fault_pfn, can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 16:04:50 +03:00
Takuya Yoshikawa	d89cc617b9	KVM: Push rmap into kvm_arch_memory_slot Two reasons: - x86 can integrate rmap and rmap_pde and remove heuristics in __gfn_to_rmap(). - Some architectures do not need rmap. Since rmap is one of the most memory consuming stuff in KVM, ppc'd better restrict the allocation to Book3S HV. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 12:47:30 +03:00
Takuya Yoshikawa	65fbe37c42	KVM: MMU: Use gfn_to_rmap() instead of directly reading rmap array This helps to make rmap architecture specific in a later patch. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 12:47:04 +03:00
Takuya Yoshikawa	aab2eb7a38	KVM: Stop checking rmap to see if slot is being created Instead, check npages consistently. This helps to make rmap architecture specific in a later patch. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-06 12:47:02 +03:00
Gleb Natapov	439793d4b3	KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value When MSR_KVM_PV_EOI_EN was added to msrs_to_save array KVM_SAVE_MSRS_BEGIN was not updated accordingly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-05 16:52:38 +03:00
Avi Kivity	fe56097b23	Merge remote-tracking branch 'upstream' into next - bring back critical fixes (esp. `aa67f6096c`) - provide an updated base for development * upstream: (4334 commits) missed mnt_drop_write() in do_dentry_open() UBIFS: nuke pdflush from comments gfs2: nuke pdflush from comments drbd: nuke pdflush from comments nilfs2: nuke write_super from comments hfs: nuke write_super from comments vfs: nuke pdflush from comments jbd/jbd2: nuke write_super from comments btrfs: nuke pdflush from comments btrfs: nuke write_super from comments ext4: nuke pdflush from comments ext4: nuke write_super from comments ext3: nuke write_super from comments Documentation: fix the VM knobs descritpion WRT pdflush Documentation: get rid of write_super vfs: kill write_super and sync_supers ACPI processor: Fix tick_broadcast_mask online/offline regression ACPI: Only count valid srat memory structures ACPI: Untangle a return statement for better readability Linux 3.6-rc1 ... Signed-off-by: Avi Kivity <avi@redhat.com>	2012-08-05 13:25:10 +03:00
Gleb Natapov	e115676e04	KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value When MSR_KVM_PV_EOI_EN was added to msrs_to_save array KVM_SAVE_MSRS_BEGIN was not updated accordingly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-03 15:08:22 -03:00
Avi Kivity	aa67f6096c	KVM: VMX: Fix ds/es corruption on i386 with preemption Commit `b2da15ac26` ("KVM: VMX: Optimize %ds, %es reload") broke i386 in the following scenario: vcpu_load ... vmx_save_host_state vmx_vcpu_run (ds.rpl, es.rpl cleared by hardware) interrupt push ds, es # pushes bad ds, es schedule vmx_vcpu_put vmx_load_host_state reload ds, es (with __USER_DS) pop ds, es # of other thread's stack iret # other thread runs interrupt push ds, es schedule # back in vcpu thread pop ds, es # now with rpl=0 iret ... vcpu_put resume_userspace iret # clears ds, es due to mismatched rpl (instead of resume_userspace, we might return with SYSEXIT and then take an exception; when the exception IRETs we end up with cleared ds, es) Fix by avoiding the optimization on i386 and reloading ds, es on the lightweight exit path. Reported-by: Chris Clayron <chris2553@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 20:23:57 -03:00
Bruce Rogers	4b6486659a	KVM: x86: apply kvmclock offset to guest wall clock time When a guest migrates to a new host, the system time difference from the previous host is used in the updates to the kvmclock system time visible to the guest, resulting in a continuation of correct kvmclock based guest timekeeping. The wall clock component of the kvmclock provided time is currently not updated with this same time offset. Since the Linux guest caches the wall clock based time, this discrepency is not noticed until the guest is rebooted. After reboot the guest's time calculations are off. This patch adjusts the wall clock by the kvmclock_offset, resulting in correct guest time after a reboot. Cc: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Bruce Rogers <brogers@suse.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 17:23:50 -03:00
Avi Kivity	26ef19242f	KVM: fold kvm_pit_timer into kvm_kpit_state One structure nests inside the other, providing no value at all. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 00:21:07 -03:00
Avi Kivity	9d9d2239bd	KVM: Simplify kvm_pit_timer 'timer_mode_mask' is unused 'tscdeadline' is unused 't_ops' only adds needless indirection 'vcpu' is unused Remove. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 00:21:07 -03:00
Avi Kivity	2a6eac9638	KVM: Simplify kvm_timer 'reinject' is never initialized 't_ops' only serves as indirection to lapic_is_periodic; call that directly instead 'kvm' is never used 'vcpu' can be derived via container_of Remove these fields. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 00:21:06 -03:00
Avi Kivity	e9d90d472d	KVM: Remove internal timer abstraction kvm_timer_fn(), the sole inhabitant of timer.c, is only used by lapic.c. Move it there to make it easier to hack on it. struct kvm_timer is a thin wrapper around hrtimer, and only adds obfuscation. Move near its two users (with different names) to prepare for simplification. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-08-01 00:21:06 -03:00
Avi Kivity	4a4541a40e	KVM: Don't update PPR on any APIC read The current code will update the PPR on almost any APIC read; however that's only required if we read the PPR. kvm_update_ppr() shows up in some profiles, albeit with a low usage (~1%). This should reduce it further (it will still be called during interrupt processing). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-30 20:47:32 -03:00
Christoffer Dall	23d43cf998	KVM: Move KVM_IRQ_LINE to arch-generic code Handle KVM_IRQ_LINE and KVM_IRQ_LINE_STATUS in the generic kvm_vm_ioctl() function and call into kvm_vm_ioctl_irq_line(). This is even more relevant when KVM/ARM also uses this ioctl. Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 12:23:25 +03:00
Gleb Natapov	99245b507d	KVM: x86 emulator: drop unneeded call to get_segment() setup_syscalls_segments() calls get_segment() and than overwrites all but one of the structure fields and this one should also be overwritten anyway, so we can drop call to get_segment() and avoid a couple of vmreads on vmx. Also drop zeroing ss/cs structures since most of the fields are set anyway. Just set those that were not set explicitly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 12:21:29 +03:00
Gleb Natapov	aea218f3cb	KVM: PIC: call ack notifiers for irqs that are dropped form irr After commit `242ec97c35` PIT interrupts are no longer delivered after PIC reset. It happens because PIT injects interrupt only if previous one was acked, but since on PIC reset it is dropped from irr it will never be delivered and hence acknowledged. Fix that by calling ack notifier on PIC reset. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 12:19:06 +03:00
Xiao Guangrong	f23b070e66	KVM: x86 emulator: simplify read_emulated No need split mmio read region into 8-bits pieces since we do it in emulator_read_write_onepage Changelog: Add a WARN_ON to check read-cache overflow Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 12:10:58 +03:00
Xiao Guangrong	3b2bd2f800	KVM: MMU: use kvm_release_pfn_clean to release pfn The current code depends on the fact that fault_page is the normal page, however, we will use the error code instead of these dummy pages in the later patch, so we use kvm_release_pfn_clean to release pfn which will release the error code properly Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 11:55:30 +03:00
Avi Kivity	e9bda6f6f9	Merge branch 'queue' into next Merge patches queued during the run-up to the merge window. * queue: (25 commits) KVM: Choose better candidate for directed yield KVM: Note down when cpu relax intercepted or pause loop exited KVM: Add config to support ple or cpu relax optimzation KVM: switch to symbolic name for irq_states size KVM: x86: Fix typos in pmu.c KVM: x86: Fix typos in lapic.c KVM: x86: Fix typos in cpuid.c KVM: x86: Fix typos in emulate.c KVM: x86: Fix typos in x86.c KVM: SVM: Fix typos KVM: VMX: Fix typos KVM: remove the unused parameter of gfn_to_pfn_memslot KVM: remove is_error_hpa KVM: make bad_pfn static to kvm_main.c KVM: using get_fault_pfn to get the fault pfn KVM: MMU: track the refcount when unmap the page KVM: x86: remove unnecessary mark_page_dirty KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range() KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp() KVM: MMU: Add memslot parameter to hva handlers ... Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-26 11:54:21 +03:00
Linus Torvalds	5fecc9d8f5	KVM updates for the 3.6 merge window -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8 ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64 3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/ pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5 JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm 2AsN5bS0BWq+aqPpZHa5 =pECD -----END PGP SIGNATURE----- Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Avi Kivity: "Highlights include - full big real mode emulation on pre-Westmere Intel hosts (can be disabled with emulate_invalid_guest_state=0) - relatively small ppc and s390 updates - PCID/INVPCID support in guests - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on interrupt intensive workloads) - Lockless write faults during live migration - EPT accessed/dirty bits support for new Intel processors" Fix up conflicts in: - Documentation/virtual/kvm/api.txt: Stupid subchapter numbering, added next to each other. - arch/powerpc/kvm/booke_interrupts.S: PPC asm changes clashing with the KVM fixes - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c: Duplicated commits through the kvm tree and the s390 tree, with subsequent edits in the KVM tree. * tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) KVM: fix race with level interrupts x86, hyper: fix build with !CONFIG_KVM_GUEST Revert "apic: fix kvm build on UP without IOAPIC" KVM guest: switch to apic_set_eoi_write, apic_write apic: add apic_set_eoi_write for PV use KVM: VMX: Implement PCID/INVPCID for guests with EPT KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check KVM: PPC: Critical interrupt emulation support KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests KVM: PPC64: booke: Set interrupt computation mode for 64-bit host KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt KVM: PPC: bookehv64: Add support for std/ld emulation. booke: Added crit/mc exception handler for e500v2 booke/bookehv: Add host crit-watchdog exception support KVM: MMU: document mmu-lock and fast page fault KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint KVM: MMU: trace fast page fault KVM: MMU: fast path of handling guest page fault KVM: MMU: introduce SPTE_MMU_WRITEABLE bit KVM: MMU: fold tlb flush judgement into mmu_spte_update ...	2012-07-24 12:01:20 -07:00
Raghavendra K T	f2a7434731	KVM: Add config to support ple or cpu relax optimzation Suggested-by: Avi Kivity <avi@redhat.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> # on s390x Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-23 13:00:53 +03:00
Michael S. Tsirkin	93b6547e22	KVM: switch to symbolic name for irq_states size Use PIC_NUM_PINS instead of hard-coded 16 for pic pins. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 16:12:16 -03:00
Michael S. Tsirkin	1a577b7247	KVM: fix race with level interrupts When more than 1 source id is in use for the same GSI, we have the following race related to handling irq_states race: CPU 0 clears bit 0. CPU 0 read irq_state as 0. CPU 1 sets level to 1. CPU 1 calls kvm_ioapic_set_irq(1). CPU 0 calls kvm_ioapic_set_irq(0). Now ioapic thinks the level is 0 but irq_state is not 0. Fix by performing all irq_states bitmap handling under pic/ioapic lock. This also removes the need for atomics with irq_states handling. Reported-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 16:12:00 -03:00
Guo Chao	c7a7062fa0	KVM: x86: Fix typos in pmu.c Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:27:00 -03:00
Guo Chao	d5b0b5b196	KVM: x86: Fix typos in lapic.c Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:27:00 -03:00
Guo Chao	bbbda79510	KVM: x86: Fix typos in cpuid.c Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:27:00 -03:00
Guo Chao	fc0586807d	KVM: x86: Fix typos in emulate.c Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:26:59 -03:00
Guo Chao	4a9699807c	KVM: x86: Fix typos in x86.c Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:26:36 -03:00
Guo Chao	c5ec2e56d0	KVM: SVM: Fix typos Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:19:48 -03:00
Guo Chao	0fa0607147	KVM: VMX: Fix typos Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-20 15:19:04 -03:00
Xiao Guangrong	d566104853	KVM: remove the unused parameter of gfn_to_pfn_memslot The parameter, 'kvm', is not used in gfn_to_pfn_memslot, we can happily remove it Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-19 21:25:24 -03:00
Xiao Guangrong	903816fa4d	KVM: using get_fault_pfn to get the fault pfn Using get_fault_pfn to cleanup the code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-19 21:15:25 -03:00
Xiao Guangrong	86fde74cf5	KVM: MMU: track the refcount when unmap the page It will trigger a WARN_ON if the page has been freed but it is still used in mmu, it can help us to detect mm bug early Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-19 21:09:10 -03:00
Xiao Guangrong	9d3c92af47	KVM: x86: remove unnecessary mark_page_dirty fix: [ 132.474633] 3.5.0-rc1+ #50 Not tainted [ 132.474634] ------------------------------- [ 132.474635] include/linux/kvm_host.h:369 suspicious rcu_dereference_check() usage! [ 132.474636] [ 132.474636] other info that might help us debug this: [ 132.474636] [ 132.474638] [ 132.474638] rcu_scheduler_active = 1, debug_locks = 1 [ 132.474640] 1 lock held by qemu-kvm/2832: [ 132.474657] #0: (&vcpu->mutex){+.+.+.}, at: [<ffffffffa01e1636>] vcpu_load+0x1e/0x91 [kvm] [ 132.474658] [ 132.474658] stack backtrace: [ 132.474660] Pid: 2832, comm: qemu-kvm Not tainted 3.5.0-rc1+ #50 [ 132.474661] Call Trace: [ 132.474665] [<ffffffff81092f40>] lockdep_rcu_suspicious+0xfc/0x105 [ 132.474675] [<ffffffffa01e0c85>] kvm_memslots+0x6d/0x75 [kvm] [ 132.474683] [<ffffffffa01e0ca1>] gfn_to_memslot+0x14/0x4c [kvm] [ 132.474693] [<ffffffffa01e3575>] mark_page_dirty+0x17/0x2a [kvm] [ 132.474706] [<ffffffffa01f21ea>] kvm_arch_vcpu_ioctl+0xbcf/0xc07 [kvm] Actually, we do not write vcpu->arch.time at this time, mark_page_dirty should be removed. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-19 21:00:20 -03:00
Takuya Yoshikawa	bcd3ef5828	KVM: MMU: Avoid handling same rmap_pde in kvm_handle_hva_range() When we invalidate a THP page, we call the handler with the same rmap_pde argument 512 times in the following loop: for each guest page in the range for each level unmap using rmap This patch avoids these extra handler calls by changing the loop order like this: for each level for each rmap in the range unmap using rmap With the preceding patches in the patch series, this made THP page invalidation more than 5 times faster on our x86 host: the host became more responsive during swapping the guest's memory as a result. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	f395302e09	KVM: MMU: Push trace_kvm_age_page() into kvm_age_rmapp() This restricts the tracing to page aging and makes it possible to optimize kvm_handle_hva_range() further in the following patch. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	048212d0bc	KVM: MMU: Add memslot parameter to hva handlers This is needed to push trace_kvm_age_page() into kvm_age_rmapp() in the following patch. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	77d11309b3	KVM: Separate rmap_pde from kvm_lpage_info->write_count This makes it possible to loop over rmap_pde arrays in the same way as we do over rmap so that we can optimize kvm_handle_hva_range() easily in the following patch. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	b3ae209697	KVM: Introduce kvm_unmap_hva_range() for kvm_mmu_notifier_invalidate_range_start() When we tested KVM under memory pressure, with THP enabled on the host, we noticed that MMU notifier took a long time to invalidate huge pages. Since the invalidation was done with mmu_lock held, it not only wasted the CPU but also made the host harder to respond. This patch mitigates this by using kvm_handle_hva_range(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Alexander Graf <agraf@suse.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	84504ef386	KVM: MMU: Make kvm_handle_hva() handle range of addresses When guest's memory is backed by THP pages, MMU notifier needs to call kvm_unmap_hva(), which in turn leads to kvm_handle_hva(), in a loop to invalidate a range of pages which constitute one huge page: for each page for each memslot if page is in memslot unmap using rmap This means although every page in that range is expected to be found in the same memslot, we are forced to check unrelated memslots many times. If the guest has more memslots, the situation will become worse. Furthermore, if the range does not include any pages in the guest's memory, the loop over the pages will just consume extra time. This patch, together with the following patches, solves this problem by introducing kvm_handle_hva_range() which makes the loop look like this: for each memslot for each page in memslot unmap using rmap In this new processing, the actual work is converted to a loop over rmap which is much more cache friendly than before. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Alexander Graf <agraf@suse.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	d19a748b1c	KVM: Introduce hva_to_gfn_memslot() for kvm_handle_hva() This restricts hva handling in mmu code and makes it easier to extend kvm_handle_hva() so that it can treat a range of addresses later in this patch series. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Alexander Graf <agraf@suse.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:04 -03:00
Takuya Yoshikawa	9594a49861	KVM: MMU: Use __gfn_to_rmap() to clean up kvm_handle_hva() We can treat every level uniformly. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-18 16:55:03 -03:00
Ingo Molnar	a2fe194723	Merge branch 'linus' into perf/core Pick up the latest ring-buffer fixes, before applying a new fix. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-07-18 11:17:17 +02:00
Mao, Junjie	ad756a1603	KVM: VMX: Implement PCID/INVPCID for guests with EPT This patch handles PCID/INVPCID for guests. Process-context identifiers (PCIDs) are a facility by which a logical processor may cache information for multiple linear-address spaces so that the processor may retain cached information when software switches to a different linear address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual Volume 3A for details. For guests with EPT, the PCID feature is enabled and INVPCID behaves as running natively. For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD. Signed-off-by: Junjie Mao <junjie.mao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-12 13:07:34 +03:00
Xiao Guangrong	6fbc277053	KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint The P bit of page fault error code is missed in this tracepoint, fix it by passing the full error code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:22 +03:00
Xiao Guangrong	a72faf2504	KVM: MMU: trace fast page fault To see what happen on this path and help us to optimize it Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:21 +03:00
Xiao Guangrong	c7ba5b48cc	KVM: MMU: fast path of handling guest page fault If the the present bit of page fault error code is set, it indicates the shadow page is populated on all levels, it means what we do is only modify the access bit which can be done out of mmu-lock Currently, in order to simplify the code, we only fix the page fault caused by write-protect on the fast path Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:20 +03:00
Xiao Guangrong	49fde3406f	KVM: MMU: introduce SPTE_MMU_WRITEABLE bit This bit indicates whether the spte can be writable on MMU, that means the corresponding gpte is writable and the corresponding gfn is not protected by shadow page protection In the later path, SPTE_MMU_WRITEABLE will indicates whether the spte can be locklessly updated Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:19 +03:00
Xiao Guangrong	6e7d035407	KVM: MMU: fold tlb flush judgement into mmu_spte_update mmu_spte_update() is the common function, we can easily audit the path Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:18 +03:00
Xiao Guangrong	4f5982a56a	KVM: VMX: export PFEC.P bit on ept Export the present bit of page fault error code, the later patch will use it Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:17 +03:00
Xiao Guangrong	8e22f955fb	KVM: MMU: cleanup spte_write_protect Use __drop_large_spte to cleanup this function and comment spte_write_protect Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:16 +03:00
Xiao Guangrong	d13bc5b5a1	KVM: MMU: abstract spte write-protect Introduce a common function to abstract spte write-protect to cleanup the code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:14 +03:00
Xiao Guangrong	2f84569f97	KVM: MMU: return bool in __rmap_write_protect The reture value of __rmap_write_protect is either 1 or 0, use true/false instead of these Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-11 16:51:13 +03:00
Avi Kivity	a27685c33a	KVM: VMX: Emulate invalid guest state by default Our emulation should be complete enough that we can emulate guests while they are in big real mode, or in a mode transition that is not virtualizable without unrestricted guest support. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:05 +03:00
Avi Kivity	8089000616	KVM: x86 emulator: implement LTR Opcode 0F 00 /3. Encountered during Windows XP secondary processor bringup. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:05 +03:00
Avi Kivity	869be99c75	KVM: x86 emulator: make loading TR set the busy bit Guest software doesn't actually depend on it, but vmx will refuse us entry if we don't. Set the bit in both the cached segment and memory, just to be nice. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:05 +03:00
Avi Kivity	e919464b53	KVM: x86 emulator: make read_segment_descriptor() return the address Some operations want to modify the descriptor later on, so save the address for future use. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:04 +03:00
Avi Kivity	a14e579f22	KVM: x86 emulator: emulate LLDT Opcode 0F 00 /2. Used by isolinux durign the protected mode transition. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:04 +03:00
Avi Kivity	9299836e63	KVM: x86 emulator: emulate BSWAP Opcodes 0F C8 - 0F CF. Used by the SeaBIOS cdrom code (though not in big real mode). Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:04 +03:00
Avi Kivity	de5f70e0c6	KVM: VMX: Improve error reporting during invalid guest state emulation If instruction emulation fails, report it properly to userspace. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:04 +03:00
Avi Kivity	de87dcddc7	KVM: VMX: Stop invalid guest state emulation on pending event Process the event, possibly injecting an interrupt, before continuing. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:04 +03:00
Avi Kivity	612e89f015	KVM: x86 emulator: implement ENTER Opcode C8. Only ENTER with lexical nesting depth 0 is implemented, since others are very rare. We'll fail emulation if nonzero lexical depth is used so data is not corrupted. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:03 +03:00
Avi Kivity	51ddff50cb	KVM: x86 emulator: split push logic from push opcode emulation This allows us to reuse the code without populating ctxt->src and overriding ctxt->op_bytes. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:03 +03:00
Avi Kivity	361cad2b50	KVM: x86 emulator: fix byte-sized MOVZX/MOVSX Commit `2adb5ad9fe` removed ByteOp from MOVZX/MOVSX, replacing them by SrcMem8, but neglected to fix the dependency in the emulation code on ByteOp. This caused the instruction not to have any effect in some circumstances. Fix by replacing the check for ByteOp with the equivalent src.op_bytes == 1. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:03 +03:00
Avi Kivity	2dd7caa092	KVM: x86 emulator: emulate LAHF Opcode 9F. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:03 +03:00
Avi Kivity	7c068e4558	KVM: VMX: Continue emulating after batch exhausted If we return early from an invalid guest state emulation loop, make sure we return to it later if the guest state is still invalid. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:03 +03:00
Avi Kivity	bdea48e305	KVM: VMX: Fix interrupt exit condition during emulation Checking EFLAGS.IF is incorrect as we might be in interrupt shadow. If that is the case, the main loop will notice that and not inject the interrupt, causing an endless loop. Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt instead. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:02 +03:00
Avi Kivity	96051572c8	KVM: x86 emulator: emulate SGDT/SIDT Opcodes 0F 01 /0 and 0F 01 /1 Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:02 +03:00
Avi Kivity	a6e3407bb1	KVM: Fix SS default ESP/EBP based addressing We correctly default to SS when BP is used as a base in 16-bit address mode, but we don't do that for 32-bit mode. Fix by adjusting the default to SS when either ESP or EBP is used as the base register. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:02 +03:00
Avi Kivity	f47cfa3174	KVM: x86 emulator: emulate LEAVE Opcode c9; used by some variants of Windows during boot, in big real mode. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:01 +03:00
Avi Kivity	b8405c184b	KVM: VMX: Limit iterations with emulator_invalid_guest_state Otherwise, if the guest ends up looping, we never exit the srcu critical section, which causes synchronize_srcu() to hang. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:01 +03:00
Avi Kivity	f0495f9b99	KVM: VMX: Relax check on unusable segment Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment descriptors, causing us not to recognize them as unusable segments with emulate_invalid_guest_state=1. Relax the check by testing for segment not present (a non-present segment cannot be usable). Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:01 +03:00
Avi Kivity	510425ff33	KVM: x86 emulator: fix LIDT/LGDT in long mode The operand size for these instructions is 8 bytes in long mode, even without a REX prefix. Set it explicitly. Triggered while booting Linux with emulate_invalid_guest_state=1. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:01 +03:00
Avi Kivity	79d5b4c3cd	KVM: x86 emulator: allow loading null SS in long mode Null SS is valid in long mode; allow loading it. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:01 +03:00
Avi Kivity	6d6eede4a0	KVM: x86 emulator: emulate cpuid Opcode 0F A2. Used by Linux during the mode change trampoline while in a state that is not virtualizable on vmx without unrestricted_guest, so we need to emulate it is emulate_invalid_guest_state=1. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:00 +03:00
Avi Kivity	0017f93a27	KVM: x86 emulator: change ->get_cpuid() accessor to use the x86 semantics Instead of getting an exact leaf, follow the spec and fall back to the last main leaf instead. This lets us easily emulate the cpuid instruction in the emulator. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:00 +03:00
Avi Kivity	62046e5a86	KVM: Split cpuid register access from computation Introduce kvm_cpuid() to perform the leaf limit check and calculate register values, and let kvm_emulate_cpuid() just handle reading and writing the registers from/to the vcpu. This allows us to reuse kvm_cpuid() in a context where directly reading and writing registers is not desired. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:00 +03:00
Avi Kivity	d881e6f6cf	KVM: VMX: Return correct CPL during transition to protected mode In protected mode, the CPL is defined as the lower two bits of CS, as set by the last far jump. But during the transition to protected mode, there is no last far jump, so we need to return zero (the inherited real mode CPL). Fix by reading CPL from the cache during the transition. This isn't 100% correct since we don't set the CPL cache on a far jump, but since protected mode transition will always jump to a segment with RPL=0, it will always work. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:19:00 +03:00
Avi Kivity	e676505ac9	KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation Currently the MMU's ->new_cr3() callback does nothing when guest paging is disabled or when two-dimentional paging (e.g. EPT on Intel) is active. This means that an emulated write to cr3 can be lost; kvm_set_cr3() will write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its old value and this is what the guest sees. This bug did not have any effect until now because: - with unrestricted guest, or with svm, we never emulate a mov cr3 instruction - without unrestricted guest, and with paging enabled, we also never emulate a mov cr3 instruction - without unrestricted guest, but with paging disabled, the guest's cr3 is ignored until the guest enables paging; at this point the value from arch.cr3 is loaded correctly my the mov cr0 instruction which turns on paging However, the patchset that enables big real mode causes us to emulate mov cr3 instructions in protected mode sometimes (when guest state is not virtualizable by vmx); this mov cr3 is effectively ignored and will crash the guest. The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3 reload. This is awkward because now all the new_cr3 callbacks to the same thing, and because mmu_free_roots() is somewhat of an overkill; but fixing that is more complicated and will be done after this minimal fix. Observed in the Window XP 32-bit installer while bringing up secondary vcpus. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-07-09 14:18:59 +03:00
Ingo Molnar	35c2f48c66	Merge branch 'tip/perf/core' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull tracing updates from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-07-06 11:12:17 +02:00
Robert Richter	15c7ad51ad	perf/x86: Rename Intel specific macros There are macros that are Intel specific and not x86 generic. Rename them into INTEL_*. This patch removes X86_PMC_IDX_GENERIC and does: $ sed -i -e 's/X86_PMC_MAX_/INTEL_PMC_MAX_/g' \ arch/x86/include/asm/kvm_host.h \ arch/x86/include/asm/perf_event.h \ arch/x86/kernel/cpu/perf_event.c \ arch/x86/kernel/cpu/perf_event_p4.c \ arch/x86/kvm/pmu.c $ sed -i -e 's/X86_PMC_IDX_FIXED/INTEL_PMC_IDX_FIXED/g' \ arch/x86/include/asm/perf_event.h \ arch/x86/kernel/cpu/perf_event.c \ arch/x86/kernel/cpu/perf_event_intel.c \ arch/x86/kernel/cpu/perf_event_intel_ds.c \ arch/x86/kvm/pmu.c $ sed -i -e 's/X86_PMC_MSK_/INTEL_PMC_MSK_/g' \ arch/x86/include/asm/perf_event.h \ arch/x86/kernel/cpu/perf_event.c Signed-off-by: Robert Richter <robert.richter@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340217996-2254-2-git-send-email-robert.richter@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2012-07-05 21:19:39 +02:00
Xiao Guangrong	85b7059169	KVM: MMU: fix shrinking page from the empty mmu Fix: [ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at (null) [ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm] [ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0 [ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 3190.066860] CPU 1 [ ...... ] [ 3190.109629] Call Trace: [ 3190.111342] [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm] [ 3190.113091] [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm] [ 3190.114844] [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm] [ 3190.116598] [<ffffffff81150c9d>] ? prune_super+0x142/0x154 [ 3190.118333] [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e [ 3190.120043] [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e [ 3190.121718] [<ffffffff8110ca1d>] do_try_to_free_pages This is caused by shrinking page from the empty mmu, although we have checked n_used_mmu_pages, it is useless since the check is out of mmu-lock Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-03 17:31:50 -03:00
Guo Chao	2106a54812	KVM: VMX: code clean for vmx_init() Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-07-03 14:55:30 -03:00
Namhyung Kim	b102f1d0f1	tracing/kvm: Use __print_hex() for kvm_emulate_insn tracepoint The kvm_emulate_insn tracepoint used __print_insn() for printing its instructions. However it makes the format of the event hard to parse as it reveals TP internals. Fortunately, kernel provides __print_hex for almost same purpose, we can use it instead of open coding it. The user-space can be changed to parse it later. That means raw kernel tracing will not be affected by this change: # cd /sys/kernel/debug/tracing/ # cat events/kvm/kvm_emulate_insn/format name: kvm_emulate_insn ID: 29 format: ... print fmt: "%x:%llx:%s (%s)%s", REC->csbase, REC->rip, __print_hex(REC->insn, REC->len), \ __print_symbolic(REC->flags, { 0, "real" }, { (1 << 0) \| (1 << 1), "vm16" }, \ { (1 << 0), "prot16" }, { (1 << 0) \| (1 << 2), "prot32" }, { (1 << 0) \| (1 << 3), "prot64" }), \ REC->failed ? " failed" : "" # echo 1 > events/kvm/kvm_emulate_insn/enable # cat trace # tracer: nop # # entries-in-buffer/entries-written: 2183/2183 #P:12 # # _-----=> irqs-off # / _----=> need-resched # \| / _---=> hardirq/softirq # \|\| / _--=> preempt-depth # \|\|\| / delay # TASK-PID CPU# \|\|\|\| TIMESTAMP FUNCTION # \| \| \| \|\|\|\| \| \| qemu-kvm-1782 [002] ...1 140.931636: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) qemu-kvm-1781 [004] ...1 140.931637: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) Link: http://lkml.kernel.org/n/tip-wfw6y3b9ugtey8snaow9nmg5@git.kernel.org Link: http://lkml.kernel.org/r/1340757701-10711-2-git-send-email-namhyung@kernel.org Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: kvm@vger.kernel.org Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>	2012-06-28 13:52:15 -04:00
Michael S. Tsirkin	ae7a2a3fb6	KVM: host side for eoi optimization Implementation of PV EOI using shared memory. This reduces the number of exits an interrupt causes as much as by half. The idea is simple: there's a bit, per APIC, in guest memory, that tells the guest that it does not need EOI. We set it before injecting an interrupt and clear before injecting a nested one. Guest tests it using a test and clear operation - this is necessary so that host can detect interrupt nesting - and if set, it can skip the EOI MSR. There's a new MSR to set the address of said register in guest memory. Otherwise not much changed: - Guest EOI is not required - Register is tested & ISR is automatically cleared on exit For testing results see description of previous patch 'kvm_para: guest side for eoi avoidance'. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-25 12:40:55 +03:00
Michael S. Tsirkin	d905c06935	KVM: rearrange injection cancelling code Each time we need to cancel injection we invoke same code (cancel_injection callback). Move it towards the end of function using the familiar goto on error pattern. Will make it easier to do more cleanups for PV EOI. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-25 12:40:50 +03:00
Michael S. Tsirkin	5cfb1d5a65	KVM: only sync when attention bits set Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic attention bitmask but kvm still syncs lapic unconditionally. As that commit suggested and in anticipation of adding more attention bits, only sync lapic if(apic_attention). Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-25 12:40:40 +03:00
Michael S. Tsirkin	8680b94b0e	KVM: optimize ISR lookups We perform ISR lookups twice: during interrupt injection and on EOI. Typical workloads only have a single bit set there. So we can avoid ISR scans by 1. counting bits as we set/clear them in ISR 2. on set, caching the injected vector number 3. on clear, invalidating the cache The real purpose of this is enabling PV EOI which needs to quickly validate the vector. But non PV guests also benefit: with this patch, and without interrupt nesting, apic_find_highest_isr will always return immediately without scanning ISR. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-25 12:37:21 +03:00
Michael S. Tsirkin	5eadf916df	KVM: document lapic regs field Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-25 12:37:14 +03:00
Takuya Yoshikawa	9e40b67bf2	KVM: Use kvm_kvfree() to free memory allocated by kvm_kvzalloc() The following commit did not care about the error handling path: commit `c1a7b32a14` KVM: Avoid wasting pages for small lpage_info arrays If memory allocation fails, vfree() will be called with the address returned by kzalloc(). This patch fixes this issue. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-19 16:10:25 +03:00
Xudong Hao	00763e4113	KVM: x86: change PT_FIRST_AVAIL_BITS_SHIFT to avoid conflict with EPT Dirty bit EPT Dirty bit use bit 9 as Intel SDM definition, to avoid conflict, change PT_FIRST_AVAIL_BITS_SHIFT to 10. Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-06-13 20:28:21 -03:00
Takuya Yoshikawa	80feb89a0a	KVM: MMU: Remove unused parameter from mmu_memory_cache_alloc() Size is not needed to return one from pre-allocated objects. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-06-11 22:46:47 -03:00
Michael S. Tsirkin	79f702a6d1	KVM: disable uninitialized var warning I see this in 3.5-rc1: arch/x86/kvm/mmu.c: In function ‘kvm_test_age_rmapp’: arch/x86/kvm/mmu.c:1271: warning: ‘iter.desc’ may be used uninitialized in this function The line in question was introduced by commit `1e3f42f03c` static int kvm_test_age_rmapp(struct kvm kvm, unsigned long rmapp, unsigned long data) { - u64 spte; + u64 sptep; + struct rmap_iterator iter; <- line 1271 int young = 0; /* The reason I think is that the compiler assumes that the rmap value could be 0, so static u64 rmap_get_first(unsigned long rmap, struct rmap_iterator iter) { if (!rmap) return NULL; if (!(rmap & 1)) { iter->desc = NULL; return (u64 )rmap; } iter->desc = (struct pte_list_desc )(rmap & ~1ul); iter->pos = 0; return iter->desc->sptes[iter->pos]; } will not initialize iter.desc, but the compiler isn't smart enough to see that for (sptep = rmap_get_first(rmapp, &iter); sptep; sptep = rmap_get_next(&iter)) { will immediately exit in this case. I checked by adding if (!rmapp) goto out; on top which is clearly equivalent but disables the warning. This patch uses uninitialized_var to disable the warning without increasing code size. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-06 15:26:12 +03:00
Christoffer Dall	a737f256bf	KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers Introduces a couple of print functions, which are essentially wrappers around standard printk functions, with a KVM: prefix. Functions introduced or modified are: - kvm_err(fmt, ...) - kvm_info(fmt, ...) - kvm_debug(fmt, ...) - kvm_pr_unimpl(fmt, ...) - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...) Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-06 15:24:00 +03:00
Orit Wasserman	b246dd5df1	KVM: VMX: Fix KVM_SET_SREGS with big real mode segments For example migration between Westmere and Nehelem hosts, caught in big real mode. The code that fixes the segments for real mode guest was moved from enter_rmode to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment. Signed-off-by: Orit Wasserman <owasserm@rehdat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 17:51:46 +03:00
Gleb Natapov	1952639665	KVM: MMU: do not iterate over all VMs in mmu_shrink() mmu_shrink() needlessly iterates over all VMs even though it will not attempt to free mmu pages from more than one on them. Fix that and also check used mmu pages count outside of VM lock to skip inactive VMs faster. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 17:46:43 +03:00
Xudong Hao	3f6d8c8a47	KVM: VMX: Use EPT Access bit in response to memory notifiers Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 16:31:05 +03:00
Xudong Hao	b38f993478	KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP In EPT page structure entry, Enable EPT A/D bits if processor supported. Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 16:31:04 +03:00
Xudong Hao	83c3a33122	KVM: VMX: Add parameter to control A/D bits support, default is on Add kernel parameter to control A/D bits support, it's on by default. Signed-off-by: Haitao Shan <haitao.shan@intel.com> Signed-off-by: Xudong Hao <xudong.hao@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 16:31:03 +03:00
Takuya Yoshikawa	c1a7b32a14	KVM: Avoid wasting pages for small lpage_info arrays lpage_info is created for each large level even when the memory slot is not for RAM. This means that when we add one slot for a PCI device, we end up allocating at least KVM_NR_PAGE_SIZES - 1 pages by vmalloc(). To make things worse, there is an increasing number of devices which would result in more pages being wasted this way. This patch mitigates this problem by using kvm_kvzalloc(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-06-05 16:29:49 +03:00
Xiao Guangrong	c358666783	KVM: MMU: fix huge page adapted on non-PAE host The huge page size is 4M on non-PAE host, but 2M page size is used in transparent_hugepage_adjust(), so the page we get after adjust the mapping level is not the head page, the BUG_ON() will be triggered Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-28 17:41:15 +03:00
Linus Torvalds	07acfc2a93	Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM changes from Avi Kivity: "Changes include additional instruction emulation, page-crossing MMIO, faster dirty logging, preventing the watchdog from killing a stopped guest, module autoload, a new MSI ABI, and some minor optimizations and fixes. Outside x86 we have a small s390 and a very large ppc update. Regarding the new (for kvm) rebaseless workflow, some of the patches that were merged before we switch trees had to be rebased, while others are true pulls. In either case the signoffs should be correct now." Fix up trivial conflicts in Documentation/feature-removal-schedule.txt arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h. I suspect the kvm_para.h resolution ends up doing the "do I have cpuid" check effectively twice (it was done differently in two different commits), but better safe than sorry ;) * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits) KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block KVM: s390: onereg for timer related registers KVM: s390: epoch difference and TOD programmable field KVM: s390: KVM_GET/SET_ONEREG for s390 KVM: s390: add capability indicating COW support KVM: Fix mmu_reload() clash with nested vmx event injection KVM: MMU: Don't use RCU for lockless shadow walking KVM: VMX: Optimize %ds, %es reload KVM: VMX: Fix %ds/%es clobber KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte() KVM: VMX: unlike vmcs on fail path KVM: PPC: Emulator: clean up SPR reads and writes KVM: PPC: Emulator: clean up instruction parsing kvm/powerpc: Add new ioctl to retreive server MMU infos kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler KVM: PPC: Book3S: Enable IRQs during exit handling KVM: PPC: Fix PR KVM on POWER7 bare metal KVM: PPC: Fix stbux emulation KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields ...	2012-05-24 16:17:30 -07:00
Avi Kivity	d8368af8b4	KVM: Fix mmu_reload() clash with nested vmx event injection Currently the inject_pending_event() call during guest entry happens after kvm_mmu_reload(). This is for historical reasons - we used to inject_pending_event() in atomic context, while kvm_mmu_reload() needs task context. A problem is that nested vmx can cause the mmu context to be reset, if event injection is intercepted and causes a #VMEXIT instead (the #VMEXIT resets CR0/CR3/CR4). If this happens, we end up with invalid root_hpa, and since kvm_mmu_reload() has already run, no one will fix it and we end up entering the guest this way. Fix by reordering event injection to be before kvm_mmu_reload(). Use ->cancel_injection() to undo if kvm_mmu_reload() fails. https://bugzilla.kernel.org/show_bug.cgi?id=42980 Reported-by: Luke-Jr <luke-jr+linuxbugs@utopios.org> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-05-16 18:09:26 -03:00
Avi Kivity	c142786c62	KVM: MMU: Don't use RCU for lockless shadow walking Using RCU for lockless shadow walking can increase the amount of memory in use by the system, since RCU grace periods are unpredictable. We also have an unconditional write to a shared variable (reader_counter), which isn't good for scaling. Replace that with a scheme similar to x86's get_user_pages_fast(): disable interrupts during lockless shadow walk to force the freer (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the processor with interrupts enabled. We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent kvm_flush_remote_tlbs() from avoiding the IPI. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-05-16 16:08:28 -03:00
Avi Kivity	b2da15ac26	KVM: VMX: Optimize %ds, %es reload On x86_64, we can defer %ds and %es reload to the heavyweight context switch, since nothing in the lightweight paths uses the host %ds or %es (they are ignored by the processor). Furthermore we can avoid the load if the segments are null, by letting the hardware load the null segments for us. This is the expected case. On i386, we could avoid the reload entirely, since the entry.S paths take care of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS. So we set them to the same values as well. Saves about 70 cycles out of 1600 (around 4%; noisy measurements). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-05-16 16:03:19 -03:00
Avi Kivity	512d5649e8	KVM: VMX: Fix %ds/%es clobber The vmx exit code unconditionally restores %ds and %es to __USER_DS. This can override the user's values, since %ds and %es are not saved and restored in x86_64 syscalls. In practice, this isn't dangerous since nobody uses segment registers in long mode, least of all programs that use KVM. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-05-16 16:03:19 -03:00
Joerg Roedel	d54e4237bc	KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte() The instruction emulation for bsrw is broken in KVM because the code always uses bsr with 32 or 64 bit operand size for emulation. Fix that by using emulate_2op_SrcV_nobyte() macro to use guest operand size for emulation. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-14 11:32:38 +03:00
Xiao Guangrong	5f3fbc342f	KVM: VMX: unlike vmcs on fail path fix: [ 1529.577273] Call Trace: [ 1529.577289] [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm] [ 1529.577302] [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm] [ 1529.577311] [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm] [ 1529.577315] [<ffffffff81096ba8>] on_each_cpu+0x44/0x84 [ 1529.577326] [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm] [ 1529.577335] [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm] [ 1529.577349] [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm] [ 1529.577358] [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-14 11:28:02 +03:00
Takuya Yoshikawa	9f4260e73a	KVM: x86 emulator: Avoid pushing back ModRM byte fetched for group decoding Although ModRM byte is fetched for group decoding, it is soon pushed back to make decode_modrm() fetch it later again. Now that ModRM flag can be found in the top level opcode tables, fetch ModRM byte before group decoding to make the code simpler. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-06 16:15:58 +03:00
Takuya Yoshikawa	1c2545be05	KVM: x86 emulator: Move ModRM flags for groups to top level opcode tables Needed for the following patch which simplifies ModRM fetching code. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-06 16:15:57 +03:00
Michael S. Tsirkin	57c22e5f35	KVM: fix cpuid eax for KVM leaf cpuid eax should return the max leaf so that guests can find out the valid range. This matches Xen et al. Update documentation to match. Tested with -cpu host. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-06 15:51:56 +03:00
Gleb Natapov	a4fa163531	KVM: ensure async PF event wakes up vcpu from halt If vcpu executes hlt instruction while async PF is waiting to be delivered vcpu can block and deliver async PF only after another even wakes it up. This happens because kvm_check_async_pf_completion() will remove completion event from vcpu->async_pf.done before entering kvm_vcpu_block() and this will make kvm_arch_vcpu_runnable() return false. The solution is to make vcpu runnable when processing completion. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-05-06 14:56:54 +03:00
Jan Kiszka	b6ddf05ff6	KVM: x86: Run PIT work in own kthread We can't run PIT IRQ injection work in the interrupt context of the host timer. This would allow the user to influence the handler complexity by asking for a broadcast to a large number of VCPUs. Therefore, this work was pushed into workqueue context in 9d244caf2e. However, this prevents prioritizing the PIT injection over other task as workqueues share kernel threads. This replaces the workqueue with a kthread worker and gives that thread a name in the format "kvm-pit/<owner-process-pid>". That allows to identify and adjust the kthread priority according to the VM process parameters. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-27 19:40:29 -03:00
Avi Kivity	38e8a2ddc9	KVM: x86 emulator: fix asm constraint in flush_pending_x87_faults 'bool' wants 8-bit registers. Reported-by: Takuya Yoshikawa <takuya.yoshikawa@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-24 17:16:17 +03:00
Gleb Natapov	4138377142	KVM: Introduce bitmask for apic attention reasons The patch introduces a bitmap that will hold reasons apic should be checked during vmexit. This is in a preparation for vp eoi patch that will add one more check on vmexit. With the bitmap we can do if(apic_attention) to check everything simultaneously which will add zero overhead on the fast path. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-24 16:36:18 +03:00
Jan Kiszka	07975ad3b3	KVM: Introduce direct MSI message injection for in-kernel irqchips Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated "on the fly" by user space, IRQ routes are a limited resource that user space has to manage carefully. By providing a direct injection path, we can both avoid using up limited resources and simplify the necessary steps for user land. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-24 15:59:47 +03:00
Al Viro	bfce281c28	kill mm argument of vm_munmap() it's always current->mm Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2012-04-21 01:58:20 -04:00
Linus Torvalds	6be5ceb02e	VM: add "vm_mmap()" helper function This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-04-20 17:29:13 -07:00
Linus Torvalds	a46ef99d80	VM: add "vm_munmap()" helper function Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-04-20 17:29:13 -07:00
Avi Kivity	f78146b0f9	KVM: Fix page-crossing MMIO MMIO that are split across a page boundary are currently broken - the code does not expect to be aborted by the exit to userspace for the first MMIO fragment. This patch fixes the problem by generalizing the current code for handling 16-byte MMIOs to handle a number of "fragments", and changes the MMIO code to create those fragments. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-19 20:35:07 -03:00
Marcelo Tosatti	eac0556750	Merge branch 'linus' into queue Merge reason: development work has dependency on kvm patches merged upstream. Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-19 17:06:26 -03:00
Avi Kivity	2225fd5604	KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context kvm_set_shared_msr() may not be called in preemptible context, but vmx_set_msr() does so: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713 caller is kvm_set_shared_msr+0x32/0xa0 [kvm] Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39 Call Trace: [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100 [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm] [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel] ... Making kvm_set_shared_msr() work in preemptible is cleaner, but it's used in the fast path. Making two variants is overkill, so this patch just disables preemption around the call. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-18 23:42:27 -03:00
Davidlohr Bueso	f71fa31f9f	KVM: MMU: use page table level macro Its much cleaner to use PT_PAGE_TABLE_LEVEL than its numeric value. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-18 23:35:01 -03:00
Michael S. Tsirkin	a0c9a822bf	KVM: dont clear TMR on EOI Intel spec says that TMR needs to be set/cleared when IRR is set, but kvm also clears it on EOI. I did some tests on a real (AMD based) system, and I see same TMR values both before and after EOI, so I think it's a minor bug in kvm. This patch fixes TMR to be set/cleared on IRR set only as per spec. And now that we don't clear TMR, we can save an atomic read of TMR on EOI that's not propagated to ioapic, by checking whether ioapic needs a specific vector first and calculating the mode afterwards. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:38 -03:00
Avi Kivity	e59717550e	KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f) Needed by some framebuffer drivers. See https://bugzilla.kernel.org/show_bug.cgi?id=42779 Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:16 -03:00
Avi Kivity	cbe2c9d30a	KVM: x86 emulator: MMX support General support for the MMX instruction set. Special care is taken to trap pending x87 exceptions so that they are properly reflected to the guest. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:16 -03:00
Avi Kivity	3e114eb4db	KVM: x86 emulator: implement movntps Used to write to framebuffers (by at least Icaros). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:16 -03:00
Stefan Hajnoczi	49597d8116	KVM: x86: emulate movdqa An Ubuntu 9.10 Karmic Koala guest is unable to boot or install due to missing movdqa emulation: kvm_exit: reason EXCEPTION_NMI rip 0x7fef3e025a7b info 7fef3e799000 80000b0e kvm_page_fault: address 7fef3e799000 error_code f kvm_emulate_insn: 0:7fef3e025a7b: 66 0f 7f 07 (prot64) movdqa %xmm0,(%rdi) [avi: mark it explicitly aligned] Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:15 -03:00
Avi Kivity	1c11b37669	KVM: x86 emulator: add support for vector alignment x86 defines three classes of vector instructions: explicitly aligned (#GP(0) if unaligned, explicitly unaligned, and default (which depends on the encoding: AVX is unaligned, SSE is aligned). Add support for marking an instruction as explicitly aligned or unaligned, and mark MOVDQU as unaligned. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:36:15 -03:00
Josh Triplett	ae75954457	KVM: SVM: Auto-load on CPUs with SVM Enable x86 feature-based autoloading for the kvm-amd module on CPUs with X86_FEATURE_SVM. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-04-16 20:35:04 -03:00
Gleb Natapov	f19a0c2c2e	KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset On reset all MPU counters should be enabled in GLOBAL_CTRL MSR. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-10 15:34:10 +03:00
Takuya Yoshikawa	1e3f42f03c	KVM: MMU: Improve iteration through sptes from rmap Iteration using rmap_next(), the actual body is pte_list_next(), is inefficient: every time we call it we start from checking whether rmap holds a single spte or points to a descriptor which links more sptes. In the case of shadow paging, this quadratic total iteration cost is a problem. Even for two dimensional paging, with EPT/NPT on, in which we almost always have a single mapping, the extra checks at the end of the iteration should be eliminated. This patch fixes this by introducing rmap_iterator which keeps the iteration context for the next search. Furthermore the implementation of rmap_next() is splitted into two functions, rmap_get_first() and rmap_get_next(), to avoid repeatedly checking whether the rmap being iterated on has only one spte. Although there seemed to be only a slight change for EPT/NPT, the actual improvement was significant: we observed that GET_DIRTY_LOG for 1GB dirty memory became 15% faster than before. This is probably because the new code is easy to make branch predictions. Note: we just remove pte_list_next() because we can think of parent_ptes as a reverse mapping. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 16:08:27 +03:00
Takuya Yoshikawa	220f773a00	KVM: MMU: Make pte_list_desc fit cache lines well We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20 bytes do not fit cache lines well. Furthermore, some allocators may use 64/32-byte objects for the pte_list_desc cache. This patch solves this problem by changing PTE_LIST_EXT from 4 to 3. For shadow paging, the new size is still large enough to hold both the kernel and process mappings for usual anonymous pages. For file mappings, there may be a slight change in the cache usage. Note: with EPT/NPT we almost always have a single spte in each reverse mapping and we will not see any change by this. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 16:08:25 +03:00
Davidlohr Bueso	c36fc04ef5	KVM: x86: add paging gcc optimization Since most guests will have paging enabled for memory management, add likely() optimization around CR0.PG checks. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 14:03:13 +03:00
Josh Triplett	e9bda3b3d0	KVM: VMX: Auto-load on CPUs with VMX Enable x86 feature-based autoloading for the kvm-intel module on CPUs with X86_FEATURE_VMX. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-By: Kay Sievers <kay@vrfy.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 14:03:12 +03:00
Takuya Yoshikawa	60c34612b7	KVM: Switch to srcu-less get_dirty_log() We have seen some problems of the current implementation of get_dirty_log() which uses synchronize_srcu_expedited() for updating dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms order of latency when we use VGA displays. Furthermore the recent discussion on the following thread "srcu: Implement call_srcu()" http://lkml.org/lkml/2012/1/31/211 also motivated us to implement get_dirty_log() without SRCU. This patch achieves this goal without sacrificing the performance of both VGA and live migration: in practice the new code is much faster than the old one unless we have too many dirty pages. Implementation: The key part of the implementation is the use of xchg() operation for clearing dirty bits atomically. Since this allows us to update only BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap until every dirty bit is cleared again for the next call. Although some people may worry about the problem of using the atomic memory instruction many times to the concurrently accessible bitmap, it is usually accessed with mmu_lock held and we rarely see concurrent accesses: so what we need to care about is the pure xchg() overheads. Another point to note is that we do not use for_each_set_bit() to check which ones in each BITS_PER_LONG pages are actually dirty. Instead we simply use __ffs() in a loop. This is much faster than repeatedly call find_next_bit(). Performance: The dirty-log-perf unit test showed nice improvements, some times faster than before, except for some extreme cases; for such cases the speed of getting dirty page information is much faster than we process it in the userspace. For real workloads, both VGA and live migration, we have observed pure improvements: when the guest was reading a file during live migration, we originally saw a few ms of latency, but with the new method the latency was less than 200us. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:50:00 +03:00
Takuya Yoshikawa	5dc99b2380	KVM: Avoid checking huge page mappings in get_dirty_log() Dropped such mappings when we enabled dirty logging and we will never create new ones until we stop the logging. For this we introduce a new function which can be used to write protect a range of PT level pages: although we do not need to care about a range of pages at this point, the following patch will need this feature to optimize the write protection of many pages. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:49:58 +03:00
Takuya Yoshikawa	a0ed46073c	KVM: MMU: Split the main body of rmap_write_protect() off from others We will use this in the following patch to implement another function which needs to write protect pages using the rmap information. Note that there is a small change in debug printing for large pages: we do not differentiate them from others to avoid duplicating code. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:49:56 +03:00
Eric B Munson	1c0b28c2a4	KVM: x86: Add ioctl for KVM_KVMCLOCK_CTRL Now that we have a flag that will tell the guest it was suspended, create an interface for that communication using a KVM ioctl. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:49:01 +03:00
Christoffer Dall	b6d33834bd	KVM: Factor out kvm_vcpu_kick to arch-generic code The kvm_vcpu_kick function performs roughly the same funcitonality on most all architectures, so we shouldn't have separate copies. PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch structure and to accomodate this special need a __KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function kvm_arch_vcpu_wq have been defined. For all other architectures this is a generic inline that just returns &vcpu->wq; Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:47:47 +03:00
Jason Wang	675acb758a	KVM: SVM: count all irq windows exit Also count the exits of fast-path. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:47:01 +03:00
Liu, Jinsong	83c529151a	KVM: x86: expose Intel cpu new features (HLE, RTM) to guest Intel recently release 2 new features, HLE and RTM. Refer to http://software.intel.com/file/41417. This patch expose them to guest. Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-08 12:46:32 +03:00
Marcelo Tosatti	7a4f5ad051	KVM: VMX: vmx_set_cr0 expects kvm->srcu locked vmx_set_cr0 is called from vcpu run context, therefore it expects kvm->srcu to be held (for setting up the real-mode TSS). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-05 19:04:09 +03:00
Sasikantha babu	fea5295324	KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr() Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-04-05 19:04:08 +03:00
Linus Torvalds	2e7580b0e7	Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Avi Kivity: "Changes include timekeeping improvements, support for assigning host PCI devices that share interrupt lines, s390 user-controlled guests, a large ppc update, and random fixes." This is with the sign-off's fixed, hopefully next merge window we won't have rebased commits. * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: Convert intx_mask_lock to spin lock KVM: x86: fix kvm_write_tsc() TSC matching thinko x86: kvmclock: abstract save/restore sched_clock_state KVM: nVMX: Fix erroneous exception bitmap check KVM: Ignore the writes to MSR_K7_HWCR(3) KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask KVM: PMU: add proper support for fixed counter 2 KVM: PMU: Fix raw event check KVM: PMU: warn when pin control is set in eventsel msr KVM: VMX: Fix delayed load of shared MSRs KVM: use correct tlbs dirty type in cmpxchg KVM: Allow host IRQ sharing for assigned PCI 2.3 devices KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: x86 emulator: Allow PM/VM86 switch during task switch KVM: SVM: Fix CPL updates KVM: x86 emulator: VM86 segments must have DPL 3 KVM: x86 emulator: Fix task switch privilege checks arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation KVM: mmu_notifier: Flush TLBs before releasing mmu_lock ...	2012-03-28 14:35:31 -07:00
Linus Torvalds	35cb8d9e18	Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86/fpu changes from Ingo Molnar. * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: i387: Split up <asm/i387.h> into exported and internal interfaces i387: Uninline the generic FP helpers that we expose to kernel modules	2012-03-22 09:41:22 -07:00
Linus Torvalds	9f3938346a	Merge branch 'kmap_atomic' of git://github.com/congwang/linux Pull kmap_atomic cleanup from Cong Wang. It's been in -next for a long time, and it gets rid of the (no longer used) second argument to k[un]map_atomic(). Fix up a few trivial conflicts in various drivers, and do an "evil merge" to catch some new uses that have come in since Cong's tree. * 'kmap_atomic' of git://github.com/congwang/linux: (59 commits) feature-removal-schedule.txt: schedule the deprecated form of kmap_atomic() for removal highmem: kill all __kmap_atomic() [swarren@nvidia.com: highmem: Fix ARM build break due to __kmap_atomic rename] drbd: remove the second argument of k[un]map_atomic() zcache: remove the second argument of k[un]map_atomic() gma500: remove the second argument of k[un]map_atomic() dm: remove the second argument of k[un]map_atomic() tomoyo: remove the second argument of k[un]map_atomic() sunrpc: remove the second argument of k[un]map_atomic() rds: remove the second argument of k[un]map_atomic() net: remove the second argument of k[un]map_atomic() mm: remove the second argument of k[un]map_atomic() lib: remove the second argument of k[un]map_atomic() power: remove the second argument of k[un]map_atomic() kdb: remove the second argument of k[un]map_atomic() udf: remove the second argument of k[un]map_atomic() ubifs: remove the second argument of k[un]map_atomic() squashfs: remove the second argument of k[un]map_atomic() reiserfs: remove the second argument of k[un]map_atomic() ocfs2: remove the second argument of k[un]map_atomic() ntfs: remove the second argument of k[un]map_atomic() ...	2012-03-21 09:40:26 -07:00
Cong Wang	8fd75e1216	x86: remove the second argument of k[un]map_atomic() Acked-by: Avi Kivity <avi@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Cong Wang <amwang@redhat.com>	2012-03-20 21:48:15 +08:00
Marcelo Tosatti	02626b6af5	KVM: x86: fix kvm_write_tsc() TSC matching thinko kvm_write_tsc() converts from guest TSC to microseconds, not nanoseconds as intended. The result is that the window for matching is 1000 seconds, not 1 second. Microsecond precision is enough for checking whether the TSC write delta is within the heuristic values, so use it instead of nanoseconds. Noted by Avi Kivity. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-20 12:40:36 +02:00
Nadav Har'El	9587190107	KVM: nVMX: Fix erroneous exception bitmap check The code which checks whether to inject a pagefault to L1 or L2 (in nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit. Thanks to Dan Carpenter for spotting this. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:14:23 +02:00
Nicolae Mogoreanu	a223c313cb	KVM: Ignore the writes to MSR_K7_HWCR(3) When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource and fails. This patch will ignore this step and allow things to move on without having to fake CPUID value. Signed-off-by: Nicolae Mogoreanu <mogoreanu@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:14:14 +02:00
Davidlohr Bueso	4d6931c380	KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask The reset_rsvds_bits_mask() function can use the guest walker's root level number instead of using a separate 'level' variable. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:13:54 +02:00
Gleb Natapov	62079d8a43	KVM: PMU: add proper support for fixed counter 2 Currently pmu emulation emulates fixed counter 2 as bus cycles architectural counter, but since commit `9c1497ea59` perf has pseudo encoding for it. Use it. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:13:34 +02:00
Gleb Natapov	fac3368310	KVM: PMU: Fix raw event check If eventsel has EDGE, INV or CMASK set we should create raw counter for it, but the check is done on a wrong variable. Fix it. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:13:26 +02:00
Gleb Natapov	a7b9d2ccc3	KVM: PMU: warn when pin control is set in eventsel msr Print warning once if pin control bit is set in eventsel msr since emulation does not support it yet. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:13:18 +02:00
Avi Kivity	9ee73970c0	KVM: VMX: Fix delayed load of shared MSRs Shared MSRs (MSR_*STAR and related) are stored in both vmx->guest_msrs and in the CPU registers, but vmx_set_msr() only updated memory. Prior to `46199f33c2`, this didn't matter, since we called vmx_load_host_state(), which scheduled a vmx_save_host_state(), which re-synchronized the CPU state, but now we don't, so the CPU state will not be synchronized until the next exit to host userspace. This mostly affects nested vmx workloads, which play with these MSRs a lot. Fix by loading the MSR eagerly. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:11:55 +02:00
Jan Kiszka	07700a94b0	KVM: Allow host IRQ sharing for assigned PCI 2.3 devices PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Acked-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:11:36 +02:00
Avi Kivity	3e515705a1	KVM: Ensure all vcpus are consistent with in-kernel irqchip settings If some vcpus are created before KVM_CREATE_IRQCHIP, then irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading to potential NULL pointer dereferences. Fix by: - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP This is somewhat long winded because vcpu->arch.apic is created without kvm->lock held. Based on earlier patch by Michael Ellerman. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:30 +02:00
Kevin Wolf	4cee4798a3	KVM: x86 emulator: Allow PM/VM86 switch during task switch Task switches can switch between Protected Mode and VM86. The current mode must be updated during the task switch emulation so that the new segment selectors are interpreted correctly. In order to let privilege checks succeed, rflags needs to be updated in the vcpu struct as this causes a CPL update. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:29 +02:00
Kevin Wolf	ea5e97e8bf	KVM: SVM: Fix CPL updates Keep CPL at 0 in real mode and at 3 in VM86. In protected/long mode, use RPL rather than DPL of the code segment. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:28 +02:00
Kevin Wolf	66b0ab8fac	KVM: x86 emulator: VM86 segments must have DPL 3 Setting the segment DPL to 0 for at least the VM86 code segment makes the VM entry fail on VMX. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:27 +02:00
Kevin Wolf	7f3d35fddd	KVM: x86 emulator: Fix task switch privilege checks Currently, all task switches check privileges against the DPL of the TSS. This is only correct for jmp/call to a TSS. If a task gate is used, the DPL of this take gate is used for the check instead. Exceptions, external interrupts and iret shouldn't perform any check. [avi: kill kvm-kmod remnants] Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:26 +02:00
Gleb Natapov	270c6c79f4	KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:24 +02:00
Takuya Yoshikawa	db3fe4eb45	KVM: Introduce kvm_memory_slot::arch and move lpage_info into it Some members of kvm_memory_slot are not used by every architecture. This patch is the first step to make this difference clear by introducing kvm_memory_slot::arch; lpage_info is moved into it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:22 +02:00
Takuya Yoshikawa	fb03cb6f44	KVM: Introduce gfn_to_index() which returns the index for a given level This patch cleans up the code and removes the "(void)level;" warning suppressor. Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every level uniformly later. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:19 +02:00
Takuya Yoshikawa	6dbf79e716	KVM: Fix write protection race during dirty logging This patch fixes a race introduced by: commit `95d4c16ce7` KVM: Optimize dirty logging by rmap_write_protect() During protecting pages for dirty logging, other threads may also try to protect a page in mmu_sync_children() or kvm_mmu_get_page(). In such a case, because get_dirty_log releases mmu_lock before flushing TLB's, the following race condition can happen: A (get_dirty_log) B (another thread) lock(mmu_lock) clear pte.w unlock(mmu_lock) lock(mmu_lock) pte.w is already cleared unlock(mmu_lock) skip TLB flush return ... TLB flush Though thread B assumes the page has already been protected when it returns, the remaining TLB entry will break that assumption. This patch fixes this problem by making get_dirty_log hold the mmu_lock until it flushes the TLB's. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:12 +02:00
Raghavendra K T	10166744b8	KVM: VMX: remove yield_on_hlt yield_on_hlt was introduced for CPU bandwidth capping. Now it is redundant with CFS hardlimit. yield_on_hlt also complicates the scenario in paravirtual environment, that needs to trap halt. for e.g. paravirtualized ticket spinlocks. Acked-by: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:11 +02:00
Zachary Amsden	e26101b116	KVM: Track TSC synchronization in generations This allows us to track the original nanosecond and counter values at each phase of TSC writing by the guest. This gets us perfect offset matching for stable TSC systems, and perfect software computed TSC matching for machines with unstable TSC. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:09 +02:00
Zachary Amsden	0dd6a6edb0	KVM: Dont mark TSC unstable due to S4 suspend During a host suspend, TSC may go backwards, which KVM interprets as an unstable TSC. Technically, KVM should not be marking the TSC unstable, which causes the TSC clocksource to go bad, but we need to be adjusting the TSC offsets in such a case. Dealing with this issue is a little tricky as the only place we can reliably do it is before much of the timekeeping infrastructure is up and running. On top of this, we are not in a KVM thread context, so we may not be able to safely access VCPU fields. Instead, we compute our best known hardware offset at power-up and stash it to be applied to all VCPUs when they actually start running. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:08 +02:00
Marcelo Tosatti	f1e2b26003	KVM: Allow adjust_tsc_offset to be in host or guest cycles Redefine the API to take a parameter indicating whether an adjustment is in host or guest cycles. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:07 +02:00
Zachary Amsden	6f526ec538	KVM: Add last_host_tsc tracking back to KVM The variable last_host_tsc was removed from upstream code. I am adding it back for two reasons. First, it is unnecessary to use guest TSC computation to conclude information about the host TSC. The guest may set the TSC backwards (this case handled by the previous patch), but the computation of guest TSC (and fetching an MSR) is significanlty more work and complexity than simply reading the hardware counter. In addition, we don't actually need the guest TSC for any part of the computation, by always recomputing the offset, we can eliminate the need to deal with the current offset and any scaling factors that may apply. The second reason is that later on, we are going to be using the host TSC value to restore TSC offsets after a host S4 suspend, so we need to be reading the host values, not the guest values here. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:06 +02:00
Zachary Amsden	b183aa580a	KVM: Fix last_guest_tsc / tsc_offset semantics The variable last_guest_tsc was being used as an ad-hoc indicator that guest TSC has been initialized and recorded correctly. However, it may not have been, it could be that guest TSC has been set to some large value, the back to a small value (by, say, a software reboot). This defeats the logic and causes KVM to falsely assume that the guest TSC has gone backwards, marking the host TSC unstable, which is undesirable behavior. In addition, rather than try to compute an offset adjustment for the TSC on unstable platforms, just recompute the whole offset. This allows us to get rid of one callsite for adjust_tsc_offset, which is problematic because the units it takes are in guest units, but here, the computation was originally being done in host units. Doing this, and also recording last_guest_tsc when the TSC is written allow us to remove the tricky logic which depended on last_guest_tsc being zero to indicate a reset of uninitialized value. Instead, we now have the guarantee that the guest TSC offset is always at least something which will get us last_guest_tsc. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:05 +02:00
Zachary Amsden	4dd7980b21	KVM: Leave TSC synchronization window open with each new sync Currently, when the TSC is written by the guest, the variable ns is updated to force the current write to appear to have taken place at the time of the first write in this sync phase. This leaves a cliff at the end of the match window where updates will fall of the end. There are two scenarios where this can be a problem in practe - first, on a system with a large number of VCPUs, the sync period may last for an extended period of time. The second way this can happen is if the VM reboots very rapidly and we catch a VCPU TSC synchronization just around the edge. We may be unaware of the reboot, and thus the first VCPU might synchronize with an old set of the timer (at, say 0.97 seconds ago, when first powered on). The second VCPU can come in 0.04 seconds later to try to synchronize, but it misses the window because it is just over the threshold. Instead, stop doing this artificial setback of the ns variable and just update it with every write of the TSC. It may be observed that doing so causes values computed by compute_guest_tsc to diverge slightly across CPUs - note that the last_tsc_ns and last_tsc_write variable are used here, and now they last_tsc_ns will be different for each VCPU, reflecting the actual time of the update. However, compute_guest_tsc is used only for guests which already have TSC stability issues, and further, note that the previous patch has caused last_tsc_write to be incremented by the difference in nanoseconds, converted back into guest cycles. As such, only boundary rounding errors should be visible, which given the resolution in nanoseconds, is going to only be a few cycles and only visible in cross-CPU consistency tests. The problem can be fixed by adding a new set of variables to track the start offset and start write value for the current sync cycle. Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:04 +02:00
Zachary Amsden	5d3cb0f6a8	KVM: Improve TSC offset matching There are a few improvements that can be made to the TSC offset matching code. First, we don't need to call the 128-bit multiply (especially on a constant number), the code works much nicer to do computation in nanosecond units. Second, the way everything is setup with software TSC rate scaling, we currently have per-cpu rates. Obviously this isn't too desirable to use in practice, but if for some reason we do change the rate of all VCPUs at runtime, then reset the TSCs, we will only want to match offsets for VCPUs running at the same rate. Finally, for the case where we have an unstable host TSC, but rate scaling is being done in hardware, we should call the platform code to compute the TSC offset, so the math is reorganized to recompute the base instead, then transform the base into an offset using the existing API. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> KVM: Fix 64-bit division in kvm_write_tsc() Breaks i386 build. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:10:03 +02:00
Zachary Amsden	cc578287e3	KVM: Infrastructure for software and hardware based TSC rate scaling This requires some restructuring; rather than use 'virtual_tsc_khz' to indicate whether hardware rate scaling is in effect, we consider each VCPU to always have a virtual TSC rate. Instead, there is new logic above the vendor-specific hardware scaling that decides whether it is even necessary to use and updates all rate variables used by common code. This means we can simply query the virtual rate at any point, which is needed for software rate scaling. There is also now a threshold added to the TSC rate scaling; minor differences and variations of measured TSC rate can accidentally provoke rate scaling to be used when it is not needed. Instead, we have a tolerance variable called tsc_tolerance_ppm, which is the maximum variation from user requested rate at which scaling will be used. The default is 250ppm, which is the half the threshold for NTP adjustment, allowing for some hardware variation. In the event that hardware rate scaling is not available, we can kludge a bit by forcing TSC catchup to turn on when a faster than hardware speed has been requested, but there is nothing available yet for the reverse case; this requires a trap and emulate software implementation for RDTSC, which is still forthcoming. [avi: fix 64-bit division on i386] Signed-off-by: Zachary Amsden <zamsden@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-08 14:09:35 +02:00
Gleb Natapov	242ec97c35	KVM: x86: reset edge sense circuit of i8259 on init The spec says that during initialization "The edge sense circuit is reset which means that following initialization an interrupt request (IR) input must make a low-to-high transition to generate an interrupt", but currently if edge triggered interrupt is in IRR it is delivered after i8259 initialization. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:57:30 +02:00
Avi Kivity	1a18a69b76	KVM: x86 emulator: reject SYSENTER in compatibility mode on AMD guests If the guest thinks it's an AMD, it will not have prepared the SYSENTER MSRs, and if the guest executes SYSENTER in compatibility mode, it will fails. Detect this condition and #UD instead, like the spec says. Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:57:20 +02:00
Julian Stecklina	a52315e1d5	KVM: Don't mistreat edge-triggered INIT IPI as INIT de-assert. (LAPIC) If the guest programs an IPI with level=0 (de-assert) and trig_mode=0 (edge), it is erroneously treated as INIT de-assert and ignored, but to quote the spec: "For this delivery mode [INIT de-assert], the level flag must be set to 0 and trigger mode flag to 1." Signed-off-by: Julian Stecklina <js@alien8.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:43 +02:00
Davidlohr Bueso	e2358851ef	KVM: SVM: comment nested paging and virtualization module parameters Also use true instead of 1 for enabling by default. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:43 +02:00
Takuya Yoshikawa	e4b35cc960	KVM: MMU: Remove unused kvm parameter from rmap_next() Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:43 +02:00
Takuya Yoshikawa	9373e2c057	KVM: MMU: Remove unused kvm parameter from __gfn_to_rmap() Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:42 +02:00
Avi Kivity	2adb5ad9fe	KVM: x86 emulator: Remove byte-sized MOVSX/MOVZX hack Currently we treat MOVSX/MOVZX with a byte source as a byte instruction, and change the destination operand size with a hack. Change it to be a word instruction, so the destination receives its natural size, and change the source to be SrcMem8. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-03-05 14:52:42 +02:00
Avi Kivity	28867cee75	KVM: x86 emulator: add 8-bit memory operands Useful for MOVSX/MOVZX. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-03-05 14:52:42 +02:00
Boris Ostrovsky	2b036c6b86	KVM: SVM: Add support for AMD's OSVW feature in guests In some cases guests should not provide workarounds for errata even when the physical processor is affected. For example, because of erratum 400 on family 10h processors a Linux guest will read an MSR (resulting in VMEXIT) before going to idle in order to avoid getting stuck in a non-C0 state. This is not necessary: HLT and IO instructions are intercepted and therefore there is no reason for erratum 400 workaround in the guest. This patch allows us to present a guest with certain errata as fixed, regardless of the state of actual hardware. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:21 +02:00
Davidlohr Bueso	4a58ae614a	KVM: MMU: unnecessary NX state assignment We can remove the first ->nx state assignment since it is assigned afterwards anyways. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:21 +02:00
Carsten Otte	5b1c1493af	KVM: s390: ucontrol: export SIE control block to user This patch exports the s390 SIE hardware control block to userspace via the mapping of the vcpu file descriptor. In order to do so, a new arch callback named kvm_arch_vcpu_fault is introduced for all architectures. It allows to map architecture specific pages. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:19 +02:00
Carsten Otte	e08b963716	KVM: s390: add parameter for KVM_CREATE_VM This patch introduces a new config option for user controlled kernel virtual machines. It introduces a parameter to KVM_CREATE_VM that allows to set bits that alter the capabilities of the newly created virtual machine. The parameter is passed to kvm_arch_init_vm for all architectures. The only valid modifier bit for now is KVM_VM_S390_UCONTROL. This requires CAP_SYS_ADMIN privileges and creates a user controlled virtual machine on s390 architectures. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:18 +02:00
Xiao Guangrong	a138fe7535	KVM: MMU: remove the redundant get_written_sptes get_written_sptes is called twice in kvm_mmu_pte_write, one of them can be removed Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:18 +02:00
Takuya Yoshikawa	6addd1aa2c	KVM: MMU: Add missing large page accounting to drop_large_spte() Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:18 +02:00
Takuya Yoshikawa	37178b8bf0	KVM: MMU: Remove for_each_unsync_children() macro There is only one user of it and for_each_set_bit() does the same. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2012-03-05 14:52:17 +02:00
Ingo Molnar	737f24bda7	Merge branch 'perf/urgent' into perf/core Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-05 09:20:08 +01:00
Joerg Roedel	1018faa6cf	perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabled It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Avi Kivity <avi@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Robert Richter <robert.richter@amd.com> Cc: stable@vger.kernel.org # v3.2 Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-03-02 12:16:39 +01:00
Ingo Molnar	c5905afb0e	static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc\|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu>	2012-02-24 10:05:59 +01:00
Linus Torvalds	1361b83a13	i387: Split up <asm/i387.h> into exported and internal interfaces While various modules include <asm/i387.h> to get access to things we actually intend for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>	2012-02-21 14:12:54 -08:00
Linus Torvalds	f94edacf99	i387: move TS_USEDFPU flag from thread_info to task_struct This moves the bit that indicates whether a thread has ownership of the FPU from the TS_USEDFPU bit in thread_info->status to a word of its own (called 'has_fpu') in task_struct->thread.has_fpu. This fixes two independent bugs at the same time: - changing 'thread_info->status' from the scheduler causes nasty problems for the other users of that variable, since it is defined to be thread-synchronous (that's what the "TS_" part of the naming was supposed to indicate). So perfectly valid code could (and did) do ti->status \|= TS_RESTORE_SIGMASK; and the compiler was free to do that as separate load, or and store instructions. Which can cause problems with preemption, since a task switch could happen in between, and change the TS_USEDFPU bit. The change to TS_USEDFPU would be overwritten by the final store. In practice, this seldom happened, though, because the 'status' field was seldom used more than once, so gcc would generally tend to generate code that used a read-modify-write instruction and thus happened to avoid this problem - RMW instructions are naturally low fat and preemption-safe. - On x86-32, the current_thread_info() pointer would, during interrupts and softirqs, point to a copy of the real thread_info, because x86-32 uses %esp to calculate the thread_info address, and thus the separate irq (and softirq) stacks would cause these kinds of odd thread_info copy aliases. This is normally not a problem, since interrupts aren't supposed to look at thread information anyway (what thread is running at interrupt time really isn't very well-defined), but it confused the heck out of irq_fpu_usable() and the code that tried to squirrel away the FPU state. (It also caused untold confusion for us poor kernel developers). It also turns out that using 'task_struct' is actually much more natural for most of the call sites that care about the FPU state, since they tend to work with the task struct for other reasons anyway (ie scheduling). And the FPU data that we are going to save/restore is found there too. Thanks to Arjan Van De Ven <arjan@linux.intel.com> for pointing us to the %esp issue. Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-and-tested-by: Raphael Prevost <raphael@buro.asia> Acked-and-tested-by: Suresh Siddha <suresh.b.siddha@intel.com> Tested-by: Peter Anvin <hpa@zytor.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-02-18 10:19:41 -08:00
Linus Torvalds	6d59d7a9f5	i387: don't ever touch TS_USEDFPU directly, use helper functions This creates three helper functions that do the TS_USEDFPU accesses, and makes everybody that used to do it by hand use those helpers instead. In addition, there's a couple of helper functions for the "change both CR0.TS and TS_USEDFPU at the same time" case, and the places that do that together have been changed to use those. That means that we have fewer random places that open-code this situation. The intent is partly to clarify the code without actually changing any semantics yet (since we clearly still have some hard to reproduce bug in this area), but also to make it much easier to use another approach entirely to caching the CR0.TS bit for software accesses. Right now we use a bit in the thread-info 'status' variable (this patch does not change that), but we might want to make it a full field of its own or even make it a per-cpu variable. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2012-02-16 13:33:12 -08:00
Gleb Natapov	5753785fa9	KVM: do not #GP on perf MSR writes when vPMU is disabled Return to behaviour perf MSR had before introducing vPMU in case vPMU is disabled. Some guests access those registers unconditionally and do not expect it to fail. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-02-01 11:44:46 +02:00
Stephan Bärwolf	c2226fc9e8	KVM: x86: fix missing checks in syscall emulation On hosts without this patch, 32bit guests will crash (and 64bit guests may behave in a wrong way) for example by simply executing following nasm-demo-application: [bits 32] global _start SECTION .text _start: syscall (I tested it with winxp and linux - both always crashed) Disassembly of section .text: 00000000 <_start>: 0: 0f 05 syscall The reason seems a missing "invalid opcode"-trap (int6) for the syscall opcode "0f05", which is not available on Intel CPUs within non-longmodes, as also on some AMD CPUs within legacy-mode. (depending on CPU vendor, MSR_EFER and cpuid) Because previous mentioned OSs may not engage corresponding syscall target-registers (STAR, LSTAR, CSTAR), they remain NULL and (non trapping) syscalls are leading to multiple faults and finally crashs. Depending on the architecture (AMD or Intel) pretended by guests, various checks according to vendor's documentation are implemented to overcome the current issue and behave like the CPUs physical counterparts. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-02-01 11:43:40 +02:00
Stephan Bärwolf	bdb42f5afe	KVM: x86: extend "struct x86_emulate_ops" with "get_cpuid" In order to be able to proceed checks on CPU-specific properties within the emulator, function "get_cpuid" is introduced. With "get_cpuid" it is possible to virtually call the guests "cpuid"-opcode without changing the VM's context. [mtosatti: cleanup/beautify code] Signed-off-by: Stephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2012-02-01 11:43:33 +02:00
Rusty Russell	476bc0015b	module_param: make bool parameters really bool (arch) module_param(bool) used to counter-intuitively take an int. In `fddd5201` (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2012-01-13 09:32:18 +10:30
Avi Kivity	222d21aa07	KVM: x86 emulator: implement RDPMC (0F 33) Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:43 +02:00
Avi Kivity	80bdec64c0	KVM: x86 emulator: fix RDPMC privilege check RDPMC is only privileged if CR4.PCE=0. check_rdpmc() already implements this, so all we need to do is drop the Priv flag. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:41 +02:00
Gleb Natapov	a6c06ed1a6	KVM: Expose the architectural performance monitoring CPUID leaf Provide a CPUID leaf that describes the emulated PMU. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:40 +02:00
Avi Kivity	fee84b079d	KVM: VMX: Intercept RDPMC Intercept RDPMC and forward it to the PMU emulation code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:38 +02:00
Avi Kivity	332b56e484	KVM: SVM: Intercept RDPMC Intercept RDPMC and forward it to the PMU emulation code. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:37 +02:00
Avi Kivity	022cd0e840	KVM: Add generic RDPMC support Add a helper function that emulates the RDPMC instruction operation. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:35 +02:00
Gleb Natapov	f5132b0138	KVM: Expose a version 2 architectural PMU to a guests Use perf_events to emulate an architectural PMU, version 2. Based on PMU version 1 emulation by Avi Kivity. [avi: adjust for cpuid.c] [jan: fix anonymous field initialization for older gcc] Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:24:29 +02:00
Avi Kivity	8934208221	KVM: Expose kvm_lapic_local_deliver() Needed to deliver performance monitoring interrupts. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:39 +02:00
Takuya Yoshikawa	e0dac408d0	KVM: x86 emulator: Use opcode::execute for Group 9 instruction Group 9: 0F C7 Rename em_grp9() to em_cmpxchg8b() and register it. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:38 +02:00
Takuya Yoshikawa	c04ec8393f	KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions Group 4: FE Group 5: FF Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:36 +02:00
Takuya Yoshikawa	c15af35f54	KVM: x86 emulator: Use opcode::execute for Group 1A instruction Group 1A: 8F Register em_pop() directly and remove em_grp1a(). Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:23:35 +02:00
Gleb Natapov	d546cb406e	KVM: drop bsp_vcpu pointer from kvm struct Drop bsp_vcpu pointer from kvm struct since its only use is incorrect anyway. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:32 +02:00
Jan Kiszka	a647795efb	KVM: x86: Consolidate PIT legacy test Move the test for KVM_PIT_FLAGS_HPET_LEGACY into create_pit_timer instead of replicating it on the caller site. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:30 +02:00
Jan Kiszka	bb5a798ad5	KVM: x86: Do not rely on implicit inclusions Works so far by change, but it is not guaranteed to stay like this. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:29 +02:00
Avi Kivity	43771ebfc9	KVM: Make KVM_INTEL depend on CPU_SUP_INTEL PMU virtualization needs to talk to Intel-specific bits of perf; these are only available when CPU_SUP_INTEL=y. Fixes arch/x86/built-in.o: In function `atomic_switch_perf_msrs': vmx.c:(.text+0x6b1d4): undefined reference to `perf_guest_get_msrs' Reported-by: Ingo Molnar <mingo@elte.hu> Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-27 11:22:27 +02:00
Sasha Levin	ff5c2c0316	KVM: Use memdup_user instead of kmalloc/copy_from_user Switch to using memdup_user when possible. This makes code more smaller and compact, and prevents errors. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:21 +02:00
Sasha Levin	cdfca7b346	KVM: Use kmemdup() instead of kmalloc/memcpy Switch to kmemdup() in two places to shorten the code and avoid possible bugs. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:20 +02:00
Jan Kiszka	234b639206	KVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write This was probably copy&pasted from the cr0 case, but it's unneeded here. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:16 +02:00
Jan Kiszka	3d56cbdf35	KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages freed_pages is never evaluated, so remove it as well as the return code kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:15 +02:00
Alex,Shi	086c985501	KVM: use this_cpu_xxx replace percpu_xxx funcs percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them for further code clean up. And in preempt safe scenario, __this_cpu_xxx funcs has a bit better performance since __this_cpu_xxx has no redundant preempt_disable() Signed-off-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:13 +02:00
Xiao Guangrong	e37fa7853c	KVM: MMU: audit: inline audit function inline audit function and little cleanup Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:12 +02:00
Xiao Guangrong	d750ea2886	KVM: MMU: remove oos_shadow parameter The unsync code should be stable now, maybe it is the time to remove this parameter to cleanup the code a little bit Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:10 +02:00
Xiao Guangrong	e459e3228d	KVM: MMU: move the relevant mmu code to mmu.c Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create() Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:09 +02:00
Xiao Guangrong	9edb17d55f	KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL KVM_EXIT_HYPERCALL is not used anymore, so remove the code Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:22:07 +02:00

... 8 9 10 11 12 ...

3009 Commits