linux

Commit Graph

Author	SHA1	Message	Date
Xiao Guangrong	889e5cbced	KVM: MMU: split kvm_mmu_pte_write function kvm_mmu_pte_write is too long, we split it for better readable Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:59 +02:00
Xiao Guangrong	f8734352c6	KVM: MMU: remove unnecessary kvm_mmu_free_some_pages In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling kvm_mmu_free_some_pages is really unnecessary Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:58 +02:00
Xiao Guangrong	f57f2ef58f	KVM: MMU: fast prefetch spte on invlpg path Fast prefetch spte for the unsync shadow page on invlpg path Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:56 +02:00
Xiao Guangrong	505aef8f30	KVM: MMU: cleanup FNAME(invlpg) Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the same code between FNAME(invlpg) and FNAME(sync_page) Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:54 +02:00
Xiao Guangrong	d01f8d5e02	KVM: MMU: do not mark accessed bit on pte write path In current code, the accessed bit is always set when page fault occurred, do not need to set it on pte write path Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:53 +02:00
Xiao Guangrong	6f6fbe98c3	KVM: x86: cleanup port-in/port-out emulated Remove the same code between emulator_pio_in_emulated and emulator_pio_out_emulated Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:51 +02:00
Xiao Guangrong	1cb3f3ae5a	KVM: x86: retry non-page-table writing instructions If the emulation is caused by #PF and it is non-page_table writing instruction, it means the VM-EXIT is caused by shadow page protected, we can zap the shadow page and retry this instruction directly The idea is from Avi Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:50 +02:00
Xiao Guangrong	d5ae7ce835	KVM: x86: tag the instructions which are used to write page table The idea is from Avi: \| tag instructions that are typically used to modify the page tables, and \| drop shadow if any other instruction is used. \| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, \| and cmpxchg8b. This patch is used to tag the instructions and in the later path, shadow page is dropped if it is written by other instructions Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:48 +02:00
Xiao Guangrong	f759e2b4c7	KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the function when spte is prefetched, unfortunately, we can not know how many spte need to be prefetched on this path, that means we can use out of the free pte_list_desc object in the cache, and BUG_ON() is triggered, also some path does not fill the cache, such as INS instruction emulated that does not trigger page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:47 +02:00
Nadav Har'El	51cfe38ea5	KVM: nVMX: Fix warning-causing idt-vectoring-info behavior When L0 wishes to inject an interrupt while L2 is running, it emulates an exit to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original nVMX patch 23, titled "Correct handling of interrupt injection". Unfortunately, it is possible (though rare) that at this point there is valid idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2, and when L2 tried to run this interrupt's handler, it got a page fault - so it returns the original interrupt vector in idt_vectoring_info. The problem is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT like we wished to, because the VMX spec guarantees that idt_vectoring_info and exit_reason_external_interrupt can never happen together. This is not just specified in the spec - a KVM L1 actually prints a kernel warning "unexpected, valid vectoring info" if we violate this guarantee, and some users noticed these warnings in L1's logs. In order to better emulate a processor, which would never return the external interrupt and the idt-vectoring-info together, we need to separate the two injection steps: First, complete L1's injection into L2 (i.e., enter L2, injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT. Most of this is already in the code - the only change we need is to remain in L2 (and not exit to L1) in this case. Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that although we do enter L2 first, it will exit immediately after processing its injection, allowing us to promptly inject to L1. Note how we test vmcs12->idt_vectoring_info_field; This isn't really the vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated), but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value of this field. This was explained in patch 25 ("Correct handling of idt vectoring info") of the original nVMX patch series. Thanks to Dave Allan and to Federico Simoncelli for reporting this bug, to Abel Gordon for helping me figure out the solution, and to Avi Kivity for helping to improve it. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:45 +02:00
Nadav Har'El	d6185f20a0	KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT. This bit requests that when next entering the guest, we should run it only for as little as possible, and exit again. We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1 to continue running so it can inject an event to it, we unfortunately cannot just pretend to have run L2 for a little while - We must really launch L2, otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2) will be lost. So the existing code runs L2 in this case. But L2 could potentially run for a long time until it exits, and the injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us to request that L2 will be entered, as necessary, but will exit as soon as possible after entry. Our implementation of this request uses smp_send_reschedule() to send a self-IPI, with interrupts disabled. The interrupts remain disabled until the guest is entered, and then, after the entry is complete (often including processing an injection and jumping to the relevant handler), the physical interrupt is noticed and causes an exit. On recent Intel processors, we could have achieved the same goal by using MTF instead of a self-IPI. Another technique worth considering in the future is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to slightly improve performance by avoiding the useless interrupt handler which ends up being called when smp_send_reschedule() is used. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-27 11:16:43 +02:00
Jan Kiszka	4d25a066b6	KVM: Don't automatically expose the TSC deadline timer in cpuid Unlike all of the other cpuid bits, the TSC deadline timer bit is set unconditionally, regardless of what userspace wants. This is broken in several ways: - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC deadline timer feature, a guest that uses the feature will break - live migration to older host kernels that don't support the TSC deadline timer will cause the feature to be pulled from under the guest's feet; breaking it - guests that are broken wrt the feature will fail. Fix by not enabling the feature automatically; instead report it to userspace. Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not KVM_GET_SUPPORTED_CPUID. Fixes the Illumos guest kernel, which uses the TSC deadline timer feature. [avi: add the KVM_CAP + documentation] Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-12-26 13:27:44 +02:00
Jan Kiszka	0924ab2cfa	KVM: x86: Prevent starting PIT timers in the absence of irqchip support User space may create the PIT and forgets about setting up the irqchips. In that case, firing PIT IRQs will crash the host: BUG: unable to handle kernel NULL pointer dereference at 0000000000000128 IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm] ... Call Trace: [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm] [<ffffffff81071431>] process_one_work+0x111/0x4d0 [<ffffffff81071bb2>] worker_thread+0x152/0x340 [<ffffffff81075c8e>] kthread+0x7e/0x90 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10 Prevent this by checking the irqchip mode before starting a timer. We can't deny creating the PIT if the irqchips aren't set up yet as current user land expects this order to work. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-12-25 17:13:18 +02:00
Gleb Natapov	e7fc6f93b4	KVM: VMX: Check for automatic switch msr table overflow Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-11-17 16:28:09 +02:00
Gleb Natapov	d7cd97964b	KVM: VMX: Add support for guest/host-only profiling Support guest/host-only profiling by switch perf msrs on a guest entry if needed. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-11-17 16:28:00 +02:00
Gleb Natapov	8bf00a5299	KVM: VMX: add support for switching of PERF_GLOBAL_CTRL Some cpus have special support for switching PERF_GLOBAL_CTRL msr. Add logic to detect if such support exists and works properly and extend msr switching code to use it if available. Also extend number of generic msr switching entries to 8. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-11-17 16:27:54 +02:00
Linus Torvalds	0cfdc72439	Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits) iommu/core: Remove global iommu_ops and register_iommu iommu/msm: Use bus_set_iommu instead of register_iommu iommu/omap: Use bus_set_iommu instead of register_iommu iommu/vt-d: Use bus_set_iommu instead of register_iommu iommu/amd: Use bus_set_iommu instead of register_iommu iommu/core: Use bus->iommu_ops in the iommu-api iommu/core: Convert iommu_found to iommu_present iommu/core: Add bus_type parameter to iommu_domain_alloc Driver core: Add iommu_ops to bus_type iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API iommu/amd: Fix wrong shift direction iommu/omap: always provide iommu debug code iommu/core: let drivers know if an iommu fault handler isn't installed iommu/core: export iommu_set_fault_handler() iommu/omap: Fix build error with !IOMMU_SUPPORT iommu/omap: Migrate to the generic fault report mechanism iommu/core: Add fault reporting mechanism iommu/core: Use PAGE_SIZE instead of hard-coded value iommu/core: use the existing IS_ALIGNED macro iommu/msm: ->unmap() should return order of unmapped page ... Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config options" just happened to touch lines next to each other.	2011-10-30 15:46:19 -07:00
Jan Kiszka	f1c1da2bde	KVM: SVM: Keep intercepting task switching with NPT enabled AMD processors apparently have a bug in the hardware task switching support when NPT is enabled. If the task switch triggers a NPF, we can get wrong EXITINTINFO along with that fault. On resume, spurious exceptions may then be injected into the guest. We were able to reproduce this bug when our guest triggered #SS and the handler were supposed to run over a separate task with not yet touched stack pages. Work around the issue by continuing to emulate task switches even in NPT mode. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-10-30 12:24:10 +02:00
Joerg Roedel	1abb4ba596	Merge branches 'amd/fixes', 'debug/dma-api', 'arm/omap', 'arm/msm', 'core', 'iommu/fault-reporting' and 'api/iommu-ops-per-bus' into next Conflicts: drivers/iommu/amd_iommu.c drivers/iommu/iommu.c	2011-10-21 14:38:55 +02:00
Joerg Roedel	a1b60c1cd9	iommu/core: Convert iommu_found to iommu_present With per-bus iommu_ops the iommu_found function needs to work on a bus_type too. This patch adds a bus_type parameter to that function and converts all call-places. The function is also renamed to iommu_present because the function now checks if an iommu is present for a given bus and does not check for a global iommu anymore. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>	2011-10-21 14:37:20 +02:00
Liu, Jinsong	a3e06bbe84	KVM: emulate lapic tsc deadline timer for guest This patch emulate lapic tsc deadline timer for guest: Enumerate tsc deadline timer capability by CPUID; Enable tsc deadline timer mode by lapic MMIO; Start tsc deadline timer by WRMSR; [jan: use do_div()] [avi: fix for !irqchip_in_kernel()] [marcelo: another fix for !irqchip_in_kernel()] Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-10-05 15:34:56 +02:00
Avi Kivity	7460fb4a34	KVM: Fix simultaneous NMIs If simultaneous NMIs happen, we're supposed to queue the second and next (collapsing them), but currently we sometimes collapse the second into the first. Fix by using a counter for pending NMIs instead of a bool; since the counter limit depends on whether the processor is currently in an NMI handler, which can only be checked in vcpu context (via the NMI mask), we add a new KVM_REQ_NMI to request recalculation of the counter. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:59 +03:00
Avi Kivity	1cd196ea42	KVM: x86 emulator: convert push %sreg/pop %sreg to direct decode Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:58 +03:00
Avi Kivity	d4b4325fdb	KVM: x86 emulator: switch lds/les/lss/lfs/lgs to direct decode Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:57 +03:00
Avi Kivity	c191a7a0f4	KVM: x86 emulator: streamline decode of segment registers The opcodes push %seg pop %seg l%seg, %mem, %reg (e.g. lds/les/lss/lfs/lgs) all have an segment register encoded in the instruction. To allow reuse, decode the segment number into src2 during the decode stage instead of the execution stage. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:56 +03:00
Avi Kivity	41ddf9784c	KVM: x86 emulator: simplify OpMem64 decode Use the same technique as the other OpMem variants, and goto mem_common. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:55 +03:00
Avi Kivity	0fe5912884	KVM: x86 emulator: switch src decode to decode_operand() Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:54 +03:00
Avi Kivity	5217973ef8	KVM: x86 emulator: qualify OpReg inhibit_byte_regs hack OpReg decoding has a hack that inhibits byte registers for movsx and movzx instructions. It should be replaced by something better, but meanwhile, qualify that the hack is only active for the destination operand. Note these instructions only use OpReg for the destination, but better to be explicit about it. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:53 +03:00
Avi Kivity	608aabe316	KVM: x86 emulator: switch OpImmUByte decode to decode_imm() Similar to SrcImmUByte. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:52 +03:00
Avi Kivity	20c29ff205	KVM: x86 emulator: free up some flag bits near src, dst Op fields are going to grow by a bit, we need two free bits. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:51 +03:00
Avi Kivity	4dd6a57df7	KVM: x86 emulator: switch src2 to generic decode_operand() Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:50 +03:00
Avi Kivity	b1ea50b2b6	KVM: x86 emulator: expand decode flags to 64 bits Unifiying the operands means not taking advantage of the fact that some operand types can only go into certain operands (for example, DI can only be used by the destination), so we need more bits to hold the operand type. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:49 +03:00
Avi Kivity	a99455499a	KVM: x86 emulator: split dst decode to a generic decode_operand() Instead of decoding each operand using its own code, use a generic function. Start with the destination operand. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:48 +03:00
Avi Kivity	f09ed83e21	KVM: x86 emulator: move memop, memopp into emulation context Simplifies further generalization of decode. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:47 +03:00
Avi Kivity	3329ece161	KVM: x86 emulator: convert group 3 instructions to direct decode Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:46 +03:00
Jan Kiszka	9bc5791d4a	KVM: x86: Add module parameter for lapic periodic timer limit Certain guests, specifically RTOSes, request faster periodic timers than what we allow by default. Add a module parameter to adjust the limit for non-standard setups. Also add a rate-limited warning in case the guest requested more. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:44 +03:00
Jan Kiszka	bd80158aff	KVM: Clean up and extend rate-limited output The use of printk_ratelimit is discouraged, replace it with pr*_ratelimited or __ratelimit. While at it, convert remaining guest-triggerable printks to rate-limited variants. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:43 +03:00
Jan Kiszka	7712de872c	KVM: x86: Avoid guest-triggerable printks in APIC model Convert remaining printks that the guest can trigger to apic_printk. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:42 +03:00
Jan Kiszka	1e2b1dd797	KVM: x86: Move kvm_trace_exit into atomic vmexit section This avoids that events causing the vmexit are recorded before the actual exit reason. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:41 +03:00
Avi Kivity	caa8a168e3	KVM: x86 emulator: disable writeback for TEST The TEST instruction doesn't write its destination operand. This could cause problems if an MMIO register was accessed using the TEST instruction. Recently Windows XP was observed to use TEST against the APIC ICR; this can cause spurious IPIs. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:40 +03:00
Avi Kivity	e8f2b1d621	KVM: x86 emulator: simplify emulate_1op_rax_rdx() emulate_1op_rax_rdx() is always called with the same parameters. Simplify by passing just the emulation context. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:37 +03:00
Avi Kivity	9fef72ce10	KVM: x86 emulator: merge the two emulate_1op_rax_rdx implementations We have two emulate-with-extended-accumulator implementations: once which expect traps (_ex) and one which doesn't (plain). Drop the plain implementation and always use the one which expects traps; it will simply return 0 in the _ex argument and we can happily ignore it. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:36 +03:00
Avi Kivity	d1eef45d59	KVM: x86 emulator: simplify emulate_1op() emulate_1op() is always called with the same parameters. Simplify by passing just the emulation context. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:35 +03:00
Avi Kivity	29053a60d7	KVM: x86 emulator: simplify emulate_2op_cl() emulate_2op_cl() is always called with the same parameters. Simplify by passing just the emulation context. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:34 +03:00
Avi Kivity	761441b9f4	KVM: x86 emulator: simplify emulate_2op_cl() emulate_2op_cl() is always called with the same parameters. Simplify by passing just the emulation context. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:33 +03:00
Avi Kivity	a31b9ceadb	KVM: x86 emulator: simplify emulate_2op_SrcV() emulate_2op_SrcV(), and its siblings, emulate_2op_SrcV_nobyte() and emulate_2op_SrcB(), all use the same calling conventions and all get passed exactly the same parameters. Simplify them by passing just the emulation context. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:52:32 +03:00
Kevin Tian	58fbbf26eb	KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian <kevin.tian@intel.com> <Based on earlier work from>: Signed-off-by: Eddie Dong <eddie.dong@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:52:17 +03:00
Nadav Har'El	45133ecaae	KVM: SVM: Fix TSC MSR read in nested SVM When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be read without exit), we need to return L2's notion of the TSC, not L1's. The current code incorrectly returned L1 TSC, because svm_get_msr() was also used in x86.c where this was assumed, but now that these places call the new svm_read_l1_tsc(), the MSR read can be fixed. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Tested-by: Joerg Roedel <joerg.roedel@amd.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:03 +03:00
Nadav Har'El	27fc51b21c	KVM: nVMX: Fix nested VMX TSC emulation This patch fixes two corner cases in nested (L2) handling of TSC-related issues: 1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to the TSC MSR without an exit, then this should set L1's TSC value itself - not offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code). 2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore the vmcs12.TSC_OFFSET. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:02 +03:00
Nadav Har'El	d5c1785d2f	KVM: L1 TSC handling KVM assumed in several places that reading the TSC MSR returns the value for L1. This is incorrect, because when L2 is running, the correct TSC read exit emulation is to return L2's value. We therefore add a new x86_ops function, read_l1_tsc, to use in places that specifically need to read the L1 TSC, NOT the TSC of the current level of guest. Note that one change, of one line in kvm_arch_vcpu_load, is made redundant by a different patch sent by Zachary Amsden (and not yet applied): kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of course we didn't have to change the call of kvm_get_msr() to read_l1_tsc(). [avi: moved callback to kvm_x86_ops tsc block] Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Acked-by: Zachary Amsdem <zamsden@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:02 +03:00
Yang, Wei Y	cd46868c7f	KVM: MMU: Fix SMEP failure during fetch This patch fix kvm-unit-tests hanging and incorrect PT_ACCESSED_MASK bit set in the case of SMEP fault. The code updated 'eperm' after the variable was checked. Signed-off-by: Yang, Wei <wei.y.yang@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:02 +03:00
Avi Kivity	e4e517b4be	KVM: MMU: Do not unconditionally read PDPTE from guest memory Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded. On SVM, it is not possible to implement this, but on VMX this is possible and was indeed implemented until nested SVM changed this to unconditionally read PDPTEs dynamically. This has noticable impact when running PAE guests. Fix by changing the MMU to read PDPTRs from the cache, falling back to reading from memory for the nested MMU. Signed-off-by: Avi Kivity <avi@redhat.com> Tested-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:18:01 +03:00
Julia Lawall	cf3ace79c0	KVM: VMX: trivial: use BUG_ON Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); @@ identifier x; @@ -if (!x) BUG(); +BUG_ON(!x); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:18:01 +03:00
Marcelo Tosatti	742bc67042	KVM: x86: report valid microcode update ID Windows Server 2008 SP2 checked build with smp > 1 BSOD's during boot due to lack of microcode update: * Assertion failed: The system BIOS on this machine does not properly support the processor. The system BIOS did not load any microcode update. A BIOS containing the latest microcode update is needed for system reliability. (CurrentUpdateRevision != 0) * Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440 Report a non-zero microcode update signature to make it happy. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:01 +03:00
Takuya Yoshikawa	1d2887e2d8	KVM: x86 emulator: Make x86_decode_insn() return proper macros Return EMULATION_OK/FAILED consistently. Also treat instruction fetch errors, not restricted to X86EMUL_UNHANDLEABLE, as EMULATION_FAILED; although this cannot happen in practice, the current logic will continue the emulation even if the decoder fails to fetch the instruction. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:01 +03:00
Takuya Yoshikawa	7d88bb4803	KVM: x86 emulator: Let compiler know insn_fetch() rarely fails Fetching the instruction which was to be executed by the guest cannot fail normally. So compiler should always predict that it will succeed. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:18:00 +03:00
Takuya Yoshikawa	e85a10852c	KVM: x86 emulator: Drop _size argument from insn_fetch() _type is enough to know the size. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Takuya Yoshikawa	807941b121	KVM: x86 emulator: Use ctxt->_eip directly in do_insn_fetch_byte() Instead of passing ctxt->_eip from insn_fetch() call sites, get it from ctxt in do_insn_fetch_byte(). This is done by replacing the argument _eip of insn_fetch() with _ctxt, which should be better than letting the macro use ctxt silently in its body. Though this changes the place where ctxt->_eip is incremented from insn_fetch() to do_insn_fetch_byte(), this does not have any real effect. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Sasha Levin	743eeb0b01	KVM: Intelligent device lookup on I/O bus Currently the method of dealing with an IO operation on a bus (PIO/MMIO) is to call the read or write callback for each device registered on the bus until we find a device which handles it. Since the number of devices on a bus can be significant due to ioeventfds and coalesced MMIO zones, this leads to a lot of overhead on each IO operation. Instead of registering devices, we now register ranges which points to a device. Lookup is done using an efficient bsearch instead of a linear search. Performance test was conducted by comparing exit count per second with 200 ioeventfds created on one byte and the guest is trying to access a different byte continuously (triggering usermode exits). Before the patch the guest has achieved 259k exits per second, after the patch the guest does 274k exits per second. Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Stefan Hajnoczi	0d460ffc09	KVM: Use __print_symbolic() for vmexit tracepoints The vmexit tracepoints format the exit_reason to make it human-readable. Since the exit_reason depends on the instruction set (vmx or svm), formatting is handled with ftrace_print_symbols_seq() by referring to the appropriate exit reason table. However, the ftrace_print_symbols_seq() function is not meant to be used directly in tracepoints since it does not export the formatting table which userspace tools like trace-cmd and perf use to format traces. In practice perf dies when formatting vmexit-related events and trace-cmd falls back to printing the numeric value (with extra formatting code in the kvm plugin to paper over this limitation). Other userspace consumers of vmexit-related tracepoints would be in similar trouble. To avoid significant changes to the kvm_exit tracepoint, this patch moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and selects the right table with __print_symbolic() depending on the instruction set. Note that __print_symbolic() is designed for exporting the formatting table to userspace and allows trace-cmd and perf to work. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:59 +03:00
Stefan Hajnoczi	e097e5ffd6	KVM: Record instruction set in all vmexit tracepoints The kvm_exit tracepoint recently added the isa argument to aid decoding exit_reason. The semantics of exit_reason depend on the instruction set (vmx or svm) and the isa argument allows traces to be analyzed on other machines. Add the isa argument to kvm_nested_vmexit and kvm_nested_vmexit_inject so these tracepoints can also be self-describing. Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:58 +03:00
Mike Waychison	d1613ad5d0	KVM: Really fix HV_X64_MSR_APIC_ASSIST_PAGE Commit 0945d4b228 tried to fix the get_msr path for the HV_X64_MSR_APIC_ASSIST_PAGE msr, but was poorly tested. We should be returning 0 if the read succeeded, and passing the value back to the caller via the pdata out argument, not returning the value directly. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:58 +03:00
Mike Waychison	14fa67ee95	KVM: x86: get_msr support for HV_X64_MSR_APIC_ASSIST_PAGE "get" support for the HV_X64_MSR_APIC_ASSIST_PAGE msr was missing, even though it is explicitly enumerated as something the vmm should save in msrs_to_save and reported to userland via the KVM_GET_MSR_INDEX_LIST ioctl. Add "get" support for HV_X64_MSR_APIC_ASSIST_PAGE. We simply return the guest visible value of this register, which seems to be correct as a set on the register is validated for us already. Signed-off-by: Mike Waychison <mikew@google.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:17:58 +03:00
Sasha Levin	8c3ba334f8	KVM: x86: Raise the hard VCPU count limit The patch raises the hard limit of VCPU count to 254. This will allow developers to easily work on scalability and will allow users to test high VCPU setups easily without patching the kernel. To prevent possible issues with current setups, KVM_CAP_NR_VCPUS now returns the recommended VCPU limit (which is still 64) - this should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS returns the hard limit which is now 254. Cc: Avi Kivity <avi@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Suggested-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:17:57 +03:00
Xiao Guangrong	22388a3c8c	KVM: x86: cleanup the code of read/write emulation Using the read/write operation to remove the same code Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Xiao Guangrong	77d197b2ca	KVM: x86: abstract the operation for read/write emulation The operations of read emulation and write emulation are very similar, so we can abstract the operation of them, in larter patch, it is used to cleanup the same code Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Xiao Guangrong	ca7d58f375	KVM: x86: fix broken read emulation spans a page boundary If the range spans a page boundary, the mmio access can be broke, fix it as write emulation. And we already get the guest physical address, so use it to read guest data directly to avoid walking guest page table again Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-09-25 19:17:17 +03:00
Avi Kivity	9be3be1f15	KVM: x86 emulator: fix Src2CL decode Src2CL decode (used for double width shifts) erronously decodes only bit 3 of %rcx, instead of bits 7:0. Fix by decoding %cl in its entirety. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:14:58 +03:00
Zhao Jin	41bc3186b3	KVM: MMU: fix incorrect return of spte __update_clear_spte_slow should return original spte while the current code returns low half of original spte combined with high half of new spte. Signed-off-by: Zhao Jin <cronozhj@gmail.com> Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-09-25 19:13:25 +03:00
Linus Torvalds	ab7e2dbf9b	Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: uses TASKSTATS, depends on NET KVM: fix TASK_DELAY_ACCT kconfig warning	2011-08-16 11:14:44 -07:00
Randy Dunlap	df3d8ae1f8	KVM: uses TASKSTATS, depends on NET CONFIG_TASKSTATS just had a change to use netlink, including a change to "depends on NET". Since "select" does not follow dependencies, KVM also needs to depend on NET to prevent build errors when CONFIG_NET is not enabled. Sample of the reported "undefined reference" build errors: taskstats.c:(.text+0x8f686): undefined reference to `nla_put' taskstats.c:(.text+0x8f721): undefined reference to `nla_reserve' taskstats.c:(.text+0x8f8fb): undefined reference to `init_net' taskstats.c:(.text+0x8f905): undefined reference to `netlink_unicast' taskstats.c:(.text+0x8f934): undefined reference to `kfree_skb' taskstats.c:(.text+0x8f9e9): undefined reference to `skb_clone' taskstats.c:(.text+0x90060): undefined reference to `__alloc_skb' taskstats.c:(.text+0x901e9): undefined reference to `skb_put' taskstats.c:(.init.text+0x4665): undefined reference to `genl_register_family' taskstats.c:(.init.text+0x4699): undefined reference to `genl_register_ops' taskstats.c:(.init.text+0x4710): undefined reference to `genl_unregister_ops' taskstats.c:(.init.text+0x471c): undefined reference to `genl_unregister_family' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-08-16 19:00:41 +03:00
Randy Dunlap	fd079facb3	KVM: fix TASK_DELAY_ACCT kconfig warning Fix kconfig dependency warning: warning: (KVM) selects TASK_DELAY_ACCT which has unmet direct dependencies (TASKSTATS) Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-27 14:22:06 +03:00
Arun Sharma	60063497a9	atomic: use <linux/atomic.h> This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2011-07-26 16:49:47 -07:00
Linus Torvalds	d3ec4844d4	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)	2011-07-25 13:56:39 -07:00
Linus Torvalds	5fabc487c9	Merge branch 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits) KVM: IOMMU: Disable device assignment without interrupt remapping KVM: MMU: trace mmio page fault KVM: MMU: mmio page fault support KVM: MMU: reorganize struct kvm_shadow_walk_iterator KVM: MMU: lockless walking shadow page table KVM: MMU: do not need atomicly to set/clear spte KVM: MMU: introduce the rules to modify shadow page table KVM: MMU: abstract some functions to handle fault pfn KVM: MMU: filter out the mmio pfn from the fault pfn KVM: MMU: remove bypass_guest_pf KVM: MMU: split kvm_mmu_free_page KVM: MMU: count used shadow pages on prepareing path KVM: MMU: rename 'pt_write' to 'emulate' KVM: MMU: cleanup for FNAME(fetch) KVM: MMU: optimize to handle dirty bit KVM: MMU: cache mmio info on page fault path KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code KVM: MMU: do not update slot bitmap if spte is nonpresent KVM: MMU: fix walking shadow page table KVM guest: KVM Steal time registration ...	2011-07-24 09:07:03 -07:00
Xiao Guangrong	4f0226482d	KVM: MMU: trace mmio page fault Add tracepoints to trace mmio page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:41 +03:00
Xiao Guangrong	ce88decffd	KVM: MMU: mmio page fault support The idea is from Avi: \| We could cache the result of a miss in an spte by using a reserved bit, and \| checking the page fault error code (or seeing if we get an ept violation or \| ept misconfiguration), so if we get repeated mmio on a page, we don't need to \| search the slot list/tree. \| (https://lkml.org/lkml/2011/2/22/221) When the page fault is caused by mmio, we cache the info in the shadow page table, and also set the reserved bits in the shadow page table, so if the mmio is caused again, we can quickly identify it and emulate it directly Searching mmio gfn in memslots is heavy since we need to walk all memeslots, it can be reduced by this feature, and also avoid walking guest page table for soft mmu. [jan: fix operator precedence issue] Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:40 +03:00
Xiao Guangrong	dd3bfd59db	KVM: MMU: reorganize struct kvm_shadow_walk_iterator Reorganize it for good using the cache Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:39 +03:00
Xiao Guangrong	c2a2ac2b56	KVM: MMU: lockless walking shadow page table Use rcu to protect shadow pages table to be freed, so we can safely walk it, it should run fastly and is needed by mmio page fault Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:38 +03:00
Xiao Guangrong	603e0651cf	KVM: MMU: do not need atomicly to set/clear spte Now, the spte is just from nonprsent to present or present to nonprsent, so we can use some trick to set/clear spte non-atomicly as linux kernel does Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:37 +03:00
Xiao Guangrong	1df9f2dc39	KVM: MMU: introduce the rules to modify shadow page table Introduce some interfaces to modify spte as linux kernel does: - mmu_spte_clear_track_bits, it set the spte from present to nonpresent, and track the stat bits(accessed/dirty) of spte - mmu_spte_clear_no_track, the same as mmu_spte_clear_track_bits except tracking the stat bits - mmu_spte_set, set spte from nonpresent to present - mmu_spte_update, only update the stat bits Now, it does not allowed to set spte from present to present, later, we can drop the atomicly opration for X86_32 host, and it is the preparing work to get spte on X86_32 host out of the mmu lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:36 +03:00
Xiao Guangrong	d7c55201e6	KVM: MMU: abstract some functions to handle fault pfn Introduce handle_abnormal_pfn to handle fault pfn on page fault path, introduce mmu_invalid_pfn to handle fault pfn on prefetch path It is the preparing work for mmio page fault support Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:35 +03:00
Xiao Guangrong	fce92dce79	KVM: MMU: filter out the mmio pfn from the fault pfn If the page fault is caused by mmio, the gfn can not be found in memslots, and 'bad_pfn' is returned on gfn_to_hva path, so we can use 'bad_pfn' to identify the mmio page fault. And, to clarify the meaning of mmio pfn, we return fault page instead of bad page when the gfn is not allowd to prefetch Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:34 +03:00
Xiao Guangrong	c37079586f	KVM: MMU: remove bypass_guest_pf The idea is from Avi: \| Maybe it's time to kill off bypass_guest_pf=1. It's not as effective as \| it used to be, since unsync pages always use shadow_trap_nonpresent_pte, \| and since we convert between the two nonpresent_ptes during sync and unsync. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:33 +03:00
Xiao Guangrong	bd4c86eaa6	KVM: MMU: split kvm_mmu_free_page Split kvm_mmu_free_page to kvm_mmu_isolate_page and kvm_mmu_free_page One is used to remove the page from cache under mmu lock and the other is used to free page table out of mmu lock Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:32 +03:00
Xiao Guangrong	aa6bd187af	KVM: MMU: count used shadow pages on prepareing path Move counting used shadow pages from commiting path to preparing path to reduce tlb flush on some paths Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:31 +03:00
Xiao Guangrong	b90a0e6c81	KVM: MMU: rename 'pt_write' to 'emulate' If 'pt_write' is true, we need to emulate the fault. And in later patch, we need to emulate the fault even though it is not a pt_write event, so rename it to better fit the meaning Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:30 +03:00
Xiao Guangrong	b36c7a7c10	KVM: MMU: cleanup for FNAME(fetch) gw->pte_access is the final access permission, since it is unified with gw->pt_access when we walked guest page table: FNAME(walk_addr_generic): pte_access = pt_access & FNAME(gpte_access)(vcpu, pte, true); Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:29 +03:00
Xiao Guangrong	640d9b0dbe	KVM: MMU: optimize to handle dirty bit If dirty bit is not set, we can make the pte access read-only to avoid handing dirty bit everywhere Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:27 +03:00
Xiao Guangrong	bebb106a5a	KVM: MMU: cache mmio info on page fault path If the page fault is caused by mmio, we can cache the mmio info, later, we do not need to walk guest page table and quickly know it is a mmio fault while we emulate the mmio instruction Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:26 +03:00
Xiao Guangrong	af7cc7d1ee	KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code Introduce vcpu_mmio_gva_to_gpa to translate the gva to gpa, we can use it to cleanup the code between read emulation and write emulation Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:25 +03:00
Xiao Guangrong	ffb61bb3bc	KVM: MMU: do not update slot bitmap if spte is nonpresent Set slot bitmap only if the spte is present Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:24 +03:00
Xiao Guangrong	052331bea3	KVM: MMU: fix walking shadow page table Properly check the last mapping, and do not walk to the next level if last spte is met Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-24 11:50:23 +03:00
Ohad Ben-Cohen	e72542191c	virtio: expose for non-virtualization users too virtio has been so far used only in the context of virtualization, and the virtio Kconfig was sourced directly by the relevant arch Kconfigs when VIRTUALIZATION was selected. Now that we start using virtio for inter-processor communications, we need to source the virtio Kconfig outside of the virtualization scope too. Moreover, some architectures might use virtio for both virtualization and inter-processor communications, so directly sourcing virtio might yield unexpected results due to conflicting selections. The simple solution offered by this patch is to always source virtio's Kconfig in drivers/Kconfig, and remove it from the appropriate arch Kconfigs. Additionally, a virtio menu entry has been added so virtio drivers don't show up in the general drivers menu. This way anyone can use virtio, though it's arguably less accessible (and neat!) for virtualization users now. Note: some architectures (mips and sh) seem to have a VIRTUALIZATION menu merely for sourcing virtio's Kconfig, so that menu is removed too. Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>	2011-07-23 16:20:30 +09:30
Glauber Costa	c9aaa8957f	KVM: Steal time implementation To implement steal time, we need the hypervisor to pass the guest information about how much time was spent running other processes outside the VM, while the vcpu had meaningful work to do - halt time does not count. This information is acquired through the run_delay field of delayacct/schedstats infrastructure, that counts time spent in a runqueue but not running. Steal time is a per-cpu information, so the traditional MSR-based infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the memory area address containing information about steal time This patch contains the hypervisor part of the steal time infrasructure, and can be backported independently of the guest portion. [avi, yongjie: export delayacct_on, to avoid build failures in some configs] Signed-off-by: Glauber Costa <glommer@redhat.com> Tested-by: Eric B Munson <emunson@mgebm.net> CC: Rik van Riel <riel@redhat.com> CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> CC: Peter Zijlstra <peterz@infradead.org> CC: Anthony Liguori <aliguori@us.ibm.com> Signed-off-by: Yongjie Ren <yongjie.ren@intel.com> Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-14 12:59:14 +03:00
Takuya Yoshikawa	3c8c652ae4	KVM: MMU: Introduce is_last_gpte() to clean up walk_addr_generic() Suggested by Ingo and Avi. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:44 +03:00
Takuya Yoshikawa	92c1c1e85b	KVM: MMU: Rename the walk label in walk_addr_generic() The current name does not explain the meaning well. So give it a better name "retry_walk" to show that we are trying the walk again. This was suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:43 +03:00
Takuya Yoshikawa	134291bf3c	KVM: MMU: Clean up the error handling of walk_addr_generic() Avoid two step jump to the error handling part. This eliminates the use of the variables present and rsvd_fault. We also use the const type qualifier to show that write/user/fetch_fault do not change in the function. Both of these were suggested by Ingo Molnar. Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:42 +03:00
Marcelo Tosatti	f8f7e5ee10	Revert "KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB" This reverts commit bee931d31e588b8eb86b7edee32fac2d16930cd7. TLB flush should be done lazily during guest entry, in kvm_mmu_load(). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>	2011-07-12 13:16:41 +03:00
Avi Kivity	45bd07b9d5	KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB kvm_set_cr0() and kvm_set_cr4(), and possible other functions, assume that kvm_mmu_reset_context() flushes the guest TLB. However, it does not. Fix by flushing the tlb (and syncing the new root as well). Signed-off-by: Avi Kivity <avi@redhat.com>	2011-07-12 13:16:27 +03:00

1 2 3 4 5 ...

2084 Commits