The name "pid_sync_vcpu_all" isn't appropriate since it just affect
a single vpid, so rename it to vpid_sync_vcpu_single().
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Using kvm_mmu_prepare_zap_page() and kvm_mmu_zap_page() instead of
kvm_mmu_zap_page() that can reduce remote tlb flush IPI
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In the later patch, we will modify sp's zapping way like below:
kvm_mmu_prepare_zap_page A
kvm_mmu_prepare_zap_page B
kvm_mmu_prepare_zap_page C
....
kvm_mmu_commit_zap_page
[ zaped multiple sps only need to call kvm_mmu_commit_zap_page once ]
In __kvm_mmu_free_some_pages() function, the free page number is
getted form 'vcpu->kvm->arch.n_free_mmu_pages' in loop, it will
hinders us to apply kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page()
since kvm_mmu_prepare_zap_page() not free sp.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Using kvm_mmu_prepare_zap_page() and kvm_mmu_commit_zap_page() to
split kvm_mmu_zap_page() function, then we can:
- traverse hlist safely
- easily to gather remote tlb flush which occurs during page zapped
Those feature can be used in the later patches
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In kvm_mmu_unprotect_page(), the invalid sp can be skipped
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
According to SDM, we need check whether single-context INVVPID type is supported
before issuing invvpid instruction.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Should use linux/uaccess.h instead of asm/uaccess.h
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The vmexit handler returns KVM_EXIT_UNKNOWN since there is no handler
for vmentry failures. This intercepts vmentry failures and returns
KVM_FAIL_ENTRY to userspace instead.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
There's no need to calculate quadrant if tdp is enabled.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Memory allocation may fail. Propagate such errors.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Must set VCPU register to be the guest notion of EFER even if that
setting is not valid on hardware. This was masked by the set in
set_efer until 7657fd5ace88e8092f5f3a84117e093d7b893f26 broke that.
Fix is simply to set the VCPU register before stripping bits.
Signed-off-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In Documentation/kvm/mmu.txt:
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
But in function FNAME(fetch)(), sp->gfn is incorrect when one of following
situations occurred:
1) guest is 32bit paging and the guest PDE maps a 4-MByte page
(backed by 4k host pages), FNAME(fetch)() miss handling the quadrant.
And if guest use pse-36, "table_gfn = gpte_to_gfn(gw->ptes[level - delta]);"
is incorrect.
2) guest is long mode paging and the guest PDPTE maps a 1-GByte page
(backed by 4k or 2M host pages).
So we fix it to suit to the document and suit to the code which
requires sp->gfn correct when sp->role.direct=1.
We use the goal mapping gfn(gw->gfn) to calculate the base page frame
for linear translations, it is simple and easy to be understood.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Reported-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In Document/kvm/mmu.txt:
gfn:
Either the guest page table containing the translations shadowed by this
page, or the base page frame for linear translations. See role.direct.
But in __direct_map(), the base gfn calculation is incorrect,
it does not calculate correctly when level=3 or 4.
Fix by using PT64_LVL_ADDR_MASK() which accounts for all levels correctly.
Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When sp->role.direct is set, sp->gfns does not contain any essential
information, leaf sptes reachable from this sp are for a continuous
guest physical memory range (a linear range).
So sp->gfns[i] (if it was set) equals to sp->gfn + i. (PT_PAGE_TABLE_LEVEL)
Obviously, it is not essential information, we can calculate it when need.
It means we don't need sp->gfns when sp->role.direct=1,
Thus we can save one page usage for every kvm_mmu_page.
Note:
Access to sp->gfns must be wrapped by kvm_mmu_page_get_gfn()
or kvm_mmu_page_set_gfn().
It is only exposed in FNAME(sync_page).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
For the sake of completeness, this patch adds a symbolic
constant for VMX exit reason 0x21 (invalid guest state).
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Allow more page become asynchronous at getting sp time, if need create new
shadow page for gfn but it not allow unsync(level > 1), we should unsync all
gfn's unsync page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
In current code, shadow page can become asynchronous only if one
shadow page for a gfn, this rule is too strict, in fact, we can
let all last mapping page(i.e, it's the pte page) become unsync,
and sync them at invlpg or flush tlb time.
This patch allow more page become asynchronous at gfn mapping time
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
On SVM interrupts are injected by svm_set_irq() not svm_inject_irq().
The later is used only to wait for irq window.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Only unsync pages need updated at invlpg time since other shadow
pages are write-protected
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Two cases maybe happen in kvm_mmu_get_page() function:
- one case is, the goal sp is already in cache, if the sp is unsync,
we only need update it to assure this mapping is valid, but not
mark it sync and not write-protect sp->gfn since it not broke unsync
rule(one shadow page for a gfn)
- another case is, the goal sp not existed, we need create a new sp
for gfn, i.e, gfn (may)has another shadow page, to keep unsync rule,
we should sync(mark sync and write-protect) gfn's unsync shadow page.
After enabling multiple unsync shadows, we sync those shadow pages
only when the new sp not allow to become unsync(also for the unsyc
rule, the new rule is: allow all pte page become unsync)
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Split kvm_sync_page() into kvm_sync_page() and kvm_sync_page_transient()
to clarify the code address Avi's suggestion
kvm_sync_page_transient() function only update shadow page but not mark
it sync and not write protect sp->gfn. it will be used by later patch
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We can avoid unnecessary fpu load when userspace process
didn't use FPU frequently.
Derived from Avi's idea.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Now that all arch specific ioctls have centralized locking, it is easy to
move it to the central dispatcher.
Signed-off-by: Avi Kivity <avi@redhat.com>
All vcpu ioctls need to be locked, so instead of locking each one specifically
we lock at the generic dispatcher.
This patch only updates generic ioctls and leaves arch specific ioctls alone.
Signed-off-by: Avi Kivity <avi@redhat.com>
Remove rmap before clear spte otherwise it will trigger BUG_ON() in
some functions such as rmap_write_protect().
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Only modifying some bits of CR0/CR4 needs paging mode switch.
Modify EFER.NXE bit would result in reserved bit updates.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
mmu.free() already set root_hpa to INVALID_PAGE, no need to do it again in the
destory_kvm_mmu().
kvm_x86_ops->set_cr4() and set_efer() already assign cr4/efer to
vcpu->arch.cr4/efer, no need to do it again later.
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This adds missing decoder flags for sub instructions (opcodes 0x2c - 0x2d)
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This adds test acc, imm instruction to the x86 emulator
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
SDM suggests VMXON should be called before VMPTRLD, and VMXOFF
should be called after doing VMCLEAR.
Therefore in vmm coexistence case, we should firstly call VMXON
before any VMCS operation, and then call VMXOFF after the
operation is done.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Originally VMCLEAR/VMPTRLD is called on vcpu migration. To
support hosted VMM coexistance, VMCLEAR is executed on vcpu
schedule out, and VMPTRLD is executed on vcpu schedule in.
This could also eliminate the IPI when doing VMCLEAR.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Do some preparations for vmm coexistence support.
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Define vmcs_load() and kvm_cpu_vmxon() to avoid direct call of asm
code. Also move VMXE bit operation out of kvm_cpu_vmxoff().
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We drop the mmu lock between freeing memory and allocating the roots; this
allows some other vcpu to sneak in and allocate memory.
While the race is benign (resulting only in temporary overallocation, not oom)
it is simple and easy to fix by moving the freeing close to the allocation.
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Do not kill VM when instruction emulation fails. Inject #UD and report
failure to userspace instead. Userspace may choose to reenter guest if
vcpu is in userspace (cpl == 3) in which case guest OS will kill
offending process and continue running.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently, kvm_mmu_zap_page() returning the number of freed children sp.
This might confuse the caller, because caller don't know the actual freed
number. Let's make kvm_mmu_zap_page() return the number of pages it actually
freed.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Sometime cmpxchg_gpte doesn't modify gpte, in such case, don't mark
page table page as dirty.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch enables setting of efer bit 13 which is allowed
in all SVM capable processors. This is necessary for the
SLES11 version of Xen 4.0 to boot with nested svm.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a function to dump the vmcb into the kernel
log and calls it after a failed vmrun to ease debugging.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM_REQ_KICK poisons vcpu->requests by having a bit set during normal
operation. This causes the fast path check for a clear vcpu->requests
to fail all the time, triggering tons of atomic operations.
Fix by replacing KVM_REQ_KICK with a vcpu->guest_mode atomic.
Signed-off-by: Avi Kivity <avi@redhat.com>
Return exception as a result of instruction emulation and handle
injection in KVM code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
emulate_sysexit() should use shadowed registers copy instead of
looking into vcpu state directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Return new RIP as part of instruction emulation result instead of
updating KVM's RIP from x86 emulator code.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Return error to x86 emulator instead of injection exception behind its back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It is not called directly outside of the file it's defined in anymore.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently emulator returns -1 when emulation failed or IO is needed.
Caller tries to guess whether emulation failed by looking at other
variables. Make it easier for caller to recognise error condition by
always returning -1 in case of failure. For this new emulator
internal return value X86EMUL_IO_NEEDED is introduced. It is used to
distinguish between error condition (which returns X86EMUL_UNHANDLEABLE)
and condition that requires IO exit to userspace to continue emulation.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Fill in run->mmio details in (read|write)_emulated function just like
pio does. There is no point in filling only vcpu fields there just to
copy them into vcpu->run a little bit later.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Currently X86EMUL_PROPAGATE_FAULT, X86EMUL_RETRY_INSTR and
X86EMUL_CMPXCHG_FAILED have the same value so caller cannot
distinguish why function such as emulator_cmpxchg_emulated()
(which can return both X86EMUL_PROPAGATE_FAULT and
X86EMUL_CMPXCHG_FAILED) failed.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Make (get|set)_dr() callback return error if it fails instead of
injecting exception behind emulator's back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Make set_cr() callback return error if it fails instead of injecting #GP
behind emulator's back.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use callbacks from x86_emulate_ops to access segments instead of calling
into kvm directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
On VMX it is expensive to call get_cached_descriptor() just to get segment
base since multiple vmcs_reads are done instead of only one. Introduce
new call back get_cached_segment_base() for efficiency.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add (set|get)_msr callbacks to x86_emulate_ops instead of calling
them directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Add (set|get)_dr callbacks to x86_emulate_ops instead of calling
them directly.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
ljmp/lcall instruction operand contains address and segment.
It can be 10 bytes long. Currently we decode it as two different
operands. Fix it by introducing new kind of operand that can hold
entire far address.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Dst operand is already initialized during decoding stage. No need to
reinitialize.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This instruction does not need generic decoding for its dst operand.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce read cache which is needed for instruction that require more
then one exit to userspace. After returning from userspace the instruction
will be re-executed with cached read value.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
cr0.ts may change between entries, so we copy cr0 to HOST_CR0 before each
entry. That is slow, so instead, set HOST_CR0 to have TS set unconditionally
(which is a safe value), and issue a clts() just before exiting vcpu context
if the task indeed owns the fpu.
Saves ~50 cycles/exit.
Signed-off-by: Avi Kivity <avi@redhat.com>
Although we always allocate a new dirty bitmap in x86's get_dirty_log(),
it is only used as a zero-source of copy_to_user() and freed right after
that when memslot is clean. This patch uses clear_user() instead of doing
this unnecessary zero-source allocation.
Performance improvement: as we can expect easily, the time needed to
allocate a bitmap is completely reduced. In my test, the improved ioctl
was about 4 to 10 times faster than the original one for clean slots.
Furthermore, reducing memory allocations and copies will produce good
effects to caches too.
Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
HW breakpoints events stopped working correctly with kgdb
as a result of commit: 018cbffe68
(Merge commit 'v2.6.33' into perf/core).
The regression occurred because the behavior changed for setting
NOTIFY_STOP as the return value to the die notifier if the breakpoint
was known to the HW breakpoint API. Because kgdb is using the HW
breakpoint API to register HW breakpoints slots, it must also now
implement the overflow_handler call back else kgdb does not get to see
the events from the die notifier.
The kgdb_ll_trap function will be changed to be general purpose code
which can allow an easy way to implement the hw_breakpoint API
overflow call back.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Acked-by: Dongdong Deng <dongdong.deng@windriver.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Do not try to disable hpet if it hasn't been initialized before
x86, i8259: Only register sysdev if we have a real 8259 PIC
The Pstate transition latency check was added for broken F10h BIOSen
which wrongly contain a value of 0 for transition and bus master
latency. Fam11h and later, however, (will) have similar transition
latency so extend that behavior for them too.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The PCC cpufreq driver unmaps the mailbox address range if any CPUs fail to
initialise, but doesn't do anything to remove the registered CPUs from the
cpufreq core resulting in failures further down the line. We're better off
simply returning a failure - the cpufreq core will unregister us cleanly if
we end up with no successfully registered CPUs. Tidy up the failure path
and also add a sanity check to ensure that the firmware gives us a realistic
frequency - the core deals badly with that being set to 0.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
The pcc specification documents an _OSC method that's incompatible with the
one defined as part of the ACPI spec. This shouldn't be a problem as both
are supposed to be guarded with a UUID. Unfortunately approximately nobody
(including HP, who wrote this spec) properly check the UUID on entry to the
_OSC call. Right now this could result in surprising behaviour if the pcc
driver performs an _OSC call on a machine that doesn't implement the pcc
specification. Check whether the PCCH method exists first in order to reduce
this probability.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Signed-off-by: Dave Jones <davej@redhat.com>
* 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: Use kmalloc() instead of vmalloc() for KVM_[GS]ET_MSR
KVM: MMU: fix conflict access permissions in direct sp
Commit 2a6b69765a
(ACPI: Store NVS state even when entering suspend to RAM) caused the
ACPI suspend code save the NVS area during suspend and restore it
during resume unconditionally, although it is known that some systems
need to use acpi_sleep=s4_nonvs for hibernation to work. To allow
the affected systems to avoid saving and restoring the NVS area
during suspend to RAM and resume, introduce kernel command line
option acpi_sleep=nonvs and make acpi_sleep=s4_nonvs work as its
alias temporarily (add acpi_sleep=s4_nonvs to the feature removal
file).
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16396 .
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Reported-and-tested-by: tomas m <tmezzadra@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
hpet_disable is called unconditionally on machine reboot if hpet support
is compiled in the kernel.
hpet_disable only checks if the machine is hpet capable but doesn't make
sure that hpet has been initialized.
[ tglx: Made it a one liner and removed the redundant hpet_address check ]
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <alpine.DEB.2.00.1007211726240.22235@kaball-desktop>
Cc: stable@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In no-direct mapping, we mark sp is 'direct' when we mapping the
guest's larger page, but its access is encoded form upper page-struct
entire not include the last mapping, it will cause access conflict.
For example, have this mapping:
[W]
/ PDE1 -> |---|
P[W] | | LPA
\ PDE2 -> |---|
[R]
P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
same lage page(LPA). The P's access is WR, PDE1's access is WR,
PDE2's access is RO(just consider read-write permissions here)
When guest access PDE1, we will create a direct sp for LPA, the sp's
access is from P, is W, then we will mark the ptes is W in this sp.
Then, guest access PDE2, we will find LPA's shadow page, is the same as
PDE's, and mark the ptes is RO.
So, if guest access PDE1, the incorrect #PF is occured.
Fixed by encode the last mapping access into direct shadow page
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It turns out that there is a bit in the _CST for Intel FFH C3
that tells the OS if we should be checking BM_STS or not.
Linux has been unconditionally checking BM_STS.
If the chip-set is configured to enable BM_STS,
it can retard or completely prevent entry into
deep C-states -- as illustrated by turbostat:
http://userweb.kernel.org/~lenb/acpi/utils/pmtools/turbostat/
ref: Intel Processor Vendor-Specific ACPI Interface Specification
table 4 "_CST FFH GAS Field Encoding"
Bit 1: Set to 1 if OSPM should use Bus Master avoidance for this C-state
https://bugzilla.kernel.org/show_bug.cgi?id=15886
Signed-off-by: Len Brown <len.brown@intel.com>
In the CONFIG_AUDITSYSCALL fast-path for x86 64-bit system calls,
we can pass a bad return value and/or error indication for the
system call to audit_syscall_exit(). This happens when
TIF_NEED_RESCHED was set as the system call returned, so we went
out to schedule() and came back to the exit-audit fast-path. The
fix is to reload the user return value register from the pt_regs
before using it for audit_syscall_exit().
Both the 32-bit kernel's fast path and the 64-bit kernel's 32-bit
system call fast paths work slightly differently, so that they
always leave the fast path entirely to reschedule and don't return
there, so they don't have the analogous bugs.
Reported-by: Alexander Viro <aviro@redhat.com>
Signed-off-by: Roland McGrath <roland@redhat.com>
Pointed out by Lucas who found the new one in a comment in
setup_percpu.c. And then I fixed the others that I grepped
for.
Reported-by: Lucas <canolucas@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e534c7c5f8 ("numa: x86_64: use generic percpu var
numa_node_id() implementation") broke numa systems that don't have ram
on node0 when MEMORY_HOTPLUG is enabled, because cpu_up() will call
cpu_to_node() before per_cpu(numa_node) is setup for APs.
When Node0 doesn't have RAM, on x86, cpus already round it to nearest
node with RAM in x86_cpu_to_node_map. and per_cpu(numa_node) is not set
up until in c_init for APs.
When later cpu_up() calling cpu_to_node() will get 0 again, and make it
online even there is no RAM on node0. so later all APs can not booted up,
and later will have panic.
[ 1.611101] On node 0 totalpages: 0
.........
[ 2.608558] On node 0 totalpages: 0
[ 2.612065] Brought up 1 CPUs
[ 2.615199] Total of 1 processors activated (3990.31 BogoMIPS).
...
93.225341] calling loop_init+0x0/0x1a4 @ 1
[ 93.229314] PERCPU: allocation failed, size=80 align=8, failed to populate
[ 93.246539] Pid: 1, comm: swapper Tainted: G W 2.6.35-rc4-tip-yh-04371-gd64e6c4-dirty #354
[ 93.264621] Call Trace:
[ 93.266533] [<ffffffff81125e43>] pcpu_alloc+0x83a/0x8e7
[ 93.270710] [<ffffffff81125f15>] __alloc_percpu+0x10/0x12
[ 93.285849] [<ffffffff8140786c>] alloc_disk_node+0x94/0x16d
[ 93.291811] [<ffffffff81407956>] alloc_disk+0x11/0x13
[ 93.306157] [<ffffffff81503e51>] loop_alloc+0xa7/0x180
[ 93.310538] [<ffffffff8277ef48>] loop_init+0x9b/0x1a4
[ 93.324909] [<ffffffff8277eead>] ? loop_init+0x0/0x1a4
[ 93.329650] [<ffffffff810001f2>] do_one_initcall+0x57/0x136
[ 93.345197] [<ffffffff827486d0>] kernel_init+0x184/0x20e
[ 93.348146] [<ffffffff81034954>] kernel_thread_helper+0x4/0x10
[ 93.365194] [<ffffffff81c7cc3c>] ? restore_args+0x0/0x30
[ 93.369305] [<ffffffff8274854c>] ? kernel_init+0x0/0x20e
[ 93.386011] [<ffffffff81034950>] ? kernel_thread_helper+0x0/0x10
[ 93.392047] loop: out of memory
...
Try to assign per_cpu(numa_node) early
[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Yinghai <yinghai@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My platform makes use of the null_legacy_pic choice and oopses when doing
a shutdown as the shutdown code goes through all the registered sysdevs
and calls their shutdown method which in my case poke on a non-existing
i8259. Imho the i8259 specific sysdev should only be registered if the
i8259 is actually there.
Do not register the sysdev function when the null_legacy_pic is used so
that the i8259 resume, suspend and shutdown functions are not called.
Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de>
LKML-Reference: <201007202218.o6KMIJ3m020955@imap1.linux-foundation.org>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
Cc: <stable@kernel.org> 2.6.34
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
pavel@suse.cz no longer works, replace it with working address.
Signed-off-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pci, mrst: Add extra sanity check in walking the PCI extended cap chain
x86: Fix x2apic preenabled system with kexec
x86: Force HPET readback_cmp for all ATI chipsets
In commit f007ea26, the order of the %es and %ds segment registers
got accidentally swapped, so synthesized 'struct pt_regs' frames
have the two values inverted. It's almost sure that these values
never matter, and that they also never differ. But wrong is wrong.
Signed-off-by: Roland McGrath <roland@redhat.com>
The fixed bar capability structure is searched in PCI extended
configuration space. We need to make sure there is a valid capability
ID to begin with otherwise, the search code may stuck in a infinite
loop which results in boot hang. This patch adds additional check for
cap ID 0, which is also invalid, and indicates end of chain.
End of chain is supposed to have all fields zero, but that doesn't
seem to always be the case in the field.
Suggested-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
LKML-Reference: <1279306706-27087-1-git-send-email-jacob.jun.pan@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Found one x2apic system kexec loop test failed
when CONFIG_NMI_WATCHDOG=y (old) or CONFIG_LOCKUP_DETECTOR=y (current tip)
first kernel can kexec second kernel, but second kernel can not kexec third one.
it can be duplicated on another system with BIOS preenabled x2apic.
First kernel can not kexec second kernel.
It turns out, when kernel boot with pre-enabled x2apic, it will not execute
disable_local_APIC on shutdown path.
when init_apic_mappings() is called in setup_arch, it will skip setting of
apic_phys when x2apic_mode is set. ( x2apic_mode is much early check_x2apic())
Then later, disable_local_APIC() will bail out early because !apic_phys.
So check !x2apic_mode in x2apic_mode in disable_local_APIC with !apic_phys.
another solution could be updating init_apic_mappings() to set apic_phys even
for preenabled x2apic system. Actually even for x2apic system, that lapic
address is mapped already in early stage.
BTW: is there any x2apic preenabled system with apicid of boot cpu > 255?
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4C3EB22B.3000701@kernel.org>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: stable@kernel.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we fail to assign resources to a PCI BAR, this patch makes us try the
original address from BIOS rather than leaving it disabled.
Linux tries to make sure all PCI device BARs are inside the upstream
PCI host bridge or P2P bridge apertures, reassigning BARs if necessary.
Windows does similar reassignment.
Before this patch, if we could not move a BAR into an aperture, we left
the resource unassigned, i.e., at address zero. Windows leaves such BARs
at the original BIOS addresses, and this patch makes Linux do the same.
This is a bit ugly because we disable the resource long before we try to
reassign it, so we have to keep track of the BIOS BAR address somewhere.
For lack of a better place, I put it in the struct pci_dev.
I think it would be cleaner to attempt the assignment immediately when the
claim fails, so we could easily remember the original address. But we
currently claim motherboard resources in the middle, after attempting to
claim PCI resources and before assigning new PCI resources, and changing
that is a fairly big job.
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=16263
Reported-by: Andrew <nitr0@seti.kr.ua>
Tested-by: Andrew <nitr0@seti.kr.ua>
Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
commit 30a564be (x86, hpet: Restrict read back to affected ATI
chipset) restricted the workaround for the HPET bug to SMX00
chipsets. This was reasonable as those were the only ones against
which we ever got a bug report.
Stephan Wolf reported now that this patch breaks his IXP400 based
machine. Though it's confirmed to work on other IXP400 based systems.
To error out on the safe side, we force the HPET readback workaround
for all ATI SMbus class chipsets.
Reported-by: Stephan Wolf <stephan@letzte-bankreihe.de>
LKML-Reference: <alpine.LFD.2.00.1007142134140.3321@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Stephan Wolf <stephan@letzte-bankreihe.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
input: i8042 - add runtime check in x86's i8042_platform_init
Revert "Input: fixup X86_MRST selects"
Revert "Input: do not force selecting i8042 on Moorestown"
x86, mrst: Add i8042_detect API for Moorestwon platform
x86: Add i8042 pre-detection hook to x86_platform_ops
x86, platform: Export x86_platform to modules
After remove a rmap, we should flush all vcpu's tlb
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It will just return 0 as there is no i8042 controller
Signed-off-by: Feng Tang <feng.tang@intel.com>
LKML-Reference: <1278342202-10973-3-git-send-email-feng.tang@intel.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Some x86 platforms like Intel MID platforms don't have i8042 controllers,
and i8042 driver's probe to some legacy IO ports may hang the MID
processor. With this hook, i8042 driver can runtime check and skip the
probe when the pretection fail which also saves some probe time
[ hpa note: this is currently a compile-time check, which breaks the
i386 allyesconfig build. This patch series thus does fix a regression. ]
Signed-off-by: Feng Tang <feng.tang@intel.com>
LKML-Reference: <1278342202-10973-2-git-send-email-feng.tang@intel.com>
Acked-by: Dmitry Torokhov <dtor@mail.ru>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Export x86_platform to modules in preparation of using it for i8042
discovery control.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
LKML-Reference: <1278342202-10973-1-git-send-email-feng.tang@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Dmitry Torokhov <dtor@mail.ru>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rbtree: Undo augmented trees performance damage and regression
x86, Calgary: Limit the max PHB number to 256
enter_lmode() and exit_lmode() modify the guest's EFER.LMA before calling
vmx_set_efer(). However, the latter function depends on the value of EFER.LMA
to determine whether MSR_KERNEL_GS_BASE needs reloading, via
vmx_load_host_state(). With EFER.LMA changing under its feet, it took the
wrong choice and corrupted userspace's %gs.
This causes 32-on-64 host userspace to fault.
Fix not touching EFER.LMA; instead ask vmx_set_efer() to change it.
Signed-off-by: Avi Kivity <avi@redhat.com>
Reimplement augmented RB-trees without sprinkling extra branches
all over the RB-tree code (which lives in the scheduler hot path).
This approach is 'borrowed' from Fabio's BFQ implementation and
relies on traversing the rebalance path after the RB-tree-op to
correct the heap property for insertion/removal and make up for
the damage done by the tree rotations.
For insertion the rebalance path is trivially that from the new
node upwards to the root, for removal it is that from the deepest
node in the path from the to be removed node that will still
be around after the removal.
[ This patch also fixes a video driver regression reported by
Ali Gholami Rudi - the memtype->subtree_max_end was updated
incorrectly. ]
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Ali Gholami Rudi <ali@rudi.ir>
Cc: Fabio Checconi <fabio@gandalf.sssup.it>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1275414172.27810.27961.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Fix incorrect branches event on AMD CPUs
perf tools: Fix find tids routine by excluding "." and ".."
x86: Send a SIGTRAP for user icebp traps
While doing some performance counter validation tests on some
assembly language programs I noticed that the "branches:u"
count was very wrong on AMD machines.
It looks like the wrong event was selected.
Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: <stable@kernel.org>
LKML-Reference: <alpine.DEB.2.00.1007011526010.23160@cl320.eecs.utk.edu>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch removes the CONFIG_MCORE2 check from around NET_IP_ALIGN. It is
based on a suggestion from Andi Kleen. The assumption is that there are
not any x86 cores where unaligned access is really slow, and this change
would allow for a performance improvement to still exist on configurations
that are not necessarily optimized for Core 2.
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The x3950 family can have as many as 256 PCI buses in a single system, so
change the limits to the maximum. Since there can only be 256 PCI buses in one
domain, we no longer need the BUG_ON check.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
LKML-Reference: <20100701004519.GQ15515@tux1.beaverton.ibm.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
x86 architectures can handle unaligned accesses in hardware, and it has
been shown that unaligned DMA accesses can be expensive on Nehalem
architectures. As such we should overwrite NET_IP_ALIGN to resolve
this issue.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Before we had a generic breakpoint layer, x86 used to send a
sigtrap for any debug event that happened in userspace,
except if it was caused by lazy dr7 switches.
Currently we only send such signal for single step or breakpoint
events.
However, there are three other kind of debug exceptions:
- debug register access detected: trigger an exception if the
next instruction touches the debug registers. We don't use
it.
- task switch, but we don't use tss.
- icebp/int01 trap. This instruction (0xf1) is undocumented and
generates an int 1 exception. Unlike single step through TF
flag, it doesn't set the single step origin of the exception
in dr6.
icebp then used to be reported in userspace using trap signals
but this have been incidentally broken with the new breakpoint
code. Reenable this. Since this is the only debug event that
doesn't set anything in dr6, this is all we have to check.
This fixes a regression in Wine where World Of Warcraft got broken
as it uses this for software protection checks purposes. And
probably other apps do.
Reported-and-tested-by: Alexandre Julliard <julliard@winehq.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: 2.6.33.x 2.6.34.x <stable@kernel.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, Calgary: Increase max PHB number
x86: Fix rebooting on Dell Precision WorkStation T7400
x86: Fix vsyscall on gcc 4.5 with -Os
x86, pat: Proper init of memtype subtree_max_end
um, hweight: Fix UML boot crash due to x86 optimized hweight
x86, setup: Set ax register in boot vga query
percpu, x86: Avoid warnings of unused variables in per cpu
x86, irq: Rename gsi_end gsi_top, and fix off by one errors
x86: use __ASSEMBLY__ rather than __ASSEMBLER__
Newer systems (x3950M2) can have 48 PHBs per chassis and 8
chassis, so bump the limits up and provide an explanation
of the requirements for each class.
Signed-off-by: Darrick J. Wong <djwong@us.ibm.com>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
Cc: Corinna Schultz <cschultz@linux.vnet.ibm.com>
Cc: <stable@kernel.org>
LKML-Reference: <20100624212647.GI15515@tux1.beaverton.ibm.com>
[ v2: Fixed build bug, added back PHBS_PER_CALGARY == 4 ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This fixes the -Os breaks with gcc 4.5 bug. rdtsc_barrier needs to be
force inlined, otherwise user space will jump into kernel space and
kill init.
This also addresses http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44129
I believe.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20100618210859.GA10913@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@kernel.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6:
PCI: clear bridge resource range if BIOS assigned bad one
PCI: hotplug/cpqphp, fix NULL dereference
Revert "PCI: create function symlinks in /sys/bus/pci/slots/N/"
PCI: change resource collision messages from KERN_ERR to KERN_INFO
subtree_max_end that was recently added to struct memtype was not getting
properly initialized resulting in
WARNING: kmemcheck: Caught 64-bit read from uninitialized memory
in memtype_rb_augment_cb()
reported here
https://bugzilla.kernel.org/show_bug.cgi?id=16092
This change fixes the problem.
Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Tested-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Venkatesh Pallipadi <venki@google.com>
LKML-Reference: <1276217101-11515-1-git-send-email-venki@google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Yannick found that video does not work with 2.6.34. The cause of this
bug was that the BIOS had assigned the wrong range to the PCI bridge
above the video device. Before 2.6.34 the kernel would have shrunk
the size of the bridge window, but since
d65245c PCI: don't shrink bridge resources
the kernel will avoid shrinking BIOS ranges.
So zero out the old range if we fail to claim it at boot time; this will
cause us to allocate a new range at startup, restoring the 2.6.34
behavior.
Fixes regression https://bugzilla.kernel.org/show_bug.cgi?id=16009.
Reported-by: Yannick <yannick.roehlly@free.fr>
Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Catch missing conversion to the register structure "glove box" scheme.
Found by gcc 4.6's new warnings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20100610111040.F1781B1A2B@basil.firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Saving platform non-volatile state may be required for suspend to RAM as
well as hibernation. Move it to more generic code.
Signed-off-by: Matthew Garrett <mjg@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Tested-by: Maxim Levitsky <maximlevitsky@gmail.com>
Signed-off-by: Len Brown <len.brown@intel.com>
When I introduced the global variable gsi_end I thought gsi_end on
io_apics was one past the end of the gsi range for the io_apic. After
it was pointed out the the range on io_apics was inclusive I changed
my global variable to match. That was a big mistake. Inclusive
semantics without a range start cannot describe the case when no gsi's
are allocated. Describing the case where no gsi's are allocated is
important in sfi.c and mpparse.c so that we can assign gsi numbers
instead of blindly copying the gsi assignments the BIOS has done as we
do in the acpi case.
To keep from getting the global variable confused with the gsi range
end rename it gsi_top.
To allow describing the case where no gsi's are allocated have gsi_top
be one place the highest gsi number seen in the system.
This fixes an off by one bug in sfi.c:
Reported-by: jacob pan <jacob.jun.pan@linux.intel.com>
This fixes the same off by one bug in mpparse.c:
This fixes an off unreachable by one bug in acpi/boot.c:irq_to_gsi
Reported-by: Yinghai <yinghai.lu@oracle.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
LKML-Reference: <m17hm9jre7.fsf_-_@fess.ebiederm.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If cr0.wp=0, we have to allow the guest kernel access to a page with pte.w=0.
We do that by setting spte.w=1, since the host cr0.wp must remain set so the
host can write protect pages. Once we allow write access, we must remove
user access otherwise we mistakenly allow the user to write the page.
Reviewed-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Always invalidate spte and flush TLBs when changing page size, to make
sure different sized translations for the same address are never cached
in a CPU's TLB.
Currently the only case where this occurs is when a non-leaf spte pointer is
overwritten by a leaf, large spte entry. This can happen after dirty
logging is disabled on a memslot, for example.
Noticed by Andrea.
KVM-Stable-Tag
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch implements a workaround for AMD erratum 383 into
KVM. Without this erratum fix it is possible for a guest to
kill the host machine. This patch implements the suggested
workaround for hypervisors which will be published by the
next revision guide update.
[jan: fix overflow warning on i386]
[xiao: fix unused variable warning]
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch moves handling of the MC vmexits to an earlier
point in the vmexit. The handle_exit function is too late
because the vcpu might alreadry have changed its physical
cpu.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>