Commit Graph

10735 Commits

Author SHA1 Message Date
Ingo Molnar dfacc4d6c9 Merge branch 'perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core 2010-05-20 14:38:55 +02:00
Huang Ying 482908b49e ACPI, APEI, Use ERST for persistent storage of MCE
Traditionally, fatal MCE will cause Linux print error log to console
then reboot. Because MCE registers will preserve their content after
warm reboot, the hardware error can be logged to disk or network after
reboot. But system may fail to warm reboot, then you may lose the
hardware error log. ERST can help here. Through saving the hardware
error log into flash via ERST before go panic, the hardware error log
can be gotten from the flash after system boot successful again.

The fatal MCE processing procedure with ERST involved is as follow:

- Hardware detect error, MCE raised
- MCE read MCE registers, check error severity (fatal), prepare error record
- Write MCE error record into flash via ERST
- Go panic, then trigger system reboot
- System reboot, /sbin/mcelog run, it reads /dev/mcelog to check flash
  for error record of previous boot via ERST, and output and clear
  them if available
- /sbin/mcelog logs error records into disk or network

ERST only accepts CPER record format, but there is no pre-defined CPER
section can accommodate all information in struct mce, so a customized
section type is defined to hold struct mce inside a CPER record as an
error section.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2010-05-19 22:41:40 -04:00
Huang Ying d334a49113 ACPI, APEI, Generic Hardware Error Source memory error support
Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.

Now, only SCI notification type and memory errors are supported. More
notification type and hardware error type will be added later. These
memory errors are reported to user space through /dev/mcelog via
faking a corrected Machine Check, so that the error memory page can be
offlined by /sbin/mcelog if the error count for one page is beyond the
threshold.

On some machines, Machine Check can not report physical address for
some corrected memory errors, but GHES can do that. So this simplified
GHES is implemented firstly.

Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
2010-05-19 22:41:16 -04:00
Linus Torvalds 6fa0fddd5f Merge branch 'timers-for-linus-hpet' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-hpet' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, hpet: Add reference to chipset erratum documentation for disable-hpet-msi-quirk
  x86, hpet: Restrict read back to affected ATI chipsets
2010-05-19 17:10:24 -07:00
Linus Torvalds 7d02093e29 Merge branch 'timers-for-linus-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'timers-for-linus-cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  avr32: Fix typo in read_persistent_clock()
  sparc: Convert sparc to use read/update_persistent_clock
  cris: Convert cris to use read/update_persistent_clock
  m68k: Convert m68k to use read/update_persistent_clock
  m32r: Convert m32r to use read/update_peristent_clock
  blackfin: Convert blackfin to use read/update_persistent_clock
  ia64: Convert ia64 to use read/update_persistent_clock
  avr32: Convert avr32 to use read/update_persistent_clock
  h8300: Convert h8300 to use read/update_persistent_clock
  frv: Convert frv to use read/update_persistent_clock
  mn10300: Convert mn10300 to use read/update_persistent_clock
  alpha: Convert alpha to use read/update_persistent_clock
  xtensa: Fix unnecessary setting of xtime
  time: Clean up direct xtime usage in xen
2010-05-19 17:10:06 -07:00
Avi Kivity 8fbf065d62 KVM: x86: Add missing locking to arch specific vcpu ioctls
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:41:11 +03:00
Avi Kivity 3dbe141595 KVM: MMU: Segregate shadow pages with different cr0.wp
When cr0.wp=0, we may shadow a gpte having u/s=1 and r/w=0 with an spte
having u/s=0 and r/w=1.  This allows excessive access if the guest sets
cr0.wp=1 and accesses through this spte.

Fix by making cr0.wp part of the base role; we'll have different sptes for
the two cases and the problem disappears.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:09 +03:00
Sheng Yang a3d204e285 KVM: x86: Check LMA bit before set_efer
kvm_x86_ops->set_efer() would execute vcpu->arch.efer = efer, so the
checking of LMA bit didn't work.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:09 +03:00
Avi Kivity f78e917688 KVM: Don't allow lmsw to clear cr0.pe
The current lmsw implementation allows the guest to clear cr0.pe, contrary
to the manual, which breaks EMM386.EXE.

Fix by ORing the old cr0.pe with lmsw's operand.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:08 +03:00
Glauber Costa 371bcf646d KVM: x86: Tell the guest we'll warn it about tsc stability
This patch puts up the flag that tells the guest that we'll warn it
about the tsc being trustworthy or not. By now, we also say
it is not.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:06 +03:00
Glauber Costa 3a0d7256a6 x86, paravirt: don't compute pvclock adjustments if we trust the tsc
If the HV told us we can fully trust the TSC, skip any
correction

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:05 +03:00
Glauber Costa 838815a787 x86: KVM guest: Try using new kvm clock msrs
We now added a new set of clock-related msrs in replacement of the old
ones. In theory, we could just try to use them and get a return value
indicating they do not exist, due to our use of kvm_write_msr_save.

However, kvm clock registration happens very early, and if we ever
try to write to a non-existant MSR, we raise a lethal #GP, since our
idt handlers are not in place yet.

So this patch tests for a cpuid feature exported by the host to
decide which set of msrs are supported.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:04 +03:00
Glauber Costa 84478c829d KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
Right now, we were using individual KVM_CAP entities to communicate
userspace about which cpuids we support. This is suboptimal, since it
generates a delay between the feature arriving in the host, and
being available at the guest.

A much better mechanism is to list para features in KVM_GET_SUPPORTED_CPUID.
This makes userspace automatically aware of what we provide. And if we
ever add a new cpuid bit in the future, we have to do that again,
which create some complexity and delay in feature adoption.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:03 +03:00
Glauber Costa 0e6ac58acb KVM: x86: add new KVMCLOCK cpuid feature
This cpuid, KVM_CPUID_CLOCKSOURCE2, will indicate to the guest
that kvmclock is available through a new set of MSRs. The old ones
are deprecated.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:02 +03:00
Glauber Costa 11c6bffa42 KVM: x86: change msr numbers for kvmclock
Avi pointed out a while ago that those MSRs falls into the pentium
PMU range. So the idea here is to add new ones, and after a while,
deprecate the old ones.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:01 +03:00
Glauber Costa 489fb490db x86, paravirt: Add a global synchronization point for pvclock
In recent stress tests, it was found that pvclock-based systems
could seriously warp in smp systems. Using ingo's time-warp-test.c,
I could trigger a scenario as bad as 1.5mi warps a minute in some systems.
(to be fair, it wasn't that bad in most of them). Investigating further, I
found out that such warps were caused by the very offset-based calculation
pvclock is based on.

This happens even on some machines that report constant_tsc in its tsc flags,
specially on multi-socket ones.

Two reads of the same kernel timestamp at approx the same time, will likely
have tsc timestamped in different occasions too. This means the delta we
calculate is unpredictable at best, and can probably be smaller in a cpu
that is legitimately reading clock in a forward ocasion.

Some adjustments on the host could make this window less likely to happen,
but still, it pretty much poses as an intrinsic problem of the mechanism.

A while ago, I though about using a shared variable anyway, to hold clock
last state, but gave up due to the high contention locking was likely
to introduce, possibly rendering the thing useless on big machines. I argue,
however, that locking is not necessary.

We do a read-and-return sequence in pvclock, and between read and return,
the global value can have changed. However, it can only have changed
by means of an addition of a positive value. So if we detected that our
clock timestamp is less than the current global, we know that we need to
return a higher one, even though it is not exactly the one we compared to.

OTOH, if we detect we're greater than the current time source, we atomically
replace the value with our new readings. This do causes contention on big
boxes (but big here means *BIG*), but it seems like a good trade off, since
it provide us with a time source guaranteed to be stable wrt time warps.

After this patch is applied, I don't see a single warp in time during 5 days
of execution, in any of the machines I saw them before.

Signed-off-by: Glauber Costa <glommer@redhat.com>
Acked-by: Zachary Amsden <zamsden@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
CC: Avi Kivity <avi@redhat.com>
CC: Marcelo Tosatti <mtosatti@redhat.com>
CC: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:41:00 +03:00
Glauber Costa 424c32f1aa x86, paravirt: Enable pvclock flags in vcpu_time_info structure
This patch removes one padding byte and transform it into a flags
field. New versions of guests using pvclock will query these flags
upon each read.

Flags, however, will only be interpreted when the guest decides to.
It uses the pvclock_valid_flags function to signal that a specific
set of flags should be taken into consideration. Which flags are valid
are usually devised via HV negotiation.

Signed-off-by: Glauber Costa <glommer@redhat.com>
CC: Jeremy Fitzhardinge <jeremy@goop.org>
Acked-by: Zachary Amsden <zamsden@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:40:59 +03:00
Roedel, Joerg b69e8caef5 KVM: x86: Inject #GP with the right rip on efer writes
This patch fixes a bug in the KVM efer-msr write path. If a
guest writes to a reserved efer bit the set_efer function
injects the #GP directly. The architecture dependent wrmsr
function does not see this, assumes success and advances the
rip. This results in a #GP in the guest with the wrong rip.
This patch fixes this by reporting efer write errors back to
the architectural wrmsr function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:39 +03:00
Joerg Roedel 0d945bd935 KVM: SVM: Don't allow nested guest to VMMCALL into host
This patch disables the possibility for a l2-guest to do a
VMMCALL directly into the host. This would happen if the
l1-hypervisor doesn't intercept VMMCALL and the l2-guest
executes this instruction.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:38 +03:00
Joerg Roedel 3f0fd2927b KVM: x86: Fix exception reinjection forced to true
The patch merged recently which allowed to mark an exception
as reinjected has a bug as it always marks the exception as
reinjected. This breaks nested-svm shadow-on-shadow
implementation.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:37 +03:00
Avi Kivity 9ed3c444ab KVM: Fix wallclock version writing race
Wallclock writing uses an unprotected global variable to hold the version;
this can cause one guest to interfere with another if both write their
wallclock at the same time.

Acked-by: Glauber Costa <glommer@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:36 +03:00
Avi Kivity 8facbbff07 KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
On svm, kvm_read_pdptr() may require reading guest memory, which can sleep.

Push the spinlock into mmu_alloc_roots(), and only take it after we've read
the pdptr.

Tested-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:35 +03:00
Shane Wang cafd66595d KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
Per document, for feature control MSR:

  Bit 1 enables VMXON in SMX operation. If the bit is clear, execution
        of VMXON in SMX operation causes a general-protection exception.
  Bit 2 enables VMXON outside SMX operation. If the bit is clear, execution
        of VMXON outside SMX operation causes a general-protection exception.

This patch is to enable this kind of check with SMX for VMXON in KVM.

Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:34 +03:00
Marcelo Tosatti f1d86e469b KVM: x86: properly update ready_for_interrupt_injection
The recent changes to emulate string instructions without entering guest
mode exposed a bug where pending interrupts are not properly reflected
in ready_for_interrupt_injection.

The result is that userspace overwrites a previously queued interrupt,
when irqchip's are emulated in userspace.

Fix by always updating state before returning to userspace.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:33 +03:00
Avi Kivity 84ad33ef5d KVM: VMX: Atomically switch efer if EPT && !EFER.NX
When EPT is enabled, we cannot emulate EFER.NX=0 through the shadow page
tables.  This causes accesses through ptes with bit 63 set to succeed instead
of failing a reserved bit check.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:32 +03:00
Avi Kivity 61d2ef2ce3 KVM: VMX: Add facility to atomically switch MSRs on guest entry/exit
Some guest msr values cannot be used on the host (for example. EFER.NX=0),
so we need to switch them atomically during guest entry or exit.

Add a facility to program the vmx msr autoload registers accordingly.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:31 +03:00
Avi Kivity 5dfa3d170e KVM: VMX: Add definitions for guest and host EFER autoswitch vmcs entries
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:31 +03:00
Avi Kivity 19b95dba03 KVM: VMX: Add definition for msr autoload entry
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:30 +03:00
Avi Kivity 0ee75bead8 KVM: Let vcpu structure alignment be determined at runtime
vmx and svm vcpus have different contents and therefore may have different
alignmment requirements.  Let each specify its required alignment.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-19 11:36:29 +03:00
Xiao Guangrong 884a0ff0b6 KVM: MMU: cleanup invlpg code
Using is_last_spte() to cleanup invlpg code

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:28 +03:00
Xiao Guangrong 5e1b3ddbf2 KVM: MMU: move unsync/sync tracpoints to proper place
Move unsync/sync tracepoints to the proper place, it's good
for us to obtain unsync page live time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:27 +03:00
Xiao Guangrong 85f2067c31 KVM: MMU: convert mmu tracepoints
Convert mmu tracepoints by using DECLARE_EVENT_CLASS

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:26 +03:00
Xiao Guangrong 22c9b2d166 KVM: MMU: fix for calculating gpa in invlpg code
If the guest is 32-bit, we should use 'quadrant' to adjust gpa
offset

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:25 +03:00
Gui Jianfeng d35b8dd935 KVM: Fix mmu shrinker error
kvm_mmu_remove_one_alloc_mmu_page() assumes kvm_mmu_zap_page() only reclaims
only one sp, but that's not the case. This will cause mmu shrinker returns
a wrong number. This patch fix the counting error.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:23 +03:00
Eric Northup 5a7388c2d2 KVM: MMU: fix hashing for TDP and non-paging modes
For TDP mode, avoid creating multiple page table roots for the single
guest-to-host physical address map by fixing the inputs used for the
shadow page table hash in mmu_alloc_roots().

Signed-off-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-19 11:36:22 +03:00
Cyrill Gorcunov ce7f15452c perf, x86: P4 PMU -- add missing bit in CCCR mask
Should be there for the sake of RAW events.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100518212439.354345151@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-19 09:41:06 +02:00
Cyrill Gorcunov 9d36dfcf21 perf, x86: P4_pmu_schedule_events -- use smp_processor_id instead of raw_
This snippet somehow escaped the commit:

 | commit 137351e0fe
 | Author: Cyrill Gorcunov <gorcunov@openvz.org>
 | Date:   Sat May 8 15:25:52 2010 +0400
 |
 |    x86, perf: P4 PMU -- protect sensible procedures from preemption

so bring it eventually back. It helps to catch
preemption issue (if there will be, rule of thumb --
don't use raw_ if you can).

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Lin Ming <ming.m.lin@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100518212439.167259349@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-19 09:41:05 +02:00
Cyrill Gorcunov 623aab896e perf, x86: P4 PMU -- do a real check for ESCR address being in hash
To prevent from clashes in future code modifications
do a real check for ESCR address being in hash. At
moment the callers are known to pass sane values but
better to be on a safe side.

And comment fix.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Lin Ming <ming.m.lin@intel.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <20100518212439.004503600@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-19 09:41:05 +02:00
Feng Tang a02ce953a1 x86/PCI: make ACPI MCFG reserved error messages ACPI specific
Both ACPI and SFI firmwares will have MCFG space, but the error message
isn't valid on SFI, so don't print the message in that case.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2010-05-18 15:03:27 -07:00
Linus Torvalds 537b60d178 Merge branch 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, UV: uv_irq.c: Fix all sparse warnings
  x86, UV: Improve BAU performance and error recovery
  x86, UV: Delete unneeded boot messages
  x86, UV: Clean up UV headers for MMR definitions
2010-05-18 09:46:35 -07:00
Linus Torvalds 3ae684e1c4 Merge branch 'x86-txt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-txt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, tboot: Add support for S3 memory integrity protection
2010-05-18 09:28:24 -07:00
Linus Torvalds c4fd308ed6 Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, pat: Update the page flags for memtype atomically instead of using memtype_lock
  x86, pat: In rbt_memtype_check_insert(), update new->type only if valid
  x86, pat: Migrate to rbtree only backend for pat memtype management
  x86, pat: Preparatory changes in pat.c for bigger rbtree change
  rbtree: Add support for augmented rbtrees
2010-05-18 09:28:04 -07:00
Linus Torvalds 96fbeb973a Merge branch 'x86-mrst-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mrst-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, mrst: add nop functions to x86_init mpparse functions
  x86, mrst, pci: return 0 for non-present pci bars
  x86: Avoid check hlt for newer cpus
2010-05-18 09:27:49 -07:00
Linus Torvalds 1f8caa986a Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86-64: Combine SRAT regions when possible
2010-05-18 09:17:50 -07:00
Linus Torvalds d6f3875252 Merge branch 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/microcode: Use nonseekable_open()
  x86: Improve Intel microcode loader performance
2010-05-18 09:17:17 -07:00
Linus Torvalds cb41838bbc Merge branch 'core-hweight-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-hweight-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, hweight: Use a 32-bit popcnt for __arch_hweight32()
  arch, hweight: Fix compilation errors
  x86: Add optimized popcnt variants
  bitops: Optimize hweight() by making use of compile-time evaluation
2010-05-18 09:17:01 -07:00
Linus Torvalds 98f01720cb Merge branch 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-irq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, acpi/irq: Define gsi_end when X86_IO_APIC is undefined
  x86, irq: Kill io_apic_renumber_irq
  x86, acpi/irq: Handle isa irqs that are not identity mapped to gsi's.
  x86, ioapic: Simplify probe_nr_irqs_gsi.
  x86, ioapic: Optimize pin_2_irq
  x86, ioapic: Move nr_ioapic_registers calculation to mp_register_ioapic.
  x86, ioapic: In mpparse use mp_register_ioapic
  x86, ioapic: Teach mp_register_ioapic to compute a global gsi_end
  x86, ioapic: Fix the types of gsi values
  x86, ioapic: Fix io_apic_redir_entries to return the number of entries.
  x86, ioapic: Only export mp_find_ioapic and mp_find_ioapic_pin in io_apic.h
  x86, acpi/irq: Generalize mp_config_acpi_legacy_irqs
  x86, acpi/irq: Fix acpi_sci_ioapic_setup so it has both bus_irq and gsi
  x86, acpi/irq: pci device dev->irq is an isa irq not a gsi
  x86, acpi/irq: Teach acpi_get_override_irq to take a gsi not an isa_irq
  x86, acpi/irq: Introduce apci_isa_irq_to_gsi
2010-05-18 09:15:57 -07:00
Linus Torvalds 41d59102e1 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, fpu: Use static_cpu_has() to implement use_xsave()
  x86: Add new static_cpu_has() function using alternatives
  x86, fpu: Use the proper asm constraint in use_xsave()
  x86, fpu: Unbreak FPU emulation
  x86: Introduce 'struct fpu' and related API
  x86: Eliminate TS_XSAVE
  x86-32: Don't set ignore_fpu_irq in simd exception
  x86: Merge kernel_math_error() into math_error()
  x86: Merge simd_math_error() into math_error()
  x86-32: Rework cache flush denied handler

Fix trivial conflict in arch/x86/kernel/process.c
2010-05-18 08:58:16 -07:00
Linus Torvalds 07d77759c9 Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, hypervisor: add missing <linux/module.h>
  Modify the VMware balloon driver for the new x86_hyper API
  x86, hypervisor: Export the x86_hyper* symbols
  x86: Clean up the hypervisor layer
  x86, HyperV: fix up the license to mshyperv.c
  x86: Detect running on a Microsoft HyperV system
  x86, cpu: Make APERF/MPERF a normal table-driven flag
  x86, k8: Fix build error when K8_NB is disabled
  x86, cacheinfo: Disable index in all four subcaches
  x86, cacheinfo: Make L3 cache info per node
  x86, cacheinfo: Reorganize AMD L3 cache structure
  x86, cacheinfo: Turn off L3 cache index disable feature in virtualized environments
  x86, cacheinfo: Unify AMD L3 cache index disable checking
  cpufreq: Unify sysfs attribute definition macros
  powernow-k8: Fix frequency reporting
  x86, cpufreq: Add APERF/MPERF support for AMD processors
  x86: Unify APERF/MPERF support
  powernow-k8: Add core performance boost support
  x86, cpu: Add AMD core boosting feature flag to /proc/cpuinfo

Fix up trivial conflicts in arch/x86/kernel/cpu/intel_cacheinfo.c and
drivers/cpufreq/cpufreq_ondemand.c
2010-05-18 08:49:13 -07:00
Linus Torvalds b7723f9d21 Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Clean up arch/x86/Kconfig*
  x86-64: Don't export init_level4_pgt
2010-05-18 08:40:21 -07:00
Linus Torvalds 93c9d7f60c Merge branch 'x86-atomic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-atomic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Fix LOCK_PREFIX_HERE for uniprocessor build
  x86, atomic64: In selftest, distinguish x86-64 from 586+
  x86-32: Fix atomic64_inc_not_zero return value convention
  lib: Fix atomic64_inc_not_zero test
  lib: Fix atomic64_add_unless return value convention
  x86-32: Fix atomic64_add_unless return value convention
  lib: Fix atomic64_add_unless test
  x86: Implement atomic[64]_dec_if_positive()
  lib: Only test atomic64_dec_if_positive on archs having it
  x86-32: Rewrite 32-bit atomic64 functions in assembly
  lib: Add self-test for atomic64_t
  x86-32: Allow UP/SMP lock replacement in cmpxchg64
  x86: Add support for lock prefix in alternatives
2010-05-18 08:40:05 -07:00
Linus Torvalds 7421a10de7 Merge branch 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86: Use .cfi_sections for assembly code
  x86-64: Reduce SMP locks table size
  x86, asm: Introduce and use percpu_inc()
2010-05-18 08:35:37 -07:00
Linus Torvalds 4d7b4ac22f Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (311 commits)
  perf tools: Add mode to build without newt support
  perf symbols: symbol inconsistency message should be done only at verbose=1
  perf tui: Add explicit -lslang option
  perf options: Type check all the remaining OPT_ variants
  perf options: Type check OPT_BOOLEAN and fix the offenders
  perf options: Check v type in OPT_U?INTEGER
  perf options: Introduce OPT_UINTEGER
  perf tui: Add workaround for slang < 2.1.4
  perf record: Fix bug mismatch with -c option definition
  perf options: Introduce OPT_U64
  perf tui: Add help window to show key associations
  perf tui: Make <- exit menus too
  perf newt: Add single key shortcuts for zoom into DSO and threads
  perf newt: Exit browser unconditionally when CTRL+C, q or Q is pressed
  perf newt: Fix the 'A'/'a' shortcut for annotate
  perf newt: Make <- exit the ui_browser
  x86, perf: P4 PMU - fix counters management logic
  perf newt: Make <- zoom out filters
  perf report: Report number of events, not samples
  perf hist: Clarify events_stats fields usage
  ...

Fix up trivial conflicts in kernel/fork.c and tools/perf/builtin-record.c
2010-05-18 08:19:03 -07:00
Linus Torvalds 3aaf51ace5 Merge branch 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'oprofile-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (24 commits)
  oprofile/x86: make AMD IBS hotplug capable
  oprofile/x86: notify cpus only when daemon is running
  oprofile/x86: reordering some functions
  oprofile/x86: stop disabled counters in nmi handler
  oprofile/x86: protect cpu hotplug sections
  oprofile/x86: remove CONFIG_SMP macros
  oprofile/x86: fix uninitialized counter usage during cpu hotplug
  oprofile/x86: remove duplicate IBS capability check
  oprofile/x86: move IBS code
  oprofile/x86: return -EBUSY if counters are already reserved
  oprofile/x86: moving shutdown functions
  oprofile/x86: reserve counter msrs pairwise
  oprofile/x86: rework error handler in nmi_setup()
  oprofile: update file list in MAINTAINERS file
  oprofile: protect from not being in an IRQ context
  oprofile: remove double ring buffering
  ring-buffer: Add lost event count to end of sub buffer
  tracing: Show the lost events in the trace_pipe output
  ring-buffer: Add place holder recording of dropped events
  tracing: Fix compile error in module tracepoints when MODULE_UNLOAD not set
  ...
2010-05-18 08:18:07 -07:00
Linus Torvalds 1014cfe2fb Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  lockdep: Reduce stack_trace usage
  lockdep: No need to disable preemption in debug atomic ops
  lockdep: Actually _dec_ in debug_atomic_dec
  lockdep: Provide off case for redundant_hardirqs_on increment
  lockdep: Simplify debug atomic ops
  lockdep: Fix redundant_hardirqs_on incremented with irqs enabled
  lockstat: Make lockstat counting per cpu
  i8253: Convert i8253_lock to raw_spinlock
2010-05-18 08:17:35 -07:00
Linus Torvalds 8123d8f17d Merge branch 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86/amd-iommu: Add amd_iommu=off command line option
  iommu-api: Remove iommu_{un}map_range functions
  x86/amd-iommu: Implement ->{un}map callbacks for iommu-api
  x86/amd-iommu: Make amd_iommu_iova_to_phys aware of multiple page sizes
  x86/amd-iommu: Make iommu_unmap_page and fetch_pte aware of page sizes
  x86/amd-iommu: Make iommu_map_page and alloc_pte aware of page sizes
  kvm: Change kvm_iommu_map_pages to map large pages
  VT-d: Change {un}map_range functions to implement {un}map interface
  iommu-api: Add ->{un}map callbacks to iommu_ops
  iommu-api: Add iommu_map and iommu_unmap functions
  iommu-api: Rename ->{un}map function pointers to ->{un}map_range
2010-05-18 07:22:37 -07:00
Cyrill Gorcunov ef4f30f54e perf, x86: P4 PMU -- fix typo in unflagged NMI handling
Tested-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <1274174954.22793.17.camel@minggr.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 12:05:20 +02:00
Cyrill Gorcunov 0db1a7bc00 perf, x86: P4 PMU -- handle unflagged events
It might happen that an event can overflow without
the proper overflow flag set. Check the sign bit in
the raw counter value to solve this problem.

Tested-by: Lin Ming <ming.m.lin@intel.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: fweisbec@gmail.com
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <1274083984.6540.15.camel@minggr.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-05-18 08:25:34 +02:00
H. Peter Anvin c59bd56882 x86, hweight: Use a 32-bit popcnt for __arch_hweight32()
Use a 32-bit popcnt instruction for __arch_hweight32(), even on
x86-64.  Even though the input register will *usually* be
zero-extended due to the standard operation of the hardware, it isn't
necessarily so if the input value was the result of truncating a
64-bit operation.

Note: the POPCNT32 variant used on x86-64 has a technically
unnecessary REX prefix to make it five bytes long, the same as a CALL
instruction, therefore avoiding an unnecessary NOP.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <alpine.LFD.2.00.1005171443060.4195@i5.linux-foundation.org>
2010-05-17 15:17:16 -07:00
Andreas Herrmann fec84e3307 x86, hpet: Add reference to chipset erratum documentation for disable-hpet-msi-quirk
(At the moment the "SB700 Family Product Errata" document is available
at http://support.amd.com/us/Embedded_TechDocs/46837.pdf)

Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
LKML-Reference: <20100517164324.GB10254@alberich.amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2010-05-17 10:04:43 -07:00
Sreedhara DS 9a58a33339 IPC driver for Intel Mobile Internet Device (MID) platforms
The IPC (inter processor communications) is used to provide the
communications between kernel and system control units on some embedded
Intel x86 platforms.

(Various bits of clean up and restructuring by Alan Cox)

Signed-off-by: Sreedhara DS <sreedhara.ds@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
2010-05-17 12:06:07 -04:00
Anton Blanchard f3d46f9d31 atomic_t: Cast to volatile when accessing atomic variables
In preparation for removing volatile from the atomic_t definition, this
patch adds a volatile cast to all the atomic read functions.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-17 07:57:27 -07:00
Gui Jianfeng df2fb6e710 KVM: MMU: fix sp->unsync type error in trace event definition
sp->unsync is bool now, so update trace event declaration.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:19:29 +03:00
Joerg Roedel ff47a49b23 KVM: SVM: Handle MCE intercepts always on host level
This patch prevents MCE intercepts from being propagated
into the L1 guest if they happened in an L2 guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:27 +03:00
Joerg Roedel ce7ddec4bb KVM: x86: Allow marking an exception as reinjected
This patch adds logic to kvm/x86 which allows to mark an
injected exception as reinjected. This allows to remove an
ugly hack from svm_complete_interrupts that prevented
exceptions from being reinjected at all in the nested case.
The hack was necessary because an reinjected exception into
the nested guest could cause a nested vmexit emulation. But
reinjected exceptions must not intercept. The downside of
the hack is that a exception that in injected could get
lost.
This patch fixes the problem and puts the code for it into
generic x86 files because. Nested-VMX will likely have the
same problem and could reuse the code.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:26 +03:00
Joerg Roedel c2c63a4939 KVM: SVM: Report emulated SVM features to userspace
This patch implements the reporting of the emulated SVM
features to userspace instead of the real hardware
capabilities. Every real hardware capability needs emulation
in nested svm so the old behavior was broken.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:24 +03:00
Joerg Roedel d4330ef2fb KVM: x86: Add callback to let modules decide over some supported cpuid bits
This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:23 +03:00
Joerg Roedel 228070b1b3 KVM: SVM: Propagate nested entry failure into guest hypervisor
This patch implements propagation of a failes guest vmrun
back into the guest instead of killing the whole guest.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:21 +03:00
Joerg Roedel 2be4fc7a02 KVM: SVM: Sync cr0 and cr3 to kvm state before nested handling
This patch syncs cr0 and cr3 from the vmcb to the kvm state
before nested intercept handling is done. This allows to
simplify the vmexit path.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:20 +03:00
Joerg Roedel 2041a06a50 KVM: SVM: Make sure rip is synced to vmcb before nested vmexit
This patch fixes a bug where a nested guest always went over
the same instruction because the rip was not advanced on a
nested vmexit.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:18 +03:00
Joerg Roedel 924584ccb0 KVM: SVM: Fix nested nmi handling
The patch introducing nested nmi handling had a bug. The
check does not belong to enable_nmi_window but must be in
nmi_allowed. This patch fixes this.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:17 +03:00
Avi Kivity 8d3b932309 Merge remote branch 'tip/perf/core'
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:15 +03:00
Lai Jiangshan cdbecfc398 KVM: VMX: free vpid when fail to create vcpu
Fix bug of the exception path, free allocated vpid when fail
to create vcpu.

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:19:10 +03:00
Wei Yongjun 77a1a71570 KVM: MMU: cleanup for function unaccount_shadowed()
Since gfn is not changed in the for loop, we do not need to call
gfn_to_memslot_unaliased() under the loop, and it is safe to move
it out.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:12 +03:00
Gui Jianfeng 2a059bf444 KVM: Get rid of dead function gva_to_page()
Nobody use gva_to_page() anymore, get rid of it.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:10 +03:00
Gui Jianfeng b2fc15a5ef KVM: MMU: Remove unused varialbe in rmap_next()
Remove unused varialbe in rmap_next()

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:09 +03:00
Gui Jianfeng 814a59d207 KVM: MMU: Make use of is_large_pte() in walker
Make use of is_large_pte() instead of checking PT_PAGE_SIZE_MASK
bit directly.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:07 +03:00
Gui Jianfeng 51fb60d81b KVM: MMU: Move sync_page() first pte address calculation out of loop
Move first pte address calculation out of loop to save some cycles.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:06 +03:00
Avi Kivity 87bc3bf972 KVM: MMU: Drop cr4.pge from shadow page role
Since commit bf47a760f6, we no longer handle ptes with the global bit
set specially, so there is no reason to distinguish between shadow pages
created with cr4.gpe set and clear.

Such tracking is expensive when the guest toggles cr4.pge, so drop it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:03 +03:00
Lai Jiangshan 90d83dc3d4 KVM: use the correct RCU API for PROVE_RCU=y
The RCU/SRCU API have already changed for proving RCU usage.

I got the following dmesg when PROVE_RCU=y because we used incorrect API.
This patch coverts rcu_deference() to srcu_dereference() or family API.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
arch/x86/kvm/mmu.c:3020 invoked rcu_dereference_check() without protection!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
2 locks held by qemu-system-x86/8550:
 #0:  (&kvm->slots_lock){+.+.+.}, at: [<ffffffffa011a6ac>] kvm_set_memory_region+0x29/0x50 [kvm]
 #1:  (&(&kvm->mmu_lock)->rlock){+.+...}, at: [<ffffffffa012262d>] kvm_arch_commit_memory_region+0xa6/0xe2 [kvm]

stack backtrace:
Pid: 8550, comm: qemu-system-x86 Not tainted 2.6.34-rc4-tip-01028-g939eab1 #27
Call Trace:
 [<ffffffff8106c59e>] lockdep_rcu_dereference+0xaa/0xb3
 [<ffffffffa012f6c1>] kvm_mmu_calculate_mmu_pages+0x44/0x7d [kvm]
 [<ffffffffa012263e>] kvm_arch_commit_memory_region+0xb7/0xe2 [kvm]
 [<ffffffffa011a5d7>] __kvm_set_memory_region+0x636/0x6e2 [kvm]
 [<ffffffffa011a6ba>] kvm_set_memory_region+0x37/0x50 [kvm]
 [<ffffffffa015e956>] vmx_set_tss_addr+0x46/0x5a [kvm_intel]
 [<ffffffffa0126592>] kvm_arch_vm_ioctl+0x17a/0xcf8 [kvm]
 [<ffffffff810a8692>] ? unlock_page+0x27/0x2c
 [<ffffffff810bf879>] ? __do_fault+0x3a9/0x3e1
 [<ffffffffa011b12f>] kvm_vm_ioctl+0x364/0x38d [kvm]
 [<ffffffff81060cfa>] ? up_read+0x23/0x3d
 [<ffffffff810f3587>] vfs_ioctl+0x32/0xa6
 [<ffffffff810f3b19>] do_vfs_ioctl+0x495/0x4db
 [<ffffffff810e6b2f>] ? fget_light+0xc2/0x241
 [<ffffffff810e416c>] ? do_sys_open+0x104/0x116
 [<ffffffff81382d6d>] ? retint_swapgs+0xe/0x13
 [<ffffffff810f3ba6>] sys_ioctl+0x47/0x6a
 [<ffffffff810021db>] system_call_fastpath+0x16/0x1b

Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:18:01 +03:00
Avi Kivity 9beeaa2d68 Merge branch 'perf'
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:17:58 +03:00
Xiao Guangrong 3246af0ece KVM: MMU: cleanup for hlist walk restart
Quote from Avi:

|Just change the assignment to a 'goto restart;' please,
|I don't like playing with list_for_each internals.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:56 +03:00
Gleb Natapov acb5451789 KVM: prevent spurious exit to userspace during task switch emulation.
If kvm_task_switch() fails code exits to userspace without specifying
exit reason, so the previous exit reason is reused by userspace. Fix
this by specifying exit reason correctly.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:55 +03:00
Xiao Guangrong 6b18493d60 KVM: MMU: remove unused parameter in mmu_parent_walk()
'vcpu' is unused, remove it

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:53 +03:00
Xiao Guangrong 0571d366e0 KVM: MMU: reduce 'struct kvm_mmu_page' size
Define 'multimapped' as 'bool'.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:52 +03:00
Xiao Guangrong 1b8c7934a4 KVM: MMU: remove unused struct kvm_unsync_walk
Remove 'struct kvm_unsync_walk' since it's not used.

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:50 +03:00
Gleb Natapov 19d0443726 KVM: fix emulator_task_switch() return value.
emulator_task_switch() should return -1 for failure and 0 for success to
the caller, just like x86_emulate_insn() does.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:49 +03:00
Avi Kivity 5b7e0102ae KVM: MMU: Replace role.glevels with role.cr4_pae
There is no real distinction between glevels=3 and glevels=4; both have
exactly the same format and the code is treated exactly the same way.  Drop
role.glevels and replace is with role.cr4_pae (which is meaningful).  This
simplifies the code a bit.

As a side effect, it allows sharing shadow page tables between pae and
longmode guest page tables at the same guest page.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:47 +03:00
Jan Kiszka e269fb2189 KVM: x86: Push potential exception error code on task switches
When a fault triggers a task switch, the error code, if existent, has to
be pushed on the new task's stack. Implement the missing bits.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:46 +03:00
Jan Kiszka 0760d44868 KVM: x86: Terminate early if task_switch_16/32 failed
Stop the switch immediately if task_switch_16/32 returned an error. Only
if that step succeeded, the switch should actually take place and update
any register states.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:44 +03:00
Gleb Natapov 8f6abd06f5 KVM: x86: get rid of mmu_only parameter in emulator_write_emulated()
We can call kvm_mmu_pte_write() directly from
emulator_cmpxchg_emulated() instead of passing mmu_only down to
emulator_write_emulated_onepage() and call it there.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:43 +03:00
Gleb Natapov 020df0794f KVM: move DR register access handling into generic code
Currently both SVM and VMX have their own DR handling code. Move it to
x86.c.

Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:39 +03:00
Andre Przywara 6bc31bdc55 KVM: SVM: implement NEXTRIPsave SVM feature
On SVM we set the instruction length of skipped instructions
to hard-coded, well known values, which could be wrong when (bogus,
but valid) prefixes (REX, segment override) are used.
Newer AMD processors (Fam10h 45nm and better, aka. PhenomII or
AthlonII) have an explicit NEXTRIP field in the VMCB containing the
desired information.
Since it is cheap to do so, we use this field to override the guessed
value on newer processors.
A fix for older CPUs would be rather expensive, as it would require
to fetch and partially decode the instruction. As the problem is not
a security issue and needs special, handcrafted code to trigger
(no compiler will ever generate such code), I omit a fix for older
CPUs.
If someone is interested, I have both a patch for these CPUs as well as
demo code triggering this issue: It segfaults under KVM, but runs
perfectly on native Linux.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:38 +03:00
Avi Kivity f7a711971e KVM: Fix MAXPHYADDR calculation when cpuid does not support it
MAXPHYADDR is derived from cpuid 0x80000008, but when that isn't present, we
get some random value.

Fix by checking first that cpuid 0x80000008 is supported.

Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:36 +03:00
Avi Kivity e46479f852 KVM: Trace emulated instructions
Log emulated instructions in ftrace, especially if they failed.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:35 +03:00
Avi Kivity 2fb53ad811 KVM: x86 emulator: Don't overwrite decode cache
Currently if we an instruction spans a page boundary, when we fetch the
second half we overwrite the first half.  This prevents us from tracing
the full instruction opcodes.

Fix by appending the second half to the first.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:33 +03:00
Xiao Guangrong 24222c2fec KVM: MMU: remove unnecessary NX check in walk_addr
After is_rsvd_bits_set() checks, EFER.NXE must be enabled if NX bit is seted

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:30 +03:00
Xiao Guangrong f84cbb0561 KVM: MMU: remove unused field
kvm_mmu_page.oos_link is not used, so remove it

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:29 +03:00
Xiao Guangrong 805d32dea4 KVM: MMU: cleanup/fix mmu audit code
This patch does:
- 'sp' parameter in inspect_spte_fn() is not used, so remove it
- fix 'kvm' and 'slots' is not defined in count_rmaps()
- fix a bug in inspect_spte_has_rmap()

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:17:27 +03:00
Avi Kivity 84b0c8c6a6 KVM: MMU: Disassociate direct maps from guest levels
Direct maps are linear translations for a section of memory, used for
real mode or with large pages.  As such, they are independent of the guest
levels.

Teach the mmu about this by making page->role.glevels = 0 for direct maps.
This allows direct maps to be shared among real mode and the various paging
modes.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:44 +03:00
Xiao Guangrong f815bce894 KVM: MMU: check reserved bits only if CR4.PSE=1 or CR4.PAE=1
- Check reserved bits only if CR4.PAE=1 or CR4.PSE=1 when guest #PF occurs
- Fix a typo in reset_rsvds_bits_mask()

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:42 +03:00
Marcelo Tosatti 041b1359a2 KVM: x86: document KVM_REQ_PENDING_TIMER usage
Document that KVM_REQ_PENDING_TIMER is implicitly used during guest
entry.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:40 +03:00
Gleb Natapov de3e6480f7 KVM: x86 emulator: fix unlocked CMPXCHG8B emulation
When CMPXCHG8B is executed without LOCK prefix it is racy. Preserve this
behaviour in emulator too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:38 +03:00
Gleb Natapov 6550e1f165 KVM: x86 emulator: add decoding of CMPXCHG8B dst operand
Decode CMPXCHG8B destination operand in decoding stage. Fixes regression
introduced by "If LOCK prefix is used dest arg should be memory" commit.
This commit relies on dst operand be decoded at the beginning of an
instruction emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:37 +03:00
Gleb Natapov 482ac18ae2 KVM: x86 emulator: commit rflags as part of registers commit
Make sure that rflags is committed only after successful instruction
emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:35 +03:00
Jan Kiszka 9749a6c0f0 KVM: x86: Fix 32-bit build breakage due to typo
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:16:34 +03:00
Gleb Natapov 92bf9748b5 KVM: small kvm_arch_vcpu_ioctl_run() cleanup.
Unify all conditions that get us back into emulator after returning from
userspace.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:32 +03:00
Gleb Natapov 7b262e90fc KVM: x86 emulator: introduce pio in string read ahead.
To optimize "rep ins" instruction do IO in big chunks ahead of time
instead of doing it only when required during instruction emulation.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:31 +03:00
Gleb Natapov 5cd21917da KVM: x86 emulator: restart string instruction without going back to a guest.
Currently when string instruction is only partially complete we go back
to a guest mode, guest tries to reexecute instruction and exits again
and at this point emulation continues. Avoid all of this by restarting
instruction without going back to a guest mode, but return to a guest
mode each 1024 iterations to allow interrupt injection. Pending
exception causes immediate guest entry too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:29 +03:00
Gleb Natapov cb404fe089 KVM: x86 emulator: remove saved_eip
c->eip is never written back in case of emulation failure, so no need to
set it to old value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:28 +03:00
Gleb Natapov 7972995b0c KVM: x86 emulator: Move string pio emulation into emulator.c
Currently emulation is done outside of emulator so things like doing
ins/outs to/from mmio are broken it also makes it hard (if not impossible)
to implement single stepping in the future. The implementation in this
patch is not efficient since it exits to userspace for each IO while
previous implementation did 'ins' in batches. Further patch that
implements pio in string read ahead address this problem.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:26 +03:00
Gleb Natapov cf8f70bfe3 KVM: x86 emulator: fix in/out emulation.
in/out emulation is broken now. The breakage is different depending
on where IO device resides. If it is in userspace emulator reports
emulation failure since it incorrectly interprets kvm_emulate_pio()
return value. If IO device is in the kernel emulation of 'in' will do
nothing since kvm_emulate_pio() stores result directly into vcpu
registers, so emulator will overwrite result of emulation during
commit of shadowed register.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:25 +03:00
Gleb Natapov d9271123a4 KVM: x86 emulator: during rep emulation decrement ECX only if emulation succeeded
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:23 +03:00
Gleb Natapov a682e35449 KVM: x86 emulator: add decoding of X,Y parameters from Intel SDM
Add decoding of X,Y parameters from Intel SDM which are used by string
instruction to specify source and destination. Use this new decoding
to implement movs, cmps, stos, lods in a generic way.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:21 +03:00
Gleb Natapov 69f55cb11e KVM: x86 emulator: populate OP_MEM operand during decoding.
All struct operand fields are initialized during decoding for all
operand types except OP_MEM, but there is no reason for that. Move
OP_MEM operand initialization into decoding stage for consistency.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:20 +03:00
Gleb Natapov ceffb45972 KVM: Use task switch from emulator.c
Remove old task switch code from x86.c

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:18 +03:00
Gleb Natapov 2e873022f5 KVM: x86 emulator: Use load_segment_descriptor() instead of kvm_load_segment_descriptor()
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:17 +03:00
Gleb Natapov 38ba30ba51 KVM: x86 emulator: Emulate task switch in emulator.c
Implement emulation of 16/32 bit task switch in emulator.c

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:15 +03:00
Gleb Natapov 2dafc6c234 KVM: x86 emulator: Provide more callbacks for x86 emulator.
Provide get_cached_descriptor(), set_cached_descriptor(),
get_segment_selector(), set_segment_selector(), get_gdt(),
write_std() callbacks.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:14 +03:00
Gleb Natapov aca06a8307 KVM: x86 emulator: cleanup grp3 return value
When x86_emulate_insn() does not know how to emulate instruction it
exits via cannot_emulate label in all cases except when emulating
grp3. Fix that.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:12 +03:00
Gleb Natapov a41ffb7540 KVM: x86 emulator: If LOCK prefix is used dest arg should be memory.
If LOCK prefix is used dest arg should be memory, otherwise instruction
should generate #UD.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:11 +03:00
Gleb Natapov fd5253658b KVM: x86 emulator: do not call writeback if msr access fails.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:09 +03:00
Gleb Natapov 2e901c4cf4 KVM: x86 emulator: fix return values of syscall/sysenter/sysexit emulations
Return X86EMUL_PROPAGATE_FAULT is fault was injected. Also inject #UD
for those instruction when appropriate.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:08 +03:00
Gleb Natapov 1e470be5a1 KVM: x86 emulator: fix mov dr to inject #UD when needed.
If CR4.DE=1 access to registers DR4/DR5 cause #UD.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:06 +03:00
Gleb Natapov 6aebfa6ea7 KVM: x86 emulator: inject #UD on access to non-existing CR
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:05 +03:00
Gleb Natapov ab8557b2b3 KVM: x86 emulator: 0f (20|21|22|23) ignore mod bits.
Resent spec says that for 0f (20|21|22|23) the 2 bits in the mod field
are ignored. Interestingly enough older spec says that 11 is only valid
encoding.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:03 +03:00
Gleb Natapov 6e1e5ffee8 KVM: x86 emulator: fix 0f 01 /5 emulation
It is undefined and should generate #UD.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:02 +03:00
Gleb Natapov 5e3ae6c540 KVM: x86 emulator: fix mov r/m, sreg emulation.
mov r/m, sreg generates #UD ins sreg is incorrect.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:16:00 +03:00
Gleb Natapov 063db061b9 KVM: Provide current eip as part of emulator context.
Eliminate the need to call back into KVM to get it from emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:59 +03:00
Gleb Natapov 9c5372445c KVM: Provide x86_emulate_ctxt callback to get current cpl
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:57 +03:00
Gleb Natapov 93a152be5a KVM: remove realmode_lmsw function.
Use (get|set)_cr callback to emulate lmsw inside emulator.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:56 +03:00
Gleb Natapov 52a4661737 KVM: Provide callback to get/set control registers in emulator ops.
Use this callback instead of directly call kvm function. Also rename
realmode_(set|get)_cr to emulator_(set|get)_cr since function has nothing
to do with real mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:54 +03:00
Gui Jianfeng 3129994458 KVM: VMX: change to use bool return values
Make use of bool as return values, and remove some useless
bool value converting. Thanks Avi to point this out.

Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:51 +03:00
Gleb Natapov 49c6799a2c KVM: Remove pointer to rflags from realmode_set_cr parameters.
Mov reg, cr instruction doesn't change flags in any meaningful way, so
no need to update rflags after instruction execution.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:50 +03:00
Gleb Natapov af5b4f7ff7 KVM: x86 emulator: check return value against correct define
Check return value against correct define instead of open code
the value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:48 +03:00
Gleb Natapov c73e197bc5 KVM: x86 emulator: fix RCX access during rep emulation
During rep emulation access length to RCX depends on current address
mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:47 +03:00
Gleb Natapov d6d367d678 KVM: x86 emulator: Fix DstAcc decoding.
Set correct operation length. Add RAX (64bit) handling.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:45 +03:00
Avi Kivity 08e850c653 KVM: MMU: Reinstate pte prefetch on invlpg
Commit fb341f57 removed the pte prefetch on guest invlpg, citing guest races.
However, the SDM is adamant that prefetch is allowed:

  "The processor may create entries in paging-structure caches for
   translations required for prefetches and for accesses that are a
   result of speculative execution that would never actually occur
   in the executed code path."

And, in fact, there was a race in the prefetch code: we picked up the pte
without the mmu lock held, so an older invlpg could install the pte over
a newer invlpg.

Reinstate the prefetch logic, but this time note whether another invlpg has
executed using a counter.  If a race occured, do not install the pte.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:43 +03:00
Avi Kivity fbc5d139bb KVM: MMU: Do not instantiate nontrapping spte on unsync page
The update_pte() path currently uses a nontrapping spte when a nonpresent
(or nonaccessed) gpte is written.  This is fine since at present it is only
used on sync pages.  However, on an unsync page this will cause an endless
fault loop as the guest is under no obligation to invlpg a gpte that
transitions from nonpresent to present.

Needed for the next patch which reinstates update_pte() on invlpg.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:42 +03:00
Avi Kivity 4a5f48f666 KVM: Don't follow an atomic operation by a non-atomic one
Currently emulated atomic operations are immediately followed by a non-atomic
operation, so that kvm_mmu_pte_write() can be invoked.  This updates the mmu
but undoes the whole point of doing things atomically.

Fix by only performing the atomic operation and the mmu update, and avoiding
the non-atomic write.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:40 +03:00
Avi Kivity daea3e73cb KVM: Make locked operations truly atomic
Once upon a time, locked operations were emulated while holding the mmu mutex.
Since mmu pages were write protected, it was safe to emulate the writes in
a non-atomic manner, since there could be no other writer, either in the
guest or in the kernel.

These days emulation takes place without holding the mmu spinlock, so the
write could be preempted by an unshadowing event, which exposes the page
to writes by the guest.  This may cause corruption of guest page tables.

Fix by using an atomic cmpxchg for these operations.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:39 +03:00
Avi Kivity 72016f3a42 KVM: MMU: Consolidate two guest pte reads in kvm_mmu_pte_write()
kvm_mmu_pte_write() reads guest ptes in two different occasions, both to
allow a 32-bit pae guest to update a pte with 4-byte writes.  Consolidate
these into a single read, which also allows us to consolidate another read
from an invlpg speculating a gpte into the shadow page table.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:37 +03:00
Wei Yongjun 160d2f6c0c KVM: x86: fix the error of ioctl KVM_IRQ_LINE if no irq chip
If no irq chip in kernel, ioctl KVM_IRQ_LINE will return -EFAULT.
But I see in other place such as KVM_[GET|SET]IRQCHIP, -ENXIO is
return. So this patch used -ENXIO instead of -EFAULT.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:31 +03:00
Wei Yongjun ec68798c8f KVM: x86: Use native_store_idt() instead of kvm_get_idt()
This patch use generic linux function native_store_idt()
instead of kvm_get_idt(), and also removed the useless
function kvm_get_idt().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:28 +03:00
Avi Kivity 5c1c85d08d KVM: Trace exception injection
Often an exception can help point out where things start to go wrong.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:27 +03:00
Avi Kivity 5bfd8b5455 KVM: Move kvm_exit tracepoint rip reading inside tracepoint
Reading rip is expensive on vmx, so move it inside the tracepoint so we only
incur the cost if tracing is enabled.

Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:25 +03:00
Minchan Kim d4f64b6cad KVM: remove redundant initialization of page->private
The prep_new_page() in page allocator calls set_page_private(page, 0).
So we don't need to reinitialize private of page.

Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Avi Kivity<avi@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:24 +03:00
Xiao Guangrong 2ed152afc7 KVM: cleanup kvm trace
This patch does:

 - no need call tracepoint_synchronize_unregister() when kvm module
   is unloaded since ftrace can handle it

 - cleanup ftrace's macro

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2010-05-17 12:15:22 +03:00
Gleb Natapov 835e6b8047 KVM: x86 emulator mark VMMCALL and LMSW as privileged
LMSW is present in both group tables. It was marked privileged only in
one of them. Intel analog of VMMCALL is already marked privileged.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:18 +03:00
Joerg Roedel f71385383f KVM: SVM: Ignore lower 12 bit of nested msrpm_pa
These bits are ignored by the hardware too. Implement this
for nested svm too.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2010-05-17 12:15:16 +03:00