If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly
The idea is from Avi
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.
This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".
Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.
In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.
Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.
Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.
Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.
We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.
Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.
On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.
This is broken in several ways:
- if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
deadline timer feature, a guest that uses the feature will break
- live migration to older host kernels that don't support the TSC deadline
timer will cause the feature to be pulled from under the guest's feet;
breaking it
- guests that are broken wrt the feature will fail.
Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.
Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
[avi: add the KVM_CAP + documentation]
Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
[<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm]
[<ffffffff81071431>] process_one_work+0x111/0x4d0
[<ffffffff81071bb2>] worker_thread+0x152/0x340
[<ffffffff81075c8e>] kthread+0x7e/0x90
[<ffffffff815a4474>] kernel_thread_helper+0x4/0x10
Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
net: Add a flow_cache_flush_deferred function
ipv4: reintroduce route cache garbage collector
net: have ipconfig not wait if no dev is available
sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
asix: new device id
davinci-cpdma: fix locking issue in cpdma_chan_stop
sctp: fix incorrect overflow check on autoclose
r8169: fix Config2 MSIEnable bit setting.
llc: llc_cmsg_rcv was getting called after sk_eat_skb.
net: bpf_jit: fix an off-one bug in x86_64 cond jump target
iwlwifi: update SCD BC table for all SCD queues
Revert "Bluetooth: Revert: Fix L2CAP connection establishment"
Bluetooth: Clear RFCOMM session timer when disconnecting last channel
Bluetooth: Prevent uninitialized data access in L2CAP configuration
iwlwifi: allow to switch to HT40 if not associated
iwlwifi: tx_sync only on PAN context
mwifiex: avoid double list_del in command cancel path
ath9k: fix max phy rate at rate control init
nfc: signedness bug in __nci_request()
iwlwifi: do not set the sequence control bit is not needed
When printing the code bytes in show_registers(), the markers around the
byte at the fault address could make the printk() format string look
like a valid log level and facility code. This would prevent this byte
from being printed and result in a spurious newline:
[ 7555.765589] Code: 8b 32 e9 94 00 00 00 81 7d 00 ff 00 00 00 0f 87 96 00 00 00 48 8b 83 c0 00 00 00 44 89 e2 44 89 e6 48 89 df 48 8b 80 d8 02 00 00
[ 7555.765683] 8b 48 28 48 89 d0 81 e2 ff 0f 00 00 48 c1 e8 0c 48 c1 e0 04
Add KERN_CONT where needed, and elsewhere in show_registers() for
consistency.
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Link: http://lkml.kernel.org/r/4EEFA7AE.9020407@ladisch.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
x86 jump instruction size is 2 or 5 bytes (near/long jump), not 2 or 6
bytes.
In case a conditional jump is followed by a long jump, conditional jump
target is one byte past the start of target instruction.
Signed-off-by: Markus Kötter <nepenthesdev@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen/swiotlb: Use page alignment for early buffer allocation.
xen: only limit memory map to maximum reservation for domain 0.
d312ae878b "xen: use maximum reservation to limit amount of usable RAM"
clamped the total amount of RAM to the current maximum reservation. This is
correct for dom0 but is not correct for guest domains. In order to boot a guest
"pre-ballooned" (e.g. with memory=1G but maxmem=2G) in order to allow for
future memory expansion the guest must derive max_pfn from the e820 provided by
the toolstack and not the current maximum reservation (which can reflect only
the current maximum, not the guest lifetime max). The existing algorithm
already behaves this correctly if we do not artificially limit the maximum
number of pages for the guest case.
For a guest booted with maxmem=512, memory=128 this results in:
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] Xen: 0000000000000000 - 00000000000a0000 (usable)
[ 0.000000] Xen: 00000000000a0000 - 0000000000100000 (reserved)
-[ 0.000000] Xen: 0000000000100000 - 0000000008100000 (usable)
-[ 0.000000] Xen: 0000000008100000 - 0000000020800000 (unusable)
+[ 0.000000] Xen: 0000000000100000 - 0000000020800000 (usable)
...
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] e820 update range: 0000000000000000 - 0000000000010000 (usable) ==> (reserved)
[ 0.000000] e820 remove range: 00000000000a0000 - 0000000000100000 (usable)
-[ 0.000000] last_pfn = 0x8100 max_arch_pfn = 0x1000000
+[ 0.000000] last_pfn = 0x20800 max_arch_pfn = 0x1000000
[ 0.000000] initial memory mapped : 0 - 027ff000
[ 0.000000] Base memory trampoline at [c009f000] 9f000 size 4096
-[ 0.000000] init_memory_mapping: 0000000000000000-0000000008100000
-[ 0.000000] 0000000000 - 0008100000 page 4k
-[ 0.000000] kernel direct mapping tables up to 8100000 @ 27bb000-27ff000
+[ 0.000000] init_memory_mapping: 0000000000000000-0000000020800000
+[ 0.000000] 0000000000 - 0020800000 page 4k
+[ 0.000000] kernel direct mapping tables up to 20800000 @ 26f8000-27ff000
[ 0.000000] xen: setting RW the range 27e8000 - 27ff000
[ 0.000000] 0MB HIGHMEM available.
-[ 0.000000] 129MB LOWMEM available.
-[ 0.000000] mapped low ram: 0 - 08100000
-[ 0.000000] low ram: 0 - 08100000
+[ 0.000000] 520MB LOWMEM available.
+[ 0.000000] mapped low ram: 0 - 20800000
+[ 0.000000] low ram: 0 - 20800000
With this change "xl mem-set <domain> 512M" will successfully increase the
guest RAM (by reducing the balloon).
There is no change for dom0.
Reported-and-Tested-by: George Shuklin <george.shuklin@gmail.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
This hangs my MacBook Air at boot time; I get no console
messages at all. I reverted this on top of -rc5 and my machine
boots again.
This reverts commit e8c7106280.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Keith Packard <keithp@keithp.com>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console
Signed-off-by: Ingo Molnar <mingo@elte.hu>
efi_call_phys_prelog() sets up a 1:1 mapping of the physical address
range in swapper_pg_dir. Instead of replacing then restoring entries
in swapper_pg_dir we should be using initial_page_table which already
contains the 1:1 mapping.
It's safe to blindly switch back to swapper_pg_dir in the epilog
because the physical EFI routines are only called before
efi_enter_virtual_mode(), e.g. before any user processes have been
forked. Therefore, we don't need to track which pgd was in %cr3 when
we entered the prelog.
The previous code actually contained a bug because it assumed that the
kernel was loaded at a physical address within the first 8MB of ram,
usually at 0x100000. However, this isn't the case with a
CONFIG_RELOCATABLE=y kernel which could have been loaded anywhere in
the physical address space.
Also delete the ancient (and bogus) comments about the page table
being restored after the lock is released. There is no locking.
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Darrent Hart <dvhart@linux.intel.com>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/1323346250.3894.74.camel@mfleming-mobl1.ger.corp.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, efi: Calling __pa() with an ioremap()ed address is invalid
x86, hpet: Immediately disable HPET timer 1 if rtc irq is masked
x86/intel_mid: Kconfig select fix
x86/intel_mid: Fix the Kconfig for MID selection
With the 3.2-rc kernel, IOMMU 2M pages in KVM works. But when I tried
to use IOMMU 1GB pages in KVM, I encountered an oops and the 1GB page
failed to be used.
The root cause is that 1GB page allocation calls gup_huge_pud() while 2M
page calls gup_huge_pmd. If compound pages are used and the page is a
tail page, gup_huge_pmd() increases _mapcount to record tail page are
mapped while gup_huge_pud does not do that.
So when the mapped page is relesed, it will result in kernel oops
because the page is not marked mapped.
This patch add tail process for compound page in 1GB huge page which
keeps the same process as 2M page.
Reproduce like:
1. Add grub boot option: hugepagesz=1G hugepages=8
2. mount -t hugetlbfs -o pagesize=1G hugetlbfs /dev/hugepages
3. qemu-kvm -m 2048 -hda os-kvm.img -cpu kvm64 -smp 4 -mem-path /dev/hugepages
-net none -device pci-assign,host=07:00.1
kernel BUG at mm/swap.c:114!
invalid opcode: 0000 [#1] SMP
Call Trace:
put_page+0x15/0x37
kvm_release_pfn_clean+0x31/0x36
kvm_iommu_put_pages+0x94/0xb1
kvm_iommu_unmap_memslots+0x80/0xb6
kvm_assign_device+0xba/0x117
kvm_vm_ioctl_assigned_device+0x301/0xa47
kvm_vm_ioctl+0x36c/0x3a2
do_vfs_ioctl+0x49e/0x4e4
sys_ioctl+0x5a/0x7c
system_call_fastpath+0x16/0x1b
RIP put_compound_page+0xd4/0x168
Signed-off-by: Youquan Song <youquan.song@intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If we encounter an efi_memory_desc_t without EFI_MEMORY_WB set
in ->attribute we currently call set_memory_uc(), which in turn
calls __pa() on a potentially ioremap'd address.
On CONFIG_X86_32 this is invalid, resulting in the following
oops on some machines:
BUG: unable to handle kernel paging request at f7f22280
IP: [<c10257b9>] reserve_ram_pages_type+0x89/0x210
[...]
Call Trace:
[<c104f8ca>] ? page_is_ram+0x1a/0x40
[<c1025aff>] reserve_memtype+0xdf/0x2f0
[<c1024dc9>] set_memory_uc+0x49/0xa0
[<c19334d0>] efi_enter_virtual_mode+0x1c2/0x3aa
[<c19216d4>] start_kernel+0x291/0x2f2
[<c19211c7>] ? loglevel+0x1b/0x1b
[<c19210bf>] i386_start_kernel+0xbf/0xc8
A better approach to this problem is to map the memory region
with the correct attributes from the start, instead of modifying
it after the fact. The uncached case can be handled by
ioremap_nocache() and the cached by ioremap_cache().
Despite first impressions, it's not possible to use
ioremap_cache() to map all cached memory regions on
CONFIG_X86_64 because EFI_RUNTIME_SERVICES_DATA regions really
don't like being mapped into the vmalloc space, as detailed in
the following bug report,
https://bugzilla.redhat.com/show_bug.cgi?id=748516
Therefore, we need to ensure that any EFI_RUNTIME_SERVICES_DATA
regions are covered by the direct kernel mapping table on
CONFIG_X86_64. To accomplish this we now map E820_RESERVED_EFI
regions via the direct kernel mapping with the initial call to
init_memory_mapping() in setup_arch(), whereas previously these
regions wouldn't be mapped if they were after the last E820_RAM
region until efi_ioremap() was called. Doing it this way allows
us to delete efi_ioremap() completely.
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Matthew Garrett <mjg@redhat.com>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Huang Ying <huang.ying.caritas@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1321621751-3650-1-git-send-email-matt@console-pimps.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When HPET is operating in RTC mode, the TN_ENABLE bit on timer1
controls whether the HPET or the RTC delivers interrupts to irq8. When
the system goes into suspend, the RTC driver sends a signal to the
HPET driver so that the HPET releases control of irq8, allowing the
RTC to wake the system from suspend. The switchover is accomplished by
a write to the HPET configuration registers which currently only
occurs while servicing the HPET interrupt.
On some systems, I have seen the system suspend before an HPET
interrupt occurs, preventing the write to the HPET configuration
register and leaving the HPET in control of the irq8. As the HPET is
not active during suspend, it does not generate a wake signal and RTC
alarms do not work.
This patch forces the HPET driver to immediately transfer control of
the irq8 channel to the RTC instead of waiting until the next
interrupt event.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Link: http://lkml.kernel.org/r/20111118153306.GB16319@alberich.amd.com
Tested-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
KVM needs to know perf capability to decide which PMU it can expose to a
guest.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-8-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Implement the disabling of arch events as a quirk so that we can print
a message along with it. This creates some visibility into the problem
space and could allow us to work on adding more work-around like the
AAJ80 one.
Requested-by: Ingo Molnar <mingo@elte.hu>
Cc: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-wcja2z48wklzu1b0nkz0a5y7@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel CPUs report non-available architectural events in cpuid leaf
0AH.EBX. Use it to disable events that are not available according
to CPU.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1320929850-10480-7-git-send-email-gleb@redhat.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
WARNING: arch/x86/kernel/built-in.o(.text+0x4c71): Section mismatch in
reference from the function arch_jump_label_transform_static() to the
function .init.text:text_poke_early()
The function arch_jump_label_transform_static() references
the function __init text_poke_early().
This is often because arch_jump_label_transform_static lacks a __init
annotation or the annotation of text_poke_early is wrong.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/n/tip-9lefe89mrvurrwpqw5h8xm8z@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If we select a symbol it should have a type declared first
otherwise in some situations the config tools get upset. They
are currently perhaps a bit too resilient which is why this
wasn't noticed initially.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111206132811.4041.32549.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We currently fail to build on CONFIG_X86_INTEL_MID=y and
CONFIG_X86_MRST unset.
We could build all the bits to make generic MID work if you
picked MID platform alone but that's really silly. Instead use
select and two variables.
This looks a bit daft right now but once we add a Medfield
selection it'll start to look a good deal more sensible.
Reported-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20111205231433.28811.51297.stgit@bob.linux.org.uk
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This avoids a scheduling failure for cases like:
cycles, cycles, instructions, instructions (on Core2)
Which would end up being programmed like:
PMC0, PMC1, FP-instructions, fail
Because all events will have the same weight.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-8tnwb92asqj7xajqqoty4gel@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The current x86 event scheduler fails to resolve scheduling problems
of certain combinations of events and constraints. This happens if the
counter mask of such an event is not a subset of any other counter
mask of a constraint with an equal or higher weight, e.g. constraints
of the AMD family 15h pmu:
counter mask weight
amd_f15_PMC30 0x09 2 <--- overlapping counters
amd_f15_PMC20 0x07 3
amd_f15_PMC53 0x38 3
The scheduler does not find then an existing solution. Here is an
example:
event code counter failure possible solution
0x02E PMC[3,0] 0 3
0x043 PMC[2:0] 1 0
0x045 PMC[2:0] 2 1
0x046 PMC[2:0] FAIL 2
The event scheduler may not select the correct counter in the first
cycle because it needs to know which subsequent events will be
scheduled. It may fail to schedule the events then.
To solve this, we now save the scheduler state of events with
overlapping counter counstraints. If we fail to schedule the events
we rollback to those states and try to use another free counter.
Constraints with overlapping counters are marked with a new introduced
overlap flag. We set the overlap flag for such constraints to give the
scheduler a hint which events to select for counter rescheduling. The
EVENT_CONSTRAINT_OVERLAP() macro can be used for this.
Care must be taken as the rescheduling algorithm is O(n!) which will
increase scheduling cycles for an over-commited system dramatically.
The number of such EVENT_CONSTRAINT_OVERLAP() macros and its counter
masks must be kept at a minimum. Thus, the current stack is limited to
2 states to limit the number of loops the algorithm takes in the worst
case.
On systems with no overlapping-counter constraints, this
implementation does not increase the loop count compared to the
previous algorithm.
V2:
* Renamed redo -> overlap.
* Reimplementation using perf scheduling helper functions.
V3:
* Added WARN_ON_ONCE() if out of save states.
* Changed function interface of perf_sched_restore_state() to use bool
as return value.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch introduces x86 perf scheduler code helper functions. We
need this to later add more complex functionality to support
overlapping counter constraints (next patch).
The algorithm is modified so that the range of weight values is now
generated from the constraints. There shouldn't be other functional
changes.
With the helper functions the scheduler is controlled. There are
functions to initialize, traverse the event list, find unused counters
etc. The scheduler keeps its own state.
V3:
* Added macro for_each_set_bit_cont().
* Changed functions interfaces of perf_sched_find_counter() and
perf_sched_next_event() to use bool as return value.
* Added some comments to make code better understandable.
V4:
* Fix broken event assignment if weight of the first event is not
wmin (perf_sched_init()).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1321616122-1533-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intr_remapping: Fix section mismatch in ir_dev_scope_init()
intel-iommu: Fix section mismatch in dmar_parse_rmrr_atsr_dev()
x86, amd: Fix up numa_node information for AMD CPU family 15h model 0-0fh northbridge functions
x86, AMD: Correct align_va_addr documentation
x86/rtc, mrst: Don't register a platform RTC device for for Intel MID platforms
x86/mrst: Battery fixes
x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode
x86: Fix "Acer Aspire 1" reboot hang
x86/mtrr: Resolve inconsistency with Intel processor manual
x86: Document rdmsr_safe restrictions
x86, microcode: Fix the failure path of microcode update driver init code
Add TAINT_FIRMWARE_WORKAROUND on MTRR fixup
x86/mpparse: Account for bus types other than ISA and PCI
x86, mrst: Change the pmic_gpio device type to IPC
mrst: Added some platform data for the SFI translations
x86,mrst: Power control commands update
x86/reboot: Blacklist Dell OptiPlex 990 known to require PCI reboot
x86, UV: Fix UV2 hub part number
x86: Add user_mode_vm check in stack_overflow_check
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix loss of notification with multi-event
perf, x86: Force IBS LVT offset assignment for family 10h
perf, x86: Disable PEBS on SandyBridge chips
trace_events_filter: Use rcu_assign_pointer() when setting ftrace_event_call->filter
perf session: Fix crash with invalid CPU list
perf python: Fix undefined symbol problem
perf/x86: Enable raw event access to Intel offcore events
perf: Don't use -ENOSPC for out of PMU resources
perf: Do not set task_ctx pointer in cpuctx if there are no events in the context
perf/x86: Fix PEBS instruction unwind
oprofile, x86: Fix crash when unloading module (nmi timer mode)
oprofile: Fix crash when unloading module (hr timer mode)
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched, x86: Avoid unnecessary overflow in sched_clock
sched: Fix buglet in return_cfs_rq_runtime()
sched: Avoid SMT siblings in select_idle_sibling() if possible
sched: Set the command name of the idle tasks in SMP kernels
sched, rt: Provide means of disabling cross-cpu bandwidth sharing
sched: Document wait_for_completion_*() return values
sched_fair: Fix a typo in the comment describing update_sd_lb_stats
sched: Add a comment to effective_load() since it's a pain
I've received complaints that the numa_node attribute for family
15h model 00-0fh (e.g. Interlagos) northbridge functions shows
-1 instead of the proper node ID.
Correct this with attached quirks (similar to quirks for other
AMD CPU families used in multi-socket systems).
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/20111202072143.GA31916@alberich.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel MID x86 platforms have a memory mapped virtual RTC
instead. No MID platform have the default ports (and
accessing them may do weird stuff).
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Cc: feng.tang@intel.com
Cc: Feng Tang <feng.tang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could
crash with:
BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000 Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
[<c01ac222>] ? __swap_duplicate+0xc2/0x160
[<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
[<c01ac2e4>] ? swap_duplicate+0x14/0x40
[<c01a0a6b>] ? copy_pte_range+0x45b/0x500
[<c01a0ca5>] ? copy_page_range+0x195/0x200
[<c01328c6>] ? dup_mmap+0x1c6/0x2c0
[<c0132cf8>] ? dup_mm+0xa8/0x130
[<c013376a>] ? copy_process+0x98a/0xb30
[<c013395f>] ? do_fork+0x4f/0x280
[<c01573b3>] ? getnstimeofday+0x43/0x100
[<c010f770>] ? sys_clone+0x30/0x40
[<c06c048d>] ? ptregs_clone+0x15/0x48
[<c06bfb71>] ? syscall_call+0x7/0xb
The problem is that in copy_page_range() we turn lazy mode on,
and then in swap_entry_free() we call swap_count_continued()
which ends up in:
map = kmap_atomic(page, KM_USER0) + offset;
and then later we touch *map.
Since we are running in batched mode (lazy) we don't actually
set up the PTE mappings and the kmap_atomic is not done
synchronously and ends up trying to dereference a page that has
not been set.
Looking at kmap_atomic_prot_pfn(), it uses
'arch_flush_lazy_mmu_mode' and doing the same in
kmap_atomic_prot() and __kunmap_atomic() makes the problem go
away.
Interestingly, commit b8bcfe997e ("x86/paravirt: remove lazy
mode in interrupts") removed part of this to fix an interrupt
issue - but it went to far and did not consider this scenario.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Looks like on some Acer Aspire 1s with older bioses, reboot via bios
fails. It works on my machine, (with BIOS version 0.3310) but
not on some others (BIOS version 0.3309).
There's a log of problems at:
https://bbs.archlinux.org/viewtopic.php?id=124136
This patch adds a different callback to the reboot quirk table,
to allow rebooting via keybaord controller.
Reported-by: Uroš Vampl <mobile.leecher@gmail.com>
Tested-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Peter Chubb <peter.chubb@nicta.com.au>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1323093233-9481-1-git-send-email-anarsoul@gmail.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following is from Notes of section 11.5.3 of Intel processor
manual available at:
http://www.intel.com/Assets/PDF/manual/325384.pdf
For the Pentium 4 and Intel Xeon processors, after the sequence of
steps given above has been executed, the cache lines containing the
code between the end of the WBINVD instruction and before the
MTRRS have actually been disabled may be retained in the cache
hierarchy. Here, to remove code from the cache completely, a
second WBINVD instruction must be executed after the MTRRs have
been disabled.
This patch provides resolution for that.
Ideally, I will like to make changes only for Pentium 4 and Xeon
processors. But, I am not finding easier way to do it.
And, extra wbinvd() instruction does not hurt much for other
processors.
Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@profusion.mobi>
Link: http://lkml.kernel.org/r/4EBD1CC5.3030008@oracle.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since new Intel software developers manual introduces
new format for AVX instruction set (including AVX2),
it is important to update x86-opcode-map.txt to fit
those changes.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120557.15475.13236.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix instruction decoder test (insn_sanity), so that
it doesn't show both info and error messages twice on
same instruction. (In that case, show only error message)
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120545.15475.7928.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For reducing memory usage of attribute table, x86 instruction
decoder puts "Group" attribute only on "no-last-prefix"
attribute table (same as vex_p == 0 case).
Thus, the decoder should look no-last-prefix table first, and
then only if it is not a group, move on to "with-last-prefix"
table (vex_p != 0).
However, current implementation, inat_get_avx_attribute()
looks with-last-prefix directly. So, when decoding
a grouped AVX instruction, the decoder fails to find correct
group because there is no "Group" attribute on the table.
This ends up with the mis-decoding of instructions, as Ingo
reported in http://thread.gmane.org/gmane.linux.kernel/1214103
This patch fixes it to check no-last-prefix table first
even if that is an AVX instruction, and get an attribute from
"with last-prefix" table only if that is not a group.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: yrl.pp-manager.tt@hitachi.com
Link: http://lkml.kernel.org/r/20111205120539.15475.91428.stgit@cloud
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Recently, I got bitten by using rdmsr_safe too early in the boot
process. Document its shortcomings for future reference.
Link: http://lkml.kernel.org/r/4ED5B70F.606@lwfinger.net
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
In commit f8924e770e ("x86: unify mp_bus_info"), the 32-bit
and 64-bit versions of MP_bus_info were rearranged to match each
other better. Unfortunately it introduced a regression: prior
to that change we used to always set the mp_bus_not_pci bit,
then clear it if we found a PCI bus. After it, we set
mp_bus_not_pci for ISA buses, clear it for PCI buses, and leave
it alone otherwise.
In the cases of ISA and PCI, there's not much difference. But
ISA is not the only non-PCI bus, so it's better to always set
mp_bus_not_pci and clear it only for PCI.
Without this change, Dan's Dell PowerEdge 4200 panics on boot
with a log indicating interrupt routing trouble unless the
"noapic" option is supplied. With this change, the machine
boots reliably without "noapic".
Fixes http://bugs.debian.org/586494
Reported-bisected-and-tested-by: Dan McGrath <troubledaemon@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org # 2.6.26+
Cc: Dan McGrath <troubledaemon@gmail.com>
Cc: Alexey Starikovskiy <aystarik@gmail.com>
[jrnieder@gmail.com: clarified commit message]
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
Link: http://lkml.kernel.org/r/20111122215000.GA9151@elie.hsd1.il.comcast.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In latest firmware's SFI tables, pmic_gpio has been set to
IPC type of device, so we need handle it too.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add SFI glue for the following devices:
tca6416: a gpio expander compatible with max7315
mpu3050: gyro sensor
Both of these actual drivers are already upstream
Signed-off-by: Jekyll Lai <jekyll_lai@wistron.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On the Intel MID devices SCU commands are issued to manage power
off and the like. We need to issue different ones for
non-Lincroft based devices.
Signed-off-by: Alek Du <alek.du@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Dell OptiPlex 990 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/201111160019.51303.rjw@sisk.pl
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There was a mixup when the SGI UV2 hub chip was sent to be
fabricated, and it ended up with the wrong part number in the
HRP_NODE_ID mmr. Future versions of the chip will (may) have the
correct part number. Change the UV infrastructure to recognize
both part numbers as valid IDs of a UV2 hub chip.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/20111129210058.GA20452@sgi.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The kernel stack overflow is checked in stack_overflow_check(),
which may wrongly detect the overflow if the stack pointer in
user space points to the kernel stack intentionally or
accidentally. So, the actual overflow is never detected after
this misdetection because WARN_ONCE() is used on the detection
of it.
This patch adds user-mode-vm checking before it to avoid this
problem and bails out early if the user stack is used.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Randy Dunlap <rdunlap@xenotime.net>
Link: http://lkml.kernel.org/r/20111129060821.11076.55315.stgit@ltc219.sdl.hitachi.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
On AMD family 10h we see firmware bug messages like the following:
[Firmware Bug]: cpu 6, try to use APIC500 (LVT offset 0) for vector 0x10400, but the register is already in use for vector 0xf9 on another cpu
[Firmware Bug]: cpu 6, IBS interrupt offset 0 not available (MSRC001103A=0x0000000000000100)
[Firmware Bug]: using offset 1 for IBS interrupts
[Firmware Bug]: workaround enabled for IBS LVT offset
perf: AMD IBS detected (0x00000007)
We always see this, since the offsets are not assigned by the BIOS for
this family. Force LVT offset assignment in this case. If the OS
assignment fails, fallback to BIOS settings and try to setup this.
The fallback to BIOS settings weakens the family check since
force_ibs_eilvt_setup() may fail e.g. in case of virtual machines.
But setup may still succeed if BIOS offsets are correct.
Other families don't have a workaround implemented that assigns LVT
offsets. It's ok, to drop calling force_ibs_eilvt_setup() for that
families.
With the patch the [Firmware Bug] messages vanish. We see now:
IBS: LVT offset 1 assigned
perf: AMD IBS detected (0x00000007)
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111109162225.GO12451@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People with old AMD chips are getting hung boots, because commit
bcb80e5387 ("x86, microcode, AMD: Add microcode revision to
/proc/cpuinfo") moved the microcode detection too early into
"early_init_amd()".
At that point we are *so* early in the booth that the exception tables
haven't even been set up yet, so the whole
rdmsr_safe(MSR_AMD64_PATCH_LEVEL, &c->microcode, &dummy);
doesn't actually work: if the rdmsr does a GP fault (due to non-existant
MSR register on older CPU's), we can't fix it up yet, and the boot fails.
Fix it by simply moving the code to a slightly later point in the boot
(init_amd() instead of early_init_amd()), since the kernel itself
doesn't even really care about the microcode patchlevel at this point
(or really ever: it's made available to user space in /proc/cpuinfo, and
updated if you do a microcode load).
Reported-tested-and-bisected-by: Larry Finger <Larry.Finger@lwfinger.net>
Tested-by: Bob Tracy <rct@gherkin.frus.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The idea behind commit d91ee5863b ("cpuidle: replace xen access to x86
pm_idle and default_idle") was to have one call - disable_cpuidle()
which would make pm_idle not be molested by other code. It disallows
cpuidle_idle_call to be set to pm_idle (which is excellent).
But in the select_idle_routine() and idle_setup(), the pm_idle can still
be set to either: amd_e400_idle, mwait_idle or default_idle. This
depends on some CPU flags (MWAIT) and in AMD case on the type of CPU.
In case of mwait_idle we can hit some instances where the hypervisor
(Amazon EC2 specifically) sets the MWAIT and we get:
Brought up 2 CPUs
invalid opcode: 0000 [#1] SMP
Pid: 0, comm: swapper Not tainted 3.1.0-0.rc6.git0.3.fc16.x86_64 #1
RIP: e030:[<ffffffff81015d1d>] [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
...
Call Trace:
[<ffffffff8100e2ed>] cpu_idle+0xae/0xe8
[<ffffffff8149ee78>] cpu_bringup_and_idle+0xe/0x10
RIP [<ffffffff81015d1d>] mwait_idle+0x6f/0xb4
RSP <ffff8801d28ddf10>
In the case of amd_e400_idle we don't get so spectacular crashes, but we
do end up making an MSR which is trapped in the hypervisor, and then
follow it up with a yield hypercall. Meaning we end up going to
hypervisor twice instead of just once.
The previous behavior before v3.0 was that pm_idle was set to
default_idle regardless of select_idle_routine/idle_setup.
We want to do that, but only for one specific case: Xen. This patch
does that.
Fixes RH BZ #739499 and Ubuntu #881076
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Prevent tracing of preempt_disable() in get_cpu_var() in
kvm_clock_read(). When CONFIG_DEBUG_PREEMPT is enabled,
preempt_disable/enable() are traced and this causes the function_graph
tracer to go into an infinite recursion. By open coding the
preempt_disable() around the get_cpu_var(), we can use the notrace
version which prevents preempt_disable/enable() from being traced and
prevents the recursion.
Based on a similar patch for Xen from Jeremy Fitzhardinge.
Tested-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
random: Fix handing of arch_get_random_long in get_random_bytes()
x86: Call stop_machine_text_poke() on all CPUs
x86, ioapic: Only print ioapic debug information for IRQs belonging to an ioapic chip
x86/mrst: Avoid reporting wrong nmi status
x86/mrst: Add support for Penwell clock calibration
x86/apic: Allow use of lapic timer early calibration result
x86/apic: Do not clear nr_irqs_gsi if no legacy irqs
x86/platform: Add a wallclock_init func to x86_platforms ops
x86/mce: Make mce_chrdev_ops 'static const'
* 'stable/for-linus-fixes-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen-gntalloc: signedness bug in add_grefs()
xen-gntalloc: integer overflow in gntalloc_ioctl_alloc()
xen-gntdev: integer overflow in gntdev_alloc_map()
xen:pvhvm: enable PVHVM VCPU placement when using more than 32 CPUs.
xen/balloon: Avoid OOM when requesting highmem
xen: Remove hanging references to CONFIG_XEN_PLATFORM_PCI
xen: map foreign pages for shared rings by updating the PTEs directly
Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
(Added the missing signed-off-by line)
In hundreds of days, the __cycles_2_ns calculation in sched_clock
has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
the final value to become zero. We can solve this without losing
any precision.
We can decompose TSC into quotient and remainder of division by the
scale factor, and then use this to convert TSC into nanoseconds.
Signed-off-by: Salman Qazi <sqazi@google.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Reviewed-by: Paul Turner <pjt@google.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111115221121.7262.88871.stgit@dungbeetle.mtv.corp.google.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
PVHVM running with more than 32 vcpus and pv_irq/pv_time enabled
need VCPU placement to work, or else it will softlockup.
CC: stable@kernel.org
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).
After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>
It appears that stop_machine_text_poke() wants to be called on all CPUs,
like it's done from text_poke_smp(). Fix text_poke_smp_batch() to do
this.
Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Jason Baron <jbaron@redhat.com>
Link: http://lkml.kernel.org/r/1319702072-32676-1-git-send-email-rabin@rab.in
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that the core offcore support is fixed up (thanks Stephane) and we
have sane generic events utilizing them, re-enable the raw access to
the feature as well.
Note that it doesn't matter if you use event 0x1b7 or 0x1bb to specify
an offcore event, either one works and neither guarantees you'll end
up on a particular offcore MSR.
Based on original patch from: Vince Weaver <vweaver1@eecs.utk.edu>.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>.
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1108031200390.703@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@elte.hu>
People (Linus) objected to using -ENOSPC to signal not having enough
resources on the PMU to satisfy the request. Use -EINVAL.
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-xv8geaz2zpbjhlx0svmpp28n@git.kernel.org
[ merged to newer kernel, fixed up MIPS impact ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Masami spotted that we always try to decode the instruction stream as
64bit instructions when running a 64bit kernel, this doesn't work for
ia32-compat proglets.
Use TIF_IA32 to detect if we need to use the 32bit instruction
decoder.
Reported-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This needed the sfi IRQ 0xFF fix to go in first. It simply plumbs in the
bma023 driver with the firmware naming of it.
Signed-off-by: William Douglas <william.douglas@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Real world year equals the value in vrtc YEAR register plus an offset.
We used 1960 as the offset to make leap year consistent, but for a
device's first use, its YEAR register is 0 and the system year will
be parsed as 1960 which is not a valid UNIX time and will cause many
applications to fail mysteriously. So we use 1972 instead to fix this
issue.
Updated patch which adds a sanity check suggested by Mathias
This isn't a change in behaviour for systems, because 1972 is the one we
actually use. It's the old version in upstream which is out of sync with
all devices.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix a build error. CE4100 with no serial errors because the alternate
function is only a prototype not a null function as intended.
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
with "apic=verbose" the print_IO_APIC() function tries to print
IRQ to pin mappings for every active irq. It assumes chip_data
is of type irq_cfg and may cause an oops if not.
As the print_IO_APIC() is called from a late_initcall other
chained irq chips may already be registered with custom
chip_data information, causing an oops. This is the case with
intel MID SoC devices with gpio demuxers registered as irq_chips.
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
[ -v2: fixed build failure ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moorestown/Medfield platform does not have port 0x61 to report
NMI status, nor does it have external NMI sources. The only NMI
sources are from lapic, as results of perf counter overflow or
IPI, e.g. NMI watchdog or spin lock debug.
Reading port 0x61 on Moorestown will return 0xff which misled
NMI handlers to false critical errors such memory parity error.
The subsequent ioport access for NMI handling can also cause
undefined behavior on Moorestown.
This patch allows kernel process NMI due to watchdog or backrace
dump without unnecessary hangs.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[hand applied]
Signed-off-by: Alan Cox <alan@linux.intel.com>
lapic timer calibration can be combined with tsc in platform
specific calibration functions. if such calibration result is
obtained early, we can skip the redundant calibration loops.
Signed-off-by: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
nr_legacy_irqs is set in probe_nr_irqs_gsi, we should not clear
it after that. Otherwise, the result is that MSI irqs will be
allocated from the wrong range for the systems without legacy
PIC.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some wall clock devices use MMIO based HW register, this new
function will give them a chance to do some initialization work
before their get/set_time service get called.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Signed-off-by: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a sanity test of x86 insn decoder against a stream
of randomly generated input, at build time.
This test is also able to reproduce any bug that might
trigger by allowing the passing of random-seed and
iteration-number to the test, or by passing input
which has invalid byte code.
Changes in V2:
- Code cleanup.
- Show how to reproduce the error by insn_sanity test.
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: acme@redhat.com
Cc: ming.m.lin@intel.com
Cc: robert.richter@amd.com
Cc: ravitillo@lbl.gov
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20111020140109.20938.92572.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Arjan would like to make struct file_operations const, but
mce-inject directly writes to the mce_chrdev_ops to install its
write handler. In an ideal world mce-inject would have its own
character device, but we have a sizable legacy of test scripts
that hardwire "/dev/mcelog", so it would be painful to switch to
a separate device now. Instead, this patch switches to a stub
function in the mce code, with a registration helper that
mce-inject can call when it is loaded.
Note that this would also allow for a sane process to allow
mce-inject to be unloaded again (with an unregister function,
and appropriate module_{get,put}() calls), but that is left for
potential future patches.
Reported-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4eb2e1971326651a3b@agluck-desktop.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
cpuidle: Single/Global registration of idle states
cpuidle: Split cpuidle_state structure and move per-cpu statistics fields
cpuidle: Remove CPUIDLE_FLAG_IGNORE and dev->prepare()
cpuidle: Move dev->last_residency update to driver enter routine; remove dev->last_state
ACPI: Fix CONFIG_ACPI_DOCK=n compiler warning
ACPI: Export FADT pm_profile integer value to userspace
thermal: Prevent polling from happening during system suspend
ACPI: Drop ACPI_NO_HARDWARE_INIT
ACPI atomicio: Convert width in bits to bytes in __acpi_ioremap_fast()
PNPACPI: Simplify disabled resource registration
ACPI: Fix possible recursive locking in hwregs.c
ACPI: use kstrdup()
mrst pmu: update comment
tools/power turbostat: less verbose debugging
* 'upstream/jump-label-noearly' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
jump-label: initialize jump-label subsystem much earlier
x86/jump_label: add arch_jump_label_transform_static()
s390/jump-label: add arch_jump_label_transform_static()
jump_label: add arch_jump_label_transform_static() to optimise non-live code updates
sparc/jump_label: drop arch_jump_label_text_poke_early()
x86/jump_label: drop arch_jump_label_text_poke_early()
jump_label: if a key has already been initialized, don't nop it out
stop_machine: make stop_machine safe and efficient to call early
jump_label: use proper atomic_t initializer
Conflicts:
- arch/x86/kernel/jump_label.c
Added __init_or_module to arch_jump_label_text_poke_early vs
removal of that function entirely
- kernel/stop_machine.c
same patch ("stop_machine: make stop_machine safe and efficient
to call early") merged twice, with whitespace fix in one version
* 'upstream/xen-settime' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen/dom0: set wallclock time in Xen
xen: add dom0_op hypercall
xen/acpi: Domain0 acpi parser related platform hypercall
* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include <linux/module.h>
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include <linux/module.h>
net: sch_generic remove redundant use of <linux/module.h>
net: inet_timewait_sock doesnt need <linux/module.h>
...
Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h
* 'trivial' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild:
scsi: drop unused Kconfig symbol
pci: drop unused Kconfig symbol
stmmac: drop unused Kconfig symbol
x86: drop unused Kconfig symbol
powerpc: drop unused Kconfig symbols
powerpc: 40x: drop unused Kconfig symbol
mips: drop unused Kconfig symbols
openrisc: drop unused Kconfig symbols
arm: at91: drop unused Kconfig symbol
samples: drop unused Kconfig symbol
m32r: drop unused Kconfig symbol
score: drop unused Kconfig symbols
sh: drop unused Kconfig symbol
um: drop unused Kconfig symbol
sparc: drop unused Kconfig symbol
alpha: drop unused Kconfig symbol
Fix up trivial conflict in drivers/net/ethernet/stmicro/stmmac/Kconfig
as per Michal: the STMMAC_DUAL_MAC config variable is still unused and
should be deleted.
* 'stable/vmalloc-3.2' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
net: xen-netback: use API provided by xenbus module to map rings
block: xen-blkback: use API provided by xenbus module to map rings
xen: use generic functions instead of xen_{alloc, free}_vm_area()
The legacy x86 nmi watchdog code was removed with the implementation
of the perf based nmi watchdog. This broke Oprofile's nmi timer
mode. To run nmi timer mode we relied on a continuous ticking nmi
source which the nmi watchdog provided. The nmi tick was no longer
available and current watchdog can not be used anymore since it runs
with very long periods in the range of seconds. This patch
reimplements the nmi timer mode using a perf counter nmi source.
V2:
* removing pr_info()
* fix undefined reference to `__udivdi3' for 32 bit build
* fix section mismatch of .cpuinit.data:nmi_timer_cpu_nb
* removed nmi timer setup in arch/x86
* implemented function stubs for op_nmi_init/exit()
* made code more readable in oprofile_init()
V3:
* fix architectural initialization in oprofile_init()
* fix CONFIG_OPROFILE_NMI_TIMER dependencies
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Robert Richter <robert.richter@amd.com>
We need this to better test x86 NMI timer mode. Otherwise it is very
hard to setup systems with NMI timer enabled, but we have this e.g. in
virtual machine environments.
Signed-off-by: Robert Richter <robert.richter@amd.com>
* 'for-next' of git://git.infradead.org/users/sameo/mfd-2.6: (80 commits)
mfd: Fix missing abx500 header file updates
mfd: Add missing <linux/io.h> include to intel_msic
x86, mrst: add platform support for MSIC MFD driver
mfd: Expose TurnOnStatus in ab8500 sysfs
mfd: Remove support for early drop ab8500 chip
mfd: Add support for ab8500 v3.3
mfd: Add ab8500 interrupt disable hook
mfd: Convert db8500-prcmu panic() into pr_crit()
mfd: Refactor db8500-prcmu request_clock() function
mfd: Rename db8500-prcmu init function
mfd: Fix db5500-prcmu defines
mfd: db8500-prcmu voltage domain consumers additions
mfd: db8500-prcmu reset code retrieval
mfd: db8500-prcmu tweak for modem wakeup
mfd: Add db8500-pcmu watchdog accessor functions for watchdog
mfd: hwacc power state db8500-prcmu accessor
mfd: Add db8500-prcmu accessors for PLL and SGA clock
mfd: Move to the new db500 PRCMU API
mfd: Create a common interface for dbx500 PRCMU drivers
mfd: Initialize DB8500 PRCMU regs
...
Fix up trivial conflicts in
arch/arm/mach-imx/mach-mx31moboard.c
arch/arm/mach-omap2/board-omap3beagle.c
arch/arm/mach-u300/include/mach/irqs.h
drivers/mfd/wm831x-spi.c
* 'linux_next' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-edac: (21 commits)
MAINTAINERS: add an entry for Edac Sandy Bridge driver
edac: tag sb_edac as EXPERIMENTAL, as it requires more testing
EDAC: Fix incorrect edac mode reporting in sb_edac
edac: sb_edac: Add it to the building system
edac: Add an experimental new driver to support Sandy Bridge CPU's
i7300_edac: Fix error cleanup logic
i7core_edac: Initialize memory name with cpu, channel, bank
i7core_edac: Fix compilation on 32 bits arch
i7core_edac: scrubbing fixups
EDAC: Correct Kconfig dependencies
i7core_edac: return -ENODEV if no MC is found
i7core_edac: use edac's own way to print errors
MAINTAINERS: remove dropped edac_mce.* from the file
i7core_edac: Drop the edac_mce facility
x86, MCE: Use notifier chain only for MCE decoding
EDAC i7core: Use mce socketid for better compatibility
i7core_edac: Don't enable memory scrubbing for Xeon 35xx
i7core_edac: Add scrubbing support
edac: Move edac main structs to include/linux/edac.h
i7core_edac: Fix oops when trying to inject errors
...
Says Andrew:
"60 patches. That's good enough for -rc1 I guess. I have quite a lot
of detritus to be rechecked, work through maintainers, etc.
- most of the remains of MM
- rtc
- various misc
- cgroups
- memcg
- cpusets
- procfs
- ipc
- rapidio
- sysctl
- pps
- w1
- drivers/misc
- aio"
* akpm: (60 commits)
memcg: replace ss->id_lock with a rwlock
aio: allocate kiocbs in batches
drivers/misc/vmw_balloon.c: fix typo in code comment
drivers/misc/vmw_balloon.c: determine page allocation flag can_sleep outside loop
w1: disable irqs in critical section
drivers/w1/w1_int.c: multiple masters used same init_name
drivers/power/ds2780_battery.c: fix deadlock upon insertion and removal
drivers/power/ds2780_battery.c: add a nolock function to w1 interface
drivers/power/ds2780_battery.c: create central point for calling w1 interface
w1: ds2760 and ds2780, use ida for id and ida_simple_get() to get it
pps gpio client: add missing dependency
pps: new client driver using GPIO
pps: default echo function
include/linux/dma-mapping.h: add dma_zalloc_coherent()
sysctl: make CONFIG_SYSCTL_SYSCALL default to n
sysctl: add support for poll()
RapidIO: documentation update
drivers/net/rionet.c: fix ethernet address macros for LE platforms
RapidIO: fix potential null deref in rio_setup_device()
RapidIO: add mport driver for Tsi721 bridge
...
This avoids duplicating the function in every arch gup_fast.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>