Since the guest and host ptes can have wildly different format, adjust
the pte accessor names to indicate on which type of pte they operate on.
No functional changes.
Signed-off-by: Avi Kivity <avi@redhat.com>
is_dirty_pte() is used on guest ptes, not shadow ptes, so it needs to avoid
shadow_dirty_mask and use PT_DIRTY_MASK instead.
Misdetecting dirty pages could lead to unnecessarily setting the dirty bit
under EPT.
Signed-off-by: Avi Kivity <avi@redhat.com>
"Unrestricted Guest" feature is added in the VMX specification.
Intel Westmere and onwards processors will support this feature.
It allows kvm guests to run real mode and unpaged mode
code natively in the VMX mode when EPT is turned on. With the
unrestricted guest there is no need to emulate the guest real mode code
in the vm86 container or in the emulator. Also the guest big real mode
code works like native.
The attached patch enhances KVM to use the unrestricted guest feature
if available on the processor. It also adds a new kernel/module
parameter to disable the unrestricted guest feature at the boot time.
Signed-off-by: Nitin A Kamble <nitin.a.kamble@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Protect irq injection/acking data structures with a separate irq_lock
mutex. This fixes the following deadlock:
CPU A CPU B
kvm_vm_ioctl_deassign_dev_irq()
mutex_lock(&kvm->lock); worker_thread()
-> kvm_deassign_irq() -> kvm_assigned_dev_interrupt_work_handler()
-> deassign_host_irq() mutex_lock(&kvm->lock);
-> cancel_work_sync() [blocked]
[gleb: fix ia64 path]
Reported-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
None of the interface services the LAPIC emulation provides need to be
exported to modules, and kvm_lapic_get_base is even totally unused
today.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of reloading the pdptrs on every entry and exit (vmcs writes on vmx,
guest memory access on svm) extract them on demand.
Signed-off-by: Avi Kivity <avi@redhat.com>
Instead of reading the PDPTRs from memory after every exit (which is slow
and wrong, as the PDPTRs are stored on the cpu), sync the PDPTRs from
memory to the VMCS before entry, and from the VMCS to memory after exit.
Do the same for cr3.
Signed-off-by: Avi Kivity <avi@redhat.com>
vmx_set_cr3() will call vmx_tlb_flush(), which will flush the ept context.
So there is no need to call ept_sync_context() explicitly.
Signed-off-by: Avi Kivity <avi@redhat.com>
We currently publish the i8254 resources to the pio_bus before the devices
are fully initialized. Since we hold the pit_lock, its probably not
a real issue. But lets clean this up anyway.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
We modernize the io_device code so that we use container_of() instead of
dev->private, and move the vtable to a separate ops structure
(theoretically allows better caching for multiple instances of the same
ops structure)
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
Since AMD does not support sysenter in 64bit mode, the VMCB fields storing
the MSRs are truncated to 32bit upon VMRUN/#VMEXIT. So store the values
in a separate 64bit storage to avoid truncation.
[andre: fix amd->amd migration]
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The in-kernel speaker emulation is only a dummy and also unneeded from
the performance point of view. Rather, it takes user space support to
generate sound output on the host, e.g. console beeps.
To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel
speaker port emulation via a flag passed along the new IOCTL. It also
leaves room for future extensions of the PIT configuration interface.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM provides a complete virtual system environment for guests, including
support for injecting interrupts modeled after the real exception/interrupt
facilities present on the native platform (such as the IDT on x86).
Virtual interrupts can come from a variety of sources (emulated devices,
pass-through devices, etc) but all must be injected to the guest via
the KVM infrastructure. This patch adds a new mechanism to inject a specific
interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
on the irqfd (using eventfd semantics from either userspace or kernel) will
translate into an injected interrupt in the guest at the next available
interrupt window.
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The problem exists only on VMX. Also currently we skip this step if
there is pending exception. The patch fixes this too.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use proper foo-y style list additions to cleanup all the conditionals,
move module selection after compound object selection and remove the
superflous comment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
If we run out of cpuid entries for extended request types
we should return -E2BIG, just like we do for the standard
request types.
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Replace 0xc0010010 with MSR_K8_SYSCFG and 0xc0010015 with MSR_K7_HWCR.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The related MSRs are emulated. MCE capability is exported via
extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new
vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation
such as the mcg_cap. MCE is injected via vcpu ioctl command
KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are
not implemented.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Use standard msr-index.h's MSR declaration.
MSR_IA32_TSC is better than MSR_IA32_TIME_STAMP_COUNTER as it also solves
80 column issue.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When reinjecting a software interrupt or exception, use the correct
instruction length provided by the hardware instead of a hardcoded 1.
Fixes problems running the suse 9.1 livecd boot loader.
Problem introduced by commit f0a3602c20 ("KVM: Move interrupt injection
logic to x86.c").
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
We need to have a stronger barrier between releasing the lock and
checking for any waiting spinners. A compiler barrier is not sufficient
because the CPU's ordering rules do not prevent the read xl->spinners
from happening before the unlock assignment, as they are different
memory locations.
We need to have an explicit barrier to enforce the write-read ordering
to different memory locations.
Because of it, I can't bring up > 4 HVM guests on one SMP machine.
[ Code and commit comments expanded -J ]
[ Impact: avoid deadlock when using Xen PV spinlocks ]
Signed-off-by: Yang Xiaowei <xiaowei.yang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Where possible we enable interrupts while waiting for a spinlock to
become free, in order to reduce big latency spikes in interrupt handling.
However, at present if we manage to pick up the spinlock just before
blocking, we'll end up holding the lock with interrupts enabled for a
while. This will cause a deadlock if we recieve an interrupt in that
window, and the interrupt handler tries to take the lock too.
Solve this by shrinking the interrupt-enabled region to just around the
blocking call.
[ Impact: avoid race/deadlock when using Xen PV spinlocks ]
Reported-by: "Yang, Xiaowei" <xiaowei.yang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
-fstack-protector uses a special per-cpu "stack canary" value.
gcc generates special code in each function to test the canary to make
sure that the function's stack hasn't been overrun.
On x86-64, this is simply an offset of %gs, which is the usual per-cpu
base segment register, so setting it up simply requires loading %gs's
base as normal.
On i386, the stack protector segment is %gs (rather than the usual kernel
percpu %fs segment register). This requires setting up the full kernel
GDT and then loading %gs accordingly. We also need to make sure %gs is
initialized when bringing up secondary cpus too.
To keep things consistent, we do the full GDT/segment register setup on
both architectures.
Because we need to avoid -fstack-protected code before setting up the GDT
and because there's no way to disable it on a per-function basis, several
files need to have stack-protector inhibited.
[ Impact: allow Xen booting with stack-protector enabled ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Stanse found a pci reference leak in quirk_amd_nb_node.
Instead of putting nb_ht, there is a put of dev passed as
an argument.
http://stanse.fi.muni.cz/
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Fix address passed to cpa_flush_range() when changing page
attributes from WB to UC. The address (*addr) is
modified by __change_page_attr_set_clr(). The result is that
the pages being flushed start at the _end_ of the changed range
instead of the beginning.
This should be considered for 2.6.30-stable and 2.6.31-stable.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stable team <stable@kernel.org>
This avoids a "Malformed early option 'iommu'" on boot when trying
to use pass-through mode.
Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
The current mp_bus_to_node array is initialized only by AMD specific
code, since AMD platforms have registers that can be used for
determining mode numbers. On new Intel platforms it's necessary to
initialize this array as well though, otherwise all PCI node numbers
will be 0, when in fact they should be -1 (indicating that I/O isn't
tied to any particular node).
So move the mp_bus_to_node code into the common PCI code, and
initialize it early with a default value of -1. This may be overridden
later by arch code (e.g. the AMD code).
With this change, PCI consistent memory and other node specific
allocations (e.g. skbuff allocs) should occur on the "current" node.
If, for performance reasons, applications want to be bound to specific
nodes, they should open their devices only after being pinned to the
CPU where they'll run, for maximum locality.
Acked-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Jesse Brandeburg <jesse.brandeburg@gmail.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
This was #define'd as 0 on all platforms, so let's get rid of it.
This change makes pci_scan_slot() slightly easier to read.
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Tony Luck <tony.luck@intel.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Russell King <linux@arm.linux.org.uk>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Alex Chiang <achiang@hp.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
There are cases where full date information is required instead of
just the year. Add month and day parsing to dmi_get_year() and rename
it to dmi_get_date().
As the original function only required '/' followed by any number of
parseable characters at the end of the string, keep that behavior to
avoid upsetting existing users.
The new function takes dates of format [mm[/dd]]/yy[yy]. Year, month
and date are checked to be in the ranges of [1-9999], [1-12] and
[1-31] respectively and any invalid or out-of-range component is
returned as zero.
The dummy implementation is updated accordingly but the return value
is updated to indicate field not found which is consistent with how
other dummy functions behave.
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
Now that SD_WAKE_IDLE doesn't make pipe-test suck anymore,
enable it by default for MC, CPU and NUMA domains.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some NUMA messages in srat_32.c are confusing to users,
because they seem to indicate errors, while in fact they
reflect normal behaviour.
Decrease the level of these messages to KERN_DEBUG so that
they don't show up unnecessarily.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <200909050107.45175.rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change msr-reg.o to obj-y (it will be included in virtually every
kernel since it is used by the initialization code for AMD processors)
and add a separate C file to export its symbols to modules, so that
msr.ko can use them; on uniprocessors we bypass the helper functions
in msr.o and use the accessor functions directly via inlines.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <20090904140834.GA15789@elte.hu>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Ingo Molnar reported the following kmemcheck warning when running both
kmemleak and kmemcheck enabled:
PM: Adding info for No Bus:vcsa7
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory
(f6f6e1a4)
d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000
i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u
^
Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6
EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0
EIP is at scan_block+0x3f/0xe0
EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001
ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c110313c>] scan_object+0x7c/0xf0
[<c1103389>] kmemleak_scan+0x1d9/0x400
[<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0
[<c10819d4>] kthread+0x74/0x80
[<c10257db>] kernel_thread_helper+0x7/0x3c
[<ffffffff>] 0xffffffff
kmemleak: 515 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
The problem here is that kmemleak will scan partially initialized
objects that makes kmemcheck complain. Fix that up by skipping
uninitialized memory regions when kmemcheck is enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Re-organize the flag settings so that it's visible at a glance
which sched-domains flags are set and which not.
With the new balancer code we'll need to re-tune these details
anyway, so make it cleaner to make fewer mistakes down the
road ;-)
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Otherwise, system with apci id lifting will have wrong apicid in
/proc/cpuinfo.
and use that in srat_detect_node().
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <4A998CCA.1040407@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kernel BTS tracing generates too much data too fast for us to
handle, causing the kernel to hang.
Fail for BTS requests for kernel code.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Acked-by: Peter Zijlstra <a.p.zjilstra@chello.nl>
LKML-Reference: <20090902140616.901253000@intel.com>
[ This is really a workaround - but we want BTS tracing in .32
so make sure we dont regress. The lockup should be fixed
ASAP. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On 32bit, pointers in the DS AREA configuration are cast to
u64. The current (long) cast to avoid compiler warnings results
in a signed 64bit address.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090902140615.305889000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reserve PERF_COUNT_HW_BRANCH_INSTRUCTIONS with sample_period ==
1 for BTS tracing and fail, if BTS is not available.
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090902140612.943801000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Pack aligned things together into a special section to minimize
padding holes.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Tejun Heo <tj@kernel.org>
LKML-Reference: <4AA035C0.9070202@goop.org>
[ queued up in tip:x86/asm because it depends on this commit:
x86/i386: Make sure stack-protector segment base is cache aligned ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Current sched domain creation code can't handle multi-node processors.
When switching to power_savings scheduling errors show up and
system might hang later on (due to broken sched domain hierarchy):
# echo 0 >> /sys/devices/system/cpu/sched_mc_power_savings
CPU0 attaching sched-domain:
domain 0: span 0-5 level MC
groups: 0 1 2 3 4 5
domain 1: span 0-23 level NODE
groups: 0-5 6-11 18-23 12-17
...
# echo 1 >> /sys/devices/system/cpu/sched_mc_power_savings
CPU0 attaching sched-domain:
domain 0: span 0-11 level MC
groups: 0 1 2 3 4 5 6 7 8 9 10 11
ERROR: parent span is not a superset of domain->span
domain 1: span 0-5 level CPU
ERROR: domain->groups does not contain CPU0
groups: 6-11 (__cpu_power = 12288)
ERROR: groups don't span domain->span
domain 2: span 0-23 level NODE
groups:
ERROR: domain->cpu_power not set
ERROR: groups don't span domain->span
...
Fixing all aspects of power-savings scheduling for Magny-Cours needs
some larger changes in the sched domain creation code.
As a short-term and temporary workaround avoid the problems by
extending "the worst possible hack" ;-(
and always use llc_shared_map on AMD Magny-Cours when MC domain span
is calculated.
With this I get:
# echo 1 >> /sys/devices/system/cpu/sched_mc_power_savings
CPU0 attaching sched-domain:
domain 0: span 0-5 level MC
groups: 0 1 2 3 4 5
domain 1: span 0-5 level CPU
groups: 0-5 (__cpu_power = 6144)
domain 2: span 0-23 level NODE
groups: 0-5 (__cpu_power = 6144) 6-11 (__cpu_power = 6144) 18-23 (__cpu_power = 6144) 12-17 (__cpu_power = 6144)
...
I.e. no errors during sched domain creation, no system hangs, and also
mc_power_savings scheduling works to a certain extend.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This fixes threshold_bank4 support on multi-node processors.
The correct mask to use is llc_shared_map, representing an internal
node on Magny-Cours.
We need to create 2 sets of symlinks for sibling shared banks -- one
set for each internal node, symlinks of each set should target the
first core on same internal node.
Currently only one set is created where all symlinks are targeting
the first core of the entire socket.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
L3 cache size, associativity and shared_cpu information need to be
adapted to show information for an internal node instead of the
entire physical package.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Construct entire NodeID and use it as cpu_llc_id. Thus internal node
siblings are stored in llc_shared_map.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The Intel Optimization Reference Guide says:
In Intel Atom microarchitecture, the address generation unit
assumes that the segment base will be 0 by default. Non-zero
segment base will cause load and store operations to experience
a delay.
- If the segment base isn't aligned to a cache line
boundary, the max throughput of memory operations is
reduced to one [e]very 9 cycles.
[...]
Assembly/Compiler Coding Rule 15. (H impact, ML generality)
For Intel Atom processors, use segments with base set to 0
whenever possible; avoid non-zero segment base address that is
not aligned to cache line boundary at all cost.
We can't avoid having a non-zero base for the stack-protector
segment, but we can make it cache-aligned.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
LKML-Reference: <4AA01893.6000507@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The macro was defined in the 32-bit path as well - breaking the
build on 32-bit platforms:
arch/x86/lib/msr-reg.S: Assembler messages:
arch/x86/lib/msr-reg.S:53: Error: Bad macro parameter list
arch/x86/lib/msr-reg.S💯 Error: invalid character '_' in mnemonic
arch/x86/lib/msr-reg.S:101: Error: invalid character '_' in mnemonic
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <tip-f6909f394c2d4a0a71320797df72d54c49c5927e@git.kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There is no dependency from the gart code to the agp code.
And since a lot of systems today do not have agp anymore
remove this dependency from the kernel configuration.
Signed-off-by: Pavel Vasilyev <pavel@pavlinux.ru>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch enables the passthrough mode for AMD IOMMU by
running the initialization function when iommu=pt is passed
on the kernel command line.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch makes sure a device is not detached from the
passthrough domain when the device driver is unloaded or
does otherwise release the device.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When the IOMMU driver runs in passthrough mode it has to
make sure that every device not assigned to an IOMMU-API
domain must be put into the passthrough domain instead of
keeping it unassigned.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch makes the locking behavior between the functions
attach_device and __attach_device consistent with the
locking behavior between detach_device and __detach_device.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The V bit of the device table entry has to be set after the
rest of the entry is written to not confuse the hardware.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
When iommu=pt is passed on kernel command line the devices
should run untranslated. This requires the allocation of a
special domain for that purpose. This patch implements the
allocation and initialization path for iommu=pt.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch factors some code of protection domain allocation
into seperate functions. This way the logic can be used to
allocate the passthrough domain later. As a side effect this
patch fixes an unlikely domain id leakage bug.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This variable is read most of the time. This patch marks it
as such. It also documents the meaning the this variable
while at it.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds a map_size parameter to the iommu_map_page
function which makes it generic enough to handle multiple
page sizes. This also requires a change to alloc_pte which
is also done in this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The driver now supports a dynamic number of levels for IO
page tables. This allows to reduce the number of levels for
dma_ops domains by one because a dma_ops domain has usually
an address space size between 128MB and 4G.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch changes the alloc_pte function to be able to map
pages into the whole 64 bit address space supported by AMD
IOMMU hardware from the old limit of 2**39 bytes.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Thist patch introduces the update_domain function which
propagates the larger address space of a protection domain
to the device table and flushes all relevant DTEs and the
domain TLB.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This function factors out some logic of attach_device to a
seperate function. This new function will be used to update
device table entries when necessary.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds a generic variant of
amd_iommu_flush_all_devices function which flushes only the
DTEs for a given protection domain.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch changes the fetch_pte function in the AMD IOMMU
driver to support dynamic mapping levels.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Instead of a panic on an comletion wait loop failure, try to
recover from that event from resetting the command buffer.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
To prevent the driver from doing recursive command buffer
resets, just panic when that recursion happens.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
On an ILLEGAL_COMMAND_ERROR the IOMMU stops executing
further commands. This patch changes the code to handle this
case better by resetting the command buffer in the IOMMU.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch factors parts of the command buffer
initialization code into a seperate function which can be
used to reset the command buffer later.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This function flushes all DTE entries on one IOMMU for all
devices behind this IOMMU. This is required for command
buffer resetting later.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
The amd_iommu_pd_table is indexed by protection domain
number and not by device id. So this check is broken and
must be removed.
Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch replaces the "AMD IOMMU" printk strings with the
official name for the hardware: "AMD-Vi".
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch removes some left-overs which where put into the code to
simplify merging code which also depends on changes in other trees.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch introduces a function to flush all domain tlbs
for on one given IOMMU. This is required later to reset the
command buffer on one IOMMU.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds code to dump the command which caused an
ILLEGAL_COMMAND_ERROR raised by the IOMMU hardware.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
This patch adds code to dump the content of the device table
entry which caused an ILLEGAL_DEV_TABLE_ENTRY error from the
IOMMU hardware.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Add a keyctl to install a process's session keyring onto its parent. This
replaces the parent's session keyring. Because the COW credential code does
not permit one process to change another process's credentials directly, the
change is deferred until userspace next starts executing again. Normally this
will be after a wait*() syscall.
To support this, three new security hooks have been provided:
cred_alloc_blank() to allocate unset security creds, cred_transfer() to fill in
the blank security creds and key_session_to_parent() - which asks the LSM if
the process may replace its parent's session keyring.
The replacement may only happen if the process has the same ownership details
as its parent, and the process has LINK permission on the session keyring, and
the session keyring is owned by the process, and the LSM permits it.
Note that this requires alteration to each architecture's notify_resume path.
This has been done for all arches barring blackfin, m68k* and xtensa, all of
which need assembly alteration to support TIF_NOTIFY_RESUME. This allows the
replacement to be performed at the point the parent process resumes userspace
execution.
This allows the userspace AFS pioctl emulation to fully emulate newpag() and
the VIOCSETTOK and VIOCSETTOK2 pioctls, all of which require the ability to
alter the parent process's PAG membership. However, since kAFS doesn't use
PAGs per se, but rather dumps the keys into the session keyring, the session
keyring of the parent must be replaced if, for example, VIOCSETTOK is passed
the newpag flag.
This can be tested with the following program:
#include <stdio.h>
#include <stdlib.h>
#include <keyutils.h>
#define KEYCTL_SESSION_TO_PARENT 18
#define OSERROR(X, S) do { if ((long)(X) == -1) { perror(S); exit(1); } } while(0)
int main(int argc, char **argv)
{
key_serial_t keyring, key;
long ret;
keyring = keyctl_join_session_keyring(argv[1]);
OSERROR(keyring, "keyctl_join_session_keyring");
key = add_key("user", "a", "b", 1, keyring);
OSERROR(key, "add_key");
ret = keyctl(KEYCTL_SESSION_TO_PARENT);
OSERROR(ret, "KEYCTL_SESSION_TO_PARENT");
return 0;
}
Compiled and linked with -lkeyutils, you should see something like:
[dhowells@andromeda ~]$ keyctl show
Session Keyring
-3 --alswrv 4043 4043 keyring: _ses
355907932 --alswrv 4043 -1 \_ keyring: _uid.4043
[dhowells@andromeda ~]$ /tmp/newpag
[dhowells@andromeda ~]$ keyctl show
Session Keyring
-3 --alswrv 4043 4043 keyring: _ses
1055658746 --alswrv 4043 4043 \_ user: a
[dhowells@andromeda ~]$ /tmp/newpag hello
[dhowells@andromeda ~]$ keyctl show
Session Keyring
-3 --alswrv 4043 4043 keyring: hello
340417692 --alswrv 4043 4043 \_ user: a
Where the test program creates a new session keyring, sticks a user key named
'a' into it and then installs it on its parent.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Conflicts:
arch/x86/kernel/reboot.c
security/Kconfig
Merge reason: resolve the conflicts, bump up from rc3 to rc8.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This function measures whether the FPU/SSE state can be touched in
interrupt context. If the interrupted code is in user space or has no
valid FPU/SSE context (CR0.TS == 1), FPU/SSE state can be used in IRQ
or soft_irq context too.
This is used by AES-NI accelerated AES implementation and PCLMULQDQ
accelerated GHASH implementation.
v3:
- Renamed to irq_fpu_usable to reflect the purpose of the function.
v2:
- Renamed to irq_is_fpu_using to reflect the real situation.
Signed-off-by: Huang Ying <ying.huang@intel.com>
CC: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Move tboot.h from asm to linux to fix the build errors of intel_txt
patch on non-X86 platforms. Remove the tboot code from generic code
init/main.c and kernel/cpu.c.
Signed-off-by: Shane Wang <shane.wang@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
msr-reg.S used the :req option on a macro argument, which wasn't
supported by gas 2.16.1 (but apparently by some earlier versions of
gas, just to be confusing.) It isn't necessary, so just remove it.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Create a blacklist for processors that should not load the acpi-cpufreq module.
The initial entry in the blacklist function is the Intel 0f68 processor. It's
specification update mentions errata AL30 which implies that cpufreq should not
run on this processor.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Dave Jones <davej@redhat.com>
Remove an obsolete check that used to prevent there being more
than 2 low P-states. Now that low-to-low P-states changes are
enabled, it prevents otherwise workable configurations with
multiple low P-states.
Signed-off-by: Mark Langsdorf <mark.langsdorf@amd.com>
Tested-by: Krists Krilovs <pow@pow.za.net>
Signed-off-by: Dave Jones <davej@redhat.com>
This block is allocated with alloc_bootmem() and scanned by kmemleak but
the kernel direct mapping may no longer exist. This patch tells kmemleak
to ignore this memory hole. The dma32_bootmem_ptr in
dma32_reserve_bootmem() is also ignored.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Make it possible to access the all-register-setting/getting MSR
functions via the MSR driver. This is implemented as an ioctl() on
the standard MSR device node.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <petkovbb@gmail.com>
Create _on_cpu helpers for {rw,wr}msr_safe_regs() analogously with the
other MSR functions. This will be necessary to add support for these
to the MSR driver.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <petkovbb@gmail.com>
For some reason, the _safe MSR functions returned -EFAULT, not -EIO.
However, the only user which cares about the return code as anything
other than a boolean is the MSR driver, which wants -EIO. Change it
to -EIO across the board.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Alok Kataria <akataria@vmware.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Add CFI annotations for native_{rd,wr}msr_safe_regs().
Simplify the 64-bit implementation: we don't allow the upper half
registers to be set, and so we can use them to carry state across the
operation.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <petkovbb@gmail.com>
LKML-Reference: <1251705011-18636-1-git-send-email-petkovbb@gmail.com>
We have had this convenient macro _ASM_EXTABLE() to generate exception
table entry in inline assembly. Make it also usable for pure
assembly.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add 32-bit versions of the combined CFI macros, equivalent to the
64-bit ones except, obviously, operating on 32-bit stack words.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
fbd8b1819e turns off the bit for
/proc/cpuinfo. However, a proper/full fix would be to additionally
turn off the bit in the CPUID output so that future callers get
correct CPU features info.
Do that by basically reversing what the BIOS wrongfully does at boot.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <1251705011-18636-3-git-send-email-petkovbb@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Switch them to native_{rd,wr}msr_safe_regs and remove
pv_cpu_ops.read_msr_amd.
Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
LKML-Reference: <1251705011-18636-2-git-send-email-petkovbb@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
native_{rdmsr,wrmsr}_safe_regs are two new interfaces which allow
presetting of a subset of eight x86 GPRs before executing the rd/wrmsr
instructions. This is needed at least on AMD K8 for accessing an erratum
workaround MSR.
Originally based on an idea by H. Peter Anvin.
Signed-off-by: Borislav Petkov <petkovbb@gmail.com>
LKML-Reference: <1251705011-18636-1-git-send-email-petkovbb@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
boot_cpu_physical_apicid is a global variable and used as function
argument as well. Rename the function arguments to avoid confusion.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The proposed Moorestown support patches use an extra feature flag
mechanism to make the ioapic work w/o an i8259. There is a much
simpler solution.
Most i8259 specific functions are already called dependend on the irq
number less than NR_IRQS_LEGACY. Replacing that constant by a
read_mostly variable which can be set to 0 by the platform setup code
allows us to achieve the same without any special feature flags.
That trivial change allows us to proceed with MRST w/o doing a full
blown overhaul of the ioapic code which would delay MRST unduly.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Moorestown MID devices need to be detected early in the boot process
to setup and do not call x86_default_early_setup as there is no EBDA
region to reserve.
[ Copied the minimal code from Jacobs latest MRST series ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jacob Pan <jacob.jun.pan@intel.com>
x86 bootprotocol 2.07 has introduced hardware_subarch ID in the boot
parameters provided by FW. We use it to identify Moorestown platforms.
[ tglx: Cleanup and paravirt fix ]
Signed-off-by: Jacob Pan <jacob.jun.pan@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Platforms like Moorestown require early setup and want to avoid the
call to reserve_ebda_region. The x86_init override is too late when
the MRST detection happens in setup_arch. Move the default i386
x86_init overrides and the call to reserve_ebda_region into a separate
function which is called as the default of a switch case depending on
the hardware_subarch id in boot params. This allows us to add a case
for MRST and let MRST have its own early setup function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We do not need the TSC before late_time_init. Move the tsc_init to the
late time init code so we can also utilize HPET for calibration (which
we claimed to do but never did except in some older kernel
version). This also helps Moorestown to calibrate the TSC with the
AHBT timer which needs to be initialized in late_time_init like HPET.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
TSC calibration is modified by the vmware hypervisor and paravirt by
separate means. Moorestown wants to add its own calibration routine as
well. So make calibrate_tsc a proper x86_init_ops function and
override it by paravirt or by the early setup of the vmware
hypervisor.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the code where it's only user is. Also we need to look whether
this hardwired hackery might interfere with perfcounters.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timer and timer irq setup code is identical in 32 and 64 bit. Make
it the same formatting as well. Also add the global variables under
the necessary ifdefs to both files.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
MCA_bus is constant 0 when CONFIG_MCA=n. So the compiler removes that
code w/o needing an extra #ifdef
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Let the compiler optimize the timer_ack magic away in the 32bit timer
interrupt and put the same code into time_64.c. It's optimized out for
CONFIG_X86_IO_APIC on 32bit and for 64bit because timer_ack is const 0
in both cases.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This is a left over of the old x86 sub arch support. Remove it and
open code it like we do in time_64.c
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The timer init code is convoluted with several quirks and the paravirt
timer chooser. Figuring out which code path is actually taken is not
for the faint hearted.
Move the numaq TSC quirk to tsc_pre_init x86_init_ops function and
replace the paravirt time chooser and the remaining x86 quirk with a
simple x86_init_ops function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
paravirt overrides the setup of the default apic timers as per cpu
timers. Moorestown needs to override that as well.
Move it to x86_init_ops setup and create a separate x86_cpuinit struct
which holds the function for the secondary evtl. hotplugabble CPUs.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We really do not need two paravirt/x86_init_ops functions which are
called in two consecutive source lines. Move the only user of
post_allocator_init into the already existing pagetable_setup_done
function.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Replace another obscure paravirt magic and move it to
x86_init_ops. Such a hook is also useful for embedded and special
hardware.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ARCH_SETUP is a horrible leftover from the old arch/i386 mach support
code. It still has a lonely user in xen. Move it to x86_init_ops.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
irq_init is overridden by x86_quirks and by paravirts. Unify the whole
mess and make it an unconditional x86_init_ops function which defaults
to the standard function and can be overridden by the early platform
code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Replace the quirk machinery by a x86_init_ops function which
defaults to the standard implementation. This is also a preparatory
patch for Moorestown support which needs to replace the default
init_ISA_irqs as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Replace the quirk machinery by a x86_init_ops function which defaults
to the standard implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Stack protector support was not detected when building with
ARCH=i386 on x86_64 systems:
arch/x86/Makefile:80: stack protector enabled but no compiler support
The "-m32" argument needs to be passed to the detection script.
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Arjan van de Ven <arjan@infradead.org>
LKML-Reference: <20090829182718.10f566b1@leela>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
--
Since parse_early_param() may (e.g. for earlyprintk=dbgp)
involve calls to page table manipulation functions (here
set_fixmap_nocache()), NX hardware support must be determined
before calling that function (so that __supported_pte_mask gets
properly set up).
But the call after parse_early_param() can also not go away, as
that will honor eventual command line specified disabling of
the NX functionality.
( This will then just result in whatever mappings got
established during parse_early_param() having the NX bit set
despite it being disabled on the command line, but I think
that's tolerable).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
LKML-Reference: <4A97F3BD02000078000121B9@vpn.id2.novell.com>
[ merged to x86/pat to resolve a conflict. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Merge reason: the SFI (Simple Firmware Interface) feature in the ACPI
tree needs this cleanup, pull it into the APIC branch as
well so that there's no interactions.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some IO-APIC routines are ACPI specific now, but need to
be exposed when CONFIG_ACPI=n for the benefit of SFI.
Remove #ifdef ACPI around these routines:
io_apic_get_unique_id(int ioapic, int apic_id);
io_apic_get_version(int ioapic);
io_apic_get_redir_entries(int ioapic);
Move these routines from ACPI-specific boot.c to io_apic.c:
uniq_ioapic_id(u8 id)
mp_find_ioapic()
mp_find_ioapic_pin()
mp_register_ioapic()
Also, since uniq_ioapic_id() is now no longer static,
re-name it to io_apic_unique_id() for consistency
with the other public io_apic routines.
For simplicity, do not #ifdef the resulting code ACPI || SFI,
thought that could be done in the future if it is important
to optimize the !ACPI !SFI IO-APIC x86 kernel for size.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Martin Schwidefsky analyzed it:
To register a clocksource the clocksource_mutex is acquired and if
necessary timekeeping_notify is called to install the clocksource as
the timekeeper clock. timekeeping_notify uses stop_machine which needs
to take cpu_add_remove_lock mutex.
Starting a new cpu is done with the cpu_add_remove_lock mutex held.
native_cpu_up checks the tsc of the new cpu and if the tsc is no good
clocksource_change_rating is called. Which needs the clocksource_mutex
and the deadlock is complete.
The solution is to replace the TSC via the clocksource watchdog
mechanism. Mark the TSC as unstable and schedule the watchdog work so
it gets removed in the watchdog thread context.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: John Stultz <johnstul@us.ibm.com>
The mpc_apic_id setup is handled by a x86_quirk. Make it a
x86_init_ops function with a default implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
32bit and also the numaq code have special requirements on the
ioapic_id setup. Convert it to a x86_init_ops function and get rid
of the quirks and #ifdefs
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The x86 quirkification introduced an extra ugly hackery with a
variable pointer in the mpparse code. If the pointer is initialized
then it is dereferenced and the variable set to 0 or incremented.
Create a x86_init_ops function and let the affected numaq code
hold the function. Default init is a setup noop.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
memory_setup is overridden by x86_quirks and by paravirts with weak
functions and quirks. Unify the whole mess and make it an
unconditional x86_init_ops function which defaults to the standard
function and can be overridden by the early platform code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
reserve_ebda_region needs to be called befor start_kernel. Moorestown
needs to override it. Make it a x86_init_ops function and initialize
it with the default reserve_ebda_region.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The 32bit and the 64bit code are slighty different in the reservation
of standard resources. Also the upcoming Moorestown support needs its
own version of that.
Add it to x86_init_ops and initialize it with the 64bit default. 32bit
overrides it in early boot. Now moorestown can add it's own override
w/o sprinkling the code with more #ifdefs
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
probe_roms is only used on 32bit. Add it to the x86_init ops and
remove the #ifdefs.
Default initializer is x86_init_noop() which is overridden in
the 32bit boot code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The upcoming Moorestown support brings the embedded world to x86. The
setup code of x86 has already a couple of hooks which are either
x86_quirks or paravirt ops. Some of those setup hooks are pretty
convoluted like the timer setup and the tsc calibration code. But
there are other places which could do with a cleanup.
Instead of having inline functions/macros which are modified at
compile time I decided to introduce x86_init ops which are
unconditional in the code and make it clear that they can be changed
either during compile time or in the early boot process. The function
pointers are initialized by default functions which can be noops so
that the pointer can be called unconditionally in the most cases. This
also allows us to remove 32bit/64bit, paravirt and other #ifdeffery.
paravirt guests are just a hardware platform in the setup code, so we
should treat them as such and not hide all behind multiple layers of
indirection and compile time dependencies.
It's more obvious that x86_init.timers.timer_init() is a function
pointer than the late_time_init = choose_time_init() obscurity. It's
also way simpler to grep for x86_init.timers.timer_init and find all
the places which modify that function pointer instead of analyzing
weak functions, macros and paravirt indirections.
Note. This is not a general paravirt_ops replacement. It just will
move setup related hooks which are potentially useful for other
platform setup purposes as well out of the paravirt domain.
Add the base infrastructure without any functionality.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reason: The setup cleanups conflict with the paravirt cleanups. Avoid
a rather large merge conflict
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reason: Change to is_new_memtype_allowed() in x86/urgent
Resolved semantic conflicts in:
arch/x86/mm/pat.c
arch/x86/mm/ioremap.c
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add sanity check for remap_pfn_range of RAM regions using
lookup_memtype(). Previously, we did not have anyway to get the type of
RAM memory regions as they were tracked using a single bit in
page_struct (WB, nonWB). Now we can get the actual type from page struct
(WB, WC, UC_MINUS) and make sure the requester gets that type.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Lookup the reserved memtype during vm_insert_pfn and use that memtype
for the new mapping. This takes care or handling of vm_insert_pfn()
interface in track_pfn_vma*/untrack_pfn_vma.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add a new routine lookup_memtype() to get the current memtype based on
the PAT reserves and frees.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Change reserve_ram_pages_type and free_ram_pages_type to use 2 page
flags to track UC_MINUS, WC, WB and default types. Previous RAM tracking
just tracked WB or NonWB, which was not complete and did not allow
tracking of RAM fully and there was no way to get the actual type
reserved by looking at the page flags.
We use the memtype_lock spinlock for atomicity in dealing with
memtype tracking in struct page.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Only IA64 was using PG_uncached as of now. We now intend to use this bit
in x86 as well, to keep track of memory type of those addresses that
have page struct for them. So, generalize the use of that bit across
ia64 and x86.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
PAT memtype tracking uses a linear link list to keep track of IO
(non-RAM) regions and their memtypes. The code used a last_accessed
pointer as a cache to speedup the lookup. As per discussions with
H. Peter Anvin a while back, having a rbtree here will avoid bad
performances in pathological cases where we may end up with huge
linked list. This may not add any noticable performance speedup
in normal case as the number of entires in PAT memtype list tend
to be ~20-30 range. The patch removes the "cached_entry" logic
as with rbtree we have more generic way of speeding up the lookup.
With this patch, we use rbtree to do the quick lookup. We still use
linked list as the memtype range tracked can be of different sizes
and can overlap in different ways. We also keep track of usage counts
with linked list.
Example:
Multiple ioremaps with different sizes
uncached-minus @ 0xfffff00000-0xfffff04000
uncached-minus @ 0xfffff02000-0xfffff03000
And one userlevel mmap and the thread forks a new process
uncached-minus @ 0xbf453000-0xbf454000
uncached-minus @ 0xbf453000-0xbf454000
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
io_mapping_* interfaces were added, mainly for graphics drivers.
Make this interface go through the PAT reserve/free, instead of
hardcoding WC mapping. This makes sure that there are no
aliases due to unconditional WC setting.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add new routines to request memtype for IO regions. This will currently
be a backend for io_mapping_* routines. But, it can also be made available
to drivers directly in future, in case it is needed.
reserve interface reserves the memory, makes sure we have a compatible
memory type available and keeps the identity map in sync when needed.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
ioremap has this hard-coded check for new type and requested type. That
check differs from other PAT users like /dev/mem mmap, remap_pfn_range
in only one condition where requested type is UC_MINUS and new type
is WC. Under that condition, ioremap fails. But other PAT interfaces succeed
with a WC mapping.
Change to make ioremap be in sync with other PAT APIs and use the same
macro as others. Also changes the error print to KERN_ERR instead of
pr_debug.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make reserve_memtype internally take care of pat disabled case and fallback
to default return values.
Remove the specific pat_disabled checks in track_* routines.
Change kernel_map_sync_memtype to sync identity map even when
pat_disabled.
This change ensures that, even for pat_disabled case, we take care of
keeping identity map in sync. Before this patch, in pat disabled case,
ioremap() keeps the identity maps in sync and other APIs like pci and
/dev/mem mmap don't, which is not a very consistent behavior.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Remove the FTRACE_SYSCALL_MAX definitions now that we have converted the
syscall event tracing code to use NR_syscalls.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Josh Stone <jistone@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anwin <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <f2240cdc8f0b1ca7617390c8f5ec90ba2bd348cf.1251146513.git.jbaron@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Convert the syscalls event tracing code to use NR_syscalls, instead of
FTRACE_SYSCALL_MAX. NR_syscalls is standard accross most arches, and
reduces code confusion/complexity.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Josh Stone <jistone@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anwin <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <9b4f1a84ecae57cc6599412772efa36f0d2b815b.1251146513.git.jbaron@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Express the available number of syscalls in a standard way by defining
NR_syscalls.
The common way to define it is to place its definition in asm/unistd.h
However, the number of syscalls is defined using __NR_syscall_max in
x86-64 after building a dynamic header file "asm-offsets.h"
The source file that generates this header, asm-offsets-64.c includes
unistd.h, then if we want to express NR_syscalls from __NR_syscall_max
in unistd.h only after generating the dynamic header file, we need a
watchguard.
If unistd.h is included from asm-offsets-64.c, then we are generating
asm-offset.h which defines __NR_syscall_max. At this time, we don't
want to (we can't) define NR_syscalls, then we do nothing.
Otherwise we define NR_syscalls because we know asm-offsets.h has
been generated.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Josh Stone <jistone@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anwin <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <20090826160910.GB2658@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Add a NR_syscalls #define for x86. This is used in the syscall events
tracing code. Todo: make it dynamic like x86_64.
NR_syscalls is the usual name used to determine the number of syscalls
supported by the current arch. We want to unify the use of this number
across archs that support the syscall tracing. This also prepare to move
some of the arch code to core code in the syscall tracing area.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Josh Stone <jistone@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anwin <hpa@zytor.com>
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <0f33c0f96d198fccc3ddd9ff7f5334ff5cb42706.1251146513.git.jbaron@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
As far as I see there is no external poking of mp_lapic_addr in
this procedure which could lead to unpredited changes and
require local storage unit for it. Lets use it plain forward.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20090826171324.GC4548@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If MCE handler is called but none of mces_seen have machine
check event which might signal the MCE (i.e. event higher than
MCE_KEEP_SEVERITY), panic with "Machine check from unknown
source" will be taken since the MCE is assumed to be signaled
from external agent or so.
Usually mces_seen never point MCE_KEEP_SEVERITY event such as
CE. But it can happen because initial value of mces_seen is
accidentally modified by mce_no_way_out() - in case if
mce_no_way_out() run through all banks and the last bank has
the CE, mces_seen points the CE and the "panic by unknown" will
not be taken.
This patch fixes this undesired behavior, and clarifies the logic.
Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com>
LKML-Reference: <4A94E244.3020301@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Jin Dongming <jin.dongming@np.css.fujitsu.com>
2.6.31-rc7 does not boot on vSMP systems:
[ 8.501108] CPU31: Thermal monitoring enabled (TM1)
[ 8.501127] CPU 31 MCA banks SHD:2 SHD:3 SHD:5 SHD:6 SHD:8
[ 8.650254] CPU31: Intel(R) Xeon(R) CPU E5540 @ 2.53GHz stepping 04
[ 8.710324] Brought up 32 CPUs
[ 8.713916] Total of 32 processors activated (162314.96 BogoMIPS).
[ 8.721489] ERROR: parent span is not a superset of domain->span
[ 8.727686] ERROR: domain->groups does not contain CPU0
[ 8.733091] ERROR: groups don't span domain->span
[ 8.737975] ERROR: domain->cpu_power not set
[ 8.742416]
Ravikiran Thirumalai bisected it to:
| commit 2759c3287d
| x86: don't call read_apic_id if !cpu_has_apic
The problem is that on vSMP systems the CPUID derived
initial-APICIDs are overlapping - so we need to fall
back on hard_smp_processor_id() which reads the local
APIC.
Both come from the hardware (influenced by firmware
though) so it's a tough call which one to trust.
Doing the quirk expresses the vSMP property properly
and also does not affect other systems, so we go for
this solution instead of a revert.
Reported-and-Tested-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Shai Fultheim <shai@scalex86.org>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <4A944D3C.5030100@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
alloc_bootmem() already panics on allocation failure. There is
no need to check the result.
Also there is a way to unbind global variable from its body and
use it as a parameter which allow us to simplify
ioapic_init_mappings as well -- "for" cycle already uses
nr_ioapics as a conditional variable and there is no need to
check if ioapic_setup_resources was returning NULL again.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20090824175551.493629148@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We already have APIC_DEFAULT_PHYS_BASE so just to be
consistent.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
LKML-Reference: <20090824175550.927946757@openvz.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Initialize cx before calling xen_cpuid(), in order to suppress the
"may be used uninitialized in this function" warning.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Xen always runs on CPUs which properly support WP enforcement in
privileged mode, so there's no need to test for it.
This also works around a crash reported by Arnd Hannemann, though I
think its just a band-aid for that case.
Reported-by: Arnd Hannemann <hannemann@nets.rwth-aachen.de>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This is a partial revert of f1f029c7bf.
"=rm" is allowed in this context, because "pop" is explicitly defined
to adjust the stack pointer *before* it evaluates its effective
address, if it has one. Thus, we do end up writing to the correct
address even if we use an on-stack memory argument.
The original reporter for f1f029c7bf was
apparently using a broken x86 simulator.
[ Impact: performance ]
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Gabe Black <spamforgabe@umich.edu>
This converts the syscall_enter/exit tracepoints into TRACE_EVENTs, so
you can have generic ftrace events that capture all system calls with
arguments and return values. These generic events are also renamed to
sys_enter/exit, so they're more closely aligned to the specific
sys_enter_foo events.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-5-git-send-email-jistone@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
It's not strictly correct for the tracepoint reg/unreg callbacks to
occur when a client is hooking up, because the actual tracepoint may not
be present yet. This happens to be fine for syscall, since that's in
the core kernel, but it would cause problems for tracepoints defined in
a module that hasn't been loaded yet. It also means the reg/unreg has
to be EXPORTed for any modules to use the tracepoint (as in SystemTap).
This patch removes DECLARE_TRACE_WITH_CALLBACK, and instead introduces
DEFINE_TRACE_FN which stores the callbacks in struct tracepoint. The
callbacks are used now when the active state of the tracepoint changes
in set_tracepoint & disable_tracepoint.
This also introduces TRACE_EVENT_FN, so ftrace events can also provide
registration callbacks if needed.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-4-git-send-email-jistone@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
s/HAVE_FTRACE_SYSCALLS/HAVE_SYSCALL_TRACEPOINTS/g
s/TIF_SYSCALL_FTRACE/TIF_SYSCALL_TRACEPOINT/g
The syscall enter/exit tracing is no longer specific to just ftrace, so
they now have names that reflect their tie to tracepoints instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-2-git-send-email-jistone@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
clockevent: Prevent dead lock on clockevents_lock
timers: Drop write permission on /proc/timer_list
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix build with older binutils and consolidate linker script
x86: Fix an incorrect argument of reserve_bootmem()
x86: add vmlinux.lds to targets in arch/x86/boot/compressed/Makefile
xen: rearrange things to fix stackprotector
x86: make sure load_percpu_segment has no stackprotector
i386: Fix section mismatches for init code with !HOTPLUG_CPU
x86, pat: Allow ISA memory range uncacheable mapping requests
For the x86_model to be greater than 6 or less than 12 is
logically always true.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
binutils prior to 2.17 can't deal with the currently possible
situation of a new segment following the per-CPU segment, but
that new segment being empty - objcopy misplaces the .bss (and
perhaps also the .brk) sections outside of any segment.
However, the current ordering of sections really just appears
to be the effect of cumulative unrelated changes; re-ordering
things allows to easily guarantee that the segment following
the per-CPU one is non-empty, and at once eliminates the need
for the bogus data.init2 segment.
Once touching this code, also use the various data section
helper macros from include/asm-generic/vmlinux.lds.h.
-v2: fix !SMP builds.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: <sam@ravnborg.org>
LKML-Reference: <4A94085D02000078000119A5@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This line looks suspicious, because if this is true, then the
'flags' parameter of function reserve_bootmem_generic() will be
unused when !CONFIG_NUMA. I don't think this is what we want.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: akpm@linux-foundation.org
LKML-Reference: <20090821083709.5098.52505.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add another option when selecting CPU family so the kernel can be
optimized for Intel Atom CPUs. If GCC supports tuning options for
Intel Atom they will be used.
Signed-off-by: Tobias Doerffel <tobias.doerffel@gmail.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
LKML-Reference: <1251018457-19157-1-git-send-email-tobias.doerffel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mtr_aps_delayed_init was declared u32 and made global, but it only
ever takes boolean values and is only ever used in
arch/x86/kernel/cpu/mtrr/main.c. Declare it "static bool" and remove
external references.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
SDM Vol 3a section titled "MTRR considerations in MP systems" specifies
the need for synchronizing the logical cpu's while initializing/updating
MTRR.
Currently Linux kernel does the synchronization of all cpu's only when
a single MTRR register is programmed/updated. During an AP online
(during boot/cpu-online/resume) where we initialize all the MTRR/PAT registers,
we don't follow this synchronization algorithm.
This can lead to scenarios where during a dynamic cpu online, that logical cpu
is initializing MTRR/PAT with cache disabled (cr0.cd=1) etc while other logical
HT sibling continue to run (also with cache disabled because of cr0.cd=1
on its sibling).
Starting from Westmere, VMX transitions with cr0.cd=1 don't work properly
(because of some VMX performance optimizations) and the above scenario
(with one logical cpu doing VMX activity and another logical cpu coming online)
can result in system crash.
Fix the MTRR initialization by doing rendezvous of all the cpus. During
boot and resume, we delay the MTRR/PAT init for APs till all the
logical cpu's come online and the rendezvous process at the end of AP's bringup,
will initialize the MTRR/PAT for all AP's.
For dynamic single cpu online, we synchronize all the logical cpus and
do the MTRR/PAT init on the AP that is coming online.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Some of the NOPs tables aren't used on 64-bits, quite some code and
data is needed post-init for module loading only, and a couple of
functions aren't used outside that file (i.e. can be static, and don't
need to be exported).
The change to __INITDATA/__INITRODATA is needed to avoid an assembler
warning.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A8BC8A00200007800010823@vpn.id2.novell.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
After talking with some application writers who want very fast, but not
fine-grained timestamps, I decided to try to implement new clock_ids
to clock_gettime(): CLOCK_REALTIME_COARSE and CLOCK_MONOTONIC_COARSE
which returns the time at the last tick. This is very fast as we don't
have to access any hardware (which can be very painful if you're using
something like the acpi_pm clocksource), and we can even use the vdso
clock_gettime() method to avoid the syscall. The only trade off is you
only get low-res tick grained time resolution.
This isn't a new idea, I know Ingo has a patch in the -rt tree that made
the vsyscall gettimeofday() return coarse grained time when the
vsyscall64 sysctrl was set to 2. However this affects all applications
on a system.
With this method, applications can choose the proper speed/granularity
trade-off for themselves.
Signed-off-by: John Stultz <johnstul@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: nikolag@ca.ibm.com
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: arjan@infradead.org
Cc: jonathan@jonmasters.org
LKML-Reference: <1250734414.6897.5.camel@localhost.localdomain>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This basically reverts commit 1a0c009ac (x86: unregister PIT
clocksource when PIT is disabled) because the problem which was tried
to address with that patch has been solved by commit 3f68535ada
(clocksource: sanity check sysfs clocksource changes).
The problem addressed by the original patch is that PIT could be
selected as clocksource after the system switched the PIT off or set
the PIT into one shot mode which would result in complete timekeeping
wreckage.
Now with the sysfs sanity check in place PIT cannot be selected again
when the system is in oneshot mode. The system will not switch to one
shot mode as long as PIT is installed because PIT is not suitable for
one shot.
The shutdown case which happens when the lapic timer is installed is
covered by the fact that init_pit_clocksource() is called after the
lapic timer take over and then does not install the PIT clocksource
at all.
We should have done the sanity checks back then, but ...
This also solves the locking problem which was reported vs. the
clocksource rework.
LKML-Reference: <new-submission>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
As noted in 83d349f35e ("x86: don't send
an IPI to the empty set of CPU's"), some APIC's will be very unhappy
with an empty destination mask. That commit added a WARN_ON() for that
case, and avoided the resulting problem, but didn't fix the underlying
reason for why those empty mask cases happened.
This fixes that, by checking the result of 'cpumask_andnot()' of the
current CPU actually has any other CPU's left in the set of CPU's to be
sent a TLB flush, and not calling down to the IPI code if the mask is
empty.
The reason this started happening at all is that we started passing just
the CPU mask pointers around in commit 4595f9620 ("x86: change
flush_tlb_others to take a const struct cpumask"), and when we did that,
the cpumask was no longer thread-local.
Before that commit, flush_tlb_mm() used to create it's own copy of
'mm->cpu_vm_mask' and pass that copy down to the low-level flush
routines after having tested that it was not empty. But after changing
it to just pass down the CPU mask pointer, the lower level TLB flush
routines would now get a pointer to that 'mm->cpu_vm_mask', and that
could still change - and become empty - after the test due to other
CPU's having flushed their own TLB's.
See
http://bugzilla.kernel.org/show_bug.cgi?id=13933
for details.
Tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The default_send_IPI_mask_logical() function uses the "flat" APIC mode
to send an IPI to a set of CPU's at once, but if that set happens to be
empty, some older local APIC's will apparently be rather unhappy. So
just warn if a caller gives us an empty mask, and ignore it.
This fixes a regression in 2.6.30.x, due to commit 4595f9620 ("x86:
change flush_tlb_others to take a const struct cpumask"), documented
here:
http://bugzilla.kernel.org/show_bug.cgi?id=13933
which causes a silent lock-up. It only seems to happen on PPro, P2, P3
and Athlon XP cores. Most developers sadly (or not so sadly, if you're
a developer..) have more modern CPU's. Also, on x86-64 we don't use the
flat APIC mode, so it would never trigger there even if the APIC didn't
like sending an empty IPI mask.
Reported-by: Pavel Vilim <wylda@volny.cz>
Reported-and-tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Reported-and-tested-by: Martin Rogge <marogge@onlinehome.de>
Cc: Mike Travis <travis@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This line looks suspicious, because if this is true, then the
'flags' parameter of function reserve_bootmem_generic() will be
unused when !CONFIG_NUMA. I don't think this is what we want.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: akpm@linux-foundation.org
LKML-Reference: <20090821083709.5098.52505.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The system will die if the kernel is booted with "reservetop"
parameter, in present code, parse "reservetop" parameter after
early_ioremap_init(), and some function still use
early_ioremap() after it.
The problem is, "reservetop" parameter can modify
'FIXADDR_TOP', then the virtual address got by early_ioremap()
is base on old 'FIXADDR_TOP', but the page mapping is base on
new 'FIXADDR_TOP', it will occur page fault, and the IDT is not
prepare yet, so, the system is dead.
So, put parse_early_param() in the front of
early_ioremap_init() in this patch.
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Cc: yinghai@kernel.org
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A8D402F.4080805@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The absence of vmlinux.lds here keeps .vmlinux.lds.cmd from being
included, which in turn leads to it and all its dependents always
getting rebuilt independent of whether they are already up-to-date.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A8D84670200007800010D31@vpn.id2.novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make sure the stack-protector segment registers are properly set up
before calling any functions which may have stack-protection compiled
into them.
[ Impact: prevent Xen early-boot crash when stack-protector is enabled ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
load_percpu_segment() is used to set up the per-cpu segment registers,
which are also used for -fstack-protector. Make sure that the
load_percpu_segment() function doesn't have stackprotector enabled.
[ Impact: allow percpu setup before calling stack-protected functions ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Currently clockevents_notify() is called with interrupts enabled at
some places and interrupts disabled at some other places.
This results in a deadlock in this scenario.
cpu A holds clockevents_lock in clockevents_notify() with irqs enabled
cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled
cpu C doing set_mtrr() which will try to rendezvous of all the cpus.
This will result in C and A come to the rendezvous point and waiting
for B. B is stuck forever waiting for the spinlock and thus not
reaching the rendezvous point.
Fix the clockevents code so that clockevents_lock is taken with
interrupts disabled and thus avoid the above deadlock.
Also call lapic_timer_propagate_broadcast() on the destination cpu so
that we avoid calling smp_call_function() in the clockevents notifier
chain.
This issue left us wondering if we need to change the MTRR rendezvous
logic to use stop machine logic (instead of smp_call_function) or add
a check in spinlock debug code to see if there are other spinlocks
which gets taken under both interrupts enabled/disabled conditions.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: "Brown Len" <len.brown@intel.com>
LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: use the right flag for get_vm_area()
percpu, sparc64: fix sparse possible cpu map handling
init: set nr_cpu_ids before setup_per_cpu_areas()
Commit 0e83815be7 changed the
section the initial_code variable gets allocated in, in an
attempt to address a section conflict warning. This, however
created a new section conflict when building without
HOTPLUG_CPU. The apparently only (reasonable) way to address
this is to always use __REFDATA.
Once at it, also fix a second section mismatch when not using
HOTPLUG_CPU.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Robert Richter <robert.richter@amd.com>
LKML-Reference: <4A8AE7CD020000780001054B@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Vozeler reported:
> Bug 13877 - bogl-term broken with CONFIG_X86_PAT=y, works with =n
>
> strace of bogl-term:
> 814 mmap2(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0)
> = -1 EAGAIN (Resource temporarily unavailable)
> 814 write(2, "bogl: mmaping /dev/fb0: Resource temporarily unavailable\n",
> 57) = 57
PAT code maps the ISA memory range as WB in the PAT attribute, so that
fixed range MTRR registers define the actual memory type (UC/WC/WT etc).
But the upper level is_new_memtype_allowed() API checks are failing,
as the request here is for UC and the return tracked type is WB (Tracked type is
WB as MTRR type for this legacy range potentially will be different for each
4k page).
Fix is_new_memtype_allowed() by always succeeding the ISA address range
checks, as the null PAT (WB) and def MTRR fixed range register settings
satisfy the memory type needs of the applications that map the ISA address
range.
Reported-and-Tested-by: Max Vozeler <xam@debian.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
On 32 bits, we can have CONFIG_ACPI_SLEEP set without implying
CONFIG_X86_TRAMPOLINE. In that case, we simply do not need to mark
the trampoline as a MAC region.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Shane Wang <shane.wang@intel.com>
Cc: Joseph Cihula <joseph.cihula@intel.com>
An older test-box started hanging at the following point during
bootup:
[ 0.022996] Mount-cache hash table entries: 512
[ 0.024996] Initializing cgroup subsys debug
[ 0.025996] Initializing cgroup subsys cpuacct
[ 0.026995] Initializing cgroup subsys devices
[ 0.027995] Initializing cgroup subsys freezer
[ 0.028995] mce: CPU supports 5 MCE banks
I've bisected it down to commit 4efc0670 ("x86, mce: use 64bit
machine check code on 32bit"), which utilizes the MCE code on
32-bit systems too.
The problem is caused by this detail in my config:
# CONFIG_CPU_SUP_INTEL is not set
This disables the quirks in mce_cpu_quirks() but still enables
MCE support - which then hangs due to the missing quirk
workaround needed on this CPU:
if (c->x86 == 6 && c->x86_model < 0x1A && banks > 0)
mce_banks[0].init = 0;
The safe solution is to not initialize MCEs if we dont know on
what CPU we are running (or if that CPU's support code got
disabled in the config).
Also be a bit more defensive on 32-bit systems: dont do a
boot-time dump of pending MCEs not just on the specific system
that we found a problem with (Pentium-M), but earlier ones as
well.
Now this problem is probably not common and disabling CPU
support is rare - but still being more defensive in something
we turned on for a wide range of CPUs is prudent.
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
LKML-Reference: Message-ID: <4A88E3E4.40506@jp.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
On my legacy Pentium M laptop (Acer Extensa 2900) I get bogus MCE on a cold
boot with CONFIG_X86_NEW_MCE enabled, i.e. (after decoding it with mcelog):
MCE 0
HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 BANK 1 MCG status:
MCi status:
Error overflow
Uncorrected error
Error enabled
Processor context corrupt
MCA: Data CACHE Level-1 UNKNOWN Error
STATUS f200000000000195 MCGSTATUS 0
[ The other STATUS values observed: f2000000000001b5 (... UNKNOWN error)
and f200000000000115 (... READ Error).
To verify that this is not a CONFIG_X86_NEW_MCE bug I also modified
the CONFIG_X86_OLD_MCE code (which doesn't log any MCEs) to dump
content of STATUS MSR before it is cleared during initialization. ]
Since the bogus MCE results in a kernel taint (which in turn disables
lockdep support) don't log boot MCEs on Pentium M (model == 13) CPUs
by default ("mce=bootlog" boot parameter can be be used to get the old
behavior).
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The function uv_acpi_madt_oem_check() has been marked __init,
the struct apic_x2apic_uv_x has been marked __refdata.
The aim is to address the following section mismatch messages:
WARNING: arch/x86/kernel/apic/built-in.o(.data+0x1368): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
WARNING: arch/x86/kernel/built-in.o(.data+0x68e8): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
WARNING: arch/x86/built-in.o(.text+0x7b36f): Section mismatch in reference from the function uv_acpi_madt_oem_check() to the function .init.text:early_ioremap()
The function uv_acpi_madt_oem_check() references
the function __init early_ioremap().
This is often because uv_acpi_madt_oem_check lacks a __init
annotation or the annotation of early_ioremap is wrong.
WARNING: arch/x86/built-in.o(.text+0x7b38d): Section mismatch in reference from the function uv_acpi_madt_oem_check() to the function .init.text:early_iounmap()
The function uv_acpi_madt_oem_check() references
the function __init early_iounmap().
This is often because uv_acpi_madt_oem_check lacks a __init
annotation or the annotation of early_iounmap is wrong.
WARNING: arch/x86/built-in.o(.data+0x8668): Section mismatch in reference from the variable apic_x2apic_uv_x to the function .cpuinit.text:uv_wakeup_secondary()
The variable apic_x2apic_uv_x references
the function __cpuinit uv_wakeup_secondary()
If the reference is valid then annotate the
variable with __init* or __refdata (see linux/init.h) or name the variable:
*driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
Signed-off-by: Leonardo Potenza <lpotenza@inwind.it>
LKML-Reference: <200908161855.48302.lpotenza@inwind.it>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
kernel/perf_counter.c
Merge reason: update to latest upstream (-rc6) and resolve
the conflict with urgent fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The SGI UV Broadcast Assist Unit is used to send TLB shootdown
messages to remote nodes of the system. The header of the
message must contain the subnode id of the block in the
receiving hub that handles such messages. It should always be
0x10, the id of the "LB" block.
It had previously been documented as a "must be zero" field.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Acked-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <E1Mc1x7-0005Ce-6t@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The persistent clock of some architectures (e.g. s390) have a
better granularity than seconds. To reduce the delta between the
host clock and the guest clock in a virtualized system change the
read_persistent_clock function to return a struct timespec.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134811.013873340@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
change_clocksource resets the cycle_last value to zero then sets it to
a value read from the clocksource. The reset to zero is required only
for the TSC clocksource to make the read_tsc function work after a
resume. The reason is that the TSC read function uses cycle_last to
detect backwards going TSCs. In the resume case cycle_last contains
the TSC value from the last update before the suspend. On resume the
TSC starts counting from 0 again and would trip over the cycle_last
comparison.
This is subtle and surprising. Move the reset to a resume function in
the tsc code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <johnstul@us.ibm.com>
Cc: Daniel Walker <dwalker@fifo99.com>
LKML-Reference: <20090814134808.142191175@de.ibm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
S3 sleep requires special setup in tboot. However, the data
structures needed to do such setup are only available if
CONFIG_ACPI_SLEEP is enabled. Abstract them out as much as possible,
so we can have a single tboot_setup_sleep() which either is a proper
implementation or a stub which simply calls BUG().
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Shane Wang <shane.wang@intel.com>
Cc: Joseph Cihula <joseph.cihula@intel.com>
With x86 converted to embedding allocator, lpage doesn't have any user
left. Kill it along with cpa handling code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Embedding percpu first chunk allocator can now handle very sparse unit
mapping. Use embedding allocator instead of lpage for 64bit NUMA.
This removes extra TLB pressure and the need to do complex and fragile
dancing when changing page attributes.
For 32bit, using very sparse unit mapping isn't a good idea because
the vmalloc space is very constrained. 32bit NUMA machines aren't
exactly the focus of optimization and it isn't very clear whether
lpage performs better than page. Use page first chunk allocator for
32bit NUMAs.
As this leaves setup_pcpu_*() functions pretty much empty, fold them
into setup_per_cpu_areas().
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Andi Kleen <andi@firstfloor.org>
Now that percpu core can handle very sparse units, given that vmalloc
space is large enough, embedding first chunk allocator can use any
memory to build the first chunk. This patch teaches
pcpu_embed_first_chunk() about distances between cpus and to use
alloc/free callbacks to allocate node specific areas for each group
and use them for the first chunk.
This brings the benefits of embedding allocator to NUMA configurations
- no extra TLB pressure with the flexibility of unified dynamic
allocator and no need to restructure arch code to build memory layout
suitable for percpu. With units put into atom_size aligned groups
according to cpu distances, using large page for dynamic chunks is
also easily possible with falling back to reuglar pages if large
allocation fails.
Embedding allocator users are converted to specify NULL
cpu_distance_fn, so this patch doesn't cause any visible behavior
difference. Following patches will convert them.
Signed-off-by: Tejun Heo <tj@kernel.org>
Currently units are mapped sequentially into address space. This
patch adds pcpu_unit_offsets[] which allows units to be mapped to
arbitrary offsets from the chunk base address. This is necessary to
allow sparse embedding which might would need to allocate address
ranges and memory areas which aren't aligned to unit size but
allocation atom size (page or large page size). This also simplifies
things a bit by removing the need to calculate offset from unit
number.
With this change, there's no need for the arch code to know
pcpu_unit_size. Update pcpu_setup_first_chunk() and first chunk
allocators to return regular 0 or -errno return code instead of unit
size or -errno.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Till now, non-linear cpu->unit map was expressed using an integer
array which maps each cpu to a unit and used only by lpage allocator.
Although how many units have been placed in a single contiguos area
(group) is known while building unit_map, the information is lost when
the result is recorded into the unit_map array. For lpage allocator,
as all allocations are done by lpages and whether two adjacent lpages
are in the same group or not is irrelevant, this didn't cause any
problem. Non-linear cpu->unit mapping will be used for sparse
embedding and this grouping information is necessary for that.
This patch introduces pcpu_alloc_info which contains all the
information necessary for initializing percpu allocator.
pcpu_alloc_info contains array of pcpu_group_info which describes how
units are grouped and mapped to cpus. pcpu_group_info also has
base_offset field to specify its offset from the chunk's base address.
pcpu_build_alloc_info() initializes this field as if all groups are
allocated back-to-back as is currently done but this will be used to
sparsely place groups.
pcpu_alloc_info is a rather complex data structure which contains a
flexible array which in turn points to nested cpu_map arrays.
* pcpu_alloc_alloc_info() and pcpu_free_alloc_info() are provided to
help dealing with pcpu_alloc_info.
* pcpu_lpage_build_unit_map() is updated to build pcpu_alloc_info,
generalized and renamed to pcpu_build_alloc_info().
@cpu_distance_fn may be NULL indicating that all cpus are of
LOCAL_DISTANCE.
* pcpul_lpage_dump_cfg() is updated to process pcpu_alloc_info,
generalized and renamed to pcpu_dump_alloc_info(). It now also
prints which group each alloc unit belongs to.
* pcpu_setup_first_chunk() now takes pcpu_alloc_info instead of the
separate parameters. All first chunk allocators are updated to use
pcpu_build_alloc_info() to build alloc_info and call
pcpu_setup_first_chunk() with it. This has the side effect of
packing units for sparse possible cpus. ie. if cpus 0, 2 and 4 are
possible, they'll be assigned unit 0, 1 and 2 instead of 0, 2 and 4.
* x86 setup_pcpu_lpage() is updated to deal with alloc_info.
* sparc64 setup_per_cpu_areas() is updated to build alloc_info.
Although the changes made by this patch are pretty pervasive, it
doesn't cause any behavior difference other than packing of sparse
cpus. It mostly changes how information is passed among
initialization functions and makes room for more flexibility.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
First chunk allocators assume percpu areas have been linked using one
of PERCPU_*() macros and depend on __per_cpu_load symbol defined by
those macros, so there isn't much point in passing in static area size
explicitly when it can be easily calculated from __per_cpu_start and
__per_cpu_end. Drop @static_size from all percpu first chunk
allocators and helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
Now that all first chunk allocators are in mm/percpu.c, it makes sense
to make generalize percpu_alloc kernel parameter. Define PCPU_FC_*
and set pcpu_chosen_fc using early_param() in mm/percpu.c. Arch code
can use the set value to determine which first chunk allocator to use.
Signed-off-by: Tejun Heo <tj@kernel.org>
There's no need to build unused first chunk allocators in. Define
CONFIG_NEED_PER_CPU_*_FIRST_CHUNK and let archs enable them
selectively.
Signed-off-by: Tejun Heo <tj@kernel.org>
Page size isn't always 4k depending on arch and configuration. Rename
4k first chunk allocator to page.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
percpu code has been assuming num_possible_cpus() == nr_cpu_ids which
is incorrect if cpu_possible_map contains holes. This causes percpu
code to access beyond allocated memories and vmalloc areas. On a
sparc64 machine with cpus 0 and 2 (u60), this triggers the following
warning or fails boot.
WARNING: at /devel/tj/os/work/mm/vmalloc.c:106 vmap_page_range_noflush+0x1f0/0x240()
Modules linked in:
Call Trace:
[00000000004b17d0] vmap_page_range_noflush+0x1f0/0x240
[00000000004b1840] map_vm_area+0x20/0x60
[00000000004b1950] __vmalloc_area_node+0xd0/0x160
[0000000000593434] deflate_init+0x14/0xe0
[0000000000583b94] __crypto_alloc_tfm+0xd4/0x1e0
[00000000005844f0] crypto_alloc_base+0x50/0xa0
[000000000058b898] alg_test_comp+0x18/0x80
[000000000058dad4] alg_test+0x54/0x180
[000000000058af00] cryptomgr_test+0x40/0x60
[0000000000473098] kthread+0x58/0x80
[000000000042b590] kernel_thread+0x30/0x60
[0000000000472fd0] kthreadd+0xf0/0x160
---[ end trace 429b268a213317ba ]---
This patch fixes generic percpu functions and sparc64
setup_per_cpu_areas() so that they handle sparse cpu_possible_map
properly.
Please note that on x86, cpu_possible_map() doesn't contain holes and
thus num_possible_cpus() == nr_cpu_ids and this patch doesn't cause
any behavior difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf_counter: Report the cloning task as parent on perf_counter_fork()
perf_counter: Fix an ipi-deadlock
perf: Rework/fix the whole read vs group stuff
perf_counter: Fix swcounter context invariance
perf report: Don't show unresolved DSOs and symbols when -S/-d is used
perf tools: Add a general option to enable raw sample records
perf tools: Add a per tracepoint counter attribute to get raw sample
perf_counter: Provide hw_perf_counter_setup_online() APIs
perf list: Fix large list output by using the pager
perf_counter, x86: Fix/improve apic fallback
perf record: Add missing -C option support for specifying profile cpu
perf tools: Fix dso__new handle() to handle deleted DSOs
perf tools: Fix fallback to cplus_demangle() when bfd_demangle() is not available
perf report: Show the tid too in -D
perf record: Fix .tid and .pid fill-in when synthesizing events
perf_counter, x86: Fix generic cache events on P6-mobile CPUs
perf_counter, x86: Fix lapic printk message
arch/x86/kernel/tboot.c needs <asm/fixmap.h>. In most configurations
that ends up getting implicitly included, but not in all, causing
build failures in some configurations.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Joseph Cihula <joseph.cihula@intel.com>
Cc: Shane Wang <shane.wang@intel.com>
Johannes Stezenbach reported that his Pentium-M based
laptop does not have the local APIC enabled by default,
and hence perfcounters do not get initialized.
Add a fallback for this case: allow non-sampled counters
and return with an error on sampled counters. This allows
'perf stat' to work out of box - and allows 'perf top'
and 'perf record' to fall back on a hrtimer based sampling
method.
( Passing 'lapic' on the boot line will allow hardware
sampling to occur - but if the APIC is disabled
permanently by the hardware then this fallback still
allows more systems to use perfcounters. )
Also decouple perfcounter support from X86_LOCAL_APIC.
-v2: fix typo breaking counters on all other systems ...
Reported-by: Johannes Stezenbach <js@sig21.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Kernel is broken for x86 CPUs without CPUID since 2.6.28. It
crashes with NULL pointer dereference in identify_cpu():
766 generic_identify(c);
767
768--> if (this_cpu->c_identify)
769 this_cpu->c_identify(c);
this_cpu is NULL. This is because it's only initialized in
get_cpu_vendor() function, which is not called if the CPU has
no CPUID instruction.
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
LKML-Reference: <200908112000.15993.linux@rainbow-software.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A number of syscalls are not using 'DEFINE_SYSCALL'. I'm not sure why.
Convert x86_64 uname and mmap to use DEFINE_SYSCALL.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
The current state of syscalls tracepoints generates only one event id
for every syscall events.
This patch associates an id with each syscall trace event, so that we
can identify each syscall trace event using the 'perf' tool.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
update FTRACE_SYSCALL_MAX to the current number of syscalls
FTRACE_SYSCALL_MAX is a temporary solution to get the number of
syscalls supported by the arch until we find a more dynamic way
to get this number.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Call arch_init_ftrace_syscalls at boot, so we can determine early the
set of syscalls for the syscall trace events.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Add a new function to support translating a syscall name to number at
runtime.
This allows the syscall event tracer to map syscall names to number.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Due to an erratum with certain AMD Athlon 64 processors, the
BIOS may need to force enable the LAHF_LM capability.
Unfortunately, in at least one case, the BIOS does this even
for processors that do not support the functionality.
Add a specific check that will clear the feature bit for
processors known not to support the LAHF/SAHF instructions.
Signed-off-by: Kevin Winchester <kjwinchester@gmail.com>
Acked-by: Borislav Petkov <petkovbb@googlemail.com>
LKML-Reference: <4A80A5AD.2000209@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Johannes Stezenbach reported that 'perf stat' does not count
cache-miss and cache-references events on his Pentium-M based
laptop.
This is because we left them blank in p6_perfmon_event_map[],
fill them in.
Reported-by: Johannes Stezenbach <js@sig21.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>