To make this work, we teach the page fault handler how to send
signals on failed uaccess. This only works for user addresses
(kernel addresses will never hit the page fault handler in the
first place), so we need to generate signals for those
separately.
This gets the tricky case right: if the user buffer spans
multiple pages and only the second page is invalid, we set
cr2 and si_addr correctly. UML relies on this behavior to
"fault in" pages as needed.
We steal a bit from thread_info.uaccess_err to enable this.
Before this change, uaccess_err was a 32-bit boolean value.
This fixes issues with UML when vsyscall=emulate.
Reported-by: Adrian Bunk <bunk@stusta.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: richard -rw- weinberger <richard.weinberger@gmail.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4c8f91de7ec5cd2ef0f59521a04e1015f11e42b4.1320712291.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, doc: Remove int 0xcc from entry_64.S documentation
x86, vsyscall: Add missing <asm/fixmap.h> to arch/x86/mm/fault.c
Fix up trivial conflicts in arch/x86/mm/fault.c (asm/fixmap.h vs
asm/vsyscall.h: both work, which to use? Whatever..)
Erratum 93 applies to AMD K8 CPUs only, and its workaround
(forcing the upper 32 bits of %rip to all get set under certain
conditions) is actually getting in the way of analyzing page
faults occurring during EFI physical mode runtime calls (in
particular the page table walk shown is completely unrelated to
the actual fault). This is because typically EFI runtime code
lives in the space between 2G and 4G, which - modulo the above
manipulation - is likely to overlap with the kernel or modules
area.
While even for the other errata workarounds their taking effect
could be limited to just the affected CPUs, none of them appears
to be destructive, and they're generally getting called only
outside of performance critical paths, so they're being left
untouched.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4E835FE30200007800058464@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/fault.c now depend on having the symbol VSYSCALL_START
defined, which is best handled by including <asm/fixmap.h> (it isn't
unreasonable we may want other fixed addresses in this file in the
future, and so it is cleaner than including <asm/vsyscall.h>
directly.)
This addresses an x86-64 allnoconfig build failure. On other
configurations it was masked by an indirect path:
<asm/smp.h> -> <asm/apic.h> -> <asm/fixmap.h> -> <asm/vsyscall.h>
... however, the first such include is conditional on CONFIG_X86_LOCAL_APIC.
Originally-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/CA%2B55aFxsOMc9=p02r8-QhJ=h=Mqwckk4_Pnx9LQt5%2BfqMp_exQ@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
arch/x86/mm/fault.c needs to include asm/vsyscall.h to fix a
build error:
arch/x86/mm/fault.c: In function '__bad_area_nosemaphore':
arch/x86/mm/fault.c:728: error: 'VSYSCALL_START' undeclared (first use in this function)
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-tip:
x86-64: Rework vsyscall emulation and add vsyscall= parameter
x86-64: Wire up getcpu syscall
x86: Remove unnecessary compile flag tweaks for vsyscall code
x86-64: Add vsyscall:emulate_vsyscall trace event
x86-64: Add user_64bit_mode paravirt op
x86-64, xen: Enable the vvar mapping
x86-64: Work around gold bug 13023
x86-64: Move the "user" vsyscall segment out of the data segment.
x86-64: Pad vDSO to a page boundary
There are three choices:
vsyscall=native: Vsyscalls are native code that issues the
corresponding syscalls.
vsyscall=emulate (default): Vsyscalls are emulated by instruction
fault traps, tested in the bad_area path. The actual contents of
the vsyscall page is the same as the vsyscall=native case except
that it's marked NX. This way programs that make assumptions about
what the code in the page does will not be confused when they read
that code.
vsyscall=none: Trying to execute a vsyscall will segfault.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/8449fb3abf89851fd6b2260972666a6f82542284.1312988155.git.luto@mit.edu
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Three places in the kernel assume that the only long mode CPL 3
selector is __USER_CS. This is not true on Xen -- Xen's sysretq
changes cs to the magic value 0xe033.
Two of the places are corner cases, but as of "x86-64: Improve
vsyscall emulation CS and RIP handling"
(c9712944b2), vsyscalls will segfault
if called with Xen's extra CS selector. This causes a panic when
older init builds die.
It seems impossible to make Xen use __USER_CS reliably without
taking a performance hit on every system call, so this fixes the
tests instead with a new paravirt op. It's a little ugly because
ptrace.h can't include paravirt.h.
Signed-off-by: Andy Lutomirski <luto@mit.edu>
Link: http://lkml.kernel.org/r/f4fcb3947340d9e96ce1054a432f183f9da9db83.1312378163.git.luto@mit.edu
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The nmi parameter indicated if we could do wakeups from the current
context, if not, we would set some state and self-IPI and let the
resulting interrupt do the wakeup.
For the various event classes:
- hardware: nmi=0; PMI is in fact an NMI or we run irq_work_run from
the PMI-tail (ARM etc.)
- tracepoint: nmi=0; since tracepoint could be from NMI context.
- software: nmi=[0,1]; some, like the schedule thing cannot
perform wakeups, and hence need 0.
As one can see, there is very little nmi=1 usage, and the down-side of
not using it is that on some platforms some software events can have a
jiffy delay in wakeup (when arch_irq_work_raise isn't implemented).
The up-side however is that we can remove the nmi parameter and save a
bunch of conditionals in fast paths.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Michael Cree <mcree@orcon.net.nz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Eric B Munson <emunson@mgebm.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/n/tip-agjev8eu666tvknpb3iaj0fg@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo suggested SIGKILL check should be moved into slowpath
function. This will reduce the page fault fastpath impact
of this recent commit:
37b23e0525d3: x86,mm: make pagefault killable
Suggested-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: kamezawa.hiroyu@jp.fujitsu.com
Cc: minchan.kim@gmail.com
Cc: willy@linux.intel.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/4DDE0B5C.9050907@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When an oom killing occurs, almost all processes are getting stuck at the
following two points.
1) __alloc_pages_nodemask
2) __lock_page_or_retry
1) is not very problematic because TIF_MEMDIE leads to an allocation
failure and getting out from page allocator.
2) is more problematic. In an OOM situation, zones typically don't have
page cache at all and memory starvation might lead to greatly reduced IO
performance. When a fork bomb occurs, TIF_MEMDIE tasks don't die quickly,
meaning that a fork bomb may create new process quickly rather than the
oom-killer killing it. Then, the system may become livelocked.
This patch makes the pagefault interruptible by SIGKILL.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e66eed651f ("list: remove prefetching from regular list
iterators") removed the include of prefetch.h from list.h, which
uncovered several cases that had apparently relied on that rather
obscure header file dependency.
So this fixes things up a bit, using
grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
to guide us in finding files that either need <linux/prefetch.h>
inclusion, or have it despite not needing it.
There are more of them around (mostly network drivers), but this gets
many core ones.
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's forbidden to take the page_table_lock with the irq disabled
or if there's contention the IPIs (for tlb flushes) sent with
the page_table_lock held will never run leading to a deadlock.
Nobody takes the pgd_lock from irq context so the _irqsave can be
removed.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
mm_fault_error() should not execute oom-killer, if page fault
occurs in kernel space. E.g. in copy_from_user()/copy_to_user().
This would happen if we find ourselves in OOM on a
copy_to_user(), or a copy_from_user() which faults.
Without this patch, the kernels hangs up in copy_from_user(),
because OOM killer sends SIG_KILL to current process, but it
can't handle a signal while in syscall, then the kernel returns
to copy_from_user(), reexcute current command and provokes
page_fault again.
With this patch the kernel return -EFAULT from copy_from_user().
The code, which checks that page fault occurred in kernel space,
has been copied from do_sigbus().
This situation is handled by the same way on powerpc, xtensa,
tile, ...
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
access_error() already takes error_code as an argument, so there is
no need for an additional write flag.
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This change reduces mmap_sem hold times that are caused by waiting for
disk transfers when accessing file mapped VMAs.
It introduces the VM_FAULT_ALLOW_RETRY flag, which indicates that the call
site wants mmap_sem to be released if blocking on a pending disk transfer.
In that case, filemap_fault() returns the VM_FAULT_RETRY status bit and
do_page_fault() will then re-acquire mmap_sem and retry the page fault.
It is expected that the retry will hit the same page which will now be
cached, and thus it will complete with a low mmap_sem hold time.
Tests:
- microbenchmark: thread A mmaps a large file and does random read accesses
to the mmaped area - achieves about 55 iterations/s. Thread B does
mmap/munmap in a loop at a separate location - achieves 55 iterations/s
before, 15000 iterations/s after.
- We are seeing related effects in some applications in house, which show
significant performance regressions when running without this change.
[akpm@linux-foundation.org: fix warning & crash]
Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Ying Han <yinghan@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, percpu: Correct the ordering of the percpu readmostly section
x86, mm: Enable ARCH_DMA_ADDR_T_64BIT with X86_64 || HIGHMEM64G
x86: Spread tlb flush vector between nodes
percpu: Introduce a read-mostly percpu API
x86, mm: Fix incorrect data type in vmalloc_sync_all()
x86, mm: Hold mm->page_table_lock while doing vmalloc_sync
x86, mm: Fix bogus whitespace in sync_global_pgds()
x86-32: Fix sparse warning for the __PHYSICAL_MASK calculation
x86, mm: Add RESERVE_BRK_ARRAY() helper
mm, x86: Saving vmcore with non-lazy freeing of vmas
x86, kdump: Change copy_oldmem_page() to use cached addressing
x86, mm: fix uninitialized addr in kernel_physical_mapping_init()
x86, kmemcheck: Remove double test
x86, mm: Make spurious_fault check explicitly check the PRESENT bit
x86-64, mem: Update all PGDs for direct mapping and vmemmap mapping changes
x86, mm: Separate x86_64 vmalloc_sync_all() into separate functions
x86, mm: Avoid unnecessary TLB flush
arch/x86/mm/fault.c: In function 'vmalloc_sync_all':
arch/x86/mm/fault.c:238: warning: assignment makes integer from pointer without a cast
introduced by 617d34d9e5.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
LKML-Reference: <20101020103642.GA3135@kryptos.osrc.amd.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Take mm->page_table_lock while syncing the vmalloc region. This prevents
a race with the Xen pagetable pin/unpin code, which expects that the
page_table_lock is already held. If this race occurs, then Xen can see
an inconsistent page type (a page can either be read/write or a pagetable
page, and pin/unpin converts it between them), which will cause either
the pin or the set_p[gm]d to fail; either will crash the kernel.
vmalloc_sync_all() should be called rarely, so this extra use of
page_table_lock should not interfere with its normal users.
The mm pointer is stashed in the pgd page's index field, as that won't
be otherwise used for pgds.
Reported-by: Ian Campbell <ian.cambell@eu.citrix.com>
Originally-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4CB88A4C.1080305@goop.org>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
In x86, faults exit by executing the iret instruction, which then
reenables NMIs if we faulted in NMI context. Then if a fault
happens in NMI, another NMI can nest after the fault exits.
But we don't yet support nested NMIs because we have only one NMI
stack. To prevent from that, check that vmalloc and kmemcheck
faults don't happen in this context. Most of the other kernel faults
in NMIs can be more easily spotted by finding explicit
copy_from,to_user() calls on review.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
An earlier patch fixed the hwpoison fault handling to encode the
huge page size in the fault code of the page fault handler.
This is needed to report this information in SIGBUS to user space.
This is a straight forward patch to pass this information
through to the signal handling in the x86 specific fault.c
Cc: x86@kernel.org
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: fengguang.wu@intel.com
Signed-off-by: Andi Kleen <ak@linux.intel.com>
pte_present() returns true even present bit isn't set but _PAGE_PROTNONE
(global bit) bit is set. While with CONFIG_DEBUG_PAGEALLOC, free pages have
global bit set but present bit clear. This patch makes we could catch
free pages access with CONFIG_DEBUG_PAGEALLOC enabled.
[ hpa: added a comment in the code as a warning to janitors ]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <1280217988.32400.75.camel@sli10-desk.sh.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
No behavior change.
Move some of vmalloc_sync_all() code into a new function
sync_global_pgds() that will be useful for memory hotplug.
Signed-off-by: Haicheng Li <haicheng.li@linux.intel.com>
LKML-Reference: <4C6E4ECD.1090607@linux.intel.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It's wrong for several reasons, but the most direct one is that the
fault may be for the stack accesses to set up a previous SIGBUS. When
we have a kernel exception, the kernel exception handler does all the
fixups, not some user-level signal handler.
Even apart from the nested SIGBUS issue, it's also wrong to give out
kernel fault addresses in the signal handler info block, or to send a
SIGBUS when a system call already returns EFAULT.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Limit number of per cpu TSC sync messages
x86: dumpstack, 64-bit: Disable preemption when walking the IRQ/exception stacks
x86: dumpstack: Clean up the x86_stack_ids[][] initalization and other details
x86, cpu: mv display_cacheinfo -> cpu_detect_cache_sizes
x86: Suppress stack overrun message for init_task
x86: Fix cpu_devs[] initialization in early_cpu_init()
x86: Remove CPU cache size output for non-Intel too
x86: Minimise printk spew from per-vendor init code
x86: Remove the CPU cache size printk's
cpumask: Avoid cpumask_t in arch/x86/kernel/apic/nmi.c
x86: Make sure we also print a Code: line for show_regs()
init_task doesn't get its stack end location set to
STACK_END_MAGIC, and hence the message is confusing
rather than helpful in this case.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4B06AEFE02000078000211F4@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.
v2: Make the printk more verbose/unique
Cc: x86@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, highmem_32.c: Clean up comment
x86, pgtable.h: Clean up types
x86: Clean up dump_pagetable()
Add __kprobes to the functions which handle in-kernel fixable page
faults. Since kprobes can cause those in-kernel page faults by accessing
kprobe data structures, probing those fault functions will cause
fault-int3-loop (do_page_fault has already been marked as __kprobes).
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20090827172311.8246.92725.stgit@localhost.localdomain>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Since commit 5fd29d6c ("printk: clean up handling of log-levels
and newlines"), the kernel logs segfaults like:
<6>gnome-power-man[24509]: segfault at 20 ip 00007f9d4950465a sp 00007fffbb50fc70 error 4 in libgobject-2.0.so.0.2103.0[7f9d494f7000+45000]
with the extra "<6>" being KERN_INFO. This happens because the
printk in show_signal_msg() started with KERN_CONT and then
used "%s" to pass in the real level; and KERN_CONT is no longer
an empty string, and printk only pays attention to the level at
the very beginning of the format string.
Therefore, remove the KERN_CONT from this printk, since it is
now actively causing problems (and never really made any
sense).
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <874otjitkj.fsf@shaolin.home.digitalvampire.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use pgtable access helpers for 32-bit version dump_pagetable()
and get rid of __typeof__() operators. This needs to make
pmd_pfn() available for 2-level pgtable.
Also, remove some casts for 64-bit version dump_pagetable().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090627063514.GA2834@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (45 commits)
x86, mce: fix error path in mce_create_device()
x86: use zalloc_cpumask_var for mce_dev_initialized
x86: fix duplicated sysfs attribute
x86: de-assembler-ize asm/desc.h
i386: fix/simplify espfix stack switching, move it into assembly
i386: fix return to 16-bit stack from NMI handler
x86, ioapic: Don't call disconnect_bsp_APIC if no APIC present
x86: Remove duplicated #include's
x86: msr.h linux/types.h is only required for __KERNEL__
x86: nmi: Add Intel processor 0x6f4 to NMI perfctr1 workaround
x86, mce: mce_intel.c needs <asm/apic.h>
x86: apic/io_apic.c: dmar_msi_type should be static
x86, io_apic.c: Work around compiler warning
x86: mce: Don't touch THERMAL_APIC_VECTOR if no active APIC present
x86: mce: Handle banks == 0 case in K7 quirk
x86, boot: use .code16gcc instead of .code16
x86: correct the conversion of EFI memory types
x86: cap iomem_resource to addressable physical memory
x86, mce: rename _64.c files which are no longer 64-bit-specific
x86, mce: mce.h cleanup
...
Manually fix up trivial conflict in arch/x86/mm/fault.c
Prefetch instructions can generate spurious faults on certain
models of older CPUs. The faults themselves cannot be stopped
and they can occur pretty much anywhere - so the way we solve
them is that we detect certain patterns and ignore the fault.
There is one small path of code where we must not take faults
though: the #PF handler execution leading up to the reading
of the CR2 (the faulting address). If we take a fault there
then we destroy the CR2 value (with that of the prefetching
instruction's) and possibly mishandle user-space or
kernel-space pagefaults.
It turns out that in current upstream we do exactly that:
prefetchw(&mm->mmap_sem);
/* Get the faulting address: */
address = read_cr2();
This is not good.
So turn around the order: first read the cr2 then prefetch
the lock address. Reading cr2 is plenty fast (2 cycles) so
delaying the prefetch by this amount shouldnt be a big issue
performance-wise.
[ And this might explain a mystery fault.c warning that sometimes
occurs on one an old AMD/Semptron based test-system i have -
which does have such prefetch problems. ]
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
LKML-Reference: <20090616030522.GA22162@Krystal>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The hooks that we modify are:
- Page fault handler (to handle kmemcheck faults)
- Debug exception handler (to hide pages after single-stepping
the instruction that caused the page fault)
Also redefine memset() to use the optimized version if kmemcheck is
enabled.
(Thanks to Pekka Enberg for minimizing the impact on the page fault
handler.)
As kmemcheck doesn't handle MMX/SSE instructions (yet), we also disable
the optimized xor code, and rely instead on the generic C implementation
in order to avoid false-positive warnings.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
[whitespace fixlet]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (42 commits)
xen: cache cr0 value to avoid trap'n'emulate for read_cr0
xen/x86-64: clean up warnings about IST-using traps
xen/x86-64: fix breakpoints and hardware watchpoints
xen: reserve Xen start_info rather than e820 reserving
xen: add FIX_TEXT_POKE to fixmap
lguest: update lazy mmu changes to match lguest's use of kvm hypercalls
xen: honour VCPU availability on boot
xen: add "capabilities" file
xen: drop kexec bits from /sys/hypervisor since kexec isn't implemented yet
xen/sys/hypervisor: change writable_pt to features
xen: add /sys/hypervisor support
xen/xenbus: export xenbus_dev_changed
xen: use device model for suspending xenbus devices
xen: remove suspend_cancel hook
xen/dev-evtchn: clean up locking in evtchn
xen: export ioctl headers to userspace
xen: add /dev/xen/evtchn driver
xen: add irq_from_evtchn
xen: clean up gate trap/interrupt constants
xen: set _PAGE_NX in __supported_pte_mask before pagetable construction
...
Andrew pointed out that the 'once' variable has a needlessly
function-global scope. We can in fact eliminate it completely,
via the use of printk_once().
[ Impact: cleanup ]
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul suggested we allow for data addresses to be recorded along with
the traditional IPs as power can provide these.
For now, only the software pagefault events provide data addresses,
but in the future power might as well for some events.
x86 doesn't seem capable of providing this atm.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.394816925@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Provide separate sw counters for major and minor page faults.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We use the generic software counter infrastructure to provide
page fault events.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Eliminate more than 20 unnecessary #include lines in fault.c
Also fix include file dependency bug in asm/traps.h. (this was
masked before, by implicit inclusion)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <tip-56aea8468746e673a4bf50b6a13d97b2d1cbe1e8@git.kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: simplification, robustness
Make paravirt_lazy_mode() always return PARAVIRT_LAZY_NONE
when in an interrupt. This prevents interrupt code from
accidentally inheriting an outer lazy state, and instead
does everything synchronously. Outer batched operations
are left deferred.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Impact: extend prefetch handling on 64-bit
Currently there's an extra is_prefetch() check done in do_sigbus(),
which we only do on 32 bits.
This is a last-ditch check before we terminate a task, so it's worth
giving prefetch instructions another chance - should none of our
existing quirks have caught a prefetch instruction related spurious
fault.
The only risk is if a prefetch causes a real sigbus, in that case
we'll not OOM but try another fault. But this code has been on
32-bit for a long time, so it should be fine in practice.
So do this on 64-bit too - and thus remove one more #ifdef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Removal of an #ifdef in fault_in_kernel_space(), by making
use of the new TASK_SIZE_MAX symbol which is now available
on 32-bit too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
define on 32-bit too. (mapped to TASK_SIZE)
This allows 32-bit code to make use of the (former-) TASK_SIZE64
symbol as well, in a clean way.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
do_page_fault() has this ugly #ifdef in its prototype:
#ifdef CONFIG_X86_64
asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
Replace it with 'dotraplinkage' which maps to exactly the above
construct: nothing on 32-bit and asmlinkage on 64-bit.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add oops-recursion check to 32-bit
Unify the oops state-machine, to the 64-bit version. It is
slightly more careful in that it does a recursion check
in oops_begin(), and is thus more likely to show the relevant
oops.
It also means that 32-bit will print one more line at the
end of pagefault triggered oopses:
printk(KERN_EMERG "CR2: %016lx\n", address);
Which is generally good information to be seen in partial-dump
digital-camera jpegs ;-)
The downside is the somewhat more complex critical path. Both
variants have been tested well meanwhile by kernel developers
crashing their boxes so i dont think this is a practical worry.
This removes 3 ugly #ifdefs from no_context() and makes the
function a lot nicer read.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refine/extend page fault related oops printing on 64-bit
- honor the pause_on_oops logic on 64-bit too
- print out NX fault warnings on 64-bit as well
- factor out the NX fault message to make it git-greppable and readable
Note that this means that we do the PF_INSTR check on 32-bit non-PAE
as well where it should not occur ... normally. Cannot hurt.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from notify_page_fault(). The function still
compiles to nothing in the !CONFIG_KPROBES case.
Introduce kprobes_built_in() and kprobe_fault_handler() helpers
to allow this - they returns 0 if !CONFIG_KPROBES.
No code changed:
text data bss dec hex filename
4618 32 24 4674 1242 fault.o.before
4618 32 24 4674 1242 fault.o.after
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from kmmio_fault() - we can do this by
providing default implementations for is_kmmio_active()
and kmmio_handler(). The compiler optimizes it all away
in the !CONFIG_MMIOTRACE case.
Also, while at it, clean up mmiotrace.h a bit:
- standard header guards
- standard vertical spaces for structure definitions
No code changed (both with mmiotrace on and off in the config):
text data bss dec hex filename
2947 12 12 2971 b9b fault.o.before
2947 12 12 2971 b9b fault.o.after
Cc: Pekka Paalanen <pq@iki.fi>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve page fault handling robustness
The 'PF_RSVD' flag (bit 3) of the page-fault error_code is a
relatively recent addition to x86 CPUs, so the 32-bit do_fault()
implementation never had it. This flag gets set when the CPU
detects nonzero values in any reserved bits of the page directory
entries.
Extend the existing 64-bit check for PF_RSVD in do_page_fault()
to 32-bit too. If we detect such a fault then we print a more
informative oops and the pagetables.
This unifies the code some more, removes an ugly #ifdef and improves
the 32-bit page fault code robustness a bit. It slightly increases
the 32-bit kernel text size.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Instead of an ugly, open-coded, #ifdef-ed vm86 related legacy check
in do_page_fault(), put it into the check_v8086_mode() helper
function and merge it with an existing #ifdef.
Also, simplify the code flow a tiny bit in the helper.
No code changed:
arch/x86/mm/fault.o:
text data bss dec hex filename
2711 12 12 2735 aaf fault.o.before
2711 12 12 2735 aaf fault.o.after
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: no functionality changed
Factor out the opcode checker into a helper inline.
The code got a tiny bit smaller:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4618 32 24 4674 1242 fault.o.after
And it got cleaner / easier to review as well.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, no code changed
Clean up various small details, which can be correctness checked
automatically:
- tidy up the include file section
- eliminate unnecessary includes
- introduce show_signal_msg() to clean up code flow
- standardize the code flow
- standardize comments and other style details
- more cleanups, pointed out by checkpatch
No code changed on either 32-bit nor 64-bit:
arch/x86/mm/fault.o:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4632 32 24 4688 1250 fault.o.after
the md5 changed due to a change in a single instruction:
2e8a8241e7f0d69706776a5a26c90bc0 fault.o.before.asm
c5c3d36e725586eb74f0e10692f0193e fault.o.after.asm
Because a __LINE__ reference in a WARN_ONCE() has changed.
On 32-bit a few stack offsets changed - no code size difference
nor any functionality difference.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to prevent hard lockup on bad PMD permissions
If the PMD does not have the correct permissions for a page access,
but the PTE does, the spurious fault handler will mistake the fault
as a lazy TLB transaction. This will result in an infinite loop of:
fault -> spurious_fault check (pass) -> return to code -> fault
This patch adds a check and a warn on if the PTE passes the permissions
but the PMD does not.
[ Updated: Ingo Molnar suggested using WARN_ONCE with some text ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
* commit 'remotes/tip/x86/paravirt': (175 commits)
xen: use direct ops on 64-bit
xen: make direct versions of irq_enable/disable/save/restore to common code
xen: setup percpu data pointers
xen: fix 32-bit build resulting from mmu move
x86/paravirt: return full 64-bit result
x86, percpu: fix kexec with vmlinux
x86/vmi: fix interrupt enable/disable/save/restore calling convention.
x86/paravirt: don't restore second return reg
xen: setup percpu data pointers
x86: split loading percpu segments from loading gdt
x86: pass in cpu number to switch_to_new_gdt()
x86: UV fix uv_flush_send_and_wait()
x86/paravirt: fix missing callee-save call on pud_val
x86/paravirt: use callee-saved convention for pte_val/make_pte/etc
x86/paravirt: implement PVOP_CALL macros for callee-save functions
x86/paravirt: add register-saving thunks to reduce caller register pressure
x86/paravirt: selectively save/restore regs around pvops calls
x86: fix paravirt clobber in entry_64.S
x86/pvops: add a paravirt_ident functions to allow special patching
xen: move remaining mmu-related stuff into mmu.c
...
Conflicts:
arch/x86/mach-voyager/voyager_smp.c
arch/x86/mm/fault.c
Prevent kprobes from catching spurious faults which will cause infinite
recursive page-fault and memory corruption by stack overflow.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
Introduce helper function fault_in_kernel_address() to make editors happy.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: widen debug checks
VirtualBox calls do_page_fault() from an atomic context but runs into a
might_sleep() way pas this point, cure that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tsk is already assigned to current, drop the redundant second
assignment.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, restructure code to improve assembly
gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines so the normal path is easier to read, while the error
cases are also nicely split away.
Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.
And add a couple of branch annotations.
Before:
do_page_fault:
subq $360, %rsp #,
After:
do_page_fault:
subq $56, %rsp #,
bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680
Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size,
and touches far less stack.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ajith Kumar noticed:
I was going through the vmalloc fault handling for x86_64 and am unclear
about the following lines in the vmalloc_fault() function.
pgd = pgd_offset(current->mm ?: &init_mm, address);
pgd_ref = pgd_offset_k(address);
Here the intention is to get the pgd corresponding to the current process
and sync it up with the pgd in init_mm(obtained from pgd_offset_k).
However, for kernel threads current->mm is NULL and hence pgd =
pgd_offset(init_mm, address) = pgd_ref which means the fault handler
returns without setting the pgd entry in the MM structure in the context
of which the kernel thread has faulted. This could lead to never-ending
faults and busy looping of kernel threads like pdflush. So, shouldn't the
pgd = pgd_offset(current->mm ?: &init_mm, address); be pgd =
pgd_offset(current->active_mm ?: &init_mm, address);
We can use active_mm unconditionally because it should be always set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.
With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.
Only converted x86 and uml so far.
[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (246 commits)
x86: traps.c replace #if CONFIG_X86_32 with #ifdef CONFIG_X86_32
x86: PAT: fix address types in track_pfn_vma_new()
x86: prioritize the FPU traps for the error code
x86: PAT: pfnmap documentation update changes
x86: PAT: move track untrack pfnmap stubs to asm-generic
x86: PAT: remove follow_pfnmap_pte in favor of follow_phys
x86: PAT: modify follow_phys to return phys_addr prot and return value
x86: PAT: clarify is_linear_pfn_mapping() interface
x86: ia32_signal: remove unnecessary declaration
x86: common.c boot_cpu_stack and boot_exception_stacks should be static
x86: fix intel x86_64 llc_shared_map/cpu_llc_id anomolies
x86: fix warning in arch/x86/kernel/microcode_amd.c
x86: ia32.h: remove unused struct sigfram32 and rt_sigframe32
x86: asm-offset_64: use rt_sigframe_ia32
x86: sigframe.h: include headers for dependency
x86: traps.c declare functions before they get used
x86: PAT: update documentation to cover pgprot and remap_pfn related changes - v3
x86: PAT: add pgprot_writecombine() interface for drivers - v3
x86: PAT: change pgprot_noncached to uc_minus instead of strong uc - v3
x86: PAT: implement track/untrack of pfnmap regions for x86 - v3
...
Merge x86/dumpstack into tracing/ftrace because upcoming ftrace changes
depend on cleanups already in x86/dumpstack.
Also merge to latest upstream -rc.
Wrap access to task credentials so that they can be separated more easily from
the task_struct during the introduction of COW creds.
Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
Change some task->e?[ug]id to task_e?[ug]id(). In some places it makes more
sense to use RCU directly rather than a convenient wrapper; these will be
addressed by later patches.
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: James Morris <jmorris@namei.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: James Morris <jmorris@namei.org>
Impact: cleanup
We can remove MMIOTRACE_HOOKS and replace it with just MMIOTRACE.
MMIOTRACE_HOOKS is a remnant from the time when I thought that
something else could also use the kmmio facilities.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change oops_end such that signr=0 signals that do_exit
is not to be called.
Currently, each use of __die is soon followed by a call
to oops_end and 'regs' is set to NULL if oops_end is expected
not to call do_exit. Change all such pairs to set signr=0
instead. On x86_64 oops_end is used 'bare' in die_nmi; use
signr=0 instead of regs=NULL there, too.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix compat-vdso
x86/mm: unify init task OOM handling
x86/mm: do not trigger a kernel warning if user-space disables interrupts and generates a page fault
Linus noticed that the "again:" versus "survive:" OOM logic for
the init task was arbitrarily different.
The 64-bit codepath is the better one, because it correctly re-lookups
the vma after having dropped the ->mmap_sem.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Arjan reported a spike in the following bug pattern in v2.6.27:
http://www.kerneloops.org/searchweek.php?search=lock_page
which happens because hwclock started triggering warnings due to
a (correct) might_sleep() check in the MM code.
The warning occurs because hwclock uses this dubious sequence of
code to run "atomic" code:
static unsigned long
atomic(const char *name, unsigned long (*op)(unsigned long),
unsigned long arg)
{
unsigned long v;
__asm__ volatile ("cli");
v = (*op)(arg);
__asm__ volatile ("sti");
return v;
}
Then it pagefaults in that "atomic" section, triggering the warning.
There is no way the kernel could provide "atomicity" in this path,
a page fault is a cannot-continue machine event so the kernel has to
wait for the page to be filled in.
Even if it was just a minor fault we'd have to take locks and might have
to spend quite a bit of time with interrupts disabled - not nice to irq
latencies in general.
So instead just enable interrupts in the pagefault path unconditionally
if we come from user-space, and handle the fault.
Also, while touching this code, unify some trivial parts of the x86
VM paths at the same time.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Arjan van de Ven <arjan@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The last use of trace_hardirqs_fixup is unnecessary, because the
trap is taken with interrupt off on i386 as well as x86_64, and
the irq-tracer is notified of this from the assembly code.
trace_hardirqs_fixup and trace_hardirqs_fixup_flags are removed
from include/asm-x86/irqflags.h as they are no longer used.
Signed-off-by: Alexander van Heukelum <heukelum@fastmail.fm>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Since the fourth PDPT entry cannot be shared under Xen,
vmalloc_sync_all() must iterate over pmd-s rather than pgd-s here.
Luckily, the code isn't used for native PAE (SHARED_KERNEL_PMD is 1)
and the change is benign to non-PAE.
Also do a little more cleanup in that function.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
declared do_page_fault() in asm-x86/trap.h for both X86_32 and X86_64
removed do_invalid_op declaration from mm/fault.c as it is already declared in asm-x86/trap.h
Signed-off-by: Jaswinder Singh <jaswinder@infradead.org>
vmalloc_sync_all() is only called from register_die_notifier and
alloc_vm_area. Neither is on any performance-critical paths, so
vmalloc_sync_all() itself is not on any hot paths.
Given that the optimisations in vmalloc_sync_all add a fair amount of
code and complexity, and are fairly hard to evaluate for correctness,
it's better to just remove them to simplify the code rather than worry
about its absolute performance.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: xen-devel <xen-devel@lists.xensource.com>
Cc: Stephen Tweedie <sct@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>