In the first 2MB, kernel text is co-located with kernel static
page tables setup by head_64.S. CONFIG_DEBUG_RODATA chops this
2MB large page mapping to small 4KB pages as we mark the kernel text as RO,
leaving the static page tables as RW.
With CONFIG_DEBUG_RODATA disabled, OLTP run on NHM-EP shows 1% improvement
with 2% reduction in system time and 1% improvement in iowait idle time.
To recover this, move the kernel static page tables to .data section, so that
we don't have to break the first 2MB of kernel text to small pages with
CONFIG_DEBUG_RODATA.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20091014220254.063193621@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Conflicts:
kernel/Makefile
kernel/trace/Makefile
kernel/trace/trace.h
samples/Makefile
Merge reason: We need to be uptodate with the perf events development
branch because we plan to rewrite the breakpoints API on top of
perf events.
Add interleaved NUMA emulation support
This patch interleaves emulated nodes over the system's physical
nodes. This is required for interleave optimizations since
mempolicies, for example, operate by iterating over a nodemask and
act without knowledge of node distances. It can also be used for
testing memory latencies and NUMA bugs in the kernel.
There're a couple of ways to do this:
- divide the number of emulated nodes by the number of physical
nodes and allocate the result on each physical node, or
- allocate each successive emulated node on a different physical
node until all memory is exhausted.
The disadvantage of the first option is, depending on the asymmetry
in node capacities of each physical node, emulated nodes may
substantially differ in size on a particular physical node compared
to another.
The disadvantage of the second option is, also depending on the
asymmetry in node capacities of each physical node, there may be
more emulated nodes allocated on a single physical node as another.
This patch implements the second option; we sacrifice the
possibility that we may have slightly more emulated nodes on a
particular physical node compared to another in lieu of node size
asymmetry.
[ Note that "node capacity" of a physical node is not only a
function of its addressable range, but also is affected by
subtracting out the amount of reserved memory over that range.
NUMA emulation only deals with available, non-reserved memory
quantities. ]
We ensure there is at least a minimal amount of available memory
allocated to each node. We also make sure that at least this
amount of available memory is available in ZONE_DMA32 for any node
that includes both ZONE_DMA32 and ZONE_NORMAL.
This patch also cleans the emulation code up by no longer passing
the statically allocated struct bootnode array among the various
functions. This init.data array is not allocated on the stack since
it may be very large and thus it may be accessed at file scope.
The WARN_ON() for nodes_cover_memory() when faking proximity
domains is removed since it relies on successive nodes always
having greater start addresses than previous nodes; with
interleaving this is no longer always true.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Len Brown <len.brown@intel.com>
LKML-Reference: <alpine.DEB.1.00.0909251519150.14754@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This is the counterpart to "x86: export k8 physical topology" for
SRAT. It is not as invasive because the acpi code already seperates
node setup into detection and registration steps, with the
exception of registering e820 active regions in
acpi_numa_memory_affinity_init(). This is now moved to
acpi_scan_nodes() if NUMA emulation is disabled or deferred.
acpi_numa_init() now returns a value which specifies whether an
underlying SRAT was located. If so, that topology can be used by
the emulation code to interleave emulated nodes over physical nodes
or to register the nodes for ACPI.
acpi_get_nodes() may now be used to export the srat physical
topology of the machine for NUMA emulation.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Len Brown <len.brown@intel.com>
LKML-Reference: <alpine.DEB.1.00.0909251518580.14754@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To eventually interleave emulated nodes over physical nodes, we
need to know the physical topology of the machine without actually
registering it. This does the k8 node setup in two parts:
detection and registration. NUMA emulation can then used the
physical topology detected to setup the address ranges of emulated
nodes accordingly. If emulation isn't used, the k8 nodes are
registered as normal.
Two formals are added to the x86 NUMA setup functions: `acpi' and
`k8'. These represent whether ACPI or K8 NUMA has been detected;
both cannot be true at the same time. This specifies to the NUMA
emulation code whether an underlying physical NUMA topology exists
and which interface to use.
This patch deals solely with separating the k8 setup path into
Northbridge detection and registration steps and leaves the ACPI
changes for a subsequent patch. The `acpi' formal is added here,
however, to avoid touching all the header files again in the next
patch.
This approach also ensures emulated nodes will not span physical
nodes so the true memory latency is not misrepresented.
k8_get_nodes() may now be used to export the k8 physical topology
of the machine for NUMA emulation.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Len Brown <len.brown@intel.com>
LKML-Reference: <alpine.DEB.1.00.0909251518400.14754@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Convert all printk's in arch/x86/mm/k8topology_64.c to use
pr_info() or pr_err() appropriately.
Adds log levels for messages currently lacking them.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Ankita Garg <ankita@in.ibm.com>
Cc: Len Brown <len.brown@intel.com>
LKML-Reference: <alpine.DEB.1.00.0909251517440.14754@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Move the handling of truncated %rip from an iret fault to the fault
entry path.
This allows x86-64 to use the standard search_extable() function.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <1255357103-5418-1-git-send-email-brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Remove redundant non-NUMA topology functions
x86: early_printk: Protect against using the same device twice
x86: Reduce verbosity of "PAT enabled" kernel message
x86: Reduce verbosity of "TSC is reliable" message
x86: mce: Use safer ways to access MCE registers
x86: mce, inject: Use real inject-msg in raise_local
x86: mce: Fix thermal throttling message storm
x86: mce: Clean up thermal throttling state tracking code
x86: split NX setup into separate file to limit unstack-protected code
xen: check EFER for NX before setting up GDT mapping
x86: Cleanup linker script using new linker script macros.
x86: Use section .data.page_aligned for the idt_table.
x86: convert to use __HEAD and HEAD_TEXT macros.
x86: convert compressed loader to use __HEAD and HEAD_TEXT macros.
x86: fix fragile computation of vsyscall address
* 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel: (57 commits)
drm/i915: Handle ERESTARTSYS during page fault
drm/i915: Warn before mmaping a purgeable buffer.
drm/i915: Track purged state.
drm/i915: Remove eviction debug spam
drm/i915: Immediately discard any backing storage for uneeded objects
drm/i915: Do not mis-classify clean objects as purgeable
drm/i915: Whitespace correction for madv
drm/i915: BUG_ON page refleak during unbind
drm/i915: Search harder for a reusable object
drm/i915: Clean up evict from list.
drm/i915: Add tracepoints
drm/i915: framebuffer compression for GM45+
drm/i915: split display functions by chip type
drm/i915: Skip the sanity checks if the current relocation is valid
drm/i915: Check that the relocation points to within the target
drm/i915: correct FBC update when pipe base update occurs
drm/i915: blacklist Acer AspireOne lid status
ACPI: make ACPI button funcs no-ops if not built in
drm/i915: prevent FIFO calculation overflows on 32 bits with high dotclocks
drm/i915: intel_display.c handle latency variable efficiently
...
Fix up trivial conflicts in drivers/gpu/drm/i915/{i915_dma.c|i915_drv.h}
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
On modern systems, the kernel prints the message
x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
once for every CPU.
This gets kind of ridiculous on huge systems; for example, on a
64-thread system I was lucky enough to get:
dmesg| grep 'PAT enabled' | wc
64 704 5174
There is already a BUG() if non-boot CPUs have PAT capabilities
that don't match the boot CPU, so just print the message on the
boot CPU. (I kept the print after the wrmsrl() that enables PAT,
so that the log output continues to mean that the system survived
enabling PAT on the boot CPU)
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <adavdj92sso.fsf@cisco.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Makes code futureproof against the impending change to mm->cpu_vm_mask (to be a pointer).
It's also a chance to use the new cpumask_ ops which take a pointer
(the older ones are deprecated, but there's no hurry for arch code).
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Some archs define MODULED_VADDR/MODULES_END which is not in VMALLOC area.
This is handled only in x86-64. This patch make it more generic. And we
can use vread/vwrite to access the area. Fix it.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For /proc/kcore, each arch registers its memory range by kclist_add().
In usual,
- range of physical memory
- range of vmalloc area
- text, etc...
are registered but "range of physical memory" has some troubles. It
doesn't updated at memory hotplug and it tend to include unnecessary
memory holes. Now, /proc/iomem (kernel/resource.c) includes required
physical memory range information and it's properly updated at memory
hotplug. Then, it's good to avoid using its own code(duplicating
information) and to rebuild kclist for physical memory based on
/proc/iomem.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some 64bit arch has special segment for mapping kernel text. It should be
entried to /proc/kcore in addtion to direct-linear-map, vmalloc area.
This patch unifies KCORE_TEXT entry scattered under x86 and ia64.
I'm not familiar with other archs (mips has its own even after this patch)
but range of [_stext ..._end) is a valid area of text and it's not in
direct-map area, defining CONFIG_ARCH_PROC_KCORE_TEXT is only a necessary
thing to do.
Note: I left mips as it is now.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For /proc/kcore, vmalloc areas are registered per arch. But, all of them
registers same range of [VMALLOC_START...VMALLOC_END) This patch unifies
them. By this. archs which have no kclist_add() hooks can see vmalloc
area correctly.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently, kclist_add() only eats start address and size as its arguments.
Considering to make kclist dynamically reconfigulable, it's necessary to
know which kclists are for System RAM and which are not.
This patch add kclist types as
KCORE_RAM
KCORE_VMALLOC
KCORE_TEXT
KCORE_OTHER
This "type" is used in a patch following this for detecting KCORE_RAM.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since alloc_bootmem() will never return inaccessible (via virtual
addressing) memory anyway, using the ..._low() variant only makes sense
when the physical address range of the allocated memory must fulfill
further constraints, espacially since on 64-bits (or more generally in all
cases where the pools the two variants allocate from are than the full
available range.
Probably the use in alloc_tce_table() could also be eliminated (based on
code inspection of pci-calgary_64.c), but that seems too risky given I
know nothing about that hardware and have no way to test it.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 9617729941 ("Drop free_pages()")
modified nr_free_pages() to return 'unsigned long' instead of 'unsigned
int'. This made the casts to 'unsigned long' in most callers superfluous,
so remove them.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Kyle McMartin <kyle@mcmartin.ca>
Acked-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Haavard Skinnemoen <hskinnemoen@atmel.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Hirokazu Takata <takata@linux-m32r.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: David Howells <dhowells@redhat.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <zankel@tensilica.com>
Cc: Michal Simek <monstr@monstr.eu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move the NX setup into a separate file so that it can be compiled
without stack-protection while leaving the rest of the mm/init code
protected.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
x86-64 assumes NX is available by default, so we need to
explicitly check for it before using NX. Some first-generation
Intel x86-64 processors didn't support NX, and even recent systems
allow it to be disabled in BIOS.
[ Impact: prevent Xen crash on NX-less 64-bit machines ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Stable Kernel <stable@kernel.org>
Bye-bye Performance Counters, welcome Performance Events!
In the past few months the perfcounters subsystem has grown out its
initial role of counting hardware events, and has become (and is
becoming) a much broader generic event enumeration, reporting, logging,
monitoring, analysis facility.
Naming its core object 'perf_counter' and naming the subsystem
'perfcounters' has become more and more of a misnomer. With pending
code like hw-breakpoints support the 'counter' name is less and
less appropriate.
All in one, we've decided to rename the subsystem to 'performance
events' and to propagate this rename through all fields, variables
and API names. (in an ABI compatible fashion)
The word 'event' is also a bit shorter than 'counter' - which makes
it slightly more convenient to write/handle as well.
Thanks goes to Stephane Eranian who first observed this misnomer and
suggested a rename.
User-space tooling and ABI compatibility is not affected - this patch
should be function-invariant. (Also, defconfigs were not touched to
keep the size down.)
This patch has been generated via the following script:
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/PERF_EVENT_/PERF_RECORD_/g' \
-e 's/PERF_COUNTER/PERF_EVENT/g' \
-e 's/perf_counter/perf_event/g' \
-e 's/nb_counters/nb_events/g' \
-e 's/swcounter/swevent/g' \
-e 's/tpcounter_event/tp_event/g' \
$FILES
for N in $(find . -name perf_counter.[ch]); do
M=$(echo $N | sed 's/perf_counter/perf_event/g')
mv $N $M
done
FILES=$(find . -name perf_event.*)
sed -i \
-e 's/COUNTER_MASK/REG_MASK/g' \
-e 's/COUNTER/EVENT/g' \
-e 's/\<event\>/event_id/g' \
-e 's/counter/event/g' \
-e 's/Counter/Event/g' \
$FILES
... to keep it as correct as possible. This script can also be
used by anyone who has pending perfcounters patches - it converts
a Linux kernel tree over to the new naming. We tried to time this
change to the point in time where the amount of pending patches
is the smallest: the end of the merge window.
Namespace clashes were fixed up in a preparatory patch - and some
stylistic fallout will be fixed up in a subsequent patch.
( NOTE: 'counters' are still the proper terminology when we deal
with hardware registers - and these sed scripts are a bit
over-eager in renaming them. I've undone some of that, but
in case there's something left where 'counter' would be
better than 'event' we can undo that on an individual basis
instead of touching an otherwise nicely automated patch. )
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Kyle McMartin <kyle@mcmartin.ca>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix the following 'make includecheck' warning:
arch/x86/mm/kmemcheck/shadow.c: linux/module.h is included more than once.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Sam Ravnborg <sam@ravnborg.org>
LKML-Reference: <1247065179.4382.51.camel@ht.satnam>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pat: don't use rb-tree based lookup in reserve_memtype()
x86: Increase MIN_GAP to include randomized stack
Merge reason:
Suresh Siddha (1):
x86, pat: don't use rb-tree based lookup in reserve_memtype()
... requires previous x86/pat commits already pushed to Linus.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Recent enhancement of rb-tree based lookup exposed a bug with the lookup
mechanism in the reserve_memtype() which ensures that there are no conflicting
memtype requests for the memory range.
memtype_rb_search() returns an entry which has a start address <= new start
address. And from here we traverse the linear linked list to check if there
any conflicts with the existing mappings. As the rbtree is based on the
start address of the memory range, it is quite possible that we have several
overlapped mappings whose start address is much less than new requested start
but the end is >= new requested end. This results in conflicting memtype
mappings.
Same bug exists with the old code which uses cached_entry from where
we traverse the linear linked list. But the new rb-tree code exposes this
bug fairly easily.
For now, don't use the memtype_rb_search() and always start the search from
the head of linear linked list in reserve_memtype(). Linear linked list
for most of the systems grow's to few 10's of entries(as we track memory type
of RAM pages using struct page). So we should be ok for now.
We still retain the rbtree and use it to speed up free_memtype() which
doesn't have the same bug(as we know what exactly we are searching for
in free_memtype).
Also use list_for_each_entry_from() in free_memtype() so that we start
the search from rb-tree lookup result.
Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <1253136483.4119.12.camel@sbs-t61.sc.intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add VM_FAULT_HWPOISON handling to the x86 page fault handler. This is
very similar to VM_FAULT_OOM, the only difference is that a different
si_code is passed to user space and the new addr_lsb field is initialized.
v2: Make the printk more verbose/unique
Cc: x86@kernel.org
Signed-off-by: Andi Kleen <ak@linux.intel.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)
powerpc64: convert to dynamic percpu allocator
sparc64: use embedding percpu first chunk allocator
percpu: kill lpage first chunk allocator
x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
percpu: update embedding first chunk allocator to handle sparse units
percpu: use group information to allocate vmap areas sparsely
vmalloc: implement pcpu_get_vm_areas()
vmalloc: separate out insert_vmalloc_vm()
percpu: add chunk->base_addr
percpu: add pcpu_unit_offsets[]
percpu: introduce pcpu_alloc_info and pcpu_group_info
percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
percpu: add @align to pcpu_fc_alloc_fn_t
percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
percpu: drop @static_size from first chunk allocators
percpu: generalize first chunk allocator selection
percpu: build first chunk allocators selectively
percpu: rename 4k first chunk allocator to page
percpu: improve boot messages
percpu: fix pcpu_reclaim() locking
...
Fix trivial conflict as by Tejun Heo in kernel/sched.c
* 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, pat: Fix cacheflush address in change_page_attr_set_clr()
mm: remove !NUMA condition from PAGEFLAGS_EXTENDED condition set
x86: Fix earlyprintk=dbgp for machines without NX
x86, pat: Sanity check remap_pfn_range for RAM region
x86, pat: Lookup the protection from memtype list on vm_insert_pfn()
x86, pat: Add lookup_memtype to get the current memtype of a paddr
x86, pat: Use page flags to track memtypes of RAM pages
x86, pat: Generalize the use of page flag PG_uncached
x86, pat: Add rbtree to do quick lookup in memtype tracking
x86, pat: Add PAT reserve free to io_mapping* APIs
x86, pat: New i/f for driver to request memtype for IO regions
x86, pat: ioremap to follow same PAT restrictions as other PAT users
x86, pat: Keep identity maps consistent with mmaps even when pat_disabled
x86, mtrr: make mtrr_aps_delayed_init static bool
x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
generic-ipi: Allow cpus not yet online to call smp_call_function with irqs disabled
x86: Fix an incorrect argument of reserve_bootmem()
x86: Fix system crash when loading with "reservetop" parameter
* 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (202 commits)
MAINTAINERS: update KVM entry
KVM: correct error-handling code
KVM: fix compile warnings on s390
KVM: VMX: Check cpl before emulating debug register access
KVM: fix misreporting of coalesced interrupts by kvm tracer
KVM: x86: drop duplicate kvm_flush_remote_tlb calls
KVM: VMX: call vmx_load_host_state() only if msr is cached
KVM: VMX: Conditionally reload debug register 6
KVM: Use thread debug register storage instead of kvm specific data
KVM guest: do not batch pte updates from interrupt context
KVM: Fix coalesced interrupt reporting in IOAPIC
KVM guest: fix bogus wallclock physical address calculation
KVM: VMX: Fix cr8 exiting control clobbering by EPT
KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp
KVM: Document KVM_CAP_IRQCHIP
KVM: Protect update_cr8_intercept() when running without an apic
KVM: VMX: Fix EPT with WP bit change during paging
KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors
KVM: x86 emulator: Add adc and sbb missing decoder flags
KVM: Add missing #include
...
* 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: split __phys_addr out into separate file
xen: use stronger barrier after unlocking lock
xen: only enable interrupts while actually blocking for spinlock
xen: make -fstack-protector work under Xen
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, highmem_32.c: Clean up comment
x86, pgtable.h: Clean up types
x86: Clean up dump_pagetable()
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Decrease the level of some NUMA messages to KERN_DEBUG
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Make memtype_seq_ops const
x86: uv: Clean up uv_ptc_init(), use proc_create()
x86: Use printk_once()
x86/cpu: Clean up various files a bit
x86: Remove duplicated #include
x86, ipi: Clean up safe_smp_processor_id() by using the cpu_has_apic() macro helper
x86: Clean up idt_descr and idt_tableby using NR_VECTORS instead of hardcoded number
x86: Further clean up of mtrr/generic.c
x86: Clean up mtrr/main.c
x86: Clean up mtrr/state.c
x86: Clean up mtrr/mtrr.h
x86: Clean up mtrr/if.c
x86: Clean up mtrr/generic.c
x86: Clean up mtrr/cyrix.c
x86: Clean up mtrr/cleanup.c
x86: Clean up mtrr/centaur.c
x86: Clean up mtrr/amd.c:
x86: ds.c fix invalid assignment
Ever since we enabled GEM, the pre-9xx chipsets (particularly 865) have had
serious stability issues. Back in May a wbinvd was added to the DRM to
work around much of the problem. Some failure remained -- easily visible
by dragging a window around on an X -retro desktop, or by looking at bugzilla.
The chipset flush was on the right track -- hitting the right amount of
memory, and it appears to be the only way to flush on these chipsets, but the
flush page was mapped uncached. As a result, the writes trying to clear the
writeback cache ended up bypassing the cache, and not flushing anything! The
wbinvd would flush out other writeback data and often cause the data we wanted
to get flushed, but not always. By removing the setting of the page to UC
and instead just clflushing the data we write to try to flush it, we get the
desired behavior with no wbinvd.
This exports clflush_cache_range(), which was laying around and happened to
basically match the code I was otherwise going to copy from the DRM.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: Brice Goglin <Brice.Goglin@ens-lyon.org>
Cc: stable@kernel.org
Currently we are not including randomized stack size when calculating
mmap_base address in arch_pick_mmap_layout for topdown case. This might
cause that mmap_base starts in the stack reserved area because stack is
randomized by 1GB for 64b (8MB for 32b) and the minimum gap is 128MB.
If the stack really grows down to mmap_base then we can get silent mmap
region overwrite by the stack values.
Let's include maximum stack randomization size into MIN_GAP which is
used as the low bound for the gap in mmap.
Signed-off-by: Michal Hocko <mhocko@suse.cz>
LKML-Reference: <1252400515-6866-1-git-send-email-mhocko@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stable Team <stable@kernel.org>
Conflicts:
kernel/trace/trace_export.c
kernel/trace/trace_kprobe.c
Merge reason: This topic branch lacks an important
build fix in tracing/core:
0dd7b74787eaf7858c6c573353a83c3e2766e674:
tracing: Fix double CPP substitution in TRACE_EVENT_FN
that prevents from multiple tracepoint headers inclusion crashes.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Split __phys_addr out into its own file so we can disable
-fstack-protector in a fine-grained fashion. Also it doesn't
have terribly much to do with the rest of ioremap.c.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
-fstack-protector uses a special per-cpu "stack canary" value.
gcc generates special code in each function to test the canary to make
sure that the function's stack hasn't been overrun.
On x86-64, this is simply an offset of %gs, which is the usual per-cpu
base segment register, so setting it up simply requires loading %gs's
base as normal.
On i386, the stack protector segment is %gs (rather than the usual kernel
percpu %fs segment register). This requires setting up the full kernel
GDT and then loading %gs accordingly. We also need to make sure %gs is
initialized when bringing up secondary cpus too.
To keep things consistent, we do the full GDT/segment register setup on
both architectures.
Because we need to avoid -fstack-protected code before setting up the GDT
and because there's no way to disable it on a per-function basis, several
files need to have stack-protector inhibited.
[ Impact: allow Xen booting with stack-protector enabled ]
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Fix address passed to cpa_flush_range() when changing page
attributes from WB to UC. The address (*addr) is
modified by __change_page_attr_set_clr(). The result is that
the pages being flushed start at the _end_ of the changed range
instead of the beginning.
This should be considered for 2.6.30-stable and 2.6.31-stable.
Signed-off-by: Jack Steiner <steiner@sgi.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: Stable team <stable@kernel.org>
Conflicts:
arch/Kconfig
kernel/trace/trace.h
Merge reason: resolve the conflicts, plus adopt to the new
ring-buffer APIs.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Some NUMA messages in srat_32.c are confusing to users,
because they seem to indicate errors, while in fact they
reflect normal behaviour.
Decrease the level of these messages to KERN_DEBUG so that
they don't show up unnecessarily.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
LKML-Reference: <200909050107.45175.rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo Molnar reported the following kmemcheck warning when running both
kmemleak and kmemcheck enabled:
PM: Adding info for No Bus:vcsa7
WARNING: kmemcheck: Caught 32-bit read from uninitialized memory
(f6f6e1a4)
d873f9f600000000c42ae4c1005c87f70000000070665f666978656400000000
i i i i u u u u i i i i i i i i i i i i i i i i i i i i i u u u
^
Pid: 3091, comm: kmemleak Not tainted (2.6.31-rc7-tip #1303) P4DC6
EIP: 0060:[<c110301f>] EFLAGS: 00010006 CPU: 0
EIP is at scan_block+0x3f/0xe0
EAX: f40bd700 EBX: f40bd780 ECX: f16b46c0 EDX: 00000001
ESI: f6f6e1a4 EDI: 00000000 EBP: f10f3f4c ESP: c2605fcc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
CR0: 8005003b CR2: e89a4844 CR3: 30ff1000 CR4: 000006f0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff4ff0 DR7: 00000400
[<c110313c>] scan_object+0x7c/0xf0
[<c1103389>] kmemleak_scan+0x1d9/0x400
[<c1103a3c>] kmemleak_scan_thread+0x4c/0xb0
[<c10819d4>] kthread+0x74/0x80
[<c10257db>] kernel_thread_helper+0x7/0x3c
[<ffffffff>] 0xffffffff
kmemleak: 515 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
kmemleak: 42 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
The problem here is that kmemleak will scan partially initialized
objects that makes kmemcheck complain. Fix that up by skipping
uninitialized memory regions when kmemcheck is enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Add __kprobes to the functions which handle in-kernel fixable page
faults. Since kprobes can cause those in-kernel page faults by accessing
kprobe data structures, probing those fault functions will cause
fault-int3-loop (do_page_fault has already been marked as __kprobes).
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <20090827172311.8246.92725.stgit@localhost.localdomain>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reason: Change to is_new_memtype_allowed() in x86/urgent
Resolved semantic conflicts in:
arch/x86/mm/pat.c
arch/x86/mm/ioremap.c
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add sanity check for remap_pfn_range of RAM regions using
lookup_memtype(). Previously, we did not have anyway to get the type of
RAM memory regions as they were tracked using a single bit in
page_struct (WB, nonWB). Now we can get the actual type from page struct
(WB, WC, UC_MINUS) and make sure the requester gets that type.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Lookup the reserved memtype during vm_insert_pfn and use that memtype
for the new mapping. This takes care or handling of vm_insert_pfn()
interface in track_pfn_vma*/untrack_pfn_vma.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add a new routine lookup_memtype() to get the current memtype based on
the PAT reserves and frees.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Change reserve_ram_pages_type and free_ram_pages_type to use 2 page
flags to track UC_MINUS, WC, WB and default types. Previous RAM tracking
just tracked WB or NonWB, which was not complete and did not allow
tracking of RAM fully and there was no way to get the actual type
reserved by looking at the page flags.
We use the memtype_lock spinlock for atomicity in dealing with
memtype tracking in struct page.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
PAT memtype tracking uses a linear link list to keep track of IO
(non-RAM) regions and their memtypes. The code used a last_accessed
pointer as a cache to speedup the lookup. As per discussions with
H. Peter Anvin a while back, having a rbtree here will avoid bad
performances in pathological cases where we may end up with huge
linked list. This may not add any noticable performance speedup
in normal case as the number of entires in PAT memtype list tend
to be ~20-30 range. The patch removes the "cached_entry" logic
as with rbtree we have more generic way of speeding up the lookup.
With this patch, we use rbtree to do the quick lookup. We still use
linked list as the memtype range tracked can be of different sizes
and can overlap in different ways. We also keep track of usage counts
with linked list.
Example:
Multiple ioremaps with different sizes
uncached-minus @ 0xfffff00000-0xfffff04000
uncached-minus @ 0xfffff02000-0xfffff03000
And one userlevel mmap and the thread forks a new process
uncached-minus @ 0xbf453000-0xbf454000
uncached-minus @ 0xbf453000-0xbf454000
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
io_mapping_* interfaces were added, mainly for graphics drivers.
Make this interface go through the PAT reserve/free, instead of
hardcoding WC mapping. This makes sure that there are no
aliases due to unconditional WC setting.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Add new routines to request memtype for IO regions. This will currently
be a backend for io_mapping_* routines. But, it can also be made available
to drivers directly in future, in case it is needed.
reserve interface reserves the memory, makes sure we have a compatible
memory type available and keeps the identity map in sync when needed.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
ioremap has this hard-coded check for new type and requested type. That
check differs from other PAT users like /dev/mem mmap, remap_pfn_range
in only one condition where requested type is UC_MINUS and new type
is WC. Under that condition, ioremap fails. But other PAT interfaces succeed
with a WC mapping.
Change to make ioremap be in sync with other PAT APIs and use the same
macro as others. Also changes the error print to KERN_ERR instead of
pr_debug.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make reserve_memtype internally take care of pat disabled case and fallback
to default return values.
Remove the specific pat_disabled checks in track_* routines.
Change kernel_map_sync_memtype to sync identity map even when
pat_disabled.
This change ensures that, even for pat_disabled case, we take care of
keeping identity map in sync. Before this patch, in pat disabled case,
ioremap() keeps the identity maps in sync and other APIs like pci and
/dev/mem mmap don't, which is not a very consistent behavior.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix build with older binutils and consolidate linker script
x86: Fix an incorrect argument of reserve_bootmem()
x86: add vmlinux.lds to targets in arch/x86/boot/compressed/Makefile
xen: rearrange things to fix stackprotector
x86: make sure load_percpu_segment has no stackprotector
i386: Fix section mismatches for init code with !HOTPLUG_CPU
x86, pat: Allow ISA memory range uncacheable mapping requests
This line looks suspicious, because if this is true, then the
'flags' parameter of function reserve_bootmem_generic() will be
unused when !CONFIG_NUMA. I don't think this is what we want.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: akpm@linux-foundation.org
LKML-Reference: <20090821083709.5098.52505.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As noted in 83d349f35e ("x86: don't send
an IPI to the empty set of CPU's"), some APIC's will be very unhappy
with an empty destination mask. That commit added a WARN_ON() for that
case, and avoided the resulting problem, but didn't fix the underlying
reason for why those empty mask cases happened.
This fixes that, by checking the result of 'cpumask_andnot()' of the
current CPU actually has any other CPU's left in the set of CPU's to be
sent a TLB flush, and not calling down to the IPI code if the mask is
empty.
The reason this started happening at all is that we started passing just
the CPU mask pointers around in commit 4595f9620 ("x86: change
flush_tlb_others to take a const struct cpumask"), and when we did that,
the cpumask was no longer thread-local.
Before that commit, flush_tlb_mm() used to create it's own copy of
'mm->cpu_vm_mask' and pass that copy down to the low-level flush
routines after having tested that it was not empty. But after changing
it to just pass down the CPU mask pointer, the lower level TLB flush
routines would now get a pointer to that 'mm->cpu_vm_mask', and that
could still change - and become empty - after the test due to other
CPU's having flushed their own TLB's.
See
http://bugzilla.kernel.org/show_bug.cgi?id=13933
for details.
Tested-by: Thomas Björnell <thomas.bjornell@gmail.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This line looks suspicious, because if this is true, then the
'flags' parameter of function reserve_bootmem_generic() will be
unused when !CONFIG_NUMA. I don't think this is what we want.
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: akpm@linux-foundation.org
LKML-Reference: <20090821083709.5098.52505.sendpatchset@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Max Vozeler reported:
> Bug 13877 - bogl-term broken with CONFIG_X86_PAT=y, works with =n
>
> strace of bogl-term:
> 814 mmap2(NULL, 65536, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0)
> = -1 EAGAIN (Resource temporarily unavailable)
> 814 write(2, "bogl: mmaping /dev/fb0: Resource temporarily unavailable\n",
> 57) = 57
PAT code maps the ISA memory range as WB in the PAT attribute, so that
fixed range MTRR registers define the actual memory type (UC/WC/WT etc).
But the upper level is_new_memtype_allowed() API checks are failing,
as the request here is for UC and the return tracked type is WB (Tracked type is
WB as MTRR type for this legacy range potentially will be different for each
4k page).
Fix is_new_memtype_allowed() by always succeeding the ISA address range
checks, as the null PAT (WB) and def MTRR fixed range register settings
satisfy the memory type needs of the applications that map the ISA address
range.
Reported-and-Tested-by: Max Vozeler <xam@debian.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
With x86 converted to embedding allocator, lpage doesn't have any user
left. Kill it along with cpa handling code.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Conflicts:
arch/sparc/kernel/smp_64.c
arch/x86/kernel/cpu/perf_counter.c
arch/x86/kernel/setup_percpu.c
drivers/cpufreq/cpufreq_ondemand.c
mm/percpu.c
Conflicts in core and arch percpu codes are mostly from commit
ed78e1e078dd44249f88b1dd8c76dafb39567161 which substituted many
num_possible_cpus() with nr_cpu_ids. As for-next branch has moved all
the first chunk allocators into mm/percpu.c, the changes are moved
from arch code to mm/percpu.c.
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Work around compilation warning in arch/x86/kernel/apm_32.c
x86, UV: Complete IRQ interrupt migration in arch_enable_uv_irq()
x86, 32-bit: Fix double accounting in reserve_top_address()
x86: Don't use current_cpu_data in x2apic phys_pkg_id
x86, UV: Fix UV apic mode
x86, UV: Fix macros for accessing large node numbers
x86, UV: Delete mapping of MMR rangs mapped by BIOS
x86, UV: Handle missing blade-local memory correctly
x86: fix assembly constraints in native_save_fl()
x86, msr: execute on the correct CPU subset
x86: Fix assert syntax in vmlinux.lds.S
x86: Make 64-bit efi_ioremap use ioremap on MMIO regions
x86: Add quirk to make Apple MacBook5,2 use reboot=pci
x86: Fix CPA memtype reserving in the set_pages_array*() cases
x86, pat: Fix set_memory_wc related corruption
x86: fix section mismatch for i386 init code
With VMALLOC_END included in the calculation of MAXMEM (as of
2.6.28) it is no longer correct to also bump __VMALLOC_RESERVE
in reserve_top_address(). Doing so results in needlessly small
lowmem.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <4A71DD2A020000780000D482@vpn.id2.novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The code was incorrectly reserving memtypes using the page
virtual address instead of the physical address. Furthermore,
the code was not ignoring highmem pages as it ought to.
( upstream does not pass in highmem pages yet - but upcoming
graphics code will do it and there's no reason to not handle
this properly in the CPA APIs.)
Fixes: http://bugzilla.kernel.org/show_bug.cgi?id=13884
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: <stable@kernel.org>
Cc: dri-devel@lists.sourceforge.net
Cc: venkatesh.pallipadi@intel.com
LKML-Reference: <1249284345-7654-1-git-send-email-thellstrom@vmware.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Changeset 3869c4aa18
that went in after 2.6.30-rc1 was a seemingly small change to _set_memory_wc()
to make it complaint with SDM requirements. But, introduced a nasty bug, which
can result in crash and/or strange corruptions when set_memory_wc is used.
One such crash reported here
http://lkml.org/lkml/2009/7/30/94
Actually, that changeset introduced two bugs.
* change_page_attr_set() takes &addr as first argument and can the addr value
might have changed on return, even for single page change_page_attr_set()
call. That will make the second change_page_attr_set() in this routine
operate on unrelated addr, that can eventually cause strange corruptions
and bad page state crash.
* The second change_page_attr_set() call, before setting _PAGE_CACHE_WC, should
clear the earlier _PAGE_CACHE_UC_MINUS, as otherwise cache attribute will not
be WC (will be UC instead).
The patch below fixes both these problems. Sending a single patch to fix both
the problems, as the change is to the same line of code. The change to have a
addr_copy is not very clean. But, it is simpler than making more changes
through various routines in pageattr.c.
A huge thanks to Jerome for reporting this problem and providing a simple test
case that helped us root cause the problem.
Reported-by: Jerome Glisse <glisse@freedesktop.org>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090730214319.GA1889@linux-os.sc.intel.com>
Acked-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
* 'drm-radeon-kms' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (35 commits)
drm/radeon: set fb aperture sizes for framebuffer handoff.
drm/ttm: fix highuser vs dma32 confusion.
drm/radeon: Fix size used for benchmarking BO copies.
drm/radeon: Add radeon.test parameter for running BO GPU copy tests.
drm/radeon/kms: allow interruptible waits for objects.
drm/ttm: powerpc: Fix Highmem cache flushing.
x86: Export kmap_atomic_prot() needed for TTM.
drm/ttm: Fix ttm in-kernel copying of pages with non-standard caching attributes.
drm/ttm: Fix an oops and sync object leak.
drm/radeon/kms: vram sizing on certain r100 chips needs workaround.
drm/radeon: Pay more attention to object placement requested by userspace.
drm/radeon: Fall back to evicting BOs with memcpy if necessary.
drm/radeon: Don't unreserve twice on failure to validate.
drm/radeon/kms: fix bandwidth computation on avivo hardware
drm/radeon/kms: add initial colortiling support.
drm/radeon/kms: fix hotspot handling on pre-avivo chips
drm/radeon/kms: enable frac fb divs on rs600/rs690/rs740
drm/radeon/kms: add PLL flag to prefer frequencies <= the target freq
drm/radeon/kms: block RN50 from using 3D engine.
drm/radeon/kms: fix VRAM sizing like DDX does it.
...
This functionality is needed to kmap_atomic() highmem pages that may
potentially have or are about to set up other mappings with
non-standard caching attributes.
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: Dave Airlie <airlied@redhat.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: geode: Mark mfgpt irq IRQF_TIMER to prevent resume failure
x86, amd: Don't probe for extended APIC ID if APICs are disabled
x86, mce: Rename incorrect macro name "CONFIG_X86_THRESHOLD"
x86-64: Fix bad_srat() to clear all state
x86, mce: Fix set_trigger() accessor
x86: Fix movq immediate operand constraints in uaccess.h
x86: Fix movq immediate operand constraints in uaccess_64.h
x86: Add reboot fixup for SBC-fitPC2
x86: Include all of .data.* sections in _edata on 64-bit
x86: Add quirk for Intel DG45ID board to avoid low memory corruption
mm: Pass virtual address to [__]p{te,ud,md}_free_tlb()
Upcoming paches to support the new 64-bit "BookE" powerpc architecture
will need to have the virtual address corresponding to PTE page when
freeing it, due to the way the HW table walker works.
Basically, the TLB can be loaded with "large" pages that cover the whole
virtual space (well, sort-of, half of it actually) represented by a PTE
page, and which contain an "indirect" bit indicating that this TLB entry
RPN points to an array of PTEs from which the TLB can then create direct
entries. Thus, in order to invalidate those when PTE pages are deleted,
we need the virtual address to pass to tlbilx or tlbivax instructions.
The old trick of sticking it somewhere in the PTE page struct page sucks
too much, the address is almost readily available in all call sites and
almost everybody implemets these as macros, so we may as well add the
argument everywhere. I added it to the pmd and pud variants for consistency.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: David Howells <dhowells@redhat.com> [MN10300 & FRV]
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> [s390]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Need to clear both nodes and nodes_add state for start/end.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
LKML-Reference: <20090718065657.GA2898@basil.fritz.box>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
Since commit 5fd29d6c ("printk: clean up handling of log-levels
and newlines"), the kernel logs segfaults like:
<6>gnome-power-man[24509]: segfault at 20 ip 00007f9d4950465a sp 00007fffbb50fc70 error 4 in libgobject-2.0.so.0.2103.0[7f9d494f7000+45000]
with the extra "<6>" being KERN_INFO. This happens because the
printk in show_signal_msg() started with KERN_CONT and then
used "%s" to pass in the real level; and KERN_CONT is no longer
an empty string, and printk only pays attention to the level at
the very beginning of the format string.
Therefore, remove the KERN_CONT from this printk, since it is
now actively causing problems (and never really made any
sense).
Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <874otjitkj.fsf@shaolin.home.digitalvampire.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Alex found that specjbb2005 still can not run with hugepages on an
x86-64 machine. This only happens when numa is not compiled in.
The root cause: node_set_state will not set it back for us in that case,
so don't clear that when numa is not select in config
[ v2: use node_clear_state instead ]
Reported-and-Tested-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 5fd29d6ccb ("printk: clean up
handling of log-levels and newlines") changed printk semantics. printk
lines with multiple KERN_<level> prefixes are no longer emitted as
before the patch.
<level> is now included in the output on each additional use.
Remove all uses of multiple KERN_<level>s in formats.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix usage of bios intcall()
x86: Remove unused function lapic_watchdog_ok()
x86: Remove unused variable disable_x2apic
x86, kvm: Fix section mismatches in kvm.c
x86: Add missing annotation to arch/x86/lib/copy_user_64.S::copy_to_user
x86: Fix fixmap page order for FIX_TEXT_POKE0,1
amd-iommu: set evt_buf_size correctly
amd-iommu: handle alias entries correctly in init code
x86: Fix printk call in print_local_apic()
x86: Declare check_efer() before it gets used
x86: Mark device_nb as static and fix NULL noise
x86: Remove double declaration of MSR_P6_EVNTSEL0 and MSR_P6_EVNTSEL1
xen: Use kcalloc() in xen_init_IRQ()
x86: Fix fixmap ordering
x86: Fix symbol annotation for arch/x86/lib/clear_page_64.S::clear_page_c
Generalize and move x86 setup_pcpu_lpage() into
pcpu_lpage_first_chunk(). setup_pcpu_lpage() now is a simple wrapper
around the generalized version. Other than taking size parameters and
using arch supplied callbacks to allocate/free/map memory,
pcpu_lpage_first_chunk() is identical to the original implementation.
This simplifies arch code and will help converting more archs to
dynamic percpu allocator.
While at it, factor out pcpu_calc_fc_sizes() which is common to
pcpu_embed_first_chunk() and pcpu_lpage_first_chunk().
[ Impact: code reorganization and generalization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
This check is a left-over from ancient times. We now have the equivalent
check much earlier in both the page fault handler and the debug trap
handler (the calls to kmemcheck_active()).
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This sparse warning:
arch/x86/mm/init.c:83:16: warning: symbol 'check_efer' was not declared. Should it be static?
triggers because check_efer() is not decalared before using it.
asm/proto.h includes the declaration of check_efer(), so
including asm/proto.h to fix that - this also addresses the
sparse warning.
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <1246458263.6940.22.camel@hpdv5.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Nathan reported that
| commit 73d60b7f74
| Author: Yinghai Lu <yinghai@kernel.org>
| Date: Tue Jun 16 15:33:00 2009 -0700
|
| page-allocator: clear N_HIGH_MEMORY map before we set it again
|
| SRAT tables may contains nodes of very small size. The arch code may
| decide to not activate such a node. However, currently the early boot
| code sets N_HIGH_MEMORY for such nodes. These nodes therefore seem to be
| active although these nodes have no present pages.
|
| For 64bit N_HIGH_MEMORY == N_NORMAL_MEMORY, so that works for 64 bit too
unintentionally and incorrectly clears the cpuset.mems cgroup attribute on
an i386 kvm guest, meaning that cpuset.mems can not be used.
Fix this by only clearing node_states[N_NORMAL_MEMORY] for 64bit only.
and need to do save/restore for that in find_zone_movable_pfn
Reported-by: Nathan Lynch <ntl@pobox.com>
Tested-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@elte.hu>,
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use pgtable access helpers for 32-bit version dump_pagetable()
and get rid of __typeof__() operators. This needs to make
pmd_pfn() available for 2-level pgtable.
Also, remove some casts for 64-bit version dump_pagetable().
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090627063514.GA2834@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The init_gbpages() function is conditionally called from
init_memory_mapping() function. There are two call-sites where
this 'after_bootmem' condition can be true: setup_arch() and
mem_init() via pci_iommu_alloc().
Therefore, it's safe to move the call to init_gbpages() to
setup_arch() as it's always called before mem_init().
This removes an after_bootmem use - paving the way to remove
all uses of that state variable.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <Pine.LNX.4.64.0906221731210.19474@melkki.cs.Helsinki.FI>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
lpage allocator aliases a PMD page for each cpu and returns whatever
is unused to the page allocator. When the pageattr of the recycled
pages are changed, this makes the two aliases point to the overlapping
regions with different attributes which isn't allowed and known to
cause subtle data corruption in certain cases.
This can be handled in simliar manner to the x86_64 highmap alias.
pageattr code should detect if the target pages have PMD alias and
split the PMD alias and synchronize the attributes.
pcpur allocator is updated to keep the allocated PMD pages map sorted
in ascending address order and provide pcpu_lpage_remapped() function
which binary searches the array to determine whether the given address
is aliased and if so to which address. pageattr is updated to use
pcpu_lpage_remapped() to detect the PMD alias and split it up as
necessary from cpa_process_alias().
Jan Beulich spotted the original problem and incorrect usage of vaddr
instead of laddr for lookup.
With this, lpage percpu allocator should work correctly. Re-enable
it.
[ Impact: fix subtle lpage pageattr bug and re-enable lpage ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
Reorganize cpa_process_alias() so that new alias condition can be
added easily.
Jan Beulich spotted problem in the original cleanup thread which
incorrectly assumed the two existing conditions were mutially
exclusive.
[ Impact: code reorganization ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jan Beulich <JBeulich@novell.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Ingo Molnar <mingo@elte.hu>
This allows the callers to now pass down the full set of FAULT_FLAG_xyz
flags to handle_mm_fault(). All callers have been (mechanically)
converted to the new calling convention, there's almost certainly room
for architectures to clean up their code and then add FAULT_FLAG_RETRY
when that support is added.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (49 commits)
perfcounter: Handle some IO return values
perf_counter: Push perf_sample_data through the swcounter code
perf_counter tools: Define and use our own u64, s64 etc. definitions
perf_counter: Close race in perf_lock_task_context()
perf_counter, x86: Improve interactions with fast-gup
perf_counter: Simplify and fix task migration counting
perf_counter tools: Add a data file header
perf_counter: Update userspace callchain sampling uses
perf_counter: Make callchain samples extensible
perf report: Filter to parent set by default
perf_counter tools: Handle lost events
perf_counter: Add event overlow handling
fs: Provide empty .set_page_dirty() aop for anon inodes
perf_counter: tools: Makefile tweaks for 64-bit powerpc
perf_counter: powerpc: Add processor back-end for MPC7450 family
perf_counter: powerpc: Make powerpc perf_counter code safe for 32-bit kernels
perf_counter: powerpc: Change how processor-specific back-ends get selected
perf_counter: powerpc: Use unsigned long for register and constraint values
perf_counter: powerpc: Enable use of software counters on 32-bit powerpc
perf_counter tools: Add and use isprint()
...
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (45 commits)
x86, mce: fix error path in mce_create_device()
x86: use zalloc_cpumask_var for mce_dev_initialized
x86: fix duplicated sysfs attribute
x86: de-assembler-ize asm/desc.h
i386: fix/simplify espfix stack switching, move it into assembly
i386: fix return to 16-bit stack from NMI handler
x86, ioapic: Don't call disconnect_bsp_APIC if no APIC present
x86: Remove duplicated #include's
x86: msr.h linux/types.h is only required for __KERNEL__
x86: nmi: Add Intel processor 0x6f4 to NMI perfctr1 workaround
x86, mce: mce_intel.c needs <asm/apic.h>
x86: apic/io_apic.c: dmar_msi_type should be static
x86, io_apic.c: Work around compiler warning
x86: mce: Don't touch THERMAL_APIC_VECTOR if no active APIC present
x86: mce: Handle banks == 0 case in K7 quirk
x86, boot: use .code16gcc instead of .code16
x86: correct the conversion of EFI memory types
x86: cap iomem_resource to addressable physical memory
x86, mce: rename _64.c files which are no longer 64-bit-specific
x86, mce: mce.h cleanup
...
Manually fix up trivial conflict in arch/x86/mm/fault.c
It's really not right to use 'access_ok()', since that is meant for the
normal "get_user()" and "copy_from/to_user()" accesses, which are done
through the TLB, rather than through the page tables.
Why? access_ok() does both too few, and too many checks. Too many,
because it is meant for regular kernel accesses that will not honor the
'user' bit in the page tables, and because it honors the USER_DS vs
KERNEL_DS distinction that we shouldn't care about in GUP. And too few,
because it doesn't do the 'canonical' check on the address on x86-64,
since the TLB will do that for us.
So instead of using a function that isn't meant for this, and does
something else and much more complicated, just do the real rules: we
don't want the range to overflow, and on x86-64, we want it to be a
canonical low address (on 32-bit, all addresses are canonical).
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Improve a few details in perfcounter call-chain recording that
makes use of fast-GUP:
- Use ACCESS_ONCE() to observe the pte value. ptes are fundamentally
racy and can be changed on another CPU, so we have to be careful
about how we access them. The PAE branch is already careful with
read-barriers - but the non-PAE and 64-bit side needs an
ACCESS_ONCE() to make sure the pte value is observed only once.
- make the checks a bit stricter so that we can feed it any kind of
cra^H^H^H user-space input ;-)
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prefetch instructions can generate spurious faults on certain
models of older CPUs. The faults themselves cannot be stopped
and they can occur pretty much anywhere - so the way we solve
them is that we detect certain patterns and ignore the fault.
There is one small path of code where we must not take faults
though: the #PF handler execution leading up to the reading
of the CR2 (the faulting address). If we take a fault there
then we destroy the CR2 value (with that of the prefetching
instruction's) and possibly mishandle user-space or
kernel-space pagefaults.
It turns out that in current upstream we do exactly that:
prefetchw(&mm->mmap_sem);
/* Get the faulting address: */
address = read_cr2();
This is not good.
So turn around the order: first read the cr2 then prefetch
the lock address. Reading cr2 is plenty fast (2 cycles) so
delaying the prefetch by this amount shouldnt be a big issue
performance-wise.
[ And this might explain a mystery fault.c warning that sometimes
occurs on one an old AMD/Semptron based test-system i have -
which does have such prefetch problems. ]
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
LKML-Reference: <20090616030522.GA22162@Krystal>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Introduce a gup_fast() variant which is usable from IRQ/NMI context.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Nick Piggin <npiggin@suse.de>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We've had some troubles in the past with weird instructions. This
patch adds a self-test framework which can be used to verify that
a certain set of opcodes are decoded correctly. Of course, the
opcodes which are not tested can still give the wrong results.
In short, this is just a safeguard to catch unintentional changes
in the opcode decoder. It does not mean that errors can't still
occur!
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This adds support for tracking the initializedness of memory that
was allocated with the page allocator. Highmem requests are not
tracked.
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
[build fix for !CONFIG_KMEMCHECK]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
As these are allocated using the page allocator, we need to pass
__GFP_NOTRACK before we add page allocator support to kmemcheck.
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
The hooks that we modify are:
- Page fault handler (to handle kmemcheck faults)
- Debug exception handler (to hide pages after single-stepping
the instruction that caused the page fault)
Also redefine memset() to use the optimized version if kmemcheck is
enabled.
(Thanks to Pekka Enberg for minimizing the impact on the page fault
handler.)
As kmemcheck doesn't handle MMX/SSE instructions (yet), we also disable
the optimized xor code, and rely instead on the generic C implementation
in order to avoid false-positive warnings.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
[whitespace fixlet]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Lets use kmemcheck_pte_lookup() in kmemcheck_fault() instead of
open-coding it there.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This patch moves the CONFIG_X86_64 ifdef out of kmemcheck_opcode_decode() by
introducing a version of the function that always returns false for
CONFIG_X86_32.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Multiple ifdef'd definitions of the same global variable is ugly and
error-prone. Fix that up.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
The "Bugs, beware!" printout during is cute but confuses users that something
bad happened so change the text to the more boring "Initialized" message.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
This patch reorders code in error.c so that we can get rid of the forward
declarations.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
kmemcheck/shadow.c needs to include <linux/module.h> to prevent
the following warnings:
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : data definition has no type or storage class
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
linux-next-20080724/arch/x86/mm/kmemcheck/shadow.c:64: warning : parameter names (without types) in function declaration
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: vegardno@ifi.uio.no
Cc: penberg@cs.helsinki.fi
Cc: akpm <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
General description: kmemcheck is a patch to the linux kernel that
detects use of uninitialized memory. It does this by trapping every
read and write to memory that was allocated dynamically (e.g. using
kmalloc()). If a memory address is read that has not previously been
written to, a message is printed to the kernel log.
Thanks to Andi Kleen for the set_memory_4k() solution.
Andrew Morton suggested documenting the shadow member of struct page.
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
[export kmemcheck_mark_initialized]
[build fix for setup_max_cpus]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
[rebased for mainline inclusion]
Signed-off-by: Vegard Nossum <vegardno@ifi.uio.no>
kernel_physical_mapping_init() could be called in memory hotplug path.
[ Impact: fix potential crash with memory hotplug ]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <20090612045752.GA827@sli10-desk.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
perf_counter: Turn off by default
perf_counter: Add counter->id to the throttle event
perf_counter: Better align code
perf_counter: Rename L2 to LL cache
perf_counter: Standardize event names
perf_counter: Rename enums
perf_counter tools: Clean up u64 usage
perf_counter: Rename perf_counter_limit sysctl
perf_counter: More paranoia settings
perf_counter: powerpc: Implement generalized cache events for POWER processors
perf_counters: powerpc: Add support for POWER7 processors
perf_counter: Accurate period data
perf_counter: Introduce struct for sample data
perf_counter tools: Normalize data using per sample period data
perf_counter: Annotate exit ctx recursion
perf_counter tools: Propagate signals properly
perf_counter tools: Small frequency related fixes
perf_counter: More aggressive frequency adjustment
perf_counter/x86: Fix the model number of Intel Core2 processors
perf_counter, x86: Correct some event and umask values for Intel processors
...
Pure renames only, to PERF_COUNT_HW_* and PERF_COUNT_SW_*.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit c9690998ef (x86: memtest: remove
64-bit division) introduced following compile warning:
arch/x86/mm/memtest.c: In function 'memtest':
arch/x86/mm/memtest.c:56: warning: comparison of distinct pointer types lacks a cast
arch/x86/mm/memtest.c:58: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (244 commits)
Revert "x86, bts: reenable ptrace branch trace support"
tracing: do not translate event helper macros in print format
ftrace/documentation: fix typo in function grapher name
tracing/events: convert block trace points to TRACE_EVENT(), fix !CONFIG_BLOCK
tracing: add protection around module events unload
tracing: add trace_seq_vprint interface
tracing: fix the block trace points print size
tracing/events: convert block trace points to TRACE_EVENT()
ring-buffer: fix ret in rb_add_time_stamp
ring-buffer: pass in lockdep class key for reader_lock
tracing: add annotation to what type of stack trace is recorded
tracing: fix multiple use of __print_flags and __print_symbolic
tracing/events: fix output format of user stack
tracing/events: fix output format of kernel stack
tracing/trace_stack: fix the number of entries in the header
ring-buffer: discard timestamps that are at the start of the buffer
ring-buffer: try to discard unneeded timestamps
ring-buffer: fix bug in ring_buffer_discard_commit
ftrace: do not profile functions when disabled
tracing: make trace pipe recognize latency format flag
...
* 'x86-xen-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (42 commits)
xen: cache cr0 value to avoid trap'n'emulate for read_cr0
xen/x86-64: clean up warnings about IST-using traps
xen/x86-64: fix breakpoints and hardware watchpoints
xen: reserve Xen start_info rather than e820 reserving
xen: add FIX_TEXT_POKE to fixmap
lguest: update lazy mmu changes to match lguest's use of kvm hypercalls
xen: honour VCPU availability on boot
xen: add "capabilities" file
xen: drop kexec bits from /sys/hypervisor since kexec isn't implemented yet
xen/sys/hypervisor: change writable_pt to features
xen: add /sys/hypervisor support
xen/xenbus: export xenbus_dev_changed
xen: use device model for suspending xenbus devices
xen: remove suspend_cancel hook
xen/dev-evtchn: clean up locking in evtchn
xen: export ioctl headers to userspace
xen: add /dev/xen/evtchn driver
xen: add irq_from_evtchn
xen: clean up gate trap/interrupt constants
xen: set _PAGE_NX in __supported_pte_mask before pagetable construction
...
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Clear TS in irq_ts_save() when in an atomic section
x86: Detect use of extended APIC ID for AMD CPUs
x86: memtest: remove 64-bit division
x86, UV: Fix macros for multiple coherency domains
x86: Fix non-lazy GS handling in sys_vm86()
x86: Add quirk for reboot stalls on a Dell Optiplex 360
x86: Fix UV BAU activation descriptor init
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (22 commits)
x86: fix system without memory on node0
x86, mm: Fix node_possible_map logic
mm, x86: remove MEMORY_HOTPLUG_RESERVE related code
x86: make sparse mem work in non-NUMA mode
x86: process.c, remove useless headers
x86: merge process.c a bit
x86: use sparse_memory_present_with_active_regions() on UMA
x86: unify 64-bit UMA and NUMA paging_init()
x86: Allow 1MB of slack between the e820 map and SRAT, not 4GB
x86: Sanity check the e820 against the SRAT table using e820 map only
x86: clean up and and print out initial max_pfn_mapped
x86/pci: remove rounding quirk from e820_setup_gap()
x86, e820, pci: reserve extra free space near end of RAM
x86: fix typo in address space documentation
x86: 46 bit physical address support on 64 bits
x86, mm: fault.c, use printk_once() in is_errata93()
x86: move per-cpu mmu_gathers to mm/init.c
x86: move max_pfn_mapped and max_low_pfn_mapped to setup.c
x86: unify noexec handling
x86: remove (null) in /sys kernel_page_tables
...
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, nmi: Use predefined numbers instead of hardcoded one
x86: asm/processor.h: remove double declaration
x86, mtrr: replace MTRRdefType_MSR with msr-index's MSR_MTRRdefType
x86, mtrr: replace MTRRfix4K_C0000_MSR with msr-index's MSR_MTRRfix4K_C0000
x86, mtrr: remove mtrr MSRs double declaration
x86, mtrr: replace MTRRfix16K_80000_MSR with msr-index's MSR_MTRRfix16K_80000
x86, mtrr: replace MTRRfix64K_00000_MSR with msr-index's MSR_MTRRfix64K_00000
x86, mtrr: replace MTRRcap_MSR with msr-index's MSR_MTRRcap
x86: mce: remove duplicated #include
x86: msr-index.h remove duplicate MSR C001_0015 declaration
x86: clean up arch/x86/kernel/tsc_sync.c a bit
x86: use symbolic name for VM86_SIGNAL when used as vm86 default return
x86: added 'ifndef _ASM_X86_IOMAP_H' to iomap.h
x86: avoid multiple declaration of kstack_depth_to_print
x86: vdso/vma.c declare vdso_enabled and arch_setup_additional_pages before they get used
x86: clean up declarations and variables
x86: apic/x2apic_cluster.c x86_cpu_to_logical_apicid should be static
x86 early quirks: eliminate unused function
Using gcc 3.3.5 a "make allmodconfig" + "CONFIG_KVM=n"
triggers a build error:
arch/x86/mm/built-in.o(.init.text+0x43f7): In function `__change_page_attr':
arch/x86/mm/pageattr.c:114: undefined reference to `__udivdi3'
make: *** [.tmp_vmlinux1] Error 1
The culprit turned out to be a division in arch/x86/mm/memtest.c
For more info see this thread:
http://marc.info/?l=linux-kernel&m=124416232620683
The patch entirely removes the division that caused the build
error.
[ Impact: build fix with certain GCC versions ]
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: xiyou.wangcong@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org>
LKML-Reference: <20090608170939.GB12431@alberich.amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch resets the bit in dr6 after the corresponding exception is
handled in code, so that we keep a clean track of the current virtual debug
status register.
[ Impact: keep track of breakpoints triggering completion ]
Signed-off-by: K.Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Merge reason: merge almost-rc8 into perfcounters/core, which was -rc6
based - to pick up the latest upstream fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302
On x86 and x86-64, it is possible that page tables are shared beween
shared mappings backed by hugetlbfs. As part of this,
page_table_shareable() checks a pair of vma->vm_flags and they must match
if they are to be shared. All VMA flags are taken into account, including
VM_LOCKED.
The problem is that VM_LOCKED is cleared on fork(). When a process with a
shared memory segment forks() to exec() a helper, there will be shared
VMAs with different flags. The impact is that the shared segment is
sometimes considered shareable and other times not, depending on what
process is checking.
What happens is that the segment page tables are being shared but the
count is inaccurate depending on the ordering of events. As the page
tables are freed with put_page(), bad pmd's are found when some of the
children exit. The hugepage counters also get corrupted and the Total and
Free count will no longer match even when all the hugepage-backed regions
are freed. This requires a reboot of the machine to "fix".
This patch addresses the problem by comparing all flags except VM_LOCKED
when deciding if pagetables should be shared or not for hugetlbfs-backed
mapping.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: <starlight@binnacle.cx>
Cc: Eric B Munson <ebmunson@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cleanup cpa_flush_array() to avoid back to back on_each_cpu() calls.
[ Impact: optimizes fix 0af48f42df ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
cpa_flush_array seems to prefer wbinvd() over clflush at 4M threshold.
clflush needs to be done on only one CPU as per instruction definition.
wbinvd() however, should be done on all CPUs.
[ Impact: fix missing flush which could cause data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
wbinvd is supported on all CPUs 486 or later. But,
pageattr.c is checking x86_model >= 4 before wbinvd(), which looks like
an oversight bug. It was first introduced at one place by changeset
d7c8f21a8c and got copied over to second
place in the same file later.
[ Impact: fix missing cache flush on early-model CPUs, potential data corruption ]
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Recently there were some changes to the meaning of node_possible_map,
and it is quite strange:
- the node without memory would be set in node_possible_map
- but some node with less NODE_MIN_SIZE will be kicked out of node_possible_map.
fix it by adding strict_setup_node_bootmem().
Also, remove unparse_node().
so result will be:
1. cpu_to_node() will return online node only (nearest one)
2. apicid_to_node() still returns the node that could be not online but is set
in node_possible_map.
3. node_possible_map will include nodes that mem on it are less NODE_MIN_SIZE
v2: after move_cpus_to_node change.
[ Impact: get node_possible_map right ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <4A0C49BE.6080800@kernel.org>
[ v3: various small cleanups and comment clarifications ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
after:
| commit b263295dbf
| Author: Christoph Lameter <clameter@sgi.com>
| Date: Wed Jan 30 13:30:47 2008 +0100
|
| x86: 64-bit, make sparsemem vmemmap the only memory model
we don't have MEMORY_HOTPLUG_RESERVE anymore.
Historically, x86-64 had an architecture-specific method for memory hotplug
whereby it scanned the SRAT for physical memory ranges that could be
potentially used for memory hot-add later. By reserving those ranges
without physical memory, the memmap would be allocated and left dormant
until needed. This depended on the DISCONTIG memory model which has been
removed so the code implementing HOTPLUG_RESERVE is now dead.
This patch removes the dead code used by MEMORY_HOTPLUG_RESERVE.
(Changelog authored by Mel.)
v2: updated changelog, and remove hotadd= in doc
[ Impact: remove dead code ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Mel Gorman <mel@csn.ul.ie>
Workflow-found-OK-by: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A0C4910.7090508@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With sparse memory, holes should not be marked present for memmap.
This patch makes sure sparsemem really works on SMP mode (!NUMA).
[ Impact: use less memory to map fragmented RAM, avoid boot-OOM/crash ]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
LKML-Reference: <1242117600.22431.0.camel@sli10-desk.sh.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There's no need to use call memory_present() manually on UMA because
initmem_init() sets up early_node_map by calling
e820_register_active_regions().
[ Impact: cleanup ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1241699742.17846.31.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
64-bit UMA and NUMA versions of paging_init() are almost identical.
Therefore, merge the copy in mm/numa_64.c to mm/init_64.c to remove
duplicate code.
[ Impact: cleanup ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1241699741.17846.30.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It is expected that there might be slight differences between the e820
map and the SRAT table and the intention was that 1MB of slack be allowed.
The calculation comparing e820ram and pxmram assumes the units are bytes,
when they are in fact pages. This means 4GB of slack is being allowed,
not 1MB. This patch makes the correct comparison.
comment is from Mel.
[ Impact: don't accept buggy SRATs that could dump up to 4G of RAM ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A03E13E.6050107@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
node_cover_memory() sanity checks the SRAT table by ensuring that all
PXMs cover the memory reported in the e820.
However, when calculating the size of the holes in the e820, it uses
the early_node_map[] which contains information taken from both SRAT
and e820. If the SRAT is missing an entry, then it is not detected
that the SRAT table is incorrect and missing entries.
This patch uses the e820 map to calculate the holes instead of
early_node_map[].
comment is from Mel.
[ Impact: reject incorrect SRAT tables ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <4A03E10C.60906@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Do this so we can check the range that is mapped before
init_memory_mapping().
To be able to print out meaningful info, we first have to fix
64-bit to have max_pfn_mapped assigned before that call. This
also unifies the code-path a bit.
[ Impact: print more debug info, cleanup ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49BF0978.40605@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/frv/include/asm/pgtable.h
arch/x86/include/asm/required-features.h
arch/x86/xen/mmu.c
Merge reason: x86/xen was on a .29 base still, move it to a fresher
branch and pick up Xen fixes as well, plus resolve
conflicts
Signed-off-by: Ingo Molnar <mingo@elte.hu>
With the introduction of the .brk section, special care must be taken
that no unused page table entries remain if _brk_end and _end are
separated by a 2M page boundary. cleanup_highmap() runs very early and
hence cannot take care of that, hence potential entries needing to be
removed past _brk_end must be cleared once the brk allocator has done
its job.
[ Impact: avoids undesirable TLB aliases ]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Merge reason: tracing/core was on a .30-rc1 base and was missing out on
on a handful of tracing fixes present in .30-rc5-almost.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The mem= option will truncate the memory map at a specified address so
it's not possible to register nodes with memory beyond the e820 upper
bound.
unparse_node() is only called when then node had memory associated with
it, although with the mem= option it is no longer addressable.
[ Impact: fix boot hang on certain (large) systems ]
Reported-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <alpine.DEB.2.00.0905051248150.20021@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: x86, mmiotrace: fix range test
tracing: fix ref count in splice pages
Andrew pointed out that the 'once' variable has a needlessly
function-global scope. We can in fact eliminate it completely,
via the use of printk_once().
[ Impact: cleanup ]
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch moves the max_pfn_mapped and max_low_pfn_mapped global
variables to kernel/setup.c where they're initialized.
[ Impact: cleanup ]
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1240923649.1982.21.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Follow up to afcfe024ae in Linus' tree
("x86: mmiotrace: quieten spurious warning message")
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Acked-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1240946271-7083-5-git-send-email-stuart@freedesktop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* change function names to clear_* from set_*: in reality we only clear
and restore page presence, and never unconditionally set present.
Using clear_*({true, false}, ...) is therefore more honest than
set_*({false, true}, ...)
* upgrade presence storage to pteval_t: doing user-space tracing will
require saving and manipulation of the _PAGE_PROTNONE bit, in addition
to the existing _PAGE_PRESENT changes, and having multiple bools stored
and passed around does not seem optimal
[ Impact: refactor, clean up mmiotrace code ]
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Acked-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1240946271-7083-4-git-send-email-stuart@freedesktop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kmmio_probe being *p and kmmio_fault_page being sometimes *f and
sometimes *p is not helpful.
[ Impact: cleanup ]
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Acked-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1240946271-7083-3-git-send-email-stuart@freedesktop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Commit dc09855 ("x86/uv: fix init of memory-less nodes") causes a
two sockets system (where node-1 doesn't have RAM installed) to crash.
That commit makes node_possible include cpu nodes that do not have memory.
So check boundary in setup_node_bootmem().
[ Impact: fix boot crash on RAM-less NUMA node system ]
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Jack Steiner <steiner@sgi.com>
LKML-Reference: <49EF89DF.9090404@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
calculate_numa_remap_pages() is called only by __init initmem_init()
further calculate_numa_remap_pages is calling:
__init find_e820_area() and __init reserve_early()
So calculate_numa_remap_pages() should be __init calculate_numa_remap_pages().
WARNING: arch/x86/built-in.o(.text+0x82ea3): Section mismatch in reference from the function calculate_numa_remap_pages() to the function .init.text:find_e820_area()
The function calculate_numa_remap_pages() references
the function __init find_e820_area().
This is often because calculate_numa_remap_pages lacks a __init
annotation or the annotation of find_e820_area is wrong.
WARNING: arch/x86/built-in.o(.text+0x82f5f): Section mismatch in reference from the function calculate_numa_remap_pages() to the function .init.text:reserve_early()
The function calculate_numa_remap_pages() references
the function __init reserve_early().
This is often because calculate_numa_remap_pages lacks a __init
annotation or the annotation of reserve_early is wrong.
[ Impact: save memory, address Section mismatch warning ]
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
LKML-Reference: <1239991281.3153.4.camel@ht.satnam>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add support for nodes that have cpus but no memory.
The current code was failing to add these nodes
to the nodes_present_map.
v2: Fixes case caught by David Rientjes - missed support
for the x2apic SRAT table.
[ Impact: fix potential boot crash on memory-less UV nodes. ]
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>
LKML-Reference: <20090417142242.GA23743@sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: fix microcode driver newly spewing warnings
x86, PAT: Remove page granularity tracking for vm_insert_pfn maps
x86: disable X86_PTRACE_BTS for now
x86, documentation: kernel-parameters replace X86-32,X86-64 with X86
x86: pci-swiotlb.c swiotlb_dma_ops should be static
x86, PAT: Remove duplicate memtype reserve in devmem mmap
x86, PAT: Consolidate code in pat_x_mtrr_type() and reserve_memtype()
x86, PAT: Changing memtype to WC ensuring no WB alias
x86, PAT: Handle faults cleanly in set_memory_ APIs
x86, PAT: Change order of cpa and free in set_memory_wb
x86, CPA: Change idmap attribute before ioremap attribute setup
This change resolves the problem of too many single page entries
in pat_memtype_list and "freeing invalid memtype" errors with i915,
reported here:
http://marc.info/?l=linux-kernel&m=123845244713183&w=2
Remove page level granularity track and untrack of vm_insert_pfn.
memtype tracking at page granularity does not scale and cleaner
approach would be for the driver to request a type for a bigger
IO address range or PCI io memory range for that device, either at
mmap time or driver init time and just use that type during
vm_insert_pfn.
This patch just removes the track/untrack of vm_insert_pfn. That
means we will be in same state as 2.6.28, with respect to these APIs.
Newer APIs for the drivers to request a memtype for a bigger region
is coming soon.
[ Impact: fix Xorg startup warnings and hangs ]
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Tested-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
LKML-Reference: <20090408223716.GC3493@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
%p prints out 0x000000000000000 as (null)
so use %lx instead.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49E43282.1090607@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, no code changed
- syscalls.h update declarations due to unifications
- irq.c declare smp_generic_interrupt() before it gets used
- process.c declare sys_fork() and sys_vfork() before they get used
- tsc.c rename tsc_khz shadowed variable
- apic/probe_32.c declare apic_default before it gets used
- apic/nmi.c prev_nmi_count should be unsigned
- apic/io_apic.c declare smp_irq_move_cleanup_interrupt() before it gets used
- mm/init.c declare direct_gbpages and free_initrd_mem before they get used
Signed-off-by: Jaswinder Singh Rajput <jaswinder@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
pat_disable cannot be __cpuinit anymore because it's called from pat_init
and the callchain looks like this:
pat_disable [cpuinit] <- pat_init <- generic_set_all <-
ipi_handler <- set_mtrr <- (other non init/cpuinit functions)
WARNING: arch/x86/mm/built-in.o(.text+0x449e): Section mismatch in reference
from the function pat_init() to the function .cpuinit.text:pat_disable()
The function pat_init() references
the function __cpuinit pat_disable().
This is often because pat_init lacks a __cpuinit
annotation or the annotation of pat_disable is wrong.
Non CONFIG_X86_PAT version of pat_disable is static inline, so this version
can be static too (and there are no callers outside of this file).
Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
LKML-Reference: <49DFB055.6070405@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix kprobes crash on 32-bit with RAM above 4G
Use phys_addr_t for receiving a physical address argument
instead of unsigned long. This allows fixmap to handle
pages higher than 4GB on x86-32.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: systemtap-ml <systemtap@sources.redhat.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <49DE3695.6040800@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
/dev/mem mmap code was doing memtype reserve/free for a while now.
Recently we added memtype tracking in remap_pfn_range, and /dev/mem mmap
uses it indirectly. So, we don't need seperate tracking in /dev/mem code
any more. That means another ~100 lines of code removed :-).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <20090409212709.085210000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix pat_x_mtrr_type() to use UC_MINUS when the mtrr type return UC. This
is to be consistent with ioremap() and ioremap_nocache() which uses
UC_MINUS.
Consolidate the code such that reserve_memtype() also uses
pat_x_mtrr_type() when the caller doesn't specify any special attribute
(non WB attribute).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <20090409212708.939936000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As per SDM, there should not be any aliasing of a WC with any cacheable
type across CPUs. That is if one CPU is changing the identity map
memtype to _WC, no other CPU at the time of this change should not have a
TLB for this page that carries a WB attribute. SDM suggests to make the
page not present. But for that we will have to handle any page faults
that can potentially happen due to these pages being not present.
Other way to deal with this without having any WB mapping is to change
the page first to UC and then to WC. This ensures that we meet the SDM
requirement of no cacheable alais to WC page. This also has same or
lower overhead than marking the page not present and making it present
later.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090409212708.797481000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Handle faults and do proper cleanups in set_memory_*() functions. In
some cases, these functions were not doing proper free on failure paths.
With the changes to tracking memtype of RAM pages in struct page instead
of pat list, we do not need the changes in commits c5e147. This patch
reverts that change.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090409212708.653222000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
To be free of aliasing due to races, set_memory_* interfaces should
follow ordering of reserving, changing memtype to UC/WC, changing
memtype back to WB followed by free.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
LKML-Reference: <20090409212708.512280000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change the identity mapping with the requested attribute first, before
we setup the virtual memory mapping with the new requested attribute.
This makes sure that there is no window when identity map'ed attribute
may disagree with ioremap range on the attribute type.
This also avoids doing cpa on the ioremap'ed address twice (first in
ioremap_page_range and then in ioremap_change_attr using vaddr), and
should improve ioremap performance a bit.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
LKML-Reference: <20090409212708.373330000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While better than get_user_pages(), the usage of gupf(),
especially the return values and the fact that it can
potentially only partially pin the range, warranted some
documentation.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Cc: npiggin@suse.de
Cc: akpm@linux-foundation.org
LKML-Reference: <1239320729-3262-1-git-send-email-andy.grover@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Paul suggested we allow for data addresses to be recorded along with
the traditional IPs as power can provide these.
For now, only the software pagefault events provide data addresses,
but in the future power might as well for some events.
x86 doesn't seem capable of providing this atm.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
LKML-Reference: <20090408130409.394816925@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* commit 'origin/master': (4825 commits)
Fix build errors due to CONFIG_BRANCH_TRACER=y
parport: Use the PCI IRQ if offered
tty: jsm cleanups
Adjust path to gpio headers
KGDB_SERIAL_CONSOLE check for module
Change KCONFIG name
tty: Blackin CTS/RTS
Change hardware flow control from poll to interrupt driven
Add support for the MAX3100 SPI UART.
lanana: assign a device name and numbering for MAX3100
serqt: initial clean up pass for tty side
tty: Use the generic RS485 ioctl on CRIS
tty: Correct inline types for tty_driver_kref_get()
splice: fix deadlock in splicing to file
nilfs2: support nanosecond timestamp
nilfs2: introduce secondary super block
nilfs2: simplify handling of active state of segments
nilfs2: mark minor flag for checkpoint created by internal operation
nilfs2: clean up sketch file
nilfs2: super block operations fix endian bug
...
Conflicts:
arch/x86/include/asm/thread_info.h
arch/x86/lguest/boot.c
drivers/xen/manage.c
Provide separate sw counters for major and minor page faults.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We use the generic software counter infrastructure to provide
page fault events.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'tracing-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (413 commits)
tracing, net: fix net tree and tracing tree merge interaction
tracing, powerpc: fix powerpc tree and tracing tree interaction
ring-buffer: do not remove reader page from list on ring buffer free
function-graph: allow unregistering twice
trace: make argument 'mem' of trace_seq_putmem() const
tracing: add missing 'extern' keywords to trace_output.h
tracing: provide trace_seq_reserve()
blktrace: print out BLK_TN_MESSAGE properly
blktrace: extract duplidate code
blktrace: fix memory leak when freeing struct blk_io_trace
blktrace: fix blk_probes_ref chaos
blktrace: make classic output more classic
blktrace: fix off-by-one bug
blktrace: fix the original blktrace
blktrace: fix a race when creating blk_tree_root in debugfs
blktrace: fix timestamp in binary output
tracing, Text Edit Lock: cleanup
tracing: filter fix for TRACE_EVENT_FORMAT events
ftrace: Using FTRACE_WARN_ON() to check "freed record" in ftrace_release()
x86: kretprobe-booster interrupt emulation code fix
...
Fix up trivial conflicts in
arch/parisc/include/asm/ftrace.h
include/linux/memory.h
kernel/extable.c
kernel/module.c
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits)
cpumask: remove cpumask allocation from idle_balance, fix
numa, cpumask: move numa_node_id default implementation to topology.h, fix
cpumask: remove cpumask allocation from idle_balance
x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus
x86: cpumask: update 32-bit APM not to mug current->cpus_allowed
x86: microcode: cleanup
x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c
cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash
numa, cpumask: move numa_node_id default implementation to topology.h
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
cpumask: remove x86 cpumask_t uses.
cpumask: use cpumask_var_t in uv_flush_tlb_others.
cpumask: remove cpumask_t assignment from vector_allocation_domain()
cpumask: make Xen use the new operators.
cpumask: clean up summit's send_IPI functions
cpumask: use new cpumask functions throughout x86
x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask
cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t
cpumask: convert node_to_cpumask_map[] to cpumask_var_t
x86: unify 32 and 64-bit node_to_cpumask_map
...
All logical processors with APIC ID values of 255 and greater will have their
APIC reported through Processor X2APIC structure (type-9 entry type) and all
logical processors with APIC ID less than 255 will have their APIC reported
through legacy Processor Local APIC (type-0 entry type) only. This is the
same case even for NMI structure reporting.
The Processor X2APIC Affinity structure provides the association between the
X2APIC ID of a logical processor and the proximity domain to which the logical
processor belongs.
For OSPM, Procssor IDs outside the 0-254 range are to be declared as Device()
objects in the ACPI namespace.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
i386 allnoconfig:
arch/x86/mm/iomap_32.c: In function 'is_io_mapping_possible':
arch/x86/mm/iomap_32.c:27: warning: comparison is always false due to limited range of data type
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Use debug_kmap_atomic in kmap_atomic, kmap_atomic_pfn, and
iomap_atomic_prot_pfn.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
x86 has debug_kmap_atomic_prot() which is error checking function for
kmap_atomic. It is usefull for the other architectures, although it needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.
This patch exposes it to the other architectures.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <linux-arch@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
Eliminate more than 20 unnecessary #include lines in fault.c
Also fix include file dependency bug in asm/traps.h. (this was
masked before, by implicit inclusion)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <tip-56aea8468746e673a4bf50b6a13d97b2d1cbe1e8@git.kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: simplification, robustness
Make paravirt_lazy_mode() always return PARAVIRT_LAZY_NONE
when in an interrupt. This prevents interrupt code from
accidentally inheriting an outer lazy state, and instead
does everything synchronously. Outer batched operations
are left deferred.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Tetsuo Handa reported this link bug:
| arch/x86/mm/built-in.o(.init.text+0x1831): In function `early_ioremap_init':
| : undefined reference to `__this_fixmap_does_not_exist'
| make: *** [.tmp_vmlinux1] Error 1
Commit:8827247ffcc9e880cbe4705655065cf011265157 used a variable (which
would be optimized to constant) as fix_to_virt()'s parameter.
It's depended on gcc's optimization and fails on old gcc. (Tetsuo used gcc 3.3)
We can use __fix_to_vir() instead, because we know it's safe and
don't need link time error reporting.
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
Cc: sfr@canb.auug.org.au
LKML-Reference: <49C9FFEA.7060908@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 698609bdcd.
69860 breaks Xen booting, as it relies on head*.S to set up the fixmap
pagetables (as a side-effect of initializing the USB debug port).
Xen, however, does not boot via head*.S, and so the fixmap area is
not initialized.
The specific symptom of the crash is a fault in dmi_scan(), because
the pointer that early_ioremap returns is not actually present.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49C43A8E.5090203@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add new interfaces:
set_pages_array_uc()
set_pages_array_wb()
that can be used change the page attribute for a bunch of pages with
flush etc done once at the end of all the changes. These interfaces
are similar to existing set_memory_array_uc() and set_memory_array_wc().
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: arjan@infradead.org
Cc: eric@anholt.net
Cc: airlied@redhat.com
LKML-Reference: <20090319215358.901545000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add struct page array pointer to cpa struct and CPA_PAGES_ARRAY.
With that we can add change_page_attr_set_clr() a parameter to pass
struct page array pointer and that can be handled by the underlying
cpa code.
cpa_flush_array() is also changed to support both addr array or
struct page pointer array, depending on the flag.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: arjan@infradead.org
Cc: eric@anholt.net
Cc: airlied@redhat.com
LKML-Reference: <20090319215358.758513000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change change_page_attr_set_clr() array parameter to a flag. This helps
following patches which adds an interface to change attr to uc/wb over a
set of pages referred by struct page.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: arjan@infradead.org
Cc: eric@anholt.net
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: airlied@redhat.com
LKML-Reference: <20090319215358.611346000@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, remove last user of set_pte_present
set_pte_vaddr() is only used to install ptes in fixmaps, and
should never be used to overwrite a present mapping.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Xen-devel <xen-devel@lists.xensource.com>
LKML-Reference: <1237406613-2929-1-git-send-email-jeremy@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, reduce memory usage for CONFIG_CPUMASK_OFFSTACK=y
Part of the "getting rid of obsolete cpumask_t" patch:
1) Use cpumask_var_t: this is a pointer if CONFIG_CPUMASK_OFFSTACK=y
2) Call alloc_cpumask_var() on first entry into enter_uniprocessor()
3) Use modern cpumask_* functions.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Pekka Paalanen <pq@iki.fi>
LKML-Reference: <200903111633.55952.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: optimize APIC IPI related barriers
Uncached MMIO accesses for xapic are inherently serializing and hence
we don't need explicit barriers for xapic IPI paths.
x2apic MSR writes/reads don't have serializing semantics and hence need
a serializing instruction or mfence, to make all the previous memory
stores globally visisble before the x2apic msr write for IPI.
Add x2apic_wrmsr_fence() in flush tlb path to x2apic specific paths.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "steiner@sgi.com" <steiner@sgi.com>
Cc: Nick Piggin <npiggin@suse.de>
LKML-Reference: <1237313814.27006.203.camel@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
asm/highmem.h inclusion is added to use kmap_atomic_prot_pfn()
by commit bb6d59ca92
Now kmap_atomic_prot_pfn is moved to iomap_32.c
by commit dd63fdcc63
So the asm/highmem.h inclusion in iomap_32.c is unnecessary now.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090315151517.GA29074@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new interface
Add a brk()-like allocator which effectively extends the bss in order
to allow very early code to do dynamic allocations. This is better than
using statically allocated arrays for data in subsystems which may never
get used.
The space for brk allocations is in the bss ELF segment, so that the
space is mapped properly by the code which maps the kernel, and so
that bootloaders keep the space free rather than putting a ramdisk or
something into it.
The bss itself, delimited by __bss_stop, ends before the brk area
(__brk_base to __brk_limit). The kernel text, data and bss is reserved
up to __bss_stop.
Any brk-allocated data is reserved separately just before the kernel
pagetable is built, as that code allocates from unreserved spaces
in the e820 map, potentially allocating from any unused brk memory.
Ultimately any unused memory in the brk area is used in the general
kernel memory pool.
Initially the brk space is set to 1MB, which is probably much larger
than any user needs (the largest current user is i386 head_32.S's code
to build the pagetables to map the kernel, which can get fairly large
with a big kernel image and no PSE support). So long as the system
has sufficient memory for the bootloader to reserve the kernel+1MB brk,
there are no bad effects resulting from an over-large brk.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: fix (CONFIG_MAXSMP=y only) boot crash
c032ef60d1 "cpumask: convert
node_to_cpumask_map[] to cpumask_var_t" didn't get this one
conversion. There was a compile warning, but I missed it.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Mike Travis <travis@sgi.com>
LKML-Reference: <200903132342.42813.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
We are removing cpumask_t in favour of struct cpumask: mainly as a
marker of what code is now CONFIG_CPUMASK_OFFSTACK-safe.
The only non-trivial change here is vector_allocation_domain():
explicitly clear the mask and set the first word, rather than using
assignment.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: reduce kernel memory usage when CONFIG_CPUMASK_OFFSTACK=y
Straightforward conversion: done for 32 and 64 bit kernels.
node_to_cpumask_map is now a cpumask_var_t array.
64-bit used to be a dynamic cpumask_t array, and 32-bit used to be a
static cpumask_t array.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: cleanup
We take the 64-bit code and use it on 32-bit as well. The new file
is called mm/numa.c.
In a minor cleanup, we use cpu_none_mask instead of declaring a local
cpu_mask_none.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Impact: fix false positive PAT warnings - also fix VirtalBox hang
Use of vma->vm_pgoff to identify the pfnmaps that are fully
mapped at mmap time is broken. vm_pgoff is set by generic mmap
code even for cases where drivers are setting up the mappings
at the fault time.
The problem was originally reported here:
http://marc.info/?l=linux-kernel&m=123383810628583&w=2
Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
time.
Problem also tracked at:
http://bugzilla.kernel.org/show_bug.cgi?id=12800
Reported-by: Thomas Hellstrom <thellstrom@vmware.com>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha>@intel.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "ebiederm@xmission.com" <ebiederm@xmission.com>
Cc: <stable@kernel.org> # only for 2.6.29.1, not .28
LKML-Reference: <20090313004527.GA7176@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: kernel image size reduction
Since in most configurations the pmd page needed maps the same range of
virtual addresses which is also mapped by the earlier inserted one for
covering FIX_DBGP_BASE, that page (and its insertion in the page
tables) can be avoided altogether by detecting the condition at compile
time.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B91826.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, save memory
The function is only being called from boot or memory hotplug paths.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B910B6.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: 32/64-bit consolidation
In a first step, this allows fixing phys_addr_valid() for PAE (which
until now reported all addresses to be valid). Subsequently, this will
also allow simplifying some MTRR handling code.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
LKML-Reference: <49B9101E.76E4.0078.0@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: build fix
Move kmap_atomic_prot_pfn() to iomap_32.c. It is used on all 32-bit
kernels, while highmem_32.c is only built on highmem kernels.
( Note: the debug_kmap_atomic_prot() check is removed for now, that
problem is handled via another patch. )
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090311143317.GA22244@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This message was being incorrectly emitted when using gdb,
so compile it out by default for now; there will be a
better fix in v2.6.30.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Acked-by: Pekka Paalanen <pq@iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: work around boot crash
Work around Intel Atom erratum AAH41 (probabilistically) - it's triggering
in the field.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Kyle McMartin <kyle@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It may be useful for kmap_atomic_pfn() and iomap_atomic_prot_pfn()
to check invalid kmap usage as well as kmap_atomic.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090311143449.GB22244@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kmap_atomic_pfn() and iomap_atomic_prot_pfn() are almost same
except pgprot. This patch removes the code duplication for these
two functions.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
LKML-Reference: <20090311143317.GA22244@localhost.localdomain>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I found that virt_addr_valid() was returning true for fixmap addresses.
I'm not sure whether pfn_valid() is supposed to include this test,
but there's no harm in being explicit.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B166D6.2080505@goop.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix race+crash in mmiotrace
The list manipulation in remove_kmmio_fault_pages() was broken. If more
than one consecutive kmmio_fault_page was re-added during the grace
period between unregister_kmmio_probe() and remove_kmmio_fault_pages(),
the list manipulation failed to remove pages from the release list.
After a second grace period the pages get into rcu_free_kmmio_fault_pages()
and raise a BUG_ON() kernel crash.
The list manipulation is fixed to properly remove pages from the release
list.
This bug has been present from the very beginning of mmiotrace in the
mainline kernel. It was introduced in 0fd0e3da ("x86: mmiotrace full
patch, preview 1");
An urgent fix for Linus. Tested by Stuart (on 32-bit) and Pekka
(on amd and intel 64-bit systems, nouveau and nvidia proprietary).
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Signed-off-by: Pekka Paalanen <pq@iki.fi>
LKML-Reference: <20090308202135.34933feb@daedalus.pq.iki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Ingo found there warning about nodeid with some configs.
try to use for_each_online_node for non numa too. in that case
nodeid will be 0.
also move out boundary checking from setup_node_bootmem(), so
non-numa config will not check it.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49B03069.80001@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve out-of-range fixmap index debugging
Commit "1b42f51630c7eebce6fb780b480731eb81afd325"
defined the __this_fixmap_does_not_exist() function
with a WARN_ON(1) in it.
This causes the linker to not report an error when
__this_fixmap_does_not_exist() is called with a
non-constant parameter.
Ingo defined __this_fixmap_does_not_exist() because he
wanted to get virt addresses of fix memory of nest level
by non-constant index.
But we can fix this and still keep the link-time check:
We can get the four slot virt addresses on link time and
store them to array slot_virt[].
Then we can then refer the slot_virt with non-constant index,
in the ioremap-leak detection code.
Signed-off-by: Wang Chen <wangchen@cn.fujitsu.com>
LKML-Reference: <49B2075B.4070509@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
As a preparational step for unifying noexec handling on 32-bit and 64-bit,
rename the do_not_nx variable to disable_nx on 64-bit.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236265497.31324.11.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: make boot command line "memtest" do one loop by default
So don't need to guess many patterns in one loop.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <49B10532.3020105@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, micro-optimization
Pre-initialize boot_cpu_data.x86_phys_bits to a reasonable default
to remove the use of system_state tests in __virt_addr_valid()
and __phys_addr().
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rather than relying on the ever-unreliable system_state,
add a specific __vmalloc_start_set flag to indicate whether
the vmalloc area has meaningful boundaries yet, and use that
in x86-32's __phys_addr and __virt_addr_valid.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
In preparation for moving the function declaration to a header file,
unify 32-bit and 64-bit signatures.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-16-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The table_start, table_end, and table_top are too generic for global
namespace so rename them to be more specific.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-15-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch moves the init_memory_mapping() function to common mm/init.c.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-14-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch adds an empty static inline init_gbpages() for the 32-bit
version of init_memory_mapping() making both versions identical.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-13-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
As a trivial preparation for moving common code to arc/x86/mm/init.c,
ifdef the 32-bit and 64-bit versions of NR_RANGE_MR.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-12-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To reduce the diff between the 32-bit and 64-bit versions of
init_memory_mapping(), ifdef configuration specific pfn setup
code in the function.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-11-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To reduce the diff between the 32-bit and 64-bit versions of
init_memory_mapping(), ifdef configuration specific setup code
in the function.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-10-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch adds a sanity check to the 32-bit version of
init_memory_mapping() to reduce the diff to the 64-bit version.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-9-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The 64-bit version of init_memory_mapping() uses the last mapped
address returned from kernel_physical_mapping_init() whereas the
32-bit version doesn't. This patch adds relevant ifdefs to both
versions of the function to reduce the diff between them.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-8-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch renames after_init_bootmem to after_bootmem in
mm/init_32.c to reduce the diff to the 64-bit version of of
init_memory_mapping().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-7-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The save_mr() function already checks that start_pfn is less than
end_pfn so we can remove the unnecessary check which reduces the
diff between the 32-bit and the 64-bit versions of init_memory_mapping().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-6-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Enabling NX, PSE, and PGE are only required on 32-bit so ifdef them
in both versions of the function.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-5-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch moves pgd_base out of init_memory_mapping() to reduce
the diff between the 32-bit version and the 64-bit version of the
function.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-4-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
There are some minor differences between the 32-bit and 64-bit
find_early_table_space() functions. This patch wraps those
differences under CONFIG_X86_32 to make the function identical
on both configurations.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-3-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To reduce the diff between the 32-bit and 64-bit versions of
init_memory_mapping(), add gbpages support to the 32-bit version.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-2-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
To reduce the diff between the 32-bit and 64-bit versions of
init_memory_mapping(), fix up all trivial issues.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1236257708-27269-1-git-send-email-penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: clean up
Simplify the code, reuse some lines.
Remove min_low_pfn reference, it is always 0
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AEE2C4.2030602@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The function is identical on 32-bit and 64-bit configurations so move it to the
common mm/init.c file.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236158020.29024.28.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
make code more readable and more like 64-bit
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AE48B4.8010907@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix panic on system 2g x4 sockets
Found one system with 4 sockets and every sockets has 2g can not boot
with numa32 because boot mem is crossing nodes.
So try to have numa version of setup_bootmem_allocator().
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
LKML-Reference: <49AE485B.8000902@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The zones are set up at this stage so there's a highmem zone
available even for the UMA case.
The only difference there is that for machines that have
CONFIG_HIGHMEM enabled but don't have any highmem available,
->zone_start_pfn is zero whereas highstart_pfn is non-zero).
The field is left zeroed because of the !size test in
free_area_init_core() but shouldn't be a problem because
add_highpages_with_active_regions() handles empty ranges just
fine.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Mel Gorman <mel@csn.ul.ie>
LKML-Reference: <1236154567.29024.23.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The function is identical on 32-bit and 64-bit configurations so move
it to the common mm/init.c file.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236160001.29024.29.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They are used by __set_fixmap->set_pte_vaddr_pud, which can
be used by arch_setup_additional_pages(), and so is used
after init.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: oprofile: don't set counter width from cpuid on Core2
x86: fix init_memory_mapping() to handle small ranges
* 'tracing/mmiotrace' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86 mmiotrace: fix race with release_kmmio_fault_page()
x86 mmiotrace: improve handling of secondary faults
x86 mmiotrace: split set_page_presence()
x86 mmiotrace: fix save/restore page table state
x86 mmiotrace: WARN_ONCE if dis/arming a page fails
x86: add far read test to testmmiotrace
x86: count errors in testmmiotrace.ko
Impact: cleanup
This patch moves set_highmem_pages_init() to arch/x86/mm/highmem_32.c.
The declaration of the function is kept in asm/numa_32.h because
asm/highmem.h is included only if CONFIG_HIGHMEM is enabled so we
can't put the empty static inline function there.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236082212.2675.24.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
This patch introduces a common arch/x86/mm/init.c and moves the identical
free_init_pages() and free_initmem() functions to the file.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236078906.2675.18.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
This patch ports commit 3c1df68b84 ("x86: make
sure initmem is writable") to the 64-bit version to unify implementations of
free_init_pages().
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Arjan van de Ven <arjan@linux.intel.com>
LKML-Reference: <1236078904.2675.17.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: unification
This patch adds sanity checks that are already in init_64.c to init_32.c.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236078902.2675.16.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
This patch changes find_early_table_space() to use roundup() for rounding up
tables to page size to unify the common parts of the 32-bit and 64-bit
implementations.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236077705.2675.6.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
The __VMALLOC_RESERVE global variable is not used in init_32.c. Move that to
pgtable_32.c to reduce the diff between init_32.c and init_64.c.
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <1236077704.2675.4.camel@penberg-laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix failed EFI bootup in certain circumstances
Ying Huang found init_memory_mapping() has problem with small ranges
less than 2M when he tried to direct map the EFI runtime code out of
max_low_pfn_mapped.
It turns out we never considered that case and didn't check the range...
Reported-by: Ying Huang <ying.huang@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: Brian Maly <bmaly@redhat.com>
LKML-Reference: <49ACDDED.1060508@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There was a theoretical possibility to a race between arming a page in
post_kmmio_handler() and disarming the page in
release_kmmio_fault_page():
cpu0 cpu1
------------------------------------------------------------------
mmiotrace shutdown
enter release_kmmio_fault_page
fault on the page
disarm the page
disarm the page
handle the MMIO access
re-arm the page
put the page on release list
remove_kmmio_fault_pages()
fault on the page
page not known to mmiotrace
fall back to do_page_fault()
*KABOOM*
(This scenario also shows the double disarm case which is allowed.)
Fixed by acquiring kmmio_lock in post_kmmio_handler() and checking
if the page is being released from mmiotrace.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Upgrade some kmmio.c debug messages to warnings.
Allow secondary faults on probed pages to fall through, and only log
secondary faults that are not due to non-present pages.
Patch edited by Pekka Paalanen.
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From 36772dcb6ffbbb68254cbfc379a103acd2fbfefc Mon Sep 17 00:00:00 2001
From: Pekka Paalanen <pq@iki.fi>
Date: Sat, 28 Feb 2009 21:34:59 +0200
Split set_page_presence() in kmmio.c into two more functions set_pmd_presence()
and set_pte_presence(). Purely code reorganization, no functional changes.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From baa99e2b32449ec7bf147c234adfa444caecac8a Mon Sep 17 00:00:00 2001
From: Pekka Paalanen <pq@iki.fi>
Date: Sun, 22 Feb 2009 20:02:43 +0200
Blindly setting _PAGE_PRESENT in disarm_kmmio_fault_page() overlooks the
possibility, that the page was not present when it was armed.
Make arm_kmmio_fault_page() store the previous page presence in struct
kmmio_fault_page and use it on disarm.
This patch was originally written by Stuart Bennett, but Pekka Paalanen
rewrote it a little different.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Print a full warning once, if arming or disarming a page fails.
Also, if initial arming fails, do not handle the page further. This
avoids the possibility of a page failing to arm and then later claiming
to have handled any fault on that page.
WARN_ONCE added by Pekka Paalanen.
Signed-off-by: Stuart Bennett <stuart@freedesktop.org>
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Apparently pages far into an ioremapped region might not actually be
mapped during ioremap(). Add an optional read test to try to trigger a
multiply faulting MMIO access. Also add more messages to the kernel log
to help debugging.
This patch is based on a patch suggested by
Stuart Bennett <stuart@freedesktop.org>
who discovered bugs in mmiotrace related to normal kernel space faults.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Check the read values against the written values in the MMIO read/write
test. This test shows if the given MMIO test area really works as
memory, which is a prerequisite for a successful mmiotrace test.
Signed-off-by: Pekka Paalanen <pq@iki.fi>
Cc: Stuart Bennett <stuart@freedesktop.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 17581ad812.
Sitsofe Wheeler reported that /dev/dri/card0 is MIA on his EeePC 900
and bisected it to this commit.
Graphics card is an i915 in an EeePC 900:
00:02.0 VGA compatible controller [0300]:
Intel Corporation Mobile 915GM/GMS/910GML
Express Graphics Controller [8086:2592] (rev 04)
( Most likely the ioremap() of the driver failed and hence the card
did not initialize. )
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Bisected-by: Sitsofe Wheeler <sitsofe@yahoo.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: new interface (not yet use)
Define reserve_top_address for x86_64; only for later x86 integration.
Signed-off-by: Gustavo F. Padovan <gustavo@las.ic.unicamp.br>
Acked-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Make io_mapping_create_wc and io_mapping_free go through PAT to make sure
that there are no memory type aliases.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
io_mapping_create_wc should take a resource_size_t parameter in place of
unsigned long. With unsigned long, there will be no way to map greater than 4GB
address in i386/32 bit.
On x86, greater than 4GB addresses cannot be mapped on i386 without PAE. Return
error for such a case.
Patch also adds a structure for io_mapping, that saves the base, size and
type on HAVE_ATOMIC_IOMAP archs, that can be used to verify the offset on
io_mapping_map calls.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add a function to check and keep identity maps in sync, when changing
any memory type. One of the follow on patches will also use this
routine.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: Eric Anholt <eric@anholt.net>
Cc: Keith Packard <keithp@keithp.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- print test pattern instead of pattern number,
- show pattern as stored in memory,
- use proper priority flags,
- consistent use of u64 throughout the code
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix unexpected behaviour when pattern number is out of range
Current implementation provides 4 patterns for memtest. The code doesn't
check whether the memtest parameter value exceeds the maximum pattern number.
Instead the memtest code pretends to test with non-existing patterns, e.g.
when booting with memtest=10 I've observed the following
...
early_memtest: pattern num 10
0000001000 - 0000006000 pattern 0
...
0000001000 - 0000006000 pattern 1
...
0000001000 - 0000006000 pattern 2
...
0000001000 - 0000006000 pattern 3
...
0000001000 - 0000006000 pattern 4
...
0000001000 - 0000006000 pattern 5
...
0000001000 - 0000006000 pattern 6
...
0000001000 - 0000006000 pattern 7
...
0000001000 - 0000006000 pattern 8
...
0000001000 - 0000006000 pattern 9
...
But in fact Linux didn't test anything for patterns > 4 as the default
case in memtest() is to leave the function.
I suggest to use the memtest parameter as the number of tests to be
performed and to re-iterate over all existing patterns.
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: minor change to populate_extra_pte() and addition of pmd flavor
Update populate_extra_pte() to return pointer to the pte_t for the
specified address and add populate_extra_pmd() which only populates
till the pmd and returns pointer to the pmd entry for the address.
For 64bit, pud/pmd/pte fill functions are separated out from
set_pte_vaddr[_pud]() and used for set_pte_vaddr[_pud]() and
populate_extra_{pte|pmd}().
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix this sparse warning:
arch/x86/mm/numa_32.c:197:24: warning: Using plain integer as NULL pointer
Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Cc: trivial@kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: extend prefetch handling on 64-bit
Currently there's an extra is_prefetch() check done in do_sigbus(),
which we only do on 32 bits.
This is a last-ditch check before we terminate a task, so it's worth
giving prefetch instructions another chance - should none of our
existing quirks have caught a prefetch instruction related spurious
fault.
The only risk is if a prefetch causes a real sigbus, in that case
we'll not OOM but try another fault. But this code has been on
32-bit for a long time, so it should be fine in practice.
So do this on 64-bit too - and thus remove one more #ifdef.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Removal of an #ifdef in fault_in_kernel_space(), by making
use of the new TASK_SIZE_MAX symbol which is now available
on 32-bit too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Rename TASK_SIZE64 to TASK_SIZE_MAX, and provide the
define on 32-bit too. (mapped to TASK_SIZE)
This allows 32-bit code to make use of the (former-) TASK_SIZE64
symbol as well, in a clean way.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
do_page_fault() has this ugly #ifdef in its prototype:
#ifdef CONFIG_X86_64
asmlinkage
#endif
void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
Replace it with 'dotraplinkage' which maps to exactly the above
construct: nothing on 32-bit and asmlinkage on 64-bit.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: add oops-recursion check to 32-bit
Unify the oops state-machine, to the 64-bit version. It is
slightly more careful in that it does a recursion check
in oops_begin(), and is thus more likely to show the relevant
oops.
It also means that 32-bit will print one more line at the
end of pagefault triggered oopses:
printk(KERN_EMERG "CR2: %016lx\n", address);
Which is generally good information to be seen in partial-dump
digital-camera jpegs ;-)
The downside is the somewhat more complex critical path. Both
variants have been tested well meanwhile by kernel developers
crashing their boxes so i dont think this is a practical worry.
This removes 3 ugly #ifdefs from no_context() and makes the
function a lot nicer read.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: refine/extend page fault related oops printing on 64-bit
- honor the pause_on_oops logic on 64-bit too
- print out NX fault warnings on 64-bit as well
- factor out the NX fault message to make it git-greppable and readable
Note that this means that we do the PF_INSTR check on 32-bit non-PAE
as well where it should not occur ... normally. Cannot hurt.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from notify_page_fault(). The function still
compiles to nothing in the !CONFIG_KPROBES case.
Introduce kprobes_built_in() and kprobe_fault_handler() helpers
to allow this - they returns 0 if !CONFIG_KPROBES.
No code changed:
text data bss dec hex filename
4618 32 24 4674 1242 fault.o.before
4618 32 24 4674 1242 fault.o.after
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Remove an #ifdef from kmmio_fault() - we can do this by
providing default implementations for is_kmmio_active()
and kmmio_handler(). The compiler optimizes it all away
in the !CONFIG_MMIOTRACE case.
Also, while at it, clean up mmiotrace.h a bit:
- standard header guards
- standard vertical spaces for structure definitions
No code changed (both with mmiotrace on and off in the config):
text data bss dec hex filename
2947 12 12 2971 b9b fault.o.before
2947 12 12 2971 b9b fault.o.after
Cc: Pekka Paalanen <pq@iki.fi>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: improve page fault handling robustness
The 'PF_RSVD' flag (bit 3) of the page-fault error_code is a
relatively recent addition to x86 CPUs, so the 32-bit do_fault()
implementation never had it. This flag gets set when the CPU
detects nonzero values in any reserved bits of the page directory
entries.
Extend the existing 64-bit check for PF_RSVD in do_page_fault()
to 32-bit too. If we detect such a fault then we print a more
informative oops and the pagetables.
This unifies the code some more, removes an ugly #ifdef and improves
the 32-bit page fault code robustness a bit. It slightly increases
the 32-bit kernel text size.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Instead of an ugly, open-coded, #ifdef-ed vm86 related legacy check
in do_page_fault(), put it into the check_v8086_mode() helper
function and merge it with an existing #ifdef.
Also, simplify the code flow a tiny bit in the helper.
No code changed:
arch/x86/mm/fault.o:
text data bss dec hex filename
2711 12 12 2735 aaf fault.o.before
2711 12 12 2735 aaf fault.o.after
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: no functionality changed
Factor out the opcode checker into a helper inline.
The code got a tiny bit smaller:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4618 32 24 4674 1242 fault.o.after
And it got cleaner / easier to review as well.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, no code changed
Clean up various small details, which can be correctness checked
automatically:
- tidy up the include file section
- eliminate unnecessary includes
- introduce show_signal_msg() to clean up code flow
- standardize the code flow
- standardize comments and other style details
- more cleanups, pointed out by checkpatch
No code changed on either 32-bit nor 64-bit:
arch/x86/mm/fault.o:
text data bss dec hex filename
4632 32 24 4688 1250 fault.o.before
4632 32 24 4688 1250 fault.o.after
the md5 changed due to a change in a single instruction:
2e8a8241e7f0d69706776a5a26c90bc0 fault.o.before.asm
c5c3d36e725586eb74f0e10692f0193e fault.o.after.asm
Because a __LINE__ reference in a WARN_ONCE() has changed.
On 32-bit a few stack offsets changed - no code size difference
nor any functionality difference.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: keep kernel text read only
Because dynamic ftrace converts the calls to mcount into and out of
nops at run time, we needed to always keep the kernel text writable.
But this defeats the point of CONFIG_DEBUG_RODATA. This patch converts
the kernel code to writable before ftrace modifies the text, and converts
it back to read only afterward.
The kernel text is converted to read/write, stop_machine is called to
modify the code, then the kernel text is converted back to read only.
The original version used SYSTEM_STATE to determine when it was OK
or not to change the code to rw or ro. Andrew Morton pointed out that
using SYSTEM_STATE is a bad idea since there is no guarantee to what
its state will actually be.
Instead, I moved the check into the set_kernel_text_* functions
themselves, and use a local variable to determine when it is
OK to change the kernel text RW permissions.
[ Update: Ingo Molnar suggested moving the prototypes to cacheflush.h ]
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Impact: future-proof the split_large_page() function
Linus noticed that split_large_page() is not safe wrt. the
PAT bit: it is bit 12 on the 1GB and 2MB page table level
(_PAGE_BIT_PAT_LARGE), and it is bit 7 on the 4K page
table level (_PAGE_BIT_PAT).
Currently it is not a problem because we never set
_PAGE_BIT_PAT_LARGE on any of the large-page mappings - but
should this happen in the future the split_large_page() would
silently lift bit 12 into the lowlevel 4K pte and would start
corrupting the physical page frame offset. Not fun.
So add a debug warning, to make sure if something ever sets
the PAT bit then this function gets updated too.
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix to prevent hard lockup on bad PMD permissions
If the PMD does not have the correct permissions for a page access,
but the PTE does, the spurious fault handler will mistake the fault
as a lazy TLB transaction. This will result in an infinite loop of:
fault -> spurious_fault check (pass) -> return to code -> fault
This patch adds a check and a warn on if the PTE passes the permissions
but the PMD does not.
[ Updated: Ingo Molnar suggested using WARN_ONCE with some text ]
Signed-off-by: Steven Rostedt <srostedt@redhat.com>
Steven Rostedt found a bug in where in his modified kernel
ftrace was unable to modify the kernel text, due to the PMD
itself having been marked read-only as well in
split_large_page().
The fix, suggested by Linus, is to not try to 'clone' the
reference protection of a huge-page, but to use the standard
(and permissive) page protection bits of KERNPG_TABLE.
The 'cloning' makes sense for the ptes but it's a confused and
incorrect concept at the page table level - because the
pagetable entry is a set of all ptes and hence cannot
'clone' any single protection attribute - the ptes can be any
mixture of protections.
With the permissive KERNPG_TABLE, even if the pte protections
get changed after this point (due to ftrace doing code-patching
or other similar activities like kprobes), the resulting combined
protections will still be correct and the pte's restrictive
(or permissive) protections will control it.
Also update the comment.
This bug was there for a long time but has not caused visible
problems before as it needs a rather large read-only area to
trigger. Steve possibly hacked his kernel with some really
large arrays or so. Anyway, the bug is definitely worth fixing.
[ Huang Ying also experienced problems in this area when writing
the EFI code, but the real bug in split_large_page() was not
realized back then. ]
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Reported-by: Huang Ying <ying.huang@intel.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: use new dynamic allocator, unified access to static/dynamic
percpu memory
Convert to the new dynamic percpu allocator.
* implement populate_extra_pte() for both 32 and 64
* update setup_per_cpu_areas() to use pcpu_setup_static()
* define __addr_to_pcpu_ptr() and __pcpu_ptr_to_addr()
* define config HAVE_DYNAMIC_PER_CPU_AREA
Signed-off-by: Tejun Heo <tj@kernel.org>
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:
BUG_ON(page_zone(start_page) != page_zone(end_page));
Once I knew this is what was happening, I added some annotations:
if (unlikely(page_zone(start_page) != page_zone(end_page))) {
printk(KERN_ERR "move_freepages: Bogus zones: "
"start_page[%p] end_page[%p] zone[%p]\n",
start_page, end_page, zone);
printk(KERN_ERR "move_freepages: "
"start_zone[%p] end_zone[%p]\n",
page_zone(start_page), page_zone(end_page));
printk(KERN_ERR "move_freepages: "
"start_pfn[0x%lx] end_pfn[0x%lx]\n",
page_to_pfn(start_page), page_to_pfn(end_page));
printk(KERN_ERR "move_freepages: "
"start_nid[%d] end_nid[%d]\n",
page_to_nid(start_page), page_to_nid(end_page));
...
And here's what I got:
move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
move_freepages: start_nid[1] end_nid[0]
My memory layout on this box is:
[ 0.000000] Zone PFN ranges:
[ 0.000000] Normal 0x00000000 -> 0x0081ff5d
[ 0.000000] Movable zone start PFN for each node
[ 0.000000] early_node_map[8] active PFN ranges
[ 0.000000] 0: 0x00000000 -> 0x00020000
[ 0.000000] 1: 0x00800000 -> 0x0081f7ff
[ 0.000000] 1: 0x0081f800 -> 0x0081fe50
[ 0.000000] 1: 0x0081fed1 -> 0x0081fed8
[ 0.000000] 1: 0x0081feda -> 0x0081fedb
[ 0.000000] 1: 0x0081fedd -> 0x0081fee5
[ 0.000000] 1: 0x0081fee7 -> 0x0081ff51
[ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d
So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.
This patch:
Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
I think it makes fix-for-memmap-init not easy.
This patch moves all declaration to include/linux/mm.h
After this,
if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use static definition in include/linux/mm.h
else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
-> Use generic definition in mm/page_alloc.c
else
-> per-arch back end function will be called.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, vm86: fix preemption bug
x86, olpc: fix model detection without OFW
x86, hpet: fix for LS21 + HPET = boot hang
x86: CPA avoid repeated lazy mmu flush
x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
x86/cpa: make sure cpa is safe to call in lazy mmu mode
x86, ptrace, mm: fix double-free on race
Impact: Flush the lazy MMU only once
Pending mmu updates only need to be flushed once to bring the
in-memory pagetable state up to date.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Impact: cleanup
Make the max_low_pfn logic a bit more standard between
lowmem_pfn_init() and highmem_pfn_init().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Split find_low_pfn_range() into two functions:
- lowmem_pfn_init()
- highmem_pfn_init()
The former gets called if all of RAM fits into lowmem,
otherwise we call highmem_pfn_init().
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeff Mahoney reported:
> With Suse's hwinfo tool, on -tip:
> WARNING: at arch/x86/mm/pat.c:637 reserve_pfn_range+0x5b/0x26d()
reserve_pfn_range() is not tracking the memory range below 1MB
as non-RAM and as such is inconsistent with similar checks in
reserve_memtype() and free_memtype()
Rename the pagerange_is_ram() to pat_pagerange_is_ram() and add the
"track legacy 1MB region as non RAM" condition.
And also, fix reserve_pfn_range() to return -EINVAL, when the pfn
range is RAM. This is to be consistent with this API design.
Reported-and-tested-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix race leading to crash under KVM and Xen
The CPA code may be called while we're in lazy mmu update mode - for
example, when using DEBUG_PAGE_ALLOC and doing a slab allocation
in an interrupt handler which interrupted a lazy mmu update. In this
case, the in-memory pagetable state may be out of date due to pending
queued updates. We need to flush any pending updates before inspecting
the page table. Similarly, we must explicitly flush any modifications
CPA may have made (which comes down to flushing queued operations when
flushing the TLB).
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Stable Kernel <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
arch/x86/mm/init_32.c: In function ‘find_low_pfn_range’:
arch/x86/mm/init_32.c:696: warning: format ‘%u’ expects type ‘unsigned int’, but
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* commit 'remotes/tip/x86/paravirt': (175 commits)
xen: use direct ops on 64-bit
xen: make direct versions of irq_enable/disable/save/restore to common code
xen: setup percpu data pointers
xen: fix 32-bit build resulting from mmu move
x86/paravirt: return full 64-bit result
x86, percpu: fix kexec with vmlinux
x86/vmi: fix interrupt enable/disable/save/restore calling convention.
x86/paravirt: don't restore second return reg
xen: setup percpu data pointers
x86: split loading percpu segments from loading gdt
x86: pass in cpu number to switch_to_new_gdt()
x86: UV fix uv_flush_send_and_wait()
x86/paravirt: fix missing callee-save call on pud_val
x86/paravirt: use callee-saved convention for pte_val/make_pte/etc
x86/paravirt: implement PVOP_CALL macros for callee-save functions
x86/paravirt: add register-saving thunks to reduce caller register pressure
x86/paravirt: selectively save/restore regs around pvops calls
x86: fix paravirt clobber in entry_64.S
x86/pvops: add a paravirt_ident functions to allow special patching
xen: move remaining mmu-related stuff into mmu.c
...
Conflicts:
arch/x86/mach-voyager/voyager_smp.c
arch/x86/mm/fault.c
Impact: bug fix
Don't use per_cpu_offset() to determine if it valid to access a
per-cpu variable for a given cpu number. It is not a valid assumption
on x86-64 anymore. Use cpu_possible() instead.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Prevent kprobes from catching spurious faults which will cause infinite
recursive page-fault and memory corruption by stack overflow.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: <stable@kernel.org> [2.6.28.x]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
Introduce helper function fault_in_kernel_address() to make editors happy.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: widen debug checks
VirtualBox calls do_page_fault() from an atomic context but runs into a
might_sleep() way pas this point, cure that.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The x86/Voyager subarch used to have this distinction between
'x86 SMP support' and 'Voyager SMP support':
config X86_SMP
bool
depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
This is a pointless distinction - Voyager can (and already does) use
smp_ops to implement various SMP quirks it has - and it can be extended
more to cover all the specialities of Voyager.
So remove this complication in the Kconfig space.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Our send_IPI_*() methods and definitions are a twisted mess: the same
symbol is defined to different things depending on .config details,
in a non-transparent way.
- spread out the quirks into separately named per apic driver methods
- prefix the standard PC methods with default_
- get rid of wrapper macro obfuscation
- clean up various details
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: Code movement, no functional change.
Move the 64-bit NUMA code from setup_percpu.c to numa_64.c
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
xen: unitialised return value in xenbus_write_transaction
x86: fix section mismatch warning
x86: unmask CPUID levels on Intel CPUs, fix
x86: work around PAGE_KERNEL_WC not getting WC in iomap_atomic_prot_pfn.
x86: use standard PIT frequency
xen: handle highmem pages correctly when shrinking a domain
x86, mm: fix pte_free()
xen: actually release memory when shrinking domain
x86: unmask CPUID levels on Intel CPUs
x86: add MSR_IA32_MISC_ENABLE bits to <asm/msr-index.h>
x86: fix PTE corruption issue while mapping RAM using /dev/mem
x86: mtrr fix debug boot parameter
x86: fix page attribute corruption with cpa()
Revert "x86: signal: change type of paramter for sys_rt_sigreturn()"
x86: use early clobbers in usercopy*.c
x86: remove kernel_physical_mapping_init() from init section
fix: crash: IP: __bitmap_intersects+0x48/0x73
cpufreq: use work_on_cpu in acpi-cpufreq.c for drv_read and drv_write
work_on_cpu: Use our own workqueue.
work_on_cpu: don't try to get_online_cpus() in work_on_cpu.
...
In the absence of PAT, PAGE_KERNEL_WC ends up mapping to a memory type that
gets UC behavior even in the presence of a WC MTRR covering the area in
question. By swapping to PAGE_KERNEL_UC_MINUS, we can get the actual
behavior the caller wanted (WC if you can manage it, UC otherwise).
This recovers the 40% performance improvement of using WC in the DRM
to upload vertex data.
Signed-off-by: Eric Anholt <eric@anholt.net>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Impact: Cleanup
When PAT was originally introduced, it was handled specially for a few
reasons:
- PAT bugs are hard to track down, so we wanted to maintain a
whitelist of CPUs.
- The i386 and x86-64 CPUID code was not yet unified.
Both of these are now obsolete, so handle PAT like any other features,
including ordinary feature blacklisting due to known bugs.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: introduce new uaccess exception handling framework
Introduce {get|put}_user_try and {get|put}_user_catch as new uaccess exception
handling framework.
{get|put}_user_try begins exception block and {get|put}_user_catch(err) ends
the block and gets err if an exception occured in {get|put}_user_ex() in the
block. The exception is stored thread_info->uaccess_err.
The example usage of this framework is below;
int func()
{
int err = 0;
get_user_try {
get_user_ex(...);
get_user_ex(...);
:
} get_user_catch(err);
return err;
}
Note: get_user_ex() is not clear the value when an exception occurs, it's
different from the behavior of __get_user(), but I think it doesn't matter.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Impact: fix/extend ioremap_wc() beyond 4GB aperture on 32-bit
ioremap_wc() was taking in unsigned long parameter, where as it should take
64-bit resource_size_t parameter like other ioremap variants.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
tsk is already assigned to current, drop the redundant second
assignment.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Beschorner Daniel reported:
> hwinfo problem since 2.6.28, showing this in the oops:
> Corrupted page table at address 7fd04de3ec00
Also, PaX Team reported a regression with this commit:
> commit 9542ada803
> Author: Suresh Siddha <suresh.b.siddha@intel.com>
> Date: Wed Sep 24 08:53:33 2008 -0700
>
> x86: track memtype for RAM in page struct
This commit breaks mapping any RAM page through /dev/mem, as the
reserve_memtype() was not initializing the return attribute type and as such
corrupting the PTE entry that was setup with the return attribute type.
Because of this bug, application mapping this RAM page through /dev/mem
will die with "Corrupted page table at address xxxx" message in the kernel
log and also the kernel identity mapping which maps the underlying RAM
page gets converted to UC.
Fix this by initializing the return attribute type before calling
reserve_ram_pages_type()
Reported-by: PaX Team <pageexec@freemail.hu>
Reported-and-tested-by: Beschorner Daniel <Daniel.Beschorner@facton.com>
Tested-and-Acked-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix sporadic slowdowns and warning messages
This patch fixes a performance issue reported by Linus on his
Nehalem system. While Linus reverted the PAT patch (commit
58dab916df) which exposed the issue,
existing cpa() code can potentially still cause wrong(page attribute
corruption) behavior.
This patch also fixes the "WARNING: at arch/x86/mm/pageattr.c:560" that
various people reported.
In 64bit kernel, kernel identity mapping might have holes depending
on the available memory and how e820 reports the address range
covering the RAM, ACPI, PCI reserved regions. If there is a 2MB/1GB hole
in the address range that is not listed by e820 entries, kernel identity
mapping will have a corresponding hole in its 1-1 identity mapping.
If cpa() happens on the kernel identity mapping which falls into these holes,
existing code fails like this:
__change_page_attr_set_clr()
__change_page_attr()
returns 0 because of if (!kpte). But doesn't
set cpa->numpages and cpa->pfn.
cpa_process_alias()
uses uninitialized cpa->pfn (random value)
which can potentially lead to changing the page
attribute of kernel text/data, kernel identity
mapping of RAM pages etc. oops!
This bug was easily exposed by another PAT patch which was doing
cpa() more often on kernel identity mapping holes (physical range between
max_low_pfn_mapped and 4GB), where in here it was setting the
cache disable attribute(PCD) for kernel identity mappings aswell.
Fix cpa() to handle the kernel identity mapping holes. Retain
the WARN() for cpa() calls to other not present address ranges
(kernel-text/data, ioremap() addresses)
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix:
arch/x86/mm/srat_64.c: In function ‘acpi_numa_processor_affinity_init’:
arch/x86/mm/srat_64.c:141: error: implicit declaration of function ‘get_uv_system_type’
arch/x86/mm/srat_64.c:141: error: ‘UV_X2APIC’ undeclared (first use in this function)
arch/x86/mm/srat_64.c:141: error: (Each undeclared identifier is reported only once
arch/x86/mm/srat_64.c:141: error: for each function it appears in.)
A couple of UV definitions were moved to asm/uv/uv.h, but srat_64.c did
not include that header. Add it.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Now that it's unified, move the (SMP) TLB flushing code from arch/x86/kernel/
to arch/x86/mm/, where it belongs logically.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup, restructure code to improve assembly
gcc isn't _all_ that smart about spilling registers to stack or reusing
stack slots, even with branch annotations. do_page_fault contained a lot
of functionality, so split unlikely paths into their own functions, and
mark them as noinline just to be sure. I consider this actually to be
somewhat of a cleanup too: the main function now contains about half
the number of lines so the normal path is easier to read, while the error
cases are also nicely split away.
Also, ensure the order of arguments to functions is always the same: regs,
addr, error_code. This can reduce code size a tiny bit, and just looks neater
too.
And add a couple of branch annotations.
Before:
do_page_fault:
subq $360, %rsp #,
After:
do_page_fault:
subq $56, %rsp #,
bloat-o-meter:
add/remove: 8/0 grow/shrink: 0/1 up/down: 2222/-1680 (542)
function old new delta
__bad_area_nosemaphore - 506 +506
no_context - 474 +474
vmalloc_fault - 424 +424
spurious_fault - 358 +358
mm_fault_error - 272 +272
bad_area_access_error - 89 +89
bad_area - 89 +89
bad_area_nosemaphore - 10 +10
do_page_fault 2464 784 -1680
Yes, the total size increases by 542 bytes, due to the extra function calls.
But these will very rarely be called (except for vmalloc_fault) in a normal
workload. Importantly, do_page_fault is less than 1/3rd it's original size,
and touches far less stack.
Existing gotos and branch hints did move a lot of the infrequently used text
out of the fastpath, but that's even further improved after this patch.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: fix crash with memory hotplug enabled
kernel_physical_mapping_init() is called during memory hotplug
so it does not belong in the init section.
If the kernel is built with CONFIG_DEBUG_SECTION_MISMATCH=y on
the make command line, arch/x86/mm/init_64.c is compiled with
the -fno-inline-functions-called-once gcc option defeating
inlining of kernel_physical_mapping_init() within init_memory_mapping().
When kernel_physical_mapping_init() is not inlined it is placed
in the .init.text section according to the __init in it's current
declaration. A later call to kernel_physical_mapping_init() during
a memory hotplug operation encounters an int3 trap because the
.init.text section memory has been freed.
This patch eliminates the crash caused by the int3 trap by moving the
non-inlined kernel_physical_mapping_init() from .init.text to .meminit.text.
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Debugging and original patch from Nick Piggin <npiggin@suse.de>
The early fixmap pmd entry inserted at the very top of the KVA is causing the
subsequent fixmap mapping code to not provide physically linear pte pages over
the kmap atomic portion of the fixmap (which relies on said property to
calculate pte addresses).
This has caused weird boot failures in kmap_atomic much later in the boot
process (initial userspace faults) on a 32-bit PAE system with a larger number
of CPUs (smaller CPU counts tend not to run over into the next page so don't
show up the problem).
Solve this by attempting to clear out the page table, and copy any of its
entries to the new one. Also, add a bug if a nonlinear condition is encountered
and can't be resolved, which might save some hours of debugging if this fragile
scheme ever breaks again...
Once we have such logic, we can also use it to eliminate the early ioremap
trickery around the page table setup for the fixmap area. This also fixes
potential issues with FIX_* entries sharing the leaf page table with the early
ioremap ones getting discarded by early_ioremap_clear() and not restored by
early_ioremap_reset(). It at once eliminates the temporary (and configuration,
namely NR_CPUS, dependent) unavailability of early fixed mappings during the
time the fixmap area page tables get constructed.
Finally, also replace the hard coded calculation of the initial table space
needed for the fixmap area with a proper one, allowing kernels configured for
large CPU counts to actually boot.
Based-on: Nick Piggin <npiggin@suse.de>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This reverts commit 58dab916df, which
makes my Nehalem come to a nasty crawling almost-halt. It looks like it
turns off caching of regular kernel RAM, with the understandable
slowdown of a few orders of magnitude as a result.
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thierry Vignaud reported:
> http://bugzilla.kernel.org/show_bug.cgi?id=12372
>
> On P4 with an SiS motherboard (video card is a SiS 651)
> X server fails to start with error:
> xf86MapVidMem: Could not mmap framebuffer (0x00000000,0x2000) (Invalid
> argument)
Here X is trying to map first 8KB of memory using /dev/mem. Existing
code treats first 0-4KB of memory as non-RAM and 4KB-8KB as RAM. Recent
code changes don't allow to map memory with different attributes
at the same time.
Fix this by treating the first 1MB legacy region as special and always
track the attribute requests with in this region using linear linked
list (and don't bother if the range is RAM or non-RAM or mixed)
Reported-and-tested-by: Thierry Vignaud <tvignaud@mandriva.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: reduce scope of debug check - avoid warnings
The logic to find whether identity map exists or not using
high_memory or max_low_pfn_mapped/max_pfn_mapped are not complete
as the memory withing the range may not be mapped if there is a
unusable hole in e820.
Specifically, on my test system I started seeing these warnings with
tools like hwinfo, acpidump trying to map ACPI region.
[ 27.400018] ------------[ cut here ]------------
[ 27.400344] WARNING: at /home/venkip/src/linus/linux-2.6/arch/x86/mm/pageattr.c:560 __change_page_attr_set_clr+0xf3/0x8b8()
[ 27.400821] Hardware name: X7DB8
[ 27.401070] CPA: called for zero pte. vaddr = ffff8800cff6a000 cpa->vaddr = ffff8800cff6a000
[ 27.401569] Modules linked in:
[ 27.401882] Pid: 4913, comm: dmidecode Not tainted 2.6.28-05716-gfe0bdec #586
[ 27.402141] Call Trace:
[ 27.402488] [<ffffffff80237c21>] warn_slowpath+0xd3/0x10f
[ 27.402749] [<ffffffff80274ade>] ? find_get_page+0xb3/0xc9
[ 27.403028] [<ffffffff80274a2b>] ? find_get_page+0x0/0xc9
[ 27.403333] [<ffffffff80226425>] __change_page_attr_set_clr+0xf3/0x8b8
[ 27.403628] [<ffffffff8028ec99>] ? __purge_vmap_area_lazy+0x192/0x1a1
[ 27.403883] [<ffffffff8028eb52>] ? __purge_vmap_area_lazy+0x4b/0x1a1
[ 27.404172] [<ffffffff80290268>] ? vm_unmap_aliases+0x1ab/0x1bb
[ 27.404512] [<ffffffff80290105>] ? vm_unmap_aliases+0x48/0x1bb
[ 27.404766] [<ffffffff80226d28>] change_page_attr_set_clr+0x13e/0x2e6
[ 27.405026] [<ffffffff80698fa7>] ? _spin_unlock+0x26/0x2a
[ 27.405292] [<ffffffff80227e6a>] ? reserve_memtype+0x19b/0x4e3
[ 27.405590] [<ffffffff80226ffd>] _set_memory_wb+0x22/0x24
[ 27.405844] [<ffffffff80225d28>] ioremap_change_attr+0x26/0x28
[ 27.406097] [<ffffffff80228355>] reserve_pfn_range+0x1a3/0x235
[ 27.406427] [<ffffffff80228430>] track_pfn_vma_new+0x49/0xb3
[ 27.406686] [<ffffffff80286c46>] remap_pfn_range+0x94/0x32c
[ 27.406940] [<ffffffff8022878d>] ? phys_mem_access_prot_allowed+0xb5/0x1a8
[ 27.407209] [<ffffffff803e9bf4>] mmap_mem+0x75/0x9d
[ 27.407523] [<ffffffff8028b3b4>] mmap_region+0x2cf/0x53e
[ 27.407776] [<ffffffff8028b8cc>] do_mmap_pgoff+0x2a9/0x30d
[ 27.408034] [<ffffffff8020f4a4>] sys_mmap+0x92/0xce
[ 27.408339] [<ffffffff8020b65b>] system_call_fastpath+0x16/0x1b
[ 27.408614] ---[ end trace 4b16ad70c09a602d ]---
[ 27.408871] dmidecode:4913 reserve_pfn_range ioremap_change_attr failed write-back for cff6a000-cff6b000
This is wih track_pfn_vma_new trying to keep identity map in sync.
The address cff6a000 is the ACPI region according to e820.
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: 0000000000000000 - 000000000009c000 (usable)
[ 0.000000] BIOS-e820: 000000000009c000 - 00000000000a0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000cc000 - 00000000000d0000 (reserved)
[ 0.000000] BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
[ 0.000000] BIOS-e820: 0000000000100000 - 00000000cff60000 (usable)
[ 0.000000] BIOS-e820: 00000000cff60000 - 00000000cff69000 (ACPI data)
[ 0.000000] BIOS-e820: 00000000cff69000 - 00000000cff80000 (ACPI NVS)
[ 0.000000] BIOS-e820: 00000000cff80000 - 00000000d0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[ 0.000000] BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
[ 0.000000] BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
[ 0.000000] BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
[ 0.000000] BIOS-e820: 0000000100000000 - 0000000230000000 (usable)
And is not mapped as per init_memory_mapping.
[ 0.000000] init_memory_mapping: 0000000000000000-00000000cff60000
[ 0.000000] init_memory_mapping: 0000000100000000-0000000230000000
We can add logic to check for this. But, there can also be other holes in
identity map when we have 1GB of aligned reserved space in e820.
This patch handles it by removing the WARN_ON and returning a specific
error value (EFAULT) to indicate that the address does not have any
identity mapping.
The code that tries to keep identity map in sync can ignore
this error, with other callers of cpa still getting error here.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: avoid warning message, potentially solve 3D performance regression
Change x86 PAT code to return compatible memtype if the exact memtype that
was requested in remap_pfn_rage and friends is not available due to some
conflict.
This is done by returning the compatible type in pgprot parameter of
track_pfn_vma_new(), and the caller uses that memtype for page table.
Note that track_pfn_vma_copy() which is basically called during fork gets the
prot from existing page table and should not have any conflict. Hence we use
strict memtype check there and do not allow compatible memtypes.
This patch fixes the bug reported here:
http://marc.info/?l=linux-kernel&m=123108883716357&w=2
Specifically the error message:
X:5010 map pfn expected mapping type write-back for d0000000-d0101000,
got write-combining
Should go away.
Reported-and-bisected-by: Kevin Winchester <kjwinchester@gmail.com>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impact: cleanup
Change the protection parameter for track_pfn_vma_new() into a pgprot_t pointer.
Subsequent patch changes the x86 PAT handling to return a compatible
memtype in pgprot_t, if what was requested cannot be allowed due to conflicts.
No fuctionality change in this patch.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ajith Kumar noticed:
I was going through the vmalloc fault handling for x86_64 and am unclear
about the following lines in the vmalloc_fault() function.
pgd = pgd_offset(current->mm ?: &init_mm, address);
pgd_ref = pgd_offset_k(address);
Here the intention is to get the pgd corresponding to the current process
and sync it up with the pgd in init_mm(obtained from pgd_offset_k).
However, for kernel threads current->mm is NULL and hence pgd =
pgd_offset(init_mm, address) = pgd_ref which means the fault handler
returns without setting the pgd entry in the MM structure in the context
of which the kernel thread has faulted. This could lead to never-ending
faults and busy looping of kernel threads like pdflush. So, shouldn't the
pgd = pgd_offset(current->mm ?: &init_mm, address); be pgd =
pgd_offset(current->active_mm ?: &init_mm, address);
We can use active_mm unconditionally because it should be always set.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (36 commits)
x86: fix section mismatch warnings in mcheck/mce_amd_64.c
x86: offer frame pointers in all build modes
x86: remove duplicated #include's
x86: k8 numa register active regions later
x86: update Alan Cox's email addresses
x86: rename all fields of mpc_table mpc_X to X
x86: rename all fields of mpc_oemtable oem_X to X
x86: rename all fields of mpc_bus mpc_X to X
x86: rename all fields of mpc_cpu mpc_X to X
x86: rename all fields of mpc_intsrc mpc_X to X
x86: rename all fields of mpc_lintsrc mpc_X to X
x86: rename all fields of mpc_iopic mpc_X to X
x86: irqinit_64.c init_ISA_irqs should be static
Documentation/x86/boot.txt: payload length was changed to payload_length
x86: setup_percpu.c fix style problems
x86: irqinit_64.c fix style problems
x86: irqinit_32.c fix style problems
x86: i8259.c fix style problems
x86: irq_32.c fix style problems
x86: ioport.c fix style problems
...
Device drivers that use pci_request_regions() (and similar APIs) have a
reasonable expectation that they are the only ones accessing their device.
As part of the e1000e hunt, we were afraid that some userland (X or some
bootsplash stuff) was mapping the MMIO region that the driver thought it
had exclusively via /dev/mem or via various sysfs resource mappings.
This patch adds the option for device drivers to cause their reserved
regions to the "banned from /dev/mem use" list, so now both kernel memory
and device-exclusive MMIO regions are banned.
NOTE: This is only active when CONFIG_STRICT_DEVMEM is set.
In addition to the config option, a kernel parameter iomem=relaxed is
provided for the cases where developers want to diagnose, in the field,
drivers issues from userspace.
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Show node to memory section relationship with symlinks in sysfs
Add /sys/devices/system/node/nodeX/memoryY symlinks for all
the memory sections located on nodeX. For example:
/sys/devices/system/node/node1/memory135 -> ../../memory/memory135
indicates that memory section 135 resides on node1.
Also revises documentation to cover this change as well as updating
Documentation/ABI/testing/sysfs-devices-memory to include descriptions
of memory hotremove files 'phys_device', 'phys_index', and 'state'
that were previously not described there.
In addition to it always being a good policy to provide users with
the maximum possible amount of physical location information for
resources that can be hot-added and/or hot-removed, the following
are some (but likely not all) of the user benefits provided by
this change.
Immediate:
- Provides information needed to determine the specific node
on which a defective DIMM is located. This will reduce system
downtime when the node or defective DIMM is swapped out.
- Prevents unintended onlining of a memory section that was
previously offlined due to a defective DIMM. This could happen
during node hot-add when the user or node hot-add assist script
onlines _all_ offlined sections due to user or script inability
to identify the specific memory sections located on the hot-added
node. The consequences of reintroducing the defective memory
could be ugly.
- Provides information needed to vary the amount and distribution
of memory on specific nodes for testing or debugging purposes.
Future:
- Will provide information needed to identify the memory
sections that need to be offlined prior to physical removal
of a specific node.
Symlink creation during boot was tested on 2-node x86_64, 2-node
ppc64, and 2-node ia64 systems. Symlink creation during physical
memory hot-add tested on a 2-node x86_64 system.
Signed-off-by: Gary Hade <garyhade@us.ibm.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.
With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.
Only converted x86 and uml so far.
[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeff Dike <jdike@addtoit.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Impact: cleanup
don't register early, so we don't need to clear actived regions if it fail
to get node hash shift or wild set in nb config.
also remove nodeids array that is not needed
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
x86: export vector_used_by_percpu_irq
x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
sched: nominate preferred wakeup cpu, fix
x86: fix lguest used_vectors breakage, -v2
x86: fix warning in arch/x86/kernel/io_apic.c
sched: fix warning in kernel/sched.c
sched: move test_sd_parent() to an SMP section of sched.h
sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
sched: activate active load balancing in new idle cpus
sched: bias task wakeups to preferred semi-idle packages
sched: nominate preferred wakeup cpu
sched: favour lower logical cpu number for sched_mc balance
sched: framework for sched_mc/smt_power_savings=N
sched: convert BALANCE_FOR_xx_POWER to inline functions
x86: use possible_cpus=NUM to extend the possible cpus allowed
x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
x86: update io_apic.c to the new cpumask code
x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
x86: xen: use smp_call_function_many()
x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
...
Fixed up trivial conflict in kernel/time/tick-sched.c manually
* 'core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (63 commits)
stacktrace: provide save_stack_trace_tsk() weak alias
rcu: provide RCU options on non-preempt architectures too
printk: fix discarding message when recursion_bug
futex: clean up futex_(un)lock_pi fault handling
"Tree RCU": scalable classic RCU implementation
futex: rename field in futex_q to clarify single waiter semantics
x86/swiotlb: add default swiotlb_arch_range_needs_mapping
x86/swiotlb: add default phys<->bus conversion
x86: unify pci iommu setup and allow swiotlb to compile for 32 bit
x86: add swiotlb allocation functions
swiotlb: consolidate swiotlb info message printing
swiotlb: support bouncing of HighMem pages
swiotlb: factor out copy to/from device
swiotlb: add arch hook to force mapping
swiotlb: allow architectures to override phys<->bus<->phys conversions
swiotlb: add comment where we handle the overflow of a dma mask on 32 bit
rcu: fix rcutorture behavior during reboot
resources: skip sanity check of busy resources
swiotlb: move some definitions to header
swiotlb: allow architectures to override swiotlb pool allocation
...
Fix up trivial conflicts in
arch/x86/kernel/Makefile
arch/x86/mm/init_32.c
include/linux/hardirq.h
as per Ingo's suggestions.
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (241 commits)
sched, trace: update trace_sched_wakeup()
tracing/ftrace: don't trace on early stage of a secondary cpu boot, v3
Revert "x86: disable X86_PTRACE_BTS"
ring-buffer: prevent false positive warning
ring-buffer: fix dangling commit race
ftrace: enable format arguments checking
x86, bts: memory accounting
x86, bts: add fork and exit handling
ftrace: introduce tracing_reset_online_cpus() helper
tracing: fix warnings in kernel/trace/trace_sched_switch.c
tracing: fix warning in kernel/trace/trace.c
tracing/ring-buffer: remove unused ring_buffer size
trace: fix task state printout
ftrace: add not to regex on filtering functions
trace: better use of stack_trace_enabled for boot up code
trace: add a way to enable or disable the stack tracer
x86: entry_64 - introduce FTRACE_ frame macro v2
tracing/ftrace: add the printk-msg-only option
tracing/ftrace: use preempt_enable_no_resched_notrace in ring_buffer_time_stamp()
x86, bts: correctly report invalid bts records
...
Fixed up trivial conflict in scripts/recordmcount.pl due to SH bits
being already partly merged by the SH merge.