Richard Yao reported a month ago that his system have a trouble with
vmap_area_lock contention during performance analysis by /proc/meminfo.
Andrew asked why his analysis checks /proc/meminfo stressfully, but he
didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a
solution reducing vmap_area_lock contention with no side-effect. That
is just to use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe0225 ("mm:
rewrite vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
if (va->flags & VM_LAZY_FREE) {
if (va->va_start < *start)
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
list_add_tail(&va->purge_list, &valist);
va->flags |= VM_LAZY_FREEING;
va->flags &= ~VM_LAZY_FREE;
}
}
rcu_read_unlock();
Peter:
: While rcu list traversal over the vmap_area_list is safe, this may
: arrive at different results than the spinlocked version. The rcu list
: traversal version will not be a 'snapshot' of a single, valid instant
: of the entire vmap_area_list, but rather a potential amalgam of
: different list states.
Joonsoo:
: Yes, you are right, but I don't think that we should be strict here.
: Meminfo is already not a 'snapshot' at specific time. While we try to get
: certain stats, the other stats can change. And, although we may arrive at
: different results than the spinlocked version, the difference would not be
: large and would not make serious side-effect.
[edumazet@google.com: add more commit description]
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reported-by: Richard Yao <ryao@gentoo.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: Zhang Yanfei <zhangyanfei.yes@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zsmalloc needs exported unmap_kernel_range for building as a module. See
https://lkml.org/lkml/2013/1/18/487
I didn't send a patch to make unmap_kernel_range exportable at that time
because zram was staging stuff and I thought VM function exporting for
staging stuff makes no sense.
Now zsmalloc was promoted. If we can't build zsmalloc as module, it means
we can't build zram as module, either. Additionally, buddy map_vm_area is
already exported so let's export unmap_kernel_range to help his buddy.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace seq_printf where possible
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Replace places where __get_cpu_var() is used for an address calculation
with this_cpu_ptr().
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vm_map_ram() has a fragmentation problem when it cannot purge a
chunk(ie, 4M address space) if there is a pinning object in that
addresss space. So it could consume all VMALLOC address space easily.
We can fix the fragmentation problem by using vmap instead of
vm_map_ram() but vmap() is known to be slow compared to vm_map_ram().
Minchan said vm_map_ram is 5 times faster than vmap in his tests. So I
thought we should fix fragment problem of vm_map_ram because our
proprietary GPU driver has used it heavily.
On second thought, it's not an easy because we should reuse freed space
for solving the problem and it could make more IPI and bitmap operation
for searching hole. It could mitigate API's goal which is very fast
mapping. And even fragmentation problem wouldn't show in 64 bit
machine.
Another option is that the user should separate long-life and short-life
object and use vmap for long-life but vm_map_ram for short-life. If we
inform the user about the characteristic of vm_map_ram the user can
choose one according to the page lifetime.
Let's add some notice messages to user.
[akpm@linux-foundation.org: tweak comment text]
Signed-off-by: Gioh Kim <gioh.kim@lge.com>
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs. Eg: __weak for
__attribute__((weak)). I've replaced all instances of gcc attributes with
the right macro in the memory management (/mm) subsystem.
[akpm@linux-foundation.org: while-we're-there consistency tweaks]
Signed-off-by: Gideon Israel Dsouza <gidisrael@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commit ece86e222d, which was intended as a small performance
improvement.
Despite the claim that the patch doesn't introduce any functional
changes in fact it does.
The "no page" path behaves different now. Originally, vmalloc_to_page
might return NULL under some conditions, with new implementation it
returns pfn_to_page(0) which is not the same as NULL.
Simple test shows the difference.
test.c
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/vmalloc.h>
#include <linux/mm.h>
int __init myi(void)
{
struct page *p;
void *v;
v = vmalloc(PAGE_SIZE);
/* trigger the "no page" path in vmalloc_to_page*/
vfree(v);
p = vmalloc_to_page(v);
pr_err("expected val = NULL, returned val = %p", p);
return -EBUSY;
}
void __exit mye(void)
{
}
module_init(myi)
module_exit(mye)
Before interchange:
expected val = NULL, returned val = (null)
After interchange:
expected val = NULL, returned val = c7ebe000
Signed-off-by: Vladimir Murzin <murzin.v@gmail.com>
Cc: Jianyu Zhan <nasa4836@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we are implementing vmalloc_to_pfn() as a wrapper around
vmalloc_to_page(), which is implemented as follow:
1. walks the page talbes to generates the corresponding pfn,
2. then converts the pfn to struct page,
3. returns it.
And vmalloc_to_pfn() re-wraps vmalloc_to_page() to get the pfn.
This seems too circuitous, so this patch reverses the way: implement
vmalloc_to_page() as a wrapper around vmalloc_to_pfn(). This makes
vmalloc_to_pfn() and vmalloc_to_page() slightly more efficient.
No functional change.
Signed-off-by: Jianyu Zhan <nasa4836@gmail.com>
Cc: Vladimir Murzin <murzin.v@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 248ac0e194 ("mm/vmalloc: remove guard page from between vmap
blocks") had the side effect of making vmap_area.va_end member point to
the next vmap_area.va_start. This was creating an artificial reference
to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.
This patch marks the vmap_area containing pointers explicitly and
reduces the min ref_count to 2 as vm_struct still contains a reference
to the vmalloc'ed object. The kmemleak add_scan_area() function has
been improved to allow a SIZE_MAX argument covering the rest of the
object (for simpler calling sites).
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d ("mm:
avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
to avoid accessing the pages field with unallocated page when
show_numa_info() is called.
This patch moves the check just before show_numa_info in order that some
messages still can be dumped via /proc/vmallocinfo. This patch reverts
commit d157a55815 ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
s_show instead of show_numa_info");
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a race window between vmap_area tear down and show vmap_area
information.
A B
remove_vm_area
spin_lock(&vmap_area_lock);
va->vm = NULL;
va->flags &= ~VM_VM_AREA;
spin_unlock(&vmap_area_lock);
spin_lock(&vmap_area_lock);
if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
return 0;
if (!(va->flags & VM_VM_AREA)) {
seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
(void *)va->va_start, (void *)va->va_end,
va->va_end - va->va_start);
return 0;
}
free_unmap_vmap_area(va);
flush_cache_vunmap
free_unmap_vmap_area_noflush
unmap_vmap_area
free_vmap_area_noflush
va->flags |= VM_LAZY_FREE
The assumption !VM_VM_AREA represents vm_map_ram allocation is
introduced by d4033afdf8 ("mm, vmalloc: iterate vmap_area_list,
instead of vmlist, in vmallocinfo()").
However, !VM_VM_AREA also represents vmap_area is being tear down in
race window mentioned above. This patch fix it by don't dump any
information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The caller address has already been set in set_vmalloc_vm(), there's no
need to set it again in __vmalloc_area_node.
Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Our intention in here is to find last_bit within the region to flush.
There is well-defined function, find_last_bit() for this purpose and its
performance may be slightly better than current implementation. So change
it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vbq in vmap_block isn't used. So remove it.
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When searching a vmap area in the vmalloc space, we use (addr + size -
1) to check if the value is less than addr, which is an overflow. But
we assign (addr + size) to vmap_area->va_end.
So if we come across the below case:
(addr + size - 1) : not overflow
(addr + size) : overflow
we will assign an overflow value (e.g 0) to vmap_area->va_end, And this
will trigger BUG in __insert_vmap_area, causing system panic.
So using (addr + size) to check the overflow should be the correct
behaviour, not (addr + size - 1).
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Reported-by: Ghennadi Procopciuc <unix140@gmail.com>
Tested-by: Daniel Baluta <dbaluta@ixiacom.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vfree() only needs schedule_work(&p->wq) if p->list was empty, otherwise
vfree_deferred->wq is already pending or it is running and didn't do
llist_del_all() yet.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We should check the VM_UNITIALIZED flag in s_show(). If this flag is
set, that said, the vm_struct is not fully initialized. So it is
unnecessary to try to show the information contained in vm_struct.
We checked this flag in show_numa_info(), but I think it's better to
check it earlier.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
VM_UNLIST was used to indicate that the vm_struct is not listed in
vmlist.
But after commit 4341fa4547 ("mm, vmalloc: remove list management of
vmlist after initializing vmalloc"), the meaning of this flag changed.
It now means the vm_struct is not fully initialized. So renaming it to
VM_UNINITIALIZED seems more reasonable.
Also change clear_vm_unlist to clear_vm_uninitialized_flag.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use goto to jump to the fail label to give a failure message before
returning NULL. This makes the failure handling in this function
consistent.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As we have removed the dead code in the vb_alloc, it seems there is no
place to use the alloc_map. So there is no reason to maintain the
alloc_map in vmap_block.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function is nowhere used now, so remove it.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Space in a vmap block that was once allocated is considered dirty and
not made available for allocation again before the whole block is
recycled. The result is that free space within a vmap block is always
contiguous.
So if a vmap block has enough free space for allocation, the allocation
is impossible to fail. Thus, the fragmented block purging was never
invoked from vb_alloc(). So remove this dead code.
[ Same patches also sent by:
Chanho Min <chanho.min@lge.com>
Johannes Weiner <hannes@cmpxchg.org>
but git doesn't do "multiple authors" ]
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is an extra semi-colon so the function always returns.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now this function is nowhere used, we can remove it directly.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Here we pass flags with only VM_ALLOC bit set, it is unnecessary to call
clear_vm_unlist to clear VM_UNLIST bit. So use setup_vmalloc_vm instead
of insert_vmalloc_vm.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now for insert_vmalloc_vm, it only calls the two functions:
- setup_vmalloc_vm: fill vm_struct and vmap_area instances
- clear_vm_unlist: clear VM_UNLIST bit in vm_struct->flags
So in __get_vm_area_node(), if VM_UNLIST bit unset in flags, that is the
else branch here, we don't need to clear VM_UNLIST bit for vm->flags since
this bit is obviously not set. That is to say, we could only call
setup_vmalloc_vm instead of insert_vmalloc_vm here. And then we could
even remove the if test here.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want to allocate ELF note segment buffer on the 2nd kernel in vmalloc
space and remap it to user-space in order to reduce the risk that memory
allocation fails on system with huge number of CPUs and so with huge ELF
note segment that exceeds 11-order block size.
Although there's already remap_vmalloc_range for the purpose of
remapping vmalloc memory to user-space, we need to specify user-space
range via vma.
Mmap on /proc/vmcore needs to remap range across multiple objects, so
the interface that requires vma to cover full range is problematic.
This patch introduces remap_vmalloc_range_partial that receives user-space
range as a pair of base address and size and can be used for mmap on
/proc/vmcore case.
remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.
[akpm@linux-foundation.org: use PAGE_ALIGNED()]
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, __find_vmap_area searches for the kernel VM area starting at
a given address. This patch changes this behavior so that it searches
for the kernel VM area to which the address belongs. This change is
needed by remap_vmalloc_range_partial to be introduced in later patch
that receives any position of kernel VM area as target address.
This patch changes the condition (addr > va->va_start) to the equivalent
(addr >= va->va_end) by taking advantage of the fact that each kernel VM
area is non-overlapping.
Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Lisa Mitchell <lisa.mitchell@hp.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull VFS updates from Al Viro,
Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).
7kloc removed.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...
Now, vmap_area_list is exported as VMCOREINFO for makedumpfile to get
the start address of vmalloc region (vmalloc_start). The address which
contains vmalloc_start value is represented as below:
vmap_area_list.next - OFFSET(vmap_area.list) + OFFSET(vmap_area.va_start)
However, both OFFSET(vmap_area.va_start) and OFFSET(vmap_area.list)
aren't exported as VMCOREINFO.
So this patch exports them externally with small cleanup.
[akpm@linux-foundation.org: vmalloc.h should include list.h for list_head]
Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, there is no need to maintain vmlist after initializing vmalloc. So
remove related code and data structure.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Although our intention is to unexport internal structure entirely, but
there is one exception for kexec. kexec dumps address of vmlist and
makedumpfile uses this information.
We are about to remove vmlist, then another way to retrieve information
of vmalloc layer is needed for makedumpfile. For this purpose, we
export vmap_area_list, instead of vmlist.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory step for removing vmlist entirely. For
above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.
Using vmap_area_list in vmallocinfo() introduce ordering problem in SMP
system. In s_show(), we retrieve some values from vm_struct.
vm_struct's values is not fully setup when va->vm is assigned. Full
setup is notified by removing VM_UNLIST flag without holding a lock.
When we see that VM_UNLIST is removed, it is not ensured that vm_struct
has proper values in view of other CPUs. So we need smp_[rw]mb for
ensuring that proper values is assigned when we see that VM_UNLIST is
removed.
Therefore, this patch not only change a iteration list, but also add a
appropriate smp_[rw]mb to right places.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch is a preparatory step for removing vmlist entirely. For
above purpose, we change iterating a vmap_list codes to iterating a
vmap_area_list. It is somewhat trivial change, but just one thing
should be noticed.
vmlist is lack of information about some areas in vmalloc address space.
For example, vm_map_ram() allocate area in vmalloc address space, but it
doesn't make a link with vmlist. To provide full information about
vmalloc address space is better idea, so we don't use va->vm and use
vmap_area directly. This makes get_vmalloc_info() more precise.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, when we hold a vmap_area_lock, va->vm can't be discarded. So we can
safely access to va->vm when iterating a vmap_area_list with holding a
vmap_area_lock. With this property, change iterating vmlist codes in
vread/vwrite() to iterating vmap_area_list.
There is a little difference relate to lock, because vmlist_lock is mutex,
but, vmap_area_lock is spin_lock. It may introduce a spinning overhead
during vread/vwrite() is executing. But, these are debug-oriented
functions, so this overhead is not real problem for common case.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Inserting and removing an entry to vmlist is linear time complexity, so
it is inefficient. Following patches will try to remove vmlist
entirely. This patch is preparing step for it.
For removing vmlist, iterating vmlist codes should be changed to
iterating a vmap_area_list. Before implementing that, we should make
sure that when we iterate a vmap_area_list, accessing to va->vm doesn't
cause a race condition. This patch ensure that when iterating a
vmap_area_list, there is no race condition for accessing to vm_struct.
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now get_vmalloc_info() is in fs/proc/mmu.c. There is no reason that this
code must be here and it's implementation needs vmlist_lock and it iterate
a vmlist which may be internal data structure for vmalloc.
It is preferable that vmlist_lock and vmlist is only used in vmalloc.c
for maintainability. So move the code to vmalloc.c
Signed-off-by: Joonsoo Kim <js1304@gmail.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A bunch of RCU callbacks want to be able to do vfree() and end up with
rather kludgy schemes. Just let vfree() do the right thing - put the
victim on llist and schedule actual __vunmap() via schedule_work(), so
that it runs from non-interrupt context.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make a sweep through mm/ and convert code that uses -1 directly to using
the more appropriate NUMA_NO_NODE.
Signed-off-by: David Rientjes <rientjes@google.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We don't need custom NUMA_BUILD anymore, since we have handy
IS_ENABLED().
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In the paranoid case of sysctl kernel.kptr_restrict=2, mask the kernel
virtual addresses in /proc/vmallocinfo too.
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Brad Spengler <spender@grsecurity.net>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:
| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.
Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Carsten Otte <cotte@de.ibm.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Morris <james.l.morris@oracle.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Venkatesh Pallipadi <venki@google.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If someone calls vb_alloc() (or vm_map_ram() for that matter) to allocate
0 bytes (0 pages), get_order() returns BITS_PER_LONG - PAGE_CACHE_SHIFT
and interesting stuff happens. So make debugging such problems easier and
warn about 0-size allocation.
[akpm@linux-foundation.org: use WARN_ON-return-value feature]
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There's a walk by repeating rb_next to find a suitable hole. Could be
simply replaced by walk on the sorted vmap_area_list. More simpler and
efficient.
Mutation of the list and tree only happens in pair within
__insert_vmap_area and __free_vmap_area, under protection of
vmap_area_lock. The patch code is also under vmap_area_lock, so the list
walk is safe, and consistent with the tree walk.
Tested on SMP by repeating batch of vmalloc anf vfree for random sizes and
rounds for hours.
Signed-off-by: Hong Zhiguo <honkiko@gmail.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull DMA-mapping updates from Marek Szyprowski:
"Those patches are continuation of my earlier work.
They contains extensions to DMA-mapping framework to remove limitation
of the current ARM implementation (like limited total size of DMA
coherent/write combine buffers), improve performance of buffer sharing
between devices (attributes to skip cpu cache operations or creation
of additional kernel mapping for some specific use cases) as well as
some unification of the common code for dma_mmap_attrs() and
dma_mmap_coherent() functions. All extensions have been implemented
and tested for ARM architecture."
* 'for-linus-for-3.6-rc1' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
ARM: dma-mapping: add support for DMA_ATTR_SKIP_CPU_SYNC attribute
common: DMA-mapping: add DMA_ATTR_SKIP_CPU_SYNC attribute
ARM: dma-mapping: add support for dma_get_sgtable()
common: dma-mapping: introduce dma_get_sgtable() function
ARM: dma-mapping: add support for DMA_ATTR_NO_KERNEL_MAPPING attribute
common: DMA-mapping: add DMA_ATTR_NO_KERNEL_MAPPING attribute
common: dma-mapping: add support for generic dma_mmap_* calls
ARM: dma-mapping: fix error path for memory allocation failure
ARM: dma-mapping: add more sanity checks in arm_dma_mmap()
ARM: dma-mapping: remove custom consistent dma region
mm: vmalloc: use const void * for caller argument
scatterlist: add sg_alloc_table_from_pages function
This patch changes dma-mapping subsystem to use generic vmalloc areas
for all consistent dma allocations. This increases the total size limit
of the consistent allocations and removes platform hacks and a lot of
duplicated code.
Atomic allocations are served from special pool preallocated on boot,
because vmalloc areas cannot be reliably created in atomic context.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
'const void *' is a safer type for caller function type. This patch
updates all references to caller function type.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Reviewed-by: Kyungmin Park <kyungmin.park@samsung.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>
The transfer of ->flags causes some of the static mapping virtual
addresses to be prematurely freed (before the mapping is removed) because
VM_LAZY_FREE gets "set" if tmp->flags has VM_IOREMAP set. This might
cause subsequent vmalloc/ioremap calls to fail because it might allocate
one of the freed virtual address ranges that aren't unmapped.
va->flags has different types of flags from tmp->flags. If a region with
VM_IOREMAP set is registered with vm_area_add_early(), it will be removed
by __purge_vmap_area_lazy().
Fix vmalloc_init() to correctly initialize vmap_area for the given
vm_struct.
Also initialise va->vm. If it is not set, find_vm_area() for the early
vm regions will always fail.
Signed-off-by: KyongHo Cho <pullip.cho@samsung.com>
Cc: "Olav Haugan" <ohaugan@codeaurora.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The advantage of kcalloc is, that will prevent integer overflows which
could result from the multiplication of number of elements and size and
it is also a bit nicer to read.
The semantic patch that makes this change is available in
https://lkml.org/lkml/2011/11/25/107
Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If either of the vas or vms arrays are not properly kzalloced, then the
code jumps to the err_free label.
The err_free label runs a loop to check and free each of the array members
of the vas and vms arrays which is not required for this situation as none
of the array members have been allocated till this point.
Eliminate the extra loop we have to go through by introducing a new label
err_free2 and then jumping to it.
[akpm@linux-foundation.org: remove now-unneeded tests]
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct. So change it to a vm_struct* with naming to
improve for readability and type checking.
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Static storage is not required for the struct vmap_area in
__get_vm_area_node.
Removing "static" to store this variable on the stack instead.
Signed-off-by: Kautuk Consul <consul.kautuk@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f5252e00 ("mm: avoid null pointer access in vm_struct via
/proc/vmallocinfo") adds newly allocated vm_structs to the vmlist after
it is fully initialised. Unfortunately, it did not check that
__vmalloc_area_node() successfully populated the area. In the event of
allocation failure, the vmalloc area is freed but the pointer to freed
memory is inserted into the vmlist leading to a a crash later in
get_vmalloc_info().
This patch adds a check for ____vmalloc_area_node() failure within
__vmalloc_node_range. It does not use "goto fail" as in the previous
error path as a warning was already displayed by __vmalloc_area_node()
before it called vfree in its failure path.
Credit goes to Luciano Chavez for doing all the real work of identifying
exactly where the problem was.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reported-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Tested-by: Luciano Chavez <lnx1138@linux.vnet.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org> [3.1.x+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The existing vm_area_register_early() allows for early vmalloc space
allocation. However upcoming cleanups in the ARM architecture require
that some fixed locations in the vmalloc area be reserved also very early.
The name "vm_area_register_early" would have been a good name for the
reservation part without the allocation. Since it is already in use with
different semantics, let's create vm_area_add_early() instead.
Both vm_area_register_early() and vm_area_add_early() can be used together
meaning that the former is now implemented using the later where it is
ensured that no conflicting areas are added, but no attempt is made to
make the allocation scheme in vm_area_register_early() more sophisticated.
After all, you must know what you're doing when using those functions.
Signed-off-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
When mapping a foreign page with xenbus_map_ring_valloc() with the
GNTTABOP_map_grant_ref hypercall, set the GNTMAP_contains_pte flag and
pass a pointer to the PTE (in init_mm).
After the page is mapped, the usual fault mechanism can be used to
update additional MMs. This allows the vmalloc_sync_all() to be
removed from alloc_vm_area().
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
[v1: Squashed fix by Michal for no-mmu case]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Michal Simek <monstr@monstr.eu>
Some vmalloc failure paths do not report OOM conditions.
Add warn_alloc_failed, which also does a dump_stack, to those failure
paths.
This allows more site specific vmalloc failure logging message printks to
be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.
[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.
Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.
The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.
Signed-off-by: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: <stable@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xen backend drivers (e.g., blkback and netback) would sometimes fail to
map grant pages into the vmalloc address space allocated with
alloc_vm_area(). The GNTTABOP_map_grant_ref would fail because Xen could
not find the page (in the L2 table) containing the PTEs it needed to
update.
(XEN) mm.c:3846:d0 Could not find L1 PTE for address fbb42000
netback and blkback were making the hypercall from a kernel thread where
task->active_mm != &init_mm and alloc_vm_area() was only updating the page
tables for init_mm. The usual method of deferring the update to the page
tables of other processes (i.e., after taking a fault) doesn't work as a
fault cannot occur during the hypercall.
This would work on some systems depending on what else was using vmalloc.
Fix this by reverting ef691947d8 ("vmalloc: remove vmalloc_sync_all()
from alloc_vm_area()") and add a comment to explain why it's needed.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Keir Fraser <keir.xen@gmail.com>
Cc: <stable@kernel.org> [3.0.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit db64fe0225 ("mm: rewrite vmap layer") introduced code that does
address calculations under the assumption that VMAP_BLOCK_SIZE is a
power of two. However, this might not be true if CONFIG_NR_CPUS is not
set to a power of two.
Wrong vmap_block index/offset values could lead to memory corruption.
However, this has never been observed in practice (or never been
diagnosed correctly); what caught this was the BUG_ON in vb_alloc() that
checks for inconsistent vmap_block indices.
To fix this, ensure that VMAP_BLOCK_SIZE always is a power of two.
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=31572
Reported-by: Pavel Kysilka <goldenfish@linuxsoft.cz>
Reported-by: Matias A. Fonzo <selk@dragora.org>
Signed-off-by: Clemens Ladisch <clemens@ladisch.de>
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: 2.6.28+ <stable@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This allows us to move duplicated code in <asm/atomic.h>
(atomic_inc_not_zero() for now) to <linux/atomic.h>
Signed-off-by: Arun Sharma <asharma@fb.com>
Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The rcu callback rcu_free_vb() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(rcu_free_vb).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
The rcu callback rcu_free_va() just calls a kfree(),
so we use kfree_rcu() instead of the call_rcu(rcu_free_va).
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Namhyung Kim <namhyung@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
* 'upstream/tidy-xen-mmu-2.6.39' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: fix compile without CONFIG_XEN_DEBUG_FS
Use arbitrary_virt_to_machine() to deal with ioremapped pud updates.
Use arbitrary_virt_to_machine() to deal with ioremapped pmd updates.
xen/mmu: remove all ad-hoc stats stuff
xen: use normal virt_to_machine for ptes
xen: make a pile of mmu pvop functions static
vmalloc: remove vmalloc_sync_all() from alloc_vm_area()
xen: condense everything onto xen_set_pte
xen: use mmu_update for xen_set_pte_at()
xen: drop all the special iomap pte paths.
I was tracking down a page allocation failure that ended up in vmalloc().
Since vmalloc() uses 0-order pages, if somebody asks for an insane amount
of memory, we'll still get a warning with "order:0" in it. That's not
very useful.
During recovery, vmalloc() also nicely frees all of the memory that it got
up to the point of the failure. That is wonderful, but it also quickly
hides any issues. We have a much different sitation if vmalloc()
repeatedly fails 10GB in to:
vmalloc(100 * 1<<30);
versus repeatedly failing 4096 bytes in to a:
vmalloc(8192);
This patch will print out messages that look like this:
[ 68.123503] vmalloc: allocation failure, allocated 6680576 of 13426688 bytes
[ 68.124218] bash: page allocation failure: order:0, mode:0xd2
[ 68.124811] Pid: 3770, comm: bash Not tainted 2.6.39-rc3-00082-g85f2e68-dirty #333
[ 68.125579] Call Trace:
[ 68.125853] [<ffffffff810f6da6>] warn_alloc_failed+0x146/0x170
[ 68.126464] [<ffffffff8107e05c>] ? printk+0x6c/0x70
[ 68.126791] [<ffffffff8112b5d4>] ? alloc_pages_current+0x94/0xe0
[ 68.127661] [<ffffffff8111ed37>] __vmalloc_node_range+0x237/0x290
...
The 'order' variable is added for clarity when calling warn_alloc_failed()
to avoid having an unexplained '0' as an argument.
The 'tmp_mask' is because adding an open-coded '| __GFP_NOWARN' would take
us over 80 columns for the alloc_pages_node() call. If we are going to
add a line, it might as well be one that makes the sucker easier to read.
As a side issue, I also noticed that ctl_ioctl() does vmalloc() based
solely on an unverified value passed in from userspace. Granted, it's
under CAP_SYS_ADMIN, but it still frightens me a bit.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vmap allocator is used to, among other things, allocate per-cpu vmap
blocks, where each vmap block is naturally aligned to its own size.
Obviously, leaving a guard page after each vmap area forbids packing vmap
blocks efficiently and can make the kernel run out of possible vmap blocks
long before overall vmap space is exhausted.
The new interface to map a user-supplied page array into linear vmalloc
space (vm_map_ram) insists on allocating from a vmap block (instead of
falling back to a custom area) when the area size is below a certain
threshold. With heavy users of this interface (e.g. XFS) and limited
vmalloc space on 32-bit, vmap block exhaustion is a real problem.
Remove the guard page from the core vmap allocator. vmalloc and the old
vmap interface enforce a guard page on their own at a higher level.
Note that without this patch, we had accidental guard pages after those
vm_map_ram areas that happened to be at the end of a vmap block, but not
between every area. This patch removes this accidental guard page only.
If we want guard pages after every vm_map_ram area, this should be done
separately. And just like with vmalloc and the old interface on a
different level, not in the core allocator.
Mel pointed out: "If necessary, the guard page could be reintroduced as a
debugging-only option (CONFIG_DEBUG_PAGEALLOC?). Otherwise it seems
reasonable."
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Dave Chinner <david@fromorbit.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Hugh Dickins <hughd@google.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KM_USER1 is never used for vwrite() path so the caller doesn't need to
guarantee it is not used. Only the caller should guarantee is KM_USER0
and it is commented already.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.
This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.
The free area cache is reset if areas are freed behind it, or if we are
searching for a smaller area or alignment than last time. So allocation
patterns are not changed (verified by corner-case and random test cases in
userspace testing).
This solves a regression caused by lazy vunmap TLB purging introduced in
db64fe02 (mm: rewrite vmap layer). That patch will leave extents in the
vmap allocator after they are vunmapped, and until a significant number
accumulate that can be flushed in a single batch. So in a workload that
vmalloc/vfree frequently, a chain of extents will build up from
VMALLOC_START address, which have to be iterated over each time (giving an
O(n) type of behaviour).
After this patch, the search will start from where it left off, giving
closer to an amortized O(1).
This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.
Hugh's update:
: I tried out the recent mmotm, and on one machine was fortunate to hit
: the BUG_ON(first->va_start < addr) which seems to have been stalling
: your vmap area cache patch ever since May.
: I can get you addresses etc, I did dump a few out; but once I stared
: at them, it was easier just to look at the code: and I cannot see how
: you would be so sure that first->va_start < addr, once you've done
: that addr = ALIGN(max(...), align) above, if align is over 0x1000
: (align was 0x8000 or 0x4000 in the cases I hit: ioremaps like Steve).
: I originally got around it by just changing the
: if (first->va_start < addr) {
: to
: while (first->va_start < addr) {
: without thinking about it any further; but that seemed unsatisfactory,
: why would we want to loop here when we've got another very similar
: loop just below it?
: I am never going to admit how long I've spent trying to grasp your
: "while (n)" rbtree loop just above this, the one with the peculiar
: if (!first && tmp->va_start < addr + size)
: in. That's unfamiliar to me, I'm guessing it's designed to save a
: subsequent rb_next() in a few circumstances (at risk of then setting
: a wrong cached_hole_size?); but they did appear few to me, and I didn't
: feel I could sign off something with that in when I don't grasp it,
: and it seems responsible for extra code and mistaken BUG_ON below it.
: I've reverted to the familiar rbtree loop that find_vma() does (but
: with va_end >= addr as you had, to respect the additional guard page):
: and then (given that cached_hole_size starts out 0) I don't see the
: need for any complications below it. If you do want to keep that loop
: as you had it, please add a comment to explain what it's trying to do,
: and where addr is relative to first when you emerge from it.
: Aren't your tests "size <= cached_hole_size" and
: "addr + size > first->va_start" forgetting the guard page we want
: before the next area? I've changed those.
: I have not changed your many "addr + size - 1 < addr" overflow tests,
: but have since come to wonder, shouldn't they be "addr + size < addr"
: tests - won't the vend checks go wrong if addr + size is 0?
: I have added a few comments - Wolfgang Wander's 2.6.13 description of
: 1363c3cd86 Avoiding mmap fragmentation
: helped me a lot, perhaps a pointer to that would be good too. And I found
: it easier to understand when I renamed cached_start slightly and moved the
: overflow label down.
: This patch would go after your mm-vmap-area-cache.patch in mmotm.
: Trivially, nobody is going to get that BUG_ON with this patch, and it
: appears to work fine on my machines; but I have not given it anything like
: the testing you did on your original, and may have broken all the
: performance you were aiming for. Please take a look and test it out
: integrate with yours if you're satisfied - thanks.
[akpm@linux-foundation.org: add locking comment]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reported-and-tested-by: Steven Whitehouse <swhiteho@redhat.com>
Reported-and-tested-by: Avi Kivity <avi@redhat.com>
Tested-by: "Barry J. Marson" <bmarson@redhat.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (59 commits)
ACPI / PM: Fix build problems for !CONFIG_ACPI related to NVS rework
ACPI: fix resource check message
ACPI / Battery: Update information on info notification and resume
ACPI: Drop device flag wake_capable
ACPI: Always check if _PRW is present before trying to evaluate it
ACPI / PM: Check status of power resources under mutexes
ACPI / PM: Rename acpi_power_off_device()
ACPI / PM: Drop acpi_power_nocheck
ACPI / PM: Drop acpi_bus_get_power()
Platform / x86: Make fujitsu_laptop use acpi_bus_update_power()
ACPI / Fan: Rework the handling of power resources
ACPI / PM: Register power resource devices as soon as they are needed
ACPI / PM: Register acpi_power_driver early
ACPI / PM: Add function for updating device power state consistently
ACPI / PM: Add function for device power state initialization
ACPI / PM: Introduce __acpi_bus_get_power()
ACPI / PM: Introduce function for refcounting device power resources
ACPI / PM: Add functions for manipulating lists of power resources
ACPI / PM: Prevent acpi_power_get_inferred_state() from making changes
ACPICA: Update version to 20101209
...
IS_ERR() already implies unlikely(), so it can be omitted here.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
module_init(). Much of the code is duplicated and can be generalized in a
globally accessible function, __vmalloc_node_range().
__vmalloc_node() now calls into __vmalloc_node_range() with a range of
[VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.
Each architecture may then use __vmalloc_node_range() directly to remove
the duplication of code.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
formal and use the mask internally.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_vm_area_node() is unused in the kernel and can thus be removed.
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.
This patch adds POLL/IRQ/NMI notification types support.
Because the memory area used to transfer hardware error information
from BIOS to Linux can be determined only in NMI, IRQ or timer
handler, but general ioremap can not be used in atomic context, so a
special version of atomic ioremap is implemented for that.
Known issue:
- Error information can not be printed for recoverable errors notified
via NMI, because printk is not NMI-safe. Will fix this via delay
printing to IRQ context via irq_work or make printk NMI-safe.
v2:
- adjust printk format per comments.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
On stock 2.6.37-rc4, running:
# mount lilith:/export /mnt/lilith
# find /mnt/lilith/ -type f -print0 | xargs -0 file
crashes the machine fairly quickly under Xen. Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
3000000000000000) for mfn 1d7058 (pfn 18fa7)
(XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
(XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it. This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.
Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.
When unmapping a region in the vmalloc space, clear the ptes
immediately. There's no point in deferring this because there's no
amortization benefit.
The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.
This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877 ("NFS: readdir with vmapped
pages") . XFS also uses vm_map_ram() and could cause similar problems.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add vzalloc() and vzalloc_node() to encapsulate the
vmalloc-then-memset-zero operation.
Use __GFP_ZERO to zero fill the allocated memory.
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Greg Ungerer <gerg@snapgear.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
s_start() and s_stop() grab/release vmlist_lock but were missing proper
annotations. Add them.
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename redundant 'tmp' to fix following sparse warnings:
mm/vmalloc.c:296:34: warning: symbol 'tmp' shadows an earlier one
mm/vmalloc.c:293:24: originally declared here
Signed-off-by: Namhyung Kim <namhyung@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: update comments to reflect that percpu allocations are always zero-filled
percpu: Optimize __get_cpu_var()
x86, percpu: Optimize this_cpu_ptr
percpu: clear memory allocated with the km allocator
percpu: fix build breakage on s390 and cleanup build configuration tests
percpu: use percpu allocator on UP too
percpu: reduce PCPU_MIN_UNIT_SIZE to 32k
vmalloc: pcpu_get/free_vm_areas() aren't needed on UP
Fixed up trivial conflicts in include/linux/percpu.h
During the reading of /proc/vmcore the kernel is doing
ioremap()/iounmap() repeatedly. And the buildup of un-flushed
vm_area_struct's is causing a great deal of overhead. (rb_next()
is chewing up most of that time).
This solution is to provide function set_iounmap_nonlazy(). It
causes a subsequent call to iounmap() to immediately purge the
vma area (with try_purge_vmap_area_lazy()).
With this patch we have seen the time for writing a 250MB
compressed dump drop from 71 seconds to 44 seconds.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: kexec@lists.infradead.org
Cc: <stable@kernel.org>
LKML-Reference: <E1OwHZ4-0005WK-Tw@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
These functions are used only by percpu memory allocator on SMP.
Don't build them on UP.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Reviewed-by: Chrsitoph Lameter <cl@linux.com>
* 'stable/xen-swiotlb-0.8.6' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
x86: Detect whether we should use Xen SWIOTLB.
pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.
swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.
xen/mmu: inhibit vmap aliases rather than trying to clear them out
vmap: add flag to allow lazy unmap to be disabled at runtime
xen: Add xen_create_contiguous_region
xen: Rename the balloon lock
xen: Allow unprivileged Xen domains to create iomap pages
xen: use _PAGE_IOMAP in ioremap to do machine mappings
Fix up trivial conflicts (adding both xen swiotlb and xen pci platform
driver setup close to each other) in drivers/xen/{Kconfig,Makefile} and
include/xen/xen-ops.h
kmalloc() may fail, if so return -ENOMEM.
Signed-off-by: Kulikov Vasiliy <segooon@gmail.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a flag to force lazy_max_pages() to zero to prevent any outstanding
mapped pages. We'll need this for Xen.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Current x86 ioremap() doesn't handle physical address higher than
32-bit properly in X86_32 PAE mode. When physical address higher than
32-bit is passed to ioremap(), higher 32-bits in physical address is
cleared wrongly. Due to this bug, ioremap() can map wrong address to
linear address space.
In my case, 64-bit MMIO region was assigned to a PCI device (ioat
device) on my system. Because of the ioremap()'s bug, wrong physical
address (instead of MMIO region) was mapped to linear address space.
Because of this, loading ioatdma driver caused unexpected behavior
(kernel panic, kernel hangup, ...).
Signed-off-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com>
LKML-Reference: <4C1AE680.7090408@jp.fujitsu.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Improve handling of fragmented per-CPU vmaps. We previously don't free
up per-CPU maps until all its addresses have been used and freed. So
fragmented blocks could fill up vmalloc space even if they actually had
no active vmap regions within them.
Add some logic to allow all CPUs to have these blocks purged in the case
of failure to allocate a new vm area, and also put some logic to trim
such blocks of a current CPU if we hit them in the allocation path (so
as to avoid a large build up of them).
Christoph reported some vmap allocation failures when using the per CPU
vmap APIs in XFS, which cannot be reproduced after this patch and the
previous bug fix.
Cc: linux-mm@kvack.org
Cc: stable@kernel.org
Tested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
--
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
RCU list walking of the per-cpu vmap cache was broken. It did not use
RCU primitives, and also the union of free_list and rcu_head is
obviously wrong (because free_list is indeed the list we are RCU
walking).
While we are there, remove a couple of unused fields from an earlier
iteration.
These APIs aren't actually used anywhere, because of problems with the
XFS conversion. Christoph has now verified that the problems are solved
with these patches. Also it is an exported interface, so I think it
will be good to be merged now (and Christoph wants to get the XFS
changes into their local tree).
Cc: stable@kernel.org
Cc: linux-mm@kvack.org
Tested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Nick Piggin <npiggin@suse.de>
--
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In free_unmap_area_noflush(), va->flags is marked as VM_LAZY_FREE first, and
then vmap_lazy_nr is increased atomically.
But, in __purge_vmap_area_lazy(), while traversing of vmap_are_list, nr
is counted by checking VM_LAZY_FREE is set to va->flags. After counting
the variable nr, kernel reads vmap_lazy_nr atomically and checks a
BUG_ON condition whether nr is greater than vmap_lazy_nr to prevent
vmap_lazy_nr from being negative.
The problem is that, if interrupted right after marking VM_LAZY_FREE,
increment of vmap_lazy_nr can be delayed. Consequently, BUG_ON
condition can be met because nr is counted more than vmap_lazy_nr.
It is highly probable when vmalloc/vfree are called frequently. This
scenario have been verified by adding delay between marking VM_LAZY_FREE
and increasing vmap_lazy_nr in free_unmap_area_noflush().
Even the vmap_lazy_nr is for checking high watermark, it never be the
strict watermark. Although the BUG_ON condition is to prevent
vmap_lazy_nr from being negative, vmap_lazy_nr is signed variable. So,
it could go down to negative value temporarily.
Consequently, removing the BUG_ON condition is proper.
A possible BUG_ON message is like the below.
kernel BUG at mm/vmalloc.c:517!
invalid opcode: 0000 [#1] SMP
EIP: 0060:[<c04824a4>] EFLAGS: 00010297 CPU: 3
EIP is at __purge_vmap_area_lazy+0x144/0x150
EAX: ee8a8818 EBX: c08e77d4 ECX: e7c7ae40 EDX: c08e77ec
ESI: 000081fe EDI: e7c7ae60 EBP: e7c7ae64 ESP: e7c7ae3c
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Call Trace:
[<c0482ad9>] free_unmap_vmap_area_noflush+0x69/0x70
[<c0482b02>] remove_vm_area+0x22/0x70
[<c0482c15>] __vunmap+0x45/0xe0
[<c04831ec>] vmalloc+0x2c/0x30
Code: 8d 59 e0 eb 04 66 90 89 cb 89 d0 e8 87 fe ff ff 8b 43 20 89 da 8d 48 e0 8d 43 20 3b 04 24 75 e7 fe 05 a8 a5 a3 c0 e9 78 ff ff ff <0f> 0b eb fe 90 8d b4 26 00 00 00 00 56 89 c6 b8 ac a5 a3 c0 31
EIP: [<c04824a4>] __purge_vmap_area_lazy+0x144/0x150 SS:ESP 0068:e7c7ae3c
[ See also http://marc.info/?l=linux-kernel&m=126335856228090&w=2 ]
Signed-off-by: Yongseok Koh <yongseok.koh@samsung.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>