2005-10-30 09:16:54 +08:00
|
|
|
/*
|
|
|
|
* linux/mm/memory_hotplug.c
|
|
|
|
*
|
|
|
|
* Copyright (C)
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/stddef.h>
|
|
|
|
#include <linux/mm.h>
|
|
|
|
#include <linux/swap.h>
|
|
|
|
#include <linux/interrupt.h>
|
|
|
|
#include <linux/pagemap.h>
|
|
|
|
#include <linux/compiler.h>
|
2011-10-16 14:01:52 +08:00
|
|
|
#include <linux/export.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/pagevec.h>
|
2006-09-29 17:01:25 +08:00
|
|
|
#include <linux/writeback.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/sysctl.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/memory.h>
|
2016-01-16 08:56:22 +08:00
|
|
|
#include <linux/memremap.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
#include <linux/memory_hotplug.h>
|
|
|
|
#include <linux/highmem.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2006-06-27 17:53:35 +08:00
|
|
|
#include <linux/ioport.h>
|
2007-10-16 16:26:12 +08:00
|
|
|
#include <linux/delay.h>
|
|
|
|
#include <linux/migrate.h>
|
|
|
|
#include <linux/page-isolation.h>
|
2008-10-19 11:25:58 +08:00
|
|
|
#include <linux/pfn.h>
|
2009-11-18 06:06:22 +08:00
|
|
|
#include <linux/suspend.h>
|
2009-12-15 09:58:11 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2010-03-06 05:41:58 +08:00
|
|
|
#include <linux/firmware-map.h>
|
2013-02-23 08:33:14 +08:00
|
|
|
#include <linux/stop_machine.h>
|
2013-09-12 05:22:09 +08:00
|
|
|
#include <linux/hugetlb.h>
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
#include <linux/memblock.h>
|
2014-11-14 07:19:39 +08:00
|
|
|
#include <linux/bootmem.h>
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
#include <linux/compaction.h>
|
2005-10-30 09:16:54 +08:00
|
|
|
|
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
2008-04-29 01:40:08 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
/*
|
|
|
|
* online_page_callback contains pointer to current page onlining function.
|
|
|
|
* Initially it is generic_online_page(). If it is required it could be
|
|
|
|
* changed by calling set_online_page_callback() for callback registration
|
|
|
|
* and restore_online_page_callback() for generic callback restore.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page);
|
|
|
|
|
|
|
|
static online_page_callback_t online_page_callback = generic_online_page;
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
static DEFINE_MUTEX(online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
/* The same as the cpu_hotplug lock, but for memory hotplug. */
|
|
|
|
static struct {
|
|
|
|
struct task_struct *active_writer;
|
|
|
|
struct mutex lock; /* Synchronizes accesses to refcount, */
|
|
|
|
/*
|
|
|
|
* Also blocks the new readers during
|
|
|
|
* an ongoing mem hotplug operation.
|
|
|
|
*/
|
|
|
|
int refcount;
|
|
|
|
|
|
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
|
|
struct lockdep_map dep_map;
|
|
|
|
#endif
|
|
|
|
} mem_hotplug = {
|
|
|
|
.active_writer = NULL,
|
|
|
|
.lock = __MUTEX_INITIALIZER(mem_hotplug.lock),
|
|
|
|
.refcount = 0,
|
|
|
|
#ifdef CONFIG_DEBUG_LOCK_ALLOC
|
|
|
|
.dep_map = {.name = "mem_hotplug.lock" },
|
|
|
|
#endif
|
|
|
|
};
|
|
|
|
|
|
|
|
/* Lockdep annotations for get/put_online_mems() and mem_hotplug_begin/end() */
|
|
|
|
#define memhp_lock_acquire_read() lock_map_acquire_read(&mem_hotplug.dep_map)
|
|
|
|
#define memhp_lock_acquire() lock_map_acquire(&mem_hotplug.dep_map)
|
|
|
|
#define memhp_lock_release() lock_map_release(&mem_hotplug.dep_map)
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
bool memhp_auto_online;
|
|
|
|
EXPORT_SYMBOL_GPL(memhp_auto_online);
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
void get_online_mems(void)
|
|
|
|
{
|
|
|
|
might_sleep();
|
|
|
|
if (mem_hotplug.active_writer == current)
|
|
|
|
return;
|
|
|
|
memhp_lock_acquire_read();
|
|
|
|
mutex_lock(&mem_hotplug.lock);
|
|
|
|
mem_hotplug.refcount++;
|
|
|
|
mutex_unlock(&mem_hotplug.lock);
|
|
|
|
|
|
|
|
}
|
2010-12-03 06:31:19 +08:00
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
void put_online_mems(void)
|
2010-12-03 06:31:19 +08:00
|
|
|
{
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (mem_hotplug.active_writer == current)
|
|
|
|
return;
|
|
|
|
mutex_lock(&mem_hotplug.lock);
|
|
|
|
|
|
|
|
if (WARN_ON(!mem_hotplug.refcount))
|
|
|
|
mem_hotplug.refcount++; /* try to fix things up */
|
|
|
|
|
|
|
|
if (!--mem_hotplug.refcount && unlikely(mem_hotplug.active_writer))
|
|
|
|
wake_up_process(mem_hotplug.active_writer);
|
|
|
|
mutex_unlock(&mem_hotplug.lock);
|
|
|
|
memhp_lock_release();
|
|
|
|
|
2010-12-03 06:31:19 +08:00
|
|
|
}
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_begin(void)
|
2010-12-03 06:31:19 +08:00
|
|
|
{
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug.active_writer = current;
|
|
|
|
|
|
|
|
memhp_lock_acquire();
|
|
|
|
for (;;) {
|
|
|
|
mutex_lock(&mem_hotplug.lock);
|
|
|
|
if (likely(!mem_hotplug.refcount))
|
|
|
|
break;
|
|
|
|
__set_current_state(TASK_UNINTERRUPTIBLE);
|
|
|
|
mutex_unlock(&mem_hotplug.lock);
|
|
|
|
schedule();
|
|
|
|
}
|
2010-12-03 06:31:19 +08:00
|
|
|
}
|
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
void mem_hotplug_done(void)
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
{
|
|
|
|
mem_hotplug.active_writer = NULL;
|
|
|
|
mutex_unlock(&mem_hotplug.lock);
|
|
|
|
memhp_lock_release();
|
|
|
|
}
|
2010-12-03 06:31:19 +08:00
|
|
|
|
2006-10-01 14:27:09 +08:00
|
|
|
/* add this memory to iomem resource */
|
|
|
|
static struct resource *register_memory_resource(u64 start, u64 size)
|
|
|
|
{
|
|
|
|
struct resource *res;
|
|
|
|
res = kzalloc(sizeof(struct resource), GFP_KERNEL);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (!res)
|
|
|
|
return ERR_PTR(-ENOMEM);
|
2006-10-01 14:27:09 +08:00
|
|
|
|
|
|
|
res->name = "System RAM";
|
|
|
|
res->start = start;
|
|
|
|
res->end = start + size - 1;
|
2016-01-27 04:57:24 +08:00
|
|
|
res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
|
2006-10-01 14:27:09 +08:00
|
|
|
if (request_resource(&iomem_resource, res) < 0) {
|
2013-07-04 06:02:39 +08:00
|
|
|
pr_debug("System RAM resource %pR cannot be added\n", res);
|
2006-10-01 14:27:09 +08:00
|
|
|
kfree(res);
|
2016-01-15 07:21:55 +08:00
|
|
|
return ERR_PTR(-EEXIST);
|
2006-10-01 14:27:09 +08:00
|
|
|
}
|
|
|
|
return res;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void release_memory_resource(struct resource *res)
|
|
|
|
{
|
|
|
|
if (!res)
|
|
|
|
return;
|
|
|
|
release_resource(res);
|
|
|
|
kfree(res);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-10-01 14:27:08 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
|
2013-02-23 08:33:00 +08:00
|
|
|
void get_page_bootmem(unsigned long info, struct page *page,
|
|
|
|
unsigned long type)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2011-01-14 07:47:00 +08:00
|
|
|
page->lru.next = (struct list_head *) type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
SetPagePrivate(page);
|
|
|
|
set_page_private(page, info);
|
|
|
|
atomic_inc(&page->_count);
|
|
|
|
}
|
|
|
|
|
2013-07-04 06:03:17 +08:00
|
|
|
void put_page_bootmem(struct page *page)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
2011-01-14 07:47:00 +08:00
|
|
|
unsigned long type;
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2011-01-14 07:47:00 +08:00
|
|
|
type = (unsigned long) page->lru.next;
|
|
|
|
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
|
|
|
|
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
|
|
|
if (atomic_dec_return(&page->_count) == 1) {
|
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
2011-01-14 07:47:00 +08:00
|
|
|
INIT_LIST_HEAD(&page->lru);
|
2013-07-04 06:03:17 +08:00
|
|
|
free_reserved_page(page);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-02-23 08:33:00 +08:00
|
|
|
#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
|
|
|
|
#ifndef CONFIG_SPARSEMEM_VMEMMAP
|
2008-07-24 12:28:12 +08:00
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
/* Get section's memmap address */
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get page for the memmap's phys address
|
|
|
|
* XXX: need more consideration for sparse_vmemmap...
|
|
|
|
*/
|
|
|
|
page = virt_to_page(memmap);
|
|
|
|
mapsize = sizeof(struct page) * PAGES_PER_SECTION;
|
|
|
|
mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
/* remember memmap's page */
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, SECTION_INFO);
|
|
|
|
|
|
|
|
usemap = __nr_to_section(section_nr)->pageblock_flags;
|
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
2008-07-24 12:28:17 +08:00
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#else /* CONFIG_SPARSEMEM_VMEMMAP */
|
|
|
|
static void register_page_bootmem_info_section(unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
unsigned long *usemap, mapsize, section_nr, i;
|
|
|
|
struct mem_section *ms;
|
|
|
|
struct page *page, *memmap;
|
|
|
|
|
|
|
|
if (!pfn_valid(start_pfn))
|
|
|
|
return;
|
|
|
|
|
|
|
|
section_nr = pfn_to_section_nr(start_pfn);
|
|
|
|
ms = __nr_to_section(section_nr);
|
|
|
|
|
|
|
|
memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
|
|
|
|
|
|
|
|
register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
|
|
|
|
|
|
|
|
usemap = __nr_to_section(section_nr)->pageblock_flags;
|
|
|
|
page = virt_to_page(usemap);
|
|
|
|
|
|
|
|
mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
for (i = 0; i < mapsize; i++, page++)
|
|
|
|
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
|
|
|
|
}
|
|
|
|
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
|
|
|
void register_page_bootmem_info_node(struct pglist_data *pgdat)
|
|
|
|
{
|
|
|
|
unsigned long i, pfn, end_pfn, nr_pages;
|
|
|
|
int node = pgdat->node_id;
|
|
|
|
struct page *page;
|
|
|
|
struct zone *zone;
|
|
|
|
|
|
|
|
nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
|
|
|
|
page = virt_to_page(pgdat);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++, page++)
|
|
|
|
get_page_bootmem(node, page, NODE_INFO);
|
|
|
|
|
|
|
|
zone = &pgdat->node_zones[0];
|
|
|
|
for (; zone < pgdat->node_zones + MAX_NR_ZONES - 1; zone++) {
|
2013-09-12 05:21:46 +08:00
|
|
|
if (zone_is_initialized(zone)) {
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
nr_pages = zone->wait_table_hash_nr_entries
|
|
|
|
* sizeof(wait_queue_head_t);
|
|
|
|
nr_pages = PAGE_ALIGN(nr_pages) >> PAGE_SHIFT;
|
|
|
|
page = virt_to_page(zone->wait_table);
|
|
|
|
|
|
|
|
for (i = 0; i < nr_pages; i++, page++)
|
|
|
|
get_page_bootmem(node, page, NODE_INFO);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pfn = pgdat->node_start_pfn;
|
2013-02-23 08:35:32 +08:00
|
|
|
end_pfn = pgdat_end_pfn(pgdat);
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2013-07-09 07:00:23 +08:00
|
|
|
/* register section info */
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
/*
|
|
|
|
* Some platforms can assign the same pfn to multiple nodes - on
|
|
|
|
* node0 as well as nodeN. To avoid registering a pfn against
|
|
|
|
* multiple nodes we check that this pfn does not already
|
2013-07-09 07:00:23 +08:00
|
|
|
* reside in some other nodes.
|
memory hotplug: fix section info double registration bug
There may be a bug when registering section info. For example, on my
Itanium platform, the pfn range of node0 includes the other nodes, so
other nodes' section info will be double registered, and memmap's page
count will equal to 3.
node0: start_pfn=0x100, spanned_pfn=0x20fb00, present_pfn=0x7f8a3, => 0x000100-0x20fc00
node1: start_pfn=0x80000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x080000-0x100000
node2: start_pfn=0x100000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x100000-0x180000
node3: start_pfn=0x180000, spanned_pfn=0x80000, present_pfn=0x80000, => 0x180000-0x200000
free_all_bootmem_node()
register_page_bootmem_info_node()
register_page_bootmem_info_section()
When hot remove memory, we can't free the memmap's page because
page_count() is 2 after put_page_bootmem().
sparse_remove_one_section()
free_section_usemap()
free_map_bootmem()
put_page_bootmem()
[akpm@linux-foundation.org: add code comment]
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-09-18 05:09:24 +08:00
|
|
|
*/
|
|
|
|
if (pfn_valid(pfn) && (pfn_to_nid(pfn) == node))
|
|
|
|
register_page_bootmem_info_section(pfn);
|
|
|
|
}
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
}
|
2013-02-23 08:33:00 +08:00
|
|
|
#endif /* CONFIG_HAVE_BOOTMEM_INFO_NODE */
|
memory hotplug: register section/node id to free
This patch set is to free pages which is allocated by bootmem for
memory-hotremove. Some structures of memory management are allocated by
bootmem. ex) memmap, etc.
To remove memory physically, some of them must be freed according to
circumstance. This patch set makes basis to free those pages, and free
memmaps.
Basic my idea is using remain members of struct page to remember information
of users of bootmem (section number or node id). When the section is
removing, kernel can confirm it. By this information, some issues can be
solved.
1) When the memmap of removing section is allocated on other
section by bootmem, it should/can be free.
2) When the memmap of removing section is allocated on the
same section, it shouldn't be freed. Because the section has to be
logical memory offlined already and all pages must be isolated against
page allocater. If it is freed, page allocator may use it which will
be removed physically soon.
3) When removing section has other section's memmap,
kernel will be able to show easily which section should be removed
before it for user. (Not implemented yet)
4) When the above case 2), the page isolation will be able to check and skip
memmap's page when logical memory offline (offline_pages()).
Current page isolation code fails in this case because this page is
just reserved page and it can't distinguish this pages can be
removed or not. But, it will be able to do by this patch.
(Not implemented yet.)
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Fortunately, current bootmem allocator just keeps PageReserved flags,
and doesn't use any other members of page struct. The users of
bootmem doesn't use them too.
This patch:
This is to register information which is node or section's id. Kernel can
distinguish which node/section uses the pages allcated by bootmem. This is
basis for hot-remove sections or nodes.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28 17:13:31 +08:00
|
|
|
|
2014-08-07 07:04:57 +08:00
|
|
|
static void __meminit grow_zone_span(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
2008-05-15 07:05:52 +08:00
|
|
|
{
|
|
|
|
unsigned long old_zone_end_pfn;
|
|
|
|
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
|
2013-09-12 05:21:44 +08:00
|
|
|
old_zone_end_pfn = zone_end_pfn(zone);
|
2013-09-12 05:21:45 +08:00
|
|
|
if (zone_is_empty(zone) || start_pfn < zone->zone_start_pfn)
|
2008-05-15 07:05:52 +08:00
|
|
|
zone->zone_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
zone->spanned_pages = max(old_zone_end_pfn, end_pfn) -
|
|
|
|
zone->zone_start_pfn;
|
|
|
|
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
}
|
|
|
|
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
static void resize_zone(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
|
2012-12-12 08:03:20 +08:00
|
|
|
if (end_pfn - start_pfn) {
|
|
|
|
zone->zone_start_pfn = start_pfn;
|
|
|
|
zone->spanned_pages = end_pfn - start_pfn;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* make it consist as free_area_init_core(),
|
|
|
|
* if spanned_pages = 0, then keep start_pfn = 0
|
|
|
|
*/
|
|
|
|
zone->zone_start_pfn = 0;
|
|
|
|
zone->spanned_pages = 0;
|
|
|
|
}
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void fix_zone_id(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
enum zone_type zid = zone_idx(zone);
|
|
|
|
int nid = zone->zone_pgdat->node_id;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn++)
|
|
|
|
set_page_links(pfn_to_page(pfn), zid, nid, pfn);
|
|
|
|
}
|
|
|
|
|
2013-02-23 08:35:30 +08:00
|
|
|
/* Can fail with -ENOMEM from allocating a wait table with vmalloc() or
|
2014-01-22 07:50:43 +08:00
|
|
|
* alloc_bootmem_node_nopanic()/memblock_virt_alloc_node_nopanic() */
|
2013-02-23 08:35:30 +08:00
|
|
|
static int __ref ensure_zone_is_initialized(struct zone *zone,
|
|
|
|
unsigned long start_pfn, unsigned long num_pages)
|
|
|
|
{
|
|
|
|
if (!zone_is_initialized(zone))
|
2015-11-06 10:47:06 +08:00
|
|
|
return init_currently_empty_zone(zone, start_pfn, num_pages);
|
|
|
|
|
2013-02-23 08:35:30 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-12 08:03:20 +08:00
|
|
|
static int __meminit move_pfn_range_left(struct zone *z1, struct zone *z2,
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2012-12-12 08:03:20 +08:00
|
|
|
int ret;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
unsigned long flags;
|
2012-12-12 08:03:20 +08:00
|
|
|
unsigned long z1_start_pfn;
|
|
|
|
|
2013-02-23 08:35:31 +08:00
|
|
|
ret = ensure_zone_is_initialized(z1, start_pfn, end_pfn - start_pfn);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(z1->zone_pgdat, &flags);
|
|
|
|
|
|
|
|
/* can't move pfns which are higher than @z2 */
|
2013-02-23 08:35:23 +08:00
|
|
|
if (end_pfn > zone_end_pfn(z2))
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
goto out_fail;
|
2013-07-04 06:03:04 +08:00
|
|
|
/* the move out part must be at the left most of @z2 */
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
if (start_pfn > z2->zone_start_pfn)
|
|
|
|
goto out_fail;
|
|
|
|
/* must included/overlap */
|
|
|
|
if (end_pfn <= z2->zone_start_pfn)
|
|
|
|
goto out_fail;
|
|
|
|
|
2012-12-12 08:03:20 +08:00
|
|
|
/* use start_pfn for z1's start_pfn if z1 is empty */
|
2013-09-12 05:21:45 +08:00
|
|
|
if (!zone_is_empty(z1))
|
2012-12-12 08:03:20 +08:00
|
|
|
z1_start_pfn = z1->zone_start_pfn;
|
|
|
|
else
|
|
|
|
z1_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
resize_zone(z1, z1_start_pfn, end_pfn);
|
2013-02-23 08:35:23 +08:00
|
|
|
resize_zone(z2, end_pfn, zone_end_pfn(z2));
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
|
|
|
|
pgdat_resize_unlock(z1->zone_pgdat, &flags);
|
|
|
|
|
|
|
|
fix_zone_id(z1, start_pfn, end_pfn);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
out_fail:
|
|
|
|
pgdat_resize_unlock(z1->zone_pgdat, &flags);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2012-12-12 08:03:20 +08:00
|
|
|
static int __meminit move_pfn_range_right(struct zone *z1, struct zone *z2,
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2012-12-12 08:03:20 +08:00
|
|
|
int ret;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
unsigned long flags;
|
2012-12-12 08:03:20 +08:00
|
|
|
unsigned long z2_end_pfn;
|
|
|
|
|
2013-02-23 08:35:31 +08:00
|
|
|
ret = ensure_zone_is_initialized(z2, start_pfn, end_pfn - start_pfn);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(z1->zone_pgdat, &flags);
|
|
|
|
|
|
|
|
/* can't move pfns which are lower than @z1 */
|
|
|
|
if (z1->zone_start_pfn > start_pfn)
|
|
|
|
goto out_fail;
|
|
|
|
/* the move out part mast at the right most of @z1 */
|
2013-02-23 08:35:23 +08:00
|
|
|
if (zone_end_pfn(z1) > end_pfn)
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
goto out_fail;
|
|
|
|
/* must included/overlap */
|
2013-02-23 08:35:23 +08:00
|
|
|
if (start_pfn >= zone_end_pfn(z1))
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
goto out_fail;
|
|
|
|
|
2012-12-12 08:03:20 +08:00
|
|
|
/* use end_pfn for z2's end_pfn if z2 is empty */
|
2013-09-12 05:21:45 +08:00
|
|
|
if (!zone_is_empty(z2))
|
2013-02-23 08:35:23 +08:00
|
|
|
z2_end_pfn = zone_end_pfn(z2);
|
2012-12-12 08:03:20 +08:00
|
|
|
else
|
|
|
|
z2_end_pfn = end_pfn;
|
|
|
|
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
resize_zone(z1, z1->zone_start_pfn, start_pfn);
|
2012-12-12 08:03:20 +08:00
|
|
|
resize_zone(z2, start_pfn, z2_end_pfn);
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
|
|
|
|
pgdat_resize_unlock(z1->zone_pgdat, &flags);
|
|
|
|
|
|
|
|
fix_zone_id(z2, start_pfn, end_pfn);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
out_fail:
|
|
|
|
pgdat_resize_unlock(z1->zone_pgdat, &flags);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
2014-08-07 07:04:57 +08:00
|
|
|
static void __meminit grow_pgdat_span(struct pglist_data *pgdat, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
2008-05-15 07:05:52 +08:00
|
|
|
{
|
2013-11-13 07:07:19 +08:00
|
|
|
unsigned long old_pgdat_end_pfn = pgdat_end_pfn(pgdat);
|
2008-05-15 07:05:52 +08:00
|
|
|
|
2012-12-12 08:01:07 +08:00
|
|
|
if (!pgdat->node_spanned_pages || start_pfn < pgdat->node_start_pfn)
|
2008-05-15 07:05:52 +08:00
|
|
|
pgdat->node_start_pfn = start_pfn;
|
|
|
|
|
|
|
|
pgdat->node_spanned_pages = max(old_pgdat_end_pfn, end_pfn) -
|
|
|
|
pgdat->node_start_pfn;
|
|
|
|
}
|
|
|
|
|
2008-11-23 01:33:24 +08:00
|
|
|
static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nr_pages = PAGES_PER_SECTION;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
int zone_type;
|
2015-08-07 06:46:51 +08:00
|
|
|
unsigned long flags, pfn;
|
2013-02-23 08:35:31 +08:00
|
|
|
int ret;
|
2005-10-30 09:16:54 +08:00
|
|
|
|
|
|
|
zone_type = zone - pgdat->node_zones;
|
2013-02-23 08:35:31 +08:00
|
|
|
ret = ensure_zone_is_initialized(zone, phys_start_pfn, nr_pages);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
2008-05-15 07:05:52 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
|
|
|
grow_zone_span(zone, phys_start_pfn, phys_start_pfn + nr_pages);
|
|
|
|
grow_pgdat_span(zone->zone_pgdat, phys_start_pfn,
|
|
|
|
phys_start_pfn + nr_pages);
|
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
2007-01-11 15:15:30 +08:00
|
|
|
memmap_init_zone(nr_pages, nid, zone_type,
|
|
|
|
phys_start_pfn, MEMMAP_HOTPLUG);
|
2015-08-07 06:46:51 +08:00
|
|
|
|
|
|
|
/* online_page_range is called later and expects pages reserved */
|
|
|
|
for (pfn = phys_start_pfn; pfn < phys_start_pfn + nr_pages; pfn++) {
|
|
|
|
if (!pfn_valid(pfn))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
SetPageReserved(pfn_to_page(pfn));
|
|
|
|
}
|
2006-06-23 17:03:10 +08:00
|
|
|
return 0;
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
|
|
|
|
2009-01-07 06:39:14 +08:00
|
|
|
static int __meminit __add_section(int nid, struct zone *zone,
|
|
|
|
unsigned long phys_start_pfn)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2006-08-06 03:15:06 +08:00
|
|
|
if (pfn_valid(phys_start_pfn))
|
|
|
|
return -EEXIST;
|
|
|
|
|
2013-11-13 07:07:42 +08:00
|
|
|
ret = sparse_add_one_section(zone, phys_start_pfn);
|
2005-10-30 09:16:54 +08:00
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2006-06-23 17:03:10 +08:00
|
|
|
ret = __add_zone(zone, phys_start_pfn);
|
|
|
|
|
|
|
|
if (ret < 0)
|
|
|
|
return ret;
|
|
|
|
|
2009-01-07 06:39:14 +08:00
|
|
|
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/*
|
|
|
|
* Reasonably generic function for adding memory. It is
|
|
|
|
* expected that archs that support memory hotplug will
|
|
|
|
* call this function after deciding the zone to which to
|
|
|
|
* add the new pages.
|
|
|
|
*/
|
|
|
|
int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
unsigned long i;
|
|
|
|
int err = 0;
|
|
|
|
int start_sec, end_sec;
|
2016-01-16 08:56:22 +08:00
|
|
|
struct vmem_altmap *altmap;
|
|
|
|
|
2016-03-16 05:57:51 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
/* during initialize mem_map, align hot-added range to section */
|
|
|
|
start_sec = pfn_to_section_nr(phys_start_pfn);
|
|
|
|
end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1);
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
altmap = to_vmem_altmap((unsigned long) pfn_to_page(phys_start_pfn));
|
|
|
|
if (altmap) {
|
|
|
|
/*
|
|
|
|
* Validate altmap is within bounds of the total request
|
|
|
|
*/
|
|
|
|
if (altmap->base_pfn != phys_start_pfn
|
|
|
|
|| vmem_altmap_offset(altmap) > nr_pages) {
|
|
|
|
pr_warn_once("memory add fail, invalid altmap\n");
|
2016-03-16 05:57:51 +08:00
|
|
|
err = -EINVAL;
|
|
|
|
goto out;
|
2016-01-16 08:56:22 +08:00
|
|
|
}
|
|
|
|
altmap->alloc = 0;
|
|
|
|
}
|
|
|
|
|
2013-04-30 06:08:22 +08:00
|
|
|
for (i = start_sec; i <= end_sec; i++) {
|
2015-04-15 06:44:54 +08:00
|
|
|
err = __add_section(nid, zone, section_nr_to_pfn(i));
|
2013-04-30 06:08:22 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* EEXIST is finally dealt with by ioresource collision
|
|
|
|
* check. see add_memory() => register_memory_resource()
|
|
|
|
* Warning will be printed if there is collision.
|
|
|
|
*/
|
|
|
|
if (err && (err != -EEXIST))
|
|
|
|
break;
|
|
|
|
err = 0;
|
|
|
|
}
|
2015-06-25 07:58:42 +08:00
|
|
|
vmemmap_populate_print_last();
|
2016-03-16 05:57:51 +08:00
|
|
|
out:
|
|
|
|
set_zone_contiguous(zone);
|
2013-04-30 06:08:22 +08:00
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__add_pages);
|
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-02-23 08:33:12 +08:00
|
|
|
/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
|
|
|
|
static int find_smallest_section_pfn(int nid, struct zone *zone,
|
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
|
|
|
|
for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(start_pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(start_pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(start_pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return start_pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
|
|
|
|
static int find_biggest_section_pfn(int nid, struct zone *zone,
|
|
|
|
unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
struct mem_section *ms;
|
|
|
|
unsigned long pfn;
|
|
|
|
|
|
|
|
/* pfn is the end pfn of a memory section. */
|
|
|
|
pfn = end_pfn - 1;
|
|
|
|
for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (unlikely(pfn_to_nid(pfn) != nid))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (zone && zone != page_zone(pfn_to_page(pfn)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
return pfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
|
|
|
|
unsigned long end_pfn)
|
|
|
|
{
|
2013-09-12 05:21:44 +08:00
|
|
|
unsigned long zone_start_pfn = zone->zone_start_pfn;
|
|
|
|
unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */
|
|
|
|
unsigned long zone_end_pfn = z;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
|
|
|
|
zone_span_writelock(zone);
|
|
|
|
if (zone_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the zone, it need
|
|
|
|
* shrink zone->zone_start_pfn and zone->zone_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, zone, end_pfn,
|
|
|
|
zone_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
zone->zone_start_pfn = pfn;
|
|
|
|
zone->spanned_pages = zone_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (zone_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the zone, it need
|
|
|
|
* shrink zone->spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
zone->spanned_pages = pfn - zone_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The section is not biggest or smallest mem_section in the zone, it
|
|
|
|
* only creates a hole in the zone. So in this case, we need not
|
|
|
|
* change the zone. But perhaps, the zone has only hole data. Thus
|
|
|
|
* it check the zone has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = zone_start_pfn;
|
|
|
|
for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (page_zone(pfn_to_page(pfn)) != zone)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The zone has no valid section */
|
|
|
|
zone->zone_start_pfn = 0;
|
|
|
|
zone->spanned_pages = 0;
|
|
|
|
zone_span_writeunlock(zone);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void shrink_pgdat_span(struct pglist_data *pgdat,
|
|
|
|
unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2013-11-13 07:07:19 +08:00
|
|
|
unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long p = pgdat_end_pfn(pgdat); /* pgdat_end_pfn namespace clash */
|
|
|
|
unsigned long pgdat_end_pfn = p;
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long pfn;
|
|
|
|
struct mem_section *ms;
|
|
|
|
int nid = pgdat->node_id;
|
|
|
|
|
|
|
|
if (pgdat_start_pfn == start_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is smallest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second smallest valid mem_section
|
|
|
|
* for shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
|
|
|
|
pgdat_end_pfn);
|
|
|
|
if (pfn) {
|
|
|
|
pgdat->node_start_pfn = pfn;
|
|
|
|
pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
|
|
|
|
}
|
|
|
|
} else if (pgdat_end_pfn == end_pfn) {
|
|
|
|
/*
|
|
|
|
* If the section is biggest section in the pgdat, it need
|
|
|
|
* shrink pgdat->node_spanned_pages.
|
|
|
|
* In this case, we find second biggest valid mem_section for
|
|
|
|
* shrinking zone.
|
|
|
|
*/
|
|
|
|
pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
|
|
|
|
start_pfn);
|
|
|
|
if (pfn)
|
|
|
|
pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the section is not biggest or smallest mem_section in the pgdat,
|
|
|
|
* it only creates a hole in the pgdat. So in this case, we need not
|
|
|
|
* change the pgdat.
|
|
|
|
* But perhaps, the pgdat has only hole data. Thus it check the pgdat
|
|
|
|
* has only hole or not.
|
|
|
|
*/
|
|
|
|
pfn = pgdat_start_pfn;
|
|
|
|
for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
ms = __pfn_to_section(pfn);
|
|
|
|
|
|
|
|
if (unlikely(!valid_section(ms)))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If the section is current section, it continues the loop */
|
|
|
|
if (start_pfn == pfn)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/* If we find valid section, we have nothing to do */
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* The pgdat has no valid section */
|
|
|
|
pgdat->node_start_pfn = 0;
|
|
|
|
pgdat->node_spanned_pages = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void __remove_zone(struct zone *zone, unsigned long start_pfn)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
int nr_pages = PAGES_PER_SECTION;
|
|
|
|
int zone_type;
|
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
zone_type = zone - pgdat->node_zones;
|
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
|
|
|
shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
|
|
|
|
shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
|
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
}
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
static int __remove_section(struct zone *zone, struct mem_section *ms,
|
|
|
|
unsigned long map_offset)
|
2008-04-28 17:12:01 +08:00
|
|
|
{
|
2013-02-23 08:33:12 +08:00
|
|
|
unsigned long start_pfn;
|
|
|
|
int scn_nr;
|
2008-04-28 17:12:01 +08:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
if (!valid_section(ms))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
ret = unregister_memory_section(ms);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-02-23 08:33:12 +08:00
|
|
|
scn_nr = __section_nr(ms);
|
|
|
|
start_pfn = section_nr_to_pfn(scn_nr);
|
|
|
|
__remove_zone(zone, start_pfn);
|
|
|
|
|
2016-01-16 08:56:22 +08:00
|
|
|
sparse_remove_one_section(zone, ms, map_offset);
|
2008-04-28 17:12:01 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/**
|
|
|
|
* __remove_pages() - remove sections of pages from a zone
|
|
|
|
* @zone: zone from which pages need to be removed
|
|
|
|
* @phys_start_pfn: starting pageframe (must be aligned to start of a section)
|
|
|
|
* @nr_pages: number of pages to remove (must be multiple of section size)
|
|
|
|
*
|
|
|
|
* Generic helper function to remove section mappings and sysfs entries
|
|
|
|
* for the section of the memory we are removing. Caller needs to make
|
|
|
|
* sure that pages are marked reserved and zones are adjust properly by
|
|
|
|
* calling offline_pages().
|
|
|
|
*/
|
|
|
|
int __remove_pages(struct zone *zone, unsigned long phys_start_pfn,
|
|
|
|
unsigned long nr_pages)
|
|
|
|
{
|
2013-04-30 06:08:20 +08:00
|
|
|
unsigned long i;
|
2016-01-16 08:56:22 +08:00
|
|
|
unsigned long map_offset = 0;
|
|
|
|
int sections_to_remove, ret = 0;
|
|
|
|
|
|
|
|
/* In the ZONE_DEVICE case device driver owns the memory region */
|
|
|
|
if (is_dev_zone(zone)) {
|
|
|
|
struct page *page = pfn_to_page(phys_start_pfn);
|
|
|
|
struct vmem_altmap *altmap;
|
|
|
|
|
|
|
|
altmap = to_vmem_altmap((unsigned long) page);
|
|
|
|
if (altmap)
|
|
|
|
map_offset = vmem_altmap_offset(altmap);
|
|
|
|
} else {
|
|
|
|
resource_size_t start, size;
|
|
|
|
|
|
|
|
start = phys_start_pfn << PAGE_SHIFT;
|
|
|
|
size = nr_pages * PAGE_SIZE;
|
|
|
|
|
|
|
|
ret = release_mem_region_adjustable(&iomem_resource, start,
|
|
|
|
size);
|
|
|
|
if (ret) {
|
|
|
|
resource_size_t endres = start + size - 1;
|
|
|
|
|
|
|
|
pr_warn("Unable to release resource <%pa-%pa> (%d)\n",
|
|
|
|
&start, &endres, ret);
|
|
|
|
}
|
|
|
|
}
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2016-03-16 05:57:51 +08:00
|
|
|
clear_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
/*
|
|
|
|
* We can only remove entire sections
|
|
|
|
*/
|
|
|
|
BUG_ON(phys_start_pfn & ~PAGE_SECTION_MASK);
|
|
|
|
BUG_ON(nr_pages % PAGES_PER_SECTION);
|
|
|
|
|
|
|
|
sections_to_remove = nr_pages / PAGES_PER_SECTION;
|
|
|
|
for (i = 0; i < sections_to_remove; i++) {
|
|
|
|
unsigned long pfn = phys_start_pfn + i*PAGES_PER_SECTION;
|
2016-01-16 08:56:22 +08:00
|
|
|
|
|
|
|
ret = __remove_section(zone, __pfn_to_section(pfn), map_offset);
|
|
|
|
map_offset = 0;
|
2008-04-28 17:12:01 +08:00
|
|
|
if (ret)
|
|
|
|
break;
|
|
|
|
}
|
2016-03-16 05:57:51 +08:00
|
|
|
|
|
|
|
set_zone_contiguous(zone);
|
|
|
|
|
2008-04-28 17:12:01 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__remove_pages);
|
2013-04-30 06:08:22 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2008-04-28 17:12:01 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
int set_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == generic_online_page) {
|
|
|
|
online_page_callback = callback;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(set_online_page_callback);
|
|
|
|
|
|
|
|
int restore_online_page_callback(online_page_callback_t callback)
|
|
|
|
{
|
|
|
|
int rc = -EINVAL;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
get_online_mems();
|
|
|
|
mutex_lock(&online_page_callback_lock);
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
if (online_page_callback == callback) {
|
|
|
|
online_page_callback = generic_online_page;
|
|
|
|
rc = 0;
|
|
|
|
}
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mutex_unlock(&online_page_callback_lock);
|
|
|
|
put_online_mems();
|
2011-07-26 08:12:05 +08:00
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(restore_online_page_callback);
|
|
|
|
|
|
|
|
void __online_page_set_limits(struct page *page)
|
2008-04-28 17:12:03 +08:00
|
|
|
{
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_set_limits);
|
|
|
|
|
|
|
|
void __online_page_increment_counters(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(page, 1);
|
2011-07-26 08:12:05 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__online_page_increment_counters);
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2011-07-26 08:12:05 +08:00
|
|
|
void __online_page_free(struct page *page)
|
|
|
|
{
|
2013-07-04 06:03:21 +08:00
|
|
|
__free_reserved_page(page);
|
2008-04-28 17:12:03 +08:00
|
|
|
}
|
2011-07-26 08:12:05 +08:00
|
|
|
EXPORT_SYMBOL_GPL(__online_page_free);
|
|
|
|
|
|
|
|
static void generic_online_page(struct page *page)
|
|
|
|
{
|
|
|
|
__online_page_set_limits(page);
|
|
|
|
__online_page_increment_counters(page);
|
|
|
|
__online_page_free(page);
|
|
|
|
}
|
2008-04-28 17:12:03 +08:00
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *arg)
|
2005-10-30 09:16:54 +08:00
|
|
|
{
|
|
|
|
unsigned long i;
|
2007-10-16 16:26:10 +08:00
|
|
|
unsigned long onlined_pages = *(unsigned long *)arg;
|
|
|
|
struct page *page;
|
|
|
|
if (PageReserved(pfn_to_page(start_pfn)))
|
|
|
|
for (i = 0; i < nr_pages; i++) {
|
|
|
|
page = pfn_to_page(start_pfn + i);
|
2011-07-26 08:12:05 +08:00
|
|
|
(*online_page_callback)(page);
|
2007-10-16 16:26:10 +08:00
|
|
|
onlined_pages++;
|
|
|
|
}
|
|
|
|
*(unsigned long *)arg = onlined_pages;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2012-12-13 05:52:04 +08:00
|
|
|
#ifdef CONFIG_MOVABLE_NODE
|
2012-12-19 06:23:24 +08:00
|
|
|
/*
|
|
|
|
* When CONFIG_MOVABLE_NODE, we permit onlining of a node which doesn't have
|
|
|
|
* normal memory.
|
|
|
|
*/
|
2012-12-13 05:52:04 +08:00
|
|
|
static bool can_online_high_movable(struct zone *zone)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2012-12-19 06:23:24 +08:00
|
|
|
#else /* CONFIG_MOVABLE_NODE */
|
2012-12-12 08:03:23 +08:00
|
|
|
/* ensure every online node has NORMAL memory */
|
|
|
|
static bool can_online_high_movable(struct zone *zone)
|
|
|
|
{
|
|
|
|
return node_state(zone_to_nid(zone), N_NORMAL_MEMORY);
|
|
|
|
}
|
2012-12-19 06:23:24 +08:00
|
|
|
#endif /* CONFIG_MOVABLE_NODE */
|
2012-12-12 08:03:23 +08:00
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when online memory */
|
|
|
|
static void node_states_check_changes_online(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
int nid = zone_to_nid(zone);
|
|
|
|
enum zone_type zone_last = ZONE_NORMAL;
|
|
|
|
|
|
|
|
/*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_NORMAL,
|
|
|
|
* set zone_last to ZONE_NORMAL.
|
2012-12-12 08:01:03 +08:00
|
|
|
*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we don't have HIGHMEM nor movable node,
|
|
|
|
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
|
|
|
|
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (N_MEMORY == N_NORMAL_MEMORY)
|
2012-12-12 08:01:03 +08:00
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* if the memory to be online is in a zone of 0...zone_last, and
|
|
|
|
* the zones of 0...zone_last don't have memory before online, we will
|
|
|
|
* need to set the node to node_states[N_NORMAL_MEMORY] after
|
|
|
|
* the memory is online.
|
|
|
|
*/
|
|
|
|
if (zone_idx(zone) <= zone_last && !node_state(nid, N_NORMAL_MEMORY))
|
|
|
|
arg->status_change_nid_normal = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
|
|
|
* If we have movable node, node_states[N_HIGH_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_HIGHMEM,
|
|
|
|
* set zone_last to ZONE_HIGHMEM.
|
|
|
|
*
|
|
|
|
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_MOVABLE,
|
|
|
|
* set zone_last to ZONE_MOVABLE.
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_HIGHMEM;
|
|
|
|
if (N_MEMORY == N_HIGH_MEMORY)
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
if (zone_idx(zone) <= zone_last && !node_state(nid, N_HIGH_MEMORY))
|
|
|
|
arg->status_change_nid_high = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid_high = -1;
|
|
|
|
#else
|
|
|
|
arg->status_change_nid_high = arg->status_change_nid_normal;
|
|
|
|
#endif
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
|
|
|
* if the node don't have memory befor online, we will need to
|
2012-12-13 05:51:49 +08:00
|
|
|
* set the node to node_states[N_MEMORY] after the memory
|
2012-12-12 08:01:03 +08:00
|
|
|
* is online.
|
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (!node_state(nid, N_MEMORY))
|
2012-12-12 08:01:03 +08:00
|
|
|
arg->status_change_nid = nid;
|
|
|
|
else
|
|
|
|
arg->status_change_nid = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_set_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_set_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
if (arg->status_change_nid_high >= 0)
|
|
|
|
node_set_state(node, N_HIGH_MEMORY);
|
|
|
|
|
|
|
|
node_set_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:10 +08:00
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
/* Must be protected by mem_hotplug_begin() */
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
int __ref online_pages(unsigned long pfn, unsigned long nr_pages, int online_type)
|
2007-10-16 16:26:10 +08:00
|
|
|
{
|
2013-07-04 06:02:10 +08:00
|
|
|
unsigned long flags;
|
2005-10-30 09:16:54 +08:00
|
|
|
unsigned long onlined_pages = 0;
|
|
|
|
struct zone *zone;
|
2006-06-23 17:03:11 +08:00
|
|
|
int need_zonelists_rebuild = 0;
|
2007-10-22 07:41:36 +08:00
|
|
|
int nid;
|
|
|
|
int ret;
|
|
|
|
struct memory_notify arg;
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
|
|
|
* This doesn't need a lock to do pfn_to_page().
|
|
|
|
* The section can't be removed here because of the
|
|
|
|
* memory_block->state_mutex.
|
|
|
|
*/
|
|
|
|
zone = page_zone(pfn_to_page(pfn));
|
|
|
|
|
2014-08-07 07:05:13 +08:00
|
|
|
if ((zone_idx(zone) > ZONE_NORMAL ||
|
|
|
|
online_type == MMOP_ONLINE_MOVABLE) &&
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
!can_online_high_movable(zone))
|
2015-04-15 06:45:11 +08:00
|
|
|
return -EINVAL;
|
2012-12-12 08:03:23 +08:00
|
|
|
|
2014-08-07 07:05:13 +08:00
|
|
|
if (online_type == MMOP_ONLINE_KERNEL &&
|
|
|
|
zone_idx(zone) == ZONE_MOVABLE) {
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (move_pfn_range_left(zone - 1, zone, pfn, pfn + nr_pages))
|
2015-04-15 06:45:11 +08:00
|
|
|
return -EINVAL;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
}
|
2014-08-07 07:05:13 +08:00
|
|
|
if (online_type == MMOP_ONLINE_MOVABLE &&
|
|
|
|
zone_idx(zone) == ZONE_MOVABLE - 1) {
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (move_pfn_range_right(zone, zone + 1, pfn, pfn + nr_pages))
|
2015-04-15 06:45:11 +08:00
|
|
|
return -EINVAL;
|
mm, memory-hotplug: dynamic configure movable memory and portion memory
Add online_movable and online_kernel for logic memory hotplug. This is
the dynamic version of "movablecore" & "kernelcore".
We have the same reason to introduce it as to introduce "movablecore" &
"kernelcore". It has the same motive as "movablecore" & "kernelcore", but
it is dynamic/running-time:
o We can configure memory as kernelcore or movablecore after boot.
Userspace workload is increased, we need more hugepage, we can't use
"online_movable" to add memory and allow the system use more
THP(transparent-huge-page), vice-verse when kernel workload is increase.
Also help for virtualization to dynamic configure host/guest's memory,
to save/(reduce waste) memory.
Memory capacity on Demand
o When a new node is physically online after boot, we need to use
"online_movable" or "online_kernel" to configure/portion it as we
expected when we logic-online it.
This configuration also helps for physically-memory-migrate.
o all benefit as the same as existed "movablecore" & "kernelcore".
o Preparing for movable-node, which is very important for power-saving,
hardware partitioning and high-available-system(hardware fault
management).
(Note, we don't introduce movable-node here.)
Action behavior:
When a memoryblock/memorysection is onlined by "online_movable", the kernel
will not have directly reference to the page of the memoryblock,
thus we can remove that memory any time when needed.
When it is online by "online_kernel", the kernel can use it.
When it is online by "online", the zone type doesn't changed.
Current constraints:
Only the memoryblock which is adjacent to the ZONE_MOVABLE
can be online from ZONE_NORMAL to ZONE_MOVABLE.
[akpm@linux-foundation.org: use min_t, cleanups]
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-12 08:03:16 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Previous code may changed the zone of the pfn range */
|
|
|
|
zone = page_zone(pfn_to_page(pfn));
|
|
|
|
|
2007-10-22 07:41:36 +08:00
|
|
|
arg.start_pfn = pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_online(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2013-11-13 07:07:21 +08:00
|
|
|
nid = pfn_to_nid(pfn);
|
2005-10-30 09:16:54 +08:00
|
|
|
|
2007-10-22 07:41:36 +08:00
|
|
|
ret = memory_notify(MEM_GOING_ONLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret) {
|
|
|
|
memory_notify(MEM_CANCEL_ONLINE, &arg);
|
2015-04-15 06:45:11 +08:00
|
|
|
return ret;
|
2007-10-22 07:41:36 +08:00
|
|
|
}
|
2006-06-23 17:03:11 +08:00
|
|
|
/*
|
|
|
|
* If this zone is not populated, then it is not in zonelist.
|
|
|
|
* This means the page allocator ignores this zone.
|
|
|
|
* So, zonelist must be updated after online.
|
|
|
|
*/
|
2010-05-25 05:32:52 +08:00
|
|
|
mutex_lock(&zonelists_mutex);
|
2012-12-12 08:01:01 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2006-06-23 17:03:11 +08:00
|
|
|
need_zonelists_rebuild = 1;
|
2012-12-12 08:01:01 +08:00
|
|
|
build_all_zonelists(NULL, zone);
|
|
|
|
}
|
2006-06-23 17:03:11 +08:00
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(pfn, nr_pages, &onlined_pages,
|
2007-10-16 16:26:10 +08:00
|
|
|
online_pages_range);
|
2008-05-15 07:05:50 +08:00
|
|
|
if (ret) {
|
2012-12-12 08:01:01 +08:00
|
|
|
if (need_zonelists_rebuild)
|
|
|
|
zone_pcp_reset(zone);
|
2010-05-25 05:32:52 +08:00
|
|
|
mutex_unlock(&zonelists_mutex);
|
2012-05-30 06:06:30 +08:00
|
|
|
printk(KERN_DEBUG "online_pages [mem %#010llx-%#010llx] failed\n",
|
|
|
|
(unsigned long long) pfn << PAGE_SHIFT,
|
|
|
|
(((unsigned long long) pfn + nr_pages)
|
|
|
|
<< PAGE_SHIFT) - 1);
|
2008-05-15 07:05:50 +08:00
|
|
|
memory_notify(MEM_CANCEL_ONLINE, &arg);
|
2015-04-15 06:45:11 +08:00
|
|
|
return ret;
|
2008-05-15 07:05:50 +08:00
|
|
|
}
|
|
|
|
|
2005-10-30 09:16:54 +08:00
|
|
|
zone->present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2006-03-10 09:33:51 +08:00
|
|
|
zone->zone_pgdat->node_present_pages += onlined_pages;
|
2013-07-04 06:02:10 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
|
|
|
|
2012-08-01 07:43:30 +08:00
|
|
|
if (onlined_pages) {
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_set_node(zone_to_nid(zone), &arg);
|
2012-08-01 07:43:30 +08:00
|
|
|
if (need_zonelists_rebuild)
|
2012-12-12 08:01:01 +08:00
|
|
|
build_all_zonelists(NULL, NULL);
|
2012-08-01 07:43:30 +08:00
|
|
|
else
|
|
|
|
zone_pcp_update(zone);
|
|
|
|
}
|
2005-10-30 09:16:54 +08:00
|
|
|
|
2010-05-25 05:32:52 +08:00
|
|
|
mutex_unlock(&zonelists_mutex);
|
2011-05-25 08:11:32 +08:00
|
|
|
|
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (onlined_pages) {
|
2007-10-16 16:25:29 +08:00
|
|
|
kswapd_run(zone_to_nid(zone));
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_run(nid);
|
|
|
|
}
|
2005-10-30 09:16:56 +08:00
|
|
|
|
2010-05-25 05:32:51 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
2008-07-24 12:28:18 +08:00
|
|
|
|
2006-09-29 17:01:25 +08:00
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
if (onlined_pages)
|
|
|
|
memory_notify(MEM_ONLINE, &arg);
|
2015-04-15 06:45:11 +08:00
|
|
|
return 0;
|
2005-10-30 09:16:54 +08:00
|
|
|
}
|
2006-10-01 14:27:08 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
|
2006-06-27 17:53:30 +08:00
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
static void reset_node_present_pages(pg_data_t *pgdat)
|
|
|
|
{
|
|
|
|
struct zone *z;
|
|
|
|
|
|
|
|
for (z = pgdat->node_zones; z < pgdat->node_zones + MAX_NR_ZONES; z++)
|
|
|
|
z->present_pages = 0;
|
|
|
|
|
|
|
|
pgdat->node_present_pages = 0;
|
|
|
|
}
|
|
|
|
|
2009-11-18 06:06:18 +08:00
|
|
|
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
|
|
|
|
static pg_data_t __ref *hotadd_new_pgdat(int nid, u64 start)
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
{
|
|
|
|
struct pglist_data *pgdat;
|
|
|
|
unsigned long zones_size[MAX_NR_ZONES] = {0};
|
|
|
|
unsigned long zholes_size[MAX_NR_ZONES] = {0};
|
2014-06-05 07:07:51 +08:00
|
|
|
unsigned long start_pfn = PFN_DOWN(start);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
pgdat = NODE_DATA(nid);
|
|
|
|
if (!pgdat) {
|
|
|
|
pgdat = arch_alloc_nodedata(nid);
|
|
|
|
if (!pgdat)
|
|
|
|
return NULL;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
arch_refresh_nodedata(nid, pgdat);
|
mm/memory hotplug: postpone the reset of obsolete pgdat
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:
BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0
The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.
process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer access
So the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Xishi Qiu <qiuxishi@huawei.com>
Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-03-26 06:55:20 +08:00
|
|
|
} else {
|
|
|
|
/* Reset the nr_zones and classzone_idx to 0 before reuse */
|
|
|
|
pgdat->nr_zones = 0;
|
|
|
|
pgdat->classzone_idx = 0;
|
2013-02-23 08:33:18 +08:00
|
|
|
}
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
|
|
|
/* we can use NODE_DATA(nid) from here */
|
|
|
|
|
|
|
|
/* init node's zones as empty zones, we don't have any present pages.*/
|
2008-07-24 12:27:20 +08:00
|
|
|
free_area_init_node(nid, zones_size, start_pfn, zholes_size);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2011-06-16 06:08:38 +08:00
|
|
|
/*
|
|
|
|
* The node we allocated has no zone fallback lists. For avoiding
|
|
|
|
* to access not-initialized zonelist, build here.
|
|
|
|
*/
|
2011-06-23 09:13:04 +08:00
|
|
|
mutex_lock(&zonelists_mutex);
|
2012-08-01 07:43:28 +08:00
|
|
|
build_all_zonelists(pgdat, NULL);
|
2011-06-23 09:13:04 +08:00
|
|
|
mutex_unlock(&zonelists_mutex);
|
2011-06-16 06:08:38 +08:00
|
|
|
|
2014-11-14 07:19:39 +08:00
|
|
|
/*
|
|
|
|
* zone->managed_pages is set to an approximate value in
|
|
|
|
* free_area_init_core(), which will cause
|
|
|
|
* /sys/device/system/node/nodeX/meminfo has wrong data.
|
|
|
|
* So reset it to 0 before any memory is onlined.
|
|
|
|
*/
|
|
|
|
reset_node_managed_pages(pgdat);
|
|
|
|
|
2014-11-14 07:19:41 +08:00
|
|
|
/*
|
|
|
|
* When memory is hot-added, all the memory is in offline state. So
|
|
|
|
* clear all zones' present_pages because they will be updated in
|
|
|
|
* online_pages() and offline_pages().
|
|
|
|
*/
|
|
|
|
reset_node_present_pages(pgdat);
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
return pgdat;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void rollback_node_hotadd(int nid, pg_data_t *pgdat)
|
|
|
|
{
|
|
|
|
arch_refresh_nodedata(nid, NULL);
|
|
|
|
arch_free_nodedata(pgdat);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2006-06-27 17:53:35 +08:00
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
/**
|
|
|
|
* try_online_node - online a node if offlined
|
|
|
|
*
|
2010-05-25 05:32:41 +08:00
|
|
|
* called by cpu_up() to online a node without onlined memory.
|
|
|
|
*/
|
2013-11-13 07:07:25 +08:00
|
|
|
int try_online_node(int nid)
|
2010-05-25 05:32:41 +08:00
|
|
|
{
|
|
|
|
pg_data_t *pgdat;
|
|
|
|
int ret;
|
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
if (node_online(nid))
|
|
|
|
return 0;
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2010-05-25 05:32:41 +08:00
|
|
|
pgdat = hotadd_new_pgdat(nid, 0);
|
2011-06-23 09:13:01 +08:00
|
|
|
if (!pgdat) {
|
2013-11-13 07:07:25 +08:00
|
|
|
pr_err("Cannot online node %d due to NULL pgdat\n", nid);
|
2010-05-25 05:32:41 +08:00
|
|
|
ret = -ENOMEM;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
node_set_online(nid);
|
|
|
|
ret = register_one_node(nid);
|
|
|
|
BUG_ON(ret);
|
|
|
|
|
2013-11-13 07:07:25 +08:00
|
|
|
if (pgdat->node_zonelists->_zonerefs->zone == NULL) {
|
|
|
|
mutex_lock(&zonelists_mutex);
|
|
|
|
build_all_zonelists(NULL, NULL);
|
|
|
|
mutex_unlock(&zonelists_mutex);
|
|
|
|
}
|
|
|
|
|
2010-05-25 05:32:41 +08:00
|
|
|
out:
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2010-05-25 05:32:41 +08:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
static int check_hotplug_memory_range(u64 start, u64 size)
|
|
|
|
{
|
2014-06-05 07:07:51 +08:00
|
|
|
u64 start_pfn = PFN_DOWN(start);
|
2013-09-12 05:21:49 +08:00
|
|
|
u64 nr_pages = size >> PAGE_SHIFT;
|
|
|
|
|
|
|
|
/* Memory range must be aligned with section */
|
|
|
|
if ((start_pfn & ~PAGE_SECTION_MASK) ||
|
|
|
|
(nr_pages % PAGES_PER_SECTION) || (!nr_pages)) {
|
|
|
|
pr_err("Section-unaligned hotplug range: start 0x%llx, size 0x%llx\n",
|
|
|
|
(unsigned long long)start,
|
|
|
|
(unsigned long long)size);
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
memory-hotplug: add zone_for_memory() for selecting zone for new memory
This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:
# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0
This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
After the second online_movable, the problem can be observed from
zoneinfo:
# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...
This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.
After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):
bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0
bash-4.2# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1
This patch (of 7):
Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.
Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.
should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Mel Gorman" <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07 07:07:36 +08:00
|
|
|
/*
|
|
|
|
* If movable zone has already been setup, newly added memory should be check.
|
|
|
|
* If its address is higher than movable zone, it should be added as movable.
|
|
|
|
* Without this check, movable zone may overlap with other zone.
|
|
|
|
*/
|
|
|
|
static int should_add_memory_movable(int nid, u64 start, u64 size)
|
|
|
|
{
|
|
|
|
unsigned long start_pfn = start >> PAGE_SHIFT;
|
|
|
|
pg_data_t *pgdat = NODE_DATA(nid);
|
|
|
|
struct zone *movable_zone = pgdat->node_zones + ZONE_MOVABLE;
|
|
|
|
|
|
|
|
if (zone_is_empty(movable_zone))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (movable_zone->zone_start_pfn <= start_pfn)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-08-10 03:29:06 +08:00
|
|
|
int zone_for_memory(int nid, u64 start, u64 size, int zone_default,
|
|
|
|
bool for_device)
|
memory-hotplug: add zone_for_memory() for selecting zone for new memory
This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:
# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0
This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
After the second online_movable, the problem can be observed from
zoneinfo:
# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...
This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.
After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):
bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0
bash-4.2# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1
This patch (of 7):
Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.
Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.
should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Mel Gorman" <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07 07:07:36 +08:00
|
|
|
{
|
2015-08-10 03:29:06 +08:00
|
|
|
#ifdef CONFIG_ZONE_DEVICE
|
|
|
|
if (for_device)
|
|
|
|
return ZONE_DEVICE;
|
|
|
|
#endif
|
memory-hotplug: add zone_for_memory() for selecting zone for new memory
This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:
# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0
This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.
After the second online_movable, the problem can be observed from
zoneinfo:
# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...
This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.
After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):
bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0
bash-4.2# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1
This patch (of 7):
Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.
Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.
should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: "Mel Gorman" <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-07 07:07:36 +08:00
|
|
|
if (should_add_memory_movable(nid, start, size))
|
|
|
|
return ZONE_MOVABLE;
|
|
|
|
|
|
|
|
return zone_default;
|
|
|
|
}
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
static int online_memory_block(struct memory_block *mem, void *arg)
|
|
|
|
{
|
|
|
|
return memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
|
|
|
|
}
|
|
|
|
|
2008-11-23 01:33:24 +08:00
|
|
|
/* we are OK calling __meminit stuff here - we have CONFIG_MEMORY_HOTPLUG */
|
2016-03-16 05:56:48 +08:00
|
|
|
int __ref add_memory_resource(int nid, struct resource *res, bool online)
|
2006-06-27 17:53:30 +08:00
|
|
|
{
|
2015-06-25 23:35:49 +08:00
|
|
|
u64 start, size;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
pg_data_t *pgdat = NULL;
|
2013-02-23 08:33:18 +08:00
|
|
|
bool new_pgdat;
|
|
|
|
bool new_node;
|
2006-06-27 17:53:30 +08:00
|
|
|
int ret;
|
|
|
|
|
2015-06-25 23:35:49 +08:00
|
|
|
start = res->start;
|
|
|
|
size = resource_size(res);
|
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
ret = check_hotplug_memory_range(start, size);
|
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
{ /* Stupid hack to suppress address-never-null warning */
|
|
|
|
void *p = NODE_DATA(nid);
|
|
|
|
new_pgdat = !p;
|
|
|
|
}
|
2014-01-24 07:53:26 +08:00
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2014-01-24 07:53:26 +08:00
|
|
|
|
2015-09-05 06:42:32 +08:00
|
|
|
/*
|
|
|
|
* Add new range to memblock so that when hotadd_new_pgdat() is called
|
|
|
|
* to allocate new pgdat, get_pfn_range_for_nid() will be able to find
|
|
|
|
* this new range and calculate total pages correctly. The range will
|
|
|
|
* be removed at hot-remove time.
|
|
|
|
*/
|
|
|
|
memblock_add_node(start, size, nid);
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
new_node = !node_online(nid);
|
|
|
|
if (new_node) {
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
pgdat = hotadd_new_pgdat(nid, start);
|
2009-11-18 06:06:22 +08:00
|
|
|
ret = -ENOMEM;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
if (!pgdat)
|
2012-07-12 05:02:31 +08:00
|
|
|
goto error;
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
}
|
|
|
|
|
2006-06-27 17:53:30 +08:00
|
|
|
/* call arch's memory hotadd */
|
2015-08-10 03:29:06 +08:00
|
|
|
ret = arch_add_memory(nid, start, size, false);
|
2006-06-27 17:53:30 +08:00
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
if (ret < 0)
|
|
|
|
goto error;
|
|
|
|
|
2006-06-27 17:53:38 +08:00
|
|
|
/* we online node here. we can't roll back from here. */
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
node_set_online(nid);
|
|
|
|
|
2013-02-23 08:33:18 +08:00
|
|
|
if (new_node) {
|
2006-06-27 17:53:38 +08:00
|
|
|
ret = register_one_node(nid);
|
|
|
|
/*
|
|
|
|
* If sysfs file of new node can't create, cpu on the node
|
|
|
|
* can't be hot-added. There is no rollback way now.
|
|
|
|
* So, check by BUG_ON() to catch it reluctantly..
|
|
|
|
*/
|
|
|
|
BUG_ON(ret);
|
|
|
|
}
|
|
|
|
|
2010-03-06 05:41:58 +08:00
|
|
|
/* create new memmap entry */
|
|
|
|
firmware_map_add_hotplug(start, start + size, "System RAM");
|
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
/* online pages if requested */
|
|
|
|
if (online)
|
|
|
|
walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1),
|
|
|
|
NULL, online_memory_block);
|
|
|
|
|
2009-11-18 06:06:22 +08:00
|
|
|
goto out;
|
|
|
|
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
error:
|
|
|
|
/* rollback pgdat allocation and others */
|
|
|
|
if (new_pgdat)
|
|
|
|
rollback_node_hotadd(nid, pgdat);
|
2015-09-05 06:42:32 +08:00
|
|
|
memblock_remove(start, size);
|
[PATCH] pgdat allocation for new node add (call pgdat allocation)
Add node-hot-add support to add_memory().
node hotadd uses this sequence.
1. allocate pgdat.
2. refresh NODE_DATA()
3. call free_area_init_node() to initialize
4. create sysfs entry
5. add memory (old add_memory())
6. set node online
7. run kswapd for new node.
(8). update zonelist after pages are onlined. (This is already merged in -mm
due to update phase is difference.)
Note:
To make common function as much as possible,
there is 2 changes from v2.
- The old add_memory(), which is defiend by each archs,
is renamed to arch_add_memory(). New add_memory becomes
caller of arch dependent function as a common code.
- This patch changes add_memory()'s interface
From: add_memory(start, end)
TO : add_memory(nid, start, end).
It was cause of similar code that finding node id from
physical address is inside of old add_memory() on each arch.
In addition, acpi memory hotplug driver can find node id easier.
In v2, it must walk DSDT'S _CRS by matching physical address to
get the handle of its memory device, then get _PXM and node id.
Because input is just physical address.
However, in v3, the acpi driver can use handle to get _PXM and node id
for the new memory device. It can pass just node id to add_memory().
Fix interface of arch_add_memory() is in next patche.
Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: "Brown, Len" <len.brown@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27 17:53:34 +08:00
|
|
|
|
2009-11-18 06:06:22 +08:00
|
|
|
out:
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2006-06-27 17:53:30 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2015-06-25 23:35:49 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_memory_resource);
|
|
|
|
|
|
|
|
int __ref add_memory(int nid, u64 start, u64 size)
|
|
|
|
{
|
|
|
|
struct resource *res;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
res = register_memory_resource(start, size);
|
2016-01-15 07:21:55 +08:00
|
|
|
if (IS_ERR(res))
|
|
|
|
return PTR_ERR(res);
|
2015-06-25 23:35:49 +08:00
|
|
|
|
2016-03-16 05:56:48 +08:00
|
|
|
ret = add_memory_resource(nid, res, memhp_auto_online);
|
2015-06-25 23:35:49 +08:00
|
|
|
if (ret < 0)
|
|
|
|
release_memory_resource(res);
|
|
|
|
return ret;
|
|
|
|
}
|
2006-06-27 17:53:30 +08:00
|
|
|
EXPORT_SYMBOL_GPL(add_memory);
|
2007-10-16 16:26:12 +08:00
|
|
|
|
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2008-07-24 12:28:19 +08:00
|
|
|
/*
|
|
|
|
* A free page on the buddy free lists (not the per-cpu lists) has PageBuddy
|
|
|
|
* set and the size of the free page is given by page_order(). Using this,
|
|
|
|
* the function determines if the pageblock contains only free pages.
|
|
|
|
* Due to buddy contraints, a free page at least the size of a pageblock will
|
|
|
|
* be located at the start of the pageblock
|
|
|
|
*/
|
|
|
|
static inline int pageblock_free(struct page *page)
|
|
|
|
{
|
|
|
|
return PageBuddy(page) && page_order(page) >= pageblock_order;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Return the start of the next active pageblock after a given page */
|
|
|
|
static struct page *next_active_pageblock(struct page *page)
|
|
|
|
{
|
|
|
|
/* Ensure the starting page is pageblock-aligned */
|
|
|
|
BUG_ON(page_to_pfn(page) & (pageblock_nr_pages - 1));
|
|
|
|
|
|
|
|
/* If the entire pageblock is free, move to the end of free page */
|
2010-09-10 07:38:01 +08:00
|
|
|
if (pageblock_free(page)) {
|
|
|
|
int order;
|
|
|
|
/* be careful. we don't have locks, page_order can be changed.*/
|
|
|
|
order = page_order(page);
|
|
|
|
if ((order < MAX_ORDER) && (order >= pageblock_order))
|
|
|
|
return page + (1 << order);
|
|
|
|
}
|
2008-07-24 12:28:19 +08:00
|
|
|
|
2010-09-10 07:38:01 +08:00
|
|
|
return page + pageblock_nr_pages;
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Checks if this range of memory is likely to be hot-removable. */
|
|
|
|
int is_mem_section_removable(unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct page *page = pfn_to_page(start_pfn);
|
|
|
|
struct page *end_page = page + nr_pages;
|
|
|
|
|
|
|
|
/* Check the starting page of each pageblock within the range */
|
|
|
|
for (; page < end_page; page = next_active_pageblock(page)) {
|
2010-10-27 05:21:30 +08:00
|
|
|
if (!is_pageblock_removable_nolock(page))
|
2008-07-24 12:28:19 +08:00
|
|
|
return 0;
|
2010-10-27 05:21:30 +08:00
|
|
|
cond_resched();
|
2008-07-24 12:28:19 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/* All pageblocks in the memory block are likely to be hot-removable */
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/*
|
|
|
|
* Confirm all pages in a range [start, end) is belongs to the same zone.
|
|
|
|
*/
|
2014-10-10 06:26:31 +08:00
|
|
|
int test_pages_in_a_zone(unsigned long start_pfn, unsigned long end_pfn)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
2015-12-30 06:54:25 +08:00
|
|
|
unsigned long pfn, sec_end_pfn;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone = NULL;
|
|
|
|
struct page *page;
|
|
|
|
int i;
|
2015-12-30 06:54:25 +08:00
|
|
|
for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn);
|
2007-10-16 16:26:12 +08:00
|
|
|
pfn < end_pfn;
|
2015-12-30 06:54:25 +08:00
|
|
|
pfn = sec_end_pfn + 1, sec_end_pfn += PAGES_PER_SECTION) {
|
|
|
|
/* Make sure the memory section is present first */
|
|
|
|
if (!present_section_nr(pfn_to_section_nr(pfn)))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
2015-12-30 06:54:25 +08:00
|
|
|
for (; pfn < sec_end_pfn && pfn < end_pfn;
|
|
|
|
pfn += MAX_ORDER_NR_PAGES) {
|
|
|
|
i = 0;
|
|
|
|
/* This is just a CONFIG_HOLES_IN_ZONE check.*/
|
|
|
|
while ((i < MAX_ORDER_NR_PAGES) &&
|
|
|
|
!pfn_valid_within(pfn + i))
|
|
|
|
i++;
|
|
|
|
if (i == MAX_ORDER_NR_PAGES)
|
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn + i);
|
|
|
|
if (zone && page_zone(page) != zone)
|
|
|
|
return 0;
|
|
|
|
zone = page_zone(page);
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-09-12 05:22:09 +08:00
|
|
|
* Scan pfn range [start,end) to find movable/migratable pages (LRU pages
|
|
|
|
* and hugepages). We scan pfn because it's much easier than scanning over
|
|
|
|
* linked list. This function returns the pfn of the first found movable
|
|
|
|
* page if it's found, otherwise 0.
|
2007-10-16 16:26:12 +08:00
|
|
|
*/
|
2013-09-12 05:22:09 +08:00
|
|
|
static unsigned long scan_movable_pages(unsigned long start, unsigned long end)
|
2007-10-16 16:26:12 +08:00
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
for (pfn = start; pfn < end; pfn++) {
|
|
|
|
if (pfn_valid(pfn)) {
|
|
|
|
page = pfn_to_page(pfn);
|
|
|
|
if (PageLRU(page))
|
|
|
|
return pfn;
|
2013-09-12 05:22:09 +08:00
|
|
|
if (PageHuge(page)) {
|
2015-04-16 07:14:41 +08:00
|
|
|
if (page_huge_active(page))
|
2013-09-12 05:22:09 +08:00
|
|
|
return pfn;
|
|
|
|
else
|
|
|
|
pfn = round_up(pfn + 1,
|
|
|
|
1 << compound_order(page)) - 1;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
#define NR_OFFLINE_AT_ONCE_PAGES (256)
|
|
|
|
static int
|
|
|
|
do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
unsigned long pfn;
|
|
|
|
struct page *page;
|
|
|
|
int move_pages = NR_OFFLINE_AT_ONCE_PAGES;
|
|
|
|
int not_managed = 0;
|
|
|
|
int ret = 0;
|
|
|
|
LIST_HEAD(source);
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn && move_pages > 0; pfn++) {
|
|
|
|
if (!pfn_valid(pfn))
|
|
|
|
continue;
|
|
|
|
page = pfn_to_page(pfn);
|
2013-09-12 05:22:09 +08:00
|
|
|
|
|
|
|
if (PageHuge(page)) {
|
|
|
|
struct page *head = compound_head(page);
|
|
|
|
pfn = page_to_pfn(head) + (1<<compound_order(head)) - 1;
|
|
|
|
if (compound_order(head) > PFN_SECTION_SHIFT) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (isolate_huge_page(page, &source))
|
|
|
|
move_pages -= 1 << compound_order(head);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
2011-05-25 08:12:19 +08:00
|
|
|
if (!get_page_unless_zero(page))
|
2007-10-16 16:26:12 +08:00
|
|
|
continue;
|
|
|
|
/*
|
|
|
|
* We can skip free pages. And we can only deal with pages on
|
|
|
|
* LRU.
|
|
|
|
*/
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
ret = isolate_lru_page(page);
|
2007-10-16 16:26:12 +08:00
|
|
|
if (!ret) { /* Success */
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
list_add_tail(&page->lru, &source);
|
2007-10-16 16:26:12 +08:00
|
|
|
move_pages--;
|
2009-12-15 09:58:11 +08:00
|
|
|
inc_zone_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
} else {
|
|
|
|
#ifdef CONFIG_DEBUG_VM
|
2010-03-11 07:20:43 +08:00
|
|
|
printk(KERN_ALERT "removing pfn %lx from LRU failed\n",
|
|
|
|
pfn);
|
2014-01-24 07:52:49 +08:00
|
|
|
dump_page(page, "failed to remove from LRU");
|
2007-10-16 16:26:12 +08:00
|
|
|
#endif
|
2011-05-25 08:12:19 +08:00
|
|
|
put_page(page);
|
2011-03-31 09:57:33 +08:00
|
|
|
/* Because we don't have big zone->lock. we should
|
2010-10-27 05:22:10 +08:00
|
|
|
check this again here. */
|
|
|
|
if (page_count(page)) {
|
|
|
|
not_managed++;
|
2010-10-27 05:22:10 +08:00
|
|
|
ret = -EBUSY;
|
2010-10-27 05:22:10 +08:00
|
|
|
break;
|
|
|
|
}
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
}
|
2010-10-27 05:22:10 +08:00
|
|
|
if (!list_empty(&source)) {
|
|
|
|
if (not_managed) {
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2010-10-27 05:22:10 +08:00
|
|
|
goto out;
|
|
|
|
}
|
2012-10-09 07:32:54 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* alloc_migrate_target should be improooooved!!
|
|
|
|
* migrate_pages returns # of failed pages.
|
|
|
|
*/
|
2014-06-05 07:08:25 +08:00
|
|
|
ret = migrate_pages(&source, alloc_migrate_target, NULL, 0,
|
2013-02-23 08:35:14 +08:00
|
|
|
MIGRATE_SYNC, MR_MEMORY_HOTPLUG);
|
2010-10-27 05:22:10 +08:00
|
|
|
if (ret)
|
2013-09-12 05:22:09 +08:00
|
|
|
putback_movable_pages(&source);
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* remove from free_area[] and mark all as Reserved.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
offline_isolated_pages_cb(unsigned long start, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
__offline_isolated_pages(start, start + nr_pages);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
2009-09-23 07:45:46 +08:00
|
|
|
walk_system_ram_range(start_pfn, end_pfn - start_pfn, NULL,
|
2007-10-16 16:26:12 +08:00
|
|
|
offline_isolated_pages_cb);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check all pages in range, recoreded as memory resource, are isolated.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
check_pages_isolated_cb(unsigned long start_pfn, unsigned long nr_pages,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
long offlined = *(long *)data;
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = test_pages_isolated(start_pfn, start_pfn + nr_pages, true);
|
2007-10-16 16:26:12 +08:00
|
|
|
offlined = nr_pages;
|
|
|
|
if (!ret)
|
|
|
|
*(long *)data += offlined;
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
static long
|
|
|
|
check_pages_isolated(unsigned long start_pfn, unsigned long end_pfn)
|
|
|
|
{
|
|
|
|
long offlined = 0;
|
|
|
|
int ret;
|
|
|
|
|
2009-09-23 07:45:46 +08:00
|
|
|
ret = walk_system_ram_range(start_pfn, end_pfn - start_pfn, &offlined,
|
2007-10-16 16:26:12 +08:00
|
|
|
check_pages_isolated_cb);
|
|
|
|
if (ret < 0)
|
|
|
|
offlined = (long)ret;
|
|
|
|
return offlined;
|
|
|
|
}
|
|
|
|
|
2012-12-13 05:52:04 +08:00
|
|
|
#ifdef CONFIG_MOVABLE_NODE
|
2012-12-19 06:23:24 +08:00
|
|
|
/*
|
|
|
|
* When CONFIG_MOVABLE_NODE, we permit offlining of a node which doesn't have
|
|
|
|
* normal memory.
|
|
|
|
*/
|
2012-12-13 05:52:04 +08:00
|
|
|
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
return true;
|
|
|
|
}
|
2012-12-19 06:23:24 +08:00
|
|
|
#else /* CONFIG_MOVABLE_NODE */
|
2012-12-12 08:03:23 +08:00
|
|
|
/* ensure the node has NORMAL memory if it is still online */
|
|
|
|
static bool can_offline_normal(struct zone *zone, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
unsigned long present_pages = 0;
|
|
|
|
enum zone_type zt;
|
|
|
|
|
|
|
|
for (zt = 0; zt <= ZONE_NORMAL; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
|
|
|
|
if (present_pages > nr_pages)
|
|
|
|
return true;
|
|
|
|
|
|
|
|
present_pages = 0;
|
|
|
|
for (; zt <= ZONE_MOVABLE; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* we can't offline the last normal memory until all
|
|
|
|
* higher memory is offlined.
|
|
|
|
*/
|
|
|
|
return present_pages == 0;
|
|
|
|
}
|
2012-12-19 06:23:24 +08:00
|
|
|
#endif /* CONFIG_MOVABLE_NODE */
|
2012-12-12 08:03:23 +08:00
|
|
|
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
static int __init cmdline_parse_movable_node(char *p)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_MOVABLE_NODE
|
|
|
|
/*
|
|
|
|
* Memory used by the kernel cannot be hot-removed because Linux
|
|
|
|
* cannot migrate the kernel pages. When memory hotplug is
|
|
|
|
* enabled, we should prevent memblock from allocating memory
|
|
|
|
* for the kernel.
|
|
|
|
*
|
|
|
|
* ACPI SRAT records all hotpluggable memory ranges. But before
|
|
|
|
* SRAT is parsed, we don't know about it.
|
|
|
|
*
|
|
|
|
* The kernel image is loaded into memory at very early time. We
|
|
|
|
* cannot prevent this anyway. So on NUMA system, we set any
|
|
|
|
* node the kernel resides in as un-hotpluggable.
|
|
|
|
*
|
|
|
|
* Since on modern servers, one node could have double-digit
|
|
|
|
* gigabytes memory, we can assume the memory around the kernel
|
|
|
|
* image is also un-hotpluggable. So before SRAT is parsed, just
|
|
|
|
* allocate memory near the kernel image to try the best to keep
|
|
|
|
* the kernel away from hotpluggable memory.
|
|
|
|
*/
|
|
|
|
memblock_set_bottom_up(true);
|
2014-01-22 07:49:35 +08:00
|
|
|
movable_node_enabled = true;
|
mem-hotplug: introduce movable_node boot option
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.
To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:
1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.
Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Suggested-by: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Suggested-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Toshi Kani <toshi.kani@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13 07:08:10 +08:00
|
|
|
#else
|
|
|
|
pr_warn("movable_node option not supported\n");
|
|
|
|
#endif
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
early_param("movable_node", cmdline_parse_movable_node);
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/* check which state of node_states will be changed when offline memory */
|
|
|
|
static void node_states_check_changes_offline(unsigned long nr_pages,
|
|
|
|
struct zone *zone, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
struct pglist_data *pgdat = zone->zone_pgdat;
|
|
|
|
unsigned long present_pages = 0;
|
|
|
|
enum zone_type zt, zone_last = ZONE_NORMAL;
|
|
|
|
|
|
|
|
/*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we have HIGHMEM or movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_NORMAL,
|
|
|
|
* set zone_last to ZONE_NORMAL.
|
2012-12-12 08:01:03 +08:00
|
|
|
*
|
2012-12-13 05:51:49 +08:00
|
|
|
* If we don't have HIGHMEM nor movable node,
|
|
|
|
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
|
|
|
|
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
|
2012-12-12 08:01:03 +08:00
|
|
|
*/
|
2012-12-13 05:51:49 +08:00
|
|
|
if (N_MEMORY == N_NORMAL_MEMORY)
|
2012-12-12 08:01:03 +08:00
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check whether node_states[N_NORMAL_MEMORY] will be changed.
|
|
|
|
* If the memory to be offline is in a zone of 0...zone_last,
|
|
|
|
* and it is the last present memory, 0...zone_last will
|
|
|
|
* become empty after offline , thus we can determind we will
|
|
|
|
* need to clear the node from node_states[N_NORMAL_MEMORY].
|
|
|
|
*/
|
|
|
|
for (zt = 0; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid_normal = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid_normal = -1;
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
#ifdef CONFIG_HIGHMEM
|
|
|
|
/*
|
|
|
|
* If we have movable node, node_states[N_HIGH_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_HIGHMEM,
|
|
|
|
* set zone_last to ZONE_HIGHMEM.
|
|
|
|
*
|
|
|
|
* If we don't have movable node, node_states[N_NORMAL_MEMORY]
|
|
|
|
* contains nodes which have zones of 0...ZONE_MOVABLE,
|
|
|
|
* set zone_last to ZONE_MOVABLE.
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_HIGHMEM;
|
|
|
|
if (N_MEMORY == N_HIGH_MEMORY)
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
for (; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (zone_idx(zone) <= zone_last && nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid_high = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid_high = -1;
|
|
|
|
#else
|
|
|
|
arg->status_change_nid_high = arg->status_change_nid_normal;
|
|
|
|
#endif
|
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
/*
|
|
|
|
* node_states[N_HIGH_MEMORY] contains nodes which have 0...ZONE_MOVABLE
|
|
|
|
*/
|
|
|
|
zone_last = ZONE_MOVABLE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* check whether node_states[N_HIGH_MEMORY] will be changed
|
|
|
|
* If we try to offline the last present @nr_pages from the node,
|
|
|
|
* we can determind we will need to clear the node from
|
|
|
|
* node_states[N_HIGH_MEMORY].
|
|
|
|
*/
|
|
|
|
for (; zt <= zone_last; zt++)
|
|
|
|
present_pages += pgdat->node_zones[zt].present_pages;
|
|
|
|
if (nr_pages >= present_pages)
|
|
|
|
arg->status_change_nid = zone_to_nid(zone);
|
|
|
|
else
|
|
|
|
arg->status_change_nid = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void node_states_clear_node(int node, struct memory_notify *arg)
|
|
|
|
{
|
|
|
|
if (arg->status_change_nid_normal >= 0)
|
|
|
|
node_clear_state(node, N_NORMAL_MEMORY);
|
|
|
|
|
2012-12-13 05:51:49 +08:00
|
|
|
if ((N_MEMORY != N_NORMAL_MEMORY) &&
|
|
|
|
(arg->status_change_nid_high >= 0))
|
2012-12-12 08:01:03 +08:00
|
|
|
node_clear_state(node, N_HIGH_MEMORY);
|
2012-12-13 05:51:49 +08:00
|
|
|
|
|
|
|
if ((N_MEMORY != N_HIGH_MEMORY) &&
|
|
|
|
(arg->status_change_nid >= 0))
|
|
|
|
node_clear_state(node, N_MEMORY);
|
2012-12-12 08:01:03 +08:00
|
|
|
}
|
|
|
|
|
2012-10-09 07:33:58 +08:00
|
|
|
static int __ref __offline_pages(unsigned long start_pfn,
|
2007-10-16 16:26:12 +08:00
|
|
|
unsigned long end_pfn, unsigned long timeout)
|
|
|
|
{
|
|
|
|
unsigned long pfn, nr_pages, expire;
|
|
|
|
long offlined_pages;
|
2007-10-22 07:41:36 +08:00
|
|
|
int ret, drain, retry_max, node;
|
2013-07-04 06:02:11 +08:00
|
|
|
unsigned long flags;
|
2007-10-16 16:26:12 +08:00
|
|
|
struct zone *zone;
|
2007-10-22 07:41:36 +08:00
|
|
|
struct memory_notify arg;
|
2007-10-16 16:26:12 +08:00
|
|
|
|
|
|
|
/* at least, alignment against pageblock is necessary */
|
|
|
|
if (!IS_ALIGNED(start_pfn, pageblock_nr_pages))
|
|
|
|
return -EINVAL;
|
|
|
|
if (!IS_ALIGNED(end_pfn, pageblock_nr_pages))
|
|
|
|
return -EINVAL;
|
|
|
|
/* This makes hotplug much easier...and readable.
|
|
|
|
we assume this for now. .*/
|
|
|
|
if (!test_pages_in_a_zone(start_pfn, end_pfn))
|
|
|
|
return -EINVAL;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
zone = page_zone(pfn_to_page(start_pfn));
|
|
|
|
node = zone_to_nid(zone);
|
|
|
|
nr_pages = end_pfn - start_pfn;
|
|
|
|
|
2012-12-12 08:03:23 +08:00
|
|
|
if (zone_idx(zone) <= ZONE_NORMAL && !can_offline_normal(zone, nr_pages))
|
2015-04-15 06:45:11 +08:00
|
|
|
return -EINVAL;
|
2012-12-12 08:03:23 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
/* set above range as isolated */
|
2012-12-12 08:00:45 +08:00
|
|
|
ret = start_isolate_page_range(start_pfn, end_pfn,
|
|
|
|
MIGRATE_MOVABLE, true);
|
2007-10-16 16:26:12 +08:00
|
|
|
if (ret)
|
2015-04-15 06:45:11 +08:00
|
|
|
return ret;
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
arg.start_pfn = start_pfn;
|
|
|
|
arg.nr_pages = nr_pages;
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_check_changes_offline(nr_pages, zone, &arg);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
ret = memory_notify(MEM_GOING_OFFLINE, &arg);
|
|
|
|
ret = notifier_to_errno(ret);
|
|
|
|
if (ret)
|
|
|
|
goto failed_removal;
|
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
pfn = start_pfn;
|
|
|
|
expire = jiffies + timeout;
|
|
|
|
drain = 0;
|
|
|
|
retry_max = 5;
|
|
|
|
repeat:
|
|
|
|
/* start memory hot removal */
|
|
|
|
ret = -EAGAIN;
|
|
|
|
if (time_after(jiffies, expire))
|
|
|
|
goto failed_removal;
|
|
|
|
ret = -EINTR;
|
|
|
|
if (signal_pending(current))
|
|
|
|
goto failed_removal;
|
|
|
|
ret = 0;
|
|
|
|
if (drain) {
|
|
|
|
lru_add_drain_all();
|
|
|
|
cond_resched();
|
2014-12-11 07:43:10 +08:00
|
|
|
drain_all_pages(zone);
|
2007-10-16 16:26:12 +08:00
|
|
|
}
|
|
|
|
|
2013-09-12 05:22:09 +08:00
|
|
|
pfn = scan_movable_pages(start_pfn, end_pfn);
|
|
|
|
if (pfn) { /* We have movable pages */
|
2007-10-16 16:26:12 +08:00
|
|
|
ret = do_migrate_range(pfn, end_pfn);
|
|
|
|
if (!ret) {
|
|
|
|
drain = 1;
|
|
|
|
goto repeat;
|
|
|
|
} else {
|
|
|
|
if (ret < 0)
|
|
|
|
if (--retry_max == 0)
|
|
|
|
goto failed_removal;
|
|
|
|
yield();
|
|
|
|
drain = 1;
|
|
|
|
goto repeat;
|
|
|
|
}
|
|
|
|
}
|
2012-09-20 09:48:02 +08:00
|
|
|
/* drain all zone's lru pagevec, this is asynchronous... */
|
2007-10-16 16:26:12 +08:00
|
|
|
lru_add_drain_all();
|
|
|
|
yield();
|
2012-09-20 09:48:02 +08:00
|
|
|
/* drain pcp pages, this is synchronous. */
|
2014-12-11 07:43:10 +08:00
|
|
|
drain_all_pages(zone);
|
2013-09-12 05:22:09 +08:00
|
|
|
/*
|
|
|
|
* dissolve free hugepages in the memory block before doing offlining
|
|
|
|
* actually in order to make hugetlbfs's object counting consistent.
|
|
|
|
*/
|
|
|
|
dissolve_free_huge_pages(start_pfn, end_pfn);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* check again */
|
|
|
|
offlined_pages = check_pages_isolated(start_pfn, end_pfn);
|
|
|
|
if (offlined_pages < 0) {
|
|
|
|
ret = -EBUSY;
|
|
|
|
goto failed_removal;
|
|
|
|
}
|
|
|
|
printk(KERN_INFO "Offlined Pages %ld\n", offlined_pages);
|
2012-09-20 09:48:02 +08:00
|
|
|
/* Ok, all of our target is isolated.
|
2007-10-16 16:26:12 +08:00
|
|
|
We cannot do rollback at this point. */
|
|
|
|
offline_isolated_pages(start_pfn, end_pfn);
|
2007-11-15 08:59:12 +08:00
|
|
|
/* reset pagetype flags and makes migrate type to be MOVABLE */
|
2012-04-03 21:06:15 +08:00
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* removal success */
|
2013-07-04 06:03:21 +08:00
|
|
|
adjust_managed_page_count(pfn_to_page(start_pfn), -offlined_pages);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
|
|
|
|
pgdat_resize_lock(zone->zone_pgdat, &flags);
|
2007-10-16 16:26:12 +08:00
|
|
|
zone->zone_pgdat->node_present_pages -= offlined_pages;
|
2013-07-04 06:02:11 +08:00
|
|
|
pgdat_resize_unlock(zone->zone_pgdat, &flags);
|
2007-10-22 07:41:36 +08:00
|
|
|
|
2011-05-25 08:11:32 +08:00
|
|
|
init_per_zone_wmark_min();
|
|
|
|
|
2012-10-09 07:31:51 +08:00
|
|
|
if (!populated_zone(zone)) {
|
2012-08-01 07:43:32 +08:00
|
|
|
zone_pcp_reset(zone);
|
2012-10-09 07:31:51 +08:00
|
|
|
mutex_lock(&zonelists_mutex);
|
|
|
|
build_all_zonelists(NULL, NULL);
|
|
|
|
mutex_unlock(&zonelists_mutex);
|
|
|
|
} else
|
|
|
|
zone_pcp_update(zone);
|
2012-08-01 07:43:32 +08:00
|
|
|
|
2012-12-12 08:01:03 +08:00
|
|
|
node_states_clear_node(node, &arg);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
if (arg.status_change_nid >= 0) {
|
2009-12-15 09:58:33 +08:00
|
|
|
kswapd_stop(node);
|
mm, compaction: introduce kcompactd
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /proc
The purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is more
The current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.
To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.
One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.
Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.
This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.
For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.
This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.
Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactd
Future possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.
[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-18 05:18:08 +08:00
|
|
|
kcompactd_stop(node);
|
|
|
|
}
|
2009-06-17 06:32:50 +08:00
|
|
|
|
2007-10-16 16:26:12 +08:00
|
|
|
vm_total_pages = nr_free_pagecache_pages();
|
|
|
|
writeback_set_ratelimit();
|
2007-10-22 07:41:36 +08:00
|
|
|
|
|
|
|
memory_notify(MEM_OFFLINE, &arg);
|
2007-10-16 16:26:12 +08:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
failed_removal:
|
2012-05-30 06:06:30 +08:00
|
|
|
printk(KERN_INFO "memory offlining [mem %#010llx-%#010llx] failed\n",
|
|
|
|
(unsigned long long) start_pfn << PAGE_SHIFT,
|
|
|
|
((unsigned long long) end_pfn << PAGE_SHIFT) - 1);
|
2007-10-22 07:41:36 +08:00
|
|
|
memory_notify(MEM_CANCEL_OFFLINE, &arg);
|
2007-10-16 16:26:12 +08:00
|
|
|
/* pushback to free area */
|
2012-04-03 21:06:15 +08:00
|
|
|
undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE);
|
2007-10-16 16:26:12 +08:00
|
|
|
return ret;
|
|
|
|
}
|
2008-10-19 11:25:58 +08:00
|
|
|
|
2015-04-15 06:45:11 +08:00
|
|
|
/* Must be protected by mem_hotplug_begin() */
|
2012-10-09 07:33:58 +08:00
|
|
|
int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
|
|
|
|
{
|
|
|
|
return __offline_pages(start_pfn, start_pfn + nr_pages, 120 * HZ);
|
|
|
|
}
|
2013-05-08 06:29:49 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|
2012-10-09 07:33:58 +08:00
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
/**
|
|
|
|
* walk_memory_range - walks through all mem sections in [start_pfn, end_pfn)
|
|
|
|
* @start_pfn: start pfn of the memory range
|
2013-04-30 06:06:16 +08:00
|
|
|
* @end_pfn: end pfn of the memory range
|
2013-02-23 08:32:54 +08:00
|
|
|
* @arg: argument passed to func
|
|
|
|
* @func: callback for each memory section walked
|
|
|
|
*
|
|
|
|
* This function walks through all present mem sections in range
|
|
|
|
* [start_pfn, end_pfn) and call func on each mem section.
|
|
|
|
*
|
|
|
|
* Returns the return value of func.
|
|
|
|
*/
|
2013-05-08 06:29:49 +08:00
|
|
|
int walk_memory_range(unsigned long start_pfn, unsigned long end_pfn,
|
2013-02-23 08:32:54 +08:00
|
|
|
void *arg, int (*func)(struct memory_block *, void *))
|
2008-10-19 11:25:58 +08:00
|
|
|
{
|
2012-10-09 07:34:01 +08:00
|
|
|
struct memory_block *mem = NULL;
|
|
|
|
struct mem_section *section;
|
|
|
|
unsigned long pfn, section_nr;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
section_nr = pfn_to_section_nr(pfn);
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
section = __nr_to_section(section_nr);
|
|
|
|
/* same memblock? */
|
|
|
|
if (mem)
|
|
|
|
if ((section_nr >= mem->start_section_nr) &&
|
|
|
|
(section_nr <= mem->end_section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
mem = find_memory_block_hinted(section, mem);
|
|
|
|
if (!mem)
|
|
|
|
continue;
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
ret = func(mem, arg);
|
2012-10-09 07:34:01 +08:00
|
|
|
if (ret) {
|
2013-02-23 08:32:54 +08:00
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
return ret;
|
2012-10-09 07:34:01 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (mem)
|
|
|
|
kobject_put(&mem->dev.kobj);
|
|
|
|
|
2013-02-23 08:32:54 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-05-08 06:29:49 +08:00
|
|
|
#ifdef CONFIG_MEMORY_HOTREMOVE
|
2013-11-13 07:07:20 +08:00
|
|
|
static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
|
|
|
int ret = !is_memblock_offlined(mem);
|
|
|
|
|
2013-04-30 06:08:49 +08:00
|
|
|
if (unlikely(ret)) {
|
|
|
|
phys_addr_t beginpa, endpa;
|
|
|
|
|
|
|
|
beginpa = PFN_PHYS(section_nr_to_pfn(mem->start_section_nr));
|
|
|
|
endpa = PFN_PHYS(section_nr_to_pfn(mem->end_section_nr + 1))-1;
|
2013-02-23 08:32:54 +08:00
|
|
|
pr_warn("removing memory fails, because memory "
|
2013-04-30 06:08:49 +08:00
|
|
|
"[%pa-%pa] is onlined\n",
|
|
|
|
&beginpa, &endpa);
|
|
|
|
}
|
2013-02-23 08:32:54 +08:00
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static int check_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_present_cpu(cpu) {
|
|
|
|
if (cpu_to_node(cpu) == pgdat->node_id)
|
|
|
|
/*
|
|
|
|
* the cpu on this node isn't removed, and we can't
|
|
|
|
* offline this node.
|
|
|
|
*/
|
|
|
|
return -EBUSY;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static void unmap_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:31 +08:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_ACPI_NUMA
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
for_each_possible_cpu(cpu)
|
|
|
|
if (cpu_to_node(cpu) == pgdat->node_id)
|
|
|
|
numa_clear_node(cpu);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
static int check_and_unmap_cpu_on_node(pg_data_t *pgdat)
|
2013-02-23 08:33:31 +08:00
|
|
|
{
|
2013-09-12 05:21:50 +08:00
|
|
|
int ret;
|
2013-02-23 08:33:31 +08:00
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
ret = check_cpu_on_node(pgdat);
|
2013-02-23 08:33:31 +08:00
|
|
|
if (ret)
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* the node will be offlined when we come here, so we can clear
|
|
|
|
* the cpu_to_node() now.
|
|
|
|
*/
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
unmap_cpu_on_node(pgdat);
|
2013-02-23 08:33:31 +08:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* try_offline_node
|
|
|
|
*
|
|
|
|
* Offline a node if all memory sections and cpus of the node are removed.
|
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call.
|
|
|
|
*/
|
2013-02-23 08:33:27 +08:00
|
|
|
void try_offline_node(int nid)
|
2013-02-23 08:33:14 +08:00
|
|
|
{
|
2013-02-23 08:33:16 +08:00
|
|
|
pg_data_t *pgdat = NODE_DATA(nid);
|
|
|
|
unsigned long start_pfn = pgdat->node_start_pfn;
|
|
|
|
unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
|
2013-02-23 08:33:14 +08:00
|
|
|
unsigned long pfn;
|
2013-02-23 08:33:16 +08:00
|
|
|
int i;
|
2013-02-23 08:33:14 +08:00
|
|
|
|
|
|
|
for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
|
|
|
|
unsigned long section_nr = pfn_to_section_nr(pfn);
|
|
|
|
|
|
|
|
if (!present_section_nr(section_nr))
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (pfn_to_nid(pfn) != nid)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* some memory sections of this node are not removed, and we
|
|
|
|
* can't offline node now.
|
|
|
|
*/
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
if (check_and_unmap_cpu_on_node(pgdat))
|
2013-02-23 08:33:14 +08:00
|
|
|
return;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* all memory/cpu of this node are removed, we can offline this
|
|
|
|
* node now.
|
|
|
|
*/
|
|
|
|
node_set_offline(nid);
|
|
|
|
unregister_one_node(nid);
|
2013-02-23 08:33:16 +08:00
|
|
|
|
|
|
|
/* free waittable in each zone */
|
|
|
|
for (i = 0; i < MAX_NR_ZONES; i++) {
|
|
|
|
struct zone *zone = pgdat->node_zones + i;
|
|
|
|
|
2013-03-23 06:04:50 +08:00
|
|
|
/*
|
|
|
|
* wait_table may be allocated from boot memory,
|
|
|
|
* here only free if it's allocated by vmalloc.
|
|
|
|
*/
|
mm/memory_hotplug.c: set zone->wait_table to null after freeing it
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP <ffff880017b97be8>
CR2: ffffc90008963690
Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.
When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}
so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-11 02:14:43 +08:00
|
|
|
if (is_vmalloc_addr(zone->wait_table)) {
|
2013-02-23 08:33:16 +08:00
|
|
|
vfree(zone->wait_table);
|
mm/memory_hotplug.c: set zone->wait_table to null after freeing it
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[<ffffffff810dff80>] [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 <48> 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [<ffffffff810dff80>] __wake_up_bit+0x20/0x70
RSP <ffff880017b97be8>
CR2: ffffc90008963690
Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)
This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.
When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:
static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}
so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Reported-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Reviewed by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-06-11 02:14:43 +08:00
|
|
|
zone->wait_table = NULL;
|
|
|
|
}
|
2013-02-23 08:33:16 +08:00
|
|
|
}
|
2013-02-23 08:33:14 +08:00
|
|
|
}
|
2013-02-23 08:33:27 +08:00
|
|
|
EXPORT_SYMBOL(try_offline_node);
|
2013-02-23 08:33:14 +08:00
|
|
|
|
2013-09-12 05:21:50 +08:00
|
|
|
/**
|
|
|
|
* remove_memory
|
|
|
|
*
|
|
|
|
* NOTE: The caller must call lock_device_hotplug() to serialize hotplug
|
|
|
|
* and online/offline operations before this call, as required by
|
|
|
|
* try_offline_node().
|
|
|
|
*/
|
2013-05-27 18:58:46 +08:00
|
|
|
void __ref remove_memory(int nid, u64 start, u64 size)
|
2013-02-23 08:32:54 +08:00
|
|
|
{
|
2013-05-27 18:58:46 +08:00
|
|
|
int ret;
|
memory-hotplug: try to offline the memory twice to avoid dependence
memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.
When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.
This idea is suggested by KOSAKI Motohiro.
Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Jianguo Wu <wujianguo@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Wu Jianguo <wujianguo@huawei.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-23 08:32:50 +08:00
|
|
|
|
2013-09-12 05:21:49 +08:00
|
|
|
BUG_ON(check_hotplug_memory_range(start, size));
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_begin();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
|
|
|
/*
|
2013-05-27 18:58:46 +08:00
|
|
|
* All memory blocks must be offlined before removing memory. Check
|
|
|
|
* whether all memory blocks in question are offline and trigger a BUG()
|
|
|
|
* if this is not the case.
|
2013-02-23 08:32:52 +08:00
|
|
|
*/
|
2013-05-27 18:58:46 +08:00
|
|
|
ret = walk_memory_range(PFN_DOWN(start), PFN_UP(start + size - 1), NULL,
|
2013-11-13 07:07:20 +08:00
|
|
|
check_memblock_offlined_cb);
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
if (ret)
|
2013-05-27 18:58:46 +08:00
|
|
|
BUG();
|
2013-02-23 08:32:52 +08:00
|
|
|
|
2013-02-23 08:32:56 +08:00
|
|
|
/* remove memmap entry */
|
|
|
|
firmware_map_remove(start, start + size, "System RAM");
|
2015-08-15 06:35:16 +08:00
|
|
|
memblock_free(start, size);
|
|
|
|
memblock_remove(start, size);
|
2013-02-23 08:32:56 +08:00
|
|
|
|
2013-02-23 08:32:58 +08:00
|
|
|
arch_remove_memory(start, size);
|
|
|
|
|
2013-02-23 08:33:14 +08:00
|
|
|
try_offline_node(nid);
|
|
|
|
|
mem-hotplug: implement get/put_online_mems
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.
What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.
[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]
This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.
This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.
lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-05 07:07:18 +08:00
|
|
|
mem_hotplug_done();
|
2008-10-19 11:25:58 +08:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(remove_memory);
|
2013-06-02 04:24:07 +08:00
|
|
|
#endif /* CONFIG_MEMORY_HOTREMOVE */
|