2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Memory Migration functionality - linux/mm/migration.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 2006 Silicon Graphics, Inc., Christoph Lameter
|
|
|
|
*
|
|
|
|
* Page migration was first developed in the context of the memory hotplug
|
|
|
|
* project. The main authors of the migration code are:
|
|
|
|
*
|
|
|
|
* IWAMOTO Toshihiro <iwamoto@valinux.co.jp>
|
|
|
|
* Hirokazu Takahashi <taka@valinux.co.jp>
|
|
|
|
* Dave Hansen <haveblue@us.ibm.com>
|
2008-07-05 00:59:22 +08:00
|
|
|
* Christoph Lameter
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/migrate.h>
|
|
|
|
#include <linux/module.h>
|
|
|
|
#include <linux/swap.h>
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
#include <linux/swapops.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/pagemap.h>
|
2006-04-11 13:52:57 +08:00
|
|
|
#include <linux/buffer_head.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/mm_inline.h>
|
2007-10-19 14:40:14 +08:00
|
|
|
#include <linux/nsproxy.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/pagevec.h>
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
#include <linux/ksm.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
#include <linux/rmap.h>
|
|
|
|
#include <linux/topology.h>
|
|
|
|
#include <linux/cpu.h>
|
|
|
|
#include <linux/cpuset.h>
|
2006-06-23 17:03:38 +08:00
|
|
|
#include <linux/writeback.h>
|
2006-06-23 17:03:55 +08:00
|
|
|
#include <linux/mempolicy.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2006-06-23 17:04:02 +08:00
|
|
|
#include <linux/security.h>
|
2008-02-07 16:13:53 +08:00
|
|
|
#include <linux/memcontrol.h>
|
2008-07-24 12:27:02 +08:00
|
|
|
#include <linux/syscalls.h>
|
2010-09-08 09:19:35 +08:00
|
|
|
#include <linux/hugetlb.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 16:04:11 +08:00
|
|
|
#include <linux/gfp.h>
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2010-12-22 09:24:26 +08:00
|
|
|
#include <asm/tlbflush.h>
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
#include "internal.h"
|
|
|
|
|
|
|
|
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
|
|
|
|
|
|
|
|
/*
|
2006-06-23 17:03:55 +08:00
|
|
|
* migrate_prep() needs to be called before we start compiling a list of pages
|
2010-05-25 05:32:27 +08:00
|
|
|
* to be migrated using isolate_lru_page(). If scheduling work on other CPUs is
|
|
|
|
* undesirable, use migrate_prep_local()
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
|
|
|
int migrate_prep(void)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Clear the LRU lists so pages can be isolated.
|
|
|
|
* Note that pages may be moved off the LRU after we have
|
|
|
|
* drained them. Those pages will fail to migrate like other
|
|
|
|
* pages that may be busy.
|
|
|
|
*/
|
|
|
|
lru_add_drain_all();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-05-25 05:32:27 +08:00
|
|
|
/* Do the necessary work of migrate_prep but not if it involves other CPUs */
|
|
|
|
int migrate_prep_local(void)
|
|
|
|
{
|
|
|
|
lru_add_drain();
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
* Add isolated pages on the list back to the LRU under page lock
|
|
|
|
* to avoid leaking evictable pages back onto unevictable list.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2010-05-25 05:31:59 +08:00
|
|
|
void putback_lru_pages(struct list_head *l)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
struct page *page2;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(page, page2, l, lru) {
|
2006-06-23 17:03:51 +08:00
|
|
|
list_del(&page->lru);
|
2009-09-22 08:01:37 +08:00
|
|
|
dec_zone_page_state(page, NR_ISOLATED_ANON +
|
2009-09-22 08:02:59 +08:00
|
|
|
page_is_file_cache(page));
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
putback_lru_page(page);
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
/*
|
|
|
|
* Restore a potential migration pte to a working pte entry
|
|
|
|
*/
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
|
|
|
|
unsigned long addr, void *old)
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
{
|
|
|
|
struct mm_struct *mm = vma->vm_mm;
|
|
|
|
swp_entry_t entry;
|
|
|
|
pgd_t *pgd;
|
|
|
|
pud_t *pud;
|
|
|
|
pmd_t *pmd;
|
|
|
|
pte_t *ptep, pte;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (unlikely(PageHuge(new))) {
|
|
|
|
ptep = huge_pte_offset(mm, addr);
|
|
|
|
if (!ptep)
|
|
|
|
goto out;
|
|
|
|
ptl = &mm->page_table_lock;
|
|
|
|
} else {
|
|
|
|
pgd = pgd_offset(mm, addr);
|
|
|
|
if (!pgd_present(*pgd))
|
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
pud = pud_offset(pgd, addr);
|
|
|
|
if (!pud_present(*pud))
|
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
pmd = pmd_offset(pud, addr);
|
2011-01-14 07:46:55 +08:00
|
|
|
if (pmd_trans_huge(*pmd))
|
|
|
|
goto out;
|
2010-09-08 09:19:35 +08:00
|
|
|
if (!pmd_present(*pmd))
|
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
ptep = pte_offset_map(pmd, addr);
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (!is_swap_pte(*ptep)) {
|
|
|
|
pte_unmap(ptep);
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
ptl = pte_lockptr(mm, pmd);
|
|
|
|
}
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
spin_lock(ptl);
|
|
|
|
pte = *ptep;
|
|
|
|
if (!is_swap_pte(pte))
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
goto unlock;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
entry = pte_to_swp_entry(pte);
|
|
|
|
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
if (!is_migration_entry(entry) ||
|
|
|
|
migration_entry_to_page(entry) != old)
|
|
|
|
goto unlock;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
|
|
|
|
get_page(new);
|
|
|
|
pte = pte_mkold(mk_pte(new, vma->vm_page_prot));
|
|
|
|
if (is_write_migration_entry(entry))
|
|
|
|
pte = pte_mkwrite(pte);
|
2010-10-11 22:03:21 +08:00
|
|
|
#ifdef CONFIG_HUGETLB_PAGE
|
2010-09-08 09:19:35 +08:00
|
|
|
if (PageHuge(new))
|
|
|
|
pte = pte_mkhuge(pte);
|
2010-10-11 22:03:21 +08:00
|
|
|
#endif
|
2007-10-16 16:25:43 +08:00
|
|
|
flush_cache_page(vma, addr, pte_pfn(pte));
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
set_pte_at(mm, addr, ptep, pte);
|
2006-06-23 17:03:38 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
if (PageHuge(new)) {
|
|
|
|
if (PageAnon(new))
|
|
|
|
hugepage_add_anon_rmap(new, vma, addr);
|
|
|
|
else
|
|
|
|
page_dup_rmap(new);
|
|
|
|
} else if (PageAnon(new))
|
2006-06-23 17:03:38 +08:00
|
|
|
page_add_anon_rmap(new, vma, addr);
|
|
|
|
else
|
|
|
|
page_add_file_rmap(new);
|
|
|
|
|
|
|
|
/* No need to invalidate - it was non-present before */
|
MM: Pass a PTE pointer to update_mmu_cache() rather than the PTE itself
On VIVT ARM, when we have multiple shared mappings of the same file
in the same MM, we need to ensure that we have coherency across all
copies. We do this via make_coherent() by making the pages
uncacheable.
This used to work fine, until we allowed highmem with highpte - we
now have a page table which is mapped as required, and is not available
for modification via update_mmu_cache().
Ralf Beache suggested getting rid of the PTE value passed to
update_mmu_cache():
On MIPS update_mmu_cache() calls __update_tlb() which walks pagetables
to construct a pointer to the pte again. Passing a pte_t * is much
more elegant. Maybe we might even replace the pte argument with the
pte_t?
Ben Herrenschmidt would also like the pte pointer for PowerPC:
Passing the ptep in there is exactly what I want. I want that
-instead- of the PTE value, because I have issue on some ppc cases,
for I$/D$ coherency, where set_pte_at() may decide to mask out the
_PAGE_EXEC.
So, pass in the mapped page table pointer into update_mmu_cache(), and
remove the PTE value, updating all implementations and call sites to
suit.
Includes a fix from Stephen Rothwell:
sparc: fix fallout from update_mmu_cache API change
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-12-19 00:40:18 +08:00
|
|
|
update_mmu_cache(vma, addr, ptep);
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
unlock:
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
pte_unmap_unlock(ptep, ptl);
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
out:
|
|
|
|
return SWAP_AGAIN;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
/*
|
|
|
|
* Get rid of all migration entries and replace them by
|
|
|
|
* references to the indicated page.
|
|
|
|
*/
|
|
|
|
static void remove_migration_ptes(struct page *old, struct page *new)
|
|
|
|
{
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
rmap_walk(new, remove_migration_pte, old);
|
2006-06-23 17:03:38 +08:00
|
|
|
}
|
|
|
|
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
/*
|
|
|
|
* Something used the pte of a page under migration. We need to
|
|
|
|
* get to the page and wait until migration is finished.
|
|
|
|
* When we return from this function the fault will be retried.
|
|
|
|
*
|
|
|
|
* This function is called from do_swap_page().
|
|
|
|
*/
|
|
|
|
void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
|
|
|
|
unsigned long address)
|
|
|
|
{
|
|
|
|
pte_t *ptep, pte;
|
|
|
|
spinlock_t *ptl;
|
|
|
|
swp_entry_t entry;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
|
|
|
|
pte = *ptep;
|
|
|
|
if (!is_swap_pte(pte))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
entry = pte_to_swp_entry(pte);
|
|
|
|
if (!is_migration_entry(entry))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
page = migration_entry_to_page(entry);
|
|
|
|
|
2008-07-26 10:45:30 +08:00
|
|
|
/*
|
|
|
|
* Once radix-tree replacement of page migration started, page_count
|
|
|
|
* *must* be zero. And, we don't want to call wait_on_page_locked()
|
|
|
|
* against a page without get_page().
|
|
|
|
* So, we use get_page_unless_zero(), here. Even failed, page fault
|
|
|
|
* will occur again.
|
|
|
|
*/
|
|
|
|
if (!get_page_unless_zero(page))
|
|
|
|
goto out;
|
[PATCH] Swapless page migration: add R/W migration entries
Implement read/write migration ptes
We take the upper two swapfiles for the two types of migration ptes and define
a series of macros in swapops.h.
The VM is modified to handle the migration entries. migration entries can
only be encountered when the page they are pointing to is locked. This limits
the number of places one has to fix. We also check in copy_pte_range and in
mprotect_pte_range() for migration ptes.
We check for migration ptes in do_swap_cache and call a function that will
then wait on the page lock. This allows us to effectively stop all accesses
to apge.
Migration entries are created by try_to_unmap if called for migration and
removed by local functions in migrate.c
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration (I've no NUMA, just
hacking it up to migrate recklessly while running load), I've hit the
BUG_ON(!PageLocked(p)) in migration_entry_to_page.
This comes from an orphaned migration entry, unrelated to the current
correctly locked migration, but hit by remove_anon_migration_ptes as it
checks an address in each vma of the anon_vma list.
Such an orphan may be left behind if an earlier migration raced with fork:
copy_one_pte can duplicate a migration entry from parent to child, after
remove_anon_migration_ptes has checked the child vma, but before it has
removed it from the parent vma. (If the process were later to fault on this
orphaned entry, it would hit the same BUG from migration_entry_wait.)
This could be fixed by locking anon_vma in copy_one_pte, but we'd rather
not. There's no such problem with file pages, because vma_prio_tree_add
adds child vma after parent vma, and the page table locking at each end is
enough to serialize. Follow that example with anon_vma: add new vmas to the
tail instead of the head.
(There's no corresponding problem when inserting migration entries,
because a missed pte will leave the page count and mapcount high, which is
allowed for. And there's no corresponding problem when migrating via swap,
because a leftover swap entry will be correctly faulted. But the swapless
method has no refcounting of its entries.)
From: Ingo Molnar <mingo@elte.hu>
pte_unmap_unlock() takes the pte pointer as an argument.
From: Hugh Dickins <hugh@veritas.com>
Several times while testing swapless page migration, gcc has tried to exec
a pointer instead of a string: smells like COW mappings are not being
properly write-protected on fork.
The protection in copy_one_pte looks very convincing, until at last you
realize that the second arg to make_migration_entry is a boolean "write",
and SWP_MIGRATION_READ is 30.
Anyway, it's better done like in change_pte_range, using
is_write_migration_entry and make_migration_entry_read.
From: Hugh Dickins <hugh@veritas.com>
Remove unnecessary obfuscation from sys_swapon's range check on swap type,
which blew up causing memory corruption once swapless migration made
MAX_SWAPFILES no longer 2 ^ MAX_SWAPFILES_SHIFT.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Christoph Lameter <clameter@engr.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
From: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23 17:03:35 +08:00
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
wait_on_page_locked(page);
|
|
|
|
put_page(page);
|
|
|
|
return;
|
|
|
|
out:
|
|
|
|
pte_unmap_unlock(ptep, ptl);
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
2006-06-23 17:03:32 +08:00
|
|
|
* Replace the page in the mapping.
|
2006-06-23 17:03:29 +08:00
|
|
|
*
|
|
|
|
* The number of remaining references must be:
|
|
|
|
* 1 for anonymous pages without a mapping
|
|
|
|
* 2 for pages with a mapping
|
2009-04-03 23:42:36 +08:00
|
|
|
* 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
static int migrate_page_move_mapping(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
2008-07-26 10:45:30 +08:00
|
|
|
int expected_count;
|
2006-12-07 12:33:44 +08:00
|
|
|
void **pslot;
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-06-23 17:03:37 +08:00
|
|
|
if (!mapping) {
|
2007-04-24 05:41:09 +08:00
|
|
|
/* Anonymous page without mapping */
|
2006-06-23 17:03:37 +08:00
|
|
|
if (page_count(page) != 1)
|
|
|
|
return -EAGAIN;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-12-07 12:33:44 +08:00
|
|
|
pslot = radix_tree_lookup_slot(&mapping->page_tree,
|
|
|
|
page_index(page));
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2009-09-22 08:02:59 +08:00
|
|
|
expected_count = 2 + page_has_private(page);
|
2008-07-26 10:45:30 +08:00
|
|
|
if (page_count(page) != expected_count ||
|
2011-01-14 07:47:21 +08:00
|
|
|
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2006-04-11 13:52:57 +08:00
|
|
|
return -EAGAIN;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
|
2008-07-26 10:45:30 +08:00
|
|
|
if (!page_freeze_refs(page, expected_count)) {
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2008-07-26 10:45:30 +08:00
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Now we know that no one else is looking at the page.
|
|
|
|
*/
|
2006-12-07 12:33:44 +08:00
|
|
|
get_page(newpage); /* add cache reference */
|
2006-03-22 16:09:12 +08:00
|
|
|
if (PageSwapCache(page)) {
|
|
|
|
SetPageSwapCache(newpage);
|
|
|
|
set_page_private(newpage, page_private(page));
|
|
|
|
}
|
|
|
|
|
2006-12-07 12:33:44 +08:00
|
|
|
radix_tree_replace_slot(pslot, newpage);
|
|
|
|
|
2008-07-26 10:45:30 +08:00
|
|
|
page_unfreeze_refs(page, expected_count);
|
2006-12-07 12:33:44 +08:00
|
|
|
/*
|
|
|
|
* Drop cache reference from old page.
|
|
|
|
* We know this isn't the last reference.
|
|
|
|
*/
|
2006-03-22 16:09:12 +08:00
|
|
|
__put_page(page);
|
2006-12-07 12:33:44 +08:00
|
|
|
|
2007-04-24 05:41:09 +08:00
|
|
|
/*
|
|
|
|
* If moved to a different zone then also account
|
|
|
|
* the page for that zone. Other VM counters will be
|
|
|
|
* taken care of when we establish references to the
|
|
|
|
* new page and drop references to the old page.
|
|
|
|
*
|
|
|
|
* Note that anonymous pages are accounted for
|
|
|
|
* via NR_FILE_PAGES and NR_ANON_PAGES if they
|
|
|
|
* are mapped to swap space.
|
|
|
|
*/
|
|
|
|
__dec_zone_page_state(page, NR_FILE_PAGES);
|
|
|
|
__inc_zone_page_state(newpage, NR_FILE_PAGES);
|
2009-09-22 08:01:33 +08:00
|
|
|
if (PageSwapBacked(page)) {
|
|
|
|
__dec_zone_page_state(page, NR_SHMEM);
|
|
|
|
__inc_zone_page_state(newpage, NR_SHMEM);
|
|
|
|
}
|
2008-07-26 10:45:32 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
/*
|
|
|
|
* The expected number of remaining references is the same as that
|
|
|
|
* of migrate_page_move_mapping().
|
|
|
|
*/
|
|
|
|
int migrate_huge_page_move_mapping(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
|
|
|
{
|
|
|
|
int expected_count;
|
|
|
|
void **pslot;
|
|
|
|
|
|
|
|
if (!mapping) {
|
|
|
|
if (page_count(page) != 1)
|
|
|
|
return -EAGAIN;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
spin_lock_irq(&mapping->tree_lock);
|
|
|
|
|
|
|
|
pslot = radix_tree_lookup_slot(&mapping->page_tree,
|
|
|
|
page_index(page));
|
|
|
|
|
|
|
|
expected_count = 2 + page_has_private(page);
|
|
|
|
if (page_count(page) != expected_count ||
|
2011-01-14 07:47:21 +08:00
|
|
|
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
|
2010-09-08 09:19:35 +08:00
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!page_freeze_refs(page, expected_count)) {
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return -EAGAIN;
|
|
|
|
}
|
|
|
|
|
|
|
|
get_page(newpage);
|
|
|
|
|
|
|
|
radix_tree_replace_slot(pslot, newpage);
|
|
|
|
|
|
|
|
page_unfreeze_refs(page, expected_count);
|
|
|
|
|
|
|
|
__put_page(page);
|
|
|
|
|
|
|
|
spin_unlock_irq(&mapping->tree_lock);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Copy the page to its new location
|
|
|
|
*/
|
2010-09-08 09:19:35 +08:00
|
|
|
void migrate_page_copy(struct page *newpage, struct page *page)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
2010-09-08 09:19:35 +08:00
|
|
|
if (PageHuge(page))
|
|
|
|
copy_huge_page(newpage, page);
|
|
|
|
else
|
|
|
|
copy_highpage(newpage, page);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
|
|
|
if (PageError(page))
|
|
|
|
SetPageError(newpage);
|
|
|
|
if (PageReferenced(page))
|
|
|
|
SetPageReferenced(newpage);
|
|
|
|
if (PageUptodate(page))
|
|
|
|
SetPageUptodate(newpage);
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
if (TestClearPageActive(page)) {
|
|
|
|
VM_BUG_ON(PageUnevictable(page));
|
2006-03-22 16:09:12 +08:00
|
|
|
SetPageActive(newpage);
|
2009-12-15 09:59:54 +08:00
|
|
|
} else if (TestClearPageUnevictable(page))
|
|
|
|
SetPageUnevictable(newpage);
|
2006-03-22 16:09:12 +08:00
|
|
|
if (PageChecked(page))
|
|
|
|
SetPageChecked(newpage);
|
|
|
|
if (PageMappedToDisk(page))
|
|
|
|
SetPageMappedToDisk(newpage);
|
|
|
|
|
|
|
|
if (PageDirty(page)) {
|
|
|
|
clear_page_dirty_for_io(page);
|
2008-04-30 15:55:16 +08:00
|
|
|
/*
|
|
|
|
* Want to mark the page and the radix tree as dirty, and
|
|
|
|
* redo the accounting that clear_page_dirty_for_io undid,
|
|
|
|
* but we can't use set_page_dirty because that function
|
|
|
|
* is actually a signal that all of the page has become dirty.
|
|
|
|
* Wheras only part of our page may be dirty.
|
|
|
|
*/
|
|
|
|
__set_page_dirty_nobuffers(newpage);
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:44 +08:00
|
|
|
mlock_migrate_page(newpage, page);
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:31 +08:00
|
|
|
ksm_migrate_page(newpage, page);
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:44 +08:00
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
ClearPageSwapCache(page);
|
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(page, 0);
|
|
|
|
page->mapping = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If any waiters have accumulated on the new page then
|
|
|
|
* wake them up.
|
|
|
|
*/
|
|
|
|
if (PageWriteback(newpage))
|
|
|
|
end_page_writeback(newpage);
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:28 +08:00
|
|
|
/************************************************************
|
|
|
|
* Migration functions
|
|
|
|
***********************************************************/
|
|
|
|
|
|
|
|
/* Always fail migration. Used for mappings that are not movable */
|
2006-06-23 17:03:33 +08:00
|
|
|
int fail_migrate_page(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
2006-06-23 17:03:28 +08:00
|
|
|
{
|
|
|
|
return -EIO;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(fail_migrate_page);
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* Common logic to directly migrate a single page suitable for
|
2009-04-03 23:42:36 +08:00
|
|
|
* pages that do not use PagePrivate/PagePrivate2.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
|
|
|
* Pages are locked upon entry and exit.
|
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
int migrate_page(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
|
|
|
|
|
2006-06-23 17:03:33 +08:00
|
|
|
rc = migrate_page_move_mapping(mapping, newpage, page);
|
2006-03-22 16:09:12 +08:00
|
|
|
|
|
|
|
if (rc)
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
migrate_page_copy(newpage, page);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(migrate_page);
|
|
|
|
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#ifdef CONFIG_BLOCK
|
2006-06-23 17:03:28 +08:00
|
|
|
/*
|
|
|
|
* Migration function for pages with buffers. This function can only be used
|
|
|
|
* if the underlying filesystem guarantees that no other references to "page"
|
|
|
|
* exist.
|
|
|
|
*/
|
2006-06-23 17:03:33 +08:00
|
|
|
int buffer_migrate_page(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
2006-06-23 17:03:28 +08:00
|
|
|
{
|
|
|
|
struct buffer_head *bh, *head;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!page_has_buffers(page))
|
2006-06-23 17:03:33 +08:00
|
|
|
return migrate_page(mapping, newpage, page);
|
2006-06-23 17:03:28 +08:00
|
|
|
|
|
|
|
head = page_buffers(page);
|
|
|
|
|
2006-06-23 17:03:33 +08:00
|
|
|
rc = migrate_page_move_mapping(mapping, newpage, page);
|
2006-06-23 17:03:28 +08:00
|
|
|
|
|
|
|
if (rc)
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
bh = head;
|
|
|
|
do {
|
|
|
|
get_bh(bh);
|
|
|
|
lock_buffer(bh);
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
|
|
|
ClearPagePrivate(page);
|
|
|
|
set_page_private(newpage, page_private(page));
|
|
|
|
set_page_private(page, 0);
|
|
|
|
put_page(page);
|
|
|
|
get_page(newpage);
|
|
|
|
|
|
|
|
bh = head;
|
|
|
|
do {
|
|
|
|
set_bh_page(bh, newpage, bh_offset(bh));
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
|
|
|
SetPagePrivate(newpage);
|
|
|
|
|
|
|
|
migrate_page_copy(newpage, page);
|
|
|
|
|
|
|
|
bh = head;
|
|
|
|
do {
|
|
|
|
unlock_buffer(bh);
|
|
|
|
put_bh(bh);
|
|
|
|
bh = bh->b_this_page;
|
|
|
|
|
|
|
|
} while (bh != head);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL(buffer_migrate_page);
|
[PATCH] BLOCK: Make it possible to disable the block layer [try #6]
Make it possible to disable the block layer. Not all embedded devices require
it, some can make do with just JFFS2, NFS, ramfs, etc - none of which require
the block layer to be present.
This patch does the following:
(*) Introduces CONFIG_BLOCK to disable the block layer, buffering and blockdev
support.
(*) Adds dependencies on CONFIG_BLOCK to any configuration item that controls
an item that uses the block layer. This includes:
(*) Block I/O tracing.
(*) Disk partition code.
(*) All filesystems that are block based, eg: Ext3, ReiserFS, ISOFS.
(*) The SCSI layer. As far as I can tell, even SCSI chardevs use the
block layer to do scheduling. Some drivers that use SCSI facilities -
such as USB storage - end up disabled indirectly from this.
(*) Various block-based device drivers, such as IDE and the old CDROM
drivers.
(*) MTD blockdev handling and FTL.
(*) JFFS - which uses set_bdev_super(), something it could avoid doing by
taking a leaf out of JFFS2's book.
(*) Makes most of the contents of linux/blkdev.h, linux/buffer_head.h and
linux/elevator.h contingent on CONFIG_BLOCK being set. sector_div() is,
however, still used in places, and so is still available.
(*) Also made contingent are the contents of linux/mpage.h, linux/genhd.h and
parts of linux/fs.h.
(*) Makes a number of files in fs/ contingent on CONFIG_BLOCK.
(*) Makes mm/bounce.c (bounce buffering) contingent on CONFIG_BLOCK.
(*) set_page_dirty() doesn't call __set_page_dirty_buffers() if CONFIG_BLOCK
is not enabled.
(*) fs/no-block.c is created to hold out-of-line stubs and things that are
required when CONFIG_BLOCK is not set:
(*) Default blockdev file operations (to give error ENODEV on opening).
(*) Makes some /proc changes:
(*) /proc/devices does not list any blockdevs.
(*) /proc/diskstats and /proc/partitions are contingent on CONFIG_BLOCK.
(*) Makes some compat ioctl handling contingent on CONFIG_BLOCK.
(*) If CONFIG_BLOCK is not defined, makes sys_quotactl() return -ENODEV if
given command other than Q_SYNC or if a special device is specified.
(*) In init/do_mounts.c, no reference is made to the blockdev routines if
CONFIG_BLOCK is not defined. This does not prohibit NFS roots or JFFS2.
(*) The bdflush, ioprio_set and ioprio_get syscalls can now be absent (return
error ENOSYS by way of cond_syscall if so).
(*) The seclvl_bd_claim() and seclvl_bd_release() security calls do nothing if
CONFIG_BLOCK is not set, since they can't then happen.
Signed-Off-By: David Howells <dhowells@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2006-10-01 02:45:40 +08:00
|
|
|
#endif
|
2006-06-23 17:03:28 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
/*
|
|
|
|
* Writeback a page to clean the dirty state
|
|
|
|
*/
|
|
|
|
static int writeout(struct address_space *mapping, struct page *page)
|
2006-06-23 17:03:33 +08:00
|
|
|
{
|
2006-06-23 17:03:38 +08:00
|
|
|
struct writeback_control wbc = {
|
|
|
|
.sync_mode = WB_SYNC_NONE,
|
|
|
|
.nr_to_write = 1,
|
|
|
|
.range_start = 0,
|
|
|
|
.range_end = LLONG_MAX,
|
|
|
|
.for_reclaim = 1
|
|
|
|
};
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!mapping->a_ops->writepage)
|
|
|
|
/* No write method for the address space */
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (!clear_page_dirty_for_io(page))
|
|
|
|
/* Someone else already triggered a write */
|
|
|
|
return -EAGAIN;
|
|
|
|
|
2006-06-23 17:03:33 +08:00
|
|
|
/*
|
2006-06-23 17:03:38 +08:00
|
|
|
* A dirty page may imply that the underlying filesystem has
|
|
|
|
* the page on some queue. So the page must be clean for
|
|
|
|
* migration. Writeout may mean we loose the lock and the
|
|
|
|
* page state is no longer what we checked for earlier.
|
|
|
|
* At this point we know that the migration attempt cannot
|
|
|
|
* be successful.
|
2006-06-23 17:03:33 +08:00
|
|
|
*/
|
2006-06-23 17:03:38 +08:00
|
|
|
remove_migration_ptes(page, page);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
rc = mapping->a_ops->writepage(page, &wbc);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:38 +08:00
|
|
|
if (rc != AOP_WRITEPAGE_ACTIVATE)
|
|
|
|
/* unlocked. Relock */
|
|
|
|
lock_page(page);
|
|
|
|
|
2008-11-20 07:36:36 +08:00
|
|
|
return (rc < 0) ? -EIO : -EAGAIN;
|
2006-06-23 17:03:38 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Default handling if a filesystem does not provide a migration function.
|
|
|
|
*/
|
|
|
|
static int fallback_migrate_page(struct address_space *mapping,
|
|
|
|
struct page *newpage, struct page *page)
|
|
|
|
{
|
|
|
|
if (PageDirty(page))
|
|
|
|
return writeout(mapping, page);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Buffers may be managed in a filesystem specific way.
|
|
|
|
* We must have no buffers or drop them.
|
|
|
|
*/
|
2009-04-03 23:42:36 +08:00
|
|
|
if (page_has_private(page) &&
|
2006-06-23 17:03:33 +08:00
|
|
|
!try_to_release_page(page, GFP_KERNEL))
|
|
|
|
return -EAGAIN;
|
|
|
|
|
|
|
|
return migrate_page(mapping, newpage, page);
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
/*
|
|
|
|
* Move a page to a newly allocated page
|
|
|
|
* The page is locked and all ptes have been successfully removed.
|
|
|
|
*
|
|
|
|
* The new page will have replaced the old page if this function
|
|
|
|
* is successful.
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
*
|
|
|
|
* Return value:
|
|
|
|
* < 0 - error code
|
|
|
|
* == 0 - success
|
2006-06-23 17:03:51 +08:00
|
|
|
*/
|
2010-05-25 05:32:20 +08:00
|
|
|
static int move_to_new_page(struct page *newpage, struct page *page,
|
|
|
|
int remap_swapcache)
|
2006-06-23 17:03:51 +08:00
|
|
|
{
|
|
|
|
struct address_space *mapping;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Block others from accessing the page when we get around to
|
|
|
|
* establishing additional references. We are the only one
|
|
|
|
* holding a reference to the new page at this point.
|
|
|
|
*/
|
2008-08-02 18:01:03 +08:00
|
|
|
if (!trylock_page(newpage))
|
2006-06-23 17:03:51 +08:00
|
|
|
BUG();
|
|
|
|
|
|
|
|
/* Prepare mapping for the new page.*/
|
|
|
|
newpage->index = page->index;
|
|
|
|
newpage->mapping = page->mapping;
|
2008-10-19 11:26:30 +08:00
|
|
|
if (PageSwapBacked(page))
|
|
|
|
SetPageSwapBacked(newpage);
|
2006-06-23 17:03:51 +08:00
|
|
|
|
|
|
|
mapping = page_mapping(page);
|
|
|
|
if (!mapping)
|
|
|
|
rc = migrate_page(mapping, newpage, page);
|
|
|
|
else if (mapping->a_ops->migratepage)
|
|
|
|
/*
|
|
|
|
* Most pages have a mapping and most filesystems
|
|
|
|
* should provide a migration function. Anonymous
|
|
|
|
* pages are part of swap space which also has its
|
|
|
|
* own migration function. This is the most common
|
|
|
|
* path for page migration.
|
|
|
|
*/
|
|
|
|
rc = mapping->a_ops->migratepage(mapping,
|
|
|
|
newpage, page);
|
|
|
|
else
|
|
|
|
rc = fallback_migrate_page(mapping, newpage, page);
|
|
|
|
|
2010-05-25 05:32:20 +08:00
|
|
|
if (rc) {
|
2006-06-23 17:03:51 +08:00
|
|
|
newpage->mapping = NULL;
|
2010-05-25 05:32:20 +08:00
|
|
|
} else {
|
|
|
|
if (remap_swapcache)
|
|
|
|
remove_migration_ptes(page, newpage);
|
|
|
|
}
|
2006-06-23 17:03:51 +08:00
|
|
|
|
|
|
|
unlock_page(newpage);
|
|
|
|
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Obtain the lock on page, remove all ptes and migrate the page
|
|
|
|
* to the newly allocated page in newpage.
|
|
|
|
*/
|
2006-06-23 17:03:53 +08:00
|
|
|
static int unmap_and_move(new_page_t get_new_page, unsigned long private,
|
2011-01-14 07:45:58 +08:00
|
|
|
struct page *page, int force, bool offlining, bool sync)
|
2006-06-23 17:03:51 +08:00
|
|
|
{
|
|
|
|
int rc = 0;
|
2006-06-23 17:03:55 +08:00
|
|
|
int *result = NULL;
|
|
|
|
struct page *newpage = get_new_page(page, private, &result);
|
2010-05-25 05:32:20 +08:00
|
|
|
int remap_swapcache = 1;
|
2008-02-07 16:14:10 +08:00
|
|
|
int charge = 0;
|
2009-11-12 06:26:26 +08:00
|
|
|
struct mem_cgroup *mem = NULL;
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
struct anon_vma *anon_vma = NULL;
|
2006-06-23 17:03:53 +08:00
|
|
|
|
|
|
|
if (!newpage)
|
|
|
|
return -ENOMEM;
|
2006-06-23 17:03:51 +08:00
|
|
|
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
if (page_count(page) == 1) {
|
2006-06-23 17:03:51 +08:00
|
|
|
/* page was freed from under us. So we are done. */
|
2006-06-23 17:03:53 +08:00
|
|
|
goto move_newpage;
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
}
|
2011-01-14 07:46:55 +08:00
|
|
|
if (unlikely(PageTransHuge(page)))
|
|
|
|
if (unlikely(split_huge_page(page)))
|
|
|
|
goto move_newpage;
|
2006-06-23 17:03:51 +08:00
|
|
|
|
2008-07-25 16:47:10 +08:00
|
|
|
/* prepare cgroup just returns 0 or -ENOMEM */
|
2006-06-23 17:03:51 +08:00
|
|
|
rc = -EAGAIN;
|
2009-01-08 10:07:50 +08:00
|
|
|
|
2008-08-02 18:01:03 +08:00
|
|
|
if (!trylock_page(page)) {
|
2006-06-23 17:03:51 +08:00
|
|
|
if (!force)
|
2006-06-23 17:03:53 +08:00
|
|
|
goto move_newpage;
|
2011-01-14 07:45:56 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* It's not safe for direct compaction to call lock_page.
|
|
|
|
* For example, during page readahead pages are added locked
|
|
|
|
* to the LRU. Later, when the IO completes the pages are
|
|
|
|
* marked uptodate and unlocked. However, the queueing
|
|
|
|
* could be merging multiple pages for one bio (e.g.
|
|
|
|
* mpage_readpages). If an allocation happens for the
|
|
|
|
* second or third page, the process can end up locking
|
|
|
|
* the same page twice and deadlocking. Rather than
|
|
|
|
* trying to be clever about what pages can be locked,
|
|
|
|
* avoid the use of lock_page for direct compaction
|
|
|
|
* altogether.
|
|
|
|
*/
|
|
|
|
if (current->flags & PF_MEMALLOC)
|
|
|
|
goto move_newpage;
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
lock_page(page);
|
|
|
|
}
|
|
|
|
|
ksm: memory hotremove migration only
The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?
Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.
But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.
So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.
And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:33 +08:00
|
|
|
/*
|
|
|
|
* Only memory hotplug's offline_pages() caller has locked out KSM,
|
|
|
|
* and can safely migrate a KSM page. The other cases have skipped
|
|
|
|
* PageKsm along with PageReserved - but it is only now when we have
|
|
|
|
* the page lock that we can be certain it will not go KSM beneath us
|
|
|
|
* (KSM will not upgrade a page from PageAnon to PageKsm when it sees
|
|
|
|
* its pagecount raised, but only here do we take the page lock which
|
|
|
|
* serializes that).
|
|
|
|
*/
|
|
|
|
if (PageKsm(page) && !offlining) {
|
|
|
|
rc = -EBUSY;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
|
2009-01-08 10:07:50 +08:00
|
|
|
/* charge against new page */
|
memcg: fix mis-accounting of file mapped racy with migration
FILE_MAPPED per memcg of migrated file cache is not properly updated,
because our hook in page_add_file_rmap() can't know to which memcg
FILE_MAPPED should be counted.
Basically, this patch is for fixing the bug but includes some big changes
to fix up other messes.
Now, at migrating mapped file, events happen in following sequence.
1. allocate a new page.
2. get memcg of an old page.
3. charge ageinst a new page before migration. But at this point,
no changes to new page's page_cgroup, no commit for the charge.
(IOW, PCG_USED bit is not set.)
4. page migration replaces radix-tree, old-page and new-page.
5. page migration remaps the new page if the old page was mapped.
6. Here, the new page is unlocked.
7. memcg commits the charge for newpage, Mark the new page's page_cgroup
as PCG_USED.
Because "commit" happens after page-remap, we can count FILE_MAPPED
at "5", because we should avoid to trust page_cgroup->mem_cgroup.
if PCG_USED bit is unset.
(Note: memcg's LRU removal code does that but LRU-isolation logic is used
for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is
not on LRU or page_cgroup->mem_cgroup is NULL.)
We can lose file_mapped accounting information at 5 because FILE_MAPPED
is updated only when mapcount changes 0->1. So we should catch it.
BTW, historically, above implemntation comes from migration-failure
of anonymous page. Because we charge both of old page and new page
with mapcount=0, we can't catch
- the page is really freed before remap.
- migration fails but it's freed before remap
or .....corner cases.
New migration sequence with memcg is:
1. allocate a new page.
2. mark PageCgroupMigration to the old page.
3. charge against a new page onto the old page's memcg. (here, new page's pc
is marked as PageCgroupUsed.)
4. page migration replaces radix-tree, page table, etc...
5. At remapping, new page's page_cgroup is now makrked as "USED"
We can catch 0->1 event and FILE_MAPPED will be properly updated.
And we can catch SWAPOUT event after unlock this and freeing this
page by unmap() can be caught.
7. Clear PageCgroupMigration of the old page.
So, FILE_MAPPED will be correctly updated.
Then, for what MIGRATION flag is ?
Without it, at migration failure, we may have to charge old page again
because it may be fully unmapped. "charge" means that we have to dive into
memory reclaim or something complated. So, it's better to avoid
charge it again. Before this patch, __commit_charge() was working for
both of the old/new page and fixed up all. But this technique has some
racy condtion around FILE_MAPPED and SWAPOUT etc...
Now, the kernel use MIGRATION flag and don't uncharge old page until
the end of migration.
I hope this change will make memcg's page migration much simpler. This
page migration has caused several troubles. Worth to add a flag for
simplification.
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reported-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27 05:42:46 +08:00
|
|
|
charge = mem_cgroup_prepare_migration(page, newpage, &mem);
|
2009-01-08 10:07:50 +08:00
|
|
|
if (charge == -ENOMEM) {
|
|
|
|
rc = -ENOMEM;
|
|
|
|
goto unlock;
|
|
|
|
}
|
|
|
|
BUG_ON(charge);
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
if (PageWriteback(page)) {
|
2011-01-14 07:45:57 +08:00
|
|
|
if (!force || !sync)
|
2009-01-08 10:07:50 +08:00
|
|
|
goto uncharge;
|
2006-06-23 17:03:51 +08:00
|
|
|
wait_on_page_writeback(page);
|
|
|
|
}
|
|
|
|
/*
|
2007-07-27 01:41:07 +08:00
|
|
|
* By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
|
|
|
|
* we cannot notice that anon_vma is freed while we migrates a page.
|
2011-01-14 07:47:30 +08:00
|
|
|
* This get_anon_vma() delays freeing anon_vma pointer until the end
|
2007-07-27 01:41:07 +08:00
|
|
|
* of migration. File cache pages are no problem because of page_lock()
|
2007-08-31 14:56:21 +08:00
|
|
|
* File Caches may use write_page() or lock_page() in migration, then,
|
|
|
|
* just care Anon page here.
|
2007-07-27 01:41:07 +08:00
|
|
|
*/
|
2007-08-31 14:56:21 +08:00
|
|
|
if (PageAnon(page)) {
|
2011-01-14 07:47:30 +08:00
|
|
|
/*
|
|
|
|
* Only page_lock_anon_vma() understands the subtleties of
|
|
|
|
* getting a hold on an anon_vma from outside one of its mms.
|
|
|
|
*/
|
|
|
|
anon_vma = page_lock_anon_vma(page);
|
|
|
|
if (anon_vma) {
|
|
|
|
/*
|
|
|
|
* Take a reference count on the anon_vma if the
|
|
|
|
* page is mapped so that it is guaranteed to
|
|
|
|
* exist when the page is remapped later
|
|
|
|
*/
|
|
|
|
get_anon_vma(anon_vma);
|
|
|
|
page_unlock_anon_vma(anon_vma);
|
|
|
|
} else if (PageSwapCache(page)) {
|
2010-05-25 05:32:20 +08:00
|
|
|
/*
|
|
|
|
* We cannot be sure that the anon_vma of an unmapped
|
|
|
|
* swapcache page is safe to use because we don't
|
|
|
|
* know in advance if the VMA that this page belonged
|
|
|
|
* to still exists. If the VMA and others sharing the
|
|
|
|
* data have been freed, then the anon_vma could
|
|
|
|
* already be invalid.
|
|
|
|
*
|
|
|
|
* To avoid this possibility, swapcache pages get
|
|
|
|
* migrated but are not remapped when migration
|
|
|
|
* completes
|
|
|
|
*/
|
|
|
|
remap_swapcache = 0;
|
|
|
|
} else {
|
2011-01-14 07:47:30 +08:00
|
|
|
goto uncharge;
|
2010-05-25 05:32:20 +08:00
|
|
|
}
|
2007-08-31 14:56:21 +08:00
|
|
|
}
|
2008-02-05 14:29:33 +08:00
|
|
|
|
2007-07-27 01:41:07 +08:00
|
|
|
/*
|
2008-02-05 14:29:33 +08:00
|
|
|
* Corner case handling:
|
|
|
|
* 1. When a new swap-cache page is read into, it is added to the LRU
|
|
|
|
* and treated as swapcache but it has no rmap yet.
|
|
|
|
* Calling try_to_unmap() against a page->mapping==NULL page will
|
|
|
|
* trigger a BUG. So handle it here.
|
|
|
|
* 2. An orphaned page (see truncate_complete_page) might have
|
|
|
|
* fs-private metadata. The page can be picked up due to memory
|
|
|
|
* offlining. Everywhere else except page reclaim, the page is
|
|
|
|
* invisible to the vm, so the page can not be migrated. So try to
|
|
|
|
* free the metadata, so the page can be freed.
|
2006-06-23 17:03:51 +08:00
|
|
|
*/
|
2008-02-05 14:29:33 +08:00
|
|
|
if (!page->mapping) {
|
2011-01-14 07:47:30 +08:00
|
|
|
VM_BUG_ON(PageAnon(page));
|
|
|
|
if (page_has_private(page)) {
|
2008-02-05 14:29:33 +08:00
|
|
|
try_to_free_buffers(page);
|
2011-01-14 07:47:30 +08:00
|
|
|
goto uncharge;
|
2008-02-05 14:29:33 +08:00
|
|
|
}
|
2009-09-22 08:01:19 +08:00
|
|
|
goto skip_unmap;
|
2008-02-05 14:29:33 +08:00
|
|
|
}
|
|
|
|
|
2007-07-27 01:41:07 +08:00
|
|
|
/* Establish migration ptes or remove ptes */
|
2009-09-16 17:50:10 +08:00
|
|
|
try_to_unmap(page, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
|
2007-07-27 01:41:07 +08:00
|
|
|
|
2009-09-22 08:01:19 +08:00
|
|
|
skip_unmap:
|
2006-06-25 20:46:49 +08:00
|
|
|
if (!page_mapped(page))
|
2010-05-25 05:32:20 +08:00
|
|
|
rc = move_to_new_page(newpage, page, remap_swapcache);
|
2006-06-23 17:03:51 +08:00
|
|
|
|
2010-05-25 05:32:20 +08:00
|
|
|
if (rc && remap_swapcache)
|
2006-06-23 17:03:51 +08:00
|
|
|
remove_migration_ptes(page, page);
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
|
|
|
|
/* Drop an anon_vma reference if we took one */
|
2010-08-10 08:18:41 +08:00
|
|
|
if (anon_vma)
|
|
|
|
drop_anon_vma(anon_vma);
|
mm: migration: take a reference to the anon_vma before migrating
This patchset is a memory compaction mechanism that reduces external
fragmentation memory by moving GFP_MOVABLE pages to a fewer number of
pageblocks. The term "compaction" was chosen as there are is a number of
mechanisms that are not mutually exclusive that can be used to defragment
memory. For example, lumpy reclaim is a form of defragmentation as was
slub "defragmentation" (really a form of targeted reclaim). Hence, this
is called "compaction" to distinguish it from other forms of
defragmentation.
In this implementation, a full compaction run involves two scanners
operating within a zone - a migration and a free scanner. The migration
scanner starts at the beginning of a zone and finds all movable pages
within one pageblock_nr_pages-sized area and isolates them on a
migratepages list. The free scanner begins at the end of the zone and
searches on a per-area basis for enough free pages to migrate all the
pages on the migratepages list. As each area is respectively migrated or
exhausted of free pages, the scanners are advanced one area. A compaction
run completes within a zone when the two scanners meet.
This method is a bit primitive but is easy to understand and greater
sophistication would require maintenance of counters on a per-pageblock
basis. This would have a big impact on allocator fast-paths to improve
compaction which is a poor trade-off.
It also does not try relocate virtually contiguous pages to be physically
contiguous. However, assuming transparent hugepages were in use, a
hypothetical khugepaged might reuse compaction code to isolate free pages,
split them and relocate userspace pages for promotion.
Memory compaction can be triggered in one of three ways. It may be
triggered explicitly by writing any value to /proc/sys/vm/compact_memory
and compacting all of memory. It can be triggered on a per-node basis by
writing any value to /sys/devices/system/node/nodeN/compact where N is the
node ID to be compacted. When a process fails to allocate a high-order
page, it may compact memory in an attempt to satisfy the allocation
instead of entering direct reclaim. Explicit compaction does not finish
until the two scanners meet and direct compaction ends if a suitable page
becomes available that would meet watermarks.
The series is in 14 patches. The first three are not "core" to the series
but are important pre-requisites.
Patch 1 reference counts anon_vma for rmap_walk_anon(). Without this
patch, it's possible to use anon_vma after free if the caller is
not holding a VMA or mmap_sem for the pages in question. While
there should be no existing user that causes this problem,
it's a requirement for memory compaction to be stable. The patch
is at the start of the series for bisection reasons.
Patch 2 merges the KSM and migrate counts. It could be merged with patch 1
but would be slightly harder to review.
Patch 3 skips over unmapped anon pages during migration as there are no
guarantees about the anon_vma existing. There is a window between
when a page was isolated and migration started during which anon_vma
could disappear.
Patch 4 notes that PageSwapCache pages can still be migrated even if they
are unmapped.
Patch 5 allows CONFIG_MIGRATION to be set without CONFIG_NUMA
Patch 6 exports a "unusable free space index" via debugfs. It's
a measure of external fragmentation that takes the size of the
allocation request into account. It can also be calculated from
userspace so can be dropped if requested
Patch 7 exports a "fragmentation index" which only has meaning when an
allocation request fails. It determines if an allocation failure
would be due to a lack of memory or external fragmentation.
Patch 8 moves the definition for LRU isolation modes for use by compaction
Patch 9 is the compaction mechanism although it's unreachable at this point
Patch 10 adds a means of compacting all of memory with a proc trgger
Patch 11 adds a means of compacting a specific node with a sysfs trigger
Patch 12 adds "direct compaction" before "direct reclaim" if it is
determined there is a good chance of success.
Patch 13 adds a sysctl that allows tuning of the threshold at which the
kernel will compact or direct reclaim
Patch 14 temporarily disables compaction if an allocation failure occurs
after compaction.
Testing of compaction was in three stages. For the test, debugging,
preempt, the sleep watchdog and lockdep were all enabled but nothing nasty
popped out. min_free_kbytes was tuned as recommended by hugeadm to help
fragmentation avoidance and high-order allocations. It was tested on X86,
X86-64 and PPC64.
Ths first test represents one of the easiest cases that can be faced for
lumpy reclaim or memory compaction.
1. Machine freshly booted and configured for hugepage usage with
a) hugeadm --create-global-mounts
b) hugeadm --pool-pages-max DEFAULT:8G
c) hugeadm --set-recommended-min_free_kbytes
d) hugeadm --set-recommended-shmmax
The min_free_kbytes here is important. Anti-fragmentation works best
when pageblocks don't mix. hugeadm knows how to calculate a value that
will significantly reduce the worst of external-fragmentation-related
events as reported by the mm_page_alloc_extfrag tracepoint.
2. Load up memory
a) Start updatedb
b) Create in parallel a X files of pagesize*128 in size. Wait
until files are created. By parallel, I mean that 4096 instances
of dd were launched, one after the other using &. The crude
objective being to mix filesystem metadata allocations with
the buffer cache.
c) Delete every second file so that pageblocks are likely to
have holes
d) kill updatedb if it's still running
At this point, the system is quiet, memory is full but it's full with
clean filesystem metadata and clean buffer cache that is unmapped.
This is readily migrated or discarded so you'd expect lumpy reclaim
to have no significant advantage over compaction but this is at
the POC stage.
3. In increments, attempt to allocate 5% of memory as hugepages.
Measure how long it took, how successful it was, how many
direct reclaims took place and how how many compactions. Note
the compaction figures might not fully add up as compactions
can take place for orders other than the hugepage size
X86 vanilla compaction
Final page count 913 916 (attempted 1002)
pages reclaimed 68296 9791
X86-64 vanilla compaction
Final page count: 901 902 (attempted 1002)
Total pages reclaimed: 112599 53234
PPC64 vanilla compaction
Final page count: 93 94 (attempted 110)
Total pages reclaimed: 103216 61838
There was not a dramatic improvement in success rates but it wouldn't be
expected in this case either. What was important is that fewer pages were
reclaimed in all cases reducing the amount of IO required to satisfy a
huge page allocation.
The second tests were all performance related - kernbench, netperf, iozone
and sysbench. None showed anything too remarkable.
The last test was a high-order allocation stress test. Many kernel
compiles are started to fill memory with a pressured mix of unmovable and
movable allocations. During this, an attempt is made to allocate 90% of
memory as huge pages - one at a time with small delays between attempts to
avoid flooding the IO queue.
vanilla compaction
Percentage of request allocated X86 98 99
Percentage of request allocated X86-64 95 98
Percentage of request allocated PPC64 55 70
This patch:
rmap_walk_anon() does not use page_lock_anon_vma() for looking up and
locking an anon_vma and it does not appear to have sufficient locking to
ensure the anon_vma does not disappear from under it.
This patch copies an approach used by KSM to take a reference on the
anon_vma while pages are being migrated. This should prevent rmap_walk()
running into nasty surprises later because anon_vma has been freed.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25 05:32:17 +08:00
|
|
|
|
2009-01-08 10:07:50 +08:00
|
|
|
uncharge:
|
|
|
|
if (!charge)
|
2011-01-14 07:47:43 +08:00
|
|
|
mem_cgroup_end_migration(mem, page, newpage, rc == 0);
|
2006-06-23 17:03:51 +08:00
|
|
|
unlock:
|
|
|
|
unlock_page(page);
|
2006-06-23 17:03:53 +08:00
|
|
|
|
2011-02-02 07:52:32 +08:00
|
|
|
move_newpage:
|
2006-06-23 17:03:51 +08:00
|
|
|
if (rc != -EAGAIN) {
|
2006-06-23 17:03:52 +08:00
|
|
|
/*
|
|
|
|
* A page that has been migrated has all references
|
|
|
|
* removed and will be freed. A page that has not been
|
|
|
|
* migrated will have kepts its references and be
|
|
|
|
* restored.
|
|
|
|
*/
|
|
|
|
list_del(&page->lru);
|
2009-09-22 08:01:37 +08:00
|
|
|
dec_zone_page_state(page, NR_ISOLATED_ANON +
|
2009-09-22 08:02:59 +08:00
|
|
|
page_is_file_cache(page));
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
putback_lru_page(page);
|
2006-06-23 17:03:51 +08:00
|
|
|
}
|
2006-06-23 17:03:53 +08:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Move the new page to the LRU. If migration was not successful
|
|
|
|
* then this will free the page.
|
|
|
|
*/
|
Unevictable LRU Infrastructure
When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages. Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.
Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.
Kosaki Motohiro added the support for the memory controller unevictable
lru list.
Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.
The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.
A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable. Subsequent patches will add the various
!evictable tests. We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.
To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference. If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list. This way, we avoid "stranding" evictable pages on the
unevictable list.
[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:39 +08:00
|
|
|
putback_lru_page(newpage);
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
if (result) {
|
|
|
|
if (rc)
|
|
|
|
*result = rc;
|
|
|
|
else
|
|
|
|
*result = page_to_nid(newpage);
|
|
|
|
}
|
2006-06-23 17:03:51 +08:00
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
/*
|
|
|
|
* Counterpart of unmap_and_move_page() for hugepage migration.
|
|
|
|
*
|
|
|
|
* This function doesn't wait the completion of hugepage I/O
|
|
|
|
* because there is no race between I/O and migration for hugepage.
|
|
|
|
* Note that currently hugepage I/O occurs only in direct I/O
|
|
|
|
* where no lock is held and PG_writeback is irrelevant,
|
|
|
|
* and writeback status of all subpages are counted in the reference
|
|
|
|
* count of the head page (i.e. if all subpages of a 2MB hugepage are
|
|
|
|
* under direct I/O, the reference of the head page is 512 and a bit more.)
|
|
|
|
* This means that when we try to migrate hugepage whose subpages are
|
|
|
|
* doing direct I/O, some references remain after try_to_unmap() and
|
|
|
|
* hugepage migration fails without data corruption.
|
|
|
|
*
|
|
|
|
* There is also no race when direct I/O is issued on the page under migration,
|
|
|
|
* because then pte is replaced with migration swap entry and direct I/O code
|
|
|
|
* will wait in the page fault for migration to complete.
|
|
|
|
*/
|
|
|
|
static int unmap_and_move_huge_page(new_page_t get_new_page,
|
|
|
|
unsigned long private, struct page *hpage,
|
2011-01-14 07:45:58 +08:00
|
|
|
int force, bool offlining, bool sync)
|
2010-09-08 09:19:35 +08:00
|
|
|
{
|
|
|
|
int rc = 0;
|
|
|
|
int *result = NULL;
|
|
|
|
struct page *new_hpage = get_new_page(hpage, private, &result);
|
|
|
|
struct anon_vma *anon_vma = NULL;
|
|
|
|
|
|
|
|
if (!new_hpage)
|
|
|
|
return -ENOMEM;
|
|
|
|
|
|
|
|
rc = -EAGAIN;
|
|
|
|
|
|
|
|
if (!trylock_page(hpage)) {
|
2011-01-14 07:45:57 +08:00
|
|
|
if (!force || !sync)
|
2010-09-08 09:19:35 +08:00
|
|
|
goto out;
|
|
|
|
lock_page(hpage);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (PageAnon(hpage)) {
|
2011-01-14 07:47:31 +08:00
|
|
|
anon_vma = page_lock_anon_vma(hpage);
|
|
|
|
if (anon_vma) {
|
|
|
|
get_anon_vma(anon_vma);
|
|
|
|
page_unlock_anon_vma(anon_vma);
|
2010-09-08 09:19:35 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
|
|
|
|
|
|
|
|
if (!page_mapped(hpage))
|
|
|
|
rc = move_to_new_page(new_hpage, hpage, 1);
|
|
|
|
|
|
|
|
if (rc)
|
|
|
|
remove_migration_ptes(hpage, hpage);
|
|
|
|
|
2011-01-14 07:47:31 +08:00
|
|
|
if (anon_vma)
|
|
|
|
drop_anon_vma(anon_vma);
|
2010-09-08 09:19:35 +08:00
|
|
|
out:
|
|
|
|
unlock_page(hpage);
|
|
|
|
|
|
|
|
if (rc != -EAGAIN) {
|
|
|
|
list_del(&hpage->lru);
|
|
|
|
put_page(hpage);
|
|
|
|
}
|
|
|
|
|
|
|
|
put_page(new_hpage);
|
|
|
|
|
|
|
|
if (result) {
|
|
|
|
if (rc)
|
|
|
|
*result = rc;
|
|
|
|
else
|
|
|
|
*result = page_to_nid(new_hpage);
|
|
|
|
}
|
|
|
|
return rc;
|
|
|
|
}
|
|
|
|
|
2006-03-22 16:09:12 +08:00
|
|
|
/*
|
|
|
|
* migrate_pages
|
|
|
|
*
|
2006-06-23 17:03:53 +08:00
|
|
|
* The function takes one list of pages to migrate and a function
|
|
|
|
* that determines from the page to be migrated and the private data
|
|
|
|
* the target of the move and allocates the page.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
|
|
|
* The function returns after 10 attempts or if no pages
|
|
|
|
* are movable anymore because to has become empty
|
2010-10-27 05:21:29 +08:00
|
|
|
* or no retryable pages exist anymore.
|
|
|
|
* Caller should call putback_lru_pages to return pages to the LRU
|
2011-01-26 07:07:26 +08:00
|
|
|
* or free list only if ret != 0.
|
2006-03-22 16:09:12 +08:00
|
|
|
*
|
2006-06-23 17:03:53 +08:00
|
|
|
* Return: Number of pages not migrated or error code.
|
2006-03-22 16:09:12 +08:00
|
|
|
*/
|
2006-06-23 17:03:53 +08:00
|
|
|
int migrate_pages(struct list_head *from,
|
2011-01-14 07:45:58 +08:00
|
|
|
new_page_t get_new_page, unsigned long private, bool offlining,
|
2011-01-14 07:45:57 +08:00
|
|
|
bool sync)
|
2006-03-22 16:09:12 +08:00
|
|
|
{
|
2006-06-23 17:03:51 +08:00
|
|
|
int retry = 1;
|
2006-03-22 16:09:12 +08:00
|
|
|
int nr_failed = 0;
|
|
|
|
int pass = 0;
|
|
|
|
struct page *page;
|
|
|
|
struct page *page2;
|
|
|
|
int swapwrite = current->flags & PF_SWAPWRITE;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (!swapwrite)
|
|
|
|
current->flags |= PF_SWAPWRITE;
|
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
for(pass = 0; pass < 10 && retry; pass++) {
|
|
|
|
retry = 0;
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
list_for_each_entry_safe(page, page2, from, lru) {
|
|
|
|
cond_resched();
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:53 +08:00
|
|
|
rc = unmap_and_move(get_new_page, private,
|
2011-01-14 07:45:57 +08:00
|
|
|
page, pass > 2, offlining,
|
|
|
|
sync);
|
2006-06-23 17:03:33 +08:00
|
|
|
|
2006-06-23 17:03:51 +08:00
|
|
|
switch(rc) {
|
2006-06-23 17:03:53 +08:00
|
|
|
case -ENOMEM:
|
|
|
|
goto out;
|
2006-06-23 17:03:51 +08:00
|
|
|
case -EAGAIN:
|
2006-06-23 17:03:33 +08:00
|
|
|
retry++;
|
2006-06-23 17:03:51 +08:00
|
|
|
break;
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
default:
|
2006-06-23 17:03:33 +08:00
|
|
|
/* Permanent failure */
|
|
|
|
nr_failed++;
|
2006-06-23 17:03:51 +08:00
|
|
|
break;
|
2006-06-23 17:03:33 +08:00
|
|
|
}
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
|
|
|
}
|
2006-06-23 17:03:53 +08:00
|
|
|
rc = 0;
|
|
|
|
out:
|
2006-03-22 16:09:12 +08:00
|
|
|
if (!swapwrite)
|
|
|
|
current->flags &= ~PF_SWAPWRITE;
|
|
|
|
|
2006-06-23 17:03:53 +08:00
|
|
|
if (rc)
|
|
|
|
return rc;
|
2006-03-22 16:09:12 +08:00
|
|
|
|
2006-06-23 17:03:53 +08:00
|
|
|
return nr_failed + retry;
|
2006-03-22 16:09:12 +08:00
|
|
|
}
|
2006-06-23 17:03:53 +08:00
|
|
|
|
2010-09-08 09:19:35 +08:00
|
|
|
int migrate_huge_pages(struct list_head *from,
|
2011-01-14 07:45:58 +08:00
|
|
|
new_page_t get_new_page, unsigned long private, bool offlining,
|
2011-01-14 07:45:57 +08:00
|
|
|
bool sync)
|
2010-09-08 09:19:35 +08:00
|
|
|
{
|
|
|
|
int retry = 1;
|
|
|
|
int nr_failed = 0;
|
|
|
|
int pass = 0;
|
|
|
|
struct page *page;
|
|
|
|
struct page *page2;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
for (pass = 0; pass < 10 && retry; pass++) {
|
|
|
|
retry = 0;
|
|
|
|
|
|
|
|
list_for_each_entry_safe(page, page2, from, lru) {
|
|
|
|
cond_resched();
|
|
|
|
|
|
|
|
rc = unmap_and_move_huge_page(get_new_page,
|
2011-01-14 07:45:57 +08:00
|
|
|
private, page, pass > 2, offlining,
|
|
|
|
sync);
|
2010-09-08 09:19:35 +08:00
|
|
|
|
|
|
|
switch(rc) {
|
|
|
|
case -ENOMEM:
|
|
|
|
goto out;
|
|
|
|
case -EAGAIN:
|
|
|
|
retry++;
|
|
|
|
break;
|
|
|
|
case 0:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
/* Permanent failure */
|
|
|
|
nr_failed++;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
rc = 0;
|
|
|
|
out:
|
|
|
|
if (rc)
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
return nr_failed + retry;
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
#ifdef CONFIG_NUMA
|
|
|
|
/*
|
|
|
|
* Move a list of individual pages
|
|
|
|
*/
|
|
|
|
struct page_to_node {
|
|
|
|
unsigned long addr;
|
|
|
|
struct page *page;
|
|
|
|
int node;
|
|
|
|
int status;
|
|
|
|
};
|
|
|
|
|
|
|
|
static struct page *new_page_node(struct page *p, unsigned long private,
|
|
|
|
int **result)
|
|
|
|
{
|
|
|
|
struct page_to_node *pm = (struct page_to_node *)private;
|
|
|
|
|
|
|
|
while (pm->node != MAX_NUMNODES && pm->page != p)
|
|
|
|
pm++;
|
|
|
|
|
|
|
|
if (pm->node == MAX_NUMNODES)
|
|
|
|
return NULL;
|
|
|
|
|
|
|
|
*result = &pm->status;
|
|
|
|
|
2009-06-17 06:31:54 +08:00
|
|
|
return alloc_pages_exact_node(pm->node,
|
2007-07-17 19:03:05 +08:00
|
|
|
GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move a set of pages as indicated in the pm array. The addr
|
|
|
|
* field must be set to the virtual address of the page to be moved
|
|
|
|
* and the node number must contain a valid target node.
|
2008-10-19 11:27:17 +08:00
|
|
|
* The pm array ends with node = MAX_NUMNODES.
|
2006-06-23 17:03:55 +08:00
|
|
|
*/
|
2008-10-19 11:27:17 +08:00
|
|
|
static int do_move_page_to_node_array(struct mm_struct *mm,
|
|
|
|
struct page_to_node *pm,
|
|
|
|
int migrate_all)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
|
|
|
int err;
|
|
|
|
struct page_to_node *pp;
|
|
|
|
LIST_HEAD(pagelist);
|
|
|
|
|
|
|
|
down_read(&mm->mmap_sem);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Build a list of pages to migrate
|
|
|
|
*/
|
|
|
|
for (pp = pm; pp->node != MAX_NUMNODES; pp++) {
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct page *page;
|
|
|
|
|
|
|
|
err = -EFAULT;
|
|
|
|
vma = find_vma(mm, pp->addr);
|
2010-10-27 05:22:07 +08:00
|
|
|
if (!vma || pp->addr < vma->vm_start || !vma_migratable(vma))
|
2006-06-23 17:03:55 +08:00
|
|
|
goto set_status;
|
|
|
|
|
2011-01-14 07:46:55 +08:00
|
|
|
page = follow_page(vma, pp->addr, FOLL_GET|FOLL_SPLIT);
|
2008-06-21 02:18:25 +08:00
|
|
|
|
|
|
|
err = PTR_ERR(page);
|
|
|
|
if (IS_ERR(page))
|
|
|
|
goto set_status;
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
err = -ENOENT;
|
|
|
|
if (!page)
|
|
|
|
goto set_status;
|
|
|
|
|
ksm: memory hotremove migration only
The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?
Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.
But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.
So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.
And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:33 +08:00
|
|
|
/* Use PageReserved to check for zero page */
|
|
|
|
if (PageReserved(page) || PageKsm(page))
|
2006-06-23 17:03:55 +08:00
|
|
|
goto put_and_set;
|
|
|
|
|
|
|
|
pp->page = page;
|
|
|
|
err = page_to_nid(page);
|
|
|
|
|
|
|
|
if (err == pp->node)
|
|
|
|
/*
|
|
|
|
* Node already in the right place
|
|
|
|
*/
|
|
|
|
goto put_and_set;
|
|
|
|
|
|
|
|
err = -EACCES;
|
|
|
|
if (page_mapcount(page) > 1 &&
|
|
|
|
!migrate_all)
|
|
|
|
goto put_and_set;
|
|
|
|
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
err = isolate_lru_page(page);
|
2009-12-15 09:58:11 +08:00
|
|
|
if (!err) {
|
vmscan: move isolate_lru_page() to vmscan.c
On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory. Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.
This patch series improves VM scalability by:
1) putting filesystem backed, swap backed and unevictable pages
onto their own LRUs, so the system only scans the pages that it
can/should evict from memory
2) switching to two handed clock replacement for the anonymous LRUs,
so the number of pages that need to be scanned when the system
starts swapping is bound to a reasonable number
3) keeping unevictable pages off the LRU completely, so the
VM does not waste CPU time scanning them. ramfs, ramdisk,
SHM_LOCKED shared memory segments and mlock()ed VMA pages
are keept on the unevictable list.
This patch:
isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c. However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.
Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list. Callers can do that.
Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller. Methinks we need to
rationalize these names/purposes. --lts
[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 11:26:09 +08:00
|
|
|
list_add_tail(&page->lru, &pagelist);
|
2009-12-15 09:58:11 +08:00
|
|
|
inc_zone_page_state(page, NR_ISOLATED_ANON +
|
|
|
|
page_is_file_cache(page));
|
|
|
|
}
|
2006-06-23 17:03:55 +08:00
|
|
|
put_and_set:
|
|
|
|
/*
|
|
|
|
* Either remove the duplicate refcount from
|
|
|
|
* isolate_lru_page() or drop the page ref if it was
|
|
|
|
* not isolated.
|
|
|
|
*/
|
|
|
|
put_page(page);
|
|
|
|
set_status:
|
|
|
|
pp->status = err;
|
|
|
|
}
|
|
|
|
|
2008-10-19 11:27:15 +08:00
|
|
|
err = 0;
|
2010-10-27 05:21:29 +08:00
|
|
|
if (!list_empty(&pagelist)) {
|
2006-06-23 17:03:55 +08:00
|
|
|
err = migrate_pages(&pagelist, new_page_node,
|
2011-01-14 07:45:57 +08:00
|
|
|
(unsigned long)pm, 0, true);
|
2010-10-27 05:21:29 +08:00
|
|
|
if (err)
|
|
|
|
putback_lru_pages(&pagelist);
|
|
|
|
}
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
/*
|
|
|
|
* Migrate an array of page address onto an array of nodes and fill
|
|
|
|
* the corresponding array of status.
|
|
|
|
*/
|
|
|
|
static int do_pages_move(struct mm_struct *mm, struct task_struct *task,
|
|
|
|
unsigned long nr_pages,
|
|
|
|
const void __user * __user *pages,
|
|
|
|
const int __user *nodes,
|
|
|
|
int __user *status, int flags)
|
|
|
|
{
|
2009-01-07 06:38:57 +08:00
|
|
|
struct page_to_node *pm;
|
2008-10-19 11:27:17 +08:00
|
|
|
nodemask_t task_nodes;
|
2009-01-07 06:38:57 +08:00
|
|
|
unsigned long chunk_nr_pages;
|
|
|
|
unsigned long chunk_start;
|
|
|
|
int err;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
|
|
|
task_nodes = cpuset_mems_allowed(task);
|
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
err = -ENOMEM;
|
|
|
|
pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
|
|
|
|
if (!pm)
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2009-06-17 06:32:43 +08:00
|
|
|
|
|
|
|
migrate_prep();
|
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
/*
|
2009-01-07 06:38:57 +08:00
|
|
|
* Store a chunk of page_to_node array in a page,
|
|
|
|
* but keep the last one as a marker
|
2008-10-19 11:27:17 +08:00
|
|
|
*/
|
2009-01-07 06:38:57 +08:00
|
|
|
chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
for (chunk_start = 0;
|
|
|
|
chunk_start < nr_pages;
|
|
|
|
chunk_start += chunk_nr_pages) {
|
|
|
|
int j;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
if (chunk_start + chunk_nr_pages > nr_pages)
|
|
|
|
chunk_nr_pages = nr_pages - chunk_start;
|
|
|
|
|
|
|
|
/* fill the chunk pm with addrs and nodes from user-space */
|
|
|
|
for (j = 0; j < chunk_nr_pages; j++) {
|
|
|
|
const void __user *p;
|
2008-10-19 11:27:17 +08:00
|
|
|
int node;
|
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
err = -EFAULT;
|
|
|
|
if (get_user(p, pages + j + chunk_start))
|
|
|
|
goto out_pm;
|
|
|
|
pm[j].addr = (unsigned long) p;
|
|
|
|
|
|
|
|
if (get_user(node, nodes + j + chunk_start))
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out_pm;
|
|
|
|
|
|
|
|
err = -ENODEV;
|
2010-02-06 08:16:50 +08:00
|
|
|
if (node < 0 || node >= MAX_NUMNODES)
|
|
|
|
goto out_pm;
|
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
if (!node_state(node, N_HIGH_MEMORY))
|
|
|
|
goto out_pm;
|
|
|
|
|
|
|
|
err = -EACCES;
|
|
|
|
if (!node_isset(node, task_nodes))
|
|
|
|
goto out_pm;
|
|
|
|
|
2009-01-07 06:38:57 +08:00
|
|
|
pm[j].node = node;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* End marker for this chunk */
|
|
|
|
pm[chunk_nr_pages].node = MAX_NUMNODES;
|
|
|
|
|
|
|
|
/* Migrate this chunk */
|
|
|
|
err = do_move_page_to_node_array(mm, pm,
|
|
|
|
flags & MPOL_MF_MOVE_ALL);
|
|
|
|
if (err < 0)
|
|
|
|
goto out_pm;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
|
|
|
/* Return status information */
|
2009-01-07 06:38:57 +08:00
|
|
|
for (j = 0; j < chunk_nr_pages; j++)
|
|
|
|
if (put_user(pm[j].status, status + j + chunk_start)) {
|
2008-10-19 11:27:17 +08:00
|
|
|
err = -EFAULT;
|
2009-01-07 06:38:57 +08:00
|
|
|
goto out_pm;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
err = 0;
|
2008-10-19 11:27:17 +08:00
|
|
|
|
|
|
|
out_pm:
|
2009-01-07 06:38:57 +08:00
|
|
|
free_page((unsigned long)pm);
|
2008-10-19 11:27:17 +08:00
|
|
|
out:
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
/*
|
2008-10-19 11:27:16 +08:00
|
|
|
* Determine the nodes of an array of pages and store it in an array of status.
|
2006-06-23 17:03:55 +08:00
|
|
|
*/
|
2008-12-10 05:14:23 +08:00
|
|
|
static void do_pages_stat_array(struct mm_struct *mm, unsigned long nr_pages,
|
|
|
|
const void __user **pages, int *status)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
2008-10-19 11:27:16 +08:00
|
|
|
unsigned long i;
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
down_read(&mm->mmap_sem);
|
|
|
|
|
2008-10-19 11:27:16 +08:00
|
|
|
for (i = 0; i < nr_pages; i++) {
|
2008-12-10 05:14:23 +08:00
|
|
|
unsigned long addr = (unsigned long)(*pages);
|
2006-06-23 17:03:55 +08:00
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct page *page;
|
2008-12-16 15:06:43 +08:00
|
|
|
int err = -EFAULT;
|
2008-10-19 11:27:16 +08:00
|
|
|
|
|
|
|
vma = find_vma(mm, addr);
|
2010-10-27 05:22:07 +08:00
|
|
|
if (!vma || addr < vma->vm_start)
|
2006-06-23 17:03:55 +08:00
|
|
|
goto set_status;
|
|
|
|
|
2008-10-19 11:27:16 +08:00
|
|
|
page = follow_page(vma, addr, 0);
|
2008-06-21 02:18:25 +08:00
|
|
|
|
|
|
|
err = PTR_ERR(page);
|
|
|
|
if (IS_ERR(page))
|
|
|
|
goto set_status;
|
|
|
|
|
2006-06-23 17:03:55 +08:00
|
|
|
err = -ENOENT;
|
|
|
|
/* Use PageReserved to check for zero page */
|
ksm: memory hotremove migration only
The previous patch enables page migration of ksm pages, but that soon gets
into trouble: not surprising, since we're using the ksm page lock to lock
operations on its stable_node, but page migration switches the page whose
lock is to be used for that. Another layer of locking would fix it, but
do we need that yet?
Do we actually need page migration of ksm pages? Yes, memory hotremove
needs to offline sections of memory: and since we stopped allocating ksm
pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
candidates for migration.
But KSM is currently unconscious of NUMA issues, happily merging pages
from different NUMA nodes: at present the rule must be, not to use
MADV_MERGEABLE where you care about NUMA. So no, NUMA page migration of
ksm pages does not make sense yet.
So, to complete support for ksm swapping we need to make hotremove safe.
ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE. But if mapped pages
are freed before migration reaches them, stable_nodes may be left still
pointing to struct pages which have been removed from the system: the
stable_node needs to identify a page by pfn rather than page pointer, then
it can safely prune them when MEM_OFFLINE.
And make NUMA migration skip PageKsm pages where it skips PageReserved.
But it's only when we reach unmap_and_move() that the page lock is taken
and we can be sure that raised pagecount has prevented a PageAnon from
being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
page when offlining (has sufficient locking) but reject it otherwise.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 09:59:33 +08:00
|
|
|
if (!page || PageReserved(page) || PageKsm(page))
|
2006-06-23 17:03:55 +08:00
|
|
|
goto set_status;
|
|
|
|
|
|
|
|
err = page_to_nid(page);
|
|
|
|
set_status:
|
2008-12-10 05:14:23 +08:00
|
|
|
*status = err;
|
|
|
|
|
|
|
|
pages++;
|
|
|
|
status++;
|
|
|
|
}
|
|
|
|
|
|
|
|
up_read(&mm->mmap_sem);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine the nodes of a user array of pages and store it in
|
|
|
|
* a user array of status.
|
|
|
|
*/
|
|
|
|
static int do_pages_stat(struct mm_struct *mm, unsigned long nr_pages,
|
|
|
|
const void __user * __user *pages,
|
|
|
|
int __user *status)
|
|
|
|
{
|
|
|
|
#define DO_PAGES_STAT_CHUNK_NR 16
|
|
|
|
const void __user *chunk_pages[DO_PAGES_STAT_CHUNK_NR];
|
|
|
|
int chunk_status[DO_PAGES_STAT_CHUNK_NR];
|
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
while (nr_pages) {
|
|
|
|
unsigned long chunk_nr;
|
2008-12-10 05:14:23 +08:00
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
chunk_nr = nr_pages;
|
|
|
|
if (chunk_nr > DO_PAGES_STAT_CHUNK_NR)
|
|
|
|
chunk_nr = DO_PAGES_STAT_CHUNK_NR;
|
|
|
|
|
|
|
|
if (copy_from_user(chunk_pages, pages, chunk_nr * sizeof(*chunk_pages)))
|
|
|
|
break;
|
2008-12-10 05:14:23 +08:00
|
|
|
|
|
|
|
do_pages_stat_array(mm, chunk_nr, chunk_pages, chunk_status);
|
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
if (copy_to_user(status, chunk_status, chunk_nr * sizeof(*status)))
|
|
|
|
break;
|
2006-06-23 17:03:55 +08:00
|
|
|
|
2010-02-19 08:13:40 +08:00
|
|
|
pages += chunk_nr;
|
|
|
|
status += chunk_nr;
|
|
|
|
nr_pages -= chunk_nr;
|
|
|
|
}
|
|
|
|
return nr_pages ? -EFAULT : 0;
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Move a list of pages in the address space of the currently executing
|
|
|
|
* process.
|
|
|
|
*/
|
2009-01-14 21:14:30 +08:00
|
|
|
SYSCALL_DEFINE6(move_pages, pid_t, pid, unsigned long, nr_pages,
|
|
|
|
const void __user * __user *, pages,
|
|
|
|
const int __user *, nodes,
|
|
|
|
int __user *, status, int, flags)
|
2006-06-23 17:03:55 +08:00
|
|
|
{
|
2008-11-14 07:39:19 +08:00
|
|
|
const struct cred *cred = current_cred(), *tcred;
|
2006-06-23 17:03:55 +08:00
|
|
|
struct task_struct *task;
|
|
|
|
struct mm_struct *mm;
|
2008-10-19 11:27:17 +08:00
|
|
|
int err;
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
/* Check flags */
|
|
|
|
if (flags & ~(MPOL_MF_MOVE|MPOL_MF_MOVE_ALL))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
|
|
|
|
return -EPERM;
|
|
|
|
|
|
|
|
/* Find the mm_struct */
|
2011-02-26 06:44:13 +08:00
|
|
|
rcu_read_lock();
|
2007-10-19 14:40:16 +08:00
|
|
|
task = pid ? find_task_by_vpid(pid) : current;
|
2006-06-23 17:03:55 +08:00
|
|
|
if (!task) {
|
2011-02-26 06:44:13 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
return -ESRCH;
|
|
|
|
}
|
|
|
|
mm = get_task_mm(task);
|
2011-02-26 06:44:13 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
|
|
|
|
if (!mm)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check if this process has the right to modify the specified
|
|
|
|
* process. The right exists if the process has administrative
|
|
|
|
* capabilities, superuser privileges or the same
|
|
|
|
* userid as the target process.
|
|
|
|
*/
|
2008-11-14 07:39:19 +08:00
|
|
|
rcu_read_lock();
|
|
|
|
tcred = __task_cred(task);
|
2008-11-14 07:39:16 +08:00
|
|
|
if (cred->euid != tcred->suid && cred->euid != tcred->uid &&
|
|
|
|
cred->uid != tcred->suid && cred->uid != tcred->uid &&
|
2006-06-23 17:03:55 +08:00
|
|
|
!capable(CAP_SYS_NICE)) {
|
2008-11-14 07:39:19 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
err = -EPERM;
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
2008-11-14 07:39:19 +08:00
|
|
|
rcu_read_unlock();
|
2006-06-23 17:03:55 +08:00
|
|
|
|
2006-06-23 17:04:02 +08:00
|
|
|
err = security_task_movememory(task);
|
|
|
|
if (err)
|
2008-10-19 11:27:17 +08:00
|
|
|
goto out;
|
2006-06-23 17:04:02 +08:00
|
|
|
|
2008-10-19 11:27:17 +08:00
|
|
|
if (nodes) {
|
|
|
|
err = do_pages_move(mm, task, nr_pages, pages, nodes, status,
|
|
|
|
flags);
|
|
|
|
} else {
|
2008-10-19 11:27:16 +08:00
|
|
|
err = do_pages_stat(mm, nr_pages, pages, status);
|
2006-06-23 17:03:55 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
out:
|
|
|
|
mmput(mm);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
|
2006-06-25 20:46:48 +08:00
|
|
|
/*
|
|
|
|
* Call migration functions in the vma_ops that may prepare
|
|
|
|
* memory in a vm for migration. migration functions may perform
|
|
|
|
* the migration for vmas that do not have an underlying page struct.
|
|
|
|
*/
|
|
|
|
int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
|
|
|
|
const nodemask_t *from, unsigned long flags)
|
|
|
|
{
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
int err = 0;
|
|
|
|
|
2009-02-12 05:04:18 +08:00
|
|
|
for (vma = mm->mmap; vma && !err; vma = vma->vm_next) {
|
2006-06-25 20:46:48 +08:00
|
|
|
if (vma->vm_ops && vma->vm_ops->migrate) {
|
|
|
|
err = vma->vm_ops->migrate(vma, to, from, flags);
|
|
|
|
if (err)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return err;
|
|
|
|
}
|
2008-07-24 12:28:22 +08:00
|
|
|
#endif
|