linux/kernel
Lee Schermerhorn 06808b0827 hugetlb: derive huge pages nodes allowed from task mempolicy
This patch derives a "nodes_allowed" node mask from the numa mempolicy of
the task modifying the number of persistent huge pages to control the
allocation, freeing and adjusting of surplus huge pages when the pool page
count is modified via the new sysctl or sysfs attribute
"nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:

* For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
  is produced.  This will cause the hugetlb subsystem to use
  node_online_map as the "nodes_allowed".  This preserves the
  behavior before this patch.
* For "preferred" mempolicy, including explicit local allocation,
  a nodemask with the single preferred node will be produced.
  "local" policy will NOT track any internode migrations of the
  task adjusting nr_hugepages.
* For "bind" and "interleave" policy, the mempolicy's nodemask
  will be used.
* Other than to inform the construction of the nodes_allowed node
  mask, the actual mempolicy mode is ignored.  That is, all modes
  behave like interleave over the resulting nodes_allowed mask
  with no "fallback".

See the updated documentation [next patch] for more information
about the implications of this patch.

Examples:

Starting with:

	Node 0 HugePages_Total:     0
	Node 1 HugePages_Total:     0
	Node 2 HugePages_Total:     0
	Node 3 HugePages_Total:     0

Default behavior [with or without this patch] balances persistent
hugepage allocation across nodes [with sufficient contiguous memory]:

	sysctl vm.nr_hugepages[_mempolicy]=32

yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:     8
	Node 3 HugePages_Total:     8

Of course, we only have nr_hugepages_mempolicy with the patch,
but with default mempolicy, nr_hugepages_mempolicy behaves the
same as nr_hugepages.

Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
'--membind' because it allows multiple nodes to be specified
and it's easy to type]--we can allocate huge pages on
individual nodes or sets of nodes.  So, starting from the
condition above, with 8 huge pages per node, add 8 more to
node 2 using:

	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40

This yields:

	Node 0 HugePages_Total:     8
	Node 1 HugePages_Total:     8
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The incremental 8 huge pages were restricted to node 2 by the
specified mempolicy.

Similarly, we can use mempolicy to free persistent huge pages
from specified nodes:

	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32

yields:

	Node 0 HugePages_Total:     4
	Node 1 HugePages_Total:     4
	Node 2 HugePages_Total:    16
	Node 3 HugePages_Total:     8

The 8 huge pages freed were balanced over nodes 0 and 1.

[rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Andi Kleen <andi@firstfloor.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 08:53:12 -08:00
..
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-12-09 19:43:33 -08:00
power PM / Runtime: Export the PM runtime workqueue 2009-12-06 16:17:56 +01:00
time Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
trace Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
.gitignore
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
Makefile Merge branch 'kvm-updates/2.6.33' of git://git.kernel.org/pub/scm/virt/kvm/kvm 2009-12-08 08:02:38 -08:00
acct.c bsdacct: fix uid/gid misreporting 2009-12-15 08:53:10 -08:00
async.c async: Fix lack of boot-time console due to insufficient synchronization 2009-06-08 12:31:53 -07:00
audit.c Audit: send signal info if selinux is disabled 2009-09-24 03:50:26 -04:00
audit.h Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
audit_tree.c Fix rule eviction order for AUDIT_DIR 2009-06-24 00:02:38 -04:00
audit_watch.c Audit: reorganize struct audit_watch to save 8 bytes 2009-09-24 03:50:25 -04:00
auditfilter.c Audit: clean up all op= output to include string quoting 2009-06-24 00:00:52 -04:00
auditsc.c Audit: rearrange audit_context to save 16 bytes per struct 2009-09-24 03:50:26 -04:00
backtracetest.c
bounds.c
capability.c remove CONFIG_SECURITY_FILE_CAPABILITIES compile option 2009-11-24 15:06:47 +11:00
cgroup.c cgroup: fix strstrip() misuse 2009-10-29 07:39:25 -07:00
cgroup_freezer.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
compat.c signals: implement sys_rt_tgsigqueueinfo 2009-04-30 19:24:24 +02:00
configs.c
cpu.c Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-12 11:34:10 -08:00
cpuset.c sched: Fix balance vs hotplug race 2009-12-06 21:10:56 +01:00
cred-internals.h
cred.c creds_are_invalid() needs to be exported for use by modules: 2009-09-23 11:02:26 -07:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
exec_domain.c
exit.c tty: Move the leader test in disassociate 2009-12-11 15:18:08 -08:00
extable.c
fork.c Merge branch 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block 2009-12-08 08:19:16 -08:00
freezer.c sched: fix nr_uninterruptible accounting of frozen tasks really 2009-07-18 14:19:53 +02:00
futex.c futex: Take mmap_sem for get_user_pages in fault_in_user_writeable 2009-12-08 14:59:36 +01:00
futex_compat.c futex: Fix compat_futex to be same as futex for REQUEUE_PI 2009-08-10 15:41:12 +02:00
groups.c groups: move code to kernel/groups.c 2009-06-16 19:47:48 -07:00
hrtimer.c hrtimer: move timer stats helper functions to hrtimer.c 2009-12-10 13:08:11 +01:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c hw-breakpoints: Modify breakpoints without unregistering them 2009-12-09 09:48:20 +01:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c hw-breakpoints: Fix broken hw-breakpoint sample module 2009-11-10 11:23:29 +01:00
kexec.c kexec: fix omitting offset in extended crashkernel syntax 2009-07-29 19:10:34 -07:00
kfifo.c kfifo: Use "const" definitions 2009-09-19 13:13:17 -07:00
kgdb.c kgdb: Always process the whole breakpoint list on activate or deactivate 2009-12-11 08:43:20 -06:00
kmod.c security: report the module name to security_module_request 2009-11-10 09:33:46 +11:00
kprobes.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 15:30:21 -08:00
ksysfs.c
kthread.c sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c 2009-11-03 07:25:00 +01:00
latencytop.c
lockdep.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
lockdep_internals.h lockdep: BFS cleanup 2009-07-24 10:53:29 +02:00
lockdep_proc.c seq_file: constify seq_operations 2009-09-23 07:39:29 -07:00
lockdep_states.h
module.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h
mutex.c mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
mutex.h
notifier.c kprobes: Fix to add __kprobes to notify_die 2009-08-30 03:08:26 +02:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c nsproxy: extract create_nsproxy() 2009-06-18 13:03:56 -07:00
panic.c Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-10-08 12:16:35 -07:00
params.c param: fix setting arrays of bool 2009-10-29 08:56:20 +10:30
perf_event.c Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-11 20:47:30 -08:00
pid.c mm: also use alloc_large_system_hash() for the PID hash table 2009-09-22 07:17:38 -07:00
pid_namespace.c pidns: deny CLONE_PARENT|CLONE_NEWPID combination 2009-09-24 07:21:04 -07:00
pm_qos_params.c pm_qos: clean up racy global "name" variable 2009-10-14 15:31:10 +02:00
posix-cpu-timers.c posix-cpu-timers: optimize and document timer_create callback 2009-11-18 12:36:05 +01:00
posix-timers.c time: Introduce CLOCK_REALTIME_COARSE 2009-08-21 21:43:46 +02:00
printk.c Merge branch 'core-printk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 09:50:22 -08:00
profile.c kernel/profile.c: Switch /proc/irq/prof_cpu_mask to seq_file 2009-09-20 20:15:40 +02:00
ptrace.c ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee 2009-09-24 07:20:59 -07:00
rcupdate.c rcu: Re-arrange code to reduce #ifdef pain 2009-11-22 18:58:16 +01:00
rcutiny.c rcu: Eliminate unneeded function wrapping 2009-11-22 18:58:16 +01:00
rcutorture.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
rcutree.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree_plugin.h rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
rcutree_trace.c rcu: Add expedited grace-period support for preemptible RCU 2009-12-03 11:35:25 +01:00
relay.c const: mark struct vm_struct_operations 2009-09-27 11:39:25 -07:00
res_counter.c memcg: some modification to softlimit under hierarchical memory reclaim. 2009-10-01 16:11:13 -07:00
resource.c resources: when allocate_resource() fails, leave resource untouched 2009-11-04 13:06:46 -08:00
rtmutex-debug.c
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() 2009-08-06 05:50:21 +02:00
rtmutex.h
rtmutex_common.h
rwsem.c
sched.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
sched_clock.c sched_clock: Fix atomicity/continuity bug by using cmpxchg64() 2009-09-30 22:56:10 +02:00
sched_cpupri.c sched: Add new prio to cpupri before removing old prio 2009-08-02 14:26:09 +02:00
sched_cpupri.h
sched_debug.c sched: Remove forced2_migrations stats 2009-12-10 20:32:39 +01:00
sched_fair.c sched: Update normalized values on user updates via proc 2009-12-09 10:04:02 +01:00
sched_features.h sched: Discard some old bits 2009-12-09 10:03:07 +01:00
sched_idletask.c sched: Protect sched_rr_get_param() access to task->sched_class 2009-12-09 10:01:07 +01:00
sched_rt.c sched: Protect sched_rr_get_param() access to task->sched_class 2009-12-09 10:01:07 +01:00
sched_stats.h
seccomp.c
semaphore.c
signal.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-05 15:30:21 -08:00
slow-work-debugfs.c SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
slow-work.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 2009-12-08 07:38:50 -08:00
slow-work.h SLOW_WORK: Move slow_work's proc file to debugfs 2009-12-01 08:20:31 -08:00
smp.c generic-ipi: Add smp_call_function_any() 2009-11-18 14:52:25 +01:00
softirq.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2009-12-14 09:58:24 -08:00
softlockup.c percpu: make percpu symbols under kernel/ and mm/ unique 2009-10-29 22:34:13 +09:00
spinlock.c locking: Reduce ifdefs in kernel/spinlock.c 2009-11-13 20:53:28 +01:00
srcu.c rcu: Add synchronize_srcu_expedited() 2009-10-26 09:40:30 +01:00
stacktrace.c
stop_machine.c
sys.c Merge branch 'bkl-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-09 08:07:17 -08:00
sys_ni.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 2009-12-08 07:55:01 -08:00
sysctl.c hugetlb: derive huge pages nodes allowed from task mempolicy 2009-12-15 08:53:12 -08:00
sysctl_binary.c sysctl binary: Reorder the tests to process wild card entries first. 2009-11-12 01:42:31 -08:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
taskstats.c genetlink: make netns aware 2009-07-12 14:03:27 -07:00
test_kprobes.c
time.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-12-08 19:27:08 -08:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-09-23 09:46:15 -07:00
tracepoint.c trivial: fix typo "to to" in multiple files 2009-09-21 15:14:55 +02:00
tsacct.c
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c uids: Prevent tear down race 2009-11-02 16:02:39 +01:00
user_namespace.c
utsname.c utsns: extract creeate_uts_ns() 2009-06-18 13:03:55 -07:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
wait.c locking, sched: Give waitqueue spinlocks their own lockdep classes 2009-08-10 14:43:09 +02:00
workqueue.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2009-12-10 09:35:44 -08:00