linux/mm
Adam Litke 6af2acb661 hugetlb: Move update_and_free_page
Dynamic huge page pool resizing.

In most real-world scenarios, configuring the size of the hugetlb pool
correctly is a difficult task.  If too few pages are allocated to the pool,
applications using MAP_SHARED may fail to mmap() a hugepage region and
applications using MAP_PRIVATE may receive SIGBUS.  Isolating too much memory
in the hugetlb pool means it is not available for other uses, especially those
programs not using huge pages.

The obvious answer is to let the hugetlb pool grow and shrink in response to
the runtime demand for huge pages.  The work Mel Gorman has been doing to
establish a memory zone for movable memory allocations makes dynamically
resizing the hugetlb pool reliable within the limits of that zone.  This patch
series implements dynamic pool resizing for private and shared mappings while
being careful to maintain existing semantics.  Please reply with your comments
and feedback; even just to say whether it would be a useful feature to you.
Thanks.

How it works
============

Upon depletion of the hugetlb pool, rather than reporting an error immediately,
first try and allocate the needed huge pages directly from the buddy allocator.
Care must be taken to avoid unbounded growth of the hugetlb pool, so the
hugetlb filesystem quota is used to limit overall pool size.

The real work begins when we decide there is a shortage of huge pages.  What
happens next depends on whether the pages are for a private or shared mapping.
Private mappings are straightforward.  At fault time, if alloc_huge_page()
fails, we allocate a page from the buddy allocator and increment the source
node's surplus_huge_pages counter.  When free_huge_page() is called for a page
on a node with a surplus, the page is freed directly to the buddy allocator
instead of the hugetlb pool.

Because shared mappings require all of the pages to be reserved up front, some
additional work must be done at mmap() to support them.  We determine the
reservation shortage and allocate the required number of pages all at once.
These pages are then added to the hugetlb pool and marked reserved.  Where that
is not possible the mmap() will fail.  As with private mappings, the
appropriate surplus counters are updated.  Since reserved huge pages won't
necessarily be used by the process, we can't be sure that free_huge_page() will
always be called to return surplus pages to the buddy allocator.  To prevent
the huge page pool from bloating, we must free unused surplus pages when their
reservation has ended.

Controlling it
==============

With the entire patch series applied, pool resizing is off by default so unless
specific action is taken, the semantics are unchanged.

To take advantage of the flexibility afforded by this patch series one must
tolerate a change in semantics.  To control hugetlb pool growth, the following
techniques can be employed:

 * A sysctl tunable to enable/disable the feature entirely
 * The size= mount option for hugetlbfs filesystems to limit pool size

Performance
===========

When contiguous memory is readily available, it is expected that the cost of
dynamicly resizing the pool will be small.  This series has been performance
tested with 'stream' to measure this cost.

Stream (http://www.cs.virginia.edu/stream/) was linked with libhugetlbfs to
enable remapping of the text and data/bss segments into huge pages.

Stream with small array
-----------------------
Baseline: 	nr_hugepages = 0, No libhugetlbfs segment remapping
Preallocated:	nr_hugepages = 5, Text and data/bss remapping
Dynamic:	nr_hugepages = 0, Text and data/bss remapping

				Rate (MB/s)
Function	Baseline	Preallocated	Dynamic
Copy:		4695.6266	5942.8371	5982.2287
Scale:		4451.5776	5017.1419	5658.7843
Add:		5815.8849	7927.7827	8119.3552
Triad:		5949.4144	8527.6492	8110.6903

Stream with large array
-----------------------
Baseline: 	nr_hugepages =  0, No libhugetlbfs segment remapping
Preallocated:	nr_hugepages = 67, Text and data/bss remapping
Dynamic:	nr_hugepages =  0, Text and data/bss remapping

				Rate (MB/s)
Function	Baseline	Preallocated	Dynamic
Copy:		2227.8281	2544.2732	2546.4947
Scale:		2136.3208	2430.7294	2421.2074
Add:		2773.1449	4004.0021	3999.4331
Triad:		2748.4502	3777.0109	3773.4970

* All numbers are averages taken from 10 consecutive runs with a maximum
  standard deviation of 1.3 percent noted.

This patch:

Simply move update_and_free_page() so that it can be reused later in this
patch series.  The implementation is not changed.

Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Dave McCracken <dave.mccracken@oracle.com>
Acked-by: William Irwin <bill.irwin@oracle.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Ken Chen <kenchen@google.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16 09:43:02 -07:00
..
Kconfig memory unplug: page offline 2007-10-16 09:43:02 -07:00
Makefile memory unplug: page isolation 2007-10-16 09:43:02 -07:00
allocpercpu.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
backing-dev.c remove mm/backing-dev.c:congestion_wait_interruptible() 2007-07-16 09:05:52 -07:00
bootmem.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
bounce.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
fadvise.c [PATCH] mm: change uses of f_{dentry,vfsmnt} to use f_path 2006-12-08 08:28:43 -08:00
filemap.c fs: remove some AOP_TRUNCATED_PAGE 2007-10-16 09:42:58 -07:00
filemap_xip.c mm: write iovec cleanup 2007-10-16 09:42:54 -07:00
fremap.c fix VM_CAN_NONLINEAR check in sys_remap_file_pages 2007-10-08 12:58:14 -07:00
highmem.c Create the ZONE_MOVABLE zone 2007-07-17 10:22:59 -07:00
hugetlb.c hugetlb: Move update_and_free_page 2007-10-16 09:43:02 -07:00
internal.h Breakout page_order() to internal.h to avoid special knowledge of the buddy allocator 2007-10-16 09:43:01 -07:00
madvise.c speed up madvise_need_mmap_write() usage 2007-07-16 09:05:36 -07:00
memory.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
memory_hotplug.c fix memory hot remove not configured case. 2007-10-16 09:43:02 -07:00
mempolicy.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
mempool.c Slab allocators: Replace explicit zeroing with __GFP_ZERO 2007-07-17 10:23:02 -07:00
migrate.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mincore.c [PATCH] mincore: vma crossing fix 2007-02-15 09:57:03 -08:00
mlock.c do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITY 2007-07-16 09:05:37 -07:00
mmap.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
mmzone.c [PATCH] remove EXPORT_UNUSED_SYMBOL'ed symbols 2006-12-07 08:39:44 -08:00
mprotect.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
mremap.c mm: variable length argument support 2007-07-19 10:04:45 -07:00
msync.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
nommu.c fix NULL pointer dereference in __vm_enough_memory() 2007-08-22 19:52:45 -07:00
oom_kill.c Memoryless nodes: OOM: use N_HIGH_MEMORY map instead of constructing one on the fly 2007-10-16 09:42:58 -07:00
page-writeback.c memoryless nodes: fixup uses of node_online_map in generic code 2007-10-16 09:42:59 -07:00
page_alloc.c memory unplug: page offline 2007-10-16 09:43:02 -07:00
page_io.c Drop 'size' argument from bio_endio and bi_end_io 2007-10-10 09:25:57 +02:00
page_isolation.c memory unplug: page isolation 2007-10-16 09:43:02 -07:00
pdflush.c Freezer: make kernel threads nonfreezable by default 2007-07-17 10:23:02 -07:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
quicklist.c Quicklists for page table pages 2007-05-07 12:12:54 -07:00
readahead.c mm: buffered write cleanup 2007-10-16 09:42:54 -07:00
rmap.c flush icache before set_pte() on ia64: flush icache at set_pte 2007-10-16 09:42:59 -07:00
shmem.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
shmem_acl.c [PATCH] Fix typos in mm/shmem_acl.c 2006-10-11 11:14:23 -07:00
slab.c Group short-lived and reclaimable kernel allocations 2007-10-16 09:43:00 -07:00
slob.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
slub.c slub: list_locations() can use GFP_TEMPORARY 2007-10-16 09:43:01 -07:00
sparse-vmemmap.c memory hotplug: Hot-add with sparsemem-vmemmap 2007-10-16 09:43:02 -07:00
sparse.c memory hotplug: Hot-add with sparsemem-vmemmap 2007-10-16 09:43:02 -07:00
swap.c mm: use pagevec to rotate reclaimable page 2007-10-16 09:42:54 -07:00
swap_state.c mm: clarify __add_to_swap_cache locking 2007-10-16 09:42:53 -07:00
swapfile.c Replace CONFIG_SOFTWARE_SUSPEND with CONFIG_HIBERNATION 2007-07-29 16:45:38 -07:00
thrash.c Bug in mm/thrash.c function grab_swap_token() 2007-05-11 08:29:32 -07:00
tiny-shmem.c [PATCH] mm/{,tiny-}shmem.c cleanups 2007-03-01 14:53:35 -08:00
truncate.c mm: merge populate and nopage into fault (fixes nonlinear) 2007-07-19 10:04:41 -07:00
util.c Slab allocators: fail if ksize is called with a NULL parameter 2007-10-16 09:42:53 -07:00
vmalloc.c Categorize GFP flags 2007-10-16 09:42:59 -07:00
vmscan.c make swappiness safer to use 2007-10-16 09:42:59 -07:00
vmstat.c Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo 2007-10-16 09:43:00 -07:00