linux

History

Michal Hocko 73e64c51af mm, compaction: allow compaction for GFP_NOFS requests compaction has been disabled for GFP_NOFS and GFP_NOIO requests since the direct compaction was introduced by commit `56de7263fc` ("mm: compaction: direct compact when a high-order allocation fails"). The main reason is that the migration of page cache pages might recurse back to fs/io layer and we could potentially deadlock. This is overly conservative because all the anonymous memory is migrateable in the GFP_NOFS context just fine. This might be a large portion of the memory in many/most workkloads. Remove the GFP_NOFS restriction and make sure that we skip all fs pages (those with a mapping) while isolating pages to be migrated. We cannot consider clean fs pages because they might need a metadata update so only isolate pages without any mapping for nofs requests. The effect of this patch will be probably very limited in many/most workloads because higher order GFP_NOFS requests are quite rare, although different configurations might lead to very different results. David Chinner has mentioned a heavy metadata workload with 64kB block which to quote him: : Unfortunately, there was an era of cargo cult configuration tweaks in the : Ceph community that has resulted in a large number of production machines : with XFS filesystems configured this way. And a lot of them store large : numbers of small files and run under significant sustained memory : pressure. : : I slowly working towards getting rid of these high order allocations and : replacing them with the equivalent number of single page allocations, but : I haven't got that (complex) change working yet. We can do the following to simulate that workload: $ mkfs.xfs -f -n size=64k <dev> $ mount <dev> /mnt/scratch $ time ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 \ -d /mnt/scratch/0 -d /mnt/scratch/1 \ -d /mnt/scratch/2 -d /mnt/scratch/3 \ -d /mnt/scratch/4 -d /mnt/scratch/5 \ -d /mnt/scratch/6 -d /mnt/scratch/7 \ -d /mnt/scratch/8 -d /mnt/scratch/9 \ -d /mnt/scratch/10 -d /mnt/scratch/11 \ -d /mnt/scratch/12 -d /mnt/scratch/13 \ -d /mnt/scratch/14 -d /mnt/scratch/15 and indeed is hammers the system with many high order GFP_NOFS requests as per a simle tracepoint during the load: $ echo '!(gfp_flags & 0x80) && (gfp_flags &0x400000)' > $TRACE_MNT/events/kmem/mm_page_alloc/filter I am getting 5287609 order=0 37 order=1 1594905 order=2 3048439 order=3 6699207 order=4 66645 order=5 My testing was done in a kvm guest so performance numbers should be taken with a grain of salt but there seems to be a difference when the patch is applied: * Original kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4300.1 20745838 3 3200000 0 4239.9 23849857 5 4800000 0 4243.4 25939543 6 6400000 0 4248.4 19514050 8 8000000 0 4262.1 20796169 9 9600000 0 4257.6 21288675 11 11200000 0 4259.7 19375120 13 12800000 0 4220.7 22734141 14 14400000 0 4238.5 31936458 16 16000000 0 4231.5 23409901 18 17600000 0 4045.3 23577700 19 19200000 0 2783.4 58299526 21 20800000 0 2678.2 40616302 23 22400000 0 2693.5 83973996 and xfs complaining about memory allocation not making progress [ 2304.372647] XFS: fs_mark(3289) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) [ 2304.443323] XFS: fs_mark(3285) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) [ 4796.772477] XFS: fs_mark(3424) possible memory allocation deadlock size 46936 in kmem_alloc (mode:0x2408240) [ 4796.775329] XFS: fs_mark(3423) possible memory allocation deadlock size 51416 in kmem_alloc (mode:0x2408240) [ 4797.388808] XFS: fs_mark(3424) possible memory allocation deadlock size 65728 in kmem_alloc (mode:0x2408240) * Patched kernel FSUse% Count Size Files/sec App Overhead 1 1600000 0 4289.1 19243934 3 3200000 0 4241.6 32828865 5 4800000 0 4248.7 32884693 6 6400000 0 4314.4 19608921 8 8000000 0 4269.9 24953292 9 9600000 0 4270.7 33235572 11 11200000 0 4346.4 40817101 13 12800000 0 4285.3 29972397 14 14400000 0 4297.2 20539765 16 16000000 0 4219.6 18596767 18 17600000 0 4273.8 49611187 19 19200000 0 4300.4 27944451 21 20800000 0 4270.6 22324585 22 22400000 0 4317.6 22650382 24 24000000 0 4065.2 22297964 So the dropdown at Count 19200000 didn't happen and there was only a single warning about allocation not making progress [ 3063.815003] XFS: fs_mark(3272) possible memory allocation deadlock size 65624 in kmem_alloc (mode:0x2408240) This suggests that the patch has helped even though there is not all that much of anonymous memory as the workload mostly generates fs metadata. I assume the success rate would be higher with more anonymous memory which should be the case in many workloads. [akpm@linux-foundation.org: fix comment] Link: http://lkml.kernel.org/r/20161012114721.31853-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2016-12-14 16:04:07 -08:00
..
kasan	Power management material for v4.10-rc1	2016-12-13 10:41:53 -08:00
Kconfig	mm: THP page cache support for ppc64	2016-12-12 18:55:08 -08:00
Kconfig.debug	PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO	2016-09-13 02:35:27 +02:00
Makefile	Disable the __builtin_return_address() warning globally after all	2016-10-12 10:23:41 -07:00
backing-dev.c	writeback: track if we're sleeping on progress in balance_dirty_pages()	2016-11-08 08:28:55 -07:00
balloon_compaction.c	mm: balloon: use general non-lru movable page feature	2016-07-26 16:19:19 -07:00
bootmem.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
cleancache.c	cleancache: constify cleancache_ops structure	2016-01-27 09:09:57 -05:00
cma.c	mm/cma.c: check the max limit for cma allocation	2016-11-11 08:12:37 -08:00
cma.h	mm: cma: mark cma_bitmap_maxno() inline in header	2015-08-14 15:56:32 -07:00
cma_debug.c	mm/cma_debug: correct size input to bitmap function	2015-07-17 16:39:54 -07:00
compaction.c	mm, compaction: allow compaction for GFP_NOFS requests	2016-12-14 16:04:07 -08:00
debug.c	mm, debug: print raw struct page data in __dump_page()	2016-12-12 18:55:08 -08:00
debug_page_ref.c	mm/page_ref: add tracepoint to track down page reference manipulation	2016-03-17 15:09:34 -07:00
dmapool.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
early_ioremap.c	mm/early_ioremap: use offset_in_page macro	2015-11-05 19:34:48 -08:00
fadvise.c	mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED	2016-06-09 14:23:11 -07:00
failslab.c	mm: fault-inject take over bootstrap kmem_cache check	2016-03-15 16:55:16 -07:00
filemap.c	This merge request includes the dax-4.0-iomap-pmd branch which is	2016-12-14 09:17:42 -08:00
frame_vector.c	mm: replace get_vaddr_frames() write/force parameters with gup_flags	2016-10-19 08:11:24 -07:00
frontswap.c	mm, frontswap: convert frontswap_enabled to static key	2016-07-26 16:19:19 -07:00
gup.c	mm: fix up get_user_pages* comments	2016-12-12 18:55:07 -08:00
highmem.c	mm/highmem: make nr_free_highpages() handles all highmem zones by itself	2016-05-19 19:12:14 -07:00
huge_memory.c	mm: make transparent hugepage size public	2016-12-12 18:55:09 -08:00
hugetlb.c	mm: add tlb_remove_check_page_size_change to track page size change	2016-12-12 18:55:07 -08:00
hugetlb_cgroup.c	mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size	2016-05-20 17:58:30 -07:00
hwpoison-inject.c	hwpoison: use page_cgroup_ino for filtering by memcg	2015-09-10 13:29:01 -07:00
init-mm.c	…
internal.h	mm, compaction: make full priority ignore pageblock suitability	2016-10-07 18:46:29 -07:00
interval_tree.c	mm: replace vma->sharead.linear with vma->shared	2015-02-10 14:30:31 -08:00
khugepaged.c	mm: THP page cache support for ppc64	2016-12-12 18:55:08 -08:00
kmemcheck.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
kmemleak-test.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
kmemleak.c	kmemleak: fix reference to Documentation	2016-12-12 18:55:07 -08:00
ksm.c	mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()	2016-10-07 18:46:29 -07:00
list_lru.c	mm/list_lru.c: avoid error-path NULL pointer deref	2016-10-27 18:43:42 -07:00
maccess.c	x86: remove more uaccess_32.h complexity	2016-05-22 17:21:27 -07:00
madvise.c	mm: add tlb_remove_check_page_size_change to track page size change	2016-12-12 18:55:07 -08:00
memblock.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
memcontrol.c	Merge branch 'akpm' (patches from Andrew)	2016-12-12 20:50:02 -08:00
memory-failure.c	mm: hwpoison: fix thp split handling in memory_failure()	2016-11-11 08:12:37 -08:00
memory.c	Char/Misc driver patches for 4.10-rc1	2016-12-13 12:11:01 -08:00
memory_hotplug.c	mm: remove x86-only restriction of movable_node	2016-12-12 18:55:07 -08:00
mempolicy.c	mm/mempolicy.c: forbid static or relative flags for local NUMA mode	2016-12-12 18:55:07 -08:00
mempool.c	Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"	2016-07-28 16:07:41 -07:00
memtest.c	memtest: remove unused header files	2015-09-08 15:35:28 -07:00
migrate.c	lib: radix-tree: check accounting of existing slot replacement users	2016-12-12 18:55:08 -08:00
mincore.c	mm, swap: use offset of swap entry as key of swap cache	2016-10-07 18:46:28 -07:00
mlock.c	thp: fix corner case of munlock() of PTE-mapped THPs	2016-11-30 16:32:52 -08:00
mm_init.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
mmap.c	mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb	2016-10-07 18:46:29 -07:00
mmu_context.c	mm/mmu_context, sched/core: Fix mmu_context.h assumption	2016-04-28 11:44:19 +02:00
mmu_notifier.c	fix Christoph's email addresses	2016-03-17 15:09:34 -07:00
mmzone.c	mm, page_alloc: inline the fast path of the zonelist iterator	2016-05-19 19:12:14 -07:00
mprotect.c	mm/pkeys: generate pkey system call code only if ARCH_HAS_PKEYS is selected	2016-12-12 18:55:07 -08:00
mremap.c	mremap: move_ptes: check pte dirty after its removal	2016-11-29 08:20:24 -08:00
msync.c	mm/msync: use offset_in_page macro	2015-11-05 19:34:48 -08:00
nobootmem.c	mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping	2016-10-11 15:06:33 -07:00
nommu.c	lkdtm: Do not use flush_icache_range() on user addresses	2016-11-10 15:34:56 +01:00
oom_kill.c	oom: print nodemask in the oom report	2016-10-07 18:46:29 -07:00
page-writeback.c	writeback: track if we're sleeping on progress in balance_dirty_pages()	2016-11-08 08:28:55 -07:00
page_alloc.c	Merge branch 'akpm' (patches from Andrew)	2016-12-12 20:50:02 -08:00
page_counter.c	mm: page_counter: let page_counter_try_charge() return bool	2015-11-05 19:34:48 -08:00
page_ext.c	mm/page_ext: support extra space allocation by page_ext user	2016-10-07 18:46:27 -07:00
page_idle.c	mm, vmscan: move lru_lock to the node	2016-07-28 16:07:41 -07:00
page_io.c	writeback: add wbc_to_write_flags()	2016-11-02 10:24:03 -06:00
page_isolation.c	mm/page_isolation: fix typo: "paes" -> "pages"	2016-10-07 18:46:29 -07:00
page_owner.c	mm/page_owner: don't define fields on struct page_ext by hard-coding	2016-10-07 18:46:27 -07:00
page_poison.c	mm: check the return value of lookup_page_ext for all call sites	2016-06-03 15:06:22 -07:00
pagewalk.c	thp: rename split_huge_page_pmd() to split_huge_pmd()	2016-01-15 17:56:32 -08:00
percpu-km.c	mm: percpu: use pr_fmt to prefix output	2016-03-17 15:09:34 -07:00
percpu-vm.c	…
percpu.c	Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu	2016-12-13 12:34:47 -08:00
pgtable-generic.c	mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range	2016-03-17 15:09:34 -07:00
process_vm_access.c	mm: remove write/force parameters from __get_user_pages_unlocked()	2016-10-18 14:13:37 -07:00
quicklist.c	fix Christoph's email addresses	2016-03-17 15:09:34 -07:00
readahead.c	mm: don't cap request size based on read-ahead setting	2016-12-12 18:55:08 -08:00
rmap.c	mm, rmap: handle anon_vma_prepare() common case inline	2016-12-12 18:55:08 -08:00
shmem.c	lib: radix-tree: update callback for changing leaf nodes	2016-12-12 18:55:08 -08:00
slab.c	Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2016-12-13 12:59:57 -08:00
slab.h	mm, slab: maintain total slab count instead of active count	2016-12-12 18:55:07 -08:00
slab_common.c	mm/slab_common.c: check kmem_create_cache flags are common	2016-12-12 18:55:06 -08:00
slob.c	slub: move synchronize_sched out of slab_mutex on shrink	2016-12-12 18:55:06 -08:00
slub.c	slub: avoid false-postive warning	2016-12-12 18:55:06 -08:00
sparse-vmemmap.c	treewide: replace obsolete _refok by __ref	2016-08-02 17:31:41 -04:00
sparse.c	treewide: replace obsolete _refok by __ref	2016-08-02 17:31:41 -04:00
swap.c	thp: reduce usage of huge zero page's atomic counter	2016-10-07 18:46:28 -07:00
swap_cgroup.c	mm: convert printk(KERN_<LEVEL> to pr_<level>	2016-03-17 15:09:34 -07:00
swap_state.c	mm, swap: use offset of swap entry as key of swap cache	2016-10-07 18:46:28 -07:00
swapfile.c	mm: add three more cond_resched() in swapoff	2016-12-12 18:55:08 -08:00
truncate.c	mm: workingset: move shadow entry tracking to radix tree exceptional tracking	2016-12-12 18:55:08 -08:00
usercopy.c	mm: usercopy: Check for module addresses	2016-09-20 16:07:39 -07:00
userfaultfd.c	mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros	2016-04-04 10:41:08 -07:00
util.c	Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2016-10-22 09:39:10 -07:00
vmacache.c	mm: unrig VMA cache hit ratio	2016-10-07 18:46:27 -07:00
vmalloc.c	mm: add preempt points into __purge_vmap_area_lazy()	2016-12-12 18:55:08 -08:00
vmpressure.c	mm/vmpressure.c: fix subtree pressure detection	2016-02-03 08:28:43 -08:00
vmscan.c	Merge branch 'akpm' (patches from Andrew)	2016-12-12 20:50:02 -08:00
vmstat.c	mm/vmstat: Convert to hotplug state machine	2016-12-02 00:52:35 +01:00
workingset.c	mm: workingset: update shadow limit to reflect bigger active list	2016-12-12 18:55:08 -08:00
z3fold.c	mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup	2016-06-03 16:02:55 -07:00
zbud.c	mm/zbud.c: use list_last_entry() instead of list_tail_entry()	2016-01-15 11:40:52 -08:00
zpool.c	mm: zsmalloc: constify struct zs_pool name	2015-11-06 17:50:42 -08:00
zsmalloc.c	mm/zsmalloc: Convert to hotplug state machine	2016-12-02 00:52:36 +01:00
zswap.c	mm/zswap: Convert pool to hotplug state machine	2016-12-02 00:52:36 +01:00