mirror of https://gitee.com/openkylin/linux.git
10650 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Mike Christie | 4e49ea4a3d |
block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw instead of passing it in. This makes that use the same as generic_make_request and how we set the other bio fields. Signed-off-by: Mike Christie <mchristi@redhat.com> Fixed up fs/ext4/crypto.c Signed-off-by: Jens Axboe <axboe@fb.com> |
|
Mel Gorman | e46e7b77c9 |
mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies
The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations. The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.
However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct. This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies. This will alter the zone used for statistics but only after
it is known that it makes sense for that context. Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone. Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.
Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
Fixes:
|
|
Mel Gorman | 0d0bd89435 |
mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy
Geert Uytterhoeven reported the following problem that bisected to commit |
|
Michal Hocko | cbdcf7f789 |
mm, oom_reaper: do not use siglock in try_oom_reaper()
Oleg has noted that siglock usage in try_oom_reaper is both pointless
and dangerous. signal_group_exit can be checked lockless. The problem
is that sighand becomes NULL in __exit_signal so we can crash.
Fixes:
|
|
Vlastimil Babka | 83b9355bf6 |
mm, page_alloc: prevent infinite loop in buffered_rmqueue()
In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
is never removed from the pcp list. Fix this by removing the page
before retrying. Also we don't need to check if page is non-NULL,
because we simply grab it from the list which was just tested for being
non-empty.
Fixes:
|
|
Vitaly Wool | 43afc19417 |
mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
Fix erroneous z3fold header access in a HEADLESS page in reclaim function, and change one remaining direct handle-to-buddy conversion to use the appropriate helper. Link: http://lkml.kernel.org/r/5748706F.9020208@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tejun Heo | 3a06bb78ce |
memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem()
memcg_offline_kmem() may be called from memcg_free_kmem() after a css init failure. memcg_free_kmem() is a ->css_free callback which is called without cgroup_mutex and memcg_offline_kmem() ends up using css_for_each_descendant_pre() without any locking. Fix it by adding rcu read locking around it. mkdir: cannot create directory `65530': No space left on device =============================== [ INFO: suspicious RCU usage. ] 4.6.0-work+ #321 Not tainted ------------------------------- kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required! [ 527.243970] other info that might help us debug this: [ 527.244715] rcu_scheduler_active = 1, debug_locks = 0 2 locks held by kworker/0:5/1664: #0: ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0 #1: ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0 [ 527.248098] stack backtrace: CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014 Workqueue: cgroup_destroy css_free_work_fn Call Trace: dump_stack+0x68/0xa1 lockdep_rcu_suspicious+0xd7/0x110 css_next_descendant_pre+0x7d/0xb0 memcg_offline_kmem.part.44+0x4a/0xc0 mem_cgroup_css_free+0x1ec/0x200 css_free_work_fn+0x49/0x5e0 process_one_work+0x1c5/0x4a0 worker_thread+0x49/0x490 kthread+0xea/0x100 ret_from_fork+0x1f/0x40 Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> [4.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | f86e427197 |
mm: check the return value of lookup_page_ext for all call sites
Per the discussion with Joonsoo Kim [1], we need check the return value of lookup_page_ext() for all call sites since it might return NULL in some cases, although it is unlikely, i.e. memory hotplug. Tested with ltp with "page_owner=0". [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE [akpm@linux-foundation.org: fix build-breaking typos] [arnd@arndb.de: fix build problems from lookup_page_ext] Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Guillermo Julián Moreno | 65ee03c4b9 |
mm: fix overflow in vm_map_ram()
When remapping pages accounting for 4G or more memory space, the operation 'count << PAGE_SHIFT' overflows as it is performed on an integer. Solution: cast before doing the bitshift. [akpm@linux-foundation.org: fix vm_unmap_ram() also] [akpm@linux-foundation.org: fix vmap() as well, per Guillermo] Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es Signed-off-by: Guillermo Julián Moreno <guillermo.julian@naudit.es> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tejun Heo | 62a584fe05 |
writeback: use higher precision calculation in domain_dirty_limits()
As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio. However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.
Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations. Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Miao Xie <miaoxie@huawei.com>
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Cc: stable@vger.kernel.org # v4.2+
Fixes:
|
|
Al Viro | 93c76a3d43 |
file_inode(f)->i_mapping is f->f_mapping
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Linus Torvalds | d102a56edb |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro: "Followups to the parallel lookup work: - update docs - restore killability of the places that used to take ->i_mutex killably now that we have down_write_killable() merged - Additionally, it turns out that I missed a prerequisite for security_d_instantiate() stuff - ->getxattr() wasn't the only thing that could be called before dentry is attached to inode; with smack we needed the same treatment applied to ->setxattr() as well" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->setxattr() to passing dentry and inode separately switch xattr_handler->set() to passing dentry and inode separately restore killability of old mutex_lock_killable(&inode->i_mutex) users add down_write_killable_nested() update D/f/directory-locking |
|
Linus Torvalds | 5d22fc25d4 |
mm: remove more IS_ERR_VALUE abuses
The do_brk() and vm_brk() return value was "unsigned long" and returned the starting address on success, and an error value on failure. The reasons are entirely historical, and go back to it basically behaving like the mmap() interface does. However, nobody actually wanted that interface, and it causes totally pointless IS_ERR_VALUE() confusion. What every single caller actually wants is just the simpler integer return of zero for success and negative error number on failure. So just convert to that much clearer and more common calling convention, and get rid of all the IS_ERR_VALUE() uses wrt vm_brk(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 7ded384a12 |
mm: fix section mismatch warning
The register_page_bootmem_info_node() function needs to be marked __init
in order to avoid a new warning introduced by commit
|
|
Gavin Shan | 11e685672a |
mm: disable DEFERRED_STRUCT_PAGE_INIT on !NO_BOOTMEM
When we have !NO_BOOTMEM, the deferred page struct initialization doesn't work well because the pages reserved in bootmem are released to the page allocator uncoditionally. It causes memory corruption and system crash eventually. As Mel suggested, the bootmem is retiring slowly. We fix the issue by simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled. Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li RongQing | 7cf7806ce1 |
mm/memcontrol.c: move comments for get_mctgt_type() to proper position
Move the comments for get_mctgt_type() to be before get_mctgt_type() implementation. Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li RongQing | cbedbac3e6 |
mm/memcontrol.c: fix the margin computation in mem_cgroup_margin()
mem_cgroup_margin() might return (memory.limit - memory_count) when the memsw.limit is in excess. This doesn't happen usually because we do not allow excess on hard limits and (memory.limit <= memsw.limit), but __GFP_NOFAIL charges can force the charge and cause the excess when no memory is really swappable (swap is full or no anonymous memory is left). [mhocko@suse.com: rewrote changelog] Link: http://lkml.kernel.org/r/20160525155122.GK20132@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1464068266-27736-1-git-send-email-roy.qing.li@gmail.com Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Stephen Rothwell | badbda53e5 |
mm/cma: silence warnings due to max() usage
pageblock_order can be (at least) an unsigned int or an unsigned long depending on the kernel config and architecture, so use max_t(unsigned long, ...) when comparing it. fixes these warnings: In file included from include/asm-generic/bug.h:13:0, from arch/powerpc/include/asm/bug.h:127, from include/linux/bug.h:4, from include/linux/mmdebug.h:4, from include/linux/mm.h:8, from include/linux/memblock.h:18, from mm/cma.c:28: mm/cma.c: In function 'cma_init_reserved_mem': include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ mm/cma.c:186:27: note: in expansion of macro 'max' alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order); ^ mm/cma.c: In function 'cma_declare_contiguous': include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ include/linux/kernel.h:747:9: note: in definition of macro 'max' typeof(y) _max2 = (y); ^ mm/cma.c:270:29: note: in expansion of macro 'max' (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order)); ^ include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast (void) (&_max1 == &_max2); ^ include/linux/kernel.h:747:21: note: in definition of macro 'max' typeof(y) _max2 = (y); ^ mm/cma.c:270:29: note: in expansion of macro 'max' (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order)); ^ [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 0798d3c022 |
mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()
If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
page from a pte, the "address" must be THP aligned in order for the
page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.
Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
Fixes:
|
|
Michal Hocko | e2fe14564d |
oom_reaper: close race with exiting task
Tetsuo has reported:
Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
sh cpuset=/ mems_allowed=0
CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
Call Trace:
dump_stack+0x85/0xc8
dump_header+0x5b/0x394
oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
In other words:
__oom_reap_task exit_mm
atomic_inc_not_zero
tsk->mm = NULL
mmput
atomic_dec_and_test # > 0
exit_oom_victim # New victim will be
# selected
<OOM killer invoked>
# no TIF_MEMDIE task so we can select a new one
unmap_page_range # to release the memory
The race exists even without the oom_reaper because anybody who pins the
address space and gets preempted might race with exit_mm but oom_reaper
made this race more probable.
We can address the oom_reaper part by using oom_lock for __oom_reap_task
because this would guarantee that a new oom victim will not be selected
if the oom reaper might race with the exit path. This doesn't solve the
original issue, though, because somebody else still might be pinning
mm_users and so __mmput won't be called to release the memory but that
is not really realiably solvable because the task will get away from the
oom sight as soon as it is unhashed from the task_list and so we cannot
guarantee a new victim won't be selected.
[akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
[akpm@linux-foundation.org: coding-style fixes]
Fixes:
|
|
Yang Shi | f65e91df25 |
mm: use early_pfn_to_nid in register_page_bootmem_info_node
register_page_bootmem_info_node() is invoked in mem_init(), so it will be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is enabled. But, pfn_to_nid() depends on memmap which won't be fully setup until page_alloc_init_late() is done, so replace pfn_to_nid() by early_pfn_to_nid(). Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | fe53ca5427 |
mm: use early_pfn_to_nid in page_ext_init
page_ext_init() checks suitable pages with pfn_to_nid(), but
pfn_to_nid() depends on memmap which will not be setup fully until
page_alloc_init_late() is done. Use early_pfn_to_nid() instead of
pfn_to_nid() so that page extension could be still used early even
though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
allocation call sites.
Suggested by Joonsoo Kim [1], this fix basically undoes the change
introduced by commit
|
|
Vladimir Davydov | edd9f7230f |
mm: oom: do not reap task if there are live threads in threadgroup
If the current process is exiting, we don't invoke oom killer, instead
we give it access to memory reserves and try to reap its mm in case
nobody is going to use it. There's a mistake in the code performing
this check - we just ignore any process of the same thread group no
matter if it is exiting or not - see try_oom_reaper. Fix it.
Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
Fixes:
|
|
Al Viro | 5930122683 |
switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Linus Torvalds | e12fab28df |
Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton: "10 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4 update "mm/zsmalloc: don't fail if can't create debugfs info" dma-debug: avoid spinlock recursion when disabling dma-debug mm: oom_reaper: remove some bloat memcg: fix mem_cgroup_out_of_memory() return value. ocfs2: fix improper handling of return errno mm: slub: remove unused virt_to_obj() mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly seqlock: fix raw_read_seqcount_latch() |
|
Linus Torvalds | 478a1469a7 |
Filesystem DAX locking for 4.7
- We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXRKwLAAoJEJ/BjXdf9fLB+BkP/3HBm05KlAKDklvnBIPFDMUK hA7g2K6vuvaEDZXZQ1ioc1Ajf1sCpVip7shXJsojZqwWmRz0/4nneF7ytluW9AjS dBX+0qCgKGH1fnwyGFF+MN7fuj7kGrSDz34lG0OObRN6/oKiVNb2svXiYKkT6J6C AgsWlWRUpMy9jrn1u/FduMjDhk92Z3ojarexuicr0i8NUlBClCIrdCEmUMi4orSB DuiIjestLOc7+mERBUwrXkzoh9v8Z0FpIgnDLWwpeEkAvJwWkGe5eXrBJwF+hEbi RYfTrOYc7bBQLo22LRb8pdighjrx3OW9EpNCfEmLDOjM3cYBbMK/d2i/ww52H6IK Mw6iS5rXdGgJtQIGL8N96HLFk+cDyZ8J8xNUCwbYYBJqgpMzxzVkL3vTm72tyFnl InWhih+miCMbBPytQSRd6+1wZG2piJTv6SsFTd5K1OaiRmJhBJZG47t2QTBRBu7Y 5A4FGPtlraV+iDJvD6VLO1Tp8twxdLluOJ2BwdGeiKXiGh6LP+FGGFF3aFa5N4Ro xSslCTX7Q1G66zXQwD4+IMWLwS1FDNymPkUSsF6RQo6qfAnl9SrmYTc4xJ4QXy92 sUdrWEz2OBTfxKNqbGyc/KrXKZT3RnEkJNft8snB2h6WTCdOPaNYs/yETUwiwkSc CXpuQFrxm69QYwNsqVu1 =Pkd0 -----END PGP SIGNATURE----- Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX locking updates from Ross Zwisler: "Filesystem DAX locking for 4.7 - We use a bit in an exceptional radix tree entry as a lock bit and use it similarly to how page lock is used for normal faults. This fixes races between hole instantiation and read faults of the same index. - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD locking is implemented" * tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Remove i_mmap_lock protection dax: Use radix tree entry lock to protect cow faults dax: New fault locking dax: Allow DAX code to replace exceptional entries dax: Define DAX lock bit for radix tree exceptional entry dax: Make huge page handling depend of CONFIG_BROKEN dax: Fix condition for filling of PMD holes |
|
Dan Streetman | 4abaac9b73 |
update "mm/zsmalloc: don't fail if can't create debugfs info"
Some updates to commit
|
|
Tetsuo Handa | 1ebab2db06 |
memcg: fix mem_cgroup_out_of_memory() return value.
mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE task after an eligible task was found, "false" if it found a TIF_MEMDIE task before an eligible task is found. This difference confuses memory_max_write() which checks the return value of mem_cgroup_out_of_memory(). Since memory_max_write() wants to continue looping, mem_cgroup_out_of_memory() should return "true" in this case. This patch sets a dummy pointer in order to return "true". Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Ryabinin | 9725759a96 |
mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
Commit
|
|
Yang Shi | 957949243b |
mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT requires some ordering wrt other initialization operations, e.g. page_ext_init has to happen after the whole memmap is initialized properly. For SPARSEMEM this requires to wait for page_alloc_init_late. Other memory models (e.g. flatmem) might have different initialization layouts (page_ext_init_flatmem). Currently DEFERRED_STRUCT_PAGE_INIT depends on MEMORY_HOTPLUG which in turn depends on SPARSEMEM || X86_64_ACPI_NUMA depends on ARCH_ENABLE_MEMORY_HOTPLUG and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM memory model: config ARCH_FLATMEM_ENABLE def_bool y depends on X86_32 && !NUMA so FLATMEM is ruled out via dependency maze. Be explicit and disable FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce subtle initialization bugs [1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tejun Heo | 6710e594f7 |
percpu: fix synchronization between synchronous map extension and chunk destruction
For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.
This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes:
|
|
Tejun Heo | 4f996e234d |
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work. pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.
This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes:
|
|
Linus Torvalds | 84787c572d |
Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton: - Oleg's "wait/ptrace: assume __WALL if the child is traced". It's a kernel-based workaround for existing userspace issues. - A few hotfixes - befs cleanups - nilfs2 updates - sys_wait() changes - kexec updates - kdump - scripts/gdb updates - the last of the MM queue - a few other misc things * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (84 commits) kgdb: depends on VT drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable drm/radeon: make radeon_mn_get wait for mmap_sem killable drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable uprobes: wait for mmap_sem for write killable prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable exec: make exec path waiting for mmap_sem killable aio: make aio_setup_ring killable coredump: make coredump_wait wait for mmap_sem for write killable vdso: make arch_setup_additional_pages wait for mmap_sem for write killable ipc, shm: make shmem attach/detach wait for mmap_sem killable mm, fork: make dup_mmap wait for mmap_sem for write killable mm, proc: make clear_refs killable mm: make vm_brk killable mm, elf: handle vm_brk error mm, aout: handle vm_brk failures mm: make vm_munmap killable mm: make vm_mmap killable mm: make mmap_sem for write waits killable for mm syscalls MAINTAINERS: add co-maintainer for scripts/gdb ... |
|
Michal Hocko | 2d6c928241 |
mm: make vm_brk killable
Now that all the callers handle vm_brk failure we can change it wait for mmap_sem killable to help oom_reaper to not get blocked just because vm_brk gets blocked behind mmap_sem readers. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | ae79878356 |
mm: make vm_munmap killable
Almost all current users of vm_munmap are ignoring the return value and so they do not handle potential error. This means that some VMAs might stay behind. This patch doesn't try to solve those potential problems. Quite contrary it adds a new failure mode by using down_write_killable in vm_munmap. This should be safer than other failure modes, though, because the process is guaranteed to die as soon as it leaves the kernel and exit_mmap will clean the whole address space. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 9fbeb5ab59 |
mm: make vm_mmap killable
All the callers of vm_mmap seem to check for the failure already and bail out in one way or another on the error which means that we can change it to use killable version of vm_mmap_pgoff and return -EINTR if the current task gets killed while waiting for mmap_sem. This also means that vm_mmap_pgoff can be killable by default and drop the additional parameter. This will help in the OOM conditions when the oom victim might be stuck waiting for the mmap_sem for write which in turn can block oom_reaper which relies on the mmap_sem for read to make a forward progress and reclaim the address space of the victim. Please note that load_elf_binary is ignoring vm_mmap error for current->personality & MMAP_PAGE_ZERO case but that shouldn't be a problem because the address is not used anywhere and we never return to the userspace if we got killed. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | dc0ef0df7b |
mm: make mmap_sem for write waits killable for mm syscalls
This is a follow up work for oom_reaper [1]. As the async OOM killing depends on oom_sem for read we would really appreciate if a holder for write didn't stood in the way. This patchset is changing many of down_write calls to be killable to help those cases when the writer is blocked and waiting for readers to release the lock and so help __oom_reap_task to process the oom victim. Most of the patches are really trivial because the lock is help from a shallow syscall paths where we can return EINTR trivially and allow the current task to die (note that EINTR will never get to the userspace as the task has fatal signal pending). Others seem to be easy as well as the callers are already handling fatal errors and bail and return to userspace which should be sufficient to handle the failure gracefully. I am not familiar with all those code paths so a deeper review is really appreciated. As this work is touching more areas which are not directly connected I have tried to keep the CC list as small as possible and people who I believed would be familiar are CCed only to the specific patches (all should have received the cover though). This patchset is based on linux-next and it depends on down_write_killable for rw_semaphores which got merged into tip locking/rwsem branch and it is merged into this next tree. I guess it would be easiest to route these patches via mmotm because of the dependency on the tip tree but if respective maintainers prefer other way I have no objections. I haven't covered all the mmap_write(mm->mmap_sem) instances here $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l 98 $ git grep "down_write(.*\<mmap_sem\>)" | wc -l 62 I have tried to cover those which should be relatively easy to review in this series because this alone should be a nice improvement. Other places can be changed on top. [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org This patch (of 18): This is the first step in making mmap_sem write waiters killable. It focuses on the trivial ones which are taking the lock early after entering the syscall and they are not changing state before. Therefore it is very easy to change them to use down_write_killable and immediately return with -EINTR. This will allow the waiter to pass away without blocking the mmap_sem which might be required to make a forward progress. E.g. the oom reaper will need the lock for reading to dismantle the OOM victim address space. The only tricky function in this patch is vm_mmap_pgoff which has many call sites via vm_mmap. To reduce the risk keep vm_mmap with the original non-killable semantic for now. vm_munmap callers do not bother checking the return value so open code it into the munmap syscall path for now for simplicity. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 1383399d7b |
mm: memcontrol: fix possible css ref leak on oom
mem_cgroup_oom may be invoked multiple times while a process is handling a page fault, in which case current->memcg_in_oom will be overwritten leaking the previously taken css reference. Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 1d6da87a32 |
Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie: "Here's the main drm pull request for 4.7, it's been a busy one, and I've been a bit more distracted in real life this merge window. Lots more ARM drivers, not sure if it'll ever end. I think I've at least one more coming the next merge window. But changes are all over the place, support for AMD Polaris GPUs is in here, some missing GM108 support for nouveau (found in some Lenovos), a bunch of MST and skylake fixes. I've also noticed a few fixes from Arnd in my inbox, that I'll try and get in asap, but I didn't think they should hold this up. New drivers: - Hisilicon kirin display driver - Mediatek MT8173 display driver - ARC PGU - bitstreamer on Synopsys ARC SDP boards - Allwinner A13 initial RGB output driver - Analogix driver for DisplayPort IP found in exynos and rockchip DRM Core: - UAPI headers fixes and C++ safety - DRM connector reference counting - DisplayID mode parsing for Dell 5K monitors - Removal of struct_mutex from drivers - Connector registration cleanups - MST robustness fixes - MAINTAINERS updates - Lockless GEM object freeing - Generic fbdev deferred IO support panel: - Support for a bunch of new panels i915: - VBT refactoring - PLL computation cleanups - DSI support for BXT - Color manager support - More atomic patches - GEM improvements - GuC fw loading fixes - DP detection fixes - SKL GPU hang fixes - Lots of BXT fixes radeon/amdgpu: - Initial Polaris support - GPUVM/Scheduler/Clock/Power improvements - ASYNC pageflip support - New mesa feature support nouveau: - GM108 support - Power sensor support improvements - GR init + ucode fixes. - Use GPU provided topology information vmwgfx: - Add host messaging support gma500: - Some cleanups and fixes atmel: - Bridge support - Async atomic commit support fsl-dcu: - Timing controller for LCD support - Pixel clock polarity support rcar-du: - Misc fixes exynos: - Pipeline clock support - Exynoss4533 SoC support - HW trigger mode support - export HDMI_PHY clock - DECON5433 fixes - Use generic prime functions - use DMA mapping APIs rockchip: - Lots of little fixes vc4: - Render node support - Gamma ramp support - DPI output support msm: - Mostly cleanups and fixes - Conversion to generic struct fence etnaviv: - Fix for prime buffer handling - Allow hangcheck to be coalesced with other wakeups tegra: - Gamme table size fix" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits) drm/edid: add displayid detailed 1 timings to the modelist. (v1.1) drm/edid: move displayid validation to it's own function. drm/displayid: Iterate over all DisplayID blocks drm/edid: move displayid tiled block parsing into separate function. drm: Nuke ->vblank_disable_allowed drm/vmwgfx: Report vmwgfx version to vmware.log drm/vmwgfx: Add VMWare host messaging capability drm/vmwgfx: Kill some lockdep warnings drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode drm/nouveau/core: recognise GM108 chipsets drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup drm/nouveau/gr/gk104-: share implementation of ppc exception init drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx drm/nouveau/bios/pll: check BIT table version before trying to parse it drm/nouveau/bios/pll: prevent oops when limits table can't be parsed drm/nouveau/volt/gk104: round up in gk104_volt_set drm/nouveau/fb/gm200: setup mmu debug buffer registers at init() drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init() drm/nouveau/fb/gf100-: allocate mmu debug buffers drm/nouveau/fb: allow chipset-specific actions for oneinit() ... |
|
Linus Torvalds | 1f40c49570 |
libnvdimm for 4.7
1/ Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. 2/ Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. 3/ Further ACPI 6.1 compliance with support for the common dimm identifier format. 4/ Various fixes and cleanups across the subsystem. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJXQhdeAAoJEB7SkWpmfYgCYP8P/RAgHkroL5lUKKU45TQUBKcY diC9POeNSccme4tIRIQCGQUZ7+7mKM5ECv2ulF4xYOHvFBCcd/8OF6xKAXs48r3v oguYhvX1YvIkBc9FUfBQbR1IsCOJ7uWp/UYiYCIQEXS5tS9Jv545j3ASqDt9xWoV TWlceZn3yWSbASiV9qZ2eXhEkk75pg4yara++rsm2/7rs/TTXn5EIjBs+57BtAo+ 6utI4fTy0CQvBYwVzam3m7y9dt2Z2jWXL4hgmT7pkvJ7HDoctVly0P9+bknJPUAo g+NugKgTGeiqH5GYp5CTZ9KvL91sDF4q00pfinITVdFl0E3VE293cIHlAzSQBm5/ w58xxaRV958ZvpH7EaBmYQG82QDi/eFNqeHqVGn0xAM6MlaqO7avUMQp2lRPYMCJ u1z/NloR5yo+sffHxsn5Luiq9KqOf6zk33PuxEkKbN74OayCSPn/SeVCO7rQR0B6 yPMJTTcTiCLnId1kOWAPaEmuK2U3BW/+ogg7hKgeCQSysuy5n6Ok5a2vEx/gJRAm v9yF68RmIWumpHr+QB0TmB8mVbD5SY+xWTm3CqJb9MipuFIOF7AVsPyTgucBvE7s v+i5F6MDO6tcVfiDT4AiZEt6D2TM5RbtckkUEX3ZTD6j7CGuR5D8bH0HNRrghrYk KT1lAk6tjWBOGAHc5Ji7 =Y3Xv -----END PGP SIGNATURE----- Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Dan Williams: "The bulk of this update was stabilized before the merge window and appeared in -next. The "device dax" implementation was revised this week in response to review feedback, and to address failures detected by the recently expanded ndctl unit test suite. Not included in this pull request are two dax topic branches (dax error handling, and dax radix-tree locking). These topics were deferred to get a few more days of -next integration testing, and to coordinate a branch baseline with Ted and the ext4 tree. Vishal and Ross will send the error handling and locking topics respectively in the next few days. This branch has received a positive build result from the kbuild robot across 226 configs. Summary: - Device DAX for persistent memory: Device DAX is the device-centric analogue of Filesystem DAX (CONFIG_FS_DAX). It allows memory ranges to be allocated and mapped without need of an intervening file system. Device DAX is strict, precise and predictable. Specifically this interface: a) Guarantees fault granularity with respect to a given page size (pte, pmd, or pud) set at configuration time. b) Enforces deterministic behavior by being strict about what fault scenarios are supported. Persistent memory is the first target, but the mechanism is also targeted for exclusive allocations of performance/feature differentiated memory ranges. - Support for the HPE DSM (device specific method) command formats. This enables management of these first generation devices until a unified DSM specification materializes. - Further ACPI 6.1 compliance with support for the common dimm identifier format. - Various fixes and cleanups across the subsystem" * tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits) libnvdimm, dax: fix deletion libnvdimm, dax: fix alignment validation libnvdimm, dax: autodetect support libnvdimm: release ida resources Revert "block: enable dax for raw block devices" /dev/dax, core: file operations and dax-mmap /dev/dax, pmem: direct access to persistent memory libnvdimm: stop requiring a driver ->remove() method libnvdimm, dax: record the specified alignment of a dax-device instance libnvdimm, dax: reserve space to store labels for device-dax libnvdimm, dax: introduce device-dax infrastructure nfit: add sysfs dimm 'family' and 'dsm_mask' attributes tools/testing/nvdimm: ND_CMD_CALL support nfit: disable vendor specific commands nfit: export subsystem ids as attributes nfit: fix format interface code byte order per ACPI6.1 nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism nfit, libnvdimm: clarify "commands" vs "_DSMs" libnvdimm: increase max envelope size for ioctl acpi/nfit: Add sysfs "id" for NVDIMM ID ... |
|
Linus Torvalds | bd28b14591 |
x86: remove more uaccess_32.h complexity
I'm looking at trying to possibly merge the 32-bit and 64-bit versions of the x86 uaccess.h implementation, but first this needs to be cleaned up. For example, the 32-bit version of "__copy_from_user_inatomic()" is mostly the special cases for the constant size, and it's actually almost never relevant. Most users aren't actually using a constant size anyway, and the few cases that do small constant copies are better off just using __get_user() instead. So get rid of the unnecessary complexity. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dan Williams | dee4107924 |
/dev/dax, core: file operations and dax-mmap
The "Device DAX" core enables dax mappings of performance / feature differentiated memory. An open mapping or file handle keeps the backing struct device live, but new mappings are only possible while the device is enabled. Faults are handled under rcu_read_lock to synchronize with the enabled state of the device. Similar to the filesystem-dax case the backing memory may optionally have struct page entries. However, unlike fs-dax there is no support for private mappings, or mappings that are not backed by media (see use of zero-page in fs-dax). Mappings are always guaranteed to match the alignment of the dax_region. If the dax_region is configured to have a 2MB alignment, all mappings are guaranteed to be backed by a pmd entry. Contrast this determinism with the fs-dax case where pmd mappings are opportunistic. If userspace attempts to force a misaligned mapping, the driver will fail the mmap attempt. See dax_dev_check_vma() for other scenarios that are rejected, like MAP_PRIVATE mappings. Cc: Hannes Reinecke <hare@suse.de> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
|
Matthew Wilcox | d604c32452 |
radix-tree: introduce radix_tree_replace_clear_tags()
In addition to replacing the entry, we also clear all associated tags. This is really a one-off special for page_cache_tree_delete() which had far too much detailed knowledge about how the radix tree works. For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It can be used by radix_tree_delete_item() as well as radix_tree_replace_clear_tags(). Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 57578c2ea2 |
raxix-tree: introduce CONFIG_RADIX_TREE_MULTIORDER
I've been receiving increasingly concerned notes from 0day about how much my recent changes have been bloating the radix tree. Make it happier by only including multiorder support if CONFIG_TRANSPARENT_HUGEPAGES is set. This is an independent Kconfig option, so other radix tree users can also set it if they have a need. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Jan Kara <jack@suse.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dan Streetman | d34f615720 |
mm/zsmalloc: don't fail if can't create debugfs info
Change the return type of zs_pool_stat_create() to void, and remove the logic to abort pool creation if the stat debugfs dir/file could not be created. The debugfs stat file is for debugging/information only, and doesn't affect operation of zsmalloc; there is no reason to abort creating the pool if the stat file can't be created. This was seen with zswap, which used the same name for all pool creations, which caused zsmalloc to fail to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled. Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dan Streetman | 200867af4d |
mm/zswap: use workqueue to destroy pool
Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to
use the workqueue instead of using call_rcu().
When zswap destroys a pool no longer in use, it uses call_rcu() to
perform the destruction/freeing. Since that executes in softirq
context, it must not sleep. However, actually destroying the pool
involves freeing the per-cpu compressors (which requires locking the
cpu_add_remove_lock mutex) and freeing the zpool, for which the
implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy, which
locks the slab_mutex). So if either mutex is currently taken, or any
other part of the compressor or zpool implementation sleeps, it will
result in a BUG().
It's not easy to reproduce this when changing zswap's params normally.
In testing with a loaded system, this does not fail:
$ cd /sys/module/zswap/parameters
$ echo lz4 > compressor ; echo zsmalloc > zpool
nor does this:
$ while true ; do
> echo lzo > compressor ; echo zbud > zpool
> sleep 1
> echo lz4 > compressor ; echo zsmalloc > zpool
> sleep 1
> done
although it's still possible either of those might fail, depending on
whether anything else besides zswap has locked the mutexes.
However, changing a parameter with no delay immediately causes the
schedule while atomic BUG:
$ while true ; do
> echo lzo > compressor ; echo lz4 > compressor
> done
This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
but moved to zswap, to cover compressor and zpool freeing.
Fixes:
|
|
Sergey Senozhatsky | d0d8da2dc4 |
zsmalloc: require GFP in zs_malloc()
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to zs_create_pool(), so we can be more flexible, but, more importantly, we need this to switch zram to per-cpu compression streams -- zram will try to allocate handle with preemption disabled in a fast path and switch to a slow path (using different gfp mask) if the fast one has failed. Apart from that, this also align zs_malloc() interface with zspool/zbud. [sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask] Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | 1ee4716585 |
zsmalloc: remove unused pool param in obj_free
Let's remove unused pool param in obj_free Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | 251cbb951b |
zsmalloc: reorder function parameters
Clean up function parameter ordering to order higher data structure first. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | 830e4bc5ba |
zsmalloc: clean up many BUG_ON
There are many BUG_ON in zsmalloc.c which is not recommened so change them as alternatives. Normal rule is as follows: 1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE 2. use VM_BUG_ON_PAGE if we need to see struct page's fields 3. use those assertion in primitive functions so higher functions can rely on the assertion in the primitive function. 4. Don't use assertion if following instruction can trigger Oops Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | a42094676f |
zsmalloc: use first_page rather than page
Clean up function parameter "struct page". Many functions of zsmalloc expect that page paramter is "first_page" so use "first_page" rather than "page" for code readability. Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Ryabinin | 64f8ebaf11 |
mm/kasan: add API to check memory regions
Memory access coded in an assembly won't be seen by KASAN as a compiler can instrument only C code. Add kasan_check_[read,write]() API which is going to be used to check a certain memory range. Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Ryabinin | 936bb4bbbb |
mm/kasan: print name of mem[set,cpy,move]() caller in report
When bogus memory access happens in mem[set,cpy,move]() it's usually caller's fault. So don't blame mem[set,cpy,move]() in bug report, blame the caller instead. Before: BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at <address> After: BUG: KASAN: out-of-bounds access in <memset_caller> at <address> Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Potapenko | 4ebb31a42f |
mm, kasan: don't call kasan_krealloc() from ksize().
Instead of calling kasan_krealloc(), which replaces the memory allocation stack ID (if stack depot is used), just unpoison the whole memory chunk. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Konstantin Serebryany <kcc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Potapenko | 55834c5909 |
mm: kasan: initial memory quarantine implementation
Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. When the object is freed, its state changes from KASAN_STATE_ALLOC to KASAN_STATE_QUARANTINE. The object is poisoned and put into quarantine instead of being returned to the allocator, therefore every subsequent access to that object triggers a KASAN error, and the error handler is able to say where the object has been allocated and deallocated. When it's time for the object to leave quarantine, its state becomes KASAN_STATE_FREE and it's returned to the allocator. From now on the allocator may reuse it for another allocation. Before that happens, it's still possible to detect a use-after free on that object (it retains the allocation/deallocation stacks). When the allocator reuses this object, the shadow is unpoisoned and old allocation/deallocation stacks are wiped. Therefore a use of this object, even an incorrect one, won't trigger ASan warning. Without the quarantine, it's not guaranteed that the objects aren't reused immediately, that's why the probability of catching a use-after-free is lower than with quarantine in place. Quarantine isolates freed objects in a separate queue. The objects are returned to the allocator later, which helps to detect use-after-free errors. Freed objects are first added to per-cpu quarantine queues. When a cache is destroyed or memory shrinking is requested, the objects are moved into the global quarantine queue. Whenever a kmalloc call allows memory reclaiming, the oldest objects are popped out of the global queue until the total size of objects in quarantine is less than 3/4 of the maximum quarantine size (which is a fraction of installed physical memory). As long as an object remains in the quarantine, KASAN is able to report accesses to it, so the chance of reporting a use-after-free is increased. Once the object leaves quarantine, the allocator may reuse it, in which case the object is unpoisoned and KASAN can't detect incorrect accesses to it. Right now quarantine support is only enabled in SLAB allocator. Unification of KASAN features in SLAB and SLUB will be done later. This patch is based on the "mm: kasan: quarantine" patch originally prepared by Dmitry Chernenkov. A number of improvements have been suggested by Andrey Ryabinin. [glider@google.com: v9] Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Rientjes | dfef2ef402 |
mm, migrate: increment fail count on ENOMEM
If page migration fails due to -ENOMEM, nr_failed should still be incremented for proper statistics. This was encountered recently when all page migration vmstats showed 0, and inferred that migrate_pages() was never called, although in reality the first page migration failed because compaction_alloc() failed to find a migration target. This patch increments nr_failed so the vmstat is properly accounted on ENOMEM. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chen Feng | 6cd9dc3e75 |
mm/compaction.c: fix zoneindex in kcompactd()
While testing the kcompactd in my platform 3G MEM only DMA ZONE. I
found the kcompactd never wakeup. It seems the zoneindex has already
minus 1 before. So the traverse here should be <=.
It fixes a regression where kswapd could previously compact, but
kcompactd not. Not a crash fix though.
[akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
Fixes:
|
|
David Rientjes | f050897778 |
mm, thp: khugepaged should scan when sleep value is written
If a large value is written to scan_sleep_millisecs, for example, that period must lapse before khugepaged will wake up for periodic collapsing. If this value is tuned to 1 day, for example, and then re-tuned to its default 10s, khugepaged will still wait for a day before scanning again. This patch causes khugepaged to wakeup immediately when the value is changed and then sleep until that value is rewritten or the new value lapses. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
NeilBrown | a53eaff8c1 |
MM: increase safety margin provided by PF_LESS_THROTTLE
When nfsd is exporting a filesystem over NFS which is then NFS-mounted on the local machine there is a risk of deadlock. This happens when there are lots of dirty pages in the NFS filesystem and they cause NFSD to be throttled, either in throttle_vm_writeout() or in balance_dirty_pages(). To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads and it provides a 25% increase to the limits that affect NFSD. Any process writing to an NFS filesystem will be throttled well before the number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD will not deadlock on pages that it needs to write out. At least it shouldn't. All processes are allowed a small excess margin to avoid performing too many calculations: ratelimit_pages. ratelimit_pages is set so that if a thread on every CPU uses the entire margin, the total will only go 3% over the limit, and this is much less than the 25% bonus that PF_LESS_THROTTLE provides, so this margin shouldn't be a problem. But it is. The "total memory" that these 3% and 25% are calculated against are not really total memory but are "global_dirtyable_memory()" which doesn't include anonymous memory, just free memory and page-cache memory. The "ratelimit_pages" number is based on whatever the global_dirtyable_memory was on the last CPU hot-plug, which might not be what you expect, but is probably close to the total freeable memory. The throttle threshold uses the global_dirtable_memory at the moment when the throttling happens, which could be much less than at the last CPU hotplug. So if lots of anonymous memory has been allocated, thus pushing out lots of page-cache pages, then NFSD might end up being throttled due to dirty NFS pages because the "25%" bonus it gets is calculated against a rather small amount of dirtyable memory, while the "3%" margin that other processes are allowed to dirty without penalty is calculated against a much larger number. To remove this possibility of deadlock we need to make sure that the margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin. Simply adding ratelimit_pages isn't enough as that should be multiplied by the number of cpus. So add "global_wb_domain.dirty_limit / 32" as that more accurately reflects the current total over-shoot margin. This ensures that the number of dirty NFS pages never gets so high that nfsd will be throttled waiting for them to be written. Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Naoya Horiguchi | e570f56ccc |
mm: check_new_page_bad() directly returns in __PG_HWPOISON case
Currently we check page->flags twice for "HWPoisoned" case of check_new_page_bad(), which can cause a race with unpoisoning. This race unnecessarily taints kernel with "BUG: Bad page state". check_new_page_bad() is the only caller of bad_page() which is interested in __PG_HWPOISON, so let's move the hwpoison related code in bad_page() to it. Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
seokhoon.yoon | 29b52de182 |
mm, kasan: fix to call kasan_free_pages() after poisoning page
When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled, free_pages_prepare()'s codeflow is below. 1)kmemcheck_free_shadow() 2)kasan_free_pages() - set shadow byte of page is freed 3)kernel_poison_pages() 3.1) check access to page is valid or not using kasan ---> error occur, kasan think it is invalid access 3.2) poison page 4)kernel_map_pages() So kasan_free_pages() should be called after poisoning the page. Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com Signed-off-by: seokhoon.yoon <iamyooon@gmail.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | d0834a6c2c |
mm: disable fault around on emulated access bit architecture
fault_around aims to reduce minor faults of file-backed pages via speculative ahead pte mapping and relying on readahead logic. However, on non-HW access bit architecture the benefit is highly limited because they should emulate the young bit with minor faults for reclaim's page aging algorithm. IOW, we cannot reduce minor faults on those architectures. I did quick a test on my ARM machine. 512M file mmap sequential every word read on eSATA drive 4 times. stddev is stable. = fault_around 4096 = elapsed time(usec): 6747645 = fault_around 65536 = elapsed time(usec): 6709263 0.5% gain. Even when I tested it with eMMC there is no gain because I guess with slow storage the major fault is the dominant factor. Also, fault_around has the side effect of shrinking slab more aggressively and causes higher vmpressure, so if such speculation fails, it can evict slab more which can result in page I/O (e.g., inode cache). In the end, it would make void any benefit of fault_around. So let's make the default "disabled" on those architectures. Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 5c0a85fad9 |
mm: make faultaround produce old ptes
Currently, faultaround code produces young pte. This can screw up vmscan behaviour[1], as it makes vmscan think that these pages are hot and not push them out on first round. During sparse file access faultaround gets more pages mapped and all of them are young. Under memory pressure, this makes vmscan swap out anon pages instead, or to drop other page cache pages which otherwise stay resident. Modify faultaround to produce old ptes, so they can easily be reclaimed under memory pressure. This can to some extend defeat the purpose of faultaround on machines without hardware accessed bit as it will not help us with reducing the number of minor page faults. We may want to disable faultaround on such machines altogether, but that's subject for separate patchset. Minchan: "I tested 512M mmap sequential word read test on non-HW access bit system (i.e., ARM) and confirmed it doesn't increase minor fault any more. old: 4096 fault_around minor fault: 131291 elapsed time: 6747645 usec new: 65536 fault_around minor fault: 131291 elapsed time: 6709263 usec 0.56% benefit" [1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Stefan Bader | 4b50bcc7ed |
mm: use phys_addr_t for reserve_bootmem_region() arguments
Since commit |
|
Richard Leitner | cd33a76b0f |
mm/memblock.c: remove unnecessary always-true comparison
Comparing an u64 variable to >= 0 returns always true and can therefore be removed. This issue was detected using the -Wtype-limits gcc flag. This patch fixes following type-limits warning: mm/memblock.c: In function `__next_reserved_mem_region': mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits] if (*idx >= 0 && *idx < type->cnt) { Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net Signed-off-by: Richard Leitner <dev@g0hl1n.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vitaly Wool | 9a001fc19c |
z3fold: the 3-fold allocator for compressed pages
This patch introduces z3fold, a special purpose allocator for storing compressed pages. It is designed to store up to three compressed pages per physical page. It is a ZBUD derivative which allows for higher compression ratio keeping the simplicity and determinism of its predecessor. This patch comes as a follow-up to the discussions at the Embedded Linux Conference in San-Diego related to the talk [1]. The outcome of these discussions was that it would be good to have a compressed page allocator as stable and deterministic as zbud with with higher compression ratio. To keep the determinism and simplicity, z3fold, just like zbud, always stores an integral number of compressed pages per page, but it can store up to 3 pages unlike zbud which can store at most 2. Therefore the compression ratio goes to around 2.6x while zbud's one is around 1.7x. The patch is based on the latest linux.git tree. This version has been updated after testing on various simulators (e.g. ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on comments from Dan Streetman [3]. [1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou [2] https://lkml.org/lkml/2016/4/21/799 [3] https://lkml.org/lkml/2016/5/4/852 Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrea Arcangeli | d5ee7c3bcc |
mm: thp: split_huge_pmd_address() comment improvement
Comment is partly wrong, this improves it by including the case of split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD is set. Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Lameter | 7b8da4c7f0 |
vmstat: get rid of the ugly cpu_stat_off variable
The cpu_stat_off variable is unecessary since we can check if a workqueue request is pending otherwise. Removal of cpu_stat_off makes it pretty easy for the vmstat shepherd to ensure that the proper things happen. Removing the state also removes all races related to it. Should a workqueue not be scheduled as needed for vmstat_update then the shepherd will notice and schedule it as needed. Should a workqueue be unecessarily scheduled then the vmstat updater will disable it. [akpm@linux-foundation.org: fix indentation, per Michal] Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org Signed-off-by: Christoph Lameter <cl@linux.com> Cc: Tejun Heo <htejun@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Greg Thelen | 51038171b7 |
memcg: fix stale mem_cgroup_force_empty() comment
Commit
|
|
Minfei Huang | 2a138dc7e5 |
mm: use existing helper to convert "on"/"off" to boolean
It's more convenient to use existing function helper to convert string "on/off" to boolean. Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com Signed-off-by: Minfei Huang <mnghuan@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ming Li | a4a921aa5c |
mm/swap.c: put activate_page_pvecs and other pagevecs together
Put the activate_page_pvecs definition next to those of the other pagevecs, for clarity. Signed-off-by: Ming Li <mingli199x@qq.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Rientjes | 297880f4af |
mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size
The page_counter rounds limits down to page size values. This makes sense, except in the case of hugetlb_cgroup where it's not possible to charge partial hugepages. If the hugetlb_cgroup margin is less than the hugepage size being charged, it will fail as expected. Round the hugetlb_cgroup limit down to hugepage size, since it is the effective limit of the cgroup. For consistency, round down PAGE_COUNTER_MAX as well when a hugetlb_cgroup is created: this prevents error reports when a user cannot restore the value to the kernel default. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nikolay Borisov <kernel@kyup.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | f4fcd55841 |
mm: enable RLIMIT_DATA by default with workaround for valgrind
Since commit
|
|
Yongji Xie | d5957d2fc2 |
mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range()
We use generic hooks in remap_pfn_range() to help archs to track pfnmap regions. The code is something like: int remap_pfn_range() { ... track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size)); ... pfn -= addr >> PAGE_SHIFT; ... untrack_pfn(vma, pfn, PAGE_ALIGN(size)); ... } Here we can easily find the pfn is changed but not recovered before untrack_pfn() is called. That's incorrect. There are no known runtime effects - this is from inspection. Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chris Wilson | 80c4bd7a5e |
mm/vmalloc: keep a separate lazy-free list
When mixing lots of vmallocs and set_memory_*() (which calls vm_unmap_aliases()) I encountered situations where the performance degraded severely due to the walking of the entire vmap_area list each invocation. One simple improvement is to add the lazily freed vmap_area to a separate lockless free list, such that we then avoid having to walk the full list on each purge. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Roman Pen <r.peniaev@gmail.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Roman Pen <r.peniaev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Kuleshov | f705ac4b39 |
mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve}
memblock_add_region() and memblock_reserve_region() do nothing specific before the call of memblock_add_range(), only print debug output. We can do the same in memblock_add() and memblock_reserve() since both memblock_add_region() and memblock_reserve_region() are not used by anybody outside of memblock.c and memblock_{add,reserve}() have the same set of flags and nids. Since memblock_add_region() and memblock_reserve_region() will be inlined, there will not be functional changes, but will improve code readability a little. Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com> Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chen Yucong | 495367c051 |
mm/memory-failure.c: replace "MCE" with "Memory failure"
HWPoison was specific to some particular x86 platforms. And it is often
seen as high level machine check handler. And therefore, 'MCE' is used
for the format prefix of printk(). However, 'PowerNV' has also used
HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
to memory_failure.c.
Additionally, 'MCE' and 'Memory failure' have different context. The
former belongs to exception context and the latter belongs to process
context. Furthermore, HWPoison can also be used for off-lining those
sub-health pages that do not trigger any machine check exception.
This patch aims to replace 'MCE' with a more appropriate prefix.
[1] commit
|
|
Yang Shi | 340a43bed6 |
mm: thp: simplify the implementation of mk_huge_pmd()
The implementation of mk_huge_pmd looks verbose, it could be just simplified to one line code. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tetsuo Handa | f44666b046 |
mm,oom: speed up select_bad_process() loop
Since commit
|
|
Michal Hocko | ec8d7c14ea |
mm, oom_reaper: do not mmput synchronously from the oom reaper context
Tetsuo has properly noted that mmput slow path might get blocked waiting for another party (e.g. exit_aio waits for an IO). If that happens the oom_reaper would be put out of the way and will not be able to process next oom victim. We should strive for making this context as reliable and independent on other subsystems as much as possible. Introduce mmput_async which will perform the slow path from an async (WQ) context. This will delay the operation but that shouldn't be a problem because the oom_reaper has reclaimed the victim's address space for most cases as much as possible and the remaining context shouldn't bind too much memory anymore. The only exception is when mmap_sem trylock has failed which shouldn't happen too often. The issue is only theoretical but not impossible. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | bb8a4b7fd1 |
mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully
Commit
|
|
Michal Hocko | 31e49bfda1 |
mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
Joonsoo has reported that he is able to trigger OOM for !costly high order requests (heavy fork() workload close the OOM) with the new oom detection rework. This is because we rely only on should_reclaim_retry when the compaction is disabled and it only checks watermarks for the requested order and so we might trigger OOM when there is a lot of free memory. It is not very clear what are the usual workloads when the compaction is disabled. Relying on high order allocations heavily without any mechanism to create those orders except for unbound amount of reclaim is certainly not a good idea. To prevent from potential regressions let's help this configuration some. We have to sacrifice the determinsm though because there simply is none here possible. should_compact_retry implementation for !CONFIG_COMPACTION, which was empty so far, will do watermark check for order-0 on all eligible zones. This will cause retrying until either the reclaim cannot make any further progress or all the zones are depleted even for order-0 pages. This means that the number of retries is basically unbounded for !costly orders but that was the case before the rework as well so this shouldn't regress. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.org Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 86a294a81f |
mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders. While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic. The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap. If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g. hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark. The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.
I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.
The reason why compaction requires being over low rather than min
watermark is not clear to me. This check was there essentially since
|
|
Michal Hocko | 7854ea6c28 |
mm: consider compaction feedback also for costly allocation
PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress. Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.
The first condition was added by
|
|
Michal Hocko | 33c2d21438 |
mm, oom: protect !costly allocations some more
should_reclaim_retry will give up retries for higher order allocations if none of the eligible zones has any requested or higher order pages available even if we pass the watermak check for order-0. This is done because there is no guarantee that the reclaimable and currently free pages will form the required order. This can, however, lead to situations where the high-order request (e.g. order-2 required for the stack allocation during fork) will trigger OOM too early - e.g. after the first reclaim/compaction round. Such a system would have to be highly fragmented and there is no guarantee further reclaim/compaction attempts would help but at least make sure that the compaction was active before we go OOM and keep retrying even if should_reclaim_retry tells us to oom if - the last compaction round backed off or - we haven't completed at least MAX_COMPACT_RETRIES active compaction rounds. The first rule ensures that the very last attempt for compaction was not ignored while the second guarantees that the compaction has done some work. Multiple retries might be needed to prevent occasional pigggy backing of other contexts to steal the compacted pages before the current context manages to retry to allocate them. compaction_failed() is taken as a final word from the compaction that the retry doesn't make much sense. We have to be careful though because the first compaction round is MIGRATE_ASYNC which is rather weak as it ignores pages under writeback and gives up too easily in other situations. We therefore have to make sure that MIGRATE_SYNC_LIGHT mode has been used before we give up. With this logic in place we do not have to increase the migration mode unconditionally and rather do it only if the compaction failed for the weaker mode. A nice side effect is that the stronger migration mode is used only when really needed so this has a potential of smaller latencies in some cases. Please note that the compaction doesn't tell us much about how successful it was when returning compaction_made_progress so we just have to blindly trust that another retry is worthwhile and cap the number to something reasonable to guarantee a convergence. If the given number of successful retries is not sufficient for a reasonable workloads we should focus on the collected compaction tracepoints data and try to address the issue in the compaction code. If this is not feasible we can increase the retries limit. [mhocko@suse.com: fix warning] Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | ede3771373 |
mm: throttle on IO only when there are too many dirty and writeback pages
wait_iff_congested has been used to throttle allocator before it retried another round of direct reclaim to allow the writeback to make some progress and prevent reclaim from looping over dirty/writeback pages without making any progress. We used to do congestion_wait before commit |
|
Michal Hocko | 0a0337e0d1 |
mm, oom: rework oom detection
__alloc_pages_slowpath has traditionally relied on the direct reclaim and did_some_progress as an indicator that it makes sense to retry allocation rather than declaring OOM. shrink_zones had to rely on zone_reclaimable if shrink_zone didn't make any progress to prevent from a premature OOM killer invocation - the LRU might be full of dirty or writeback pages and direct reclaim cannot clean those up. zone_reclaimable allows to rescan the reclaimable lists several times and restart if a page is freed. This is really subtle behavior and it might lead to a livelock when a single freed page keeps allocator looping but the current task will not be able to allocate that single page. OOM killer would be more appropriate than looping without any progress for unbounded amount of time. This patch changes OOM detection logic and pulls it out from shrink_zone which is too low to be appropriate for any high level decisions such as OOM which is per zonelist property. It is __alloc_pages_slowpath which knows how many attempts have been done and what was the progress so far therefore it is more appropriate to implement this logic. The new heuristic is implemented in should_reclaim_retry helper called from __alloc_pages_slowpath. It tries to be more deterministic and easier to follow. It builds on an assumption that retrying makes sense only if the currently reclaimable memory + free pages would allow the current allocation request to succeed (as per __zone_watermark_ok) at least for one zone in the usable zonelist. This alone wouldn't be sufficient, though, because the writeback might get stuck and reclaimable pages might be pinned for a really long time or even depend on the current allocation context. Therefore there is a backoff mechanism implemented which reduces the reclaim target after each reclaim round without any progress. This means that we should eventually converge to only NR_FREE_PAGES as the target and fail on the wmark check and proceed to OOM. The backoff is simple and linear with 1/16 of the reclaimable pages for each round without any progress. We are optimistic and reset counter for successful reclaim rounds. Costly high order pages mostly preserve their semantic and those without __GFP_REPEAT fail right away while those which have the flag set will back off after the amount of reclaimable pages reaches equivalent of the requested order. The only difference is that if there was no progress during the reclaim we rely on zone watermark check. This is more logical thing to do than previous 1<<order attempts which were a result of zone_reclaimable faking the progress. [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone] [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry] [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES] [rientjes@google.com: shrink_zones doesn't need to return anything] Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | c5d01d0d18 |
mm, compaction: simplify __alloc_pages_direct_compact feedback interface
__alloc_pages_direct_compact communicates potential back off by two variables: - deferred_compaction tells that the compaction returned COMPACT_DEFERRED - contended_compaction is set when there is a contention on zone->lock resp. zone->lru_lock locks __alloc_pages_slowpath then backs of for THP allocation requests to prevent from long stalls. This is rather messy and it would be much cleaner to return a single compact result value and hide all the nasty details into __alloc_pages_direct_compact. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | c8f7de0bfa |
mm, compaction: distinguish between full and partial COMPACT_COMPLETE
COMPACT_COMPLETE now means that compaction and free scanner met. This is not very useful information if somebody just wants to use this feedback and make any decisions based on that. The current caller might be a poor guy who just happened to scan tiny portion of the zone and that could be the reason no suitable pages were compacted. Make sure we distinguish the full and partial zone walks. Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success and be optimistic in retrying. The existing users of COMPACT_COMPLETE are conservatively changed to use COMPACT_PARTIAL_SKIPPED as well but some of them should be probably reconsidered and only defer the compaction only for COMPACT_COMPLETE with the new semantic. This patch shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 1d4746d395 |
mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
try_to_compact_pages() can currently return COMPACT_SKIPPED even when the compaction is defered for some zone just because zone DMA is skipped in 99% of cases due to watermark checks. This makes COMPACT_DEFERRED basically unusable for the page allocator as a feedback mechanism. Make sure we distinguish those two states properly and switch their ordering in the enum. This would mean that the COMPACT_SKIPPED will be returned only when all eligible zones are skipped. As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath will be more precise and we would bail out rather than reclaim. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | c46649deae |
mm, compaction: cover all compaction mode in compact_zone
The compiler is complaining after "mm, compaction: change COMPACT_ constants into enum" mm/compaction.c: In function `compact_zone': mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch] switch (ret) { ^ mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch] mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch] compaction_suitable is allowed to return only COMPACT_PARTIAL, COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply impossible. Put a VM_BUG_ON to catch an impossible return value. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | ea7ab982b6 |
mm, compaction: change COMPACT_ constants into enum
Compaction code is doing weird dances between COMPACT_FOO -> int -> unsigned long But there doesn't seem to be any reason for that. All functions which return/use one of those constants are not expecting any other value so it really makes sense to define an enum for them and make it clear that no other values are expected. This is a pure cleanup and shouldn't introduce any functional changes. Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | b6459cc154 |
vmscan: consider classzone_idx in compaction_ready
Motivation: As pointed out by Linus [2][3] relying on zone_reclaimable as a way to communicate the reclaim progress is rater dubious. I tend to agree, not only it is really obscure, it is not hard to imagine cases where a single page freed in the loop keeps all the reclaimers looping without getting any progress because their gfp_mask wouldn't allow to get that page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather rare so it doesn't happen in the practice but the current logic which we have is rather obscure and hard to follow a also non-deterministic. This is an attempt to make the OOM detection more deterministic and easier to follow because each reclaimer basically tracks its own progress which is implemented at the page allocator layer rather spread out between the allocator and the reclaim. The more on the implementation is described in the first patch. I have tested several different scenarios but it should be clear that testing OOM killer is quite hard to be representative. There is usually a tiny gap between almost OOM and full blown OOM which is often time sensitive. Anyway, I have tested the following 2 scenarios and I would appreciate if there are more to test. Testing environment: a virtual machine with 2G of RAM and 2CPUs without any swap to make the OOM more deterministic. 1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G file size, removes the files and starts over again) running in parallel for 10s to build up a lot of dirty pages when 100 parallel mem_eaters (anon private populated mmap which waits until it gets signal) with 80M each. This causes an OOM flood of course and I have compared both patched and unpatched kernels. The test is considered finished after there are no OOM conditions detected. This should tell us whether there are any excessive kills or some of them premature (e.g. due to dirty pages): I have performed two runs this time each after a fresh boot. * base kernel $ grep "Out of memory:" base-oom-run1.log | wc -l 78 $ grep "Out of memory:" base-oom-run2.log | wc -l 78 $ grep "Kill process" base-oom-run1.log | tail -n1 [ 91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child $ grep "Kill process" base-oom-run2.log | tail -n1 [ 82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61 $ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52 $ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l 1 $ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l 3 * patched kernel $ grep "Out of memory:" patched-oom-run1.log | wc -l 78 miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l 77 e grep "Kill process" patched-oom-run1.log | tail -n1 [ 497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child $ grep "Kill process" patched-oom-run2.log | tail -n1 [ 316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child $ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78 $ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77 e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l 2 $ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l 3 The patched kernel run noticeably longer while invoking OOM killer same number of times. This means that the original implementation is much more aggressive and triggers the OOM killer sooner. free pages stats show that neither kernels went OOM too early most of the time, though. I guess the difference is in the backoff when retries without any progress do sleep for a while if there is memory under writeback or dirty which is highly likely considering the parallel IO. Both kernels have seen races where zone wasn't marked unreclaimable and we still hit the OOM killer. This is most likely a race where a task managed to exit between the last allocation attempt and the oom killer invocation. 2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much memory as possible without triggering the OOM killer. This required a lot of tuning but I've considered 3 consecutive runs in three different boots without OOM as a success. * base kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo) * patched kernel size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo) That means 40M more memory was usable without triggering OOM killer. The base kernel sometimes managed to handle the same as patched but it wasn't consistent and failed in at least on of the 3 runs. This seems like a minor improvement. I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented memory and under memory pressure. The results are in patch 11 where the logic is implemented. In short I can see huge improvement there. I am certainly interested in other usecases as well as well as any feedback. Especially those which require higher order requests. This patch (of 14): While playing with the oom detection rework [1] I have noticed that my heavy order-9 (hugetlb) load close to OOM ended up in an endless loop where the reclaim hasn't made any progress but did_some_progress didn't reflect that and compaction_suitable was backing off because no zone is above low wmark + 1 << order. It turned out that this is in fact an old standing bug in compaction_ready which ignores the requested_highidx and did the watermark check for 0 classzone_idx. This succeeds for zone DMA most of the time as the zone is mostly unused because of lowmem protection. As a result costly high order allocatios always report a successfull progress even when there was none. This wasn't a problem so far because these allocations usually fail quite early or retry only few times with __GFP_REPEAT but this will change after later patch in this series so make sure to not lie about the progress and propagate requested_highidx down to compaction_ready and use it for both the watermak check and compaction_suitable to fix this issue. [1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org [2] https://lkml.org/lkml/2015/10/12/808 [3] https://lkml.org/lkml/2015/10/13/597 Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Rik van Riel | 59dc76b0d4 |
mm: vmscan: reduce size of inactive file list
The inactive file list should still be large enough to contain readahead windows and freshly written file data, but it no longer is the only source for detecting multiple accesses to file pages. The workingset refault measurement code causes recently evicted file pages that get accessed again after a shorter interval to be promoted directly to the active list. With that mechanism in place, we can afford to (on a larger system) dedicate more memory to the active file list, so we can actually cache more of the frequently used file pages in memory, and not have them pushed out by streaming writes, once-used streaming file reads, etc. This can help things like database workloads, where only half the page cache can currently be used to cache the database working set. This patch automatically increases that fraction on larger systems, using the same ratio that has already been used for anonymous memory. [hannes@cmpxchg.org: cgroup-awareness] Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | bbddabe2e4 |
mm: filemap: only do access activations on reads
Andres observed that his database workload is struggling with the transaction journal creating pressure on frequently read pages. Access patterns like transaction journals frequently write the same pages over and over, but in the majority of cases those pages are never read back. There are no caching benefits to be had for those pages, so activating them and having them put pressure on pages that do benefit from caching is a bad choice. Leave page activations to read accesses and don't promote pages based on writes alone. It could be said that partially written pages do contain cache-worthy data, because even if *userspace* does not access the unwritten part, the kernel still has to read it from the filesystem for correctness. However, a counter argument is that these pages enjoy at least *some* protection over other inactive file pages through the writeback cache, in the sense that dirty pages are written back with a delay and cache reclaim leaves them alone until they have been written back to disk. Should that turn out to be insufficient and we see increased read IO from partial writes under memory pressure, we can always go back and update grab_cache_page_write_begin() to take (pos, len) so that it can tell partial writes from pages that don't need partial reads. But for now, keep it simple. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Rik van Riel | f0281a00fe |
mm: workingset: only do workingset activations on reads
This is a follow-up to http://www.spinics.net/lists/linux-mm/msg101739.html where Andres reported his database workingset being pushed out by the minimum size enforcement of the inactive file list - currently 50% of cache - as well as repeatedly written file pages that are never actually read. Two changes fell out of the discussions. The first change observes that pages that are only ever written don't benefit from caching beyond what the writeback cache does for partial page writes, and so we shouldn't promote them to the active file list where they compete with pages whose cached data is actually accessed repeatedly. This change comes in two patches - one for in-cache write accesses and one for refaults triggered by writes, neither of which should promote a cache page. Second, with the refault detection we don't need to set 50% of the cache aside for used-once cache anymore since we can detect frequently used pages even when they are evicted between accesses. We can allow the active list to be bigger and thus protect a bigger workingset that isn't challenged by streamers. Depending on the access patterns, this can increase major faults during workingset transitions for better performance during stable phases. This patch (of 3): When rewriting a page, the data in that page is replaced with new data. This means that evicting something else from the active file list, in order to cache data that will be replaced by something else, is likely to be a waste of memory. It is better to save the active list for frequently read pages, because reads actually use the data that is in the page. This patch ignores partial writes, because it is unclear whether the complexity of identifying those is worth any potential performance gain obtained from better caching pages that see repeated partial writes at large enough intervals to not get caught by the use-twice promotion code used for the inactive file list. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4741526b83 |
mm, page_alloc: restore the original nodemask if the fast path allocation failed
The page allocator fast path uses either the requested nodemask or cpuset_current_mems_allowed if cpusets are enabled. If the allocation context allows watermarks to be ignored then it can also ignore memory policies. However, on entering the allocator slowpath the nodemask may still be cpuset_current_mems_allowed and the policies are enforced. This patch resets the nodemask appropriately before entering the slowpath. Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 4e6118016e |
mm, page_alloc: uninline the bad page part of check_new_page()
Bad pages should be rare so the code handling them doesn't need to be inline for performance reasons. Put it to separate function which returns void. This also assumes that the initial page_expected_state() result will match the result of the thorough check, i.e. the page doesn't become "good" in the meanwhile. This matches the same expectations already in place in free_pages_check(). !DEBUG_VM bloat-o-meter: add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140) function old new delta check_new_page_bad - 134 +134 get_page_from_freelist 3468 3194 -274 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | e2769dbdc5 |
mm, page_alloc: don't duplicate code in free_pcp_prepare
The new free_pcp_prepare() function shares a lot of code with free_pages_prepare(), which makes this a maintenance risk when some future patch modifies only one of them. We should be able to achieve the same effect (skipping free_pages_check() from !DEBUG_VM configs) by adding a parameter to free_pages_prepare() and making it inline, so the checks (and the order != 0 parts) are eliminated from the call from free_pcp_prepare(). !DEBUG_VM: bloat-o-meter reports no difference, as my gcc was already inlining free_pages_prepare() and the elimination seems to work as expected DEBUG_VM bloat-o-meter: add/remove: 0/1 grow/shrink: 2/0 up/down: 1035/-778 (257) function old new delta __free_pages_ok 297 1060 +763 free_hot_cold_page 480 752 +272 free_pages_prepare 778 - -778 Here inlining didn't occur before, and added some code, but it's ok for a debug option. [akpm@linux-foundation.org: fix build] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 479f854a20 |
mm, page_alloc: defer debugging checks of pages allocated from the PCP
Every page allocated checks a number of page fields for validity. This catches corruption bugs of pages that are already freed but it is expensive. This patch weakens the debugging check by checking PCP pages only when the PCP lists are being refilled. All compound pages are checked. This potentially avoids debugging checks entirely if the PCP lists are never emptied and refilled so some corruption issues may be missed. Full checking requires DEBUG_VM. With the two deferred debugging patches applied, the impact to a page allocator microbenchmark is 4.6.0-rc3 4.6.0-rc3 inline-v3r6 deferalloc-v3r7 Min alloc-odr0-1 344.00 ( 0.00%) 317.00 ( 7.85%) Min alloc-odr0-2 248.00 ( 0.00%) 231.00 ( 6.85%) Min alloc-odr0-4 209.00 ( 0.00%) 192.00 ( 8.13%) Min alloc-odr0-8 181.00 ( 0.00%) 166.00 ( 8.29%) Min alloc-odr0-16 168.00 ( 0.00%) 154.00 ( 8.33%) Min alloc-odr0-32 161.00 ( 0.00%) 148.00 ( 8.07%) Min alloc-odr0-64 158.00 ( 0.00%) 145.00 ( 8.23%) Min alloc-odr0-128 156.00 ( 0.00%) 143.00 ( 8.33%) Min alloc-odr0-256 168.00 ( 0.00%) 154.00 ( 8.33%) Min alloc-odr0-512 178.00 ( 0.00%) 167.00 ( 6.18%) Min alloc-odr0-1024 186.00 ( 0.00%) 174.00 ( 6.45%) Min alloc-odr0-2048 192.00 ( 0.00%) 180.00 ( 6.25%) Min alloc-odr0-4096 198.00 ( 0.00%) 184.00 ( 7.07%) Min alloc-odr0-8192 200.00 ( 0.00%) 188.00 ( 6.00%) Min alloc-odr0-16384 201.00 ( 0.00%) 188.00 ( 6.47%) Min free-odr0-1 189.00 ( 0.00%) 180.00 ( 4.76%) Min free-odr0-2 132.00 ( 0.00%) 126.00 ( 4.55%) Min free-odr0-4 104.00 ( 0.00%) 99.00 ( 4.81%) Min free-odr0-8 90.00 ( 0.00%) 85.00 ( 5.56%) Min free-odr0-16 84.00 ( 0.00%) 80.00 ( 4.76%) Min free-odr0-32 80.00 ( 0.00%) 76.00 ( 5.00%) Min free-odr0-64 78.00 ( 0.00%) 74.00 ( 5.13%) Min free-odr0-128 77.00 ( 0.00%) 73.00 ( 5.19%) Min free-odr0-256 94.00 ( 0.00%) 91.00 ( 3.19%) Min free-odr0-512 108.00 ( 0.00%) 112.00 ( -3.70%) Min free-odr0-1024 115.00 ( 0.00%) 118.00 ( -2.61%) Min free-odr0-2048 120.00 ( 0.00%) 125.00 ( -4.17%) Min free-odr0-4096 123.00 ( 0.00%) 129.00 ( -4.88%) Min free-odr0-8192 126.00 ( 0.00%) 130.00 ( -3.17%) Min free-odr0-16384 126.00 ( 0.00%) 131.00 ( -3.97%) Note that the free paths for large numbers of pages is impacted as the debugging cost gets shifted into that path when the page data is no longer necessarily cache-hot. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4db7548ccb |
mm, page_alloc: defer debugging checks of freed pages until a PCP drain
Every page free checks a number of page fields for validity. This catches premature frees and corruptions but it is also expensive. This patch weakens the debugging check by checking PCP pages at the time they are drained from the PCP list. This will trigger the bug but the site that freed the corrupt page will be lost. To get the full context, a kernel rebuild with DEBUG_VM is necessary. [akpm@linux-foundation.org: fix build] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 002f290627 |
cpuset: use static key better and convert to new API
An important function for cpusets is cpuset_node_allowed(), which optimizes on the fact if there's a single root CPU set, it must be trivially allowed. But the check "nr_cpusets() <= 1" doesn't use the cpusets_enabled_key static key the right way where static keys eliminate branching overhead with jump labels. This patch converts it so that static key is used properly. It's also switched to the new static key API and the checking functions are converted to return bool instead of int. We also provide a new variant __cpuset_zone_allowed() which expects that the static key check was already done and they key was enabled. This is needed for get_page_from_freelist() where we want to also avoid the relatively slower check when ALLOC_CPUSET is not set in alloc_flags. The impact on the page allocator microbenchmark is less than expected but the cleanup in itself is worthwhile. 4.6.0-rc2 4.6.0-rc2 multcheck-v1r20 cpuset-v1r20 Min alloc-odr0-1 348.00 ( 0.00%) 348.00 ( 0.00%) Min alloc-odr0-2 254.00 ( 0.00%) 254.00 ( 0.00%) Min alloc-odr0-4 213.00 ( 0.00%) 213.00 ( 0.00%) Min alloc-odr0-8 186.00 ( 0.00%) 183.00 ( 1.61%) Min alloc-odr0-16 173.00 ( 0.00%) 171.00 ( 1.16%) Min alloc-odr0-32 166.00 ( 0.00%) 163.00 ( 1.81%) Min alloc-odr0-64 162.00 ( 0.00%) 159.00 ( 1.85%) Min alloc-odr0-128 160.00 ( 0.00%) 157.00 ( 1.88%) Min alloc-odr0-256 169.00 ( 0.00%) 166.00 ( 1.78%) Min alloc-odr0-512 180.00 ( 0.00%) 180.00 ( 0.00%) Min alloc-odr0-1024 188.00 ( 0.00%) 187.00 ( 0.53%) Min alloc-odr0-2048 194.00 ( 0.00%) 193.00 ( 0.52%) Min alloc-odr0-4096 199.00 ( 0.00%) 198.00 ( 0.50%) Min alloc-odr0-8192 202.00 ( 0.00%) 201.00 ( 0.50%) Min alloc-odr0-16384 203.00 ( 0.00%) 202.00 ( 0.49%) Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Zefan Li <lizefan@huawei.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 0b423ca22f |
mm, page_alloc: inline pageblock lookup in page free fast paths
The function call overhead of get_pfnblock_flags_mask() is measurable in the page free paths. This patch uses an inlined version that is faster. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | e5b31ac2ca |
mm, page_alloc: remove unnecessary variable from free_pcppages_bulk
The original count is never reused so it can be removed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | da838d4fcb |
mm, page_alloc: pull out side effects from free_pages_check
Check without side-effects should be easier to maintain. It also removes the duplicated cpupid and flags reset done in !DEBUG_VM variant of both free_pcp_prepare() and then bulkfree_pcp_prepare(). Finally, it enables the next patch. It shouldn't result in new branches, thanks to inlining of the check. !DEBUG_VM bloat-o-meter: add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-27 (-27) function old new delta __free_pages_ok 748 739 -9 free_pcppages_bulk 1403 1385 -18 DEBUG_VM: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-28 (-28) function old new delta free_pages_prepare 806 778 -28 This is also slightly faster because cpupid information is not set on tail pages so we can avoid resets there. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | bb552ac6c6 |
mm, page_alloc: un-inline the bad part of free_pages_check
From: Vlastimil Babka <vbabka@suse.cz> !DEBUG_VM size and bloat-o-meter: add/remove: 1/0 grow/shrink: 0/2 up/down: 124/-370 (-246) function old new delta free_pages_check_bad - 124 +124 free_pcppages_bulk 1288 1171 -117 __free_pages_ok 948 695 -253 DEBUG_VM: add/remove: 1/0 grow/shrink: 0/1 up/down: 124/-214 (-90) function old new delta free_pages_check_bad - 124 +124 free_pages_prepare 1112 898 -214 [akpm@linux-foundation.org: fix whitespace] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 7bfec6f47b |
mm, page_alloc: check multiple page fields with a single branch
Every page allocated or freed is checked for sanity to avoid corruptions that are difficult to detect later. A bad page could be due to a number of fields. Instead of using multiple branches, this patch combines multiple fields into a single branch. A detailed check is only necessary if that check fails. 4.6.0-rc2 4.6.0-rc2 initonce-v1r20 multcheck-v1r20 Min alloc-odr0-1 359.00 ( 0.00%) 348.00 ( 3.06%) Min alloc-odr0-2 260.00 ( 0.00%) 254.00 ( 2.31%) Min alloc-odr0-4 214.00 ( 0.00%) 213.00 ( 0.47%) Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%) Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%) Min alloc-odr0-32 165.00 ( 0.00%) 166.00 ( -0.61%) Min alloc-odr0-64 162.00 ( 0.00%) 162.00 ( 0.00%) Min alloc-odr0-128 161.00 ( 0.00%) 160.00 ( 0.62%) Min alloc-odr0-256 170.00 ( 0.00%) 169.00 ( 0.59%) Min alloc-odr0-512 181.00 ( 0.00%) 180.00 ( 0.55%) Min alloc-odr0-1024 190.00 ( 0.00%) 188.00 ( 1.05%) Min alloc-odr0-2048 196.00 ( 0.00%) 194.00 ( 1.02%) Min alloc-odr0-4096 202.00 ( 0.00%) 199.00 ( 1.49%) Min alloc-odr0-8192 205.00 ( 0.00%) 202.00 ( 1.46%) Min alloc-odr0-16384 205.00 ( 0.00%) 203.00 ( 0.98%) Again, the benefit is marginal but avoiding excessive branches is important. Ideally the paths would not have to check these conditions at all but regrettably abandoning the tests would make use-after-free bugs much harder to detect. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 93ea9964d1 |
mm, page_alloc: remove field from alloc_context
The classzone_idx can be inferred from preferred_zoneref so remove the unnecessary field and save stack space. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | c33d6c06f6 |
mm, page_alloc: avoid looking up the first zone in a zonelist twice
The allocator fast path looks up the first usable zone in a zonelist and then get_page_from_freelist does the same job in the zonelist iterator. This patch preserves the necessary information. 4.6.0-rc2 4.6.0-rc2 fastmark-v1r20 initonce-v1r20 Min alloc-odr0-1 364.00 ( 0.00%) 359.00 ( 1.37%) Min alloc-odr0-2 262.00 ( 0.00%) 260.00 ( 0.76%) Min alloc-odr0-4 214.00 ( 0.00%) 214.00 ( 0.00%) Min alloc-odr0-8 186.00 ( 0.00%) 186.00 ( 0.00%) Min alloc-odr0-16 173.00 ( 0.00%) 173.00 ( 0.00%) Min alloc-odr0-32 165.00 ( 0.00%) 165.00 ( 0.00%) Min alloc-odr0-64 161.00 ( 0.00%) 162.00 ( -0.62%) Min alloc-odr0-128 159.00 ( 0.00%) 161.00 ( -1.26%) Min alloc-odr0-256 168.00 ( 0.00%) 170.00 ( -1.19%) Min alloc-odr0-512 180.00 ( 0.00%) 181.00 ( -0.56%) Min alloc-odr0-1024 190.00 ( 0.00%) 190.00 ( 0.00%) Min alloc-odr0-2048 196.00 ( 0.00%) 196.00 ( 0.00%) Min alloc-odr0-4096 202.00 ( 0.00%) 202.00 ( 0.00%) Min alloc-odr0-8192 206.00 ( 0.00%) 205.00 ( 0.49%) Min alloc-odr0-16384 206.00 ( 0.00%) 205.00 ( 0.49%) The benefit is negligible and the results are within the noise but each cycle counts. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 48ee5f3696 |
mm, page_alloc: shortcut watermark checks for order-0 pages
Watermarks have to be checked on every allocation including the number of pages being allocated and whether reserves can be accessed. The reserves only matter if memory is limited and the free_pages adjustment only applies to high-order pages. This patch adds a shortcut for order-0 pages that avoids numerous calculations if there is plenty of free memory yielding the following performance difference in a page allocator microbenchmark; 4.6.0-rc2 4.6.0-rc2 optfair-v1r20 fastmark-v1r20 Min alloc-odr0-1 380.00 ( 0.00%) 364.00 ( 4.21%) Min alloc-odr0-2 273.00 ( 0.00%) 262.00 ( 4.03%) Min alloc-odr0-4 227.00 ( 0.00%) 214.00 ( 5.73%) Min alloc-odr0-8 196.00 ( 0.00%) 186.00 ( 5.10%) Min alloc-odr0-16 183.00 ( 0.00%) 173.00 ( 5.46%) Min alloc-odr0-32 173.00 ( 0.00%) 165.00 ( 4.62%) Min alloc-odr0-64 169.00 ( 0.00%) 161.00 ( 4.73%) Min alloc-odr0-128 169.00 ( 0.00%) 159.00 ( 5.92%) Min alloc-odr0-256 180.00 ( 0.00%) 168.00 ( 6.67%) Min alloc-odr0-512 190.00 ( 0.00%) 180.00 ( 5.26%) Min alloc-odr0-1024 198.00 ( 0.00%) 190.00 ( 4.04%) Min alloc-odr0-2048 204.00 ( 0.00%) 196.00 ( 3.92%) Min alloc-odr0-4096 209.00 ( 0.00%) 202.00 ( 3.35%) Min alloc-odr0-8192 213.00 ( 0.00%) 206.00 ( 3.29%) Min alloc-odr0-16384 214.00 ( 0.00%) 206.00 ( 3.74%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 305347550b |
mm, page_alloc: reduce cost of fair zone allocation policy retry
The fair zone allocation policy is not without cost but it can be reduced slightly. This patch removes an unnecessary local variable, checks the likely conditions of the fair zone policy first, uses a bool instead of a flags check and falls through when a remote node is encountered instead of doing a full restart. The benefit is marginal but it's there 4.6.0-rc2 4.6.0-rc2 decstat-v1r20 optfair-v1r20 Min alloc-odr0-1 377.00 ( 0.00%) 380.00 ( -0.80%) Min alloc-odr0-2 273.00 ( 0.00%) 273.00 ( 0.00%) Min alloc-odr0-4 226.00 ( 0.00%) 227.00 ( -0.44%) Min alloc-odr0-8 196.00 ( 0.00%) 196.00 ( 0.00%) Min alloc-odr0-16 183.00 ( 0.00%) 183.00 ( 0.00%) Min alloc-odr0-32 175.00 ( 0.00%) 173.00 ( 1.14%) Min alloc-odr0-64 172.00 ( 0.00%) 169.00 ( 1.74%) Min alloc-odr0-128 170.00 ( 0.00%) 169.00 ( 0.59%) Min alloc-odr0-256 183.00 ( 0.00%) 180.00 ( 1.64%) Min alloc-odr0-512 191.00 ( 0.00%) 190.00 ( 0.52%) Min alloc-odr0-1024 199.00 ( 0.00%) 198.00 ( 0.50%) Min alloc-odr0-2048 204.00 ( 0.00%) 204.00 ( 0.00%) Min alloc-odr0-4096 210.00 ( 0.00%) 209.00 ( 0.48%) Min alloc-odr0-8192 213.00 ( 0.00%) 213.00 ( 0.00%) Min alloc-odr0-16384 214.00 ( 0.00%) 214.00 ( 0.00%) The benefit is marginal at best but one of the most important benefits, avoiding a second search when falling back to another node is not triggered by this particular test so the benefit for some corner cases is understated. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4fcb097117 |
mm, page_alloc: shorten the page allocator fast path
The page allocator fast path checks page multiple times unnecessarily. This patch avoids all the slowpath checks if the first allocation attempt succeeds. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 3777999dd4 |
mm, page_alloc: check once if a zone has isolated pageblocks
When bulk freeing pages from the per-cpu lists the zone is checked for isolated pageblocks on every release. This patch checks it once per drain. [mgorman@techsingularity.net: fix locking radce, per Vlastimil] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 83d4ca8148 |
mm, page_alloc: move __GFP_HARDWALL modifications out of the fastpath
__GFP_HARDWALL only has meaning in the context of cpusets but the fast path always applies the flag on the first attempt. Move the manipulations into the cpuset paths where they will be masked by a static branch in the common case. With the other micro-optimisations in this series combined, the impact on a page allocator microbenchmark is 4.6.0-rc2 4.6.0-rc2 decstat-v1r20 micro-v1r20 Min alloc-odr0-1 381.00 ( 0.00%) 377.00 ( 1.05%) Min alloc-odr0-2 275.00 ( 0.00%) 273.00 ( 0.73%) Min alloc-odr0-4 229.00 ( 0.00%) 226.00 ( 1.31%) Min alloc-odr0-8 199.00 ( 0.00%) 196.00 ( 1.51%) Min alloc-odr0-16 186.00 ( 0.00%) 183.00 ( 1.61%) Min alloc-odr0-32 179.00 ( 0.00%) 175.00 ( 2.23%) Min alloc-odr0-64 174.00 ( 0.00%) 172.00 ( 1.15%) Min alloc-odr0-128 172.00 ( 0.00%) 170.00 ( 1.16%) Min alloc-odr0-256 181.00 ( 0.00%) 183.00 ( -1.10%) Min alloc-odr0-512 193.00 ( 0.00%) 191.00 ( 1.04%) Min alloc-odr0-1024 201.00 ( 0.00%) 199.00 ( 1.00%) Min alloc-odr0-2048 206.00 ( 0.00%) 204.00 ( 0.97%) Min alloc-odr0-4096 212.00 ( 0.00%) 210.00 ( 0.94%) Min alloc-odr0-8192 215.00 ( 0.00%) 213.00 ( 0.93%) Min alloc-odr0-16384 216.00 ( 0.00%) 214.00 ( 0.93%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 5bb1b16975 |
mm, page_alloc: remove unnecessary initialisation from __alloc_pages_nodemask()
page is guaranteed to be set before it is read with or without the initialisation. [akpm@linux-foundation.org: fix warning] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | be06af002f |
mm, page_alloc: remove unnecessary initialisation in get_page_from_freelist
Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4dfa6cd8fd |
mm, page_alloc: remove unnecessary local variable in get_page_from_freelist
zonelist here is a copy of a struct field that is used once. Ditch it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | fa379b9586 |
mm, page_alloc: convert nr_fair_skipped to bool
The number of zones skipped to a zone expiring its fair zone allocation quota is irrelevant. Convert to bool. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | c603844bdc |
mm, page_alloc: convert alloc_flags to unsigned
alloc_flags is a bitmask of flags but it is signed which does not necessarily generate the best code depending on the compiler. Even without an impact, it makes more sense that this be unsigned. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | f75fb889d1 |
mm, page_alloc: avoid unnecessary zone lookups during pageblock operations
Pageblocks have an associated bitmap to store migrate types and whether the pageblock should be skipped during compaction. The bitmap may be associated with a memory section or a zone but the zone is looked up unconditionally. The compiler should optimise this away automatically so this is a cosmetic patch only in many cases. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 754078eb45 |
mm, page_alloc: use __dec_zone_state for order-0 page allocation
__dec_zone_state is cheaper to use for removing an order-0 page as it has fewer conditions to check. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 optiter-v1r20 decstat-v1r20 Min alloc-odr0-1 382.00 ( 0.00%) 381.00 ( 0.26%) Min alloc-odr0-2 282.00 ( 0.00%) 275.00 ( 2.48%) Min alloc-odr0-4 233.00 ( 0.00%) 229.00 ( 1.72%) Min alloc-odr0-8 203.00 ( 0.00%) 199.00 ( 1.97%) Min alloc-odr0-16 188.00 ( 0.00%) 186.00 ( 1.06%) Min alloc-odr0-32 182.00 ( 0.00%) 179.00 ( 1.65%) Min alloc-odr0-64 177.00 ( 0.00%) 174.00 ( 1.69%) Min alloc-odr0-128 175.00 ( 0.00%) 172.00 ( 1.71%) Min alloc-odr0-256 184.00 ( 0.00%) 181.00 ( 1.63%) Min alloc-odr0-512 197.00 ( 0.00%) 193.00 ( 2.03%) Min alloc-odr0-1024 203.00 ( 0.00%) 201.00 ( 0.99%) Min alloc-odr0-2048 209.00 ( 0.00%) 206.00 ( 1.44%) Min alloc-odr0-4096 214.00 ( 0.00%) 212.00 ( 0.93%) Min alloc-odr0-8192 218.00 ( 0.00%) 215.00 ( 1.38%) Min alloc-odr0-16384 219.00 ( 0.00%) 216.00 ( 1.37%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 682a3385e7 |
mm, page_alloc: inline the fast path of the zonelist iterator
The page allocator iterates through a zonelist for zones that match the addressing limitations and nodemask of the caller but many allocations will not be restricted. Despite this, there is always functional call overhead which builds up. This patch inlines the optimistic basic case and only calls the iterator function for the complex case. A hindrance was the fact that cpuset_current_mems_allowed is used in the fastpath as the allowed nodemask even though all nodes are allowed on most systems. The patch handles this by only considering cpuset_current_mems_allowed if a cpuset exists. As well as being faster in the fast-path, this removes some junk in the slowpath. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 statinline-v1r20 optiter-v1r20 Min alloc-odr0-1 412.00 ( 0.00%) 382.00 ( 7.28%) Min alloc-odr0-2 301.00 ( 0.00%) 282.00 ( 6.31%) Min alloc-odr0-4 247.00 ( 0.00%) 233.00 ( 5.67%) Min alloc-odr0-8 215.00 ( 0.00%) 203.00 ( 5.58%) Min alloc-odr0-16 199.00 ( 0.00%) 188.00 ( 5.53%) Min alloc-odr0-32 191.00 ( 0.00%) 182.00 ( 4.71%) Min alloc-odr0-64 187.00 ( 0.00%) 177.00 ( 5.35%) Min alloc-odr0-128 185.00 ( 0.00%) 175.00 ( 5.41%) Min alloc-odr0-256 193.00 ( 0.00%) 184.00 ( 4.66%) Min alloc-odr0-512 207.00 ( 0.00%) 197.00 ( 4.83%) Min alloc-odr0-1024 213.00 ( 0.00%) 203.00 ( 4.69%) Min alloc-odr0-2048 220.00 ( 0.00%) 209.00 ( 5.00%) Min alloc-odr0-4096 226.00 ( 0.00%) 214.00 ( 5.31%) Min alloc-odr0-8192 229.00 ( 0.00%) 218.00 ( 4.80%) Min alloc-odr0-16384 229.00 ( 0.00%) 219.00 ( 4.37%) perf indicated that next_zones_zonelist disappeared in the profile and __next_zones_zonelist did not appear. This is expected as the micro-benchmark would hit the inlined fast-path every time. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 060e74173f |
mm, page_alloc: inline zone_statistics
zone_statistics has one call-site but it's a public function. Make it static and inline. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 statbranch-v1r20 statinline-v1r20 Min alloc-odr0-1 419.00 ( 0.00%) 412.00 ( 1.67%) Min alloc-odr0-2 305.00 ( 0.00%) 301.00 ( 1.31%) Min alloc-odr0-4 250.00 ( 0.00%) 247.00 ( 1.20%) Min alloc-odr0-8 219.00 ( 0.00%) 215.00 ( 1.83%) Min alloc-odr0-16 203.00 ( 0.00%) 199.00 ( 1.97%) Min alloc-odr0-32 195.00 ( 0.00%) 191.00 ( 2.05%) Min alloc-odr0-64 191.00 ( 0.00%) 187.00 ( 2.09%) Min alloc-odr0-128 189.00 ( 0.00%) 185.00 ( 2.12%) Min alloc-odr0-256 198.00 ( 0.00%) 193.00 ( 2.53%) Min alloc-odr0-512 210.00 ( 0.00%) 207.00 ( 1.43%) Min alloc-odr0-1024 216.00 ( 0.00%) 213.00 ( 1.39%) Min alloc-odr0-2048 221.00 ( 0.00%) 220.00 ( 0.45%) Min alloc-odr0-4096 227.00 ( 0.00%) 226.00 ( 0.44%) Min alloc-odr0-8192 232.00 ( 0.00%) 229.00 ( 1.29%) Min alloc-odr0-16384 232.00 ( 0.00%) 229.00 ( 1.29%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | b9f00e147f |
mm, page_alloc: reduce branches in zone_statistics
zone_statistics has more branches than it really needs to take an unlikely GFP flag into account. Reduce the number and annotate the unlikely flag. The performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 nocompound-v1r10 statbranch-v1r10 Min alloc-odr0-1 417.00 ( 0.00%) 419.00 ( -0.48%) Min alloc-odr0-2 308.00 ( 0.00%) 305.00 ( 0.97%) Min alloc-odr0-4 253.00 ( 0.00%) 250.00 ( 1.19%) Min alloc-odr0-8 221.00 ( 0.00%) 219.00 ( 0.90%) Min alloc-odr0-16 205.00 ( 0.00%) 203.00 ( 0.98%) Min alloc-odr0-32 199.00 ( 0.00%) 195.00 ( 2.01%) Min alloc-odr0-64 193.00 ( 0.00%) 191.00 ( 1.04%) Min alloc-odr0-128 191.00 ( 0.00%) 189.00 ( 1.05%) Min alloc-odr0-256 200.00 ( 0.00%) 198.00 ( 1.00%) Min alloc-odr0-512 212.00 ( 0.00%) 210.00 ( 0.94%) Min alloc-odr0-1024 219.00 ( 0.00%) 216.00 ( 1.37%) Min alloc-odr0-2048 225.00 ( 0.00%) 221.00 ( 1.78%) Min alloc-odr0-4096 231.00 ( 0.00%) 227.00 ( 1.73%) Min alloc-odr0-8192 234.00 ( 0.00%) 232.00 ( 0.85%) Min alloc-odr0-16384 234.00 ( 0.00%) 232.00 ( 0.85%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 175145748d |
mm, page_alloc: use new PageAnonHead helper in the free page fast path
The PageAnon check always checks for compound_head but this is a relatively expensive check if the caller already knows the page is a head page. This patch creates a helper and uses it in the page free path which only operates on head pages. With this patch and "Only check PageCompound for high-order pages", the performance difference on a page allocator microbenchmark is; 4.6.0-rc2 4.6.0-rc2 vanilla nocompound-v1r20 Min alloc-odr0-1 425.00 ( 0.00%) 417.00 ( 1.88%) Min alloc-odr0-2 313.00 ( 0.00%) 308.00 ( 1.60%) Min alloc-odr0-4 257.00 ( 0.00%) 253.00 ( 1.56%) Min alloc-odr0-8 224.00 ( 0.00%) 221.00 ( 1.34%) Min alloc-odr0-16 208.00 ( 0.00%) 205.00 ( 1.44%) Min alloc-odr0-32 199.00 ( 0.00%) 199.00 ( 0.00%) Min alloc-odr0-64 195.00 ( 0.00%) 193.00 ( 1.03%) Min alloc-odr0-128 192.00 ( 0.00%) 191.00 ( 0.52%) Min alloc-odr0-256 204.00 ( 0.00%) 200.00 ( 1.96%) Min alloc-odr0-512 213.00 ( 0.00%) 212.00 ( 0.47%) Min alloc-odr0-1024 219.00 ( 0.00%) 219.00 ( 0.00%) Min alloc-odr0-2048 225.00 ( 0.00%) 225.00 ( 0.00%) Min alloc-odr0-4096 230.00 ( 0.00%) 231.00 ( -0.43%) Min alloc-odr0-8192 235.00 ( 0.00%) 234.00 ( 0.43%) Min alloc-odr0-16384 235.00 ( 0.00%) 234.00 ( 0.43%) Min free-odr0-1 215.00 ( 0.00%) 191.00 ( 11.16%) Min free-odr0-2 152.00 ( 0.00%) 136.00 ( 10.53%) Min free-odr0-4 119.00 ( 0.00%) 107.00 ( 10.08%) Min free-odr0-8 106.00 ( 0.00%) 96.00 ( 9.43%) Min free-odr0-16 97.00 ( 0.00%) 87.00 ( 10.31%) Min free-odr0-32 91.00 ( 0.00%) 83.00 ( 8.79%) Min free-odr0-64 89.00 ( 0.00%) 81.00 ( 8.99%) Min free-odr0-128 88.00 ( 0.00%) 80.00 ( 9.09%) Min free-odr0-256 106.00 ( 0.00%) 95.00 ( 10.38%) Min free-odr0-512 116.00 ( 0.00%) 111.00 ( 4.31%) Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%) Min free-odr0-2048 133.00 ( 0.00%) 126.00 ( 5.26%) Min free-odr0-4096 136.00 ( 0.00%) 130.00 ( 4.41%) Min free-odr0-8192 138.00 ( 0.00%) 130.00 ( 5.80%) Min free-odr0-16384 137.00 ( 0.00%) 130.00 ( 5.11%) There is a sizable boost to the free allocator performance. While there is an apparent boost on the allocation side, it's likely a co-incidence or due to the patches slightly reducing cache footprint. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | d61f859039 |
mm, page_alloc: only check PageCompound for high-order pages
Another year, another round of page allocator optimisations focusing this time on the alloc and free fast paths. This should be of help to workloads that are allocator-intensive from kernel space where the cost of zeroing is not nceessraily incurred. The series is motivated by the observation that page alloc microbenchmarks on multiple machines regressed between 3.12.44 and 4.4. Second, there is discussions before LSF/MM considering the possibility of adding another page allocator which is potentially hazardous but a patch series improving performance is better than whining. After the series is applied, there are still hazards. In the free paths, the debugging checking and page zone/pageblock lookups dominate but there was not an obvious solution to that. In the alloc path, the major contributers are dealing with zonelists, new page preperation, the fair zone allocation and numerous statistic updates. The fair zone allocator is removed by the per-node LRU series if that gets merged so it's nor a major concern at the moment. On normal userspace benchmarks, there is little impact as the zeroing cost is significant but it's visible aim9 4.6.0-rc3 4.6.0-rc3 vanilla deferalloc-v3 Min page_test 828693.33 ( 0.00%) 887060.00 ( 7.04%) Min brk_test 4847266.67 ( 0.00%) 4966266.67 ( 2.45%) Min exec_test 1271.00 ( 0.00%) 1275.67 ( 0.37%) Min fork_test 12371.75 ( 0.00%) 12380.00 ( 0.07%) The overall impact on a page allocator microbenchmark for a range of orders and number of pages allocated in a batch is 4.6.0-rc3 4.6.0-rc3 vanilla deferalloc-v3r7 Min alloc-odr0-1 428.00 ( 0.00%) 316.00 ( 26.17%) Min alloc-odr0-2 314.00 ( 0.00%) 231.00 ( 26.43%) Min alloc-odr0-4 256.00 ( 0.00%) 192.00 ( 25.00%) Min alloc-odr0-8 222.00 ( 0.00%) 166.00 ( 25.23%) Min alloc-odr0-16 207.00 ( 0.00%) 154.00 ( 25.60%) Min alloc-odr0-32 197.00 ( 0.00%) 148.00 ( 24.87%) Min alloc-odr0-64 193.00 ( 0.00%) 144.00 ( 25.39%) Min alloc-odr0-128 191.00 ( 0.00%) 143.00 ( 25.13%) Min alloc-odr0-256 203.00 ( 0.00%) 153.00 ( 24.63%) Min alloc-odr0-512 212.00 ( 0.00%) 165.00 ( 22.17%) Min alloc-odr0-1024 221.00 ( 0.00%) 172.00 ( 22.17%) Min alloc-odr0-2048 225.00 ( 0.00%) 179.00 ( 20.44%) Min alloc-odr0-4096 232.00 ( 0.00%) 185.00 ( 20.26%) Min alloc-odr0-8192 235.00 ( 0.00%) 187.00 ( 20.43%) Min alloc-odr0-16384 236.00 ( 0.00%) 188.00 ( 20.34%) Min alloc-odr1-1 519.00 ( 0.00%) 450.00 ( 13.29%) Min alloc-odr1-2 391.00 ( 0.00%) 336.00 ( 14.07%) Min alloc-odr1-4 313.00 ( 0.00%) 268.00 ( 14.38%) Min alloc-odr1-8 277.00 ( 0.00%) 235.00 ( 15.16%) Min alloc-odr1-16 256.00 ( 0.00%) 218.00 ( 14.84%) Min alloc-odr1-32 252.00 ( 0.00%) 212.00 ( 15.87%) Min alloc-odr1-64 244.00 ( 0.00%) 206.00 ( 15.57%) Min alloc-odr1-128 244.00 ( 0.00%) 207.00 ( 15.16%) Min alloc-odr1-256 243.00 ( 0.00%) 207.00 ( 14.81%) Min alloc-odr1-512 245.00 ( 0.00%) 209.00 ( 14.69%) Min alloc-odr1-1024 248.00 ( 0.00%) 214.00 ( 13.71%) Min alloc-odr1-2048 253.00 ( 0.00%) 220.00 ( 13.04%) Min alloc-odr1-4096 258.00 ( 0.00%) 224.00 ( 13.18%) Min alloc-odr1-8192 261.00 ( 0.00%) 229.00 ( 12.26%) Min alloc-odr2-1 560.00 ( 0.00%) 753.00 (-34.46%) Min alloc-odr2-2 424.00 ( 0.00%) 351.00 ( 17.22%) Min alloc-odr2-4 339.00 ( 0.00%) 393.00 (-15.93%) Min alloc-odr2-8 298.00 ( 0.00%) 246.00 ( 17.45%) Min alloc-odr2-16 276.00 ( 0.00%) 227.00 ( 17.75%) Min alloc-odr2-32 271.00 ( 0.00%) 221.00 ( 18.45%) Min alloc-odr2-64 264.00 ( 0.00%) 217.00 ( 17.80%) Min alloc-odr2-128 264.00 ( 0.00%) 217.00 ( 17.80%) Min alloc-odr2-256 264.00 ( 0.00%) 218.00 ( 17.42%) Min alloc-odr2-512 269.00 ( 0.00%) 223.00 ( 17.10%) Min alloc-odr2-1024 279.00 ( 0.00%) 230.00 ( 17.56%) Min alloc-odr2-2048 283.00 ( 0.00%) 235.00 ( 16.96%) Min alloc-odr2-4096 285.00 ( 0.00%) 239.00 ( 16.14%) Min alloc-odr3-1 629.00 ( 0.00%) 505.00 ( 19.71%) Min alloc-odr3-2 472.00 ( 0.00%) 374.00 ( 20.76%) Min alloc-odr3-4 383.00 ( 0.00%) 301.00 ( 21.41%) Min alloc-odr3-8 341.00 ( 0.00%) 266.00 ( 21.99%) Min alloc-odr3-16 316.00 ( 0.00%) 248.00 ( 21.52%) Min alloc-odr3-32 308.00 ( 0.00%) 241.00 ( 21.75%) Min alloc-odr3-64 305.00 ( 0.00%) 241.00 ( 20.98%) Min alloc-odr3-128 308.00 ( 0.00%) 244.00 ( 20.78%) Min alloc-odr3-256 317.00 ( 0.00%) 249.00 ( 21.45%) Min alloc-odr3-512 327.00 ( 0.00%) 256.00 ( 21.71%) Min alloc-odr3-1024 331.00 ( 0.00%) 261.00 ( 21.15%) Min alloc-odr3-2048 333.00 ( 0.00%) 266.00 ( 20.12%) Min alloc-odr4-1 767.00 ( 0.00%) 572.00 ( 25.42%) Min alloc-odr4-2 578.00 ( 0.00%) 429.00 ( 25.78%) Min alloc-odr4-4 474.00 ( 0.00%) 346.00 ( 27.00%) Min alloc-odr4-8 422.00 ( 0.00%) 310.00 ( 26.54%) Min alloc-odr4-16 399.00 ( 0.00%) 295.00 ( 26.07%) Min alloc-odr4-32 392.00 ( 0.00%) 293.00 ( 25.26%) Min alloc-odr4-64 394.00 ( 0.00%) 293.00 ( 25.63%) Min alloc-odr4-128 405.00 ( 0.00%) 305.00 ( 24.69%) Min alloc-odr4-256 417.00 ( 0.00%) 319.00 ( 23.50%) Min alloc-odr4-512 425.00 ( 0.00%) 326.00 ( 23.29%) Min alloc-odr4-1024 426.00 ( 0.00%) 329.00 ( 22.77%) Min free-odr0-1 216.00 ( 0.00%) 178.00 ( 17.59%) Min free-odr0-2 152.00 ( 0.00%) 125.00 ( 17.76%) Min free-odr0-4 120.00 ( 0.00%) 99.00 ( 17.50%) Min free-odr0-8 106.00 ( 0.00%) 85.00 ( 19.81%) Min free-odr0-16 97.00 ( 0.00%) 80.00 ( 17.53%) Min free-odr0-32 92.00 ( 0.00%) 76.00 ( 17.39%) Min free-odr0-64 89.00 ( 0.00%) 74.00 ( 16.85%) Min free-odr0-128 89.00 ( 0.00%) 73.00 ( 17.98%) Min free-odr0-256 107.00 ( 0.00%) 90.00 ( 15.89%) Min free-odr0-512 117.00 ( 0.00%) 108.00 ( 7.69%) Min free-odr0-1024 125.00 ( 0.00%) 118.00 ( 5.60%) Min free-odr0-2048 132.00 ( 0.00%) 125.00 ( 5.30%) Min free-odr0-4096 135.00 ( 0.00%) 130.00 ( 3.70%) Min free-odr0-8192 137.00 ( 0.00%) 130.00 ( 5.11%) Min free-odr0-16384 137.00 ( 0.00%) 131.00 ( 4.38%) Min free-odr1-1 318.00 ( 0.00%) 289.00 ( 9.12%) Min free-odr1-2 228.00 ( 0.00%) 207.00 ( 9.21%) Min free-odr1-4 182.00 ( 0.00%) 165.00 ( 9.34%) Min free-odr1-8 163.00 ( 0.00%) 146.00 ( 10.43%) Min free-odr1-16 151.00 ( 0.00%) 135.00 ( 10.60%) Min free-odr1-32 146.00 ( 0.00%) 129.00 ( 11.64%) Min free-odr1-64 145.00 ( 0.00%) 130.00 ( 10.34%) Min free-odr1-128 148.00 ( 0.00%) 134.00 ( 9.46%) Min free-odr1-256 148.00 ( 0.00%) 137.00 ( 7.43%) Min free-odr1-512 151.00 ( 0.00%) 140.00 ( 7.28%) Min free-odr1-1024 154.00 ( 0.00%) 143.00 ( 7.14%) Min free-odr1-2048 156.00 ( 0.00%) 144.00 ( 7.69%) Min free-odr1-4096 156.00 ( 0.00%) 142.00 ( 8.97%) Min free-odr1-8192 156.00 ( 0.00%) 140.00 ( 10.26%) Min free-odr2-1 361.00 ( 0.00%) 457.00 (-26.59%) Min free-odr2-2 258.00 ( 0.00%) 224.00 ( 13.18%) Min free-odr2-4 208.00 ( 0.00%) 223.00 ( -7.21%) Min free-odr2-8 185.00 ( 0.00%) 160.00 ( 13.51%) Min free-odr2-16 173.00 ( 0.00%) 149.00 ( 13.87%) Min free-odr2-32 166.00 ( 0.00%) 145.00 ( 12.65%) Min free-odr2-64 166.00 ( 0.00%) 146.00 ( 12.05%) Min free-odr2-128 169.00 ( 0.00%) 148.00 ( 12.43%) Min free-odr2-256 170.00 ( 0.00%) 152.00 ( 10.59%) Min free-odr2-512 177.00 ( 0.00%) 156.00 ( 11.86%) Min free-odr2-1024 182.00 ( 0.00%) 162.00 ( 10.99%) Min free-odr2-2048 181.00 ( 0.00%) 160.00 ( 11.60%) Min free-odr2-4096 180.00 ( 0.00%) 159.00 ( 11.67%) Min free-odr3-1 431.00 ( 0.00%) 367.00 ( 14.85%) Min free-odr3-2 306.00 ( 0.00%) 259.00 ( 15.36%) Min free-odr3-4 249.00 ( 0.00%) 208.00 ( 16.47%) Min free-odr3-8 224.00 ( 0.00%) 186.00 ( 16.96%) Min free-odr3-16 208.00 ( 0.00%) 176.00 ( 15.38%) Min free-odr3-32 206.00 ( 0.00%) 174.00 ( 15.53%) Min free-odr3-64 210.00 ( 0.00%) 178.00 ( 15.24%) Min free-odr3-128 215.00 ( 0.00%) 182.00 ( 15.35%) Min free-odr3-256 224.00 ( 0.00%) 189.00 ( 15.62%) Min free-odr3-512 232.00 ( 0.00%) 195.00 ( 15.95%) Min free-odr3-1024 230.00 ( 0.00%) 195.00 ( 15.22%) Min free-odr3-2048 229.00 ( 0.00%) 193.00 ( 15.72%) Min free-odr4-1 561.00 ( 0.00%) 439.00 ( 21.75%) Min free-odr4-2 418.00 ( 0.00%) 318.00 ( 23.92%) Min free-odr4-4 339.00 ( 0.00%) 269.00 ( 20.65%) Min free-odr4-8 299.00 ( 0.00%) 239.00 ( 20.07%) Min free-odr4-16 289.00 ( 0.00%) 234.00 ( 19.03%) Min free-odr4-32 291.00 ( 0.00%) 235.00 ( 19.24%) Min free-odr4-64 298.00 ( 0.00%) 238.00 ( 20.13%) Min free-odr4-128 308.00 ( 0.00%) 251.00 ( 18.51%) Min free-odr4-256 321.00 ( 0.00%) 267.00 ( 16.82%) Min free-odr4-512 327.00 ( 0.00%) 269.00 ( 17.74%) Min free-odr4-1024 326.00 ( 0.00%) 271.00 ( 16.87%) Min total-odr0-1 644.00 ( 0.00%) 494.00 ( 23.29%) Min total-odr0-2 466.00 ( 0.00%) 356.00 ( 23.61%) Min total-odr0-4 376.00 ( 0.00%) 291.00 ( 22.61%) Min total-odr0-8 328.00 ( 0.00%) 251.00 ( 23.48%) Min total-odr0-16 304.00 ( 0.00%) 234.00 ( 23.03%) Min total-odr0-32 289.00 ( 0.00%) 224.00 ( 22.49%) Min total-odr0-64 282.00 ( 0.00%) 218.00 ( 22.70%) Min total-odr0-128 280.00 ( 0.00%) 216.00 ( 22.86%) Min total-odr0-256 310.00 ( 0.00%) 243.00 ( 21.61%) Min total-odr0-512 329.00 ( 0.00%) 273.00 ( 17.02%) Min total-odr0-1024 346.00 ( 0.00%) 290.00 ( 16.18%) Min total-odr0-2048 357.00 ( 0.00%) 304.00 ( 14.85%) Min total-odr0-4096 367.00 ( 0.00%) 315.00 ( 14.17%) Min total-odr0-8192 372.00 ( 0.00%) 317.00 ( 14.78%) Min total-odr0-16384 373.00 ( 0.00%) 319.00 ( 14.48%) Min total-odr1-1 838.00 ( 0.00%) 739.00 ( 11.81%) Min total-odr1-2 619.00 ( 0.00%) 543.00 ( 12.28%) Min total-odr1-4 495.00 ( 0.00%) 433.00 ( 12.53%) Min total-odr1-8 440.00 ( 0.00%) 382.00 ( 13.18%) Min total-odr1-16 407.00 ( 0.00%) 353.00 ( 13.27%) Min total-odr1-32 398.00 ( 0.00%) 341.00 ( 14.32%) Min total-odr1-64 389.00 ( 0.00%) 336.00 ( 13.62%) Min total-odr1-128 392.00 ( 0.00%) 341.00 ( 13.01%) Min total-odr1-256 391.00 ( 0.00%) 344.00 ( 12.02%) Min total-odr1-512 396.00 ( 0.00%) 349.00 ( 11.87%) Min total-odr1-1024 402.00 ( 0.00%) 357.00 ( 11.19%) Min total-odr1-2048 409.00 ( 0.00%) 364.00 ( 11.00%) Min total-odr1-4096 414.00 ( 0.00%) 366.00 ( 11.59%) Min total-odr1-8192 417.00 ( 0.00%) 369.00 ( 11.51%) Min total-odr2-1 921.00 ( 0.00%) 1210.00 (-31.38%) Min total-odr2-2 682.00 ( 0.00%) 576.00 ( 15.54%) Min total-odr2-4 547.00 ( 0.00%) 616.00 (-12.61%) Min total-odr2-8 483.00 ( 0.00%) 406.00 ( 15.94%) Min total-odr2-16 449.00 ( 0.00%) 376.00 ( 16.26%) Min total-odr2-32 437.00 ( 0.00%) 366.00 ( 16.25%) Min total-odr2-64 431.00 ( 0.00%) 363.00 ( 15.78%) Min total-odr2-128 433.00 ( 0.00%) 365.00 ( 15.70%) Min total-odr2-256 434.00 ( 0.00%) 371.00 ( 14.52%) Min total-odr2-512 446.00 ( 0.00%) 379.00 ( 15.02%) Min total-odr2-1024 461.00 ( 0.00%) 392.00 ( 14.97%) Min total-odr2-2048 464.00 ( 0.00%) 395.00 ( 14.87%) Min total-odr2-4096 465.00 ( 0.00%) 398.00 ( 14.41%) Min total-odr3-1 1060.00 ( 0.00%) 872.00 ( 17.74%) Min total-odr3-2 778.00 ( 0.00%) 633.00 ( 18.64%) Min total-odr3-4 632.00 ( 0.00%) 510.00 ( 19.30%) Min total-odr3-8 565.00 ( 0.00%) 452.00 ( 20.00%) Min total-odr3-16 524.00 ( 0.00%) 424.00 ( 19.08%) Min total-odr3-32 514.00 ( 0.00%) 415.00 ( 19.26%) Min total-odr3-64 515.00 ( 0.00%) 419.00 ( 18.64%) Min total-odr3-128 523.00 ( 0.00%) 426.00 ( 18.55%) Min total-odr3-256 541.00 ( 0.00%) 438.00 ( 19.04%) Min total-odr3-512 559.00 ( 0.00%) 451.00 ( 19.32%) Min total-odr3-1024 561.00 ( 0.00%) 456.00 ( 18.72%) Min total-odr3-2048 562.00 ( 0.00%) 459.00 ( 18.33%) Min total-odr4-1 1328.00 ( 0.00%) 1011.00 ( 23.87%) Min total-odr4-2 997.00 ( 0.00%) 747.00 ( 25.08%) Min total-odr4-4 813.00 ( 0.00%) 615.00 ( 24.35%) Min total-odr4-8 721.00 ( 0.00%) 550.00 ( 23.72%) Min total-odr4-16 689.00 ( 0.00%) 529.00 ( 23.22%) Min total-odr4-32 683.00 ( 0.00%) 528.00 ( 22.69%) Min total-odr4-64 692.00 ( 0.00%) 531.00 ( 23.27%) Min total-odr4-128 713.00 ( 0.00%) 556.00 ( 22.02%) Min total-odr4-256 738.00 ( 0.00%) 586.00 ( 20.60%) Min total-odr4-512 753.00 ( 0.00%) 595.00 ( 20.98%) Min total-odr4-1024 752.00 ( 0.00%) 600.00 ( 20.21%) This patch (of 27): order-0 pages by definition cannot be compound so avoid the check in the fast path for those pages. [akpm@linux-foundation.org: use unlikely(order) in free_pages_prepare(), per Vlastimil] Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 449d777d7a |
mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper
Right now the oom reaper will clear TIF_MEMDIE only for tasks which were successfully reaped. This is the safest option because we know that such an oom victim would only block forward progress of the oom killer without a good reason because it is highly unlikely it would release much more memory. Basically most of its memory has been already torn down. We can relax this assumption to catch more corner cases though. The first obvious one is when the oom victim clears its mm and gets stuck later on. oom_reaper would back of on find_lock_task_mm returning NULL. We can safely try to clear TIF_MEMDIE in this case because such a task would be ignored by the oom killer anyway. The flag would be cleared by that time already most of the time anyway. The less obvious one is when the oom reaper fails due to mmap_sem contention. Even if we clear TIF_MEMDIE for this task then it is not very likely that we would select another task too easily because we haven't reaped the last victim and so it would be still the #1 candidate. There is a rare race condition possible when the current victim terminates before the next select_bad_process but considering that oom_reap_task had retried several times before giving up then this sounds like a borderline thing. After this patch we should have a guarantee that the OOM killer will not be block for unbounded amount of time for most cases. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 3ef22dfff2 |
oom, oom_reaper: try to reap tasks which skip regular OOM killer path
If either the current task is already killed or PF_EXITING or a selected task is PF_EXITING then the oom killer is suppressed and so is the oom reaper. This patch adds try_oom_reaper which checks the given task and queues it for the oom reaper if that is safe to be done meaning that the task doesn't share the mm with an alive process. This might help to release the memory pressure while the task tries to exit. [akpm@linux-foundation.org: fix nommu build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 3da88fb3ba |
mm, oom: move GFP_NOFS check to out_of_memory
__alloc_pages_may_oom is the central place to decide when the out_of_memory should be invoked. This is a good approach for most checks there because they are page allocator specific and the allocation fails right after for all of them. The notable exception is GFP_NOFS context which is faking did_some_progress and keep the page allocator looping even though there couldn't have been any progress from the OOM killer. This patch doesn't change this behavior because we are not ready to allow those allocation requests to fail yet (and maybe we will face the reality that we will never manage to safely fail these request). Instead __GFP_FS check is moved down to out_of_memory and prevent from OOM victim selection there. There are two reasons for that - OOM notifiers might release some memory even from this context as none of the registered notifier seems to be FS related - this might help a dying thread to get an access to memory reserves and move on which will make the behavior more consistent with the case when the task gets killed from a different context. Keep a comment in __alloc_pages_may_oom to make sure we do not forget how GFP_NOFS is special and that we really want to do something about it. Note to the current oom_notifier users: The observable difference for you is that oom notifiers cannot depend on any fs locks because we could deadlock. Not that this would be allowed today because that would just lockup machine in most of the cases and ruling out the OOM killer along the way. Another difference is that callbacks might be invoked sooner now because GFP_NOFS is a weaker reclaim context and so there could be reclaimable memory which is just not reachable now. That would require GFP_NOFS only loads which are really rare and more importantly the observable result would be dropping of reconstructible object and potential performance drop which is not such a big deal when we are struggling to fulfill other important allocation requests. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Raushaniya Maksudova <rmaksudova@parallels.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Daniel Vetter <daniel.vetter@intel.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vitaly Kuznetsov | 86dd995d63 |
memory_hotplug: introduce memhp_default_state= command line parameter
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the memory hotplug onlining policy. Add a command line parameter to make it possible to override the default. It may come handy for debug and testing purposes. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Vrabel <david.vrabel@citrix.com> Cc: David Rientjes <rientjes@google.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Lennart Poettering <lennart@poettering.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vitaly Kuznetsov | 8604d9e534 |
memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
This patchset continues the work I started with commit
|
|
Hugh Dickins | 1d069b7dd5 |
huge pagecache: extend mremap pmd rmap lockout to files
Whatever huge pagecache implementation we go with, file rmap locking must be added to anon rmap locking, when mremap's move_page_tables() finds a pmd_trans_huge pmd entry: a simple change, let's do it now. Factor out take_rmap_locks() and drop_rmap_locks() to handle the locking for make move_ptes() and move_page_tables(), and delete the VM_BUG_ON_VMA which rejected vm_file and required anon_vma. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | bf8616d5fa |
huge mm: move_huge_pmd does not need new_vma
Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as it was in the old vma, alignment and size permitting. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 52b6f46bc1 |
mm: /proc/sys/vm/stat_refresh to force vmstat update
Provide /proc/sys/vm/stat_refresh to force an immediate update of per-cpu into global vmstats: useful to avoid a sleep(2) or whatever before checking counts when testing. Originally added to work around a bug which left counts stranded indefinitely on a cpu going idle (an inaccuracy magnified when small below-batch numbers represent "huge" amounts of memory), but I believe that bug is now fixed: nonetheless, this is still a useful knob. Its schedule_on_each_cpu() is probably too expensive just to fold into reading /proc/meminfo itself: give this mode 0600 to prevent abuse. Allow a write or a read to do the same: nothing to read, but "grep -h Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient. Oh, and since global_page_state() itself is careful to disguise any underflow as 0, hack in an "Invalid argument" and pr_warn() if a counter is negative after the refresh - this helped to fix a misaccounting of NR_ISOLATED_FILE in my migration code. But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED often go negative some of the time. I have not yet worked out why, but have no evidence that it's actually harmful. Punt for the moment by just ignoring the anomaly on those. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andres Lagar-Cavilla | 9e18eb2935 |
tmpfs: mem_cgroup charge fault to vm_mm not current mm
Although shmem_fault() has been careful to count a major fault to vm_mm, shmem_getpage_gfp() has been careless in charging a remote access fault to current->mm owner's memcg instead of to vma->vm_mm owner's memcg: that is inconsistent with all the mem_cgroup charging on remote access faults in mm/memory.c. Fix it by passing fault_mm along with fault_type to shmem_get_page_gfp(); but in that case, now knowing the right mm, it's better for it to handle the PGMAJFAULT updates itself. And let's keep this clutter out of most callers' way: change the common shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | 75edd345e8 |
tmpfs: preliminary minor tidyups
Make a few cleanups in mm/shmem.c, before going on to complicate it. shmem_alloc_page() will become more complicated: we can't afford to to have that complication duplicated between a CONFIG_NUMA version and a !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a single shmem_swapin() and a single shmem_alloc_page(). Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA configurations, but eliminating it is a larger cleanup: I have an alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and bug-prone, and changed yet again since my last version. Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to shmem_alloc_page(): that SwapBacked flag will be useful in future, to help to distinguish different cases appropriately. And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of little use (IIRC it dates back to when shmem_getpage() returned the page unlocked): kill it and do the necessary in shmem_file_read_iter(). But an arm64 build then complained that info may be uninitialized (where shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and advancing to an "sgp <= SGP_CACHE" test jogged it back to reality. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | fa9949da59 |
mm: use __SetPageSwapBacked and dont ClearPageSwapBacked
v3.16 commit
|
|
Hugh Dickins | 9d5e6a9f22 |
mm: update_lru_size do the __mod_zone_page_state
Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy reclaim was removed) that lru_size can be updated by -nr_taken once per call to isolate_lru_pages(), instead of page by page. Update it inside isolate_lru_pages(), or at its two callsites? I chose to update it at the callsites, rearranging and grouping the updates by nr_taken and nr_scanned together in both. With one exception, mem_cgroup_update_lru_size(,lru,) is then used where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding some more calls in a future commit. Make the code a little smaller and simpler by incorporating stat update in lru_size update. The exception was move_active_pages_to_lru(), which aggregated the pgmoved stat update separately from the individual lru_size updates; but I still think this a simplification worth making. However, the __mod_zone_page_state is not peculiar to mem_cgroups: so better use the name update_lru_size, calls mem_cgroup_update_lru_size when CONFIG_MEMCG. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hugh Dickins | ca707239e8 |
mm: update_lru_size warn and reset bad lru_size
Though debug kernels have a VM_BUG_ON to help protect from misaccounting lru_size, non-debug kernels are liable to wrap it around: and then the vast unsigned long size draws page reclaim into a loop of repeatedly doing nothing on an empty list, without even a cond_resched(). That soft lockup looks confusingly like an over-busy reclaim scenario, with lots of contention on the lru_lock in shrink_inactive_list(): yet has a totally different origin. Help differentiate with a custom warning in mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the size to avoid the lockup. But the particular bug which suggested this change was mine alone, and since fixed. Make it a WARN_ONCE: the first occurrence is the most informative, a flurry may follow, yet even when rate-limited little more is learnt. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Yang Shi <yang.shi@linaro.org> Cc: Ning Qu <quning@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Andres Lagar-Cavilla <andreslc@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | 1269019e69 |
mm/mmap: kill hook arch_rebalance_pgtables()
Nobody uses it. Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | e87d59f7a2 |
mm/vmstat: make node_page_state() handles all zones by itself
node_page_state() manually adds statistics per each zone and returns total value for all zones. Whenever we add a new zone, we need to consider this function and it's really troublesome. Make it handle all zones by itself. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 33499bfe50 |
mm/highmem: make nr_free_highpages() handles all highmem zones by itself
nr_free_highpages() manually adds statistics per each highmem zone and returns a total value for them. Whenever we add a new highmem zone, we need to consider this function and it's really troublesome. Make it handle all highmem zones by itself. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | fc2bd799c7 |
mm/page_alloc: correct highmem memory statistics
ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate statistics. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 09b4ab3c43 |
mm/writeback: correct dirty page calculation for highmem
ZONE_MOVABLE could be treated as highmem so we need to consider it for accurate calculation of dirty pages. And, in following patches, ZONE_CMA will be introduced and it can be treated as highmem, too. So, instead of manually adding stat of ZONE_MOVABLE, looping all zones and check whether the zone is highmem or not and add stat of the zone which can be treated as highmem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | ba6b0979e3 |
power: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. mark_free_pages() iterates requested zone's pfn range and unset all range's bitmap first. And then it marks freepages in a zone to the bitmap. If there is an overlapping zone, above unset could clear previous marked bit and reference to this bitmap in the future will cause the problem. To prevent it, this patch adds a zone check in mark_free_pages(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 9d43f5aec9 |
mm/page_owner: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. There are one place in page_owner.c that iterates pfn range and it doesn't consider this overlapping. Add it. Without this patch, above system could over count early allocated page number before page_owner is activated. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | a91c43c731 |
mm/vmstat: add zone range overlapping check
There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to care this overlapping when iterating pfn range. There are two places in vmstat.c that iterates pfn range and they don't consider this overlapping. Add it. Without this patch, above system could over count pageblock number on a zone. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | b9eb63191a |
mm/memory_hotplug: add comment to some functions related to memory hotplug
__offline_isolated_pages() and test_pages_isolated() are used by memory hotplug. These functions require that range is in a single zone but there is no code to do this because memory hotplug checks it before calling these functions. To avoid confusing future user of these functions, this patch adds comments to them. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | f44b2dda8b |
mm/hugetlb: add same zone check in pfn_range_valid_gigantic()
This patchset deals with some problematic sites that iterate pfn ranges. There is a system thats node's pfns are overlapped as follows: -----pfn--------> N0 N1 N2 N0 N1 N2 Therefore, we need to take care of this overlapping when iterating pfn range. I audit many iterating sites that uses pfn_valid(), pfn_valid_within(), zone_start_pfn and etc. and others looks safe to me. This is a preparation step for a new CMA implementation, ZONE_CMA (https://lkml.org/lkml/2015/2/12/95), because it would be easily overlapped with other zones. But, zone overlap check is also needed for the general case so I send it separately. This patch (of 5): alloc_gigantic_page() uses alloc_contig_range() and this requires that the requested range is in a single zone. To satisfy this requirement, add this check to pfn_range_valid_gigantic(). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 1aa8aea535 |
mm: uninline page_mapped()
It's huge. Uninlining it saves 206 bytes per callsite. Shaves 4924 bytes from the x86_64 allmodconfig vmlinux. [akpm@linux-foundation.org: coding-style fixes] Cc: Steve Capper <steve.capper@arm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | fdd048e12c |
mm, compaction: skip blocks where isolation fails in async direct compaction
The goal of direct compaction is to quickly make a high-order page available for the pending allocation. Within an aligned block of pages of desired order, a single allocated page that cannot be isolated for migration means that the block cannot fully merge to a buddy page that would satisfy the allocation request. Therefore we can reduce the allocation stall by skipping the rest of the block immediately on isolation failure. For async compaction, this also means a higher chance of succeeding until it detects contention. We however shouldn't completely sacrifice the second objective of compaction, which is to reduce overal long-term memory fragmentation. As a compromise, perform the eager skipping only in direct async compaction, while sync compaction (including kcompactd) remains thorough. Testing was done using stress-highalloc from mmtests, configured for order-4 GFP_KERNEL allocations: 4.6-rc1 4.6-rc1 before after Success 1 Min 24.00 ( 0.00%) 27.00 (-12.50%) Success 1 Mean 30.20 ( 0.00%) 31.60 ( -4.64%) Success 1 Max 37.00 ( 0.00%) 35.00 ( 5.41%) Success 2 Min 42.00 ( 0.00%) 32.00 ( 23.81%) Success 2 Mean 44.00 ( 0.00%) 44.80 ( -1.82%) Success 2 Max 48.00 ( 0.00%) 52.00 ( -8.33%) Success 3 Min 91.00 ( 0.00%) 92.00 ( -1.10%) Success 3 Mean 92.20 ( 0.00%) 92.80 ( -0.65%) Success 3 Max 94.00 ( 0.00%) 93.00 ( 1.06%) We can see that success rates are unaffected by the skipping. 4.6-rc1 4.6-rc1 before after User 2587.42 2566.53 System 482.89 471.20 Elapsed 1395.68 1382.00 Times are not so useful metric for this benchmark as main portion is the interfering kernel builds, but results do hint at reduced system times. 4.6-rc1 4.6-rc1 before after Direct pages scanned 163614 159608 Kswapd pages scanned 2070139 2078790 Kswapd pages reclaimed 2061707 2069757 Direct pages reclaimed 163354 159505 Reduced direct reclaim was unintended, but could be explained by more successful first attempt at (async) direct compaction, which is attempted before the first reclaim attempt in __alloc_pages_slowpath(). Compaction stalls 33052 39853 Compaction success 12121 19773 Compaction failures 20931 20079 Compaction is indeed more successful, and thus less likely to get deferred, so there are also more direct compaction stalls. Page migrate success 3781876 3326819 Page migrate failure 45817 41774 Compaction pages isolated 7868232 6941457 Compaction migrate scanned 168160492 127269354 Compaction migrate prescanned 0 0 Compaction free scanned 2522142582 2326342620 Compaction free direct alloc 0 0 Compaction free dir. all. miss 0 0 Compaction cost 5252 4476 The patch reduces migration scanned pages by 25% thanks to the eager skipping. [hughd@google.com: prevent nr_isolated_* from going negative] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | a34753d275 |
mm, compaction: reduce spurious pcplist drains
Compaction drains the local pcplists each time migration scanner moves away from a cc->order aligned block where it isolated pages for migration, so that the pages freed by migrations can merge into higher orders. The detection is currently coarser than it could be. The cc->last_migrated_pfn variable should track the lowest pfn that was isolated for migration. But it is set to the pfn where isolate_migratepages_block() starts scanning, which is typically the first pfn of the pageblock. There, the scanner might fail to isolate several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in another block. This would cause the pcplists drain to be performed, although the scanner didn't yet finish the block where it isolated from. This patch thus makes cc->last_migrated_pfn handling more accurate by setting it to the pfn of an actually isolated page in isolate_migratepages_block(). Although practical effects of this patch are likely low, it arguably makes the intent of the code more obvious. Also the next patch will make async direct compaction skip blocks more aggressively, and draining pcplists due to skipped blocks is wasteful. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 06b6640a39 |
mm, compaction: wrap calculating first and last pfn of pageblock
Compaction code has accumulated numerous instances of manual calculations of the first (inclusive) and last (exclusive) pfn of a pageblock (or a smaller block of given order), given a pfn within the pageblock. Wrap these calculations by introducing pageblock_start_pfn(pfn) and pageblock_end_pfn(pfn) macros. [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | e4c5800a39 |
mm/rmap: replace BUG_ON(anon_vma->degree) with VM_WARN_ON
This check effectively catches anon vma hierarchy inconsistence and some vma corruptions. It was effective for catching corner cases in anon vma reusing logic. For now this code seems stable so check could be hidden under CONFIG_DEBUG_VM and replaced with WARN because it's not so fatal. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Suggested-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | fee83b3aba |
mm/mempolicy.c:offset_il_node() document and clarify
This code was pretty obscure and was relying upon obscure side-effects of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to -1. Clean that all up and document the function's intent. Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Laura Abbott <lauraa@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 54f18d3526 |
mm/hugetlb.c: use first_memory_node
Instead of open-coding it. Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li Zhang | 949698a31a |
mm/page_alloc: Remove useless parameter of __free_pages_boot_core
__free_pages_boot_core has parameter pfn which is not used at all. Remove it. Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Reviewed-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | fda3d69be9 |
mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment
> The comment seems to have not much to do with the code? I guess the comment tries to say that the code path is triggered when we charge the page which happens _before_ it is added to the LRU list and so last_scanned_node might contain the stale data. Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yaowei Bai | c98940f6fa |
mm/memory_hotplug: is_mem_section_removable() can return bool
Make is_mem_section_removable() return bool to improve readability due to this particular function only using either one or zero as its return value. Signed-off-by: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vaishali Thakkar | 9fee021d15 |
mm/hugetlb: introduce hugetlb_bad_size()
When any unsupported hugepage size is specified, 'hugepagesz=' and 'hugepages=' should be ignored during command line parsing until any supported hugepage size is found. But currently incorrect number of hugepages are allocated when unsupported size is specified as it fails to ignore the 'hugepages=' command. Test case: Note that this is specific to x86 architecture. Boot the kernel with command line option 'hugepagesz=256M hugepages=X'. After boot, dmesg output shows that X number of hugepages of the size 2M is pre-allocated instead of 0. So, to handle such command line options, introduce new routine hugetlb_bad_size. The routine hugetlb_bad_size sets the global variable parsed_valid_hugepagesz. We are using parsed_valid_hugepagesz to save the state when unsupported hugepagesize is found so that we can ignore the 'hugepages=' parameters after that and then reset the variable when supported hugepage size is found. The routine hugetlb_bad_size can be called while setting 'hugepagesz=' parameter in an architecture specific code. Signed-off-by: Vaishali Thakkar <vaishali.thakkar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Dominik Dingel <dingel@linux.vnet.ibm.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: James Hogan <james.hogan@imgtec.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Kravetz | 09a95e29cb |
mm/hugetlb: optimize minimum size (min_size) accounting
It was observed that minimum size accounting associated with the hugetlbfs min_size mount option may not perform optimally and as expected. As huge pages/reservations are released from the filesystem and given back to the global pools, they are reserved for subsequent filesystem use as long as the subpool reserved count is less than subpool minimum size. It does not take into account used pages within the filesystem. The filesystem size limits are not exceeded and this is technically not a bug. However, better behavior would be to wait for the number of used pages/reservations associated with the filesystem to drop below the minimum size before taking reservations to satisfy minimum size. An optimization is also made to the hugepage_subpool_get_pages() routine which is called when pages/reservations are allocated. This does not change behavior, but simply avoids the accounting if all reservations have already been taken (subpool reserved count == 0). Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 0edaf86cf1 |
include/linux/nodemask.h: create next_node_in() helper
Lots of code does node = next_node(node, XXX); if (node == MAX_NUMNODES) node = first_node(XXX); so create next_node_in() to do this and use it in various places. [mhocko@suse.com: use next_node_in() helper] Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Joonsoo Kim <js1304@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Hui Zhu <zhuhui@xiaomi.com> Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 0139aa7b7f |
mm: rename _count, field of the struct page, to _refcount
Many developers already know that field for reference count of the struct page is _count and atomic type. They would try to handle it directly and this could break the purpose of page reference count tracepoint. To prevent direct _count modification, this patch rename it to _refcount and add warning message on the code. After that, developer who need to handle reference count will find that field should not be accessed directly. [akpm@linux-foundation.org: fix comments, per Vlastimil] [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too] [sfr@canb.auug.org.au: sync ethernet driver changes] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Manish Chopra <manish.chopra@qlogic.com> Cc: Yuval Mintz <yuval.mintz@qlogic.com> Cc: Tariq Toukan <tariqt@mellanox.com> Cc: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 6d061f9f61 |
mm/page_ref: use page_ref helper instead of direct modification of _count
page_reference manipulation functions are introduced to track down reference count change of the page. Use it instead of direct modification of _count. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li Peng | 43efd3ea64 |
mm/slub.c: fix sysfs filename in comment
/sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio. Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.com Signed-off-by: Li Peng <lip@dtdream.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | a3187e438b |
mm: slab: remove ZONE_DMA_FLAG
Now we have IS_ENABLED helper to check if a Kconfig option is enabled or not, so ZONE_DMA_FLAG sounds no longer useful. And, the use of ZONE_DMA_FLAG in slab looks pointless according to the comment [1] from Johannes Weiner, so remove them and ORing passed in flags with the cache gfp flags has been done in kmem_getpages(). [1] https://lkml.org/lkml/2014/9/25/553 Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Thomas Garnier | c7ce4f60ac |
mm: SLAB freelist randomization
Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize the SLAB freelist. The list is randomized during initialization of a new set of pages. The order on different freelist sizes is pre-computed at boot for performance. Each kmem_cache has its own randomized freelist. Before pre-computed lists are available freelists are generated dynamically. This security feature reduces the predictability of the kernel SLAB allocator against heap overflows rendering attacks much less stable. For example this attack against SLUB (also applicable against SLAB) would be affected: https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/ Also, since v4.6 the freelist was moved at the end of the SLAB. It means a controllable heap is opened to new attacks not yet publicly discussed. A kernel heap overflow can be transformed to multiple use-after-free. This feature makes this type of attack harder too. To generate entropy, we use get_random_bytes_arch because 0 bits of entropy is available in the boot stage. In the worse case this function will fallback to the get_random_bytes sub API. We also generate a shift random number to shift pre-computed freelist for each new set of pages. The config option name is not specific to the SLAB as this approach will be extended to other allocators like SLUB. Performance results highlighted no major changes: Hackbench (running 90 10 times): Before average: 0.0698 After average: 0.0663 (-5.01%) slab_test 1 run on boot. Difference only seen on the 2048 size test being the worse case scenario covered by freelist randomization. New slab pages are constantly being created on the 10000 allocations. Variance should be mainly due to getting new pages every few allocations. Before: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles 10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles 10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles 10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles 10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles 10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles 10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles 10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles 10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles 10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles 10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles 10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 121 cycles 10000 times kmalloc(16)/kfree -> 121 cycles 10000 times kmalloc(32)/kfree -> 121 cycles 10000 times kmalloc(64)/kfree -> 121 cycles 10000 times kmalloc(128)/kfree -> 121 cycles 10000 times kmalloc(256)/kfree -> 119 cycles 10000 times kmalloc(512)/kfree -> 119 cycles 10000 times kmalloc(1024)/kfree -> 119 cycles 10000 times kmalloc(2048)/kfree -> 119 cycles 10000 times kmalloc(4096)/kfree -> 121 cycles 10000 times kmalloc(8192)/kfree -> 119 cycles 10000 times kmalloc(16384)/kfree -> 119 cycles After: Single thread testing ===================== 1. Kmalloc: Repeatedly allocate then free test 10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles 10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles 10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles 10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles 10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles 10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles 10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles 10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles 10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles 10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles 10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles 10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles 2. Kmalloc: alloc/free test 10000 times kmalloc(8)/kfree -> 121 cycles 10000 times kmalloc(16)/kfree -> 121 cycles 10000 times kmalloc(32)/kfree -> 123 cycles 10000 times kmalloc(64)/kfree -> 142 cycles 10000 times kmalloc(128)/kfree -> 121 cycles 10000 times kmalloc(256)/kfree -> 119 cycles 10000 times kmalloc(512)/kfree -> 119 cycles 10000 times kmalloc(1024)/kfree -> 119 cycles 10000 times kmalloc(2048)/kfree -> 119 cycles 10000 times kmalloc(4096)/kfree -> 119 cycles 10000 times kmalloc(8192)/kfree -> 119 cycles 10000 times kmalloc(16384)/kfree -> 119 cycles [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()] Signed-off-by: Thomas Garnier <thgarnie@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Greg Thelen <gthelen@google.com> Cc: Laura Abbott <labbott@fedoraproject.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 81ae6d0395 |
mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink()
When we call __kmem_cache_shrink on memory cgroup removal, we need to synchronize kmem_cache->cpu_partial update with put_cpu_partial that might be running on other cpus. Currently, we achieve that by using kick_all_cpus_sync, which works as a system wide memory barrier. Though fast it is, this method has a flaw - it issues a lot of IPIs, which might hurt high performance or real-time workloads. To fix this, let's replace kick_all_cpus_sync with synchronize_sched. Although the latter one may take much longer to finish, it shouldn't be a problem in this particular case, because memory cgroups are destroyed asynchronously from a workqueue so that no user visible effects should be introduced. OTOH, it will save us from excessive IPIs when someone removes a cgroup. Anyway, even if using synchronize_sched turns out to take too long, we can always introduce a kind of __kmem_cache_shrink batching so that this method would only be called once per one cgroup destruction (not per each per memcg kmem cache as it is now). Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 801faf0db8 |
mm/slab: lockless decision to grow cache
To check whether free objects exist or not precisely, we need to grab a lock. But, accuracy isn't that important because race window would be even small and if there is too much free object, cache reaper would reap it. So, this patch makes the check for free object exisistence not to hold a lock. This will reduce lock contention in heavily allocation case. Note that until now, n->shared can be freed during the processing by writing slabinfo, but, with some trick in this patch, we can access it freely within interrupt disabled period. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that allocation performance decreases for the object size up to 128 and it may be due to extra checks in cache_alloc_refill(). But, with considering improvement of free performance, net result looks the same. Result for other size class looks very promising, roughly, 50% performance improvement. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 213b46958c |
mm/slab: refill cpu cache through a new slab without holding a node lock
Until now, cache growing makes a free slab on node's slab list and then we can allocate free objects from it. This necessarily requires to hold a node lock which is very contended. If we refill cpu cache before attaching it to node's slab list, we can avoid holding a node lock as much as possible because this newly allocated slab is only visible to the current task. This will reduce lock contention. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 * After Kmalloc N*alloc N*free(32): Average=248/966 Kmalloc N*alloc N*free(64): Average=261/949 Kmalloc N*alloc N*free(128): Average=314/1016 Kmalloc N*alloc N*free(256): Average=741/1061 Kmalloc N*alloc N*free(512): Average=1246/1152 Kmalloc N*alloc N*free(1024): Average=2437/1259 Kmalloc N*alloc N*free(2048): Average=4980/1800 Kmalloc N*alloc N*free(4096): Average=9000/2078 It shows that contention is reduced for all the object sizes and performance increases by 30 ~ 40%. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 76b342bdc7 |
mm/slab: separate cache_grow() to two parts
This is a preparation step to implement lockless allocation path when there is no free objects in kmem_cache. What we'd like to do here is to refill cpu cache without holding a node lock. To accomplish this purpose, refill should be done after new slab allocation but before attaching the slab to the management list. So, this patch separates cache_grow() to two parts, allocation and attaching to the list in order to add some code inbetween them in the following patch. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 511e3a0588 |
mm/slab: make cache_grow() handle the page allocated on arbitrary node
Currently, cache_grow() assumes that allocated page's nodeid would be same with parameter nodeid which is used for allocation request. If we discard this assumption, we can handle fallback_alloc() case gracefully. So, this patch makes cache_grow() handle the page allocated on arbitrary node and clean-up relevant code. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 03d1d43a12 |
mm/slab: racy access/modify the slab color
Slab color isn't needed to be changed strictly. Because locking for changing slab color could cause more lock contention so this patch implements racy access/modify the slab color. This is a preparation step to implement lockless allocation path when there is no free objects in the kmem_cache. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=355/750 Kmalloc N*alloc N*free(64): Average=452/812 Kmalloc N*alloc N*free(128): Average=559/1070 Kmalloc N*alloc N*free(256): Average=1176/980 Kmalloc N*alloc N*free(512): Average=1939/1189 Kmalloc N*alloc N*free(1024): Average=3521/1278 Kmalloc N*alloc N*free(2048): Average=7152/1838 Kmalloc N*alloc N*free(4096): Average=13438/2013 It shows that contention is reduced for object size >= 1024 and performance increases by roughly 15%. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 6052b7880a |
mm/slab: don't keep free slabs if free_objects exceeds free_limit
Currently, determination to free a slab is done whenever each freed object is put into the slab. This has a following problem. Assume free_limit = 10 and nr_free = 9. Free happens as following sequence and nr_free changes as following. free(become a free slab) free(not become a free slab) nr_free: 9 -> 10 (at first free) -> 11 (at second free) If we try to check if we can free current slab or not on each object free, we can't free any slab in this situation because current slab isn't a free slab when nr_free exceed free_limit (at second free) even if there is a free slab. However, if we check it lastly, we can free 1 free slab. This problem would cause to keep too much memory in the slab subsystem. This patch try to fix it by checking number of free object after all free work is done. If there is free slab at that time, we can free slab as much as possible so we keep free slab as minimal. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | c3d332b6b2 |
mm/slab: clean-up kmem_cache_node setup
There are mostly same code for setting up kmem_cache_node either in cpuup_prepare() or alloc_kmem_cache_node(). Factor out and clean-up them. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Tested-by: Nishanth Menon <nm@ti.com> Tested-by: Jon Hunter <jonathanh@nvidia.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | ded0ecf611 |
mm/slab: factor out kmem_cache_node initialization code
It can be reused on other place, so factor out it. Following patch will use it. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | a5aa63a5f7 |
mm/slab: drain the free slab as much as possible
slabs_tofree() implies freeing all free slab. We can do it with just providing INT_MAX. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 8888177ea1 |
mm/slab: remove BAD_ALIEN_MAGIC again
Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
|
|
Joonsoo Kim | 18726ca8b3 |
mm/slab: fix the theoretical race by holding proper lock
While processing concurrent allocation, SLAB could be contended a lot because it did a lots of work with holding a lock. This patchset try to reduce the number of critical section to reduce lock contention. Major changes are lockless decision to allocate more slab and lockless cpu cache refill from the newly allocated slab. Below is the result of concurrent allocation/free in slab allocation benchmark made by Christoph a long time ago. I make the output simpler. The number shows cycle count during alloc/free respectively so less is better. * Before Kmalloc N*alloc N*free(32): Average=365/806 Kmalloc N*alloc N*free(64): Average=452/690 Kmalloc N*alloc N*free(128): Average=736/886 Kmalloc N*alloc N*free(256): Average=1167/985 Kmalloc N*alloc N*free(512): Average=2088/1125 Kmalloc N*alloc N*free(1024): Average=4115/1184 Kmalloc N*alloc N*free(2048): Average=8451/1748 Kmalloc N*alloc N*free(4096): Average=16024/2048 * After Kmalloc N*alloc N*free(32): Average=344/792 Kmalloc N*alloc N*free(64): Average=347/882 Kmalloc N*alloc N*free(128): Average=390/959 Kmalloc N*alloc N*free(256): Average=393/1067 Kmalloc N*alloc N*free(512): Average=683/1229 Kmalloc N*alloc N*free(1024): Average=1295/1325 Kmalloc N*alloc N*free(2048): Average=2513/1664 Kmalloc N*alloc N*free(4096): Average=4742/2172 It shows that performance improves greatly (roughly more than 50%) for the object class whose size is more than 128 bytes. This patch (of 11): If we don't hold neither the slab_mutex nor the node lock, node's shared array cache could be freed and re-populated. If __kmem_cache_shrink() is called at the same time, it will call drain_array() with n->shared without holding node lock so problem can happen. This patch fix the situation by holding the node lock before trying to drain the shared array. In addition, add a debug check to confirm that n->shared access race doesn't exist. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Jan Kara | 4d9a2c8746 |
dax: Remove i_mmap_lock protection
Currently faults are protected against truncate by filesystem specific i_mmap_sem and page lock in case of hole page. Cow faults are protected DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX code. Remove it. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> |
|
Jan Kara | bc2466e425 |
dax: Use radix tree entry lock to protect cow faults
When doing cow faults, we cannot directly fill in PTE as we do for other faults as we rely on generic code to do proper accounting of the cowed page. We also have no page to lock to protect against races with truncate as other faults have and we need the protection to extend until the moment generic code inserts cowed page into PTE thus at that point we have no protection of fs-specific i_mmap_sem. So far we relied on using i_mmap_lock for the protection however that is completely special to cow faults. To make fault locking more uniform use DAX entry lock instead. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> |
|
Jan Kara | ac401cc782 |
dax: New fault locking
Currently DAX page fault locking is racy. CPU0 (write fault) CPU1 (read fault) __dax_fault() __dax_fault() get_block(inode, block, &bh, 0) -> not mapped get_block(inode, block, &bh, 0) -> not mapped if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) get_block(inode, block, &bh, 1) -> allocates blocks if (page) -> no if (!buffer_mapped(&bh)) if (vmf->flags & FAULT_FLAG_WRITE) { } else { dax_load_hole(); } dax_insert_mapping() And we are in a situation where we fail in dax_radix_entry() with -EIO. Another problem with the current DAX page fault locking is that there is no race-free way to clear dirty tag in the radix tree. We can always end up with clean radix tree and dirty data in CPU cache. We fix the first problem by introducing locking of exceptional radix tree entries in DAX mappings acting very similarly to page lock and thus synchronizing properly faults against the same mapping index. The same lock can later be used to avoid races when clearing radix tree dirty tag. Reviewed-by: NeilBrown <neilb@suse.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> |
|
Jan Kara | 4f622938a5 |
dax: Allow DAX code to replace exceptional entries
Currently we forbid page_cache_tree_insert() to replace exceptional radix tree entries for DAX inodes. However to make DAX faults race free we will lock radix tree entries and when hole is created, we need to replace such locked radix tree entry with a hole page. So modify page_cache_tree_insert() to allow that. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> |
|
Linus Torvalds | c2e7b20705 |
Merge branch 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs cleanups from Al Viro: "More cleanups from Christoph" * 'work.preadv2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nfsd: use RWF_SYNC fs: add RWF_DSYNC aand RWF_SYNC ceph: use generic_write_sync fs: simplify the generic_write_sync prototype fs: add IOCB_SYNC and IOCB_DSYNC direct-io: remove the offset argument to dio_complete direct-io: eliminate the offset argument to ->direct_IO xfs: eliminate the pos variable in xfs_file_dio_aio_write filemap: remove the pos argument to generic_file_direct_write filemap: remove pos variables in generic_file_read_iter |
|
Linus Torvalds | 7f427d3a60 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull parallel filesystem directory handling update from Al Viro. This is the main parallel directory work by Al that makes the vfs layer able to do lookup and readdir in parallel within a single directory. That's a big change, since this used to be all protected by the directory inode mutex. The inode mutex is replaced by an rwsem, and serialization of lookups of a single name is done by a "in-progress" dentry marker. The series begins with xattr cleanups, and then ends with switching filesystems over to actually doing the readdir in parallel (switching to the "iterate_shared()" that only takes the read lock). A more detailed explanation of the process from Al Viro: "The xattr work starts with some acl fixes, then switches ->getxattr to passing inode and dentry separately. This is the point where the things start to get tricky - that got merged into the very beginning of the -rc3-based #work.lookups, to allow untangling the security_d_instantiate() mess. The xattr work itself proceeds to switch a lot of filesystems to generic_...xattr(); no complications there. After that initial xattr work, the series then does the following: - untangle security_d_instantiate() - convert a bunch of open-coded lookup_one_len_unlocked() to calls of that thing; one such place (in overlayfs) actually yields a trivial conflict with overlayfs fixes later in the cycle - overlayfs ended up switching to a variant of lookup_one_len_unlocked() sans the permission checks. I would've dropped that commit (it gets overridden on merge from #ovl-fixes in #for-next; proper resolution is to use the variant in mainline fs/overlayfs/super.c), but I didn't want to rebase the damn thing - it was fairly late in the cycle... - some filesystems had managed to depend on lookup/lookup exclusion for *fs-internal* data structures in a way that would break if we relaxed the VFS exclusion. Fixing hadn't been hard, fortunately. - core of that series - parallel lookup machinery, replacing ->i_mutex with rwsem, making lookup_slow() take it only shared. At that point lookups happen in parallel; lookups on the same name wait for the in-progress one to be done with that dentry. Surprisingly little code, at that - almost all of it is in fs/dcache.c, with fs/namei.c changes limited to lookup_slow() - making it use the new primitive and actually switching to locking shared. - parallel readdir stuff - first of all, we provide the exclusion on per-struct file basis, same as we do for read() vs lseek() for regular files. That takes care of most of the needed exclusion in readdir/readdir; however, these guys are trickier than lookups, so I went for switching them one-by-one. To do that, a new method '->iterate_shared()' is added and filesystems are switched to it as they are either confirmed to be OK with shared lock on directory or fixed to be OK with that. I hope to kill the original method come next cycle (almost all in-tree filesystems are switched already), but it's still not quite finished. - several filesystems get switched to parallel readdir. The interesting part here is dealing with dcache preseeding by readdir; that needs minor adjustment to be safe with directory locked only shared. Most of the filesystems doing that got switched to in those commits. Important exception: NFS. Turns out that NFS folks, with their, er, insistence on VFS getting the fuck out of the way of the Smart Filesystem Code That Knows How And What To Lock(tm) have grown the locking of their own. They had their own homegrown rwsem, with lookup/readdir/atomic_open being *writers* (sillyunlink is the reader there). Of course, with VFS getting the fuck out of the way, as requested, the actual smarts of the smart filesystem code etc. had become exposed... - do_last/lookup_open/atomic_open cleanups. As the result, open() without O_CREAT locks the directory only shared. Including the ->atomic_open() case. Backmerge from #for-linus in the middle of that - atomic_open() fix got brought in. - then comes NFS switch to saner (VFS-based ;-) locking, killing the homegrown "lookup and readdir are writers" kinda-sorta rwsem. All exclusion for sillyunlink/lookup is done by the parallel lookups mechanism. Exclusion between sillyunlink and rmdir is a real rwsem now - rmdir being the writer. Result: NFS lookups/readdirs/O_CREAT-less opens happen in parallel now. - the rest of the series consists of switching a lot of filesystems to parallel readdir; in a lot of cases ->llseek() gets simplified as well. One backmerge in there (again, #for-linus - rockridge fix)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (74 commits) ext4: switch to ->iterate_shared() hfs: switch to ->iterate_shared() hfsplus: switch to ->iterate_shared() hostfs: switch to ->iterate_shared() hpfs: switch to ->iterate_shared() hpfs: handle allocation failures in hpfs_add_pos() gfs2: switch to ->iterate_shared() f2fs: switch to ->iterate_shared() afs: switch to ->iterate_shared() befs: switch to ->iterate_shared() befs: constify stuff a bit isofs: switch to ->iterate_shared() get_acorn_filename(): deobfuscate a bit btrfs: switch to ->iterate_shared() logfs: no need to lock directory in lseek switch ecryptfs to ->iterate_shared 9p: switch to ->iterate_shared() fat: switch to ->iterate_shared() romfs, squashfs: switch to ->iterate_shared() more trivial ->iterate_shared conversions ... |
|
Al Viro | 0e0162bb8c |
Merge branch 'ovl-fixes' into for-linus
Backmerge to resolve a conflict in ovl_lookup_real(); "ovl_lookup_real(): use lookup_one_len_unlocked()" instead, but it was too late in the cycle to rebase. |
|
Linus Torvalds | 825a3b2605 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: - massive CPU hotplug rework (Thomas Gleixner) - improve migration fairness (Peter Zijlstra) - CPU load calculation updates/cleanups (Yuyang Du) - cpufreq updates (Steve Muckle) - nohz optimizations (Frederic Weisbecker) - switch_mm() micro-optimization on x86 (Andy Lutomirski) - ... lots of other enhancements, fixes and cleanups. * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (66 commits) ARM: Hide finish_arch_post_lock_switch() from modules sched/core: Provide a tsk_nr_cpus_allowed() helper sched/core: Use tsk_cpus_allowed() instead of accessing ->cpus_allowed sched/loadavg: Fix loadavg artifacts on fully idle and on fully loaded systems sched/fair: Correct unit of load_above_capacity sched/fair: Clean up scale confusion sched/nohz: Fix affine unpinned timers mess sched/fair: Fix fairness issue on migration sched/core: Kill sched_class::task_waking to clean up the migration logic sched/fair: Prepare to fix fairness problems on migration sched/fair: Move record_wakee() sched/core: Fix comment typo in wake_q_add() sched/core: Remove unused variable sched: Make hrtick_notifier an explicit call sched/fair: Make ilb_notifier an explicit call sched/hotplug: Make activate() the last hotplug step sched/hotplug: Move migration CPU_DYING to sched_cpu_dying() sched/migration: Move CPU_ONLINE into scheduler state sched/migration: Move calc_load_migrate() into CPU_DYING sched/migration: Move prepare transition to SCHED_STARTING state ... |
|
Andrea Arcangeli | 6d0a07edd1 |
mm: thp: calculate the mapcount correctly for THP pages during WP faults
This will provide fully accuracy to the mapcount calculation in the
write protect faults, so page pinning will not get broken by false
positive copy-on-writes.
total_mapcount() isn't the right calculation needed in
reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
that is effectively the full accurate return value for page_mapcount()
if dealing with Transparent Hugepages, however we only use the
page_trans_huge_mapcount() during COW faults where it strictly needed,
due to its higher runtime cost.
This also provide at practical zero cost the total_mapcount
information which is needed to know if we can still relocate the page
anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
can reuse the page no matter if it's a pte or a pmd_trans_huge
triggering the fault, but we can only relocate the page anon_vma to
the local vma->anon_vma if we're sure it's only this "vma" mapping the
whole THP physical range.
Kirill A. Shutemov discovered the problem with moving the page
anon_vma to the local vma->anon_vma in a previous version of this
patch and another problem in the way page_move_anon_rmap() was called.
Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
previous version, because reuse_swap_page must be a macro to call
page_trans_huge_mapcount from swap.h, so this uses a macro again
instead of an inline function. With this change at least it's a less
dangerous usage than it was before, because "page" is used only once
now, while with the previous code reuse_swap_page(page++) would have
called page_mapcount on page+1 and it would have increased page twice
instead of just once.
Dean Luick noticed an uninitialized variable that could result in a
rmap inefficiency for the non-THP case in a previous version.
Mike Marciniszyn said:
: Our RDMA tests are seeing an issue with memory locking that bisects to
: commit
|
|
Zhou Chengming | 7496fea9a6 |
ksm: fix conflict between mmput and scan_get_next_rmap_item
A concurrency issue about KSM in the function scan_get_next_rmap_item. task A (ksmd): |task B (the mm's task): | mm = slot->mm; | down_read(&mm->mmap_sem); | | ... | | spin_lock(&ksm_mmlist_lock); | | ksm_scan.mm_slot go to the next slot; | | spin_unlock(&ksm_mmlist_lock); | |mmput() -> | ksm_exit(): | |spin_lock(&ksm_mmlist_lock); |if (mm_slot && ksm_scan.mm_slot != mm_slot) { | if (!mm_slot->rmap_list) { | easy_to_free = 1; | ... | |if (easy_to_free) { | mmdrop(mm); | ... | |So this mm_struct may be freed in the mmput(). | up_read(&mm->mmap_sem); | As we can see above, the ksmd thread may access a mm_struct that already been freed to the kmem_cache. Suppose a fork will get this mm_struct from the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will cause mmap_sem.count to become -1. As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has the same SMP race condition, so fix it too. My prev fix in function scan_get_next_rmap_item will introduce a different SMP race condition, so just invert the up_read/spin_unlock order as Andrea Arcangeli said. Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.com Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com> Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Geliang Tang <geliangtang@163.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Hanjun Guo <guohanjun@huawei.com> Cc: Ding Tianhong <dingtianhong@huawei.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ingo Molnar | eb60b3e5e8 |
Merge branch 'sched/urgent' into sched/core to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Sergey Senozhatsky | 44f43e99fe |
zsmalloc: fix zs_can_compact() integer overflow
zs_can_compact() has two race conditions in its core calculation: unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) - zs_stat_get(class, OBJ_USED); 1) classes are not locked, so the numbers of allocated and used objects can change by the concurrent ops happening on other CPUs 2) shrinker invokes it from preemptible context Depending on the circumstances, thus, OBJ_ALLOCATED can become less than OBJ_USED, which can result in either very high or negative `total_scan' value calculated later in do_shrink_slab(). do_shrink_slab() has some logic to prevent those cases: vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62 However, due to the way `total_scan' is calculated, not every shrinker->count_objects() overflow can be spotted and handled. To demonstrate the latter, I added some debugging code to do_shrink_slab() (x86_64) and the results were: vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 92679974445502 vmscan: resulting total_scan: 92679974445502 [..] vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615] vmscan: but total_scan > 0: 22634041808232578 vmscan: resulting total_scan: 22634041808232578 Even though shrinker->count_objects() has returned an overflowed value, the resulting `total_scan' is positive, and, what is more worrisome, it is insanely huge. This value is getting used later on in shrinker->scan_objects() loop: while (total_scan >= batch_size || total_scan >= freeable) { unsigned long ret; unsigned long nr_to_scan = min(batch_size, total_scan); shrinkctl->nr_to_scan = nr_to_scan; ret = shrinker->scan_objects(shrinker, shrinkctl); if (ret == SHRINK_STOP) break; freed += ret; count_vm_events(SLABS_SCANNED, nr_to_scan); total_scan -= nr_to_scan; cond_resched(); } `total_scan >= batch_size' is true for a very-very long time and 'total_scan >= freeable' is also true for quite some time, because `freeable < 0' and `total_scan' is large enough, for example, 22634041808232578. The only break condition, in the given scheme of things, is shrinker->scan_objects() == SHRINK_STOP test, which is a bit too weak to rely on, especially in heavy zsmalloc-usage scenarios. To fix the issue, take a pool stat snapshot and use it instead of racy zs_stat_get() calls. Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.com Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dave Airlie | bafb86f5bc |
Linux 4.6-rc7
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXL7HfAAoJEHm+PkMAQRiGYe8IAJBGaPUq38EJh2YOV+AQf9v6 t/alhwB3DUE1E0zjLy7I7JJ+xDXtKjZh9fS6OFuIS8Q3RIrBteIJ/oH8TPpt7yZ/ SnP6rYPvYD6CImTyrh7+ORL/udEwJX8+YqFYAgUAq167gvpDjYj8r26VzdIaIN4/ oBbL8NrQNWfODieywYyhUoitVhwMz09zmBfLtGVks4vd2jUJk2Fdd9cOtGV5tRfk DPndPgyQtbr8W0mKovV8sT9WkQeV5TsUr4MLgf7hjnAGYQ8+0KamkzzVVLBeBiiw uazyrOCFkddZp+N7KbmbOmazV/yULRuLGgDjVKazoCsOaKOvoGCzrCk7daOPy6Q= =CegX -----END PGP SIGNATURE----- Merge tag 'v4.6-rc7' into drm-next Merge this back as we've built up a fair few conflicts, and I have some newer trees to pull in. |
|
Linus Torvalds | 0783783104 |
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull writeback fix from Jens Axboe: "Just a single fix for domain aware writeback, fixing a regression that can cause balance_dirty_pages() to keep looping while not getting any work done" * 'for-linus' of git://git.kernel.dk/linux-block: writeback: Fix performance regression in wb_over_bg_thresh() |
|
Vlastimil Babka | 172400c69c |
mm: fix kcompactd hang during memory offlining
Assume memory47 is the last online block left in node1. This will hang:
# echo offline > /sys/devices/system/node/node1/memory47/state
After a couple of minutes, the following pops up in dmesg:
INFO: task bash:957 blocked for more than 120 seconds.
Not tainted 4.6.0-rc6+ #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
bash D ffff8800b7adbaf8 0 957 951 0x00000000
Call Trace:
schedule+0x35/0x80
schedule_timeout+0x1ac/0x270
wait_for_completion+0xe1/0x120
kthread_stop+0x4f/0x110
kcompactd_stop+0x26/0x40
__offline_pages.constprop.28+0x7e6/0x840
offline_pages+0x11/0x20
memory_block_action+0x73/0x1d0
memory_subsys_offline+0x47/0x60
device_offline+0x86/0xb0
store_mem_state+0xda/0xf0
dev_attr_store+0x18/0x30
sysfs_kf_write+0x37/0x40
kernfs_fop_write+0x11d/0x170
__vfs_write+0x37/0x120
vfs_write+0xa9/0x1a0
SyS_write+0x55/0xc0
entry_SYSCALL_64_fastpath+0x1a/0xa4
kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
actually exit. Check kthread_should_stop() to break out of the wait.
Fixes:
|
|
Dan Streetman | 32a4e16903 |
mm/zswap: provide unique zpool name
Instead of using "zswap" as the name for all zpools created, add an
atomic counter and use "zswap%x" with the counter number for each zpool
created, to provide a unique name for each new zpool.
As zsmalloc, one of the zpool implementations, requires/expects a unique
name for each pool created, zswap should provide a unique name. The
zsmalloc pool creation does not fail if a new pool with a conflicting
name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
zsmalloc pool creation fails with -ENOMEM. Then zswap will be unable to
change its compressor parameter if its zpool is zsmalloc; it also will
be unable to change its zpool parameter back to zsmalloc, if it has any
existing old zpool using zsmalloc with page(s) in it. Attempts to
change the parameters will result in failure to create the zpool. This
changes zswap to provide a unique name for each zpool creation.
Fixes:
|
|
Hugh Dickins | 14af4a5e9b |
mm, cma: prevent nr_isolated_* counters from going negative
/proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
increasingly negative under compaction: which would add delay when
should be none, or no delay when should delay. The bug in compaction
was due to a recent mmotm patch, but much older instance of the bug was
also noticed in isolate_migratepages_range() which is used for CMA and
gigantic hugepage allocations.
The bug is caused by putback_movable_pages() in an error path
decrementing the isolated counters without them being previously
incremented by acct_isolated(). Fix isolate_migratepages_range() by
removing the error-path putback, thus reaching acct_isolated() with
migratepages still isolated, and leaving putback to caller like most
other places do.
Fixes:
|
|
Jason Baron | bc22af74f2 |
mm: update min_free_kbytes from khugepaged after core initialization
Khugepaged attempts to raise min_free_kbytes if its set too low.
However, on boot khugepaged sets min_free_kbytes first from
subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
after from init_per_zone_wmark_min(), via a module_init() call.
Khugepaged used to use a late_initcall() to set min_free_kbytes (such
that it occurred after the core initialization), however this was
removed when the initialization of min_free_kbytes was integrated into
the starting of the khugepaged thread.
The fix here is simply to invoke the core initialization using a
core_initcall() instead of module_init(), such that the previous
initialization ordering is restored. I didn't restore the
late_initcall() since start_stop_khugepaged() already sets
min_free_kbytes via set_recommended_min_free_kbytes().
This was noticed when we had a number of page allocation failures when
moving a workload to a kernel with this new initialization ordering. On
an 8GB system this restores min_free_kbytes back to 67584 from 11365
when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
Fixes:
|
|
Hugh Dickins | 684283988f |
huge pagecache: mmap_sem is unlocked when truncation splits pmd
zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
be invalid with huge pagecache, in whatever way it is implemented:
truncation of a hugely-mapped file to an unhugely-aligned size would
easily hit it.
(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)
We could complicate the test, to check i_mmap_rwsem also when there's a
vm_file; but my inclination was to make zap_pmd_range() more readable by
simply deleting this check. A search has shown no report of the issue
in the years since commit
|
|
Yang Shi | 145bdaa150 |
mm: thp: correct split_huge_pages file permission
split_huge_pages doesn't support get method at all, so the read permission sounds confusing, change the permission to write only. And, add "\n" to the output of set method to make it more readable. Signed-off-by: Yang Shi <yang.shi@linaro.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Howard Cochran | 74d3694433 |
writeback: Fix performance regression in wb_over_bg_thresh()
Commit |
|
Al Viro | 84e710da2a |
parallel lookups machinery, part 2
We'll need to verify that there's neither a hashed nor in-lookup dentry with desired parent/name before adding to in-lookup set. One possible solution would be to hold the parent's ->d_lock through both checks, but while the in-lookup set is relatively small at any time, dcache is not. And holding the parent's ->d_lock through something like __d_lookup_rcu() would suck too badly. So we leave the parent's ->d_lock alone, which means that we watch out for the following scenario: * we verify that there's no hashed match * existing in-lookup match gets hashed by another process * we verify that there's no in-lookup matches and decide that everything's fine. Solution: per-directory kinda-sorta seqlock, bumped around the times we hash something that used to be in-lookup or move (and hash) something in place of in-lookup. Then the above would turn into * read the counter * do dcache lookup * if no matches found, check for in-lookup matches * if there had been none of those either, check if the counter has changed; repeat if it has. The "kinda-sorta" part is due to the fact that we don't have much spare space in inode. There is a spare word (shared with i_bdev/i_cdev/i_pipe), so the counter part is not a problem, but spinlock is a different story. We could use the parent's ->d_lock, and it would be less painful in terms of contention, for __d_add() it would be rather inconvenient to grab; we could do that (using lock_parent()), but... Fortunately, we can get serialization on the counter itself, and it might be a good idea in general; we can use cmpxchg() in a loop to get from even to odd and smp_store_release() from odd to even. This commit adds the counter and updating logics; the readers will be added in the next commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Al Viro | 84695ffee7 |
Merge getxattr prototype change into work.lookups
The rest of work.xattr stuff isn't needed for this branch |
|
Christoph Hellwig | e259221763 |
fs: simplify the generic_write_sync prototype
The kiocb already has the new position, so use that. The only interesting case is AIO, where we currently don't bother updating ki_pos. We're about to free the kiocb after we're done, so we might as well update it to make everyone's life simpler. While we're at it also return the bytes written argument passed in if we were successful so that the boilerplate error switch code in the callers can go away. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | dde0c2e798 |
fs: add IOCB_SYNC and IOCB_DSYNC
This will allow us to do per-I/O sync file writes, as required by a lot of fileservers or storage targets. XXX: Will need a few additional audits for O_DSYNC Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | c8b8e32d70 |
direct-io: eliminate the offset argument to ->direct_IO
Including blkdev_direct_IO and dax_do_io. It has to be ki_pos to actually work, so eliminate the superflous argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | 1af5bb491f |
filemap: remove the pos argument to generic_file_direct_write
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Christoph Hellwig | c64fb5c744 |
filemap: remove pos variables in generic_file_read_iter
Just use ki_pos directly to make everyones life easier. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Konstantin Khlebnikov | c2e7e00b71 |
mm/memory-failure: fix race with compound page split/merge
get_hwpoison_page() must recheck relation between head and tail pages. n-horiguchi said: without this recheck, the race causes kernel to pin an irrelevant page, and finally makes kernel crash for refcount mismatch. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | fd901c9538 |
mm: wake kcompactd before kswapd's short sleep
When kswapd goes to sleep it checks if the node is balanced and at first
it sleeps only for HZ/10 time, then rechecks if the node is still
balanced and nobody has woken it during the initial sleep. Only then it
goes fully sleep until an allocation slowpath wakes it up again.
For higher-order allocations, waking up kcompactd is done only before
the full sleep. This turns out to be an issue in case another
high-order allocation fails during the initial sleep. It will wake
kswapd up, however kswapd considers the zone balanced from the order-0
perspective, and will just quickly try to sleep again. So if there's a
longer stream of high-order allocations hitting the slowpath and waking
up kswapd, it might never actually wake up kcompactd, which may be
considered a regression from kswapd-based compaction. In the worst
case, it might be that a single allocation that cannot direct
reclaim/compact itself is waking kswapd in the retry loop and preventing
kcompactd from being woken up and unblocking it.
This patch makes sure kcompactd is woken up in such situations by simply
moving the wakeup before the short initial sleep. More efficient
solution would be to wake kcompactd immediately instead of kswapd if the
node is already order-0 balanced, but in that case we should also move
reset_isolation_suitable() call to kcompactd so it's not adding to the
allocator's latency. Since it's late in the 4.6 cycle, let's go with
the simpler change for now.
Fixes:
|
|
Minchan Kim | d7e69488bd |
mm/hwpoison: fix wrong num_poisoned_pages accounting
Currently, migration code increses num_poisoned_pages on *failed* migration page as well as successfully migrated one at the trial of memory-failure. It will make the stat wrong. As well, it marks the page as PG_HWPoison even if the migration trial failed. It would mean we cannot recover the corrupted page using memory-failure facility. This patches fixes it. Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | b06bad17c7 |
mm: call swap_slot_free_notify() with page lock held
Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in
page_swap_info. The reason is that page_endio in rw_page unlocks the
page if read I/O is completed so we need to hold a PG_lock again to
check PageSwapCache. Otherwise, the page can be removed from swapcache.
Kernel BUG at c00f9040 [verbose debug info unavailable]
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 4 PID: 13446 Comm: RenderThread Tainted: G W 3.10.84-g9f14aec-dirty #73
task: c3b73200 ti: dd192000 task.ti: dd192000
PC is at page_swap_info+0x10/0x2c
LR is at swap_slot_free_notify+0x18/0x6c
pc : [<c00f9040>] lr : [<c00f5560>] psr: 400f0113
sp : dd193d78 ip : c2deb1e4 fp : da015180
r10: 00000000 r9 : 000200da r8 : c120fe08
r7 : 00000000 r6 : 00000000 r5 : c249a6c0 r4 : = c249a6c0
r3 : 00000000 r2 : 40080009 r1 : 200f0113 r0 : = c249a6c0
..<snip> ..
Call Trace:
page_swap_info+0x10/0x2c
swap_slot_free_notify+0x18/0x6c
swap_readpage+0x90/0x11c
read_swap_cache_async+0x134/0x1ac
swapin_readahead+0x70/0xb0
handle_pte_fault+0x320/0x6fc
handle_mm_fault+0xc0/0xf0
do_page_fault+0x11c/0x36c
do_DataAbort+0x34/0x118
Fixes:
|
|
Minchan Kim | 7bf52fb891 |
mm: vmscan: reclaim highmem zone if buffer_heads is over limit
We have been reclaimed highmem zone if buffer_heads is over limit but commit |
|
Gerald Schaefer | 28093f9f34 |
numa: fix /proc/<pid>/numa_maps for THP
In gather_pte_stats() a THP pmd is cast into a pte, which is wrong because the layouts may differ depending on the architecture. On s390 this will lead to inaccurate numa_maps accounting in /proc because of misguided pte_present() and pte_dirty() checks on the fake pte. On other architectures pte_present() and pte_dirty() may work by chance, but there may be an issue with direct-access (dax) mappings w/o underlying struct pages when HAVE_PTE_SPECIAL is set and THP is available. In vm_normal_page() the fake pte will be checked with pte_special() and because there is no "special" bit in a pmd, this will always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be skipped. On dax mappings w/o struct pages, an invalid struct page pointer would then be returned that can crash the kernel. This patch fixes the numa_maps THP handling by introducing new "_pmd" variants of the can_gather_numa_stats() and vm_normal_page() functions. Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Konstantin Khlebnikov | 3486b85a29 |
mm/huge_memory: replace VM_NO_THP VM_BUG_ON with actual VMA check
Khugepaged detects own VMAs by checking vm_file and vm_ops but this way
it cannot distinguish private /dev/zero mappings from other special
mappings like /dev/hpet which has no vm_ops and popultes PTEs in mmap.
This fixes false-positive VM_BUG_ON and prevents installing THP where
they are not expected.
Link: http://lkml.kernel.org/r/CACT4Y+ZmuZMV5CjSFOeXviwQdABAgT7T+StKfTqan9YDtgEi5g@mail.gmail.com
Fixes:
|
|
Kirill A. Shutemov | aa88b68c3b |
thp: keep huge zero page pinned until tlb flush
Andrea has found[1] a race condition on MMU-gather based TLB flush vs split_huge_page() or shrinker which frees huge zero under us (patch 1/2 and 2/2 respectively). With new THP refcounting, we don't need patch 1/2: mmu_gather keeps the page pinned until flush is complete and the pin prevents the page from being split under us. We still need patch 2/2. This is simplified version of Andrea's patch. We don't need fancy encoding. [1] http://lkml.kernel.org/r/1447938052-22165-1-git-send-email-aarcange@redhat.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ingo Molnar | 8efd755ac2 |
mm/mmu_context, sched/core: Fix mmu_context.h assumption
Some architectures (such as Alpha) rely on include/linux/sched.h definitions in their mmu_context.h files. So include sched.h before mmu_context.h. Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
Tejun Heo | 264a0ae164 |
memcg: relocate charge moving from ->attach to ->post_attach
Hello,
So, this ended up a lot simpler than I originally expected. I tested
it lightly and it seems to work fine. Petr, can you please test these
two patches w/o the lru drain drop patch and see whether the problem
is gone?
Thanks.
------ 8< ------
If charge moving is used, memcg performs relabeling of the affected
pages from its ->attach callback which is called under both
cgroup_threadgroup_rwsem and thus can't create new kthreads. This is
fragile as various operations may depend on workqueues making forward
progress which relies on the ability to create new kthreads.
There's no reason to perform charge moving from ->attach which is deep
in the task migration path. Move it to ->post_attach which is called
after the actual migration is finished and cgroup_threadgroup_rwsem is
dropped.
* move_charge_struct->mm is added and ->can_attach is now responsible
for pinning and recording the target mm. mem_cgroup_clear_mc() is
updated accordingly. This also simplifies mem_cgroup_move_task().
* mem_cgroup_move_task() is now called from ->post_attach instead of
->attach.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Debugged-and-tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: Cyril Hrubis <chrubis@suse.cz>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Fixes:
|
|
Linus Torvalds | 2e57259913 |
Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe: "A few fixes for the current series. This contains: - Two fixes for NVMe: One fixes a reset race that can be triggered by repeated insert/removal of the module. The other fixes an issue on some platforms, where we get probe timeouts since legacy interrupts isn't working. This used not to be a problem since we had the worker thread poll for completions, but since that was killed off, it means those poor souls can't successfully probe their NVMe device. Use a proper IRQ check and probe (msi-x -> msi ->legacy), like most other drivers to work around this. Both from Keith. - A loop corruption issue with offset in iters, from Ming Lei. - A fix for not having the partition stat per cpu ref count initialized before sending out the KOBJ_ADD, which could cause user space to access the counter prior to initialization. Also from Ming Lei. - A fix for using the wrong congestion state, from Kaixu Xia" * 'for-linus' of git://git.kernel.dk/linux-block: block: loop: fix filesystem corruption in case of aio/dio NVMe: Always use MSI/MSI-x interrupts NVMe: Fix reset/remove race writeback: fix the wrong congested state variable definition block: partition: initialize percpuref before sending out KOBJ_ADD |
|
Linus Torvalds | a1f983174d |
Merge branch 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull mm gup cleanup from Ingo Molnar: "This removes the ugly get-user-pages API hack, now that all upstream code has been migrated to it" ("ugly" is putting it mildly. But it worked.. - Linus) * 'mm-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs |
|
Daniel Vetter | 3970285319 |
Linux 4.6-rc3
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJXCva8AAoJEHm+PkMAQRiGXBoIAIkrjxdbuT2nS9A3tHwkiFXa
6/Th1UjbNaoLuZ+MckQHayAD9NcWY9lVjOUmFsSiSWMCQK/rTWDl8x5ITputrY2V
VuhrJCwI7huEtu6GpRaJaUgwtdOjhIHz1Ue2MCdNIbKX3l+LjVyyJ9Vo8rruvZcR
fC7kiivH04fYX58oQ+SHymCg54ny3qJEPT8i4+g26686m11hvZLI3UAs2PAn6ut+
atCjxdQ4yLN3DWsbjuA7wYGWhTgFloxL4TIoisuOUc3FXnSi/ivIbXZvu4lUfisz
LA2JBhfII3AEMBWG9xfGbXPijJTT4q7yNlTD0oYcnMtAt/Roh2F04asqB1LetEY=
=bri6
-----END PGP SIGNATURE-----
Merge tag 'v4.6-rc3' into drm-intel-next-queued
Linux 4.6-rc3
Backmerge requested by Chris Wilson to make his patches apply cleanly.
Tiny conflict in vmalloc.c with the (properly acked and all) patch in
drm-intel-next:
commit
|
|
Al Viro | b296821a7c |
xattr_handler: pass dentry and inode as separate arguments of ->get()
... and do not assume they are already attached to each other Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
Ingo Molnar | c12d2da56d |
mm/gup: Remove the macro overload API migration helpers from the get_user*() APIs
The pkeys changes brought about a truly hideous set of macros in: |
|
Chris Wilson | 4da56b99d9 |
mm/vmap: Add a notifier for when we run out of vmap address space
vmaps are temporary kernel mappings that may be of long duration. Reusing a vmap on an object is preferrable for a driver as the cost of setting up the vmap can otherwise dominate the operation on the object. However, the vmap address space is rather limited on 32bit systems and so we add a notification for vmap pressure in order for the driver to release any cached vmappings. The interface is styled after the oom-notifier where the callees are passed a pointer to an unsigned long counter for them to indicate if they have freed any space. v2: Guard the blocking notifier call with gfpflags_allow_blocking() v3: Correct typo in forward declaration and move to head of file Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Roman Peniaev <r.peniaev@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Acked-by: Andrew Morton <akpm@linux-foundation.org> # for inclusion via DRM Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1459777603-23618-3-git-send-email-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> |
|
Linus Torvalds | 4a2d057e4f |
Merge branch 'PAGE_CACHE_SIZE-removal'
Merge PAGE_CACHE_SIZE removal patches from Kirill Shutemov: "PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. Let's stop pretending that pages in page cache are special. They are not. The first patch with most changes has been done with coccinelle. The second is manual fixups on top. The third patch removes macros definition" [ I was planning to apply this just before rc2, but then I spaced out, so here it is right _after_ rc2 instead. As Kirill suggested as a possibility, I could have decided to only merge the first two patches, and leave the old interfaces for compatibility, but I'd rather get it all done and any out-of-tree modules and patches can trivially do the converstion while still also working with older kernels, so there is little reason to try to maintain the redundant legacy model. - Linus ] * PAGE_CACHE_SIZE-removal: mm: drop PAGE_CACHE_* and page_cache_{get,release} definition mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros |
|
Kirill A. Shutemov | ea1754a084 |
mm, fs: remove remaining PAGE_CACHE_* and page_cache_{get,release} usage
Mostly direct substitution with occasional adjustment or removing outdated comments. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 09cbfeaf1a |
mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Neil Zhang | ec3b688250 |
mm/page_isolation.c: fix the function comments
Commit
|
|
Michal Hocko | af8e15cc85 |
oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head
Commit |
|
Nadav Amit | 858eaaa711 |
mm/rmap: batched invalidations should use existing api
The recently introduced batched invalidations mechanism uses its own
mechanism for shootdown. However, it does wrong accounting of
interrupts (e.g., inc_irq_stat is called for local invalidations),
trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and
may break some platforms as it bypasses the invalidation mechanisms of
Xen and SGI UV.
This patch reuses the existing TLB flushing mechnaisms instead. We use
NULL as mm to indicate a global invalidation is required.
Fixes
|
|
Xishi Qiu | 6f25a14a70 |
mm: fix invalid node in alloc_migrate_target()
It is incorrect to use next_node to find a target node, it will return
MAX_NUMNODES or invalid node. This will lead to crash in buddy system
allocation.
Fixes:
|
|
Alexander Potapenko | 0b355eaaaa |
mm, kasan: fix compilation for CONFIG_SLAB
Add the missing argument to set_track().
Fixes:
|
|
Kaixu Xia | c877ef8ae7 |
writeback: fix the wrong congested state variable definition
The right variable definition should be wb_congested_state that include WB_async_congested and WB_sync_congested. So fix it. Signed-off-by: Kaixu Xia <xiakaixu@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> |
|
Kirill A. Shutemov | 0fda2788b0 |
thp: fix typo in khugepaged_scan_pmd()
!PageLRU should lead to SCAN_PAGE_LRU, not SCAN_SCAN_ABORT result. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Nicolai Stange | e7080a439a |
mm/filemap: generic_file_read_iter(): check for zero reads unconditionally
If - generic_file_read_iter() gets called with a zero read length, - the read offset is at a page boundary, - IOCB_DIRECT is not set - and the page in question hasn't made it into the page cache yet, then do_generic_file_read() will trigger a readahead with a req_size hint of zero. Since roundup_pow_of_two(0) is undefined, UBSAN reports UBSAN: Undefined behaviour in include/linux/log2.h:63:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 3 PID: 1017 Comm: sa1 Tainted: G L 4.5.0-next-20160318+ #14 [...] Call Trace: [...] [<ffffffff813ef61a>] ondemand_readahead+0x3aa/0x3d0 [<ffffffff813ef61a>] ? ondemand_readahead+0x3aa/0x3d0 [<ffffffff813c73bd>] ? find_get_entry+0x2d/0x210 [<ffffffff813ef9c3>] page_cache_sync_readahead+0x63/0xa0 [<ffffffff813cc04d>] do_generic_file_read+0x80d/0xf90 [<ffffffff813cc955>] generic_file_read_iter+0x185/0x420 [...] [<ffffffff81510b06>] __vfs_read+0x256/0x3d0 [...] when get_init_ra_size() gets called from ondemand_readahead(). The net effect is that the initial readahead size is arch dependent for requested read lengths of zero: for example, since 1UL << (sizeof(unsigned long) * 8) evaluates to 1 on x86 while its result is 0 on ARMv7, the initial readahead size becomes 4 on the former and 0 on the latter. What's more, whether or not the file access timestamp is updated for zero length reads is decided differently for the two cases of IOCB_DIRECT being set or cleared: in the first case, generic_file_read_iter() explicitly skips updating that timestamp while in the latter case, it is always updated through the call to do_generic_file_read(). According to POSIX, zero length reads "do not modify the last data access timestamp" and thus, the IOCB_DIRECT behaviour is POSIXly correct. Let generic_file_read_iter() unconditionally check the requested read length at its entry and return immediately with success if it is zero. Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Potapenko | cd11016e5f |
mm, kasan: stackdepot implementation. Enable stackdepot for SLAB
Implement the stack depot and provide CONFIG_STACKDEPOT. Stack depot will allow KASAN store allocation/deallocation stack traces for memory chunks. The stack traces are stored in a hash table and referenced by handles which reside in the kasan_alloc_meta and kasan_free_meta structures in the allocated memory chunks. IRQ stack traces are cut below the IRQ entry point to avoid unnecessary duplication. Right now stackdepot support is only enabled in SLAB allocator. Once KASAN features in SLAB are on par with those in SLUB we can switch SLUB to stackdepot as well, thus removing the dependency on SLUB stack bookkeeping, which wastes a lot of memory. This patch is based on the "mm: kasan: stack depots" patch originally prepared by Dmitry Chernenkov. Joonsoo has said that he plans to reuse the stackdepot code for the mm/page_owner.c debugging facility. [akpm@linux-foundation.org: s/depot_stack_handle/depot_stack_handle_t] [aryabinin@virtuozzo.com: comment style fixes] Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Potapenko | 505f5dcb1c |
mm, kasan: add GFP flags to KASAN API
Add GFP flags to KASAN hooks for future patches to use. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexander Potapenko | 7ed2f9e663 |
mm, kasan: SLAB support
Add KASAN hooks to SLAB allocator. This patch is based on the "mm: kasan: unified support for SLUB and SLAB allocators" patch originally prepared by Dmitry Chernenkov. Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Konovalov <adech.fo@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | d9dddbf556 |
mm/page_alloc: prevent merging between isolated and other pageblocks
Hanjun Guo has reported that a CMA stress test causes broken accounting of CMA and free pages: > Before the test, I got: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 195044 kB > > > After running the test: > -bash-4.3# cat /proc/meminfo | grep Cma > CmaTotal: 204800 kB > CmaFree: 6602584 kB > > So the freed CMA memory is more than total.. > > Also the the MemFree is more than mem total: > > -bash-4.3# cat /proc/meminfo > MemTotal: 16342016 kB > MemFree: 22367268 kB > MemAvailable: 22370528 kB Laura Abbott has confirmed the issue and suspected the freepage accounting rewrite around 3.18/4.0 by Joonsoo Kim. Joonsoo had a theory that this is caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA pageblocks: > CMA isolates MAX_ORDER aligned blocks, but, during the process, > partialy isolated block exists. If MAX_ORDER is 11 and > pageblock_order is 9, two pageblocks make up MAX_ORDER > aligned block and I can think following scenario because pageblock > (un)isolation would be done one by one. > > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA, > MIGRATE_ISOLATE, respectively. > > CC -> IC -> II (Isolation) > II -> CI -> CC (Un-isolation) > > If some pages are freed at this intermediate state such as IC or CI, > that page could be merged to the other page that is resident on > different type of pageblock and it will cause wrong freepage count. This was supposed to be prevented by CMA operating on MAX_ORDER blocks, but since it doesn't hold the zone->lock between pageblocks, a race window does exist. It's also likely that unexpected merging can occur between MIGRATE_ISOLATE and non-CMA pageblocks. This should be prevented in __free_one_page() since commit |
|
Tetsuo Handa | bb29902a75 |
oom, oom_reaper: protect oom_reaper_list using simpler way
"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it by simply checking tsk->oom_reaper_list != NULL. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | e26796066f |
oom: make oom_reaper freezable
After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space" oom_reaper will call exit_oom_victim on the target task after it is done. This might however race with the PM freezer: CPU0 CPU1 CPU2 freeze_processes try_to_freeze_tasks # Allocation request out_of_memory oom_killer_disable wake_oom_reaper(P1) __oom_reap_task exit_oom_victim(P1) wait_event(oom_victims==0) [...] do_exit(P1) perform IO/interfere with the freezer which breaks the oom_killer_disable semantic. We no longer have a guarantee that the oom victim won't interfere with the freezer because it might be anywhere on the way to do_exit while the freezer thinks the task has already terminated. It might trigger IO or touch devices which are frozen already. In order to close this race, make the oom_reaper thread freezable. This will work because a) already running oom_reaper will block freezer to enter the quiescent state b) wake_oom_reaper will not wake up the reaper after it has been frozen c) the only way to call exit_oom_victim after try_to_freeze_tasks is from the oom victim's context when we know the further interference shouldn't be possible Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 29c696e1c6 |
oom: make oom_reaper_list single linked
Entries are only added/removed from oom_reaper_list at head so we can use a single linked list and hence save a word in task_struct. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 855b018325 |
oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
Tetsuo has reported that oom_kill_allocating_task=1 will cause oom_reaper_list corruption because oom_kill_process doesn't follow standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue the same task multiple times - e.g. by sacrificing the same child multiple times. This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag which is set in oom_kill_process atomically and oom reaper is disabled if the flag was already set. Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 03049269de |
mm, oom_reaper: implement OOM victims queuing
wake_oom_reaper has allowed only 1 oom victim to be queued. The main reason for that was the simplicity as other solutions would require some way of queuing. The current approach is racy and that was deemed sufficient as the oom_reaper is considered a best effort approach to help with oom handling when the OOM victim cannot terminate in a reasonable time. The race could lead to missing an oom victim which can get stuck out_of_memory wake_oom_reaper cmpxchg // OK oom_reaper oom_reap_task __oom_reap_task oom_victim terminates atomic_inc_not_zero // fail out_of_memory wake_oom_reaper cmpxchg // fails task_to_reap = NULL This race requires 2 OOM invocations in a short time period which is not very likely but certainly not impossible. E.g. the original victim might have not released a lot of memory for some reason. The situation would improve considerably if wake_oom_reaper used a more robust queuing. This is what this patch implements. This means adding oom_reaper_list list_head into task_struct (eat a hole before embeded thread_struct for that purpose) and a oom_reaper_lock spinlock for queuing synchronization. wake_oom_reaper will then add the task on the queue and oom_reaper will dequeue it. Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | bc448e897b |
mm, oom_reaper: report success/failure
Inform about the successful/failed oom_reaper attempts and dump all the held locks to tell us more who is blocking the progress. [akpm@linux-foundation.org: fix CONFIG_MMU=n build] Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 36324a990c |
oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space
When oom_reaper manages to unmap all the eligible vmas there shouldn't be much of the freable memory held by the oom victim left anymore so it makes sense to clear the TIF_MEMDIE flag for the victim and allow the OOM killer to select another task. The lack of TIF_MEMDIE also means that the victim cannot access memory reserves anymore but that shouldn't be a problem because it would get the access again if it needs to allocate and hits the OOM killer again due to the fatal_signal_pending resp. PF_EXITING check. We can safely hide the task from the OOM killer because it is clearly not a good candidate anymore as everyhing reclaimable has been torn down already. This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE and thus hold off further global OOM killer actions granted the oom reaper is able to take mmap_sem for the associated mm struct. This is not guaranteed now but further steps should make sure that mmap_sem for write should be blocked killable which will help to reduce such a lock contention. This is not done by this patch. Note that exit_oom_victim might be called on a remote task from __oom_reap_task now so we have to check and clear the flag atomically otherwise we might race and underflow oom_victims or wake up waiters too early. Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andrea Argangeli <andrea@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | aac4536355 |
mm, oom: introduce oom reaper
This patch (of 5): This is based on the idea from Mel Gorman discussed during LSFMM 2015 and independently brought up by Oleg Nesterov. The OOM killer currently allows to kill only a single task in a good hope that the task will terminate in a reasonable time and frees up its memory. Such a task (oom victim) will get an access to memory reserves via mark_oom_victim to allow a forward progress should there be a need for additional memory during exit path. It has been shown (e.g. by Tetsuo Handa) that it is not that hard to construct workloads which break the core assumption mentioned above and the OOM victim might take unbounded amount of time to exit because it might be blocked in the uninterruptible state waiting for an event (e.g. lock) which is blocked by another task looping in the page allocator. This patch reduces the probability of such a lockup by introducing a specialized kernel thread (oom_reaper) which tries to reclaim additional memory by preemptively reaping the anonymous or swapped out memory owned by the oom victim under an assumption that such a memory won't be needed when its owner is killed and kicked from the userspace anyway. There is one notable exception to this, though, if the OOM victim was in the process of coredumping the result would be incomplete. This is considered a reasonable constrain because the overall system health is more important than debugability of a particular application. A kernel thread has been chosen because we need a reliable way of invocation so workqueue context is not appropriate because all the workers might be busy (e.g. allocating memory). Kswapd which sounds like another good fit is not appropriate as well because it might get blocked on locks during reclaim as well. oom_reaper has to take mmap_sem on the target task for reading so the solution is not 100% because the semaphore might be held or blocked for write but the probability is reduced considerably wrt. basically any lock blocking forward progress as described above. In order to prevent from blocking on the lock without any forward progress we are using only a trylock and retry 10 times with a short sleep in between. Users of mmap_sem which need it for write should be carefully reviewed to use _killable waiting as much as possible and reduce allocations requests done with the lock held to absolute minimum to reduce the risk even further. The API between oom killer and oom reaper is quite trivial. wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition and oom_reaper clear this atomically once it is done with the work. This means that only a single mm_struct can be reaped at the time. As the operation is potentially disruptive we are trying to limit it to the ncessary minimum and the reaper blocks any updates while it operates on an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a race is detected by atomic_inc_not_zero(mm_users). Signed-off-by: Michal Hocko <mhocko@suse.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Argangeli <andrea@kernel.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Piotr Kwapulinski | f138556daf |
mm/mprotect.c: don't imply PROT_EXEC on non-exec fs
The mprotect(PROT_READ) fails when called by the READ_IMPLIES_EXEC binary on a memory mapped file located on non-exec fs. The mprotect does not check whether fs is _executable_ or not. The PROT_EXEC flag is set automatically even if a memory mapped file is located on non-exec fs. Fix it by checking whether a memory mapped file is located on a non-exec fs. If so the PROT_EXEC is not implied by the PROT_READ. The implementation uses the VM_MAYEXEC flag set properly in mmap. Now it is consistent with mmap. I did the isolated tests (PT_GNU_STACK X/NX, multiple VMAs, X/NX fs). I also patched the official 3.19.0-47-generic Ubuntu 14.04 kernel and it seems to work. Signed-off-by: Piotr Kwapulinski <kwapulinski.piotr@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dmitry Vyukov | 5c9a8750a6 |
kernel: add kcov code coverage
kcov provides code coverage collection for coverage-guided fuzzing (randomized testing). Coverage-guided fuzzing is a testing technique that uses coverage feedback to determine new interesting inputs to a system. A notable user-space example is AFL (http://lcamtuf.coredump.cx/afl/). However, this technique is not widely used for kernel testing due to missing compiler and kernel support. kcov does not aim to collect as much coverage as possible. It aims to collect more or less stable coverage that is function of syscall inputs. To achieve this goal it does not collect coverage in soft/hard interrupts and instrumentation of some inherently non-deterministic or non-interesting parts of kernel is disbled (e.g. scheduler, locking). Currently there is a single coverage collection mode (tracing), but the API anticipates additional collection modes. Initially I also implemented a second mode which exposes coverage in a fixed-size hash table of counters (what Quentin used in his original patch). I've dropped the second mode for simplicity. This patch adds the necessary support on kernel side. The complimentary compiler support was added in gcc revision 231296. We've used this support to build syzkaller system call fuzzer, which has found 90 kernel bugs in just 2 months: https://github.com/google/syzkaller/wiki/Found-Bugs We've also found 30+ bugs in our internal systems with syzkaller. Another (yet unexplored) direction where kcov coverage would greatly help is more traditional "blob mutation". For example, mounting a random blob as a filesystem, or receiving a random blob over wire. Why not gcov. Typical fuzzing loop looks as follows: (1) reset coverage, (2) execute a bit of code, (3) collect coverage, repeat. A typical coverage can be just a dozen of basic blocks (e.g. an invalid input). In such context gcov becomes prohibitively expensive as reset/collect coverage steps depend on total number of basic blocks/edges in program (in case of kernel it is about 2M). Cost of kcov depends only on number of executed basic blocks/edges. On top of that, kernel requires per-thread coverage because there are always background threads and unrelated processes that also produce coverage. With inlined gcov instrumentation per-thread coverage is not possible. kcov exposes kernel PCs and control flow to user-space which is insecure. But debugfs should not be mapped as user accessible. Based on a patch by Quentin Casasnovas. [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode'] [akpm@linux-foundation.org: unbreak allmodconfig] [akpm@linux-foundation.org: follow x86 Makefile layout standards] Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: syzkaller <syzkaller@googlegroups.com> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Tavis Ormandy <taviso@google.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Kees Cook <keescook@google.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: David Drysdale <drysdale@google.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Minchan Kim | 3f2b1a04f4 |
zram: revive swap_slot_free_notify
Commit |
|
Linus Torvalds | 266c73b777 |
Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie: "This is the main drm pull request for 4.6 kernel. Overall the coolest thing here for me is the nouveau maxwell signed firmware support from NVidia, it's taken a long while to extract this from them. I also wish the ARM vendors just designed one set of display IP, ARM display block proliferation is definitely increasing. Core: - drm_event cleanups - Internal API cleanup making mode_fixup optional. - Apple GMUX vga switcheroo support. - DP AUX testing interface Panel: - Refactoring of DSI core for use over more transports. New driver: - ARM hdlcd driver i915: - FBC/PSR (framebuffer compression, panel self refresh) enabled by default. - Ongoing atomic display support work - Ongoing runtime PM work - Pixel clock limit checks - VBT DSI description support - GEM fixes - GuC firmware scheduler enhancements amdkfd: - Deferred probing fixes to avoid make file or link ordering. amdgpu/radeon: - ACP support for i2s audio support. - Command Submission/GPU scheduler/GPUVM optimisations - Initial GPU reset support for amdgpu vmwgfx: - Support for DX10 gen mipmaps - Pageflipping and other fixes. exynos: - Exynos5420 SoC support for FIMD - Exynos5422 SoC support for MIPI-DSI nouveau: - GM20x secure boot support - adds acceleration for Maxwell GPUs. - GM200 support - GM20B clock driver support - Power sensors work etnaviv: - Correctness fixes for GPU cache flushing - Better support for i.MX6 systems. imx-drm: - VBlank IRQ support - Fence support - OF endpoint support msm: - HDMI support for 8996 (snapdragon 820) - Adreno 430 support - Timestamp queries support virtio-gpu: - Fixes for Android support. rockchip: - Add support for Innosilicion HDMI rcar-du: - Support for 4 crtcs - R8A7795 support - RCar Gen 3 support omapdrm: - HDMI interlace output support - dma-buf import support - Refactoring to remove a lot of legacy code. tilcdc: - Rewrite of pageflipping code - dma-buf support - pinctrl support vc4: - HDMI modesetting bug fixes - Significant 3D performance improvement. fsl-dcu (FreeScale): - Lots of fixes tegra: - Two small fixes sti: - Atomic support for planes - Improved HDMI support" * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1063 commits) drm/amdgpu: release_pages requires linux/pagemap.h drm/sti: restore mode_fixup callback drm/amdgpu/gfx7: add MTYPE definition drm/amdgpu: removing BO_VAs shouldn't be interruptible drm/amd/powerplay: show uvd/vce power gate enablement for tonga. drm/amd/powerplay: show uvd/vce power gate info for fiji drm/amdgpu: use sched fence if possible drm/amdgpu: move ib.fence to job.fence drm/amdgpu: give a fence param to ib_free drm/amdgpu: include the right version of gmc header files for iceland drm/radeon: fix indentation. drm/amd/powerplay: add uvd/vce dpm enabling flag to fix the performance issue for CZ drm/amdgpu: switch back to 32bit hw fences v2 drm/amdgpu: remove amdgpu_fence_is_signaled drm/amdgpu: drop the extra fence range check v2 drm/amdgpu: signal fences directly in amdgpu_fence_process drm/amdgpu: cleanup amdgpu_fence_wait_empty v2 drm/amdgpu: keep all fences in an RCU protected array v2 drm/amdgpu: add number of hardware submissions to amdgpu_fence_driver_init_ring drm/amdgpu: RCU protected amd_sched_fence_release ... |
|
Linus Torvalds | 643ad15d47 |
Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ... |
|
Linus Torvalds | d5e2d00898 |
powerpc updates for 4.6
Highlights: - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V - Add POWER9 cputable entry from Michael Neuling - FPU/Altivec/VSX save/restore optimisations from Cyril Bur - Add support for new ftrace ABI on ppc64le from Torsten Duwe Various cleanups & minor fixes from: - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh. General: - atomics: Allow architectures to define their own __atomic_op_* helpers from Boqun Feng - Implement atomic{, 64}_*_return_* variants and acquire/release/relaxed variants for (cmp)xchg from Boqun Feng - Add powernv_defconfig from Jeremy Kerr - Fix BUG_ON() reporting in real mode from Balbir Singh - Add xmon command to dump OPAL msglog from Andrew Donnellan - Add xmon command to dump process/task similar to ps(1) from Douglas Miller - Clean up memory hotplug failure paths from David Gibson pci/eeh: - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei Yang. - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan. - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang - PCI: Add pcibios_bus_add_device() weak function from Wei Yang - MAINTAINERS: Update EEH details and maintainership from Russell Currey cxl: - Support added to the CXL driver for running on both bare-metal and hypervisor systems, from Christophe Lombard and Frederic Barrat. - Ignore probes for virtual afu pci devices from Vaibhav Jain perf: - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu - hv-24x7: Fix usage with chip events, display change in counter values, display domain indices in sysfs, eliminate domain suffix in event names, from Sukadev Bhattiprolu Freescale: - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and other dt bits, and minor fixes/cleanup." -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAABAgAGBQJW69OrAAoJEFHr6jzI4aWAe5EQAJw/hE6WBQc6a7Tj70AnXOqR qk/m5pZjuTwQxfBteIvHR1pE5eXdlvtAjcD254LVkFkAbIn19W/h2k0VX/nlee7P n/VRHRifjtGmukqHrPYJJ7ua9mNlY7pxh3leGSixBFASnSWqMxNNNziNQtSTcuCs TjHiw6NkZ/kzeunA4bAfE4yHVUZjmL74oiS9JbLyaVHqoW4fqWLlh26AKo2yYMZI qPicBBG4HBi3FGvoexnKxlJNdcV4HO7LzDjJmCSfUKYCJi+Pw19T5qmhso0q0qVz vHg/A8HNeG4Hn83pNVmLeQSAIQRZ3DvTtcLgbjPo+TVwm/hzrRRBWipTeOVbkLW8 2bcOXT4t7LWUq15EAJ1LYgYZGzcLrfRfUeOcuQ1TWd3+PcfY9pE7FmizsxAAfaVe E9j9mpz4XnIqBtWkFHneTIHkQ5OWptyKuZJEaYH0nut4VsP0k8NarkseafGqBPu7 5eG83gbiQbCVixfOgblV9eocJ29JcwpjPAY4CZSGJimShg909FV7WRgZgJkKWrbK dBRco8Jcp4VglGfo2qymv7Uj4KwQoypBREOhiKUvrAsVlDxPfx+bcskhjGu9xGDC xs/+nme0/lKa/wg5K4C3mQ1GAlkMWHI0ojhJjsyODbetup5UbkEu03wjAaTdO9dT Y6ptGm0rYAJluPNlziFj =qkAt -----END PGP SIGNATURE----- Merge tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: "This was delayed a day or two by some build-breakage on old toolchains which we've now fixed. There's two PCI commits both acked by Bjorn. There's one commit to mm/hugepage.c which is (co)authored by Kirill. Highlights: - Restructure Linux PTE on Book3S/64 to Radix format from Paul Mackerras - Book3s 64 MMU cleanup in preparation for Radix MMU from Aneesh Kumar K.V - Add POWER9 cputable entry from Michael Neuling - FPU/Altivec/VSX save/restore optimisations from Cyril Bur - Add support for new ftrace ABI on ppc64le from Torsten Duwe Various cleanups & minor fixes from: - Adam Buchbinder, Andrew Donnellan, Balbir Singh, Christophe Leroy, Cyril Bur, Luis Henriques, Madhavan Srinivasan, Pan Xinhui, Russell Currey, Sukadev Bhattiprolu, Suraj Jitindar Singh. General: - atomics: Allow architectures to define their own __atomic_op_* helpers from Boqun Feng - Implement atomic{, 64}_*_return_* variants and acquire/release/ relaxed variants for (cmp)xchg from Boqun Feng - Add powernv_defconfig from Jeremy Kerr - Fix BUG_ON() reporting in real mode from Balbir Singh - Add xmon command to dump OPAL msglog from Andrew Donnellan - Add xmon command to dump process/task similar to ps(1) from Douglas Miller - Clean up memory hotplug failure paths from David Gibson pci/eeh: - Redesign SR-IOV on PowerNV to give absolute isolation between VFs from Wei Yang. - EEH Support for SRIOV VFs from Wei Yang and Gavin Shan. - PCI/IOV: Rename and export virtfn_{add, remove} from Wei Yang - PCI: Add pcibios_bus_add_device() weak function from Wei Yang - MAINTAINERS: Update EEH details and maintainership from Russell Currey cxl: - Support added to the CXL driver for running on both bare-metal and hypervisor systems, from Christophe Lombard and Frederic Barrat. - Ignore probes for virtual afu pci devices from Vaibhav Jain perf: - Export Power8 generic and cache events to sysfs from Sukadev Bhattiprolu - hv-24x7: Fix usage with chip events, display change in counter values, display domain indices in sysfs, eliminate domain suffix in event names, from Sukadev Bhattiprolu Freescale: - Updates from Scott: "Highlights include 8xx optimizations, 32-bit checksum optimizations, 86xx consolidation, e5500/e6500 cpu hotplug, more fman and other dt bits, and minor fixes/cleanup" * tag 'powerpc-4.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (179 commits) powerpc: Fix unrecoverable SLB miss during restore_math() powerpc/8xx: Fix do_mtspr_cpu6() build on older compilers powerpc/rcpm: Fix build break when SMP=n powerpc/book3e-64: Use hardcoded mttmr opcode powerpc/fsl/dts: Add "jedec,spi-nor" flash compatible powerpc/T104xRDB: add tdm riser card node to device tree powerpc32: PAGE_EXEC required for inittext powerpc/mpc85xx: Add pcsphy nodes to FManV3 device tree powerpc/mpc85xx: Add MDIO bus muxing support to the board device tree(s) powerpc/86xx: Introduce and use common dtsi powerpc/86xx: Update device tree powerpc/86xx: Move dts files to fsl directory powerpc/86xx: Switch to kconfig fragments approach powerpc/86xx: Update defconfigs powerpc/86xx: Consolidate common platform code powerpc32: Remove one insn in mulhdu powerpc32: small optimisation in flush_icache_range() powerpc: Simplify test in __dma_sync() powerpc32: move xxxxx_dcache_range() functions inline powerpc32: Remove clear_pages() and define clear_page() inline ... |
|
Linus Torvalds | 814a2bf957 |
Merge branch 'akpm' (patches from Andrew)
Merge second patch-bomb from Andrew Morton: - a couple of hotfixes - the rest of MM - a new timer slack control in procfs - a couple of procfs fixes - a few misc things - some printk tweaks - lib/ updates, notably to radix-tree. - add my and Nick Piggin's old userspace radix-tree test harness to tools/testing/radix-tree/. Matthew said it was a godsend during the radix-tree work he did. - a few code-size improvements, switching to __always_inline where gcc screwed up. - partially implement character sets in sscanf * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits) sscanf: implement basic character sets lib/bug.c: use common WARN helper param: convert some "on"/"off" users to strtobool lib: add "on"/"off" support to kstrtobool lib: update single-char callers of strtobool() lib: move strtobool() to kstrtobool() include/linux/unaligned: force inlining of byteswap operations include/uapi/linux/byteorder, swab: force inlining of some byteswap operations include/asm-generic/atomic-long.h: force inlining of some atomic_long operations usb: common: convert to use match_string() helper ide: hpt366: convert to use match_string() helper ata: hpt366: convert to use match_string() helper power: ab8500: convert to use match_string() helper power: charger_manager: convert to use match_string() helper drm/edid: convert to use match_string() helper pinctrl: convert to use match_string() helper device property: convert to use match_string() helper lib/string: introduce match_string() helper radix-tree tests: add test for radix_tree_iter_next radix-tree tests: add regression3 test ... |
|
Linus Torvalds | 49dc2b7173 |
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: drivers/rtc: broken link fix drm/i915 Fix typos in i915_gem_fence.c Docs: fix missing word in REPORTING-BUGS lib+mm: fix few spelling mistakes MAINTAINERS: add git URL for APM driver treewide: Fix typo in printk |
|
Matthew Wilcox | 7165092fe5 |
radix-tree,shmem: introduce radix_tree_iter_next()
shmem likes to occasionally drop the lock, schedule, then reacqire the lock and continue with the iteration from the last place it left off. This is currently done with a pretty ugly goto. Introduce radix_tree_iter_next() and use it throughout shmem.c. [koct9i@gmail.com: fix bug in radix_tree_iter_next() for tagged iteration] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 2cf938aae1 |
mm: use radix_tree_iter_retry()
Instead of a 'goto restart', we can now use radix_tree_iter_retry() to restart from our current position. This will make a difference when there are more ways to happen across an indirect pointer. And it eliminates some confusing gotos. [vbabka@suse.cz: remove now-obsolete-and-misleading comment] Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | e614523653 |
radix_tree: add support for multi-order entries
With huge pages, it is convenient to have the radix tree be able to return an entry that covers multiple indices. Previous attempts to deal with the problem have involved inserting N duplicate entries, which is a waste of memory and leads to problems trying to handle aliased tags, or probing the tree multiple times to find alternative entries which might cover the requested index. This approach inserts one canonical entry into the tree for a given range of indices, and may also insert other entries in order to ensure that lookups find the canonical entry. This solution only tolerates inserting powers of two that are greater than the fanout of the tree. If we wish to expand the radix tree's abilities to support large-ish pages that is less than the fanout at the penultimate level of the tree, then we would need to add one more step in lookup to ensure that any sibling nodes in the final level of the tree are dereferenced and we return the canonical entry that they reference. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Lameter | 93e205a728 |
fix Christoph's email addresses
There are various email addresses for me throughout the kernel. Use the one that will always be valid. Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tetsuo Handa | 0a687aace3 |
mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled
After the OOM killer is disabled during suspend operation, any !__GFP_NOFAIL && __GFP_FS allocations are forced to fail. Thus, any !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tetsuo Handa | 6afcf2895e |
mm,oom: make oom_killer_disable() killable
While oom_killer_disable() is called by freeze_processes() after all user threads except the current thread are frozen, it is possible that kernel threads invoke the OOM killer and sends SIGKILL to the current thread due to sharing the thawed victim's memory. Therefore, checking for SIGKILL is preferable than TIF_MEMDIE. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Sergey Senozhatsky | 1120ed5483 |
mm/zsmalloc: add `freeable' column to pool stat
Add a new column to pool stats, which will tell how many pages ideally can be freed by class compaction, so it will be easier to analyze zsmalloc fragmentation. At the moment, we have only numbers of FULL and ALMOST_EMPTY classes, but they don't tell us how badly the class is fragmented internally. The new /sys/kernel/debug/zsmalloc/zramX/classes output look as follows: class size almost_full almost_empty obj_allocated obj_used pages_used pages_per_zspage freeable [..] 12 224 0 2 146 5 8 4 4 13 240 0 0 0 0 0 1 0 14 256 1 13 1840 1672 115 1 10 15 272 0 0 0 0 0 1 0 [..] 49 816 0 3 745 735 149 1 2 51 848 3 4 361 306 76 4 8 52 864 12 14 378 268 81 3 21 54 896 1 12 117 57 26 2 12 57 944 0 0 0 0 0 3 0 [..] Total 26 131 12709 10994 1071 134 For example, from this particular output we can easily conclude that class-896 is heavily fragmented -- it occupies 26 pages, 12 can be freed by compaction. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
YiPing Xu | a82cbf0713 |
zsmalloc: drop unused member 'mapping_area->huge'
When unmapping a huge class page in zs_unmap_object, the page will be unmapped by kmap_atomic. the "!area->huge" branch in __zs_unmap_object is alway true, and no code set "area->huge" now, so we can drop it. Signed-off-by: YiPing Xu <xuyiping@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shawn Lin | a1c0b1a074 |
mm/vmalloc: use PAGE_ALIGNED() to check PAGE_SIZE alignment
We have PAGE_ALIGNED() in mm.h, so let's use it instead of IS_ALIGNED() for checking PAGE_SIZE aligned case. Signed-off-by: Shawn Lin <shawn.lin@rock-chips.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | e0775d10f1 |
mm: memcontrol: zap oom_info_lock
mem_cgroup_print_oom_info is always called under oom_lock, so oom_info_lock is redundant. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | 8b5926560f |
mm: memcontrol: clarify the uncharge_list() loop
uncharge_list() does an unusual list walk because the function can take regular lists with dedicated list_heads as well as singleton lists where a single page is passed via the page->lru list node. This can sometimes lead to confusion as well as suggestions to replace the loop with a list_for_each_entry(), which wouldn't work. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | b6e6edcfa4 |
mm: memcontrol: reclaim and OOM kill when shrinking memory.max below usage
Setting the original memory.limit_in_bytes hardlimit is subject to a race condition when the desired value is below the current usage. The code tries a few times to first reclaim and then see if the usage has dropped to where we would like it to be, but there is no locking, and the workload is free to continue making new charges up to the old limit. Thus, attempting to shrink a workload relies on pure luck and hope that the workload happens to cooperate. To fix this in the cgroup2 memory.max knob, do it the other way round: set the limit first, then try enforcement. And if reclaim is not able to succeed, trigger OOM kills in the group. Keep going until the new limit is met, we run out of OOM victims and there's only unreclaimable memory left, or the task writing to memory.max is killed. This allows users to shrink groups reliably, and the behavior is consistent with what happens when new charges are attempted in excess of memory.max. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | 588083bb37 |
mm: memcontrol: reclaim when shrinking memory.high below usage
When setting memory.high below usage, nothing happens until the next charge comes along, and then it will only reclaim its own charge and not the now potentially huge excess of the new memory.high. This can cause groups to stay in excess of their memory.high indefinitely. To fix that, when shrinking memory.high, kick off a reclaim cycle that goes after the delta. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Li Zhang | 987b3095c2 |
mm: meminit: initialise more memory for inode/dentry hash tables in early boot
Upstream has supported page parallel initialisation for X86 and the boot time is improved greately. Some tests have been done for Power. Here is the result I have done with different memory size. * 4GB memory: boot time is as the following: with patch vs without patch: 10.4s vs 24.5s boot time is improved 57% * 200GB memory: boot time looks the same with and without patches. boot time is about 38s * 32TB memory: boot time looks the same with and without patches boot time is about 160s. The boot time is much shorter than X86 with 24TB memory. From community discussion, it costs about 694s for X86 24T system. Parallel initialisation improves the performance by deferring memory initilisation to kswap with N kthreads, it should improve the performance therotically. In testing on X86, performance is improved greatly with huge memory. But on Power platform, it is improved greatly with less than 100GB memory. For huge memory, it is not improved greatly. But it saves the time with several threads at least, as the following information shows(32TB system log): [ 22.648169] node 9 initialised, 16607461 pages in 280ms [ 22.783772] node 3 initialised, 23937243 pages in 410ms [ 22.858877] node 6 initialised, 29179347 pages in 490ms [ 22.863252] node 2 initialised, 29179347 pages in 490ms [ 22.907545] node 0 initialised, 32049614 pages in 540ms [ 22.920891] node 15 initialised, 32212280 pages in 550ms [ 22.923236] node 4 initialised, 32306127 pages in 550ms [ 22.923384] node 12 initialised, 32314319 pages in 550ms [ 22.924754] node 8 initialised, 32314319 pages in 550ms [ 22.940780] node 13 initialised, 33353677 pages in 570ms [ 22.940796] node 11 initialised, 33353677 pages in 570ms [ 22.941700] node 5 initialised, 33353677 pages in 570ms [ 22.941721] node 10 initialised, 33353677 pages in 570ms [ 22.941876] node 7 initialised, 33353677 pages in 570ms [ 22.944946] node 14 initialised, 33353677 pages in 570ms [ 22.946063] node 1 initialised, 33345485 pages in 580ms It saves the time about 550*16 ms at least, although it can be ignore to compare the boot time about 160 seconds. What's more, the boot time is much shorter on Power even without patches than x86 for huge memory machine. So this patchset is still necessary to be enabled for Power. This patch (of 2): This patch is based on Mel Gorman's old patch in the mailing list, https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with a completion to wait for all memory initialised in page_alloc_init_late(). It is to fix the OOM problem on X86 with 24TB memory which allocates memory in late initialisation. But for Power platform with 32TB memory, it causes a call trace in vfs_caches_init->inode_init() and inode hash table needs more memory. So this patch allocates 1GB for 0.25TB/node for large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627 This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes. Currently, it only allocates 2GB*16=32GB for early initialisation. But Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB. So the system have no enough memory for it. The log from dmesg as the following: Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes) vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes swapper/0: page allocation failure: order:0,mode:0x2080020 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64 Call Trace: .dump_stack+0xb4/0xb664 (unreliable) .warn_alloc_failed+0x114/0x160 .__vmalloc_area_node+0x1a4/0x2b0 .__vmalloc_node_range+0xe4/0x110 .__vmalloc_node+0x40/0x50 .alloc_large_system_hash+0x134/0x2a4 .inode_init+0xa4/0xf0 .vfs_caches_init+0x80/0x144 .start_kernel+0x40c/0x4e0 start_here_common+0x20/0x4a4 Signed-off-by: Li Zhang <zhlcindy@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 5f7377147c |
thp: fix deadlock in split_huge_pmd()
split_huge_pmd() tries to munlock page with munlock_vma_page(). That
requires the page to locked.
If the is locked by caller, we would get a deadlock:
Unable to find swap-space signature
INFO: task trinity-c85:1907 blocked for more than 120 seconds.
Not tainted 4.4.0-00032-gf19d0bdced41-dirty #1606
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
trinity-c85 D ffff88084d997608 0 1907 309 0x00000000
Call Trace:
schedule+0x9f/0x1c0
schedule_timeout+0x48e/0x600
io_schedule_timeout+0x1c3/0x390
bit_wait_io+0x29/0xd0
__wait_on_bit_lock+0x94/0x140
__lock_page+0x1d4/0x280
__split_huge_pmd+0x5a8/0x10f0
split_huge_pmd_address+0x1d9/0x230
try_to_unmap_one+0x540/0xc70
rmap_walk_anon+0x284/0x810
rmap_walk_locked+0x11e/0x190
try_to_unmap+0x1b1/0x4b0
split_huge_page_to_list+0x49d/0x18a0
follow_page_mask+0xa36/0xea0
SyS_move_pages+0xaf3/0x1570
entry_SYSCALL_64_fastpath+0x12/0x6b
2 locks held by trinity-c85/1907:
#0: (&mm->mmap_sem){++++++}, at: SyS_move_pages+0x933/0x1570
#1: (&anon_vma->rwsem){++++..}, at: split_huge_page_to_list+0x402/0x18a0
I don't think the deadlock is triggerable without split_huge_page()
simplifilcation patchset.
But munlock_vma_page() here is wrong: we want to munlock the page
unconditionally, no need in rmap lookup, that munlock_vma_page() does.
Let's use clear_page_mlock() instead. It can be called under ptl.
Fixes:
|
|
Kirill A. Shutemov | fec89c109f |
thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers
freeze_page() and unfreeze_page() helpers evolved in rather complex beasts. It would be nice to cut complexity of this code. This patch rewrites freeze_page() using standard try_to_unmap(). unfreeze_page() is rewritten with remove_migration_ptes(). The result is much simpler. But the new variant is somewhat slower for PTE-mapped THPs. Current helpers iterates over VMAs the compound page is mapped to, and then over ptes within this VMA. New helpers iterates over small page, then over VMA the small page mapped to, and only then find relevant pte. We have short cut for PMD-mapped THP: we directly install migration entries on PMD split. I don't think the slowdown is critical, considering how much simpler result is and that split_huge_page() is quite rare nowadays. It only happens due memory pressure or migration. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | e388466de4 |
mm: make remove_migration_ptes() beyond mm/migration.c
Make remove_migration_ptes() available to be used in split_huge_page(). New parameter 'locked' added: as with try_to_umap() we need a way to indicate that caller holds rmap lock. We also shouldn't try to mlock() pte-mapped huge pages: pte-mapeed THP pages are never mlocked. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 2a52bcbcc6 |
rmap: extend try_to_unmap() to be usable by split_huge_page()
Add support for two ttu_flags: - TTU_SPLIT_HUGE_PMD would split PMD if it's there, before trying to unmap page; - TTU_RMAP_LOCKED indicates that caller holds relevant rmap lock; Also, change rwc->done to !page_mapcount() instead of !page_mapped(). try_to_unmap() works on pte level, so we are really interested in the mappedness of this small page rather than of the compound page it's a part of. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | b97731992d |
rmap: introduce rmap_walk_locked()
This patchset rewrites freeze_page() and unfreeze_page() using try_to_unmap() and remove_migration_ptes(). Result is much simpler, but somewhat slower. Migration 8GiB worth of PMD-mapped THP: Baseline 20.21 +/- 0.393 Patched 20.73 +/- 0.082 Slowdown 1.03x It's 3% slower, comparing to 14% in v1. I don't it should be a stopper. Splitting of PTE-mapped pages slowed more. But this is not a common case. Migration 8GiB worth of PMD-mapped THP: Baseline 20.39 +/- 0.225 Patched 22.43 +/- 0.496 Slowdown 1.10x rmap_walk_locked() is the same as rmap_walk(), but the caller takes care of the relevant rmap lock. This is preparation for switching THP splitting from custom rmap walk in freeze_page()/unfreeze_page() to the generic one. There is no support for KSM pages for now: not clear which lock is implied. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dan Williams | 99490f16f8 |
mm: ZONE_DEVICE depends on SPARSEMEM_VMEMMAP
The primary use case for devm_memremap_pages() is to allocate an memmap array from persistent memory. That capabilty requires vmem_altmap which requires SPARSEMEM_VMEMMAP. Also, without SPARSEMEM_VMEMMAP the addition of ZONE_DEVICE expands ZONES_WIDTH and triggers the: "Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid." ...warning in mm/memory.c. SPARSEMEM_VMEMMAP=n && ZONE_DEVICE=y is not a configuration we should worry about supporting. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joe Perches | 870d4b12ad |
mm: percpu: use pr_fmt to prefix output
Use the normal mechanism to make the logging output consistently "percpu:" instead of a mix of "PERCPU:" and "percpu:" Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joe Perches | 1170532bb4 |
mm: convert printk(KERN_<LEVEL> to pr_<level>
Most of the mm subsystem uses pr_<level> so make it consistent. Miscellanea: - Realign arguments - Add missing newline to format - kmemleak-test.c has a "kmemleak: " prefix added to the "Kmemleak testing" logging message via pr_fmt Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joe Perches | 756a025f00 |
mm: coalesce split strings
Kernel style prefers a single string over split strings when the string is 'user-visible'. Miscellanea: - Add a missing newline - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joe Perches | 598d80914e |
mm: convert pr_warning to pr_warn
There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn consistently. Miscellanea: - Coalesce formats - Realign arguments Signed-off-by: Joe Perches <joe@perches.com> Acked-by: Tejun Heo <tj@kernel.org> [percpu] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Dan Williams | b11a7b9410 |
mm: exclude ZONE_DEVICE from GFP_ZONE_TABLE
ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
mm zones that are bumping up against the current maximum limit of 4
zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.
The GFP_ZONE_TABLE poses an interesting constraint since
include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
build. We need to be careful to only build the table for zones that
have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
purpose. This patch does not attempt to solve the problem of adding a
new zone that also has a corresponding GFP_ flag.
Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
consume another bit in page->flags (expand ZONES_WIDTH) with room to
spare.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
Fixes:
|
|
Vladimir Davydov | d334c9bcb4 |
mm: memcontrol: cleanup css_reset callback
- Do not take memcg_limit_mutex for resetting limits - the cgroup cannot be altered from userspace anymore, so no need to protect them. - Use plain page_counter_limit() for resetting ->memory and ->memsw limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits, so no need in special handling. - Reset ->swap and ->tcpmem limits as well. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chen Yucong | e33e33b4d1 |
mm, memory hotplug: print debug message in the proper way for online_pages
online_pages() simply returns an error value if memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we want for successfully onlining target pages. This patch arms to print more failure information like offline_pages() in online_pages. This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves __offline_pages() to not print failure information with KERN_INFO according to David Rientjes's suggestion[1]. [1] https://lkml.org/lkml/2016/2/24/1094 Signed-off-by: Chen Yucong <slaoub@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 0f352e5392 |
mm: remove __GFP_NOFAIL is deprecated comment
Commit
|
|
Joonsoo Kim | 95813b8faa |
mm/page_ref: add tracepoint to track down page reference manipulation
CMA allocation should be guaranteed to succeed by definition, but, unfortunately, it would be failed sometimes. It is hard to track down the problem, because it is related to page reference manipulation and we don't have any facility to analyze it. This patch adds tracepoints to track down page reference manipulation. With it, we can find exact reason of failure and can fix the problem. Following is an example of tracepoint output. (note: this example is stale version that printing flags as the number. Recent version will print it as human readable string.) <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1 <...>-9018 [004] 92.678378: kernel_stack: => get_page_from_freelist (ffffffff81176659) => __alloc_pages_nodemask (ffffffff81176d22) => alloc_pages_vma (ffffffff811bf675) => handle_mm_fault (ffffffff8119e693) => __do_page_fault (ffffffff810631ea) => trace_do_page_fault (ffffffff81063543) => do_async_page_fault (ffffffff8105c40a) => async_page_fault (ffffffff817581d8) [snip] <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1 [snip] ... ... <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail [snip] <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1 => release_pages (ffffffff8117c9e4) => free_pages_and_swap_cache (ffffffff811b0697) => tlb_flush_mmu_free (ffffffff81199616) => tlb_finish_mmu (ffffffff8119a62c) => exit_mmap (ffffffff811a53f7) => mmput (ffffffff81073f47) => do_exit (ffffffff810794e9) => do_group_exit (ffffffff81079def) => SyS_exit_group (ffffffff81079e74) => entry_SYSCALL_64_fastpath (ffffffff817560b6) This output shows that problem comes from exit path. In exit path, to improve performance, pages are not freed immediately. They are gathered and processed by batch. During this process, migration cannot be possible and CMA allocation is failed. This problem is hard to find without this page reference tracepoint facility. Enabling this feature bloat kernel text 30 KB in my configuration. text data bss dec hex filename 12127327 2243616 1507328 15878271 f2487f vmlinux_disabled 12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled Note that, due to header file dependency problem between mm.h and tracepoint.h, this feature has to open code the static key functions for tracepoints. Proposed by Steven Rostedt in following link. https://lkml.org/lkml/2015/12/9/699 [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()] [iamjoonsoo.kim@lge.com: fix build failure for xtensa] [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | fe896d1878 |
mm: introduce page reference manipulation functions
The success of CMA allocation largely depends on the success of migration and key factor of it is page reference count. Until now, page reference is manipulated by direct calling atomic functions so we cannot follow up who and where manipulate it. Then, it is hard to find actual reason of CMA allocation failure. CMA allocation should be guaranteed to succeed so finding offending place is really important. In this patch, call sites where page reference is manipulated are converted to introduced wrapper function. This is preparation step to add tracepoint to each page reference manipulation function. With this facility, we can easily find reason of CMA allocation failure. There is no functional change in this patch. In addition, this patch also converts reference read sites. It will help a second step that renames page._count to something else and prevents later attempt to direct access to it (Suggested by Andrew). Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 444eb2a449 |
mm: thp: set THP defrag by default to madvise and add a stall-free defrag option
THP defrag is enabled by default to direct reclaim/compact but not wake kswapd in the event of a THP allocation failure. The problem is that THP allocation requests potentially enter reclaim/compaction. This potentially incurs a severe stall that is not guaranteed to be offset by reduced TLB misses. While there has been considerable effort to reduce the impact of reclaim/compaction, it is still a high cost and workloads that should fit in memory fail to do so. Specifically, a simple anon/file streaming workload will enter direct reclaim on NUMA at least even though the working set size is 80% of RAM. It's been years and it's time to throw in the towel. First, this patch defines THP defrag as follows; madvise: A failed allocation will direct reclaim/compact if the application requests it never: Neither reclaim/compact nor wake kswapd defer: A failed allocation will wake kswapd/kcompactd always: A failed allocation will direct reclaim/compact (historical behaviour) khugepaged defrag will enter direct/reclaim but not wake kswapd. Next it sets the default defrag option to be "madvise" to only enter direct reclaim/compaction for applications that specifically requested it. Lastly, it removes a check from the page allocator slowpath that is related to __GFP_THISNODE to allow "defer" to work. The callers that really cares are slub/slab and they are updated accordingly. The slab one may be surprising because it also corrects a comment as kswapd was never woken up by that path. This means that a THP fault will no longer stall for most applications by default and the ideal for most users that get THP if they are immediately available. There are still options for users that prefer a stall at startup of a new application by either restoring historical behaviour with "always" or pick a half-way point with "defer" where kswapd does some of the work in the background and wakes kcompactd if necessary. THP defrag for khugepaged remains enabled and will enter direct/reclaim but no wakeup kswapd or kcompactd. After this patch a THP allocation failure will quickly fallback and rely on khugepaged to recover the situation at some time in the future. In some cases, this will reduce THP usage but the benefit of THP is hard to measure and not a universal win where as a stall to reclaim/compaction is definitely measurable and can be painful. The first test for this is using "usemem" to read a large file and write a large anonymous mapping (to avoid the zero page) multiple times. The total size of the mappings is 80% of RAM and the benchmark simply measures how long it takes to complete. It uses multiple threads to see if that is a factor. On UMA, the performance is almost identical so is not reported but on NUMA, we see this usemem 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%) Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%) Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%) Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%) Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%) Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%) Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%) Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%) Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%) Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%) Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%) Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%) Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%) Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%) For a single thread, the benchmark completes 43.23% faster with this patch applied with smaller benefits as the thread increases. Similar, notice the large reduction in most cases in system CPU usage. The overall CPU time is 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 User 10357.65 10438.33 System 3988.88 3543.94 Elapsed 2203.01 1634.41 Which is substantial. Now, the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 128458477 278352931 Major Faults 2174976 225 Swap Ins 16904701 0 Swap Outs 17359627 0 Allocation stalls 43611 0 DMA allocs 0 0 DMA32 allocs 19832646 19448017 Normal allocs 614488453 580941839 Movable allocs 0 0 Direct pages scanned 24163800 0 Kswapd pages scanned 0 0 Kswapd pages reclaimed 0 0 Direct pages reclaimed 20691346 0 Compaction stalls 42263 0 Compaction success 938 0 Compaction failures 41325 0 This patch eliminates almost all swapping and direct reclaim activity. There is still overhead but it's from NUMA balancing which does not identify that it's pointless trying to do anything with this workload. I also tried the thpscale benchmark which forces a corner case where compaction can be used heavily and measures the latency of whether base or huge pages were used thpscale Fault Latencies 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%) Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%) Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%) Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%) Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%) Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%) Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%) Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%) Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%) Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%) Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%) Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%) Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%) Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%) Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%) Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%) Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%) Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%) The average time to fault pages is substantially reduced in the majority of caseds but with the obvious caveat that fewer THPs are actually used in this adverse workload 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%) Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%) Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%) Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%) Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%) Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%) Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%) Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%) Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%) 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 37429143 47564000 Major Faults 1916 1558 Swap Ins 1466 1079 Swap Outs 2936863 149626 Allocation stalls 62510 3 DMA allocs 0 0 DMA32 allocs 6566458 6401314 Normal allocs 216361697 216538171 Movable allocs 0 0 Direct pages scanned 25977580 17998 Kswapd pages scanned 0 3638931 Kswapd pages reclaimed 0 207236 Direct pages reclaimed 8833714 88 Compaction stalls 103349 5 Compaction success 270 4 Compaction failures 103079 1 Note again that while this does swap as it's an aggressive workload, the direct relcim activity and allocation stalls is substantially reduced. There is some kswapd activity but ftrace showed that the kswapd activity was due to normal wakeups from 4K pages being allocated. Compaction-related stalls and activity are almost eliminated. I also tried the stutter benchmark. For this, I do not have figures for NUMA but it's something that does impact UMA so I'll report what is available stutter 4.4.0 4.4.0 kcompactd-v1r1 nodefrag-v1r3 Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%) 1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%) 2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%) 3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%) Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%) Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%) Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%) Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%) Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%) Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%) This benchmark is trying to fault an anonymous mapping while there is a heavy IO load -- a scenario that desktop users used to complain about frequently. This shows a mix because the ideal case of mapping with THP is not hit as often. However, note that 99% of the mappings complete 13.79% faster. The CPU usage here is particularly interesting 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 User 67.50 0.99 System 1327.88 91.30 Elapsed 2079.00 2128.98 And once again we look at the reclaim figures 4.4.0 4.4.0 kcompactd-v1r1nodefrag-v1r3 Minor Faults 335241922 1314582827 Major Faults 715 819 Swap Ins 0 0 Swap Outs 0 0 Allocation stalls 532723 0 DMA allocs 0 0 DMA32 allocs 1822364341 1177950222 Normal allocs 1815640808 1517844854 Movable allocs 0 0 Direct pages scanned 21892772 0 Kswapd pages scanned 20015890 41879484 Kswapd pages reclaimed 19961986 41822072 Direct pages reclaimed 21892741 0 Compaction stalls 1065755 0 Compaction success 514 0 Compaction failures 1065241 0 Allocation stalls and all direct reclaim activity is eliminated as well as compaction-related stalls. THP gives impressive gains in some cases but only if they are quickly available. We're not going to reach the point where they are completely free so lets take the costs out of the fast paths finally and defer the cost to kswapd, kcompactd and khugepaged where it belongs. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Rientjes | f9054c70d2 |
mm, mempool: only set __GFP_NOMEMALLOC if there are free elements
If an oom killed thread calls mempool_alloc(), it is possible that it'll loop forever if there are no elements on the freelist since __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in oom conditions. Only set __GFP_NOMEMALLOC if there are elements on the freelist. If there are no free elements, allow allocations without the bit set so that memory reserves can be accessed if needed. Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not supported since the implementation can loop forever without accessing memory reserves when needed. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Johannes Weiner | 795ae7a0de |
mm: scale kswapd watermarks in proportion to memory
In machines with 140G of memory and enterprise flash storage, we have seen read and write bursts routinely exceed the kswapd watermarks and cause thundering herds in direct reclaim. Unfortunately, the only way to tune kswapd aggressiveness is through adjusting min_free_kbytes - the system's emergency reserves - which is entirely unrelated to the system's latency requirements. In order to get kswapd to maintain a 250M buffer of free memory, the emergency reserves need to be set to 1G. That is a lot of memory wasted for no good reason. On the other hand, it's reasonable to assume that allocation bursts and overall allocation concurrency scale with memory capacity, so it makes sense to make kswapd aggressiveness a function of that as well. Change the kswapd watermark scale factor from the currently fixed 25% of the tunable emergency reserve to a tunable 0.1% of memory. Beyond 1G of memory, this will produce bigger watermark steps than the current formula in default settings. Ensure that the new formula never chooses steps smaller than that, i.e. 25% of the emergency reserve. On a 140G machine, this raises the default watermark steps - the distance between min and low, and low and high - from 16M to 143M. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | 3ed3a4f0dd |
mm: cleanup *pte_alloc* interfaces
There are few things about *pte_alloc*() helpers worth cleaning up: - 'vma' argument is unused, let's drop it; - most __pte_alloc() callers do speculative check for pmd_none(), before taking ptl: let's introduce pte_alloc() macro which does the check. The only direct user of __pte_alloc left is userfaultfd, which has different expectation about atomicity wrt pmd. - pte_alloc_map() and pte_alloc_map_lock() are redefined using pte_alloc(). [sudeep.holla@arm.com: fix build for arm64 hugetlbpage] [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Sudeep Holla <sudeep.holla@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Igor Redko | d02bd27bd3 |
mm/page_alloc.c: calculate 'available' memory in a separate function
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. This patch (of 2): Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: Igor Redko <redkoi@virtuozzo.com> Signed-off-by: Denis V. Lunev <den@openvz.org> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 7eb50292d7 |
mm/Kconfig: remove redundant arch depend for memory hotplug
MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is selected by the supported architectures, so the following arch depend is unnecessary. Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Aneesh Kumar K.V | 458aa76d13 |
mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range
We remove one instace of flush_tlb_range here. That was added by commit
|
|
Andrey Ryabinin | 39a1aa8e19 |
mm: deduplicate memory overcommitment code
Currently we have two copies of the same code which implements memory overcommitment logic. Let's move it into mm/util.c and hence avoid duplication. No functional changes here. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Ryabinin | ea606cf5d8 |
mm: move max_map_count bits into mm.h
max_map_count sysctl unrelated to scheduler. Move its bits from include/linux/sched/sysctl.h to include/linux/mm.h. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill A. Shutemov | f9719a03de |
thp, vmstats: count deferred split events
Count how many times we put a THP in split queue. Currently, it happens on partial unmap of a THP. Rapidly growing value can indicate that an application behaves unfriendly wrt THP: often fault in huge page and then unmap part of it. This leads to unnecessary memory fragmentation and the application may require tuning. The event also can help with debugging kernel [mis-]behaviour. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 0a6b76dd23 |
mm: workingset: make shadow node shrinker memcg aware
Workingset code was recently made memcg aware, but shadow node shrinker is still global. As a result, one small cgroup can consume all memory available for shadow nodes, possibly hurting other cgroups by reclaiming their shadow nodes, even though reclaim distances stored in its shadow nodes have no effect. To avoid this, we need to make shadow node shrinker memcg aware. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | cdcbb72ebf |
mm: workingset: size shadow nodes lru basing on file cache size
A page is activated on refault if the refault distance stored in the corresponding shadow entry is less than the number of active file pages. Since active file pages can't occupy more than half memory, we assume that the maximal effective refault distance can't be greater than half the number of present pages and size the shadow nodes lru list appropriately. Generally speaking, this assumption is correct, but it can result in wasting a considerable chunk of memory on stale shadow nodes in case the portion of file pages is small, e.g. if a workload mostly uses anonymous memory. To sort this out, we need to compute the size of shadow nodes lru basing not on the maximal possible, but the current size of file cache. We could take the size of active file lru for the maximal refault distance, but active lru is pretty unstable - it can shrink dramatically at runtime possibly disrupting workingset detection logic. Instead we assume that the maximal refault distance equals half the total number of file cache pages. This will protect us against active file lru size fluctuations while still being correct, because size of active lru is normally maintained lower than size of inactive lru. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | b6ecd2dea4 |
mm: memcontrol: zap memcg_kmem_online helper
As kmem accounting is now either enabled for all cgroups or disabled system-wide, there's no point in having memcg_kmem_online() helper - instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as shrink_slab() now does. There are only two places left where this helper is used - __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can only be called if memcg_kmem_enabled() returned true. Since the cgroup it operates on is online, mem_cgroup_is_root() check will be enough. memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead of memcg_kmem_online(), because it relies on the fact that in memcg_offline_kmem() memcg->kmem_state is changed before memcg_deactivate_kmem_caches() is called, but there we can just open-code the check. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | 0fc9f58a90 |
mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker
It's just convenient to implement a memcg aware shrinker when you know that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns false. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vladimir Davydov | b313aeee25 |
mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy
Workingset code was recently made memcg aware, but shadow node shrinker is still global. As a result, one small cgroup can consume all memory available for shadow nodes, possibly hurting other cgroups by reclaiming their shadow nodes, even though reclaim distances stored in its shadow nodes have no effect. To avoid this, we need to make shadow node shrinker memcg aware. The actual work is done in patch 6 of the series. Patches 1 and 2 prepare memcg/shrinker infrastructure for the change. Patch 3 is just a collateral cleanup. Patch 4 makes radix_tree_node accounted, which is necessary for making shadow node shrinker memcg aware. Patch 5 reduces shadow nodes overhead in case workload mostly uses anonymous pages. This patch: Currently, in the legacy hierarchy kmem accounting is off for all cgroups by default and must be enabled explicitly by writing something to memory.kmem.limit_in_bytes. Since we don't support reclaim on hitting kmem limit, nor do we have any plans to implement it, this is likely to be -1, just to enable kmem accounting and limit kernel memory consumption by the memory.limit_in_bytes along with user memory. This user API was introduced when the implementation of kmem accounting lacked slab shrinker support and hence was useless in practice. Things have changed since then - slab shrinkers were made memcg aware, the accounting overhead seems to be negligible, and a failure to charge a kmem allocation should not have critical consequences, because we only account those kernel objects that should be safe to fail. That's why kmem accounting is enabled by default for all cgroups in the default hierarchy, which will eventually replace the legacy one. The ability to enable kmem accounting for some cgroups while keeping it disabled for others is getting difficult to maintain. E.g. to make shadow node shrinker memcg aware (see mm/workingset.c), we need to know the relationship between the number of shadow nodes allocated for a cgroup and the size of its lru list. If kmem accounting is enabled for all cgroups there is no problem, but what should we do if kmem accounting is enabled only for half of cgroups? We've no other choice but use global lru stats while scanning root cgroup's shadow nodes, but that would be wrong if kmem accounting was enabled for all cgroups (which is the case if the unified hierarchy is used), in which case we should use lru stats of the root cgroup's lruvec. That being said, let's enable kmem accounting for all memory cgroups by default. If one finds it unstable or too costly, it can always be disabled system-wide by passing cgroup.memory=nokmem to the kernel at boot time. Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | accf62422b |
mm, kswapd: replace kswapd compaction with waking up kcompactd
Similarly to direct reclaim/compaction, kswapd attempts to combine
reclaim and compaction to attempt making memory allocation of given
order available.
The details differ from direct reclaim e.g. in having high watermark as
a goal. The code involved in kswapd's reclaim/compaction decisions has
evolved to be quite complex.
Testing reveals that it doesn't actually work in at least one scenario,
and closer inspection suggests that it could be greatly simplified
without compromising on the goal (make high-order page available) or
efficiency (don't reclaim too much). The simplification relieas of
doing all compaction in kcompactd, which is simply woken up when high
watermarks are reached by kswapd's reclaim.
The scenario where kswapd compaction doesn't work was found with mmtests
test stress-highalloc configured to attempt order-9 allocations without
direct reclaim, just waking up kswapd. There was no compaction attempt
from kswapd during the whole test. Some added instrumentation shows
what happens:
- balance_pgdat() sets end_zone to Normal, as it's not balanced
- reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
it cannot reclaim anything, so sc.nr_reclaimed is 0
- for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
it merely checks if high watermarks were reached for base pages.
This is true, so no reclaim is attempted. For DMA, testorder=0
wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
- even though the pgdat_needs_compaction flag wasn't set to false, no
compaction happens due to the condition sc.nr_reclaimed >
nr_attempted being false (as 0 < 99)
- priority-- due to nr_reclaimed being 0, repeat until priority reaches
0 pgdat_balanced() is false as only the small zone DMA appears
balanced (curiously in that check, watermark appears OK and
compaction_suitable() returns COMPACT_PARTIAL, because a lower
classzone_idx is used there)
Now, even if it was decided that reclaim shouldn't be attempted on the
DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
nr_attempted=0) is also false. The condition really should use >= as
the comment suggests. Then there is a mismatch in the check for setting
pgdat_needs_compaction to false using low watermark, while the rest uses
high watermark, and who knows what other subtlety. Hopefully this
demonstrates that this is unsustainable.
Luckily we can simplify this a lot. The reclaim/compaction decisions
make sense for direct reclaim scenario, but in kswapd, our primary goal
is to reach high watermark in order-0 pages. Afterwards we can attempt
compaction just once. Unlike direct reclaim, we don't reclaim extra
pages (over the high watermark), the current code already disallows it
for good reasons.
After this patch, we simply wake up kcompactd to process the pgdat,
after we have either succeeded or failed to reach the high watermarks in
kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
so kcompactd can apply the same criteria to determine which zones are
worth compacting. Note that we use the classzone_idx from
wakeup_kswapd(), not balanced_classzone_idx which can include higher
zones that kswapd tried to balance too, but didn't consider them in
pgdat_balanced().
Since kswapd now cannot create high-order pages itself, we need to
adjust how it determines the zones to be balanced. The key element here
is adding a "highorder" parameter to zone_balanced, which, when set to
false, makes it consider only order-0 watermark instead of the desired
higher order (this was done previously by kswapd_shrink_zone(), but not
elsewhere). This false is passed for example in pgdat_balanced().
Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
kcompactd are woken up for a high-order allocation failure.
The last thing is to decide what to do with pageblock_skip bitmap
handling. Compaction maintains a pageblock_skip bitmap to record
pageblocks where isolation recently failed. This bitmap can be reset by
three ways:
1) direct compaction is restarting after going through the full deferred cycle
2) kswapd goes to sleep, and some other direct compaction has previously
finished scanning the whole zone and set zone->compact_blockskip_flush.
Note that a successful direct compaction clears this flag.
3) compaction was invoked manually via trigger in /proc
The case 2) is somewhat fuzzy to begin with, but after introducing
kcompactd we should update it. The check for direct compaction in 1),
and to set the flush flag in 2) use current_is_kswapd(), which doesn't
work for kcompactd. Thus, this patch adds bool direct_compaction to
compact_control to use in 2). For the case 1) we remove the check
completely - unlike the former kswapd compaction, kcompactd does use the
deferred compaction functionality, so flushing tied to restarting from
deferred compaction makes sense here.
Note that when kswapd goes to sleep, kcompactd is woken up, so it will
see the flushed pageblock_skip bits. This is different from when the
former kswapd compaction observed the bits and I believe it makes more
sense. Kcompactd can afford to be more thorough than a direct
compaction trying to limit allocation latency, or kswapd whose primary
goal is to reclaim.
For testing, I used stress-highalloc configured to do order-9
allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
phases 1 and 2 work as usual):
stress-highalloc
4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)
User 3166.67 3181.09
System 1153.37 1158.25
Elapsed 1768.53 1799.37
4.5-rc1+before 4.5-rc1+after
-nodirect -nodirect
Direct pages scanned 32938 32797
Kswapd pages scanned 2183166 2202613
Kswapd pages reclaimed 2152359 2143524
Direct pages reclaimed 32735 32545
Percentage direct scans 1% 1%
THP fault alloc 579 612
THP collapse alloc 304 316
THP splits 0 0
THP fault fallback 793 778
THP collapse fail 11 16
Compaction stalls 1013 1007
Compaction success 92 67
Compaction failures 920 939
Page migrate success 238457 721374
Page migrate failure 23021 23469
Compaction pages isolated 504695 1479924
Compaction migrate scanned 661390 8812554
Compaction free scanned 13476658 84327916
Compaction cost 262 838
After this patch we see improvements in allocation success rate
(especially for phase 3) along with increased compaction activity. The
compaction stalls (direct compaction) in the interfering kernel builds
(probably THP's) also decreased somewhat thanks to kcompactd activity,
yet THP alloc successes improved a bit.
Note that elapsed and user time isn't so useful for this benchmark,
because of the background interference being unpredictable. It's just
to quickly spot some major unexpected differences. System time is
somewhat more useful and that didn't increase.
Also (after adjusting mmtests' ftrace monitor):
Time kswapd awake 2547781 2269241
Time kcompactd awake 0 119253
Time direct compacting 939937 557649
Time kswapd compacting 0 0
Time kcompactd compacting 0 119099
The decrease of overal time spent compacting appears to not match the
increased compaction stats. I suspect the tasks get rescheduled and
since the ftrace monitor doesn't see that, the reported time is wall
time, not CPU time. But arguably direct compactors care about overall
latency anyway, whether busy compacting or waiting for CPU doesn't
matter. And that latency seems to almost halved.
It's also interesting how much time kswapd spent awake just going
through all the priorities and failing to even try compacting, over and
over.
We can also configure stress-highalloc to perform both direct
reclaim/compaction and wakeup kswapd/kcompactd, by using
GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
stress-highalloc
4.5-rc1+before 4.5-rc1+after
-direct -direct
Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)
User 3344.73 3246.04
System 1194.24 1172.29
Elapsed 1838.04 1836.76
4.5-rc1+before 4.5-rc1+after
-direct -direct
Direct pages scanned 125146 120966
Kswapd pages scanned 2119757 2135012
Kswapd pages reclaimed 2073183 2108388
Direct pages reclaimed 124909 120577
Percentage direct scans 5% 5%
THP fault alloc 599 652
THP collapse alloc 323 354
THP splits 0 0
THP fault fallback 806 793
THP collapse fail 17 16
Compaction stalls 2457 2025
Compaction success 906 518
Compaction failures 1551 1507
Page migrate success 2031423 2360608
Page migrate failure 32845 40852
Compaction pages isolated
|