Commit Graph

11252 Commits

Author SHA1 Message Date
Michal Hocko 7877cdcc38 mm: consolidate warn_alloc_failed users
warn_alloc_failed is currently used from the page and vmalloc
allocators.  This is a good reuse of the code except that vmalloc would
appreciate a slightly different warning message.  This is already
handled by the fmt parameter except that

  "%s: page allocation failure: order:%u, mode:%#x(%pGg)"

is printed anyway.  This might be quite misleading because it might be a
vmalloc failure which leads to the warning while the page allocator is
not the culprit here.  Fix this by always using the fmt string and only
print the context that makes sense for the particular context (e.g.
order makes only very little sense for the vmalloc context).

Rename the function to not miss any user and also because a later patch
will reuse it also for !failure cases.

Link: http://lkml.kernel.org/r/20160929084407.7004-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Wei Fang c2a9737f45 vfs,mm: fix a dead loop in truncate_inode_pages_range()
We triggered a deadloop in truncate_inode_pages_range() on 32 bits
architecture with the test case bellow:

	...
	fd = open();
	write(fd, buf, 4096);
	preadv64(fd, &iovec, 1, 0xffffffff000);
	ftruncate(fd, 0);
	...

Then ftruncate() will not return forever.

The filesystem used in this case is ubifs, but it can be triggered on
many other filesystems.

When preadv64() is called with offset=0xffffffff000, a page with
index=0xffffffff will be added to the radix tree of ->mapping.  Then
this page can be found in ->mapping with pagevec_lookup().  After that,
truncate_inode_pages_range(), which is called in ftruncate(), will fall
into an infinite loop:

 - find a page with index=0xffffffff, since index>=end, this page won't
   be truncated

 - index++, and index become 0

 - the page with index=0xffffffff will be found again

The data type of index is unsigned long, so index won't overflow to 0 on
64 bits architecture in this case, and the dead loop won't happen.

Since truncate_inode_pages_range() is executed with holding lock of
inode->i_rwsem, any operation related with this lock will be blocked,
and a hung task will happen, e.g.:

  INFO: task truncate_test:3364 blocked for more than 120 seconds.
  ...
     call_rwsem_down_write_failed+0x17/0x30
     generic_file_write_iter+0x32/0x1c0
     ubifs_write_iter+0xcc/0x170
     __vfs_write+0xc4/0x120
     vfs_write+0xb2/0x1b0
     SyS_write+0x46/0xa0

The page with index=0xffffffff added to ->mapping is useless.  Fix this
by checking the read position before allocating pages.

Link: http://lkml.kernel.org/r/1475151010-40166-1-git-send-email-fangwei1@huawei.com
Signed-off-by: Wei Fang <fangwei1@huawei.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Yisheng Xie 461a718432 mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE
Avoid making ifdef get pretty unwieldy if many ARCHs support gigantic
page.  No functional change with this patch.

Link: http://lkml.kernel.org/r/1475227569-63446-2-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Michal Hocko 82e7d3abec oom: print nodemask in the oom report
We have received a hard to explain oom report from a customer.  The oom
triggered regardless there is a lot of free memory:

  PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
  PoolThread cpuset=/ mems_allowed=0-7
  Pid: 30055, comm: PoolThread Tainted: G           E X 3.0.101-80-default #1
  Call Trace:
    dump_trace+0x75/0x300
    dump_stack+0x69/0x6f
    dump_header+0x8e/0x110
    oom_kill_process+0xa6/0x350
    out_of_memory+0x2b7/0x310
    __alloc_pages_slowpath+0x7dd/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_anonymous_page+0x13e/0x300
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30
  [...]
  active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
   active_file:13093 inactive_file:222506 isolated_file:0
   unevictable:262144 dirty:2 writeback:0 unstable:0
   free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
   mapped:261139 shmem:166297 pagetables:2228282 bounce:0
  [...]
  Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 2892 775542 775542
  Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 772650 772650
  Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
  lowmem_reserve[]: 0 0 0 0
  Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0
  Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
  lowmem_reserve[]: 0 0 0 0

So all nodes but Node 0 have a lot of free memory which should suggest
that there is an available memory especially when mems_allowed=0-7.  One
could speculate that a massive process has managed to terminate and free
up a lot of memory while racing with the above allocation request.
Although this is highly unlikely it cannot be ruled out.

A further debugging, however shown that the faulting process had
mempolicy (not cpuset) to bind to Node 0.  We cannot see that
information from the report though.  mems_allowed turned out to be more
confusing than really helpful.

Fix this by always priting the nodemask.  It is either mempolicy mask
(and non-null) or the one defined by the cpusets.  The new output for
the above oom report would be

  PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

This patch doesn't touch show_mem and the node filtering based on the
cpuset node mask because mempolicy is always a subset of cpusets and
seeing the full cpuset oom context might be helpful for tunning more
specific mempolicies inside cpusets (e.g.  when they turn out to be too
restrictive).  To prevent from ugly ifdefs the mask is printed even for
!NUMA configurations but this should be OK (a single node will be
printed).

Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Sellami Abdelkader <abdelkader.sellami@sap.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Sellami Abdelkader <abdelkader.sellami@sap.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Kirill A. Shutemov 9996f05eac mm: clarify why we avoid page_mapcount() for slab pages in dump_page()
Let's add comment on why we skip page_mapcount() for sl[aou]b pages.

Link: http://lkml.kernel.org/r/20160922105532.GB24593@node
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli 8f26e0b176 mm: vma_merge: correct false positive from __vma_unlink->validate_mm_rb
The old code was always doing:

   vma->vm_end = next->vm_end
   vma_rb_erase(next) // in __vma_unlink
   vma->vm_next = next->vm_next // in __vma_unlink
   next = vma->vm_next
   vma_gap_update(next)

The new code still does the above for remove_next == 1 and 2, but for
remove_next == 3 it has been changed and it does:

   next->vm_start = vma->vm_start
   vma_rb_erase(vma) // in __vma_unlink
   vma_gap_update(next)

In the latter case, while unlinking "vma", validate_mm_rb() is told to
ignore "vma" that is being removed, but next->vm_start was reduced
instead. So for the new case, to avoid the false positive from
validate_mm_rb, it should be "next" that is ignored when "vma" is
being unlinked.

"vma" and "next" in the above comment, considered pre-swap().

Link: http://lkml.kernel.org/r/1474492522-2261-4-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Tested-by: Shaun Tancheff <shaun.tancheff@seagate.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli 86d12e471d mm: vma_adjust: minor comment correction
The cases are three not two.

Link: http://lkml.kernel.org/r/1474492522-2261-3-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli 97a42cd439 mm: vma_adjust: remove superfluous check for next not NULL
If next would be NULL we couldn't reach such code path.

Link: http://lkml.kernel.org/r/1474309513-20313-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli e86f15ee64 mm: vma_merge: fix vm_page_prot SMP race condition against rmap_walk
The rmap_walk can access vm_page_prot (and potentially vm_flags in the
pte/pmd manipulations).  So it's not safe to wait the caller to update
the vm_page_prot/vm_flags after vma_merge returned potentially removing
the "next" vma and extending the "current" vma over the
next->vm_start,vm_end range, but still with the "current" vma
vm_page_prot, after releasing the rmap locks.

The vm_page_prot/vm_flags must be transferred from the "next" vma to the
current vma while vma_merge still holds the rmap locks.

The side effect of this race condition is pte corruption during migrate
as remove_migration_ptes when run on a address of the "next" vma that
got removed, used the vm_page_prot of the current vma.

  migrate   	      	        mprotect
  ------------			-------------
  migrating in "next" vma
				vma_merge() # removes "next" vma and
			        	    # extends "current" vma
					    # current vma is not with
					    # vm_page_prot updated
  remove_migration_ptes
  read vm_page_prot of current "vma"
  establish pte with wrong permissions
				vm_set_page_prot(vma) # too late!
				change_protection in the old vma range
				only, next range is not updated

This caused segmentation faults and potentially memory corruption in
heavy mprotect loads with some light page migration caused by compaction
in the background.

Hugh Dickins pointed out the comment about the Odd case 8 in vma_merge
which confirms the case 8 is only buggy one where the race can trigger,
in all other vma_merge cases the above cannot happen.

This fix removes the oddness factor from case 8 and it converts it from:

      AAAA
  PPPPNNNNXXXX -> PPPPNNNNNNNN

to:

      AAAA
  PPPPNNNNXXXX -> PPPPXXXXXXXX

XXXX has the right vma properties for the whole merged vma returned by
vma_adjust, so it solves the problem fully.  It has the added benefits
that the callers could stop updating vma properties when vma_merge
succeeds however the callers are not updated by this patch (there are
bits like VM_SOFTDIRTY that still need special care for the whole range,
as the vma merging ignores them, but as long as they're not processed by
rmap walks and instead they're accessed with the mmap_sem at least for
reading, they are fine not to be updated within vma_adjust before
releasing the rmap_locks).

Link: http://lkml.kernel.org/r/1474309513-20313-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Aditya Mandaleeka <adityam@microsoft.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli fb8c41e9ad mm: vma_adjust: remove superfluous confusing update in remove_next == 1 case
mm->highest_vm_end doesn't need any update.

After finally removing the oddness from vma_merge case 8 that was
causing:

1) constant risk of trouble whenever anybody would check vma fields
   from rmap_walks, like it happened when page migration was
   introduced and it read the vma->vm_page_prot from a rmap_walk

2) the callers of vma_merge to re-initialize any value different from
   the current vma, instead of vma_merge() more reliably returning a
   vma that already matches all fields passed as parameter

.. it is also worth to take the opportunity of cleaning up superfluous
code in vma_adjust(), that if not removed adds up to the hard
readability of the function.

Link: http://lkml.kernel.org/r/1474492522-2261-5-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrea Arcangeli 6d2329f887 mm: vm_page_prot: update with WRITE_ONCE/READ_ONCE
vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
concurrently and this prevents the risk of reading intermediate values.

Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jan Vorlicek <janvorli@microsoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
zhong jiang 6213055f2c mm,ksm: add __GFP_HIGH to the allocation in alloc_stable_node()
According to Hugh's suggestion, alloc_stable_node() with GFP_KERNEL can
in rare cases cause a hung task warning.

At present, if alloc_stable_node() allocation fails, two break_cows may
want to allocate a couple of pages, and the issue will come up when free
memory is under pressure.

We fix it by adding __GFP_HIGH to GFP, to grant access to memory
reserves, increasing the likelihood of allocation success.

[akpm@linux-foundation.org: tweak comment]
Link: http://lkml.kernel.org/r/1474354484-58233-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Yisheng Xie ac34dcd263 mm/page_isolation: fix typo: "paes" -> "pages"
Fix typo in comment.

Link: http://lkml.kernel.org/r/1474788764-5774-1-git-send-email-ysxie@foxmail.com
Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Gerald Schaefer eb03aa0085 mm/hugetlb: improve locking in dissolve_free_huge_pages()
For every pfn aligned to minimum_order, dissolve_free_huge_pages() will
call dissolve_free_huge_page() which takes the hugetlb spinlock, even if
the page is not huge at all or a hugepage that is in-use.

Improve this by doing the PageHuge() and page_count() checks already in
dissolve_free_huge_pages() before calling dissolve_free_huge_page().  In
dissolve_free_huge_page(), when holding the spinlock, those checks need
to be revalidated.

Link: http://lkml.kernel.org/r/20160926172811.94033-4-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Gerald Schaefer 082d5b6b60 mm/hugetlb: check for reserved hugepages during memory offline
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.

Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs.  h->resv_huge_pages.  Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Gerald Schaefer 2247bb335a mm/hugetlb: fix memory offline with hugepage size > memory block size
Patch series "mm/hugetlb: memory offline issues with hugepages", v4.

This addresses several issues with hugepages and memory offline.  While
the first patch fixes a panic, and is therefore rather important, the
last patch is just a performance optimization.

The second patch fixes a theoretical issue with reserved hugepages,
while still leaving some ugly usability issue, see description.

This patch (of 3):

dissolve_free_huge_pages() will either run into the VM_BUG_ON() or a
list corruption and addressing exception when trying to set a memory
block offline that is part (but not the first part) of a "gigantic"
hugetlb page with a size > memory block size.

When no other smaller hugetlb page sizes are present, the VM_BUG_ON()
will trigger directly.  In the other case we will run into an addressing
exception later, because dissolve_free_huge_page() will not work on the
head page of the compound hugetlb page which will result in a NULL
hstate from page_hstate().

To fix this, first remove the VM_BUG_ON() because it is wrong, and then
use the compound head page in dissolve_free_huge_page().  This means
that an unused pre-allocated gigantic page that has any part of itself
inside the memory block that is going offline will be dissolved
completely.  Losing an unused gigantic hugepage is preferable to failing
the memory offline, for example in the situation where a (possibly
faulty) memory DIMM needs to go offline.

Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-2-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Rui Teng <rui.teng@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Wanlong Gao 914a051654 mm: nobootmem: move the comment of free_all_bootmem
Commit b4def3509d ("mm, nobootmem: clean-up of free_low_memory_core_early()")
removed the unnecessary nodeid argument, after that, this comment
becomes more confused.  We should move it to the right place.

Fixes: b4def3509d ("mm, nobootmem: clean-up of free_low_memory_core_early()")
Link: http://lkml.kernel.org/r/1473996082-14603-1-git-send-email-wanlong.gao@gmail.com
Signed-off-by: Wanlong Gao <wanlong.gao@gmail.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Rasmus Villemoes 19938e350a mm/shmem.c: constify anon_ops
Every other dentry_operations instance is const, and this one might as
well be.

Link: http://lkml.kernel.org/r/1473890528-7009-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Johannes Weiner 2d75807383 mm: memcontrol: consolidate cgroup socket tracking
The cgroup core and the memory controller need to track socket ownership
for different purposes, but the tracking sites being entirely different
is kind of ugly.

Be a better citizen and rename the memory controller callbacks to match
the cgroup core callbacks, then move them to the same place.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160914194846.11153-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Andrew Morton cc30c5d646 mm/page_io.c: replace some BUG_ON()s with VM_BUG_ON_PAGE()
So they are CONFIG_DEBUG_VM-only and more informative.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David S. Miller <davem@davemloft.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Joe Perches <joe@perches.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Tetsuo Handa a104808e21 mm: don't emit warning from pagefault_out_of_memory()
Commit c32b3cbe0d ("oom, PM: make OOM detection in the freezer path
raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
to warn when we raced with disabling the OOM killer.

Now, patch "oom, suspend: fix oom_killer_disable vs.  pm suspend
properly" introduced a timeout for oom_killer_disable().  Even if we
raced with disabling the OOM killer and the system is OOM livelocked,
the OOM killer will be enabled eventually (in 20 seconds by default) and
the OOM livelock will be solved.  Therefore, we no longer need to warn
when we raced with disabling the OOM killer.

Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka 2031142028 mm, compaction: restrict fragindex to costly orders
Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
heuristic to prevent excessive compaction for costly orders (i.e.  THP).
It's unlikely to make any difference for non-costly orders, especially
with the default threshold.  But we cannot afford any uncertainty for
the non-costly orders where the only alternative to successful
reclaim/compaction is OOM.  After the recent patches we are guaranteed
maximum effort without heuristics from compaction before deciding OOM,
and fragindex is the last remaining heuristic.  Therefore skip fragindex
altogether for non-costly orders.

Suggested-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka cc5c9f098f mm, compaction: ignore fragindex from compaction_zonelist_suitable()
The compaction_zonelist_suitable() function tries to determine if
compaction will be able to proceed after sufficient reclaim, i.e.
whether there are enough reclaimable pages to provide enough order-0
freepages for compaction.

This addition of reclaimable pages to the free pages works well for the
order-0 watermark check, but in the fragmentation index check we only
consider truly free pages.  Thus we can get fragindex value close to 0
which indicates failure do to lack of memory, and wrongly decide that
compaction won't be suitable even after reclaim.

Instead of trying to somehow adjust fragindex for reclaimable pages,
let's just skip it from compaction_zonelist_suitable().

Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka 423b452e15 mm, page_alloc: pull no_progress_loops update to should_reclaim_retry()
The should_reclaim_retry() makes decisions based on no_progress_loops,
so it makes sense to also update the counter there.  It will be also
consistent with should_compact_retry() and compaction_retries.  No
functional change.

[hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka 9f7e338793 mm, compaction: make full priority ignore pageblock suitability
Several people have reported premature OOMs for order-2 allocations
(stack) due to OOM rework in 4.7.  In the scenario (parallel kernel
build and dd writing to two drives) many pageblocks get marked as
Unmovable and compaction free scanner struggles to isolate free pages.
Joonsoo Kim pointed out that the free scanner skips pageblocks that are
not movable to prevent filling them and forcing non-movable allocations
to fallback to other pageblocks.  Such heuristic makes sense to help
prevent long-term fragmentation, but premature OOMs are relatively more
urgent problem.  As a compromise, this patch disables the heuristic only
for the ultimate compaction priority.

Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
Reported-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Reported-by: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Reported-by: Olaf Hering <olaf@aepfle.de>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka c2033b00db mm, compaction: restrict full priority to non-costly orders
The new ultimate compaction priority disables some heuristics, which may
result in excessive cost.  This is fine for non-costly orders where we
want to try hard before resulting for OOM, but might be disruptive for
costly orders which do not trigger OOM and should generally have some
fallback.  Thus, we disable the full priority for costly orders.

Suggested-by: Michal Hocko <mhocko@kernel.org>
Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:29 -07:00
Vlastimil Babka d943649831 mm, compaction: more reliably increase direct compaction priority
During reclaim/compaction loop, compaction priority can be increased by
the should_compact_retry() function, but the current code is not
optimal.  Priority is only increased when compaction_failed() is true,
which means that compaction has scanned the whole zone.  This may not
happen even after multiple attempts with a lower priority due to
parallel activity, so we might needlessly struggle on the lower
priorities and possibly run out of compaction retry attempts in the
process.

After this patch we are guaranteed at least one attempt at the highest
compaction priority even if we exhaust all retries at the lower
priorities.

Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Vlastimil Babka 3250845d05 Revert "mm, oom: prevent premature OOM killer invocation for high order request"
Patch series "reintroduce compaction feedback for OOM decisions".

After several people reported OOM's for order-2 allocations in 4.7 due
to Michal Hocko's OOM rework, he reverted the part that considered
compaction feedback [1] in the decisions to retry reclaim/compaction.
This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
while mmotm had an almost complete solution that instead improved
compaction reliability.

This series completes the mmotm solution and reintroduces the compaction
feedback into OOM decisions.  The first two patches restore the state of
mmotm before the temporary solution was merged, the last patch should be
the missing piece for reliability.  The third patch restricts the
hardened compaction to non-costly orders, since costly orders don't
result in OOMs in the first place.

[1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E

This patch (of 4):

Commit 6b4e3181d7 ("mm, oom: prevent premature OOM killer invocation
for high order request") was intended as a quick fix of OOM regressions
for 4.8 and stable 4.7.x kernels.  For a better long-term solution, we
still want to consider compaction feedback, which should be possible
after some more improvements in the following patches.

This reverts commit 6b4e3181d7.

Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Huang Ying f6ab1f7f6b mm, swap: use offset of swap entry as key of swap cache
This patch is to improve the performance of swap cache operations when
the type of the swap device is not 0.  Originally, the whole swap entry
value is used as the key of the swap cache, even though there is one
radix tree for each swap device.  If the type of the swap device is not
0, the height of the radix tree of the swap cache will be increased
unnecessary, especially on 64bit architecture.  For example, for a 1GB
swap device on the x86_64 architecture, the height of the radix tree of
the swap cache is 11.  But if the offset of the swap entry is used as
the key of the swap cache, the height of the radix tree of the swap
cache is 4.  The increased height causes unnecessary radix tree
descending and increased cache footprint.

This patch reduces the height of the radix tree of the swap cache via
using the offset of the swap entry instead of the whole swap entry value
as the key of the swap cache.  In 32 processes sequential swap out test
case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
when the type of the swap device is 1.

Use the whole swap entry as key,

  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

Use the swap offset as key,

  perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
  perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Aaron Lu <aaron.lu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Dan Williams 87744ab383 mm: fix cache mode tracking in vm_insert_mixed()
vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
fails to check the pgprot_t it uses for the mapping against the one
recorded in the memtype tracking tree.  Add the missing call to
track_pfn_insert() to preclude cases where incompatible aliased mappings
are established for a given physical address range.

Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
James Morse 0247f3f4d7 mm/memcontrol.c: make the walk_page_range() limit obvious
mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
callback, which causes the current implementation to skip non-vma
regions.  This is all fine but follow up changes would like to make
walk_page_range more generic so it is better to be explicit about which
range to traverse so let's use highest_vm_end to explicitly traverse
only user mmaped memory.

[mhocko@kernel.org: rewrote changelog]
Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
Signed-off-by: James Morse <james.morse@arm.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Aaron Lu 6fcb52a56f thp: reduce usage of huge zero page's atomic counter
The global zero page is used to satisfy an anonymous read fault.  If
THP(Transparent HugePage) is enabled then the global huge zero page is
used.  The global huge zero page uses an atomic counter for reference
counting and is allocated/freed dynamically according to its counter
value.

CPU time spent on that counter will greatly increase if there are a lot
of processes doing anonymous read faults.  This patch proposes a way to
reduce the access to the global counter so that the CPU load can be
reduced accordingly.

To do this, a new flag of the mm_struct is introduced:
MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
the global counter in two cases:

 1 The first time it uses the global huge zero page;
 2 The time when mm_user of its mm_struct reaches zero.

Note that right now, the huge zero page is eligible to be freed as soon
as its last use goes away.  With this patch, the page will not be
eligible to be freed until the exit of the last process from which it
was ever used.

And with the use of mm_user, the kthread is not eligible to use huge
zero page either.  Since no kthread is using huge zero page today, there
is no difference after applying this patch.  But if that is not desired,
I can change it to when mm_count reaches zero.

Case used for test on Haswell EP:

  usemem -n 72 --readonly -j 0x200000 100G

Which spawns 72 processes and each will mmap 100G anonymous space and
then do read only access to that space sequentially with a step of 2MB.

  CPU cycles from perf report for base commit:
      54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
  CPU cycles from perf report for this commit:
       0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page

Performance(throughput) of the workload for base commit: 1784430792
Performance(throughput) of the workload for this commit: 4726928591
164% increase.

Runtime of the workload for base commit: 707592 us
Runtime of the workload for this commit: 303970 us
50% drop.

Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Tim Chen 03e86dba5b cpu: fix node state for whether it contains CPU
In current kernel code, we only call node_set_state(cpu_to_node(cpu),
N_CPU) when a cpu is hot plugged.  But we do not set the node state for
N_CPU when the cpus are brought online during boot.

So this could lead to failure when we check to see if a node contains
cpu with node_state(node_id, N_CPU).

One use case is in the node_reclaime function:

        /*
         * Only run node reclaim on the local node or on nodes that do
         * not
         * have associated processors. This will favor the local
         * processor
         * over remote processors and spread off node memory allocations
         * as wide as possible.
         */
        if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
                numa_node_id())
                return NODE_RECLAIM_NOSCAN;

I instrumented the kernel to call this function after boot and it always
returns 0 on a x86 desktop machine until I apply the attached patch.

   int num_cpu_node(void)
   {
       int i, nr_cpu_nodes = 0;

       for_each_node(i) {
               if (node_state(i, N_CPU))
                       ++ nr_cpu_nodes;
       }

       return nr_cpu_nodes;
   }

Fix this by checking each node for online CPU when we initialize
vmstat that's responsible for maintaining node state.

Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: <Huang@linux.intel.com>
Cc: Ying <ying.huang@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Toshi Kani 74d2fad133 thp, dax: add thp_get_unmapped_area for pmd mappings
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
This feature relies on both mmap virtual address and FS block (i.e.
physical address) to be aligned by the pmd page size.  Users can use
mkfs options to specify FS to align block allocations.  However,
aligning mmap address requires code changes to existing applications for
providing a pmd-aligned address to mmap().

For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
It calls mmap() with a NULL address, which needs to be changed to
provide a pmd-aligned address for testing with DAX pmd mappings.
Changing all applications that call mmap() with NULL is undesirable.

Add thp_get_unmapped_area(), which can be called by filesystem's
get_unmapped_area to align an mmap address by the pmd size for a DAX
file.  It calls the default handler, mm->get_unmapped_area(), to find a
range and then aligns it for a DAX file.

The patch is based on Matthew Wilcox's change that allows adding support
of the pud page size easily.

[1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Simon Guo b155b4fde5 mm: mlock: avoid increase mm->locked_vm on mlock() when already mlock2(,MLOCK_ONFAULT)
When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
VM_LOCKED flag only.

There is a hole in mlock_fixup() which increase mm->locked_vm twice even
the two operations are on the same vma and both with VM_LOCKED flags.

The issue can be reproduced by following code:

  mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
  mlock(p, 1024 * 64);  //VM_LOCKED

Then check the increase VmLck field in /proc/pid/status(to 128k).

When vma is set with different vm_flags, and the new vm_flags is with
VM_LOCKED, it is not necessarily be a "new locked" vma.  This patch
corrects this bug by prevent mm->locked_vm from increment when old
vm_flags is already VM_LOCKED.

Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alexey Klimov <klimov.linux@gmail.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Simon Guo <wei.guo.simon@gmail.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Simon Guo 0cf2f6f6dc mm: mlock: check against vma for actual mlock() size
In do_mlock(), the check against locked memory limitation has a hole
which will fail following cases at step 3):

 1) User has a memory chunk from addressA with 50k, and user mem lock
    rlimit is 64k.
 2) mlock(addressA, 30k)
 3) mlock(addressA, 40k)

The 3rd step should have been allowed since the 40k request is
intersected with the previous 30k at step 2), and the 3rd step is
actually for mlock on the extra 10k memory.

This patch checks vma to caculate the actual "new" mlock size, if
necessary, and ajust the logic to fix this issue.

[akpm@linux-foundation.org: clean up comment layout]
[wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
 Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
Cc: Alexey Klimov <klimov.linux@gmail.com>
Cc: Eric B Munson <emunson@akamai.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Simon Guo <wei.guo.simon@gmail.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko 9254990fb9 oom: warn if we go OOM for higher order and compaction is disabled
Since the lumpy reclaim is gone there is no source of higher order pages
if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
unreliable for that purpose to say the least.  Hitting an OOM for
!costly higher order requests is therefore all not that hard to imagine.
We are trying hard to not invoke OOM killer as much as possible but
there is simply no reliable way to detect whether more reclaim retries
make sense.

Disabling COMPACTION is not widespread but it seems that some users
might have disable the feature without realizing full consequences
(mostly along with disabling THP because compaction used to be THP
mainly thing).  This patch just adds a note if the OOM killer was
triggered by higher order request with compaction disabled.  This will
help us identifying possible misconfiguration right from the oom report
which is easier than to always keep in mind that somebody might have
disabled COMPACTION without a good reason.

Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Huang Ying 371a096edf mm: don't use radix tree writeback tags for pages in swap cache
File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
etc.) to accelerate finding the pages with a specific tag in the radix
tree during inode writeback.  But for anonymous pages in the swap cache,
there is no inode writeback.  So there is no need to find the pages with
some writeback tags in the radix tree.  It is not necessary to touch
radix tree writeback tags for pages in the swap cache.

Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
introduced for address spaces which don't need to update the writeback
tags.  The flag is set for swap caches.  It may be used for DAX file
systems, etc.

With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
simulated PMEM (persistent memory) device.  The improvement comes from
the reduced contention on the swap cache radix tree lock.  To test
sequential swapping out, the test case uses 8 processes, which
sequentially allocate and write to the anonymous pages until RAM and
part of the swap device is used up.

Details of comparison is as follow,

base             base+patch
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
   2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
   1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
     10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
     10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
     10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
     10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tejun Heo <tj@kernel.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
zijun_hu 1d8bf926f8 mm/bootmem.c: replace kzalloc() by kzalloc_node()
In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
in order to allocate memory within given node preferentially when slab
is available

Link: http://lkml.kernel.org/r/1f487f12-6af4-5e4f-a28c-1de2361cdcd8@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
zijun_hu 2382705f22 mm/nobootmem.c: remove duplicate macro ARCH_LOW_ADDRESS_LIMIT statements
Fix the following bugs:

 - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
   header and relevant source

 - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
   asm/processor.h is preferred over default in linux/bootmem.h
   completely since the former header isn't included by the latter

Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Srikar Dronamraju 8907de5dc6 mm/memblock.c: expose total reserved memory
The total reserved memory in a system is accounted but not available for
use use outside mm/memblock.c.  By exposing the total reserved memory,
systems can better calculate the size of large hashes.

Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Srikar Dronamraju f6f34b4387 mm: introduce arch_reserved_kernel_pages()
Currently arch specific code can reserve memory blocks but
alloc_large_system_hash() may not take it into consideration when sizing
the hashes.  This can lead to bigger hash than required and lead to no
available memory for other purposes.  This is specifically true for
systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

One approach to solve this problem would be to walk through the memblock
regions and calculate the available memory and base the size of hash
system on the available memory.

The other approach would be to depend on the architecture to provide the
number of pages that are reserved.  This change provides hooks to allow
the architecture to provide the required info.

Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Hari Bathini <hbathini@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Aneesh Kumar K.V c9634cf012 mm: use zonelist name instead of using hardcoded index
Use the existing enums instead of hardcoded index when looking at the
zonelist.  This makes it more readable.  No functionality change by this
patch.

Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko 1b51e65eab oom, oom_reaper: allow to reap mm shared by the kthreads
oom reaper was skipped for an mm which is shared with the kernel thread
(aka use_mm()).  The primary concern was that such a kthread might want
to read from the userspace memory and see zero page as a result of the
oom reaper action.  This is no longer a problem after "mm: make sure
that kthreads will not refault oom reaped memory" because any attempt to
fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
target user should see an error.  This means that we can finally allow
oom reaper also to tasks which share their mm with kthreads.

Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko 3f70dc38ce mm: make sure that kthreads will not refault oom reaped memory
There are only few use_mm() users in the kernel right now.  Most of them
write to the target memory but vhost driver relies on
copy_from_user/get_user from a kernel thread context.  This makes it
impossible to reap the memory of an oom victim which shares the mm with
the vhost kernel thread because it could see a zero page unexpectedly
and theoretically make an incorrect decision visible outside of the
killed task context.

To quote Michael S. Tsirkin:
: Getting an error from __get_user and friends is handled gracefully.
: Getting zero instead of a real value will cause userspace
: memory corruption.

The vhost kernel thread is bound to an open fd of the vhost device which
is not tight to the mm owner life cycle in general.  The device fd can
be inherited or passed over to another process which means that we
really have to be careful about unexpected memory corruption because
unlike for normal oom victims the result will be visible outside of the
oom victim context.

Make sure that no kthread context (users of use_mm) can ever see
corrupted data because of the oom reaper and hook into the page fault
path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
flag before it starts unmapping the address space while the flag is
checked after the page fault has been handled.  If the flag is set then
SIGBUS is triggered so any g-u-p user will get a error code.

Regular tasks do not need this protection because all which share the mm
are killed when the mm is reaped and so the corruption will not outlive
them.

This patch shouldn't have any visible effect at this moment because the
OOM killer doesn't invoke oom reaper for tasks with mm shared with
kthreads yet.

Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Tetsuo Handa 38531201c1 mm, oom: enforce exit_oom_victim on current task
There are no users of exit_oom_victim on !current task anymore so enforce
the API to always work on the current.

Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:28 -07:00
Michal Hocko 7d2e7a22cf oom, suspend: fix oom_killer_disable vs. pm suspend properly
Commit 7407054209 ("oom, suspend: fix oom_reaper vs.
oom_killer_disable race") has workaround an existing race between
oom_killer_disable and oom_reaper by adding another round of
try_to_freeze_tasks after the oom killer was disabled.  This was the
easiest thing to do for a late 4.7 fix.  Let's fix it properly now.

After "oom: keep mm of the killed task available" we no longer have to
call exit_oom_victim from the oom reaper because we have stable mm
available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
remove exit_oom_victim and the race described in the above commit
doesn't exist anymore if.

Unfortunately this alone is not sufficient for the oom_killer_disable
usecase because now we do not have any reliable way to reach
exit_oom_victim (the victim might get stuck on a way to exit for an
unbounded amount of time).  OOM killer can cope with that by checking mm
flags and move on to another victim but we cannot do the same for
oom_killer_disable as we would lose the guarantee of no further
interference of the victim with the rest of the system.  What we can do
instead is to cap the maximum time the oom_killer_disable waits for
victims.  The only current user of this function (pm suspend) already
has a concept of timeout for back off so we can reuse the same value
there.

Let's drop set_freezable for the oom_reaper kthread because it is no
longer needed as the reaper doesn't wake or thaw any processes.

Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko 862e3073b3 mm, oom: get rid of signal_struct::oom_victims
After "oom: keep mm of the killed task available" we can safely detect
an oom victim by checking task->signal->oom_mm so we do not need the
signal_struct counter anymore so let's get rid of it.

This alone wouldn't be sufficient for nommu archs because
exit_oom_victim doesn't hide the process from the oom killer anymore.
We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko 26db62f179 oom: keep mm of the killed task available
oom_reap_task has to call exit_oom_victim in order to make sure that the
oom vicim will not block the oom killer for ever.  This is, however,
opening new problems (e.g oom_killer_disable exclusion - see commit
7407054209 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
race")).  exit_oom_victim should be only called from the victim's
context ideally.

One way to achieve this would be to rely on per mm_struct flags.  We
already have MMF_OOM_REAPED to hide a task from the oom killer since
"mm, oom: hide mm which is shared with kthread or global init". The
problem is that the exit path:

  do_exit
    exit_mm
      tsk->mm = NULL;
      mmput
        __mmput
      exit_oom_victim

doesn't guarantee that exit_oom_victim will get called in a bounded
amount of time.  At least exit_aio depends on IO which might get blocked
due to lack of memory and who knows what else is lurking there.

This patch takes a different approach.  We remember tsk->mm into the
signal_struct and bind it to the signal struct life time for all oom
victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
have to rely on find_lock_task_mm anymore and they will have a reliable
reference to the mm struct.  As a result all the oom specific
communication inside the OOM killer can be done via tsk->signal->oom_mm.

Increasing the signal_struct for something as unlikely as the oom killer
is far from ideal but this approach will make the code much more
reasonable and long term we even might want to move task->mm into the
signal_struct anyway.  In the next step we might want to make the oom
killer exclusion and access to memory reserves completely independent
which would be also nice.

Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Tetsuo Handa 8496afaba9 mm,oom_reaper: do not attempt to reap a task twice
"mm, oom_reaper: do not attempt to reap a task twice" tried to give the
OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
But the usefulness of the flag is rather limited and actually never
shown in practice.  If the flag is set, it means that the holder of
mm->mmap_sem cannot call up_write() due to presumably being blocked at
unkillable wait waiting for other thread's memory allocation.  But since
one of threads sharing that mm will queue that mm immediately via
task_will_free_mem() shortcut (otherwise, oom_badness() will select the
same mm again due to oom_score_adj value unchanged), retrying
MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

Let's always set MMF_OOM_REAPED.

Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Tetsuo Handa 7ebffa4555 mm,oom_reaper: reduce find_lock_task_mm() usage
Patch series "fortify oom killer even more", v2.

This patch (of 9):

__oom_reap_task() can be simplified a bit if it receives a valid mm from
oom_reap_task() which also uses that mm when __oom_reap_task() failed.
We can drop one find_lock_task_mm() call and also make the
__oom_reap_task() code flow easier to follow.  Moreover, this will make
later patch in the series easier to review.  Pinning mm's mm_count for
longer time is not really harmful because this will not pin much memory.

This patch doesn't introduce any functional change.

Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Huang Ying 6b53491598 mm, swap: add swap_cluster_list
This is a code clean up patch without functionality changes.  The
swap_cluster_list data structure and its operations are introduced to
provide some better encapsulation for the free cluster and discard
cluster list operations.  This avoid some code duplication, improved the
code readability, and reduced the total line number.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Alexey Dobriyan 131ddc5c7d mm: unrig VMA cache hit ratio
Current code doesn't count first FIND operation after VMA cache flush
(which happen surprisingly often) artificially increasing cache hit ratio.

On my regular setup the difference is:

		Before				After
	==========================================================

	* boot, login into KDE

	vmacache_find_calls 446216	vmacache_find_calls 492741
	vmacache_find_hits 277596	vmacache_find_hits 276096

		~62.2%				~56.0%

	* rebuild kernel (no changes to code, usual config)

	vmacache_find_calls 1943007	vmacache_find_calls 2083718
	vmacache_find_hits 1246123	vmacache_find_hits 1244146

		~64.1%				~59.7%

	* rebuild kernel (full rebuild, usual config)

	vmacache_find_calls 32163155	vmacache_find_calls 33677183
	vmacache_find_hits 27889956	vmacache_find_hits 27877591

		~88.2%				~84.3%

Total: ~4% cache hit ratio.

If someone is counting _relative_ cache _miss_ ratio, misreporting is much
higher.

Link: http://lkml.kernel.org/r/20160822225009.GA3934@p183.telecom.by
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Bart Van Assche c4b209a426 do_generic_file_read(): fail immediately if killed
If a fatal signal has been received, fail immediately instead of trying
to read more data.

If wait_on_page_locked_killable() was interrupted then this page is most
likely is not PageUptodate() and in this case do_generic_file_read()
will fail after lock_page_killable().

See also commit ebded02788 ("mm: filemap: avoid unnecessary calls to
lock_page when waiting for IO to complete during a read")

[oleg@redhat.com: changelog addition]
Link: http://lkml.kernel.org/r/63068e8e-8bee-b208-8441-a3c39a9d9eb6@sandisk.com
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim 9300d8dfd2 mm/page_owner: don't define fields on struct page_ext by hard-coding
There is a memory waste problem if we define field on struct page_ext by
hard-coding.  Entry size of struct page_ext includes the size of those
fields even if it is disabled at runtime.  Now, extra memory request at
runtime is possible so page_owner don't need to define it's own fields
by hard-coding.

This patch removes hard-coded define and uses extra memory for storing
page_owner information in page_owner.  Most of code are just mechanical
changes.

Link: http://lkml.kernel.org/r/1471315879-32294-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim 980ac1672e mm/page_ext: support extra space allocation by page_ext user
Until now, if some page_ext users want to use it's own field on
page_ext, it should be defined in struct page_ext by hard-coding.  It
has a problem that wastes memory in following situation.

  struct page_ext {
   #ifdef CONFIG_A
  	int a;
   #endif
   #ifdef CONFIG_B
  	int b;
   #endif
  };

Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int.  It's undesirable
result so this patch tries to fix it.

To solve above problem, this patch implements to support extra space
allocation at runtime.  When need() callback returns true, it's extra
memory requirement is summed to entry size of page_ext.  Also, offset
for each user's extra memory space is returned.  With this offset, user
can use this extra space and there is no need to define needed field on
page_ext by hard-coding.

This patch only implements an infrastructure.  Following patch will use
it for page_owner which is only user having it's own fields on page_ext.

Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim 0b06bb3f60 mm/page_ext: rename offset to index
Here, 'offset' means entry index in page_ext array.  Following patch
will use 'offset' for field offset in each entry so rename current
'offset' to prevent confusion.

Link: http://lkml.kernel.org/r/1471315879-32294-5-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim e2f612e673 mm/page_owner: move page_owner specific function to page_owner.c
There is no reason that page_owner specific function resides on
vmstat.c.

Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim f1c1e9f7b5 mm/debug_pagealloc.c: don't allocate page_ext if we don't use guard page
What debug_pagealloc does is just mapping/unmapping page table.
Basically, it doesn't need additional memory space to memorize
something.  But, with guard page feature, it requires additional memory
to distinguish if the page is for guard or not.  Guard page is only used
when debug_guardpage_minorder is non-zero so this patch removes
additional memory allocation (page_ext) if debug_guardpage_minorder is
zero.

It saves memory if we just use debug_pagealloc and not guard page.

Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Joonsoo Kim acbc15a4b3 mm/debug_pagealloc.c: clean-up guard page handling code
Patch series "Reduce memory waste by page extension user".

This patchset tries to reduce memory waste by page extension user.

First case is architecture supported debug_pagealloc.  It doesn't
requires additional memory if guard page isn't used.  8 bytes per page
will be saved in this case.

Second case is related to page owner feature.  Until now, if page_ext
users want to use it's own fields on page_ext, fields should be defined
in struct page_ext by hard-coding.  It has a following problem.

  struct page_ext {
   #ifdef CONFIG_A
  	int a;
   #endif
   #ifdef CONFIG_B
	int b;
   #endif
  };

Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
enable feature A and doesn't enable feature B at runtime, each entry of
struct page_ext takes two int rather than one int.  It's undesirable
waste so this patch tries to reduce it.  By this patchset, we can save
20 bytes per page dedicated for page owner feature in some
configurations.

This patch (of 6):

We can make code clean by moving decision condition for set_page_guard()
into set_page_guard() itself.  It will help code readability.  There is
no functional change.

Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko bf48438354 mm, vmscan: get rid of throttle_vm_writeout
throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
excessive pageout activity during the reclaim.  Too many pages could be
put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though.  Writers are throttled by balance_dirty_pages when
initiating the buffered IO and later during the memory pressure, the
direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi is
congested by the queued IO.  The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Finally should_reclaim_retry does congestion_wait if the reclaim cannot
make any progress and there are too many dirty/writeback pages.

Another important aspect is that we do not issue any IO from the direct
reclaim context anymore.  In a heavy parallel load this could queue a
lot of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful.  Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption.  In order to guarantee a forward progress it relies on
the mempool allocator.  mempool_alloc(), however, prefers to use the
underlying (usually page) allocator before it grabs objects from the
pool.  Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout.  If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

  kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
  Workqueue: kcryptd kcryptd_crypt [dm_crypt]
  Call Trace:
    schedule+0x3c/0x90
    schedule_timeout+0x1d8/0x360
    io_schedule_timeout+0xa4/0x110
    congestion_wait+0x86/0x1f0
    throttle_vm_writeout+0x44/0xd0
    shrink_zone_memcg+0x613/0x720
    shrink_zone+0xe0/0x300
    do_try_to_free_pages+0x1ad/0x450
    try_to_free_pages+0xef/0x300
    __alloc_pages_nodemask+0x879/0x1210
    alloc_pages_current+0xa1/0x1f0
    new_slab+0x2d7/0x6a0
    ___slab_alloc+0x3fb/0x5c0
    __slab_alloc+0x51/0x90
    kmem_cache_alloc+0x27b/0x310
    mempool_alloc_slab+0x1d/0x30
    mempool_alloc+0x91/0x230
    bio_alloc_bioset+0xbd/0x260
    kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Let's just drop throttle_vm_writeout altogether.  It is not very much
helpful anymore.

I have tried to test a potential writeback IO runaway similar to the one
described in the original patch which has introduced that [1].  Small
virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
rather slow NFS in a sync mode on the host) with 8 parallel writers each
writing 1G worth of data.  As soon as the pagecache fills up and the
direct reclaim hits then I start anon memory consumer in a loop
(allocating 300M and exiting after populating it) in the background to
make the memory pressure even stronger as well as to disrupt the steady
state for the IO.  The direct reclaim is throttled because of the
congestion as well as kswapd hitting congestion_wait due to nr_immediate
but throttle_vm_writeout doesn't ever trigger the sleep throughout the
test.  Dirty+writeback are close to nr_dirty_threshold with some
fluctuations caused by the anon consumer.

[1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: NeilBrown <neilb@suse.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Xishi Qiu e780149bcd mm: fix set pageblock migratetype in deferred struct page init
On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
2M, so we only set one pageblock's migratetype in deferred_free_range()
if pfn is aligned to MAX_ORDER_NR_PAGES.  That means it causes
uninitialized migratetype blocks, you can see from "cat
/proc/pagetypeinfo", almost half blocks are Unmovable.

Also we missed freeing the last block in deferred_init_memmap(), it
causes memory leak.

Fixes: ac5d2539b2 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Xishi Qiu e506b99696 mem-hotplug: fix node spanned pages when we have a movable node
Commit 342332e6a9 ("mm/page_alloc.c: introduce kernelcore=mirror
option") rewrote the calculation of node spanned pages.  But when we
have a movable node, the size of node spanned pages is double added.
That's because we have an empty normal zone, the present pages is zero,
but its spanned pages is not zero.

e.g.
    Zone ranges:
      DMA      [mem 0x0000000000001000-0x0000000000ffffff]
      DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
      Normal   [mem 0x0000000100000000-0x0000007c7fffffff]
    Movable zone start for each node
      Node 1: 0x0000001080000000
      Node 2: 0x0000002080000000
      Node 3: 0x0000003080000000
      Node 4: 0x0000003c80000000
      Node 5: 0x0000004c80000000
      Node 6: 0x0000005c80000000
    Early memory node ranges
      node   0: [mem 0x0000000000001000-0x000000000009ffff]
      node   0: [mem 0x0000000000100000-0x000000007552afff]
      node   0: [mem 0x000000007bd46000-0x000000007bd46fff]
      node   0: [mem 0x000000007bdcd000-0x000000007bffffff]
      node   0: [mem 0x0000000100000000-0x000000107fffffff]
      node   1: [mem 0x0000001080000000-0x000000207fffffff]
      node   2: [mem 0x0000002080000000-0x000000307fffffff]
      node   3: [mem 0x0000003080000000-0x0000003c7fffffff]
      node   4: [mem 0x0000003c80000000-0x0000004c7fffffff]
      node   5: [mem 0x0000004c80000000-0x0000005c7fffffff]
      node   6: [mem 0x0000005c80000000-0x0000006c7fffffff]
      node   7: [mem 0x0000006c80000000-0x0000007c7fffffff]

  node1:
    Normal, start=0x1080000, present=0x0, spanned=0x1000000
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000

After this patch, the problem is fixed.

  node1:
    Normal, start=0x0, present=0x0, spanned=0x0
    Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
    pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000

Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka fdd4c6149a mm, vmscan: make compaction_ready() more accurate and readable
The compaction_ready() is used during direct reclaim for costly order
allocations to skip reclaim for zones where compaction should be
attempted instead.  It's combining the standard compaction_suitable()
check with its own watermark check based on high watermark with extra
gap, and the result is confusing at best.

This patch attempts to better structure and document the checks
involved.  First, compaction_suitable() can determine that the
allocation should either succeed already, or that compaction doesn't
have enough free pages to proceed.  The third possibility is that
compaction has enough free pages, but we still decide to reclaim first -
unless we are already above the high watermark with gap.  This does not
mean that the reclaim will actually reach this watermark during single
attempt, this is rather an over-reclaim protection.  So document the
code as such.  The check for compaction_deferred() is removed
completely, as it in fact had no proper role here.

The result after this patch is mainly a less confusing code.  We also
skip some over-reclaim in cases where the allocation should already
succed.

Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 8348faf91f mm, compaction: require only min watermarks for non-costly orders
The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction.  Then __isolate_free_page uses low watermark check to decide
if particular free page can be isolated.  In the latter case, using low
watermark is needlessly pessimistic, as the free page isolations are
only temporary.  For __compaction_suitable() the higher watermark makes
sense for high-order allocations where more freepages increase the
chance of success, and we can typically fail with some order-0 fallback
when the system is struggling to reach that watermark.  But for
low-order allocation, forming the page should not be that hard.  So
using low watermark here might just prevent compaction from even trying,
and eventually lead to OOM killer even if we are above min watermarks.

So after this patch, we use min watermark for non-costly orders in
__compaction_suitable(), and for all orders in __isolate_free_page().

[vbabka@suse.cz: clarify __isolate_free_page() comment]
 Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 984fdba6a3 mm, compaction: use proper alloc_flags in __compaction_suitable()
The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction.  This check uses direct compactor's alloc_flags, but that's
wrong, since these flags are not applicable for freepage isolation.

For example, alloc_flags may indicate access to memory reserves, making
compaction proceed, and then fail watermark check during the isolation.

A similar problem exists for ALLOC_CMA, which may be part of
alloc_flags, but not during freepage isolation.  In this case however it
makes sense to use ALLOC_CMA both in __compaction_suitable() and
__isolate_free_page(), since there's actually nothing preventing the
freepage scanner to isolate from CMA pageblocks, with the assumption
that a page that could be migrated once by compaction can be migrated
also later by CMA allocation.  Thus we should count pages in CMA
pageblocks when considering compaction suitability and when isolating
freepages.

To sum up, this patch should remove some false positives from
__compaction_suitable(), and allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks.

Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 9861a62c33 mm, compaction: create compact_gap wrapper
Compaction uses a watermark gap of (2UL << order) pages at various
places and it's not immediately obvious why.  Abstract it through a
compact_gap() wrapper to create a single place with a thorough
explanation.

[vbabka@suse.cz: clarify the comment of compact_gap()]
 Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka f2b8228c5f mm, compaction: use correct watermark when checking compaction success
The __compact_finished() function uses low watermark in a check that has
to pass if the direct compaction is to finish and allocation should
succeed.  This is too pessimistic, as the allocation will typically use
min watermark.  It may happen that during compaction, we drop below the
low watermark (due to parallel activity), but still form the target
high-order page.  By checking against low watermark, we might needlessly
continue compaction.

Similarly, __compaction_suitable() uses low watermark in a check whether
allocation can succeed without compaction.  Again, this is unnecessarily
pessimistic.

After this patch, these check will use direct compactor's alloc_flags to
determine the watermark, which is effectively the min watermark.

Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka a8e025e55b mm, compaction: add the ultimate direct compaction priority
During reclaim/compaction loop, it's desirable to get a final answer
from unsuccessful compaction so we can either fail the allocation or
invoke the OOM killer.  However, heuristics such as deferred compaction
or pageblock skip bits can cause compaction to skip parts or whole zones
and lead to premature OOM's, failures or excessive reclaim/compaction
retries.

To remedy this, we introduce a new direct compaction priority called
COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

 - ignore deferred compaction status for a zone
 - ignore pageblock skip hints
 - ignore cached scanner positions and scan the whole zone

The new priority should get eventually picked up by
should_compact_retry() and this should improve success rates for costly
allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
reduce some corner-case OOM's for non-costly allocations.

Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
[vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
  Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 7ceb009a22 mm, compaction: don't recheck watermarks after COMPACT_SUCCESS
Joonsoo has reminded me that in a later patch changing watermark checks
throughout compaction I forgot to update checks in
try_to_compact_pages() and compactd_do_work().  Closer inspection
however shows that they are redundant now in the success case, because
compact_zone() now reliably reports this with COMPACT_SUCCESS.  So
effectively the checks just repeat (a subset) of checks that have just
passed.  So instead of checking watermarks again, just test the return
value.

Note it's also possible that compaction would declare failure e.g.
because its find_suitable_fallback() is more strict than simple
watermark check, and then the watermark check we are removing would then
still succeed.  After this patch this is not possible and it's arguably
better, because for long-term fragmentation avoidance we should rather
try a different zone than allocate with the unsuitable fallback.  If
compaction of all zones fail and the allocation is important enough, it
will retry and succeed anyway.

Also remove the stray "bool success" variable from kcompactd_do_work().

Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka cf378319d3 mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS
COMPACT_PARTIAL has historically meant that compaction returned after
doing some work without fully compacting a zone.  It however didn't
distinguish if compaction terminated because it succeeded in creating
the requested high-order page.  This has changed recently and now we
only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
high-order watermark check in compaction_suitable() passes and no
compaction needs to be done.

So at this point we can make the return value clearer by renaming it to
COMPACT_SUCCESS.  The next patch will remove some redundant tests for
success where compaction just returned COMPACT_SUCCESS.

Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 791cae9620 mm, compaction: cleanup unused functions
Since kswapd compaction moved to kcompactd, compact_pgdat() is not
called anymore, so we remove it.  The only caller of __compact_pgdat()
is compact_node(), so we merge them and remove code that was only
reachable from kswapd.

Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Vlastimil Babka 06ed29989f mm, compaction: make whole_zone flag ignore cached scanner positions
Patch series "make direct compaction more deterministic")

This is mostly a followup to Michal's oom detection rework, which
highlighted the need for direct compaction to provide better feedback in
reclaim/compaction loop, so that it can reliably recognize when
compaction cannot make further progress, and allocation should invoke
OOM killer or fail.  We've discussed this at LSF/MM [1] where I proposed
expanding the async/sync migration mode used in compaction to more
general "priorities".  This patchset adds one new priority that just
overrides all the heuristics and makes compaction fully scan all zones.
I don't currently think that we need more fine-grained priorities, but
we'll see.  Other than that there's some smaller fixes and cleanups,
mainly related to the THP-specific hacks.

I've tested this with stress-highalloc in GFP_KERNEL order-4 and
THP-like order-9 scenarios.  There's some improvement for compaction
stats for the order-4, which is likely due to the better watermarks
handling.  In the previous version I reported mostly noise wrt
compaction stats, and decreased direct reclaim - now the reclaim is
without difference.  I believe this is due to the less aggressive
compaction priority increase in patch 6.

"before" is a mmotm tree prior to 4.7 release plus the first part of the
series that was sent and merged separately

                                    before        after
order-4:

Compaction stalls                    27216       30759
Compaction success                   19598       25475
Compaction failures                   7617        5283
Page migrate success                370510      464919
Page migrate failure                 25712       27987
Compaction pages isolated           849601     1041581
Compaction migrate scanned       143146541   101084990
Compaction free scanned          208355124   144863510
Compaction cost                       1403        1210

order-9:

Compaction stalls                     7311        7401
Compaction success                    1634        1683
Compaction failures                   5677        5718
Page migrate success                194657      183988
Page migrate failure                  4753        4170
Compaction pages isolated           498790      456130
Compaction migrate scanned          565371      524174
Compaction free scanned            4230296     4250744
Compaction cost                        215         203

[1] https://lwn.net/Articles/684611/

This patch (of 11):

A recent patch has added whole_zone flag that compaction sets when
scanning starts from the zone boundary, in order to report that zone has
been fully scanned in one attempt.  For allocations that want to try
really hard or cannot fail, we will want to introduce a mode where
scanning whole zone is guaranteed regardless of the cached positions.

This patch reuses the whole_zone flag in a way that if it's already
passed true to compaction, the cached scanner positions are ignored.
Employing this flag during reclaim/compaction loop will be done in the
next patch.  This patch however converts compaction invoked from
userspace via procfs to use this flag.  Before this patch, the cached
positions were first reset to zone boundaries and then read back from
struct zone, so there was a window where a parallel compaction could
replace the reset values, making the manual compaction less effective.
Using the flag instead of performing reset is more robust.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:27 -07:00
Michal Hocko 5870c2e1d7 mm/oom_kill.c: fix task_will_free_mem() comment
Attempt to demystify the task_will_free_mem() loop.

Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Vladimir Davydov 58fa2a5512 mm: memcontrol: add sanity checks for memcg->id.ref on get/put
Link: http://lkml.kernel.org/r/1c5ddb1c171dbdfc3262252769d6138a29b35b70.1470219853.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
zijun_hu 252e5c6e2e mm/vmalloc.c: fix align value calculation error
It causes double align requirement for __get_vm_area_node() if parameter
size is power of 2 and VM_IOREMAP is set in parameter flags, for example
size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000

get_count_order_long() is implemented and can be used instead of
fls_long() for fixing the bug, for example size=0x10000 ->
get_count_order_long(0x10000)=16 -> align=0x10000

[akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/]
[zijun_hu@zoho.com: fixes]
 Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
[akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()]
[akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()]
[zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope]
 Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Vladimir Davydov 7c5f64f844 mm: oom: deduplicate victim selection code for memcg and global oom
When selecting an oom victim, we use the same heuristic for both memory
cgroup and global oom.  The only difference is the scope of tasks to
select the victim from.  So we could just export an iterator over all
memcg tasks and keep all oom related logic in oom_kill.c, but instead we
duplicate pieces of it in memcontrol.c reusing some initially private
functions of oom_kill.c in order to not duplicate all of it.  That looks
ugly and error prone, because any modification of select_bad_process
should also be propagated to mem_cgroup_out_of_memory.

Let's rework this as follows: keep all oom heuristic related code private
to oom_kill.c and make oom_kill.c use exported memcg functions when it's
really necessary (like in case of iterating over memcg tasks).

Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07 18:46:26 -07:00
Linus Torvalds d1f5323370 Merge branch 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull VFS splice updates from Al Viro:
 "There's a bunch of branches this cycle, both mine and from other folks
  and I'd rather send pull requests separately.

  This one is the conversion of ->splice_read() to ITER_PIPE iov_iter
  (and introduction of such). Gets rid of a lot of code in fs/splice.c
  and elsewhere; there will be followups, but these are for the next
  cycle...  Some pipe/splice-related cleanups from Miklos in the same
  branch as well"

* 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  pipe: fix comment in pipe_buf_operations
  pipe: add pipe_buf_steal() helper
  pipe: add pipe_buf_confirm() helper
  pipe: add pipe_buf_release() helper
  pipe: add pipe_buf_get() helper
  relay: simplify relay_file_read()
  switch default_file_splice_read() to use of pipe-backed iov_iter
  switch generic_file_splice_read() to use of ->read_iter()
  new iov_iter flavour: pipe-backed
  fuse_dev_splice_read(): switch to add_to_pipe()
  skb_splice_bits(): get rid of callback
  new helper: add_to_pipe()
  splice: lift pipe_lock out of splice_to_pipe()
  splice: switch get_iovec_page_array() to iov_iter
  splice_to_pipe(): don't open-code wakeup_pipe_readers()
  consistent treatment of EFAULT on O_DIRECT read/write
2016-10-07 15:36:58 -07:00
Linus Torvalds 8d37059581 xfs: updates for 4.9-rc1
Included in this update:
 - change of XFS mailing list to linux-xfs@vger.kernel.org
 - iomap-based DAX infrastructure w/ XFS and ext2 support
 - small iomap fixes and additions
 - more efficient XFS delayed allocation infrastructure based on iomap
 - a rework of log recovery writeback scheduling to ensure we don't
   fail recovery when trying to replay items that are already on disk
 - some preparation patches for upcoming reflink support
 - configurable error handling fixes and documentation
 - aio access time update race fixes for XFS and generic_file_read_iter
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJX9WvjAAoJEK3oKUf0dfodrl8P/R1cS8tEHnrmNlKeENNWFTlN
 q8HEfP3tX43QLHXpeHd9F9qXs5/esrOFfWYFjeoAaB1cWiRXDJsUNOEH3PuQf0Go
 NKHgrL8GiU6XY9keZI6KJYphr2a5//qWJywxOeBuJh3446MDSYwOmI3eEIY8ac3/
 k0e8bMnLhfryWOvyZE6v2w75lMi+SL1LH/W6OSJqGFKS3N+GqdqRKkMfYGQToHkM
 ZgIX1vDSq4xgJzkR1Q+AACCaSTGE2wEG/bnqZ1R3l19/bERB17LaOyEegBDXbrTT
 vI31EQnrN92O/Q2eYJlap8nFIm4lVaCFTU1R7KEVEXvUBRXXfxllu1sOSBpn1PSQ
 OrC5bbcCodcG8b1SlwRrcstqc42weojqwyl65eJxOa17valghaYEcLkqEZrrrssv
 Y+C0okfL3UB2JAxG4O1nFQ3py1cYlkYURf6CuhxNQfktXZxSpAMTLy9wYCRylBiO
 Eu6Say4zfnfKiVaSg0xlMhIaAyugVH+uVro62hZYxCU2mJ/biZHeQAUC6Krl6NsY
 NsAk0T7eUgMd7lLW+C9/rL2AQaXYwR72cl/1jAWBE2piBM2Gu1lcGHGwWHvOcYjO
 K2Yg4RMnR9TDbUX2jl1r4bZoQD3IZ3HpUjgVInmbTPtKY4q89kfC40haSpBQykm7
 QzGLPvFz2sMrkmKPLbV2
 =R9uL
 -----END PGP SIGNATURE-----

Merge tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs

Pull xfs and iomap updates from Dave Chinner:
 "The main things in this update are the iomap-based DAX infrastructure,
  an XFS delalloc rework, and a chunk of fixes to how log recovery
  schedules writeback to prevent spurious corruption detections when
  recovery of certain items was not required.

  The other main chunk of code is some preparation for the upcoming
  reflink functionality. Most of it is generic and cleanups that stand
  alone, but they were ready and reviewed so are in this pull request.

  Speaking of reflink, I'm currently planning to send you another pull
  request next week containing all the new reflink functionality. I'm
  working through a similar process to the last cycle, where I sent the
  reverse mapping code in a separate request because of how large it
  was. The reflink code merge is even bigger than reverse mapping, so
  I'll be doing the same thing again....

  Summary for this update:

   - change of XFS mailing list to linux-xfs@vger.kernel.org

   - iomap-based DAX infrastructure w/ XFS and ext2 support

   - small iomap fixes and additions

   - more efficient XFS delayed allocation infrastructure based on iomap

   - a rework of log recovery writeback scheduling to ensure we don't
     fail recovery when trying to replay items that are already on disk

   - some preparation patches for upcoming reflink support

   - configurable error handling fixes and documentation

   - aio access time update race fixes for XFS and
     generic_file_read_iter"

* tag 'xfs-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (40 commits)
  fs: update atime before I/O in generic_file_read_iter
  xfs: update atime before I/O in xfs_file_dio_aio_read
  ext2: fix possible integer truncation in ext2_iomap_begin
  xfs: log recovery tracepoints to track current lsn and buffer submission
  xfs: update metadata LSN in buffers during log recovery
  xfs: don't warn on buffers not being recovered due to LSN
  xfs: pass current lsn to log recovery buffer validation
  xfs: rework log recovery to submit buffers on LSN boundaries
  xfs: quiesce the filesystem after recovery on readonly mount
  xfs: remote attribute blocks aren't really userdata
  ext2: use iomap to implement DAX
  ext2: stop passing buffer_head to ext2_get_blocks
  xfs: use iomap to implement DAX
  xfs: refactor xfs_setfilesize
  xfs: take the ilock shared if possible in xfs_file_iomap_begin
  xfs: fix locking for DAX writes
  dax: provide an iomap based fault handler
  dax: provide an iomap based dax read/write path
  dax: don't pass buffer_head to copy_user_dax
  dax: don't pass buffer_head to dax_insert_mapping
  ...
2016-10-06 08:18:10 -07:00
Al Viro 82c156f853 switch generic_file_splice_read() to use of ->read_iter()
... and kill the ->splice_read() instances that can be switched to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-05 18:23:56 -04:00
Johannes Weiner 3ddf40e8c3 mm: filemap: fix mapping->nrpages double accounting in fuse
Commit 22f2ac51b6 ("mm: workingset: fix crash in shadow node shrinker
caused by replace_page_cache_page()") switched replace_page_cache() from
raw radix tree operations to page_cache_tree_insert() but didn't take
into account that the latter function, unlike the raw radix tree op,
handles mapping->nrpages.  As a result, that counter is bumped for each
page replacement rather than balanced out even.

The mapping->nrpages counter is used to skip needless radix tree walks
when invalidating, truncating, syncing inodes without pages, as well as
statistics for userspace.  Since the error is positive, we'll do more
page cache tree walks than necessary; we won't miss a necessary one.
And we'll report more buffer pages to userspace than there are.  The
error is limited to fuse inodes.

Fixes: 22f2ac51b6 ("mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05 09:17:56 -07:00
Johannes Weiner d3798ae8c6 mm: filemap: don't plant shadow entries without radix tree node
When the underflow checks were added to workingset_node_shadow_dec(),
they triggered immediately:

  kernel BUG at ./include/linux/swap.h:276!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: isofs usb_storage fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_REJECT nf_reject_ipv6
   soundcore wmi acpi_als pinctrl_sunrisepoint kfifo_buf tpm_tis industrialio acpi_pad pinctrl_intel tpm_tis_core tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt
  CPU: 0 PID: 20929 Comm: blkid Not tainted 4.8.0-rc8-00087-gbe67d60ba944 #1
  Hardware name: System manufacturer System Product Name/Z170-K, BIOS 1803 05/06/2016
  task: ffff8faa93ecd940 task.stack: ffff8faa7f478000
  RIP: page_cache_tree_insert+0xf1/0x100
  Call Trace:
    __add_to_page_cache_locked+0x12e/0x270
    add_to_page_cache_lru+0x4e/0xe0
    mpage_readpages+0x112/0x1d0
    blkdev_readpages+0x1d/0x20
    __do_page_cache_readahead+0x1ad/0x290
    force_page_cache_readahead+0xaa/0x100
    page_cache_sync_readahead+0x3f/0x50
    generic_file_read_iter+0x5af/0x740
    blkdev_read_iter+0x35/0x40
    __vfs_read+0xe1/0x130
    vfs_read+0x96/0x130
    SyS_read+0x55/0xc0
    entry_SYSCALL_64_fastpath+0x13/0x8f
  Code: 03 00 48 8b 5d d8 65 48 33 1c 25 28 00 00 00 44 89 e8 75 19 48 83 c4 18 5b 41 5c 41 5d 41 5e 5d c3 0f 0b 41 bd ef ff ff ff eb d7 <0f> 0b e8 88 68 ef ff 0f 1f 84 00
  RIP  page_cache_tree_insert+0xf1/0x100

This is a long-standing bug in the way shadow entries are accounted in
the radix tree nodes. The shrinker needs to know when radix tree nodes
contain only shadow entries, no pages, so node->count is split in half
to count shadows in the upper bits and pages in the lower bits.

Unfortunately, the radix tree implementation doesn't know of this and
assumes all entries are in node->count. When there is a shadow entry
directly in root->rnode and the tree is later extended, the radix tree
implementation will copy that entry into the new node and and bump its
node->count, i.e. increases the page count bits. Once the shadow gets
removed and we subtract from the upper counter, node->count underflows
and triggers the warning. Afterwards, without node->count reaching 0
again, the radix tree node is leaked.

Limit shadow entries to when we have actual radix tree nodes and can
count them properly. That means we lose the ability to detect refaults
from files that had only the first page faulted in at eviction time.

Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-05 09:17:56 -07:00
zijun_hu 9b7396624a mm/percpu.c: fix potential memory leakage for pcpu_embed_first_chunk()
in order to ensure the percpu group areas within a chunk aren't
distributed too sparsely, pcpu_embed_first_chunk() goes to error handling
path when a chunk spans over 3/4 VMALLOC area, however, during the error
handling, it forget to free the memory allocated for all percpu groups by
going to label @out_free other than @out_free_areas.

it will cause memory leakage issue if the rare scene really happens, in
order to fix the issue, we check chunk spanned area immediately after
completing memory allocation for all percpu groups, we go to label
@out_free_areas to free the memory then return if the checking is failed.

in order to verify the approach, we dump all memory allocated then
enforce the jump then dump all memory freed, the result is okay after
checking whether we free all memory we allocate in this function.

BTW, The approach is chosen after thinking over the below scenes
 - we don't go to label @out_free directly to fix this issue since we
   maybe free several allocated memory blocks twice
 - the aim of jumping after pcpu_setup_first_chunk() is bypassing free
   usable memory other than handling error, moreover, the function does
   not return error code in any case, it either panics due to BUG_ON()
   or return 0.

Signed-off-by: zijun_hu <zijun_hu@htc.com>
Tested-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05 11:52:55 -04:00
zijun_hu 93c76b6b2f mm/percpu.c: correct max_distance calculation for pcpu_embed_first_chunk()
pcpu_embed_first_chunk() calculates the range a percpu chunk spans into
@max_distance and uses it to ensure that a chunk is not too big compared
to the total vmalloc area. However, during calculation, it used incorrect
top address by adding a unit size to the highest group's base address.

This can make the calculated max_distance slightly smaller than the actual
distance although given the scale of values involved the error is very
unlikely to have an actual impact.

Fix this issue by adding the group's size instead of a unit size.

BTW, The type of variable max_distance is changed from size_t to unsigned
long too based on below consideration:
 - type unsigned long usually have same width with IP core registers and
   can be applied at here very well
 - make @max_distance type consistent with the operand calculated against
   it such as @ai->groups[i].base_offset and macro VMALLOC_TOTAL
 - type unsigned long is more universal then size_t, size_t is type defined
   to unsigned int or unsigned long among various ARCHs usually

Signed-off-by: zijun_hu <zijun_hu@htc.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2016-10-05 11:52:54 -04:00
Linus Torvalds 597f03f9d1 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull CPU hotplug updates from Thomas Gleixner:
 "Yet another batch of cpu hotplug core updates and conversions:

   - Provide core infrastructure for multi instance drivers so the
     drivers do not have to keep custom lists.

   - Convert custom lists to the new infrastructure. The block-mq custom
     list conversion comes through the block tree and makes the diffstat
     tip over to more lines removed than added.

   - Handle unbalanced hotplug enable/disable calls more gracefully.

   - Remove the obsolete CPU_STARTING/DYING notifier support.

   - Convert another batch of notifier users.

   The relayfs changes which conflicted with the conversion have been
   shipped to me by Andrew.

   The remaining lot is targeted for 4.10 so that we finally can remove
   the rest of the notifiers"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
  cpufreq: Fix up conversion to hotplug state machine
  blk/mq: Reserve hotplug states for block multiqueue
  x86/apic/uv: Convert to hotplug state machine
  s390/mm/pfault: Convert to hotplug state machine
  mips/loongson/smp: Convert to hotplug state machine
  mips/octeon/smp: Convert to hotplug state machine
  fault-injection/cpu: Convert to hotplug state machine
  padata: Convert to hotplug state machine
  cpufreq: Convert to hotplug state machine
  ACPI/processor: Convert to hotplug state machine
  virtio scsi: Convert to hotplug state machine
  oprofile/timer: Convert to hotplug state machine
  block/softirq: Convert to hotplug state machine
  lib/irq_poll: Convert to hotplug state machine
  x86/microcode: Convert to hotplug state machine
  sh/SH-X3 SMP: Convert to hotplug state machine
  ia64/mca: Convert to hotplug state machine
  ARM/OMAP/wakeupgen: Convert to hotplug state machine
  ARM/shmobile: Convert to hotplug state machine
  arm64/FP/SIMD: Convert to hotplug state machine
  ...
2016-10-03 19:43:08 -07:00
Linus Torvalds 8e4ef63867 Merge branch 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 vdso updates from Ingo Molnar:
 "The main changes in this cycle centered around adding support for
  32-bit compatible C/R of the vDSO on 64-bit kernels, by Dmitry
  Safonov"

* 'x86-vdso-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/vdso: Use CONFIG_X86_X32_ABI to enable vdso prctl
  x86/vdso: Only define map_vdso_randomized() if CONFIG_X86_64
  x86/vdso: Only define prctl_map_vdso() if CONFIG_CHECKPOINT_RESTORE
  x86/signal: Add SA_{X32,IA32}_ABI sa_flags
  x86/ptrace: Down with test_thread_flag(TIF_IA32)
  x86/coredump: Use pr_reg size, rather that TIF_IA32 flag
  x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
  x86/vdso: Replace calculate_addr in map_vdso() with addr
  x86/vdso: Unmap vdso blob on vvar mapping failure
2016-10-03 17:29:01 -07:00
Linus Torvalds af79ad2b1f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler changes from Ingo Molnar:
 "The main changes are:

   - irqtime accounting cleanups and enhancements. (Frederic Weisbecker)

   - schedstat debugging enhancements, make it more broadly runtime
     available. (Josh Poimboeuf)

   - More work on asymmetric topology/capacity scheduling. (Morten
     Rasmussen)

   - sched/wait fixes and cleanups. (Oleg Nesterov)

   - PELT (per entity load tracking) improvements. (Peter Zijlstra)

   - Rewrite and enhance select_idle_siblings(). (Peter Zijlstra)

   - sched/numa enhancements/fixes (Rik van Riel)

   - sched/cputime scalability improvements (Stanislaw Gruszka)

   - Load calculation arithmetics fixes. (Dietmar Eggemann)

   - sched/deadline enhancements (Tommaso Cucinotta)

   - Fix utilization accounting when switching to the SCHED_NORMAL
     policy. (Vincent Guittot)

   - ... plus misc cleanups and enhancements"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits)
  sched/irqtime: Consolidate irqtime flushing code
  sched/irqtime: Consolidate accounting synchronization with u64_stats API
  u64_stats: Introduce IRQs disabled helpers
  sched/irqtime: Remove needless IRQs disablement on kcpustat update
  sched/irqtime: No need for preempt-safe accessors
  sched/fair: Fix min_vruntime tracking
  sched/debug: Add SCHED_WARN_ON()
  sched/core: Fix set_user_nice()
  sched/fair: Introduce set_curr_task() helper
  sched/core, ia64: Rename set_curr_task()
  sched/core: Fix incorrect utilization accounting when switching to fair class
  sched/core: Optimize SCHED_SMT
  sched/core: Rewrite and improve select_idle_siblings()
  sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared
  sched/core: Introduce 'struct sched_domain_shared'
  sched/core: Restructure destroy_sched_domain()
  sched/core: Remove unused @cpu argument from destroy_sched_domain*()
  sched/wait: Introduce init_wait_entry()
  sched/wait: Avoid abort_exclusive_wait() in __wait_on_bit_lock()
  sched/wait: Avoid abort_exclusive_wait() in ___wait_event()
  ...
2016-10-03 13:39:00 -07:00
Linus Torvalds 72ec94560d Power management material for v4.9-rc1
- Add a mechanism for passing hints from the scheduler to cpufreq governors
    via their utilization update callbacks and use it to introduce "IOwait
    boosting" into the schedutil governor and intel_pstate that will make them
    boost performance if the enqueued task was previously waiting on I/O
    (Rafael Wysocki).
 
  - Fix a schedutil governor problem that causes it to overestimate utilization
    if SMT is in use (Steve Muckle).
 
  - Update defconfigs trying to use the schedutil governor as a module which is
    not possible any more (Javier Martinez Canillas).
 
  - Update the intel_pstate's pstate_sample tracepoint to take "IOwait boosting"
    into account (Srinivas Pandruvada).
 
  - Fix a problem in the cpufreq core causing it to mishandle the initialization
    of CPUs registered after the cpufreq driver (Viresh Kumar, Rafael Wysocki).
 
  - Make the cpufreq-dt driver support per-policy governor tunables, clean it
    up and update its Kconfig description (Viresh Kumar).
 
  - Add support for more ARM platforms to the cpufreq-dt driver (Chanwoo Choi,
    Dave Gerlach, Geert Uytterhoeven).
 
  - Make the cpufreq CPPC driver report frequencies in KHz to avoid user space
    compatiblility issues (Al Stone, Hoan Tran).
 
  - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin Ian King,
    Markus Elfring).
 
  - Constify some local structures in the intel_pstate driver (Julia Lawall).
 
  - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).
 
  - Add support for PM domain removal to the generic power domains (genpd)
    framework, add new DT helper functions to it and make it always enable
    debugfs support if available (Jon Hunter, Tomeu Vizoso).
 
  - Clean up the generic power domains (genpd) framework and make it avoid
    measuring power-on and power-off latencies during system-wide PM transitions
    (Ulf Hansson).
 
  - Add support for the RockChip DFI controller and the rk3399 DMC to the
    devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).
 
  - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski, Stephen
    Rothwell).
 
  - Fix a minor issue in the exynos-ppmu devfreq driver and fix up devfreq
    Kconfig indentation style (Wei Yongjun, Jisheng Zhang).
 
  - Fix the system suspend interface to make suspend-to-idle work if platform
    suspend operations have not been registered (Sudeep Holla).
 
  - Make it possible to use hibernation with PAGE_POISONING_ZERO enabled
    (Anisse Astier).
 
  - Increas the default timeout of the system suspend/resume watchdog and make it
    depend on EXPERT (Chen Yu).
 
  - Make the operating performance points (OPP) framework avoid using OPPs that
    aren't supported by the platform and fix a build warning in it (Dave Gerlach,
    Arnd Bergmann).
 
  - Fix the ARM cpuidle driver's return value (Christophe Jaillet).
 
  - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more common
    logging style (Joe Perches).
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJX8Y32AAoJEILEb/54YlRx8e0P/27zu8Lb6Aks1S2Zx9GEW0qr
 DvrO4kklCHqi3DgHlyFOYetf9cxMrUluojVJofnoSDvgAayWyg7VAd4gtOrMGCXG
 pJVJM73itcOUK+DsAVvoWJY3hk15nX77n2aiXPN2GqaMqennlQusdfzTmjCasqpm
 M84j+JwFYlJcfyMCcF5kGWqS7QBjzxhA0CjytUX1i3pL3NqRALZUEpaHwBD1W+4r
 tcF/jYTy3RsghCbuC6HoPxEF9NMOFGxeAXogmu6NvGu8gy0GqtywRSRrs5wA1a0z
 ZDAJ8krrFbzuFPMdjNIE8wtTeziofS5i9piQx3JlIMH3HpNGN86BRXVfzuHzJj11
 6ZMUI/FJy+fYukIXOEeVLtsLHUnMcMux8Jq1UF6N0InahaR9nbsjmGOmXh72+Scx
 7VJ+29l0oVwX6wkw/DjPP3rb1Swd1i3yY0/3uRoJ174mYTjhRGbrbDkIjPiDeuM5
 2Cx7QunscOjFmaNtPyr8niQ+7YhMEpn8VIbGNaX5ABz0fGftfi8nDHqliSNa391Z
 nK6YoKD0O6R0JHE6GavvJTcuMS9qE+HHHOwymWKxEdE9KYk0JUqen3gj1sSTaAZT
 BIPBsn6XlorqNy3dnqtWTHV7Nf0al9huolWvrL90s6g4Bh2BzTzDVydSgNWTMDUi
 G64nP0q1sJTqdoe30uvk
 =NYkv
 -----END PGP SIGNATURE-----

Merge tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management updates from Rafael Wysocki:
 "Traditionally, cpufreq is the area with the greatest number of
  changes, but there are fewer of them than last time. There also is
  some activity in the generic power domains and the devfreq frameworks,
  a couple of system suspend and hibernation fixes and some assorted
  changes in other places.

  One new feature is the cpufreq change to allow the scheduler to pass
  hints to the governors' utilization update callbacks and some code
  rework based on that. Another one is the support for domain removal in
  the generic power domains framework. Also it is now possible to use
  hibernation with PAGE_POISONING_ZERO enabled and devfreq supports the
  RockChip DFI controller and the rk3399 DMC.

  The rest of the changes is mostly fixes and cleanups in a number of
  places.

  Specifics:

   - Add a mechanism for passing hints from the scheduler to cpufreq
     governors via their utilization update callbacks and use it to
     introduce "IOwait boosting" into the schedutil governor and
     intel_pstate that will make them boost performance if the enqueued
     task was previously waiting on I/O (Rafael Wysocki).

   - Fix a schedutil governor problem that causes it to overestimate
     utilization if SMT is in use (Steve Muckle).

   - Update defconfigs trying to use the schedutil governor as a module
     which is not possible any more (Javier Martinez Canillas).

   - Update the intel_pstate's pstate_sample tracepoint to take "IOwait
     boosting" into account (Srinivas Pandruvada).

   - Fix a problem in the cpufreq core causing it to mishandle the
     initialization of CPUs registered after the cpufreq driver (Viresh
     Kumar, Rafael Wysocki).

   - Make the cpufreq-dt driver support per-policy governor tunables,
     clean it up and update its Kconfig description (Viresh Kumar).

   - Add support for more ARM platforms to the cpufreq-dt driver
     (Chanwoo Choi, Dave Gerlach, Geert Uytterhoeven).

   - Make the cpufreq CPPC driver report frequencies in KHz to avoid
     user space compatiblility issues (Al Stone, Hoan Tran).

   - Clean up a few cpufreq drivers (st, kirkwood, SCPI) a bit (Colin
     Ian King, Markus Elfring).

   - Constify some local structures in the intel_pstate driver (Julia
     Lawall).

   - Add a Documentation/cpu-freq/ entry to MAINTAINERS (Jean Delvare).

   - Add support for PM domain removal to the generic power domains
     (genpd) framework, add new DT helper functions to it and make it
     always enable debugfs support if available (Jon Hunter, Tomeu
     Vizoso).

   - Clean up the generic power domains (genpd) framework and make it
     avoid measuring power-on and power-off latencies during system-wide
     PM transitions (Ulf Hansson).

   - Add support for the RockChip DFI controller and the rk3399 DMC to
     the devfreq framework (Lin Huang, Axel Lin, Arnd Bergmann).

   - Add COMPILE_TEST to the devfreq framework (Krzysztof Kozlowski,
     Stephen Rothwell).

   - Fix a minor issue in the exynos-ppmu devfreq driver and fix up
     devfreq Kconfig indentation style (Wei Yongjun, Jisheng Zhang).

   - Fix the system suspend interface to make suspend-to-idle work if
     platform suspend operations have not been registered (Sudeep
     Holla).

   - Make it possible to use hibernation with PAGE_POISONING_ZERO
     enabled (Anisse Astier).

   - Increas the default timeout of the system suspend/resume watchdog
     and make it depend on EXPERT (Chen Yu).

   - Make the operating performance points (OPP) framework avoid using
     OPPs that aren't supported by the platform and fix a build warning
     in it (Dave Gerlach, Arnd Bergmann).

   - Fix the ARM cpuidle driver's return value (Christophe Jaillet).

   - Make the SmartReflex AVS (Adaptive Voltage Scaling) driver use more
     common logging style (Joe Perches)"

* tag 'pm-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (58 commits)
  PM / OPP: Don't support OPP if it provides supported-hw but platform does not
  cpufreq: st: add missing \n to end of dev_err message
  cpufreq: kirkwood: add missing \n to end of dev_err messages
  PM / Domains: Rename pm_genpd_sync_poweron|poweroff()
  PM / Domains: Don't measure latency of ->power_on|off() during system PM
  PM / Domains: Remove redundant system PM callbacks
  PM / Domains: Simplify detaching a device from its genpd
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  PM / OPP: avoid maybe-uninitialized warning
  PM / Domains: Allow holes in genpd_data.domains array
  cpufreq: CPPC: Avoid overflow when calculating desired_perf
  cpufreq: ti: Use generic platdev driver
  cpufreq: intel_pstate: Add io_boost trace
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  cpufreq: intel_pstate: Use IOWAIT flag in Atom algorithm
  cpufreq: schedutil: Add iowait boosting
  cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition
  PM / Domains: Add support for removing nested PM domains by provider
  PM / Domains: Add support for removing PM domains
  ...
2016-10-03 09:33:40 -07:00
Linus Torvalds 7af8a0f808 arm64 updates for 4.9:
- Support for execute-only page permissions
 - Support for hibernate and DEBUG_PAGEALLOC
 - Support for heterogeneous systems with mismatches cache line sizes
 - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
 - arm64 PMU perf updates, including cpumasks for heterogeneous systems
 - Set UTS_MACHINE for building rpm packages
 - Yet another head.S tidy-up
 - Some cleanups and refactoring, particularly in the NUMA code
 - Lots of random, non-critical fixes across the board
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCgAGBQJX7k31AAoJELescNyEwWM0XX0H/iOaWCfKlWOhvBsStGUCsLrK
 XryTzQT2KjdnLKf3jwP+1ateCuBR5ROurYxoDCX5/7mD63c5KiI338Vbv61a1lE1
 AAwjt1stmQVUg/j+kqnuQwB/0DYg+2C8se3D3q5Iyn7zc19cDZJEGcBHNrvLMufc
 XgHrgHgl/rzBDDlHJXleknDFge/MfhU5/Q1vJMRRb4JYrpAtmIokzCO75CYMRcCT
 ND2QbmppKtsyuFPGUTVbAFzJlP6dGKb3eruYta7/ct5d0pJQxav3u98D2yWGfjdM
 YaYq1EmX5Pol7rWumqLtk0+mA9yCFcKLLc+PrJu20Vx0UkvOq8G8Xt70sHNvZU8=
 =gdPM
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Will Deacon:
 "It's a bit all over the place this time with no "killer feature" to
  speak of.  Support for mismatched cache line sizes should help people
  seeing whacky JIT failures on some SoCs, and the big.LITTLE perf
  updates have been a long time coming, but a lot of the changes here
  are cleanups.

  We stray outside arch/arm64 in a few areas: the arch/arm/ arch_timer
  workaround is acked by Russell, the DT/OF bits are acked by Rob, the
  arch_timer clocksource changes acked by Marc, CPU hotplug by tglx and
  jump_label by Peter (all CC'd).

  Summary:

   - Support for execute-only page permissions
   - Support for hibernate and DEBUG_PAGEALLOC
   - Support for heterogeneous systems with mismatches cache line sizes
   - Errata workarounds (A53 843419 update and QorIQ A-008585 timer bug)
   - arm64 PMU perf updates, including cpumasks for heterogeneous systems
   - Set UTS_MACHINE for building rpm packages
   - Yet another head.S tidy-up
   - Some cleanups and refactoring, particularly in the NUMA code
   - Lots of random, non-critical fixes across the board"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (100 commits)
  arm64: tlbflush.h: add __tlbi() macro
  arm64: Kconfig: remove SMP dependence for NUMA
  arm64: Kconfig: select OF/ACPI_NUMA under NUMA config
  arm64: fix dump_backtrace/unwind_frame with NULL tsk
  arm/arm64: arch_timer: Use archdata to indicate vdso suitability
  arm64: arch_timer: Work around QorIQ Erratum A-008585
  arm64: arch_timer: Add device tree binding for A-008585 erratum
  arm64: Correctly bounds check virt_addr_valid
  arm64: migrate exception table users off module.h and onto extable.h
  arm64: pmu: Hoist pmu platform device name
  arm64: pmu: Probe default hw/cache counters
  arm64: pmu: add fallback probe table
  MAINTAINERS: Update ARM PMU PROFILING AND DEBUGGING entry
  arm64: Improve kprobes test for atomic sequence
  arm64/kvm: use alternative auto-nop
  arm64: use alternative auto-nop
  arm64: alternative: add auto-nop infrastructure
  arm64: lse: convert lse alternatives NOP padding to use __nops
  arm64: barriers: introduce nops and __nops macros for NOP sequences
  arm64: sysreg: replace open-coded mrs_s/msr_s with {read,write}_sysreg_s
  ...
2016-10-03 08:58:35 -07:00
Christoph Hellwig 0d5b0cf246 fs: update atime before I/O in generic_file_read_iter
After the call to ->direct_IO the final reference to the file might have
been dropped by aio_complete already, and the call to file_accessed might
cause a use after free.

Instead update the access time before the I/O, similar to how we
update the time stamps before writes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-03 09:48:08 +11:00
Rafael J. Wysocki 993eb0aeae Merge branches 'pm-devfreq' and 'pm-sleep'
* pm-devfreq:
  PM / devfreq: rk3399_dmc: Remove explictly regulator_put call in .remove
  PM / devfreq: rockchip: add PM_DEVFREQ_EVENT dependency
  partial revert of "PM / devfreq: Add COMPILE_TEST for build coverage"
  PM / devfreq: rockchip: add devfreq driver for rk3399 dmc
  Documentation: bindings: add dt documentation for rk3399 dmc
  PM / devfreq: event: support rockchip dfi controller
  Documentation: bindings: add dt documentation for dfi controller
  PM / devfreq: event: remove duplicate devfreq_event_get_drvdata()
  PM / devfreq: fix Kconfig indent style
  PM / devfreq: Add COMPILE_TEST for build coverage
  PM / devfreq: exynos-ppmu: remove unneeded of_node_put()

* pm-sleep:
  PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
  PM / sleep: enable suspend-to-idle even without registered suspend_ops
  PM / sleep: Increase default DPM watchdog timeout to 120
2016-10-02 01:43:45 +02:00
Johannes Weiner 22f2ac51b6 mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()
Antonio reports the following crash when using fuse under memory pressure:

  kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: all of them
  CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
  Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
  task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
  RIP: shadow_lru_isolate+0x181/0x190
  Call Trace:
    __list_lru_walk_one.isra.3+0x8f/0x130
    list_lru_walk_one+0x23/0x30
    scan_shadow_nodes+0x34/0x50
    shrink_slab.part.40+0x1ed/0x3d0
    shrink_zone+0x2ca/0x2e0
    kswapd+0x51e/0x990
    kthread+0xd8/0xf0
    ret_from_fork+0x3f/0x70

which corresponds to the following sanity check in the shadow node
tracking:

  BUG_ON(node->count & RADIX_TREE_COUNT_MASK);

The workingset code tracks radix tree nodes that exclusively contain
shadow entries of evicted pages in them, and this (somewhat obscure)
line checks whether there are real pages left that would interfere with
reclaim of the radix tree node under memory pressure.

While discussing ways how fuse might sneak pages into the radix tree
past the workingset code, Miklos pointed to replace_page_cache_page(),
and indeed there is a problem there: it properly accounts for the old
page being removed - __delete_from_page_cache() does that - but then
does a raw raw radix_tree_insert(), not accounting for the replacement
page.  Eventually the page count bits in node->count underflow while
leaving the node incorrectly linked to the shadow node LRU.

To address this, make sure replace_page_cache_page() uses the tracked
page insertion code, page_cache_tree_insert().  This fixes the page
accounting and makes sure page-containing nodes are properly unlinked
from the shadow node LRU again.

Also, make the sanity checks a bit less obscure by using the helpers for
checking the number of pages and shadows in a radix tree node.

Fixes: 449dd6984d ("mm: keep page cache radix tree nodes in check")
Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Antonio SJ Musumeci <trapexit@spawn.link>
Debugged-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <stable@vger.kernel.org>	[3.15+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-30 15:26:52 -07:00
Ingo Molnar 536e0e81e0 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-30 10:44:27 +02:00
Li Zhong 231e97e2b8 mem-hotplug: use nodes that contain memory as mask in new_node_page()
9bb627be47 ("mem-hotplug: don't clear the only node in new_node_page()")
prevents allocating from an empty nodemask, but as David points out, it is
still wrong.  As node_online_map may include memoryless nodes, only
allocating from these nodes is meaningless.

This patch uses node_states[N_MEMORY] mask to prevent the above case.

Fixes: 9bb627be47 ("mem-hotplug: don't clear the only node in new_node_page()")
Fixes: 394e31d2ce ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Suggested-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: John Allen <jallen@linux.vnet.ibm.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-28 16:19:02 -07:00
zhong jiang 5b398e416e mm,ksm: fix endless looping in allocating memory when ksm enable
I hit the following hung task when runing a OOM LTP test case with 4.1
kernel.

Call trace:
[<ffffffc000086a88>] __switch_to+0x74/0x8c
[<ffffffc000a1bae0>] __schedule+0x23c/0x7bc
[<ffffffc000a1c09c>] schedule+0x3c/0x94
[<ffffffc000a1eb84>] rwsem_down_write_failed+0x214/0x350
[<ffffffc000a1e32c>] down_write+0x64/0x80
[<ffffffc00021f794>] __ksm_exit+0x90/0x19c
[<ffffffc0000be650>] mmput+0x118/0x11c
[<ffffffc0000c3ec4>] do_exit+0x2dc/0xa74
[<ffffffc0000c46f8>] do_group_exit+0x4c/0xe4
[<ffffffc0000d0f34>] get_signal+0x444/0x5e0
[<ffffffc000089fcc>] do_signal+0x1d8/0x450
[<ffffffc00008a35c>] do_notify_resume+0x70/0x78

The oom victim cannot terminate because it needs to take mmap_sem for
write while the lock is held by ksmd for read which loops in the page
allocator

ksm_do_scan
	scan_get_next_rmap_item
		down_read
		get_next_rmap_item
			alloc_rmap_item   #ksmd will loop permanently.

There is no way forward because the oom victim cannot release any memory
in 4.1 based kernel.  Since 4.6 we have the oom reaper which would solve
this problem because it would release the memory asynchronously.
Nevertheless we can relax alloc_rmap_item requirements and use
__GFP_NORETRY because the allocation failure is acceptable as ksm_do_scan
would just retry later after the lock got dropped.

Such a patch would be also easy to backport to older stable kernels which
do not have oom_reaper.

While we are at it add GFP_NOWARN so the admin doesn't have to be alarmed
by the allocation failure.

Link: http://lkml.kernel.org/r/1474165570-44398-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Suggested-by: Hugh Dickins <hughd@google.com>
Suggested-by: Michal Hocko <mhocko@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-28 16:19:01 -07:00
Deepa Dinamani 078cd8279e fs: Replace CURRENT_TIME with current_time() for inode timestamps
CURRENT_TIME macro is not appropriate for filesystems as it
doesn't use the right granularity for filesystem timestamps.
Use current_time() instead.

CURRENT_TIME is also not y2038 safe.

This is also in preparation for the patch that transitions
vfs timestamps to use 64 bit time and hence make them
y2038 safe. As part of the effort current_time() will be
extended to do range checks. Hence, it is necessary for all
file system timestamps to use current_time(). Also,
current_time() will be transitioned along with vfs to be
y2038 safe.

Note that whenever a single call to current_time() is used
to change timestamps in different inodes, it is because they
share the same time granularity.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Felipe Balbi <balbi@kernel.org>
Acked-by: Steven Whitehouse <swhiteho@redhat.com>
Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: David Sterba <dsterba@suse.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-09-27 21:06:21 -04:00
Miklos Szeredi 2773bf00ae fs: rename "rename2" i_op to "rename"
Generated patch:

sed -i "s/\.rename2\t/\.rename\t\t/" `git grep -wl rename2`
sed -i "s/\brename2\b/rename/g" `git grep -wl rename2`

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-09-27 11:03:58 +02:00
Lorenzo Stoakes 38e0885465 mm: check VMA flags to avoid invalid PROT_NONE NUMA balancing
The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
PMDs respectively as requiring balancing upon a subsequent page fault.
User-defined PROT_NONE memory regions which also have this flag set will
not normally invoke the NUMA balancing code as do_page_fault() will send
a segfault to the process before handle_mm_fault() is even called.

However if access_remote_vm() is invoked to access a PROT_NONE region of
memory, handle_mm_fault() is called via faultin_page() and
__get_user_pages() without any access checks being performed, meaning
the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
region.

A simple means of triggering this problem is to access PROT_NONE mmap'd
memory using /proc/self/mem which reliably results in the NUMA handling
functions being invoked when CONFIG_NUMA_BALANCING is set.

This issue was reported in bugzilla (issue 99101) which includes some
simple repro code.

There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
added at commit c0e7cad to avoid accidentally provoking strange
behaviour by attempting to apply NUMA balancing to pages that are in
fact PROT_NONE.  The BUG_ON()'s are consistently triggered by the repro.

This patch moves the PROT_NONE check into mm/memory.c rather than
invoking BUG_ON() as faulting in these pages via faultin_page() is a
valid reason for reaching the NUMA check with the PROT_NONE page table
flag set and is therefore not always a bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
Reported-by: Trevor Saunders <tbsaunde@tbsaunde.org>
Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-25 15:43:42 -07:00
Linus Torvalds 0f26574178 Merge branch 'hughd-fixes' (patches from Hugh Dickins)
Merge VM fixes from High Dickins:
 "I get the impression that Andrew is away or busy at the moment, so I'm
  going to send you three independent uncontroversial little mm fixes
  directly - though none is strictly a 4.8 regression fix.

   - shmem: fix tmpfs to handle the huge= option properly from Toshi
     Kani is a one-liner to fix a major embarrassment in 4.8's hugepages
     on tmpfs feature: although Hillf pointed it out in June, somehow
     both Kirill and I repeatedly dropped the ball on this one.  You
     might wonder if the feature got tested at all with that bug in:
     yes, it did, but for wider testing coverage, Kirill and I had each
     relied too much on an override which bypasses that condition.

   - huge tmpfs: fix Committed_AS leak just a run-of-the-mill accounting
     fix in the same feature.

   - mm: delete unnecessary and unsafe init_tlb_ubc() is an unrelated
     fix to 4.3's TLB flush batching in reclaim: the bug would be rare,
     and none of us will be shamed if this one misses 4.8; but it got
     such a quick ack from Mel today that I'm inclined to offer it along
     with the first two"

* emailed patches from Hugh Dickins <hughd@google.com>:
  mm: delete unnecessary and unsafe init_tlb_ubc()
  huge tmpfs: fix Committed_AS leak
  shmem: fix tmpfs to handle the huge= option properly
2016-09-24 11:31:45 -07:00
Hugh Dickins b385d21f27 mm: delete unnecessary and unsafe init_tlb_ubc()
init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
initialized with zeroes in the init_task, and copied from parent to
child while it is quiescent in arch_dup_task_struct(); so I went to
delete it.

But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
check that it was always empty at that point, and found them firing:
because memcg reclaim can recurse into global reclaim (when allocating
biosets for swapout in my case), and arrive back at the init_tlb_ubc()
in shrink_node_memcg().

Resetting tlb_ubc.flush_required at that point is wrong: if the upper
level needs a deferred TLB flush, but the lower level turns out not to,
we miss a TLB flush.  But fortunately, that's the only part of the
protocol that does not nest: with the initialization removed, cpumask
collects bits from upper and lower levels, and flushes TLB when needed.

Fixes: 72b252aed5 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24 11:20:01 -07:00
Hugh Dickins 71664665c3 huge tmpfs: fix Committed_AS leak
Under swapping load on huge tmpfs, /proc/meminfo's Committed_AS grows
bigger and bigger: just a cosmetic issue for most users, but disabling
for those who run without overcommit (/proc/sys/vm/overcommit_memory 2).

shmem_uncharge() was forgetting to unaccount __vm_enough_memory's
charge, and shmem_charge() was forgetting it on the filesystem-full
error path.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24 11:20:01 -07:00
Toshi Kani 3089bf614c shmem: fix tmpfs to handle the huge= option properly
shmem_get_unmapped_area() checks SHMEM_SB(sb)->huge incorrectly, which
leads to a reversed effect of "huge=" mount option.

Fix the check in shmem_get_unmapped_area().

Note, the default value of SHMEM_SB(sb)->huge remains as
SHMEM_HUGE_NEVER.  User will need to specify "huge=" option to enable
huge page mappings.

Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-24 11:20:01 -07:00
Ingo Molnar 50797851b4 Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-22 14:49:40 +02:00
Jan Kara 31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Laura Abbott aa4f060111 mm: usercopy: Check for module addresses
While running a compile on arm64, I hit a memory exposure

usercopy: kernel memory exposure attempt detected from fffffc0000f3b1a8 (buffer_head) (1 bytes)
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:75!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in: ip6t_rpfilter ip6t_REJECT
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp
llc ebtable_nat ip6table_security ip6table_raw ip6table_nat
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle
iptable_security iptable_raw iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
ebtable_filter ebtables ip6table_filter ip6_tables vfat fat xgene_edac
xgene_enet edac_core i2c_xgene_slimpro i2c_core at803x realtek xgene_dma
mdio_xgene gpio_dwapb gpio_xgene_sb xgene_rng mailbox_xgene_slimpro nfsd
auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c sdhci_of_arasan
sdhci_pltfm sdhci mmc_core xhci_plat_hcd gpio_keys
CPU: 0 PID: 19744 Comm: updatedb Tainted: G        W 4.8.0-rc3-threadinfo+ #1
Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang Board, BIOS 3.06.12 Aug 12 2016
task: fffffe03df944c00 task.stack: fffffe00d128c000
PC is at __check_object_size+0x70/0x3f0
LR is at __check_object_size+0x70/0x3f0
...
[<fffffc00082b4280>] __check_object_size+0x70/0x3f0
[<fffffc00082cdc30>] filldir64+0x158/0x1a0
[<fffffc0000f327e8>] __fat_readdir+0x4a0/0x558 [fat]
[<fffffc0000f328d4>] fat_readdir+0x34/0x40 [fat]
[<fffffc00082cd8f8>] iterate_dir+0x190/0x1e0
[<fffffc00082cde58>] SyS_getdents64+0x88/0x120
[<fffffc0008082c70>] el0_svc_naked+0x24/0x28

fffffc0000f3b1a8 is a module address. Modules may have compiled in
strings which could get copied to userspace. In this instance, it
looks like "." which matches with a size of 1 byte. Extend the
is_vmalloc_addr check to be is_vmalloc_or_module_addr to cover
all possible cases.

Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-20 16:07:39 -07:00
Johannes Weiner db2ba40c27 mm: memcontrol: make per-cpu charge cache IRQ-safe for socket accounting
During cgroup2 rollout into production, we started encountering css
refcount underflows and css access crashes in the memory controller.
Splitting the heavily shared css reference counter into logical users
narrowed the imbalance down to the cgroup2 socket memory accounting.

The problem turns out to be the per-cpu charge cache.  Cgroup1 had a
separate socket counter, but the new cgroup2 socket accounting goes
through the common charge path that uses a shared per-cpu cache for all
memory that is being tracked.  Those caches are safe against scheduling
preemption, but not against interrupts - such as the newly added packet
receive path.  When cache draining is interrupted by network RX taking
pages out of the cache, the resuming drain operation will put references
of in-use pages, thus causing the imbalance.

Disable IRQs during all per-cpu charge cache operations.

Fixes: f7e1cb6ec5 ("mm: memcontrol: account socket memory in unified hierarchy memory controller")
Link: http://lkml.kernel.org/r/20160914194846.11153-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Santosh Shilimkar c8de641b1e mm: fix the page_swap_info() BUG_ON check
Commit 62c230bc17 ("mm: add support for a filesystem to activate
swap files and use direct_IO for writing swap pages") replaced the
swap_aops dirty hook from __set_page_dirty_no_writeback() with
swap_set_page_dirty().

For normal cases without these special SWP flags code path falls back to
__set_page_dirty_no_writeback() so the behaviour is expected to be the
same as before.

But swap_set_page_dirty() makes use of the page_swap_info() helper to
get the swap_info_struct to check for the flags like SWP_FILE,
SWP_BLKDEV etc as desired for those features.  This helper has
BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
set_page_dirty_lock() path.

For the set_page_dirty() path which is often needed for cases to be
called from irq context, kswapd() can toggle the flag behind the back
while the call is getting executed when system is low on memory and
heavy swapping is ongoing.

This ends up with undesired kernel panic.

This patch just moves the check outside the helper to its users
appropriately to fix kernel panic for the described path.  Couple of
users of helpers already take care of SwapCache condition so I skipped
them.

Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joe Perches <joe@perches.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Rik van Riel <riel@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jens Axboe <axboe@fb.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>	[4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:17 -07:00
Kirill A. Shutemov 4d35427ad7 mm: avoid endless recursion in dump_page()
dump_page() uses page_mapcount() to get mapcount of the page.
page_mapcount() has VM_BUG_ON_PAGE(PageSlab(page)) as mapcount doesn't
make sense for slab pages and the field in struct page used for other
information.

It leads to recursion if dump_page() called for slub page and DEBUG_VM
is enabled:

dump_page() -> page_mapcount() -> VM_BUG_ON_PAGE() -> dump_page -> ...

Let's avoid calling page_mapcount() for slab pages in dump_page().

Link: http://lkml.kernel.org/r/20160908082137.131076-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:16 -07:00
Ebru Akagunduz 982785c6b0 mm, thp: fix leaking mapped pte in __collapse_huge_page_swapin()
Currently, khugepaged does not permit swapin if there are enough young
pages in a THP.  The problem is when a THP does not have enough young
pages, khugepaged leaks mapped ptes.

This patch prohibits leaking mapped ptes.

Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Suggested-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:16 -07:00
Kirill A. Shutemov c131f751ab khugepaged: fix use-after-free in collapse_huge_page()
hugepage_vma_revalidate() tries to re-check if we still should try to
collapse small pages into huge one after the re-acquiring mmap_sem.

The problem Dmitry Vyukov reported[1] is that the vma found by
hugepage_vma_revalidate() can be suitable for huge pages, but not the
same vma we had before dropping mmap_sem.  And dereferencing original
vma can lead to fun results..

Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
same as what we had before the lock was dropped.

[1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com

Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: syzkaller <syzkaller@googlegroups.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:16 -07:00
Li Zhong 9bb627be47 mem-hotplug: don't clear the only node in new_node_page()
Commit 394e31d2ce ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.

In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask().  But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.

The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.

Fixes: 394e31d2ce ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Reported-by: John Allen <jallen@linux.vnet.ibm.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-19 15:36:16 -07:00
Tejun Heo eac0337af1 slab, workqueue: remove keventd_up() usage
Now that workqueue can handle work item queueing from very early
during boot, there is no need to gate schedule_delayed_work_on() while
!keventd_up().  Remove it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
2016-09-17 13:18:21 -04:00
Dmitry Safonov 2eefd87896 x86/arch_prctl/vdso: Add ARCH_MAP_VDSO_*
Add API to change vdso blob type with arch_prctl.
As this is usefull only by needs of CRIU, expose
this interface under CONFIG_CHECKPOINT_RESTORE.

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: 0x7f454c46@gmail.com
Cc: oleg@redhat.com
Cc: linux-mm@kvack.org
Cc: gorcunov@openvz.org
Cc: xemul@virtuozzo.com
Link: http://lkml.kernel.org/r/20160905133308.28234-4-dsafonov@virtuozzo.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-14 21:28:09 +02:00
Rik van Riel d59dc7bcfa sched/numa, mm: Revert to checking pmd/pte_write instead of VMA flags
Commit:

  4d94246699 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
found to introduce a regression with NUMA grouping.

It was followed up by these commits:

 53da3bc2ba ("mm: fix up numa read-only thread grouping logic")
 bea66fbd11 ("mm: numa: group related processes based on VMA flags instead of page table flags")
 b191f9b106 ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

The first of those two commits try alternate approaches to NUMA
grouping, which apparently do not work as well as looking at the PTE
write permissions.

The latter patch preserves the PTE write permissions across a NUMA
protection fault. However, it forgets to revert the condition for
whether or not to group tasks together back to what it was before
v3.19, even though the information is now preserved in the page tables
once again.

This patch brings the NUMA grouping heuristic back to what it was
before commit 4d94246699, which the changelogs of subsequent
commits suggest worked best.

We have all the information again. We should probably use it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: aarcange@redhat.com
Cc: linux-mm@kvack.org
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-13 20:31:33 +02:00
Anisse Astier 1ad1410f63 PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO
PAGE_POISONING_ZERO disables zeroing new pages on alloc, they are
poisoned (zeroed) as they become available.
In the hibernate use case, free pages will appear in the system without
being cleared, left there by the loading kernel.

This patch will make sure free pages are cleared on resume when
PAGE_POISONING_ZERO is enabled. We free the pages just after resume
because we can't do it later: going through any device resume code might
allocate some memory and invalidate the free pages bitmap.

Thus we don't need to disable hibernation when PAGE_POISONING_ZERO is
enabled.

Signed-off-by: Anisse Astier <anisse@astier.eu>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2016-09-13 02:35:27 +02:00
Linus Torvalds 98ac9a608d Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
 "nvdimm fixes for v4.8, two of them are tagged for -stable:

   - Fix devm_memremap_pages() to use track_pfn_insert().  Otherwise,
     DAX pmd mappings end up with an uncached pgprot, and unusable
     performance for the device-dax interface.  The device-dax interface
     appeared in 4.7 so this is tagged for -stable.

   - Fix a couple VM_BUG_ON() checks in the show_smaps() path to
     understand DAX pmd entries.  This fix is tagged for -stable.

   - Fix a mis-merge of the nfit machine-check handler to flip the
     polarity of an if() to match the final version of the patch that
     Vishal sent for 4.8-rc1.  Without this the nfit machine check
     handler never detects / inserts new 'badblocks' entries which
     applications use to identify lost portions of files.

   - For test purposes, fix the nvdimm_clear_poison() path to operate on
     legacy / simulated nvdimm memory ranges.  Without this fix a test
     can set badblocks, but never clear them on these ranges.

   - Fix the range checking done by dax_dev_pmd_fault().  This is not
     tagged for -stable since this problem is mitigated by specifying
     aligned resources at device-dax setup time.

  These patches have appeared in a next release over the past week.  The
  recent rebase you can see in the timestamps was to drop an invalid fix
  as identified by the updated device-dax unit tests [1].  The -mm
  touches have an ack from Andrew"

[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
   https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  libnvdimm: allow legacy (e820) pmem region to clear bad blocks
  nfit, mce: Fix SPA matching logic in MCE handler
  mm: fix cache mode of dax pmd mappings
  mm: fix show_smap() for zone_device-pmd ranges
  dax: fix mapping size check
2016-09-10 09:58:52 -07:00
Dan Williams ca120cf688 mm: fix show_smap() for zone_device-pmd ranges
Attempting to dump /proc/<pid>/smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

 kernel BUG at mm/huge_memory.c:1105!
 task: ffff88045f16b140 task.stack: ffff88045be14000
 RIP: 0010:[<ffffffff81268f9b>]  [<ffffffff81268f9b>] follow_trans_huge_pmd+0x2cb/0x340
 [..]
 Call Trace:
  [<ffffffff81306030>] smaps_pte_range+0xa0/0x4b0
  [<ffffffff814c2755>] ? vsnprintf+0x255/0x4c0
  [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
  [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
  [<ffffffff81307656>] show_smap+0xa6/0x2b0

 kernel BUG at fs/proc/task_mmu.c:585!
 RIP: 0010:[<ffffffff81306469>]  [<ffffffff81306469>] smaps_pte_range+0x499/0x4b0
 Call Trace:
  [<ffffffff814c2795>] ? vsnprintf+0x255/0x4c0
  [<ffffffff8123c46e>] __walk_page_range+0x1fe/0x4d0
  [<ffffffff8123c8a2>] walk_page_vma+0x62/0x80
  [<ffffffff81307696>] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-09-09 17:34:45 -07:00
Dave Hansen e8c24d3a23 x86/pkeys: Allocation/free syscalls
This patch adds two new system calls:

	int pkey_alloc(unsigned long flags, unsigned long init_access_rights)
	int pkey_free(int pkey);

These implement an "allocator" for the protection keys
themselves, which can be thought of as analogous to the allocator
that the kernel has for file descriptors.  The kernel tracks
which numbers are in use, and only allows operations on keys that
are valid.  A key which was not obtained by pkey_alloc() may not,
for instance, be passed to pkey_mprotect().

These system calls are also very important given the kernel's use
of pkeys to implement execute-only support.  These help ensure
that userspace can never assume that it has control of a key
unless it first asks the kernel.  The kernel does not promise to
preserve PKRU (right register) contents except for allocated
pkeys.

The 'init_access_rights' argument to pkey_alloc() specifies the
rights that will be established for the returned pkey.  For
instance:

	pkey = pkey_alloc(flags, PKEY_DENY_WRITE);

will allocate 'pkey', but also sets the bits in PKRU[1] such that
writing to 'pkey' is already denied.

The kernel does not prevent pkey_free() from successfully freeing
in-use pkeys (those still assigned to a memory range by
pkey_mprotect()).  It would be expensive to implement the checks
for this, so we instead say, "Just don't do it" since sane
software will never do it anyway.

Any piece of userspace calling pkey_alloc() needs to be prepared
for it to fail.  Why?  pkey_alloc() returns the same error code
(ENOSPC) when there are no pkeys and when pkeys are unsupported.
They can be unsupported for a whole host of reasons, so apps must
be prepared for this.  Also, libraries or LD_PRELOADs might steal
keys before an application gets access to them.

This allocation mechanism could be implemented in userspace.
Even if we did it in userspace, we would still need additional
user/kernel interfaces to tell userspace which keys are being
used by the kernel internally (such as for execute-only
mappings).  Having the kernel provide this facility completely
removes the need for these additional interfaces, or having an
implementation of this in userspace at all.

Note that we have to make changes to all of the architectures
that do not use mman-common.h because we use the new
PKEY_DENY_ACCESS/WRITE macros in arch-independent code.

1. PKRU is the Protection Key Rights User register.  It is a
   usermode-accessible register that controls whether writes
   and/or access to each individual pkey is allowed or denied.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163015.444FE75F@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:27 +02:00
Dave Hansen a8502b67d7 x86/pkeys: Make mprotect_key() mask off additional vm_flags
Today, mprotect() takes 4 bits of data: PROT_READ/WRITE/EXEC/NONE.
Three of those bits: READ/WRITE/EXEC get translated directly in to
vma->vm_flags by calc_vm_prot_bits().  If a bit is unset in
mprotect()'s 'prot' argument then it must be cleared in vma->vm_flags
during the mprotect() call.

We do this clearing today by first calculating the VMA flags we
want set, then clearing the ones we do not want to inherit from
the original VMA:

	vm_flags = calc_vm_prot_bits(prot, key);
	...
	newflags = vm_flags;
	newflags |= (vma->vm_flags & ~(VM_READ | VM_WRITE | VM_EXEC));

However, we *also* want to mask off the original VMA's vm_flags in
which we store the protection key.

To do that, this patch adds a new macro:

	ARCH_VM_PKEY_FLAGS

which allows the architecture to specify additional bits that it would
like cleared.  We use that to ensure that the VM_PKEY_BIT* bits get
cleared.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163013.E48D6981@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:26 +02:00
Dave Hansen 7d06d9c9bd mm: Implement new pkey_mprotect() system call
pkey_mprotect() is just like mprotect, except it also takes a
protection key as an argument.  On systems that do not support
protection keys, it still works, but requires that key=0.
Otherwise it does exactly what mprotect does.

I expect it to get used like this, if you want to guarantee that
any mapping you create can *never* be accessed without the right
protection keys set up.

	int real_prot = PROT_READ|PROT_WRITE;
	pkey = pkey_alloc(0, PKEY_DENY_ACCESS);
	ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
	ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey);

This way, there is *no* window where the mapping is accessible
since it was always either PROT_NONE or had a protection key set
that denied all access.

We settled on 'unsigned long' for the type of the key here.  We
only need 4 bits on x86 today, but I figured that other
architectures might need some more space.

Semantically, we have a bit of a problem if we combine this
syscall with our previously-introduced execute-only support:
What do we do when we mix execute-only pkey use with
pkey_mprotect() use?  For instance:

	pkey_mprotect(ptr, PAGE_SIZE, PROT_WRITE, 6); // set pkey=6
	mprotect(ptr, PAGE_SIZE, PROT_EXEC);  // set pkey=X_ONLY_PKEY?
	mprotect(ptr, PAGE_SIZE, PROT_WRITE); // is pkey=6 again?

To solve that, we make the plain-mprotect()-initiated execute-only
support only apply to VMAs that have the default protection key (0)
set on them.

Proposed semantics:
1. protection key 0 is special and represents the default,
   "unassigned" protection key.  It is always allocated.
2. mprotect() never affects a mapping's pkey_mprotect()-assigned
   protection key. A protection key of 0 (even if set explicitly)
   represents an unassigned protection key.
   2a. mprotect(PROT_EXEC) on a mapping with an assigned protection
       key may or may not result in a mapping with execute-only
       properties.  pkey_mprotect() plus pkey_set() on all threads
       should be used to _guarantee_ execute-only semantics if this
       is not a strong enough semantic.
3. mprotect(PROT_EXEC) may result in an "execute-only" mapping. The
   kernel will internally attempt to allocate and dedicate a
   protection key for the purpose of execute-only mappings.  This
   may not be possible in cases where there are no free protection
   keys available.  It can also happen, of course, in situations
   where there is no hardware support for protection keys.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-arch@vger.kernel.org
Cc: Dave Hansen <dave@sr71.net>
Cc: arnd@arndb.de
Cc: linux-api@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20160729163012.3DDD36C4@viggo.jf.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-09 13:02:26 +02:00
Kees Cook 8e1f74ea02 usercopy: remove page-spanning test for now
A custom allocator without __GFP_COMP that copies to userspace has been
found in vmw_execbuf_process[1], so this disables the page-span checker
by placing it behind a CONFIG for future work where such things can be
tracked down later.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1373326

Reported-by: Vinson Lee <vlee@freedesktop.org>
Fixes: f5509cc18d ("mm: Hardened usercopy")
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-09-07 11:33:26 -07:00
Sebastian Andrzej Siewior 1d7ac6aec9 mm/writeback: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20160818125731.27256-6-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:20 +02:00
Sebastian Andrzej Siewior a96a87bf94 slub: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Link: http://lkml.kernel.org/r/20160818125731.27256-5-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:20 +02:00
Sebastian Andrzej Siewior 6731d4f123 slab: Convert to hotplug state machine
Install the callbacks via the state machine.

Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: linux-mm@kvack.org
Cc: rt@linutronix.de
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Link: http://lkml.kernel.org/r/20160823125319.abeapfjapf2kfezp@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-06 18:30:20 +02:00
David Rientjes c11600e4fe mm, mempolicy: task->mempolicy must be NULL before dropping final reference
KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions.  It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:

	BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
	CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
	...
	Call Trace:
		dump_stack
		kasan_object_err
		kasan_report_error
		__asan_report_load2_noabort
		alloc_pages_current	<-- use after free
		depot_save_stack
		save_stack
		kasan_slab_free
		kmem_cache_free
		__mpol_put		<-- free
		do_exit

This patch sets current->mempolicy to NULL before dropping the final
reference.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org>	[4.6+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Mel Gorman 6aa303defb mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator
Firmware Assisted Dump (FA_DUMP) on ppc64 reserves substantial amounts
of memory when booting a secondary kernel.  Srikar Dronamraju reported
that multiple nodes may have no memory managed by the buddy allocator
but still return true for populated_zone().

Commit 1d82de618d ("mm, vmscan: make kswapd reclaim in terms of
nodes") was reported to cause kswapd to spin at 100% CPU usage when
fadump was enabled.  The old code happened to deal with the situation of
a populated node with zero free pages by co-incidence but the current
code tries to reclaim populated zones without realising that is
impossible.

We cannot just convert populated_zone() as many existing users really
need to check for present_pages.  This patch introduces a managed_zone()
helper and uses it in the few cases where it is critical that the check
is made for managed pages -- zonelist construction and page reclaim.

Link: http://lkml.kernel.org/r/20160831195104.GB8119@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Michal Hocko 6b4e3181d7 mm, oom: prevent premature OOM killer invocation for high order request
There have been several reports about pre-mature OOM killer invocation
in 4.7 kernel when order-2 allocation request (for the kernel stack)
invoked OOM killer even during basic workloads (light IO or even kernel
compile on some filesystems).  In all reported cases the memory is
fragmented and there are no order-2+ pages available.  There is usually
a large amount of slab memory (usually dentries/inodes) and further
debugging has shown that there are way too many unmovable blocks which
are skipped during the compaction.  Multiple reporters have confirmed
that the current linux-next which includes [1] and [2] helped and OOMs
are not reproducible anymore.

A simpler fix for the late rc and stable is to simply ignore the
compaction feedback and retry as long as there is a reclaim progress and
we are not getting OOM for order-0 pages.  We already do that for
CONFING_COMPACTION=n so let's reuse the same code when compaction is
enabled as well.

[1] http://lkml.kernel.org/r/20160810091226.6709-1-vbabka@suse.cz
[2] http://lkml.kernel.org/r/f7a9ea9d-bb88-bfd6-e340-3a933559305a@suse.cz

Fixes: 0a0337e0d1 ("mm, oom: rework oom detection")
Link: http://lkml.kernel.org/r/20160823074339.GB23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Tested-by: Olaf Hering <olaf@aepfle.de>
Tested-by: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com>
Cc: Ralf-Peter Rohbeck <Ralf-Peter.Rohbeck@quantum.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.7.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-09-01 17:52:01 -07:00
Ross Zwisler 11bd969fde mm: silently skip readahead for DAX inodes
For DAX inodes we need to be careful to never have page cache pages in
the mapping->page_tree.  This radix tree should be composed only of DAX
exceptional entries and zero pages.

ltp's readahead02 test was triggering a warning because we were trying
to insert a DAX exceptional entry but found that a page cache page had
already been inserted into the tree.  This page was being inserted into
the radix tree in response to a readahead(2) call.

Readahead doesn't make sense for DAX inodes, but we don't want it to
report a failure either.  Instead, we just return success and don't do
any work.

Link: http://lkml.kernel.org/r/20160824221429.21158-1-ross.zwisler@linux.intel.com
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26 17:39:35 -07:00
Arnd Bergmann 358c07fcc3 mm: memcontrol: avoid unused function warning
A bugfix in v4.8-rc2 introduced a harmless warning when
CONFIG_MEMCG_SWAP is disabled but CONFIG_MEMCG is enabled:

  mm/memcontrol.c:4085:27: error: 'mem_cgroup_id_get_online' defined but not used [-Werror=unused-function]
   static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)

This moves the function inside of the #ifdef block that hides the
calling function, to avoid the warning.

Fixes: 1f47b61fb4 ("mm: memcontrol: fix swap counter leak on swapout from offline cgroup")
Link: http://lkml.kernel.org/r/20160824113733.2776701-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26 17:39:35 -07:00
Michal Hocko b32eaf71db mm: clarify COMPACTION Kconfig text
The current wording of the COMPACTION Kconfig help text doesn't
emphasise that disabling COMPACTION might cripple the page allocator
which relies on the compaction quite heavily for high order requests and
an unexpected OOM can happen with the lack of compaction.  Make sure we
are vocal about that.

Link: http://lkml.kernel.org/r/20160823091726.GK23577@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26 17:39:35 -07:00
Andrea Arcangeli 804dd15046 soft_dirty: fix soft_dirty during THP split
While adding proper userfaultfd_wp support with bits in pagetable and
swap entry to avoid false positives WP userfaults through swap/fork/
KSM/etc, I've been adding a framework that mostly mirrors soft dirty.

So I noticed in one place I had to add uffd_wp support to the pagetables
that wasn't covered by soft_dirty and I think it should have.

Example: in the THP migration code migrate_misplaced_transhuge_page()
pmd_mkdirty is called unconditionally after mk_huge_pmd.

	entry = mk_huge_pmd(new_page, vma->vm_page_prot);
	entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

That sets soft dirty too (it's a false positive for soft dirty, the soft
dirty bit could be more finegrained and transfer the bit like uffd_wp
will do..  pmd/pte_uffd_wp() enforces the invariant that when it's set
pmd/pte_write is not set).

However in the THP split there's no unconditional pmd_mkdirty after
mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
entry is created.  The code sets the dirty bit in the struct page
instead of setting it in the pagetable (which is fully equivalent as far
as the real dirty bit is concerned, as the whole point of pagetable bits
is to be eventually flushed out of to the page, but that is not
equivalent for the soft-dirty bit that gets lost in translation).

This was found by code review only and totally untested as I'm working
to actually replace soft dirty and I don't have time to test potential
soft dirty bugfixes as well :).

Transfer the soft_dirty from pmd to pte during THP splits.

This fix avoids losing the soft_dirty bit and avoids userland memory
corruption in the checkpoint.

Fixes: eef1b3ba05 ("thp: implement split_huge_pmd()")
Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-26 17:39:35 -07:00
Catalin Marinas cab15ce604 arm64: Introduce execute-only page access permissions
The ARMv8 architecture allows execute-only user permissions by clearing
the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU
implementation without User Access Override (ARMv8.2 onwards) can still
access such page, so execute-only page permission does not protect
against read(2)/write(2) etc. accesses. Systems requiring such
protection must enable features like SECCOMP.

This patch changes the arm64 __P100 and __S100 protection_map[] macros
to the new __PAGE_EXECONLY attributes. A side effect is that
pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't
set. To work around this, the check is done on the PTE_NG bit via the
pte_ng() macro. VM_READ is also checked now for page faults.

Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2016-08-25 18:00:29 +01:00
Josh Poimboeuf 94cd97af69 usercopy: fix overlap check for kernel text
When running with a local patch which moves the '_stext' symbol to the
very beginning of the kernel text area, I got the following panic with
CONFIG_HARDENED_USERCOPY:

  usercopy: kernel memory exposure attempt detected from ffff88103dfff000 (<linear kernel text>) (4096 bytes)
  ------------[ cut here ]------------
  kernel BUG at mm/usercopy.c:79!
  invalid opcode: 0000 [#1] SMP
  ...
  CPU: 0 PID: 4800 Comm: cp Not tainted 4.8.0-rc3.after+ #1
  Hardware name: Dell Inc. PowerEdge R720/0X3D66, BIOS 2.5.4 01/22/2016
  task: ffff880817444140 task.stack: ffff880816274000
  RIP: 0010:[<ffffffff8121c796>] __check_object_size+0x76/0x413
  RSP: 0018:ffff880816277c40 EFLAGS: 00010246
  RAX: 000000000000006b RBX: ffff88103dfff000 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: ffff88081f80dfa8 RDI: ffff88081f80dfa8
  RBP: ffff880816277c90 R08: 000000000000054c R09: 0000000000000000
  R10: 0000000000000005 R11: 0000000000000006 R12: 0000000000001000
  R13: ffff88103e000000 R14: ffff88103dffffff R15: 0000000000000001
  FS:  00007fb9d1750800(0000) GS:ffff88081f800000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000021d2000 CR3: 000000081a08f000 CR4: 00000000001406f0
  Stack:
   ffff880816277cc8 0000000000010000 000000043de07000 0000000000000000
   0000000000001000 ffff880816277e60 0000000000001000 ffff880816277e28
   000000000000c000 0000000000001000 ffff880816277ce8 ffffffff8136c3a6
  Call Trace:
   [<ffffffff8136c3a6>] copy_page_to_iter_iovec+0xa6/0x1c0
   [<ffffffff8136e766>] copy_page_to_iter+0x16/0x90
   [<ffffffff811970e3>] generic_file_read_iter+0x3e3/0x7c0
   [<ffffffffa06a738d>] ? xfs_file_buffered_aio_write+0xad/0x260 [xfs]
   [<ffffffff816e6262>] ? down_read+0x12/0x40
   [<ffffffffa06a61b1>] xfs_file_buffered_aio_read+0x51/0xc0 [xfs]
   [<ffffffffa06a6692>] xfs_file_read_iter+0x62/0xb0 [xfs]
   [<ffffffff812224cf>] __vfs_read+0xdf/0x130
   [<ffffffff81222c9e>] vfs_read+0x8e/0x140
   [<ffffffff81224195>] SyS_read+0x55/0xc0
   [<ffffffff81003a47>] do_syscall_64+0x67/0x160
   [<ffffffff816e8421>] entry_SYSCALL64_slow_path+0x25/0x25
  RIP: 0033:[<00007fb9d0c33c00>] 0x7fb9d0c33c00
  RSP: 002b:00007ffc9c262f28 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
  RAX: ffffffffffffffda RBX: fffffffffff8ffff RCX: 00007fb9d0c33c00
  RDX: 0000000000010000 RSI: 00000000021c3000 RDI: 0000000000000004
  RBP: 00000000021c3000 R08: 0000000000000000 R09: 00007ffc9c264d6c
  R10: 00007ffc9c262c50 R11: 0000000000000246 R12: 0000000000010000
  R13: 00007ffc9c2630b0 R14: 0000000000000004 R15: 0000000000010000
  Code: 81 48 0f 44 d0 48 c7 c6 90 4d a3 81 48 c7 c0 bb b3 a2 81 48 0f 44 f0 4d 89 e1 48 89 d9 48 c7 c7 68 16 a3 81 31 c0 e8 f4 57 f7 ff <0f> 0b 48 8d 90 00 40 00 00 48 39 d3 0f 83 22 01 00 00 48 39 c3
  RIP  [<ffffffff8121c796>] __check_object_size+0x76/0x413
   RSP <ffff880816277c40>

The checked object's range [ffff88103dfff000, ffff88103e000000) is
valid, so there shouldn't have been a BUG.  The hardened usercopy code
got confused because the range's ending address is the same as the
kernel's text starting address at 0xffff88103e000000.  The overlap check
is slightly off.

Fixes: f5509cc18d ("mm: Hardened usercopy")
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-08-22 19:10:51 -07:00
Eric Biggers 7329a65587 usercopy: avoid potentially undefined behavior in pointer math
check_bogus_address() checked for pointer overflow using this expression,
where 'ptr' has type 'const void *':

	ptr + n < ptr

Since pointer wraparound is undefined behavior, gcc at -O2 by default
treats it like the following, which would not behave as intended:

	(long)n < 0

Fortunately, this doesn't currently happen for kernel code because kernel
code is compiled with -fno-strict-overflow.  But the expression should be
fixed anyway to use well-defined integer arithmetic, since it could be
treated differently by different compilers in the future or could be
reported by tools checking for undefined behavior.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2016-08-22 19:07:55 -07:00
Reza Arbab 5830169f47 mm/memory_hotplug.c: initialize per_cpu_nodestats for hotadded pgdats
The following oops occurs after a pgdat is hotadded:

  Unable to handle kernel paging request for data at address 0x00c30001
  Faulting instruction address: 0xc00000000022f8f4
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
  CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W 4.8.0-rc1-device #110
  task: c000000000ef3080 task.stack: c000000000f6c000
  NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
  REGS: c000000000f6fa50 TRAP: 0300   Tainted: G        W (4.8.0-rc1-device)
  MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 84002028  XER: 20000000
  CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
  NIP refresh_cpu_vm_stats+0x1a4/0x2f0
  LR refresh_cpu_vm_stats+0x1f8/0x2f0
  Call Trace:
    refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)

Add per_cpu_nodestats initialization to the hotplug codepath.

Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:14 -07:00
Geert Uytterhoeven f33e6f0671 mm, oom: fix uninitialized ret in task_will_free_mem()
mm/oom_kill.c: In function `task_will_free_mem':
    mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function

If __task_will_free_mem() is never called inside the for_each_process()
loop, ret will not be initialized.

Fixes: 1af8bb4326 ("mm, oom: fortify task_will_free_mem()")
Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:14 -07:00
Alexander Potapenko bcbf0d566b kasan: remove the unnecessary WARN_ONCE from quarantine.c
It's quite unlikely that the user will so little memory that the per-CPU
quarantines won't fit into the given fraction of the available memory.
Even in that case he won't be able to do anything with the information
given in the warning.

Link: http://lkml.kernel.org/r/1470929182-101413-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:14 -07:00
Vladimir Davydov 615d66c37c mm: memcontrol: fix memcg id ref counter on swap charge move
Since commit 73f576c04b ("mm: memcontrol: fix cgroup creation failure
after many small jobs") swap entries do not pin memcg->css.refcnt
directly.  Instead, they pin memcg->id.ref.  So we should adjust the
reference counters accordingly when moving swap charges between cgroups.

Fixes: 73f576c04b ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Link: http://lkml.kernel.org/r/9ce297c64954a42dc90b543bc76106c4a94f07e8.1470219853.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:13 -07:00
Vladimir Davydov 1f47b61fb4 mm: memcontrol: fix swap counter leak on swapout from offline cgroup
An offline memory cgroup might have anonymous memory or shmem left
charged to it and no swap.  Since only swap entries pin the id of an
offline cgroup, such a cgroup will have no id and so an attempt to
swapout its anon/shmem will not store memory cgroup info in the swap
cgroup map.  As a result, memcg->swap or memcg->memsw will never get
uncharged from it and any of its ascendants.

Fix this by always charging swapout to the first ancestor cgroup that
hasn't released its id yet.

[hannes@cmpxchg.org: add comment to mem_cgroup_swapout]
[vdavydov@virtuozzo.com: use WARN_ON_ONCE() in mem_cgroup_id_get_online()]
  Link: http://lkml.kernel.org/r/20160803123445.GJ13263@esperanza
Fixes: 73f576c04b ("mm: memcontrol: fix cgroup creation failure after many small jobs")
Link: http://lkml.kernel.org/r/5336daa5c9a32e776067773d9da655d2dc126491.1470219853.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:13 -07:00
Mel Gorman 2f95ff90b9 proc, meminfo: use correct helpers for calculating LRU sizes in meminfo
meminfo_proc_show() and si_mem_available() are using the wrong helpers
for calculating the size of the LRUs.  The user-visible impact is that
there appears to be an abnormally high number of unevictable pages.

Link: http://lkml.kernel.org/r/20160805105805.GR2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:13 -07:00
zhong jiang c1470b33bb mm/hugetlb: fix incorrect hugepages count during mem hotplug
When memory hotplug operates, free hugepages will be freed if the
movable node is offline.  Therefore, /proc/sys/vm/nr_hugepages will be
incorrect.

Fix it by reducing max_huge_pages when the node is offlined.

n-horiguchi@ah.jp.nec.com said:

: dissolve_free_huge_page intends to break a hugepage into buddy, and the
: destination hugepage is supposed to be allocated from the pool of the
: destination node, so the system-wide pool size is reduced.  So adding
: h->max_huge_pages-- makes sense to me.

Link: http://lkml.kernel.org/r/1470624546-902-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-11 16:58:13 -07:00
Chris Wilson 6039892396 mm/slub.c: run free_partial() outside of the kmem_cache_node->list_lock
With debugobjects enabled and using SLAB_DESTROY_BY_RCU, when a
kmem_cache_node is destroyed the call_rcu() may trigger a slab
allocation to fill the debug object pool (__debug_object_init:fill_pool).

Everywhere but during kmem_cache_destroy(), discard_slab() is performed
outside of the kmem_cache_node->list_lock and avoids a lockdep warning
about potential recursion:

  =============================================
  [ INFO: possible recursive locking detected ]
  4.8.0-rc1-gfxbench+ #1 Tainted: G     U
  ---------------------------------------------
  rmmod/8895 is trying to acquire lock:
   (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811c80d7>] get_partial_node.isra.63+0x47/0x430

  but task is already holding lock:
   (&(&n->list_lock)->rlock){-.-...}, at: [<ffffffff811cbda4>] __kmem_cache_shutdown+0x54/0x320

  other info that might help us debug this:
  Possible unsafe locking scenario:
        CPU0
        ----
   lock(&(&n->list_lock)->rlock);
   lock(&(&n->list_lock)->rlock);

   *** DEADLOCK ***
   May be due to missing lock nesting notation
   5 locks held by rmmod/8895:
   #0:  (&dev->mutex){......}, at: driver_detach+0x42/0xc0
   #1:  (&dev->mutex){......}, at: driver_detach+0x50/0xc0
   #2:  (cpu_hotplug.dep_map){++++++}, at: get_online_cpus+0x2d/0x80
   #3:  (slab_mutex){+.+.+.}, at: kmem_cache_destroy+0x3c/0x220
   #4:  (&(&n->list_lock)->rlock){-.-...}, at: __kmem_cache_shutdown+0x54/0x320

  stack backtrace:
  CPU: 6 PID: 8895 Comm: rmmod Tainted: G     U          4.8.0-rc1-gfxbench+ #1
  Hardware name: Gigabyte Technology Co., Ltd. H87M-D3H/H87M-D3H, BIOS F11 08/18/2015
  Call Trace:
    __lock_acquire+0x1646/0x1ad0
    lock_acquire+0xb2/0x200
    _raw_spin_lock+0x36/0x50
    get_partial_node.isra.63+0x47/0x430
    ___slab_alloc.constprop.67+0x1a7/0x3b0
    __slab_alloc.isra.64.constprop.66+0x43/0x80
    kmem_cache_alloc+0x236/0x2d0
    __debug_object_init+0x2de/0x400
    debug_object_activate+0x109/0x1e0
    __call_rcu.constprop.63+0x32/0x2f0
    call_rcu+0x12/0x20
    discard_slab+0x3d/0x40
    __kmem_cache_shutdown+0xdb/0x320
    shutdown_cache+0x19/0x60
    kmem_cache_destroy+0x1ae/0x220
    i915_gem_load_cleanup+0x14/0x40 [i915]
    i915_driver_unload+0x151/0x180 [i915]
    i915_pci_remove+0x14/0x20 [i915]
    pci_device_remove+0x34/0xb0
    __device_release_driver+0x95/0x140
    driver_detach+0xb6/0xc0
    bus_remove_driver+0x53/0xd0
    driver_unregister+0x27/0x50
    pci_unregister_driver+0x25/0x70
    i915_exit+0x1a/0x1e2 [i915]
    SyS_delete_module+0x193/0x1f0
    entry_SYSCALL_64_fastpath+0x1c/0xac

Fixes: 52b4b950b5 ("mm: slab: free kmem_cache_node after destroy sysfs file")
Link: http://lkml.kernel.org/r/1470759070-18743-1-git-send-email-chris@chris-wilson.co.uk
Reported-by: Dave Gordon <david.s.gordon@intel.com>
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Dmitry Safonov <dsafonov@virtuozzo.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Gordon <david.s.gordon@intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Steve Capper 57dea93ac4 rmap: fix compound check logic in page_remove_file_rmap
In page_remove_file_rmap(.) we have the following check:

  VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);

This is meant to check for either HugeTLB pages or THP when a compound
page is passed in.

Unfortunately, if one disables CONFIG_TRANSPARENT_HUGEPAGE, then
PageTransHuge(.) will always return false, provoking BUGs when one runs
the libhugetlbfs test suite.

This patch replaces PageTransHuge(), with PageHead() which will work for
both HugeTLB and THP.

Fixes: dd78fedde4 ("rmap: support file thp")
Link: http://lkml.kernel.org/r/1470838217-5889-1-git-send-email-steve.capper@arm.com
Signed-off-by: Steve Capper <steve.capper@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Huang Shijie <shijie.huang@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Kirill A. Shutemov c8efc390c1 mm, rmap: fix false positive VM_BUG() in page_add_file_rmap()
PageTransCompound() doesn't distinguish THP from from any other type of
compound pages.  This can lead to false-positive VM_BUG_ON() in
page_add_file_rmap() if called on compound page from a driver[1].

I think we can exclude such cases by checking if the page belong to a
mapping.

The VM_BUG_ON_PAGE() is downgraded to VM_WARN_ON_ONCE().  This path
should not cause any harm to non-THP page, but good to know if we step
on anything else.

[1] http://lkml.kernel.org/r/c711e067-0bff-a6cb-3c37-04dfe77d2db1@redhat.com

Link: http://lkml.kernel.org/r/20160810161345.GA67522@black.fi.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Laura Abbott <labbott@redhat.com>
Tested-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Joonsoo Kim 6423aa8192 mm/page_alloc.c: recalculate some of node threshold when on/offline memory
Some of node threshold depends on number of managed pages in the node.
When memory is going on/offline, it can be changed and we need to adjust
them.

Add recalculation to appropriate places and clean-up related functions
for better maintenance.

Link: http://lkml.kernel.org/r/1470724248-26780-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Joonsoo Kim 81cbcbc2d8 mm/page_alloc.c: fix wrong initialization when sysctl_min_unmapped_ratio changes
Before resetting min_unmapped_pages, we need to initialize
min_unmapped_pages rather than min_slab_pages.

Fixes: a5f5f91da6 (mm: convert zone_reclaim to node_reclaim)
Link: http://lkml.kernel.org/r/1470724248-26780-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Arnd Bergmann 3b33719c9b thp: move shmem_huge_enabled() outside of SYSFS ifdef
The newly introduced shmem_huge_enabled() function has two definitions,
but neither of them is visible if CONFIG_SYSFS is disabled, leading to a
build error:

  mm/khugepaged.o: In function `khugepaged':
  khugepaged.c:(.text.khugepaged+0x3ca): undefined reference to `shmem_huge_enabled'

This changes the #ifdef guards around the definition to match those that
are used in the header file.

Fixes: e496cf3d78 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
Link: http://lkml.kernel.org/r/20160809123638.1357593-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-10 16:40:56 -07:00
Vladimir Davydov c4159a75b6 mm: memcontrol: only mark charged pages with PageKmemcg
To distinguish non-slab pages charged to kmemcg we mark them PageKmemcg,
which sets page->_mapcount to -512.  Currently, we set/clear PageKmemcg
in __alloc_pages_nodemask()/free_pages_prepare() for any page allocated
with __GFP_ACCOUNT, including those that aren't actually charged to any
cgroup, i.e. allocated from the root cgroup context.  To avoid overhead
in case cgroups are not used, we only do that if memcg_kmem_enabled() is
true.  The latter is set iff there are kmem-enabled memory cgroups
(online or offline).  The root cgroup is not considered kmem-enabled.

As a result, if a page is allocated with __GFP_ACCOUNT for the root
cgroup when there are kmem-enabled memory cgroups and is freed after all
kmem-enabled memory cgroups were removed, e.g.

  # no memory cgroups has been created yet, create one
  mkdir /sys/fs/cgroup/memory/test
  # run something allocating pages with __GFP_ACCOUNT, e.g.
  # a program using pipe
  dmesg | tail
  # remove the memory cgroup
  rmdir /sys/fs/cgroup/memory/test

we'll get bad page state bug complaining about page->_mapcount != -1:

  BUG: Bad page state in process swapper/0  pfn:1fd945c
  page:ffffea007f651700 count:0 mapcount:-511 mapping:          (null) index:0x0
  flags: 0x1000000000000000()

To avoid that, let's mark with PageKmemcg only those pages that are
actually charged to and hence pin a non-root memory cgroup.

Fixes: 4949148ad4 ("mm: charge/uncharge kmemcg from generic page allocator paths")
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-09 10:14:10 -07:00
Linus Torvalds 1eccfa090e Implements HARDENED_USERCOPY verification of copy_to_user/copy_from_user
bounds checking for most architectures on SLAB and SLUB.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJXl9tlAAoJEIly9N/cbcAm5BoP/ikTtDp2bFw1sn92yHTnIWzl
 O+dcKVAeRgjfnSvPfb1JITpaM58exQSaDsPBeR0DbVzU1zDdhLcwHHiQupFh98Ka
 vBZthbrlL/u4NB26enEEW0iyA32BsxYBMnIu0z5ux9RbZflmQwGQ0c0rvy3dJ7/b
 FzB5ayVST5y/a0m6/sImeeExh78GU9rsMb1XmJRMwlJAy6miDz/F9TP0LnuW6PhG
 J5XC99ygNJS1pQBLACRsrZw6ImgBxXnWCok6tWPMxFfD+rJBU2//wqS+HozyMWHL
 iYP7+ytVo/ZVok4114X/V4Oof3a6wqgpBuYrivJ228QO+UsLYbYLo6sZ8kRK7VFm
 9GgHo/8rWB1T9lBbSaa7UL5r0dVNNLjFGS42vwV+YlgUMQ1A35VRojO0jUnJSIQU
 Ug1IxKmylLd0nEcwD8/l3DXeQABsfL8GsoKW0OtdTZtW4RND4gzq34LK6t7hvayF
 kUkLg1OLNdUJwOi16M/rhugwYFZIMfoxQtjkRXKWN4RZ2QgSHnx2lhqNmRGPAXBG
 uy21wlzUTfLTqTpoeOyHzJwyF2qf2y4nsziBMhvmlrUvIzW1LIrYUKCNT4HR8Sh5
 lC2WMGYuIqaiu+NOF3v6CgvKd9UW+mxMRyPEybH8mEgfm+FLZlWABiBjIUpSEZuB
 JFfuMv1zlljj/okIQRg8
 =USIR
 -----END PGP SIGNATURE-----

Merge tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull usercopy protection from Kees Cook:
 "Tbhis implements HARDENED_USERCOPY verification of copy_to_user and
  copy_from_user bounds checking for most architectures on SLAB and
  SLUB"

* tag 'usercopy-v4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  mm: SLUB hardened usercopy support
  mm: SLAB hardened usercopy support
  s390/uaccess: Enable hardened usercopy
  sparc/uaccess: Enable hardened usercopy
  powerpc/uaccess: Enable hardened usercopy
  ia64/uaccess: Enable hardened usercopy
  arm64/uaccess: Enable hardened usercopy
  ARM: uaccess: Enable hardened usercopy
  x86/uaccess: Enable hardened usercopy
  mm: Hardened usercopy
  mm: Implement stack frame object validation
  mm: Add is_migrate_cma_page
2016-08-08 14:48:14 -07:00
Jens Axboe ba13e83ec3 mm: make __swap_writepage() use bio_set_op_attrs()
Cleaner than manipulating bio->bi_rw flags directly.

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Jens Axboe c11f0c0b5b block/mm: make bdev_ops->rw_page() take a bool for read/write
Commit abf545484d changed it from an 'rw' flags type to the
newer ops based interface, but now we're effectively leaking
some bdev internals to the rest of the kernel. Since we only
care about whether it's a read or a write at that level, just
pass in a bool 'is_write' parameter instead.

Then we can also move op_is_write() and friends back under
CONFIG_BLOCK protection.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-07 14:41:02 -06:00
Linus Torvalds fff648da96 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Here's the second round of block updates for this merge window.

  It's a mix of fixes for changes that went in previously in this round,
  and fixes in general.  This pull request contains:

   - Fixes for loop from Christoph

   - A bdi vs gendisk lifetime fix from Dan, worth two cookies.

   - A blk-mq timeout fix, when on frozen queues.  From Gabriel.

   - Writeback fix from Jan, ensuring that __writeback_single_inode()
     does the right thing.

   - Fix for bio->bi_rw usage in f2fs from me.

   - Error path deadlock fix in blk-mq sysfs registration from me.

   - Floppy O_ACCMODE fix from Jiri.

   - Fix to the new bio op methods from Mike.

     One more followup will be coming here, ensuring that we don't
     propagate the block types outside of block.  That, and a rename of
     bio->bi_rw is coming right after -rc1 is cut.

   - Various little fixes"

* 'for-linus' of git://git.kernel.dk/linux-block:
  mm/block: convert rw_page users to bio op use
  loop: make do_req_filebacked more robust
  loop: don't try to use AIO for discards
  blk-mq: fix deadlock in blk_mq_register_disk() error path
  Include: blkdev: Removed duplicate 'struct request;' declaration.
  Fixup direct bi_rw modifiers
  block: fix bdi vs gendisk lifetime mismatch
  blk-mq: Allow timeouts to run while queue is freezing
  nbd: fix race in ioctl
  block: fix use-after-free in seq file
  f2fs: drop bio->bi_rw manual assignment
  block: add missing group association in bio-cloning functions
  blkcg: kill unused field nr_undestroyed_grps
  writeback: Write dirty times for WB_SYNC_ALL writeback
  floppy: fix open(O_ACCMODE) for ioctl-only open
2016-08-05 23:31:51 -04:00
Linus Torvalds 2cfd716d27 powerpc updates for 4.8 #2
Fixes:
  - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
  - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
  - Move register_process_table() out of ppc_md from Michael Ellerman
 
 Use jump_label for [cpu|mmu]_has_feature() from Aneesh Kumar K.V, Kevin Hao and Michael Ellerman:
  - Add mmu_early_init_devtree() from Michael Ellerman
  - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
  - Do hash device tree scanning earlier from Michael Ellerman
  - Do radix device tree scanning earlier from Michael Ellerman
  - Do feature patching before MMU init from Michael Ellerman
  - Check features don't change after patching from Michael Ellerman
  - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
  - Convert mmu_has_feature() to returning bool from Michael Ellerman
  - Convert cpu_has_feature() to returning bool from Michael Ellerman
  - Define radix_enabled() in one place & use static inline from Michael Ellerman
  - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
  - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
  - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
  - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
  - Remove mfvtb() from Kevin Hao
  - Move cpu_has_feature() to a separate file from Kevin Hao
  - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
  - Add option to use jump label for cpu_has_feature() from Kevin Hao
  - Add option to use jump label for mmu_has_feature() from Kevin Hao
  - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
  - Annotate jump label assembly from Michael Ellerman
 
 TLB flush enhancements from Aneesh Kumar K.V:
  - radix: Implement tlb mmu gather flush efficiently
  - Add helper for finding SLBE LLP encoding
  - Use hugetlb flush functions
  - Drop multiple definition of mm_is_core_local
  - radix: Add tlb flush of THP ptes
  - radix: Rename function and drop unused arg
  - radix/hugetlb: Add helper for finding page size
  - hugetlb: Add flush_hugetlb_tlb_range
  - remove flush_tlb_page_nohash
 
 Add new ptrace regsets from Anshuman Khandual and Simon Guo:
  - elf: Add powerpc specific core note sections
  - Add the function flush_tmregs_to_thread
  - Enable in transaction NT_PRFPREG ptrace requests
  - Enable in transaction NT_PPC_VMX ptrace requests
  - Enable in transaction NT_PPC_VSX ptrace requests
  - Adapt gpr32_get, gpr32_set functions for transaction
  - Enable support for NT_PPC_CGPR
  - Enable support for NT_PPC_CFPR
  - Enable support for NT_PPC_CVMX
  - Enable support for NT_PPC_CVSX
  - Enable support for TM SPR state
  - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
  - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
  - Enable support for EBB registers
  - Enable support for Performance Monitor registers
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXpGaLAAoJEFHr6jzI4aWA9aYP/1AqmRPJ9D0XVUJWT+FVABUK
 LESESoVFF4Hug1j1F8Synhg5o4SzD2t45iGKbclYaFthOIyovMg7Wr1KSu4hQ0go
 rPuQfpXDNQ8jKdDX8hbPXKUxrNRBNfqJGFo5E7mO6wN9AJ9d1LVwQ+jKAva29Tqs
 LaAlMbQNbeObPNzOl73B73iew3aozr+mXjBqv82lqvgYknBD2CLf24xGG3eNIbq5
 ZZk4LPC8pdkaxnajnzRFzqwiyPWzao0yfpVRKh52TKHBQF/prR/KACb6zUuja/61
 krOfegUKob14OYrehjs6X8XNRLnILRI0u1H5bmj7eVEiY/usyNzE93SMHZM3Wdau
 sQF/Au4OLNXj0ZQdNBtzRsZRyp1d560Gsj+lQGBoPd4hfIWkFYHvxzxsUSdqv4uA
 MWDMwN0Vvfk0cpprsabsWNevkaotYYBU00px5hF/e5ZUc9/x/xYUVMgPEDr0QZLr
 cHJo9/Pjk4u/0g4lj+2y1LLl/0tNEZZg69O6bvffPAPVSS4/P4y/bKKYd4I0zL99
 Ykp91mSmkl70F3edgOSFqyda2gN2l2Ekb/i081YGXheFy1rbD29Vxv82BOVog4KY
 ibvOqp38WDzCVk5OXuCRvBl0VudLKGJYdppU1nXg4KgrTZXHeCAC0E+NzUsgOF4k
 OMvQ+5drVxrno+Hw8FVJ
 =0Q8E
 -----END PGP SIGNATURE-----

Merge tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux

Pull more powerpc updates from Michael Ellerman:
 "These were delayed for various reasons, so I let them sit in next a
  bit longer, rather than including them in my first pull request.

  Fixes:
   - Fix early access to cpu_spec relocation from Benjamin Herrenschmidt
   - Fix incorrect event codes in power9-event-list from Madhavan Srinivasan
   - Move register_process_table() out of ppc_md from Michael Ellerman

  Use jump_label use for [cpu|mmu]_has_feature():
   - Add mmu_early_init_devtree() from Michael Ellerman
   - Move disable_radix handling into mmu_early_init_devtree() from Michael Ellerman
   - Do hash device tree scanning earlier from Michael Ellerman
   - Do radix device tree scanning earlier from Michael Ellerman
   - Do feature patching before MMU init from Michael Ellerman
   - Check features don't change after patching from Michael Ellerman
   - Make MMU_FTR_RADIX a MMU family feature from Aneesh Kumar K.V
   - Convert mmu_has_feature() to returning bool from Michael Ellerman
   - Convert cpu_has_feature() to returning bool from Michael Ellerman
   - Define radix_enabled() in one place & use static inline from Michael Ellerman
   - Add early_[cpu|mmu]_has_feature() from Michael Ellerman
   - Convert early cpu/mmu feature check to use the new helpers from Aneesh Kumar K.V
   - jump_label: Make it possible for arches to invoke jump_label_init() earlier from Kevin Hao
   - Call jump_label_init() in apply_feature_fixups() from Aneesh Kumar K.V
   - Remove mfvtb() from Kevin Hao
   - Move cpu_has_feature() to a separate file from Kevin Hao
   - Add kconfig option to use jump labels for cpu/mmu_has_feature() from Michael Ellerman
   - Add option to use jump label for cpu_has_feature() from Kevin Hao
   - Add option to use jump label for mmu_has_feature() from Kevin Hao
   - Catch usage of cpu/mmu_has_feature() before jump label init from Aneesh Kumar K.V
   - Annotate jump label assembly from Michael Ellerman

  TLB flush enhancements from Aneesh Kumar K.V:
   - radix: Implement tlb mmu gather flush efficiently
   - Add helper for finding SLBE LLP encoding
   - Use hugetlb flush functions
   - Drop multiple definition of mm_is_core_local
   - radix: Add tlb flush of THP ptes
   - radix: Rename function and drop unused arg
   - radix/hugetlb: Add helper for finding page size
   - hugetlb: Add flush_hugetlb_tlb_range
   - remove flush_tlb_page_nohash

  Add new ptrace regsets from Anshuman Khandual and Simon Guo:
   - elf: Add powerpc specific core note sections
   - Add the function flush_tmregs_to_thread
   - Enable in transaction NT_PRFPREG ptrace requests
   - Enable in transaction NT_PPC_VMX ptrace requests
   - Enable in transaction NT_PPC_VSX ptrace requests
   - Adapt gpr32_get, gpr32_set functions for transaction
   - Enable support for NT_PPC_CGPR
   - Enable support for NT_PPC_CFPR
   - Enable support for NT_PPC_CVMX
   - Enable support for NT_PPC_CVSX
   - Enable support for TM SPR state
   - Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
   - Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
   - Enable support for EBB registers
   - Enable support for Performance Monitor registers"

* tag 'powerpc-4.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (48 commits)
  powerpc/mm: Move register_process_table() out of ppc_md
  powerpc/perf: Fix incorrect event codes in power9-event-list
  powerpc/32: Fix early access to cpu_spec relocation
  powerpc/ptrace: Enable support for Performance Monitor registers
  powerpc/ptrace: Enable support for EBB registers
  powerpc/ptrace: Enable support for NT_PPPC_TAR, NT_PPC_PPR, NT_PPC_DSCR
  powerpc/ptrace: Enable NT_PPC_TM_CTAR, NT_PPC_TM_CPPR, NT_PPC_TM_CDSCR
  powerpc/ptrace: Enable support for TM SPR state
  powerpc/ptrace: Enable support for NT_PPC_CVSX
  powerpc/ptrace: Enable support for NT_PPC_CVMX
  powerpc/ptrace: Enable support for NT_PPC_CFPR
  powerpc/ptrace: Enable support for NT_PPC_CGPR
  powerpc/ptrace: Adapt gpr32_get, gpr32_set functions for transaction
  powerpc/ptrace: Enable in transaction NT_PPC_VSX ptrace requests
  powerpc/ptrace: Enable in transaction NT_PPC_VMX ptrace requests
  powerpc/ptrace: Enable in transaction NT_PRFPREG ptrace requests
  powerpc/process: Add the function flush_tmregs_to_thread
  elf: Add powerpc specific core note sections
  powerpc/mm: remove flush_tlb_page_nohash
  powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
  ...
2016-08-05 09:00:54 -04:00
zijun_hu e47608ab6d mm/memblock.c: fix NULL dereference error
It causes NULL dereference error and failure to get type_a->regions[0]
info if parameter type_b of __next_mem_range_rev() == NULL

Fix this by checking before dereferring and initializing idx_b to 0

The approach is tested by dumping all types of region via
__memblock_dump_all() and __next_mem_range_rev() fixed to UART
separately the result is okay after checking the logs.

Link: http://lkml.kernel.org/r/57A0320D.6070102@zoho.com
Signed-off-by: zijun_hu <zijun_hu@htc.com>
Tested-by: zijun_hu <zijun_hu@htc.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 20:02:09 -04:00
Geert Uytterhoeven 117d54df7a slub: drop bogus inline for fixup_red_left()
With m68k-linux-gnu-gcc-4.1:

    include/linux/slub_def.h:126: warning: `fixup_red_left' declared inline after being called
    include/linux/slub_def.h:126: warning: previous declaration of `fixup_red_left' was here

Commit c146a2b98e ("mm, kasan: account for object redzone in SLUB's
nearest_obj()") made fixup_red_left() global, but forgot to remove the
inline keyword.

Fixes: c146a2b98e ("mm, kasan: account for object redzone in SLUB's nearest_obj()")
Link: http://lkml.kernel.org/r/1470256262-1586-1-git-send-email-geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Alexander Potapenko <glider@google.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 20:02:09 -04:00
Mel Gorman b4911ea2bc mm: initialise per_cpu_nodestats for all online pgdats at boot
Paul Mackerras and Reza Arbab reported that machines with memoryless
nodes fail when vmstats are refreshed.  Paul reported an oops as follows

  Unable to handle kernel paging request for data at address 0xff7a10000
  Faulting instruction address: 0xc000000000270cd0
  Oops: Kernel access of bad area, sig: 11 [#1]
  SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.7.0-kvm+ #118
  task: c000000ff0680010 task.stack: c000000ff0704000
  NIP: c000000000270cd0 LR: c000000000270ce8 CTR: 0000000000000000
  REGS: c000000ff0707900 TRAP: 0300   Not tainted  (4.7.0-kvm+)
  MSR: 9000000102009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE,TM[E]>  CR: 846b6824  XER: 20000000
  CFAR: c000000000008768 DAR: 0000000ff7a10000 DSISR: 42000000 SOFTE: 1
  NIP refresh_zone_stat_thresholds+0x80/0x240
  LR refresh_zone_stat_thresholds+0x98/0x240
  Call Trace:
    refresh_zone_stat_thresholds+0xb8/0x240 (unreliable)

Both supplied potential fixes but one potentially misses checks and
another had redundant initialisations.  This version initialises
per_cpu_nodestats on a per-pgdat basis instead of on a per-zone basis.

Link: http://lkml.kernel.org/r/20160804092404.GI2799@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Paul Mackerras <paulus@ozlabs.org>
Reported-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Tested-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 20:02:09 -04:00
Alexander Kuleshov 412d0008d6 mm/memblock: fix a typo in a comment
s/accomodate/accommodate/

Link: http://lkml.kernel.org/r/20160804121824.18100-1-kuleshovmail@gmail.com
Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 20:02:09 -04:00
zhong jiang 1e185736d2 mm: disable CONFIG_MEMORY_HOTPLUG when KASAN is enabled
At present it is obvious that memory online and offline will fail when
KASAN is enabled.  So add the condition to limit the memory_hotplug when
KASAN is enabled.

Link: http://lkml.kernel.org/r/1470063651-29519-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-04 20:02:09 -04:00
Mike Christie abf545484d mm/block: convert rw_page users to bio op use
The rw_page users were not converted to use bio/req ops. As a result
bdev_write_page is not passing down REQ_OP_WRITE and the IOs will
be sent down as reads.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Fixes: 4e1b2d52a8 ("block, fs, drivers: remove REQ_OP compat defs and related code")

Modified by me to:

1) Drop op_flags passing into ->rw_page(), as we don't use it.
2) Make op_is_write() and friends safe to use for !CONFIG_BLOCK

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:25:33 -06:00
Dan Williams df08c32ce3 block: fix bdi vs gendisk lifetime mismatch
The name for a bdi of a gendisk is derived from the gendisk's devt.
However, since the gendisk is destroyed before the bdi it leaves a
window where a new gendisk could dynamically reuse the same devt while a
bdi with the same name is still live.  Arrange for the bdi to hold a
reference against its "owner" disk device while it is registered.
Otherwise we can hit sysfs duplicate name collisions like the following:

 WARNING: CPU: 10 PID: 2078 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x64/0x80
 sysfs: cannot create duplicate filename '/devices/virtual/bdi/259:1'

 Hardware name: HP ProLiant DL580 Gen8, BIOS P79 05/06/2015
  0000000000000286 0000000002c04ad5 ffff88006f24f970 ffffffff8134caec
  ffff88006f24f9c0 0000000000000000 ffff88006f24f9b0 ffffffff8108c351
  0000001f0000000c ffff88105d236000 ffff88105d1031e0 ffff8800357427f8
 Call Trace:
  [<ffffffff8134caec>] dump_stack+0x63/0x87
  [<ffffffff8108c351>] __warn+0xd1/0xf0
  [<ffffffff8108c3cf>] warn_slowpath_fmt+0x5f/0x80
  [<ffffffff812a0d34>] sysfs_warn_dup+0x64/0x80
  [<ffffffff812a0e1e>] sysfs_create_dir_ns+0x7e/0x90
  [<ffffffff8134faaa>] kobject_add_internal+0xaa/0x320
  [<ffffffff81358d4e>] ? vsnprintf+0x34e/0x4d0
  [<ffffffff8134ff55>] kobject_add+0x75/0xd0
  [<ffffffff816e66b2>] ? mutex_lock+0x12/0x2f
  [<ffffffff8148b0a5>] device_add+0x125/0x610
  [<ffffffff8148b788>] device_create_groups_vargs+0xd8/0x100
  [<ffffffff8148b7cc>] device_create_vargs+0x1c/0x20
  [<ffffffff811b775c>] bdi_register+0x8c/0x180
  [<ffffffff811b7877>] bdi_register_dev+0x27/0x30
  [<ffffffff813317f5>] add_disk+0x175/0x4a0

Cc: <stable@vger.kernel.org>
Reported-by: Yi Zhang <yizhan@redhat.com>
Tested-by: Yi Zhang <yizhan@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Fixed up missing 0 return in bdi_register_owner().

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-08-04 14:19:16 -06:00
Geert Uytterhoeven 4620a06e4b shmem: Fix link error if huge pages support is disabled
If CONFIG_TRANSPARENT_HUGE_PAGECACHE=n, HPAGE_PMD_NR evaluates to
BUILD_BUG_ON(), and may cause (e.g. with gcc 4.12):

    mm/built-in.o: In function `shmem_alloc_hugepage':
    shmem.c:(.text+0x17570): undefined reference to `__compiletime_assert_1365'

To fix this, move the assignment to hindex after the check for huge
pages support.

Fixes: 800d8c63b2 ("shmem: add huge pages support")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-03 18:20:12 -04:00
Linus Torvalds d52bd54db8 Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of ocfs2

 - various hotfixes, mainly MM

 - quite a bit of misc stuff - drivers, fork, exec, signals, etc.

 - printk updates

 - firmware

 - checkpatch

 - nilfs2

 - more kexec stuff than usual

 - rapidio updates

 - w1 things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (111 commits)
  ipc: delete "nr_ipc_ns"
  kcov: allow more fine-grained coverage instrumentation
  init/Kconfig: add clarification for out-of-tree modules
  config: add android config fragments
  init/Kconfig: ban CONFIG_LOCALVERSION_AUTO with allmodconfig
  relay: add global mode support for buffer-only channels
  init: allow blacklisting of module_init functions
  w1:omap_hdq: fix regression
  w1: add helper macro module_w1_family
  w1: remove need for ida and use PLATFORM_DEVID_AUTO
  rapidio/switches: add driver for IDT gen3 switches
  powerpc/fsl_rio: apply changes for RIO spec rev 3
  rapidio: modify for rev.3 specification changes
  rapidio: change inbound window size type to u64
  rapidio/idt_gen2: fix locking warning
  rapidio: fix error handling in mbox request/release functions
  rapidio/tsi721_dma: advance queue processing from transfer submit call
  rapidio/tsi721: add messaging mbox selector parameter
  rapidio/tsi721: add PCIe MRRS override parameter
  rapidio/tsi721_dma: add channel mask and queue size parameters
  ...
2016-08-02 21:08:07 -04:00
Kees Cook ba093a6d93 mm: refuse wrapped vm_brk requests
The vm_brk() alignment calculations should refuse to overflow.  The ELF
loader depending on this, but it has been fixed now.  No other unsafe
callers have been found.

Link: http://lkml.kernel.org/r/1468014494-25291-3-git-send-email-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Reported-by: Hector Marco-Gisbert <hecmargi@upv.es>
Cc: Ismael Ripoll Ripoll <iripoll@upv.es>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Chen Gang <gang.chen.5i5j@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 19:35:15 -04:00
Fabian Frederick bd721ea73e treewide: replace obsolete _refok by __ref
There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok     __ref
#define __initdata_refok __refdata
#define __exit_refok     __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Vladimir Davydov b5afba2974 mm: vmscan: fix memcg-aware shrinkers not called on global reclaim
We must call shrink_slab() for each memory cgroup on both global and
memcg reclaim in shrink_node_memcg().  Commit d71df22b55099 accidentally
changed that so that now shrink_slab() is only called with memcg != NULL
on memcg reclaim.  As a result, memcg-aware shrinkers (including
dentry/inode) are never invoked on global reclaim.  Fix that.

Fixes: b2e18757f2 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Alexander Potapenko c3cee37228 kasan: avoid overflowing quarantine size on low memory systems
If the total amount of memory assigned to quarantine is less than the
amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
may overflow.  Instead, set it to zero.

[akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
Fixes: 55834c5909 ("mm: kasan: initial memory quarantine implementation")
Signed-off-by: Alexander Potapenko <glider@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin 7e08897893 kasan: improve double-free reports
Currently we just dump stack in case of double free bug.
Let's dump all info about the object that we have.

[aryabinin@virtuozzo.com: change double free message per Alexander]
  Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin b3cbd9bf77 mm/kasan: get rid of ->state in struct kasan_alloc_meta
The state of object currently tracked in two places - shadow memory, and
the ->state field in struct kasan_alloc_meta.  We can get rid of the
latter.  The will save us a little bit of memory.  Also, this allow us
to move free stack into struct kasan_alloc_meta, without increasing
memory consumption.  So now we should always know when the last time the
object was freed.  This may be useful for long delayed use-after-free
bugs.

As a side effect this fixes following UBSAN warning:
	UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
	member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
	which requires 8 byte alignment

Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
Reported-by: kernel test robot <xiaolong.ye@intel.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin 47b5c2a0f0 mm/kasan: get rid of ->alloc_size in struct kasan_alloc_meta
Size of slab object already stored in cache->object_size.

Note, that kmalloc() internally rounds up size of allocation, so
object_size may be not equal to alloc_size, but, usually we don't need
to know the exact size of allocated object.  In case if we need that
information, we still can figure it out from the report.  The dump of
shadow memory allows to identify the end of allocated memory, and
thereby the exact allocation size.

Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin f7376aed6c mm/kasan, slub: don't disable interrupts when object leaves quarantine
SLUB doesn't require disabled interrupts to call ___cache_free().

Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin 4b3ec5a3f4 mm/kasan: don't reduce quarantine in atomic contexts
Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied
by __GFP_RECLAIM) allocation.  So, basically we call it on almost every
allocation.  quarantine_reduce() sometimes is heavy operation, and
calling it with disabled interrupts may trigger hard LOCKUP:

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258
 Call Trace:
  <NMI>   dump_stack+0x68/0x96
   watchdog_overflow_callback+0x15b/0x190
   __perf_event_overflow+0x1b1/0x540
   perf_event_overflow+0x14/0x20
   intel_pmu_handle_irq+0x36a/0xad0
   perf_event_nmi_handler+0x2c/0x50
   nmi_handle+0x128/0x480
   default_do_nmi+0xb2/0x210
   do_nmi+0x1aa/0x220
   end_repeat_nmi+0x1a/0x1e
  <<EOE>>   __kernel_text_address+0x86/0xb0
   print_context_stack+0x7b/0x100
   dump_trace+0x12b/0x350
   save_stack_trace+0x2b/0x50
   set_track+0x83/0x140
   free_debug_processing+0x1aa/0x420
   __slab_free+0x1d6/0x2e0
   ___cache_free+0xb6/0xd0
   qlist_free_all+0x83/0x100
   quarantine_reduce+0x177/0x1b0
   kasan_kmalloc+0xf3/0x100

Reduce the quarantine_reduce iff direct reclaim is allowed.

Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Andrey Ryabinin 4a3d308d66 mm/kasan: fix corruptions and false positive reports
Once an object is put into quarantine, we no longer own it, i.e.  object
could leave the quarantine and be reallocated.  So having set_track()
call after the quarantine_put() may corrupt slab objects.

 BUG kmalloc-4096 (Not tainted): Poison overwritten
 -----------------------------------------------------------------------------
 Disabling lock debugging due to kernel taint
 INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b
...
 INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
  __slab_free+0x1d6/0x2e0
  ___cache_free+0xb6/0xd0
  qlist_free_all+0x83/0x100
  quarantine_reduce+0x177/0x1b0
  kasan_kmalloc+0xf3/0x100
  kasan_slab_alloc+0x12/0x20
  kmem_cache_alloc+0x109/0x3e0
  mmap_region+0x53e/0xe40
  do_mmap+0x70f/0xa50
  vm_mmap_pgoff+0x147/0x1b0
  SyS_mmap_pgoff+0x2c7/0x5b0
  SyS_mmap+0x1b/0x30
  do_syscall_64+0x1a0/0x4e0
  return_from_SYSCALL_64+0x0/0x7a
 INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x          (null) flags=0x8000000000004080
 INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588
 Redzone ffff8804540de840: bb bb bb bb bb bb bb bb                          ........
 Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc  kkkkkkkk.R....`.

Similarly, poisoning after the quarantine_put() leads to false positive
use-after-free reports:

 BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0
 Read of size 8 by task trinity-c0/3036
 CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
 Call Trace:
   dump_stack+0x68/0x96
   kasan_report_error+0x222/0x600
   __asan_report_load8_noabort+0x61/0x70
   anon_vma_interval_tree_insert+0x304/0x430
   anon_vma_chain_link+0x91/0xd0
   anon_vma_clone+0x136/0x3f0
   anon_vma_fork+0x81/0x4c0
   copy_process.part.47+0x2c43/0x5b20
   _do_fork+0x16d/0xbd0
   SyS_clone+0x19/0x20
   do_syscall_64+0x1a0/0x4e0
   entry_SYSCALL64_slow_path+0x25/0x25

Fix this by putting an object in the quarantine after all other
operations.

Fixes: 80a9201a59 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB")
Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Reported-by: Sasha Levin <alexander.levin@verizon.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Michal Hocko d6507ff533 memcg: put soft limit reclaim out of way if the excess tree is empty
We've had a report about soft lockups caused by lock bouncing in the
soft reclaim path:

  BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
  RIP: 0010:[<ffffffff81469798>]  [<ffffffff81469798>] _raw_spin_lock+0x18/0x20
  Call Trace:
    mem_cgroup_soft_limit_reclaim+0x25a/0x280
    shrink_zones+0xed/0x200
    do_try_to_free_pages+0x74/0x320
    try_to_free_pages+0x112/0x180
    __alloc_pages_slowpath+0x3ff/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_wp_page+0x19f/0x840
    handle_pte_fault+0x1cd/0x230
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30

There are no memcgs created so there cannot be any in the soft limit
excess obviously:

  [...]
  memory  0       1       1

so all this just seems to be mem_cgroup_largest_soft_limit_node trying
to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
excess tree is empty.  This is just pointless wasting of cycles and
cache line bouncing during heavy parallel reclaim on large machines.
The particular machine wasn't very healthy and most probably suffering
from a memory leak which just caused the memory reclaim to trash
heavily.  But bouncing on the lock certainly didn't help...

Fix this by optimistic lockless check and bail out early if the tree is
empty.  This is theoretically racy but that shouldn't matter all that
much.  First of all soft limit is a best effort feature and it is slowly
getting deprecated and its usage should be really scarce.  Bouncing on a
lock without a good reason is surely much bigger problem, especially on
large CPU machines.

Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Michal Hocko 4e666314d2 mm, hugetlb: fix huge_pte_alloc BUG_ON
Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel.  The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.

There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful.  Both callers of
huge_pte_alloc are prepared to handle them.  copy_hugetlb_page_range
will copy the swap entry and make it COW if needed.  hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.

That means that the BUG_ON is wrong and we should update it.  Let's
simply check that all present ptes are pte_huge instead.

Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Jia He 649920c6ab mm/hugetlb: avoid soft lockup in set_max_huge_pages()
In powerpc servers with large memory(32TB), we watched several soft
lockups for hugepage under stress tests.

The call traces are as follows:
1.
get_page_from_freelist+0x2d8/0xd50
__alloc_pages_nodemask+0x180/0xc20
alloc_fresh_huge_page+0xb0/0x190
set_max_huge_pages+0x164/0x3b0

2.
prep_new_huge_page+0x5c/0x100
alloc_fresh_huge_page+0xc8/0x190
set_max_huge_pages+0x164/0x3b0

This patch fixes such soft lockups.  It is safe to call cond_resched()
there because it is out of spin_lock/unlock section.

Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He <hejianet@gmail.com>
Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Minchan Kim 1a8018fb4c mm: move swap-in anonymous page into active list
Every swap-in anonymous page starts from inactive lru list's head.  It
should be activated unconditionally when VM decide to reclaim because
page table entry for the page always usually has marked accessed bit.
Thus, their window size for getting a new referece is 2 * NR_inactive +
NR_active while others is NR_inactive + NR_active.

It's not fair that it has more chance to be referenced compared to other
newly allocated page which starts from active lru list's head.

Johannes:

: The page can still have a valid copy on the swap device, so prefering to
: reclaim that page over a fresh one could make sense.  But as you point
: out, having it start inactive instead of active actually ends up giving it
: *more* LRU time, and that seems to be without justification.

Rik:

: The reason newly read in swap cache pages start on the inactive list is
: that we do some amount of read-around, and do not know which pages will
: get used.
:
: However, immediately activating the ones that DO get used, like your patch
: does, is the right thing to do.

Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Vegard Nossum c5f88bd29a mm: fail prefaulting if page table allocation fails
I ran into this:

    BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
    in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
    2 locks held by trinity-c1/1434:
     #0:  (&mm->mmap_sem){......}, at: [<ffffffff810ce31e>] __do_page_fault+0x1ce/0x8f0
     #1:  (rcu_read_lock){......}, at: [<ffffffff81378f86>] filemap_map_pages+0xd6/0xdd0

    CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    Call Trace:
      dump_stack+0x65/0x84
      panic+0x185/0x2dd
      ___might_sleep+0x51c/0x600
      __might_sleep+0x90/0x1a0
      __alloc_pages_nodemask+0x5b1/0x2160
      alloc_pages_current+0xcc/0x370
      pte_alloc_one+0x12/0x90
      __pte_alloc+0x1d/0x200
      alloc_set_pte+0xe3e/0x14a0
      filemap_map_pages+0x42b/0xdd0
      handle_mm_fault+0x17d5/0x28b0
      __do_page_fault+0x310/0x8f0
      trace_do_page_fault+0x18d/0x310
      do_async_page_fault+0x27/0xa0
      async_page_fault+0x28/0x30

The important bits from the above is that filemap_map_pages() is calling
into the page allocator while holding rcu_read_lock (sleeping is not
allowed inside RCU read-side critical sections).

According to Kirill Shutemov, the prefaulting code in do_fault_around()
is supposed to take care of this, but missing error handling means that
the allocation failure can go unnoticed.

We don't need to return VM_FAULT_OOM (or any other error) here, since we
can just let the normal fault path try again.

Fixes: 7267ec008b ("mm: postpone page table allocation until we have page to map")
Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-08-02 17:31:41 -04:00
Linus Torvalds 221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Aneesh Kumar K.V 5491ae7b6f powerpc/mm/hugetlb: Add flush_hugetlb_tlb_range
Some archs like ppc64 need to do special things when flushing tlb for
hugepage. Add a new helper to flush hugetlb tlb range. This helps us to
avoid flushing the entire tlb mapping for the pid.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2016-08-01 11:15:13 +10:00
Linus Torvalds 27ae0c41ed Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse updates from Miklos Szeredi:
 "This fixes error propagation from writeback to fsync/close for
  writeback cache mode as well as adding a missing capability flag to
  the INIT message.  The rest are cleanups.

  (The commits are recent but all the code actually sat in -next for a
  while now.  The recommits are due to conflict avoidance and the
  addition of Cc: stable@...)"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: use filemap_check_errors()
  mm: export filemap_check_errors() to modules
  fuse: fix wrong assignment of ->flags in fuse_send_init()
  fuse: fuse_flush must check mapping->flags for errors
  fuse: fsync() did not return IO errors
  fuse: don't mess with blocking signals
  new helper: wait_event_killable_exclusive()
  fuse: improve aio directIO write performance for size extending writes
2016-07-29 12:29:15 -07:00
Miklos Szeredi d72d9e2a5d mm: export filemap_check_errors() to modules
Can be used by fuse, btrfs and f2fs to replace opencoded variants.

Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2016-07-29 14:10:57 +02:00
Linus Torvalds 1c88e19b0f Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
 "The rest of MM"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (101 commits)
  mm, compaction: simplify contended compaction handling
  mm, compaction: introduce direct compaction priority
  mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  mm, page_alloc: make THP-specific decisions more generic
  mm, page_alloc: restructure direct compaction handling in slowpath
  mm, page_alloc: don't retry initial attempt in slowpath
  mm, page_alloc: set alloc_flags only once in slowpath
  lib/stackdepot.c: use __GFP_NOWARN for stack allocations
  mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
  mm, kasan: account for object redzone in SLUB's nearest_obj()
  mm: fix use-after-free if memory allocation failed in vma_adjust()
  zsmalloc: Delete an unnecessary check before the function call "iput"
  mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
  mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
  mm: optimize copy_page_to/from_iter_iovec
  mm: add cond_resched() to generic_swapfile_activate()
  Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
  mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
  mm: hwpoison: remove incorrect comments
  make __section_nr() more efficient
  ...
2016-07-28 16:36:48 -07:00
Vlastimil Babka c3486f5376 mm, compaction: simplify contended compaction handling
Async compaction detects contention either due to failing trylock on
zone->lock or lru_lock, or by need_resched().  Since 1f9efdef4f ("mm,
compaction: khugepaged should not give up due to need_resched()") the
code got quite complicated to distinguish these two up to the
__alloc_pages_slowpath() level, so different decisions could be taken
for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for
contended compaction anymore, so we again don't need to distinguish lock
and sched contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and
completely remove the check for contended compaction after the initial
async compaction for costly orders, which was originally aimed at THP
page fault allocations.  There are several reasons why this can be done
now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
  all, unless the system admin has overridden the default, or application has
  indicated via madvise that it can benefit from THP's. In both cases, it
  means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
  wouldn't, the second compaction attempt is still async and will detect the
  contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
  in place that prevent excessive THP page fault latencies

Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka a5508cd83f mm, compaction: introduce direct compaction priority
In the context of direct compaction, for some types of allocations we
would like the compaction to either succeed or definitely fail while
trying as hard as possible.  Current async/sync_light migration mode is
insufficient, as there are heuristics such as caching scanner positions,
marking pageblocks as unsuitable or deferring compaction for a zone.  At
least the final compaction attempt should be able to override these
heuristics.

To communicate how hard compaction should try, we replace migration mode
with a new enum compact_priority and change the relevant function
signatures.  In compact_zone_order() where struct compact_control is
constructed, the priority is mapped to suitable control flags.  This
patch itself has no functional change, as the current priority levels
are mapped back to the same migration modes as before.  Expanding them
will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
removed, as the only caller exists under CONFIG_COMPACTION.

Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka 2516035499 mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
After the previous patch, we can distinguish costly allocations that
should be really lightweight, such as THP page faults, with
__GFP_NORETRY.  This means we don't need to recognize khugepaged
allocations via PF_KTHREAD anymore.  We can also change THP page faults
in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
khugepaged, as the process has indicated that it benefits from THP's and
is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default).  Adding
__GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as
follows:

* get_huge_zero_page() - the zero page lifetime should be relatively
  long and it's shared by multiple users, so it's worth spending some
  effort on it.  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
  This also restores direct reclaim to this allocation, which was
  unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
  by default to madvise and add a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
  is not an issue.  So if khugepaged "defrag" is enabled (the default), do
  reclaim via GFP_TRANSHUGE without __GFP_NORETRY.  We can remove the
  PF_KTHREAD check from page alloc.

  As a side-effect, khugepaged will now no longer check if the initial
  compaction was deferred or contended.  This is OK, as khugepaged sleep
  times between collapsion attempts are long enough to prevent noticeable
  disruption, so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out
  __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
  equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
  are now allocating without __GFP_NORETRY.  Other vma's keep using
  __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
  it's allowed only for madvised vma's).  The rest is conversion to
  GFP_TRANSHUGE(_LIGHT).

[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka 3eb2771b06 mm, page_alloc: make THP-specific decisions more generic
Since THP allocations during page faults can be costly, extra decisions
are employed for them to avoid excessive reclaim and compaction, if the
initial compaction doesn't look promising.  The detection has never been
perfect as there is no gfp flag specific to THP allocations.  At this
moment it checks the whole combination of flags that makes up
GFP_TRANSHUGE, and hopes that no other users of such combination exist,
or would mind being treated the same way.  Extra care is also taken to
separate allocations from khugepaged, where latency doesn't matter that
much.

It is however possible to distinguish these allocations in a simpler and
more reliable way.  The key observation is that after the initial
compaction followed by the first iteration of "standard"
reclaim/compaction, both __GFP_NORETRY allocations and costly
allocations without __GFP_REPEAT are declared as failures:

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_REPEAT
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                goto nopage;

This means we can further distinguish allocations that are costly order
*and* additionally include the __GFP_NORETRY flag.  As it happens,
GFP_TRANSHUGE allocations do already fall into this category.  This will
also allow other costly allocations with similar high-order benefit vs
latency considerations to use this semantic.  Furthermore, we can
distinguish THP allocations that should try a bit harder (such as from
khugepageed) by removing __GFP_NORETRY, as will be done in the next
patch.

Link: http://lkml.kernel.org/r/20160721073614.24395-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka a8161d1ed6 mm, page_alloc: restructure direct compaction handling in slowpath
The retry loop in __alloc_pages_slowpath is supposed to keep trying
reclaim and compaction (and OOM), until either the allocation succeeds,
or returns with failure.  Success here is more probable when reclaim
precedes compaction, as certain watermarks have to be met for compaction
to even try, and more free pages increase the probability of compaction
success.  On the other hand, starting with light async compaction (if
the watermarks allow it), can be more efficient, especially for smaller
orders, if there's enough free memory which is just fragmented.

Thus, the current code starts with compaction before reclaim, and to
make sure that the last reclaim is always followed by a final
compaction, there's another direct compaction call at the end of the
loop.  This makes the code hard to follow and adds some duplicated
handling of migration_mode decisions.  It's also somewhat inefficient
that even if reclaim or compaction decides not to retry, the final
compaction is still attempted.  Some gfp flags combination also shortcut
these retry decisions by "goto noretry;", making it even harder to
follow.

This patch attempts to restructure the code with only minimal functional
changes.  The call to the first compaction and THP-specific checks are
now placed above the retry loop, and the "noretry" direct compaction is
removed.

The initial compaction is additionally restricted only to costly orders,
as we can expect smaller orders to be held back by watermarks, and only
larger orders to suffer primarily from fragmentation.  This better
matches the checks in reclaim's shrink_zones().

There are two other smaller functional changes.  One is that the upgrade
from async migration to light sync migration will always occur after the
initial compaction.  This is how it has been until recent patch "mm,
oom: protect !costly allocations some more", which introduced upgrading
the mode based on COMPACT_COMPLETE result, but kept the final compaction
always upgraded, which made it even more special.  It's better to return
to the simpler handling for now, as migration modes will be further
modified later in the series.

The second change is that once both reclaim and compaction declare it's
not worth to retry the reclaim/compact loop, there is no final
compaction attempt.  As argued above, this is intentional.  If that
final compaction were to succeed, it would be due to a wrong retry
decision, or simply a race with somebody else freeing memory for us.

The main outcome of this patch should be simpler code.  Logically, the
initial compaction without reclaim is the exceptional case to the
reclaim/compaction scheme, but prior to the patch, it was the last loop
iteration that was exceptional.  Now the code matches the logic better.
The change also enable the following patches.

Link: http://lkml.kernel.org/r/20160721073614.24395-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka 23771235bb mm, page_alloc: don't retry initial attempt in slowpath
After __alloc_pages_slowpath() sets up new alloc_flags and wakes up
kswapd, it first tries get_page_from_freelist() with the new
alloc_flags, as it may succeed e.g. due to using min watermark instead
of low watermark.  It makes sense to to do this attempt before adjusting
zonelist based on alloc_flags/gfp_mask, as it's still relatively a fast
path if we just wake up kswapd and successfully allocate.

This patch therefore moves the initial attempt above the retry label and
reorganizes a bit the part below the retry label.  We still have to
attempt get_page_from_freelist() on each retry, as some allocations
cannot do that as part of direct reclaim or compaction, and yet are not
allowed to fail (even though they do a WARN_ON_ONCE() and thus should
not exist).  We can reuse the call meant for ALLOC_NO_WATERMARKS attempt
and just set alloc_flags to ALLOC_NO_WATERMARKS if the context allows
it.  As a side-effect, the attempts from direct reclaim/compaction will
also no longer obey watermarks once this is set, but there's little harm
in that.

Kswapd wakeups are also done on each retry to be safe from potential
races resulting in kswapd going to sleep while a process (that may not
be able to reclaim by itself) is still looping.

Link: http://lkml.kernel.org/r/20160721073614.24395-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vlastimil Babka 31a6c1909f mm, page_alloc: set alloc_flags only once in slowpath
In __alloc_pages_slowpath(), alloc_flags doesn't change after it's
initialized, so move the initialization above the retry: label.  Also
make the comment above the initialization more descriptive.

The only exception in the alloc_flags being constant is
ALLOC_NO_WATERMARKS, which may change due to TIF_MEMDIE being set on the
allocating thread.  We can fix this, and make the code simpler and a bit
more effective at the same time, by moving the part that determines
ALLOC_NO_WATERMARKS from gfp_to_alloc_flags() to gfp_pfmemalloc_allowed().

This means we don't have to mask out ALLOC_NO_WATERMARKS in numerous
places in __alloc_pages_slowpath() anymore.  The only two tests for the
flag can instead call gfp_pfmemalloc_allowed().

Link: http://lkml.kernel.org/r/20160721073614.24395-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Alexander Potapenko 80a9201a59 mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB
For KASAN builds:
 - switch SLUB allocator to using stackdepot instead of storing the
   allocation/deallocation stacks in the objects;
 - change the freelist hook so that parts of the freelist can be put
   into the quarantine.

[aryabinin@virtuozzo.com: fixes]
  Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Alexander Potapenko c146a2b98e mm, kasan: account for object redzone in SLUB's nearest_obj()
When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.

Previously, when KASAN had detected an error on an object from a cache
with SLAB_RED_ZONE set, the actual start address of the object was
miscalculated, which led to random stacks having been reported.

When looking up the nearest SLUB object for a given address, correctly
calculate its offset if SLAB_RED_ZONE is enabled for that cache.

Fixes: 7ed2f9e663 ("mm, kasan: SLAB support")
Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Kirill A. Shutemov 734537c9cb mm: fix use-after-free if memory allocation failed in vma_adjust()
There's one case when vma_adjust() expands the vma, overlapping with
*two* next vma.  See case 6 of mprotect, described in the comment to
vma_merge().

To handle this (and only this) situation we iterate twice over main part
of the function.  See "goto again".

Vegard reported[1] that he sees out-of-bounds access complain from
KASAN, if anon_vma_clone() on the *second* iteration fails.

This happens because we free 'next' vma by the end of first iteration
and don't have a way to undo this if anon_vma_clone() fails on the
second iteration.

The solution is to do all required allocations upfront, before we touch
vmas.

The allocation on the second iteration is only required if first two
vmas don't have anon_vma, but third does.  So we need, in total, one
anon_vma_clone() call.

It's easy to adjust 'exporter' to the third vma for such case.

[1] http://lkml.kernel.org/r/1469514843-23778-1-git-send-email-vegard.nossum@oracle.com

Link: http://lkml.kernel.org/r/1469625255-126641-1-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Markus Elfring c3491eca37 zsmalloc: Delete an unnecessary check before the function call "iput"
iput() tests whether its argument is NULL and then returns immediately.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Link: http://lkml.kernel.org/r/559cf499-4a01-25f9-c87f-24d906626a57@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
zijun_hu fb399b4854 mm/memblock.c: fix index adjustment error in __next_mem_range_rev()
Fix region index adjustment error when parameter type_b of
__next_mem_range_rev() == NULL.

Signed-off-by: zijun_hu <zijun_hu@htc.com>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Richard Leitner <dev@g0hl1n.net>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Xishi Qiu 394e31d2ce mem-hotplug: alloc new page from a nearest neighbor node when mem-offline
If we offline a node, alloc the new page from a nearest neighbor node
instead of the current node or other remote nodes, because re-migrate is
a waste of time and the distance of the remote nodes is often very
large.

Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
zone or highmem zone.

Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
Signed-off-by: Xishi Qiu <qiuxishi@huawei.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mikulas Patocka 7e4411bfe6 mm: add cond_resched() to generic_swapfile_activate()
generic_swapfile_activate() can take quite long time, it iterates over
all blocks of a file, so add cond_resched to it.  I observed about 1
second stalls when activating a swapfile that was almost unfragmented -
this patch fixes it.

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1607221710580.4818@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 4e390b2b2f Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements"
This reverts commit f9054c70d2 ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements").

There has been a report about OOM killer invoked when swapping out to a
dm-crypt device.  The primary reason seems to be that the swapout out IO
managed to completely deplete memory reserves.  Ondrej was able to
bisect and explained the issue by pointing to f9054c70d2 ("mm,
mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context.  dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator.  This has
changed by f9054c70d2 ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer.  This is less than
optimal.

The original intention of f9054c70d2 was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress.  David has mentioned the following backtrace:

  schedule
  schedule_timeout
  io_schedule_timeout
  mempool_alloc
  __split_and_process_bio
  dm_request
  generic_make_request
  submit_bio
  mpage_readpages
  ext4_readpages
  __do_page_cache_readahead
  ra_submit
  filemap_fault
  handle_mm_fault
  __do_page_fault
  do_page_fault
  page_fault

We do not know more about why the mempool is depleted without being
replenished in time, though.  In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed.  If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.

mempools are a mechanism to maintain dedicated memory reserves to
guaratee forward progress.  Allowing them an unbounded access to the
page allocator memory reserves is going against the whole purpose of
this mechanism.

Bisected by Ondrej Kozina.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160721145309.GR26379@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Ondrej Kozina <okozina@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: NeilBrown <neilb@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mikulas Patocka <mpatocka@redhat.com>
Cc: Ondrej Kozina <okozina@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Mel Gorman <mgorman@suse.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Hugh Dickins 1d2047fefa mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects with
-EBUSY: of course the writeback might complete in between, but that's
not what we usually expect, so probably better not to isolate it.

When tested by stress-highalloc from mmtests, this has reduced the
number of page migrate failures by 60-70%.

Link: http://lkml.kernel.org/r/20160721073614.24395-2-vbabka@suse.cz
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Naoya Horiguchi 7c7fd82556 mm: hwpoison: remove incorrect comments
dequeue_hwpoisoned_huge_page() can be called without page lock hold, so
let's remove incorrect comment.

The reason why the page lock is not really needed is that
dequeue_hwpoisoned_huge_page() checks page_huge_active() inside
hugetlb_lock, which allows us to avoid trying to dequeue a hugepage that
are just allocated but not linked to active list yet, even without
taking page lock.

Link: http://lkml.kernel.org/r/20160720092901.GA15995@www9186uo.sakura.ne.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reported-by: Zhan Chen <zhanc1@andrew.cmu.edu>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Zhou Chengming 91fd8b95d6 make __section_nr() more efficient
When CONFIG_SPARSEMEM_EXTREME is disabled, __section_nr can get the
section number with a subtraction directly.

Link: http://lkml.kernel.org/r/1468988310-11560-1-git-send-email-zhouchengming1@huawei.com
Signed-off-by: Zhou Chengming <zhouchengming1@huawei.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Vegard Nossum 98c42d9452 kmemleak: don't hang if user disables scanning early
If the user tries to disable automatic scanning early in the boot
process using e.g.:

  echo scan=off > /sys/kernel/debug/kmemleak

then this command will hang until SECS_FIRST_SCAN (= 60) seconds have
elapsed, even though the system is fully initialised.

We can fix this using interruptible sleep and checking if we're supposed
to stop whenever we wake up (like the rest of the code does).

Link: http://lkml.kernel.org/r/1468835005-2873-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Dennis Chen a571d4eb55 mm/memblock.c: add new infrastructure to address the mem limit issue
In some cases, memblock is queried by kernel to determine whether a
specified address is RAM or not.  For example, the ACPI core needs this
information to determine which attributes to use when mapping ACPI
regions(acpi_os_ioremap).  Use of incorrect memory types can result in
faults, data corruption, or other issues.

Removing memory with memblock_enforce_memory_limit() throws away this
information, and so a kernel booted with 'mem=' may suffer from the
issues described above.  To avoid this, we need to keep those NOMAP
regions instead of removing all above the limit, which preserves the
information we need while preventing other use of those regions.

This patch adds new infrastructure to retain all NOMAP memblock regions
while removing others, to cater for this.

Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.com
Signed-off-by: Dennis Chen <dennis.chen@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Kaly Xin <kaly.xin@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Andy Lutomirski efdc949079 mm: fix memcg stack accounting for sub-page stacks
We should account for stacks regardless of stack size, and we need to
account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the units
to kilobytes and Move it into account_kernel_stack().

Fixes: 12580e4b54 ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Andy Lutomirski d30dd8be06 mm: track NR_KERNEL_STACK in KiB instead of number of stacks
Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
This only makes sense if each kernel stack exists entirely in one zone,
and allowing vmapped stacks could break this assumption.

Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
architectures.  Keep it simple and use KiB.

Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.org
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Dan Williams c02b6aec6d mm: CONFIG_ZONE_DEVICE stop depending on CONFIG_EXPERT
When it was first introduced CONFIG_ZONE_DEVICE depended on disabling
CONFIG_ZONE_DMA, a configuration choice reserved for "experts".
However, now that the ZONE_DMA conflict has been eliminated it no longer
makes sense to require CONFIG_EXPERT.

Link: http://lkml.kernel.org/r/146687646274.39261.14267596518720371009.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Eric Sandeen <sandeen@redhat.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Christoph Hellwig c4c5ad6b35 memblock: include <asm/sections.h> instead of <asm-generic/sections.h>
asm-generic headers are generic implementations for architecture
specific code and should not be included by common code.  Thus use the
asm/ version of sections.h to get at the linker sections.

Link: http://lkml.kernel.org/r/1468285103-7470-1-git-send-email-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Huang Ying 319904ad40 mm, THP: clean up return value of madvise_free_huge_pmd
The definition of return value of madvise_free_huge_pmd is not clear
before.  According to the suggestion of Minchan Kim, change the type of
return value to bool and return true if we do MADV_FREE successfully on
entire pmd page, otherwise, return false.  Comments are added too.

Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran 18fd06bf7a mm/zsmalloc: use helper to clear page->flags bit
Use ClearPagePrivate/ClearPagePrivate2 helpers to clear
PG_private/PG_private_2 in page->flags

Link: http://lkml.kernel.org/r/1467882338-4300-7-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran 35b3445e97 mm/zsmalloc: add __init,__exit attribute
Add __init,__exit attribute for function that only called in module
init/exit to save memory.

Link: http://lkml.kernel.org/r/1467882338-4300-6-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran fd8544639e mm/zsmalloc: keep comments consistent with code
Some minor commebnt changes:

 1). update zs_malloc(),zs_create_pool() function header
 2). update "Usage of struct page fields"

Link: http://lkml.kernel.org/r/1467882338-4300-5-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran 64d90465f0 mm/zsmalloc: avoid calculate max objects of zspage twice
Currently, if a class can not be merged, the max objects of zspage in
that class may be calculated twice.

This patch calculate max objects of zspage at the begin, and pass the
value to can_merge() to decide whether the class can be merged.

Also this patch remove function get_maxobj_per_zspage(), as there is no
other place to call this function.

Link: http://lkml.kernel.org/r/1467882338-4300-4-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran b4fd07a086 mm/zsmalloc: use class->objs_per_zspage to get num of max objects
num of max objects in zspage is stored in each size_class now.  So there
is no need to re-calculate it.

Link: http://lkml.kernel.org/r/1467882338-4300-3-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran cf675acb74 mm/zsmalloc: take obj index back from find_alloced_obj
the obj index value should be updated after return from
find_alloced_obj() to avoid CPU burning caused by unnecessary object
scanning.

Link: http://lkml.kernel.org/r/1467882338-4300-2-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran 41b88e14c1 mm/zsmalloc: use obj_index to keep consistent with others
This is a cleanup patch.  Change "index" to "obj_index" to keep
consistent with others in zsmalloc.

Link: http://lkml.kernel.org/r/1467882338-4300-1-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Minchan Kim 91dcade47a mm: bail out in shrink_inactive_list()
With node-lru, if there are enough reclaimable pages in highmem but
nothing in lowmem, VM can try to shrink inactive list although the
requested zone is lowmem.

The problem is that if the inactive list is full of highmem pages then a
direct reclaimer searching for a lowmem page waste CPU scanning
uselessly.  It just burns out CPU.  Even, many direct reclaimers are
stalled by too_many_isolated if lots of parallel reclaimer are going on
although there are no reclaimable memory in inactive list.

I tried the experiment 4 times in 32bit 2G 8 CPU KVM machine to get
elapsed time.

	hackbench 500 process 2

 = Old =

  1st: 289s 2nd: 310s 3rd: 112s 4th: 272s

 = Now =

  1st: 31s  2nd: 132s 3rd: 162s 4th: 50s.

[akpm@linux-foundation.org: fixes per Mel]
Link: http://lkml.kernel.org/r/1469433119-1543-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman d7f05528ee mm, vmscan: account for skipped pages as a partial scan
Page reclaim determines whether a pgdat is unreclaimable by examining
how many pages have been scanned since a page was freed and comparing
that to the LRU sizes.  Skipped pages are not reclaim candidates but
contribute to scanned.  This can prematurely mark a pgdat as
unreclaimable and trigger an OOM kill.

This patch accounts for skipped pages as a partial scan so that an
unreclaimable pgdat will still be marked as such but by scaling the cost
of a skip, it'll avoid the pgdat being marked prematurely.

Link: http://lkml.kernel.org/r/1469110261-7365-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman f8d1a31163 mm: consider whether to decivate based on eligible zones inactive ratio
Minchan Kim reported that with per-zone lru state it was possible to
identify that a normal zone with 8^M anonymous pages could trigger OOM
with non-atomic order-0 allocations as all pages in the zone were in the
active list.

   gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
   Call Trace:
     __alloc_pages_nodemask+0xe52/0xe60
     ? new_slab+0x39c/0x3b0
     new_slab+0x39c/0x3b0
     ___slab_alloc.constprop.87+0x6da/0x840
     ? __alloc_skb+0x3c/0x260
     ? enqueue_task_fair+0x73/0xbf0
     ? poll_select_copy_remaining+0x140/0x140
     __slab_alloc.isra.81.constprop.86+0x40/0x6d
     ? __alloc_skb+0x3c/0x260
     kmem_cache_alloc+0x22c/0x260
     ? __alloc_skb+0x3c/0x260
     __alloc_skb+0x3c/0x260
     alloc_skb_with_frags+0x4e/0x1a0
     sock_alloc_send_pskb+0x16a/0x1b0
     ? wait_for_unix_gc+0x31/0x90
     unix_stream_sendmsg+0x28d/0x340
     sock_sendmsg+0x2d/0x40
     sock_write_iter+0x6c/0xc0
     __vfs_write+0xc0/0x120
     vfs_write+0x9b/0x1a0
     ? __might_fault+0x49/0xa0
     SyS_write+0x44/0x90
     do_fast_syscall_32+0xa6/0x1e0

   Mem-Info:
   active_anon:101103 inactive_anon:102219 isolated_anon:0
    active_file:503 inactive_file:544 isolated_file:0
    unevictable:0 dirty:0 writeback:34 unstable:0
    slab_reclaimable:6298 slab_unreclaimable:74669
    mapped:863 shmem:0 pagetables:100998 bounce:0
    free:23573 free_pcp:1861 free_cma:0
   Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
   DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
   lowmem_reserve[]: 0 809 1965 1965
   Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
   lowmem_reserve[]: 0 0 9247 9247
   HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
   lowmem_reserve[]: 0 0 0 0
   DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
   Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
   HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
   Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
   54409 total pagecache pages
   53215 pages in swap cache
   Swap cache stats: add 300982, delete 247765, find 157978/226539
   Free swap  = 3803244kB
   Total swap = 4192252kB
   524186 pages RAM
   295934 pages HighMem/MovableOnly
   9642 pages reserved
   0 pages cma reserved

The problem is due to the active deactivation logic in
inactive_list_is_low:

	Node 0 active_anon:404412kB inactive_anon:409040kB

IOW, (inactive_anon of node * inactive_ratio > active_anon of node) due
to highmem anonymous stat so VM never deactivates normal zone's
anonymous pages.

This patch is a modified version of Minchan's original solution but
based upon it.  The problem with Minchan's patch is that any low zone
with an imbalanced list could force a rotation.

In this patch, a zone-constrained global reclaim will rotate the list if
the inactive/active ratio of all eligible zones needs to be corrected.
It is possible that higher zone pages will be initially rotated
prematurely but this is the safer choice to maintain overall LRU age.

Link: http://lkml.kernel.org/r/20160722090929.GJ10438@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 5a1c84b404 mm: remove reclaim and compaction retry approximations
If per-zone LRU accounting is available then there is no point
approximating whether reclaim and compaction should retry based on pgdat
statistics.  This is effectively a revert of "mm, vmstat: remove zone
and node double accounting by approximating retries" with the difference
that inactive/active stats are still available.  This preserves the
history of why the approximation was retried and why it had to be
reverted to handle OOM kills on 32-bit systems.

Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman bb4cc2bea6 mm, vmscan: remove highmem_file_pages
With the reintroduction of per-zone LRU stats, highmem_file_pages is
redundant so remove it.

[mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
  Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Minchan Kim 71c799f498 mm: add per-zone lru list stat
When I did stress test with hackbench, I got OOM message frequently
which didn't ever happen in zone-lru.

  gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
  ..
  ..
   __alloc_pages_nodemask+0xe52/0xe60
   ? new_slab+0x39c/0x3b0
   new_slab+0x39c/0x3b0
   ___slab_alloc.constprop.87+0x6da/0x840
   ? __alloc_skb+0x3c/0x260
   ? _raw_spin_unlock_irq+0x27/0x60
   ? trace_hardirqs_on_caller+0xec/0x1b0
   ? finish_task_switch+0xa6/0x220
   ? poll_select_copy_remaining+0x140/0x140
   __slab_alloc.isra.81.constprop.86+0x40/0x6d
   ? __alloc_skb+0x3c/0x260
   kmem_cache_alloc+0x22c/0x260
   ? __alloc_skb+0x3c/0x260
   __alloc_skb+0x3c/0x260
   alloc_skb_with_frags+0x4e/0x1a0
   sock_alloc_send_pskb+0x16a/0x1b0
   ? wait_for_unix_gc+0x31/0x90
   ? alloc_set_pte+0x2ad/0x310
   unix_stream_sendmsg+0x28d/0x340
   sock_sendmsg+0x2d/0x40
   sock_write_iter+0x6c/0xc0
   __vfs_write+0xc0/0x120
   vfs_write+0x9b/0x1a0
   ? __might_fault+0x49/0xa0
   SyS_write+0x44/0x90
   do_fast_syscall_32+0xa6/0x1e0
   sysenter_past_esp+0x45/0x74

  Mem-Info:
  active_anon:104698 inactive_anon:105791 isolated_anon:192
   active_file:433 inactive_file:283 isolated_file:22
   unevictable:0 dirty:0 writeback:296 unstable:0
   slab_reclaimable:6389 slab_unreclaimable:78927
   mapped:474 shmem:0 pagetables:101426 bounce:0
   free:10518 free_pcp:334 free_cma:0
  Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
  DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 809 1965 1965
  Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
  lowmem_reserve[]: 0 0 9247 9247
  HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
  Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
  HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
  25121 total pagecache pages
  24160 pages in swap cache
  Swap cache stats: add 86371, delete 62211, find 42865/60187
  Free swap  = 4015560kB
  Total swap = 4192252kB
  524186 pages RAM
  295934 pages HighMem/MovableOnly
  9658 pages reserved
  0 pages cma reserved

The order-0 allocation for normal zone failed while there are a lot of
reclaimable memory(i.e., anonymous memory with free swap).  I wanted to
analyze the problem but it was hard because we removed per-zone lru stat
so I couldn't know how many of anonymous memory there are in normal/dma
zone.

When we investigate OOM problem, reclaimable memory count is crucial
stat to find a problem.  Without it, it's hard to parse the OOM message
so I believe we should keep it.

With per-zone lru stat,

  gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
  Mem-Info:
  active_anon:101103 inactive_anon:102219 isolated_anon:0
   active_file:503 inactive_file:544 isolated_file:0
   unevictable:0 dirty:0 writeback:34 unstable:0
   slab_reclaimable:6298 slab_unreclaimable:74669
   mapped:863 shmem:0 pagetables:100998 bounce:0
   free:23573 free_pcp:1861 free_cma:0
  Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
  DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
  lowmem_reserve[]: 0 809 1965 1965
  Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
  lowmem_reserve[]: 0 0 9247 9247
  HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0
  DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
  Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
  HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
  Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
  54409 total pagecache pages
  53215 pages in swap cache
  Swap cache stats: add 300982, delete 247765, find 157978/226539
  Free swap  = 3803244kB
  Total swap = 4192252kB
  524186 pages RAM
  295934 pages HighMem/MovableOnly
  9642 pages reserved
  0 pages cma reserved

With that, we can see normal zone has a 86M reclaimable memory so we can
know something goes wrong(I will fix the problem in next patch) in
reclaim.

[mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
 Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 785b99febb mm, vmscan: release/reacquire lru_lock on pgdat change
With node-lru, the locking is based on the pgdat.  As Minchan pointed
out, there is an opportunity to reduce LRU lock release/acquire in
check_move_unevictable_pages by only changing lock on a pgdat change.

[mgorman@techsingularity.net: remove double initialisation]
  Link: http://lkml.kernel.org/r/20160719074835.GC10438@techsingularity.net
Link: http://lkml.kernel.org/r/1468853426-12858-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 22fecdf5e1 mm, vmscan: remove redundant check in shrink_zones()
As pointed out by Minchan Kim, shrink_zones() checks for populated zones
in a zonelist but a zonelist can never contain unpopulated zones.  While
it's not related to the node-lru series, it can be cleaned up now.

Link: http://lkml.kernel.org/r/1468853426-12858-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 7ee36a14f0 mm, vmscan: Update all zone LRU sizes before updating memcg
Minchan Kim reported setting the following warning on a 32-bit system
although it can affect 64-bit systems.

  WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
  mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
  Modules linked in:
  CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
    dump_stack+0x76/0xaf
    __warn+0xea/0x110
    ? mem_cgroup_update_lru_size+0x103/0x110
    warn_slowpath_fmt+0x3b/0x40
    mem_cgroup_update_lru_size+0x103/0x110
    isolate_lru_pages.isra.61+0x2e2/0x360
    shrink_active_list+0xac/0x2a0
    ? __delay+0xe/0x10
    shrink_node_memcg+0x53c/0x7a0
    shrink_node+0xab/0x2a0
    do_try_to_free_pages+0xc6/0x390
    try_to_free_pages+0x245/0x590

LRU list contents and counts are updated separately.  Counts are updated
before pages are added to the LRU and updated after pages are removed.
The warning above is from a check in mem_cgroup_update_lru_size that
ensures that list sizes of zero are empty.

The problem is that node-lru needs to account for highmem pages if
CONFIG_HIGHMEM is set.  One impact of the implementation is that the
sizes are updated in multiple passes when pages from multiple zones were
isolated.  This happens whether HIGHMEM is set or not.  When multiple
zones are isolated, it's possible for a debugging check in memcg to be
tripped.

This patch forces all the zone counts to be updated before the memcg
function is called.

Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Tested-by: Minchan Kim <minchan@kernel.org>
Reported-by: Minchan Kim <minchan@kernel.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Minchan Kim 33e077bd60 mm: show node_pages_scanned per node, not zone
The node_pages_scanned represents the number of scanned pages of node
for reclaim so it's pointless to show it as kilobytes.

As well, node_pages_scanned is per-node value, not per-zone.

This patch changes node_pages_scanned per-zone-killobytes with
per-node-count.

[minchan@kernel.org: fix node_pages_scanned]
  Link: http://lkml.kernel.org/r/20160716101431.GA10305@bbox
Link: http://lkml.kernel.org/r/1468588165-12461-5-git-send-email-mgorman@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 68eb0731c4 mm, pagevec: release/reacquire lru_lock on pgdat change
With node-lru, the locking is based on the pgdat.  Previously it was
required that a pagevec drain released one zone lru_lock and acquired
another zone lru_lock on every zone change.  Now, it's only necessary if
the node changes.  The end-result is fewer lock release/acquires if the
pages are all on the same node but in different zones.

Link: http://lkml.kernel.org/r/1468588165-12461-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Minchan Kim 9cb937e219 mm, page_alloc: fix dirtyable highmem calculation
When I tested vmscale in mmtest in 32bit, I found the benchmark was slow
down 0.5 times.

                base        node
                   1    global-1
User           12.98       16.04
System        147.61      166.42
Elapsed        26.48       38.08

With vmstat, I found IO wait avg is much increased compared to base.

The reason was highmem_dirtyable_memory accumulates free pages and
highmem_file_pages from HIGHMEM to MOVABLE zones which was wrong.  With
that, dirth_thresh in throtlle_vm_write is always 0 so that it calls
congestion_wait frequently if writeback starts.

With this patch, it is much recovered.

                base        node          fi
                   1    global-1         fix
User           12.98       16.04       13.78
System        147.61      166.42      143.92
Elapsed        26.48       38.08       29.64

Link: http://lkml.kernel.org/r/1468404004-5085-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman bca6759258 mm, vmstat: remove zone and node double accounting by approximating retries
The number of LRU pages, dirty pages and writeback pages must be
accounted for on both zones and nodes because of the reclaim retry
logic, compaction retry logic and highmem calculations all depending on
per-zone stats.

Many lowmem allocations are immune from OOM kill due to a check in
__alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
03668b3ceb ("oom: avoid oom killer for lowmem allocations").  The
exception is costly high-order allocations or allocations that cannot
fail.  If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
allocations then it would fall through to __alloc_pages_direct_compact.

This patch will blindly retry reclaim for zone-constrained allocations
in should_reclaim_retry up to MAX_RECLAIM_RETRIES.  This is not ideal
but without per-zone stats there are not many alternatives.  The impact
it that zone-constrained allocations may delay before considering the
OOM killer.

As there is no guarantee enough memory can ever be freed to satisfy
compaction, this patch avoids retrying compaction for zone-contrained
allocations.

In combination, that means that the per-node stats can be used when
deciding whether to continue reclaim using a rough approximation.  While
it is possible this will make the wrong decision on occasion, it will
not infinite loop as the number of reclaim attempts is capped by
MAX_RECLAIM_RETRIES.

The final step is calculating the number of dirtyable highmem pages.  As
those calculations only care about the global count of file pages in
highmem.  This patch uses a global counter used instead of per-zone
stats as it is sufficient.

In combination, this allows the per-zone LRU and dirty state counters to
be removed.

[mgorman@techsingularity.net: fix acct_highmem_file_pages()]
  Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Suggested by: Michal Hocko <mhocko@kernel.org>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman e2ecc8a79e mm, vmstat: print node-based stats in zoneinfo file
There are a number of stats that were previously accessible via zoneinfo
that are now invisible.  While it is possible to create a new file for
the node stats, this may be missed by users.  Instead this patch prints
the stats under the first populated zone in /proc/zoneinfo.

Link: http://lkml.kernel.org/r/1467970510-21195-34-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 7cc30fcfd2 mm: vmstat: account per-zone stalls and pages skipped during reclaim
The vmstat allocstall was fairly useful in the general sense but
node-based LRUs change that.  It's important to know if a stall was for
an address-limited allocation request as this will require skipping
pages from other zones.  This patch adds pgstall_* counters to replace
allocstall.  The sum of the counters will equal the old allocstall so it
can be trivially recalculated.  A high number of address-limited
allocation requests may result in a lot of useless LRU scanning for
suitable pages.

As address-limited allocations require pages to be skipped, it's
important to know how much useless LRU scanning took place so this patch
adds pgskip* counters.  This yields the following model

1. The number of address-space limited stalls can be accounted for (pgstall)
2. The amount of useless work required to reclaim the data is accounted (pgskip)
3. The total number of scans is available from pgscan_kswapd and pgscan_direct
   so from that the ratio of useful to useless scans can be calculated.

[mgorman@techsingularity.net: s/pgstall/allocstall/]
  Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 16709d1de1 mm: vmstat: replace __count_zone_vm_events with a zone id equivalent
This is partially a preparation patch for more vmstat work but it also
has the slight advantage that __count_zid_vm_events is cheaper to
calculate than __count_zone_vm_events().

Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 3b8c0be43c mm: page_alloc: cache the last node whose dirty limit is reached
If a page is about to be dirtied then the page allocator attempts to
limit the total number of dirty pages that exists in any given zone.
The call to node_dirty_ok is expensive so this patch records if the last
pgdat examined hit the dirty limits.  In some cases, this reduces the
number of calls to node_dirty_ok().

Link: http://lkml.kernel.org/r/1467970510-21195-31-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman e6cbd7f2ef mm, page_alloc: remove fair zone allocation policy
The fair zone allocation policy interleaves allocation requests between
zones to avoid an age inversion problem whereby new pages are reclaimed
to balance a zone.  Reclaim is now node-based so this should no longer
be an issue and the fair zone allocation policy is not free.  This patch
removes it.

Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman e5146b12e2 mm, vmscan: add classzone information to tracepoints
This is convenient when tracking down why the skip count is high because
it'll show what classzone kswapd woke up at and what zones are being
isolated.

Link: http://lkml.kernel.org/r/1467970510-21195-29-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 84c7a7771f mm, vmscan: Have kswapd reclaim from all zones if reclaiming and buffer_heads_over_limit
The buffer_heads_over_limit limit in kswapd is inconsistent with direct
reclaim behaviour.  It may force an an attempt to reclaim from all zones
and then not reclaim at all because higher zones were balanced than
required by the original request.

This patch will causes kswapd to consider reclaiming from all zones if
buffer_heads_over_limit.  However, if there are eligible zones for the
allocation request that woke kswapd then no reclaim will occur even if
buffer_heads_over_limit.  This avoids kswapd over-reclaiming just
because buffer_heads_over_limit.

[mgorman@techsingularity.net: fix comment about buffer_heads_over_limit]
  Link: http://lkml.kernel.org/r/1468404004-5085-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-28-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman d9f21d426d mm, vmscan: avoid passing in `remaining' unnecessarily to prepare_kswapd_sleep()
As pointed out by Minchan Kim, the first call to prepare_kswapd_sleep()
always passes in 0 for `remaining' and the second call can trivially
check the parameter in advance.

Suggested-by: Minchan Kim <minchan@kernel.org>
Link: http://lkml.kernel.org/r/1467970510-21195-27-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 4f588331bd mm, vmscan: avoid passing in classzone_idx unnecessarily to compaction_ready
The scan_control structure has enough information available for
compaction_ready() to make a decision.  The classzone_idx manipulations
in shrink_zones() are no longer necessary as the highest populated zone
is no longer used to determine if shrink_slab should be called or not.

[mgorman@techsingularity.net remove redundant check in shrink_zones()]
  Link: http://lkml.kernel.org/r/1468588165-12461-3-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-26-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 970a39a363 mm, vmscan: avoid passing in classzone_idx unnecessarily to shrink_node
shrink_node receives all information it needs about classzone_idx from
sc->reclaim_idx so remove the aliases.

Link: http://lkml.kernel.org/r/1467970510-21195-25-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman a5f5f91da6 mm: convert zone_reclaim to node_reclaim
As reclaim is now per-node based, convert zone_reclaim to be
node_reclaim.  It is possible that a node will be reclaimed multiple
times if it has multiple zones but this is unavoidable without caching
all nodes traversed so far.  The documentation and interface to
userspace is the same from a configuration perspective and will will be
similar in behaviour unless the node-local allocation requests were also
limited to lower zones.

Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 52e9f87ae8 mm, page_alloc: wake kswapd based on the highest eligible zone
The ac_classzone_idx is used as the basis for waking kswapd and that is
based on the preferred zoneref.  If the preferred zoneref's first zone
is lower than what is available on other nodes, it's possible that
kswapd is woken on a zone with only higher, but still eligible, zones.
As classzone_idx is strictly adhered to now, it causes a problem because
eligible pages are skipped.

For example, node 0 has only DMA32 and node 1 has only NORMAL.  An
allocating context running on node 0 may wake kswapd on node 1 telling
it to skip all NORMAL pages.

Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman e1a556374a mm, vmscan: only wakeup kswapd once per node for the requested classzone
kswapd is woken when zones are below the low watermark but the wakeup
decision is not taking the classzone into account.  Now that reclaim is
node-based, it is only required to wake kswapd once per node and only if
all zones are unbalanced for the requested classzone.

Note that one node might be checked multiple times if the zonelist is
ordered by node because there is no cheap way of tracking what nodes
have already been visited.  For zone-ordering, each node should be
checked only once.

Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman c4a25635b6 mm: move vmscan writes and file write accounting to the node
As reclaim is now node-based, it follows that page write activity due to
page reclaim should also be accounted for on the node.  For consistency,
also account page writes and page dirtying on a per-node basis.

After this patch, there are a few remaining zone counters that may appear
strange but are fine.  NUMA stats are still per-zone as this is a
user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
potentially pin low memory and cannot trivially be reclaimed on demand.
This information is still useful for debugging a page allocation failure
warning.

Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 11fb998986 mm: move most file-based accounting to the node
There are now a number of accounting oddities such as mapped file pages
being accounted for on the node while the total number of file pages are
accounted on the zone.  This can be coped with to some extent but it's
confusing so this patch moves the relevant file-based accounted.  Due to
throttling logic in the page allocator for reliable OOM detection, it is
still necessary to track dirty and writeback pages on a per-zone basis.

[mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
  Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 4b9d0fab71 mm: rename NR_ANON_PAGES to NR_ANON_MAPPED
NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_PAGES  is the number of mapped anon pages.

This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
have

NR_FILE_PAGES  is the number of        file pages.
NR_FILE_MAPPED is the number of mapped file pages.
NR_ANON_MAPPED is the number of mapped anon pages.

Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 50658e2e04 mm: move page mapped accounting to the node
Reclaim makes decisions based on the number of pages that are mapped but
it's mixing node and zone information.  Account NR_FILE_MAPPED and
NR_ANON_PAGES pages on the node.

Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 281e37265f mm, page_alloc: consider dirtyable memory in terms of nodes
Historically dirty pages were spread among zones but now that LRUs are
per-node it is more appropriate to consider dirty pages in a node.

Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 1e6b10857f mm, workingset: make working set detection node-aware
Working set and refault detection is still zone-based, fix it.

Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman ef8f232799 mm, memcg: move memcg limit enforcement from zones to nodes
Memcg needs adjustment after moving LRUs to the node.  Limits are
tracked per memcg but the soft-limit excess is tracked per zone.  As
global page reclaim is based on the node, it is easy to imagine a
situation where a zone soft limit is exceeded even though the memcg
limit is fine.

This patch moves the soft limit tree the node.  Technically, all the
variable names should also change but people are already familiar by the
meaning of "mz" even if "mn" would be a more appropriate name now.

Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman a9dd0a8310 mm, vmscan: make shrink_node decisions more node-centric
Earlier patches focused on having direct reclaim and kswapd use data
that is node-centric for reclaiming but shrink_node() itself still uses
too much zone information.  This patch removes unnecessary zone-based
information with the most important decision being whether to continue
reclaim or not.  Some memcg APIs are adjusted as a result even though
memcg itself still uses some zone information.

[mgorman@techsingularity.net: optimization]
  Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 86c79f6b54 mm: vmscan: do not reclaim from kswapd if there is any eligible zone
kswapd scans from highest to lowest for a zone that requires balancing.
This was necessary when reclaim was per-zone to fairly age pages on
lower zones.  Now that we are reclaiming on a per-node basis, any
eligible zone can be used and pages will still be aged fairly.  This
patch avoids reclaiming excessively unless buffer_heads are over the
limit and it's necessary to reclaim from a higher zone than requested by
the waker of kswapd to relieve low memory pressure.

[hillf.zj@alibaba-inc.com: Force kswapd reclaim no more than needed]
Link: http://lkml.kernel.org/r/1466518566-30034-12-git-send-email-mgorman@techsingularity.net
Link: http://lkml.kernel.org/r/1467970510-21195-13-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 6256c6b499 mm, vmscan: remove duplicate logic clearing node congestion and dirty state
Reclaim may stall if there is too much dirty or congested data on a
node.  This was previously based on zone flags and the logic for
clearing the flags is in two places.  As congestion/dirty tracking is
now tracked on a per-node basis, we can remove some duplicate logic.

Link: http://lkml.kernel.org/r/1467970510-21195-12-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 79dafcdca3 mm, vmscan: by default have direct reclaim only shrink once per node
Direct reclaim iterates over all zones in the zonelist and shrinking
them but this is in conflict with node-based reclaim.  In the default
case, only shrink once per node.

Link: http://lkml.kernel.org/r/1467970510-21195-11-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 38087d9b03 mm, vmscan: simplify the logic deciding whether kswapd sleeps
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order.  It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat().  What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order.  This patch irons out the logic to check just that and the end
result is less headache inducing.

Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 31483b6ad2 mm, vmscan: remove balance gap
The balance gap was introduced to apply equal pressure to all zones when
reclaiming for a higher zone.  With node-based LRU, the need for the
balance gap is removed and the code is dead so remove it.

[vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 1d82de618d mm, vmscan: make kswapd reclaim in terms of nodes
Patch "mm: vmscan: Begin reclaiming pages on a per-node basis" started
thinking of reclaim in terms of nodes but kswapd is still zone-centric.
This patch gets rid of many of the node-based versus zone-based
decisions.

o A node is considered balanced when any eligible lower zone is balanced.
  This eliminates one class of age-inversion problem because we avoid
  reclaiming a newer page just because it's in the wrong zone
o pgdat_balanced disappears because we now only care about one zone being
  balanced.
o Some anomalies related to writeback and congestion tracking being based on
  zones disappear.
o kswapd no longer has to take care to reclaim zones in the reverse order
  that the page allocator uses.
o Most importantly of all, reclaim from node 0 with multiple zones will
  have similar aging and reclaiming characteristics as every
  other node.

Link: http://lkml.kernel.org/r/1467970510-21195-8-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman f7b60926eb mm, vmscan: have kswapd only scan based on the highest requested zone
kswapd checks all eligible zones to see if they need balancing even if
it was woken for a lower zone.  This made sense when we reclaimed on a
per-zone basis because we wanted to shrink zones fairly so avoid
age-inversion problems.  Ideally this is completely unnecessary when
reclaiming on a per-node basis.  In theory, there may still be anomalies
when all requests are for lower zones and very old pages are preserved
in higher zones but this should be the exceptional case.

Link: http://lkml.kernel.org/r/1467970510-21195-7-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman b2e18757f2 mm, vmscan: begin reclaiming pages on a per-node basis
This patch makes reclaim decisions on a per-node basis.  A reclaimer
knows what zone is required by the allocation request and skips pages
from higher zones.  In many cases this will be ok because it's a
GFP_HIGHMEM request of some description.  On 64-bit, ZONE_DMA32 requests
will cause some problems but 32-bit devices on 64-bit platforms are
increasingly rare.  Historically it would have been a major problem on
32-bit with big Highmem:Lowmem ratios but such configurations are also
now rare and even where they exist, they are not encouraged.  If it
really becomes a problem, it'll manifest as very low reclaim
efficiencies.

Link: http://lkml.kernel.org/r/1467970510-21195-6-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 599d0c954f mm, vmscan: move LRU lists to node
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.

Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic.  Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes.  It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.

Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies.  For example, the scans are
per-zone but using per-node counters.  We also mark a node as congested
when a zone is congested.  This causes weird problems that are fixed
later but is easier to review.

In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions

1. Long-term isolation of highmem pages when reclaim is lowmem

   When pages are skipped, they are immediately added back onto the LRU
   list. If lowmem reclaim persisted for long periods of time, the same
   highmem pages get continually scanned. The idea would be that lowmem
   keeps those pages on a separate list until a reclaim for highmem pages
   arrives that splices the highmem pages back onto the LRU. It potentially
   could be implemented similar to the UNEVICTABLE list.

   That would reduce the skip rate with the potential corner case is that
   highmem pages have to be scanned and reclaimed to free lowmem slab pages.

2. Linear scan lowmem pages if the initial LRU shrink fails

   This will break LRU ordering but may be preferable and faster during
   memory pressure than skipping LRU pages.

Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman a52633d8e9 mm, vmscan: move lru_lock to the node
Node-based reclaim requires node-based LRUs and locking.  This is a
preparation patch that just moves the lru_lock to the node so later
patches are easier to review.  It is a mechanical change but note this
patch makes contention worse because the LRU lock is hotter and direct
reclaim and kswapd can contend on the same lock even when reclaiming
from different zones.

Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman 75ef718405 mm, vmstat: add infrastructure for per-node vmstats
Patchset: "Move LRU page reclaim from zones to nodes v9"

This series moves LRUs from the zones to the node.  While this is a
current rebase, the test results were based on mmotm as of June 23rd.
Conceptually, this series is simple but there are a lot of details.
Some of the broad motivations for this are;

1. The residency of a page partially depends on what zone the page was
   allocated from.  This is partially combatted by the fair zone allocation
   policy but that is a partial solution that introduces overhead in the
   page allocator paths.

2. Currently, reclaim on node 0 behaves slightly different to node 1. For
   example, direct reclaim scans in zonelist order and reclaims even if
   the zone is over the high watermark regardless of the age of pages
   in that LRU. Kswapd on the other hand starts reclaim on the highest
   unbalanced zone. A difference in distribution of file/anon pages due
   to when they were allocated results can result in a difference in
   again. While the fair zone allocation policy mitigates some of the
   problems here, the page reclaim results on a multi-zone node will
   always be different to a single-zone node.
   it was scheduled on as a result.

3. kswapd and the page allocator scan zones in the opposite order to
   avoid interfering with each other but it's sensitive to timing.  This
   mitigates the page allocator using pages that were allocated very recently
   in the ideal case but it's sensitive to timing. When kswapd is allocating
   from lower zones then it's great but during the rebalancing of the highest
   zone, the page allocator and kswapd interfere with each other. It's worse
   if the highest zone is small and difficult to balance.

4. slab shrinkers are node-based which makes it harder to identify the exact
   relationship between slab reclaim and LRU reclaim.

The reason we have zone-based reclaim is that we used to have
large highmem zones in common configurations and it was necessary
to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
less of a concern as machines with lots of memory will (or should) use
64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
rare. Machines that do use highmem should have relatively low highmem:lowmem
ratios than we worried about in the past.

Conceptually, moving to node LRUs should be easier to understand. The
page allocator plays fewer tricks to game reclaim and reclaim behaves
similarly on all nodes.

The series has been tested on a 16 core UMA machine and a 2-socket 48
core NUMA machine. The UMA results are presented in most cases as the NUMA
machine behaved similarly.

pagealloc
---------

This is a microbenchmark that shows the benefit of removing the fair zone
allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
shown as the other orders were comparable.

                                           4.7.0-rc4                  4.7.0-rc4
                                      mmotm-20160623                 nodelru-v9
Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)

This shows a steady improvement throughout. The primary benefit is from
reduced system CPU usage which is obvious from the overall times;

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623nodelru-v8
User          189.19      191.80
System       2604.45     2533.56
Elapsed      2855.30     2786.39

The vmstats also showed that the fair zone allocation policy was definitely
removed as can be seen here;

                             4.7.0-rc3   4.7.0-rc3
                         mmotm-20160623 nodelru-v8
DMA32 allocs               28794729769           0
Normal allocs              48432501431 77227309877
Movable allocs                       0           0

tiobench on ext4
----------------

tiobench is a benchmark that artifically benefits if old pages remain resident
while new pages get reclaimed. The fair zone allocation policy mitigates this
problem so pages age fairly. While the benchmark has problems, it is important
that tiobench performance remains constant as it implies that page aging
problems that the fair zone allocation policy fixes are not re-introduced.

                                         4.7.0-rc4             4.7.0-rc4
                                    mmotm-20160623            nodelru-v9
Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 approx-v9
User          645.72      525.90
System        403.85      331.75
Elapsed      6795.36     6783.67

This shows that the series has little or not impact on tiobench which is
desirable and a reduction in system CPU usage. It indicates that the fair
zone allocation policy was removed in a manner that didn't reintroduce
one class of page aging bug. There were only minor differences in overall
reclaim activity

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Minor Faults                    645838      647465
Major Faults                       573         640
Swap Ins                             0           0
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                  46041453    44190646
Normal allocs                 78053072    79887245
Movable allocs                       0           0
Allocation stalls                   24          67
Stall zone DMA                       0           0
Stall zone DMA32                     0           0
Stall zone Normal                    0           2
Stall zone HighMem                   0           0
Stall zone Movable                   0          65
Direct pages scanned             10969       30609
Kswapd pages scanned          93375144    93492094
Kswapd pages reclaimed        93372243    93489370
Direct pages reclaimed           10969       30609
Kswapd efficiency                  99%         99%
Kswapd velocity              13741.015   13781.934
Direct efficiency                 100%        100%
Direct velocity                  1.614       4.512
Percentage direct scans             0%          0%

kswapd activity was roughly comparable. There were differences in direct
reclaim activity but negligible in the context of the overall workload
(velocity of 4 pages per second with the patches applied, 1.6 pages per
second in the baseline kernel).

pgbench read-only large configuration on ext4
---------------------------------------------

pgbench is a database benchmark that can be sensitive to page reclaim
decisions. This also checks if removing the fair zone allocation policy
is safe

pgbench Transactions
                        4.7.0-rc4             4.7.0-rc4
                   mmotm-20160623            nodelru-v8
Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)

Negligible differences again. As with tiobench, overall reclaim activity
was comparable.

bonnie++ on ext4
----------------

No interesting performance difference, negligible differences on reclaim
stats.

paralleldd on ext4
------------------

This workload uses varying numbers of dd instances to read large amounts of
data from disk.

                               4.7.0-rc3             4.7.0-rc3
                          mmotm-20160623            nodelru-v9
Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)

           4.7.0-rc4   4.7.0-rc4
        mmotm-20160623 nodelru-v9
User         1548.01     1552.44
System       8609.71     8515.08
Elapsed      3587.10     3594.54

There is little or no change in performance but some drop in system CPU usage.

                             4.7.0-rc3   4.7.0-rc3
                        mmotm-20160623  nodelru-v9
Minor Faults                    362662      367360
Major Faults                      1204        1143
Swap Ins                            22           0
Swap Outs                         2855        1029
DMA allocs                           0           0
DMA32 allocs                  31409797    28837521
Normal allocs                 46611853    49231282
Movable allocs                       0           0
Direct pages scanned                 0           0
Kswapd pages scanned          40845270    40869088
Kswapd pages reclaimed        40830976    40855294
Direct pages reclaimed               0           0
Kswapd efficiency                  99%         99%
Kswapd velocity              11386.711   11369.769
Direct efficiency                 100%        100%
Direct velocity                  0.000       0.000
Percentage direct scans             0%          0%
Page writes by reclaim            2855        1029
Page writes file                     0           0
Page writes anon                  2855        1029
Page reclaim immediate             771        1628
Sector Reads                 293312636   293536360
Sector Writes                 18213568    18186480
Page rescued immediate               0           0
Slabs scanned                   128257      132747
Direct inode steals                181          56
Kswapd inode steals                 59        1131

It basically shows that kswapd was active at roughly the same rate in
both kernels. There was also comparable slab scanning activity and direct
reclaim was avoided in both cases. There appears to be a large difference
in numbers of inodes reclaimed but the workload has few active inodes and
is likely a timing artifact.

stutter
-------

stutter simulates a simple workload. One part uses a lot of anonymous
memory, a second measures mmap latency and a third copies a large file.
The primary metric is checking for mmap latency.

stutter
                             4.7.0-rc4             4.7.0-rc4
                        mmotm-20160623            nodelru-v8
Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)

This shows a number of improvements with the worst-case outlier greatly
improved.

Some of the vmstats are interesting

                             4.7.0-rc4   4.7.0-rc4
                          mmotm-20160623nodelru-v8
Swap Ins                           163         502
Swap Outs                            0           0
DMA allocs                           0           0
DMA32 allocs                 618719206  1381662383
Normal allocs                891235743   564138421
Movable allocs                       0           0
Allocation stalls                 2603           1
Direct pages scanned            216787           2
Kswapd pages scanned          50719775    41778378
Kswapd pages reclaimed        41541765    41777639
Direct pages reclaimed          209159           0
Kswapd efficiency                  81%         99%
Kswapd velocity              16859.554   14329.059
Direct efficiency                  96%          0%
Direct velocity                 72.061       0.001
Percentage direct scans             0%          0%
Page writes by reclaim         6215049           0
Page writes file               6215049           0
Page writes anon                     0           0
Page reclaim immediate           70673          90
Sector Reads                  81940800    81680456
Sector Writes                100158984    98816036
Page rescued immediate               0           0
Slabs scanned                  1366954       22683

While this is not guaranteed in all cases, this particular test showed
a large reduction in direct reclaim activity. It's also worth noting
that no page writes were issued from reclaim context.

This series is not without its hazards. There are at least three areas
that I'm concerned with even though I could not reproduce any problems in
that area.

1. Reclaim/compaction is going to be affected because the amount of reclaim is
   no longer targetted at a specific zone. Compaction works on a per-zone basis
   so there is no guarantee that reclaiming a few THP's worth page pages will
   have a positive impact on compaction success rates.

2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
   are called is now different. This may or may not be a problem but if it
   is, it'll be because shrinkers are not called enough and some balancing
   is required.

3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
   distributed between zones and the fair zone allocation policy used to do
   something very similar for anon. The distribution is now different but not
   necessarily in any way that matters but it's still worth bearing in mind.

VM statistic counters for reclaim decisions are zone-based.  If the kernel
is to reclaim on a per-node basis then we need to track per-node
statistics but there is no infrastructure for that.  The most notable
change is that the old node_page_state is renamed to
sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
uses per-node stats but none exist yet.  There is some renaming such as
vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
patch with no functional change.  There is a lot of similarity between the
node and zone helpers which is unfortunate but there was no obvious way of
reusing the code and maintaining type safety.

Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Mel Gorman a621184ac6 mm, meminit: remove early_page_nid_uninitialised
The helper early_page_nid_uninitialised() has been dead since commit
974a786e63 ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
dead code.

Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Ganesh Mahendran b2b331f966 mm/compaction: remove unnecessary order check in try_to_compact_pages()
The caller __alloc_pages_direct_compact() already checked (order == 0)
so there's no need to check again.

Link: http://lkml.kernel.org/r/1465973568-3496-1-git-send-email-opensource.ganesh@gmail.com
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Johannes Weiner 55779ec759 mm: fix vm-scalability regression in cgroup-aware workingset code
Commit 23047a96d7 ("mm: workingset: per-cgroup cache thrash
detection") added a page->mem_cgroup lookup to the cache eviction,
refault, and activation paths, as well as locking to the activation
path, and the vm-scalability tests showed a regression of -23%.

While the test in question is an artificial worst-case scenario that
doesn't occur in real workloads - reading two sparse files in parallel
at full CPU speed just to hammer the LRU paths - there is still some
optimizations that can be done in those paths.

Inline the lookup functions to eliminate calls.  Also, page->mem_cgroup
doesn't need to be stabilized when counting an activation; we merely
need to hold the RCU lock to prevent the memcg from being freed.

This cuts down on overhead quite a bit:

23047a96d7 063f6715e77a7be5770d6081fe
---------------- --------------------------
         %stddev     %change         %stddev
             \          |                \
  21621405 +- 0%     +11.3%   24069657 +- 2%  vm-scalability.throughput

[linux@roeck-us.net: drop unnecessary include file]
[hannes@cmpxchg.org: add WARN_ON_ONCE()s]
  Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.org
Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
zhong jiang 400bc7fd4f mm: update the comment in __isolate_free_page
We need to assure the comment is consistent with the code.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.com
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 091f362c53 mm, oom: tighten task_will_free_mem() locking
"mm, oom: fortify task_will_free_mem" has dropped task_lock around
task_will_free_mem in oom_kill_process bacause it assumed that a
potential race when the selected task exits will not be a problem as the
oom_reaper will call exit_oom_victim.

Tetsuo was objecting that nommu doesn't have oom_reaper so the race
would be still possible.  The code would be racy and lockup prone
theoretically in other aspects without the oom reaper anyway so I didn't
considered this a big deal.  But it seems that further changes I am
planning in this area will benefit from stable task->mm in this path as
well.  So let's drop find_lock_task_mm from task_will_free_mem and call
it from under task_lock as we did previously.  Just pull the task->mm !=
NULL check inside the function.

Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko a373966d1f mm, oom: hide mm which is shared with kthread or global init
The only case where the oom_reaper is not triggered for the oom victim
is when it shares the memory with a kernel thread (aka use_mm) or with
the global init.  After "mm, oom: skip vforked tasks from being
selected" the victim cannot be a vforked task of the global init so we
are left with clone(CLONE_VM) (without CLONE_SIGHAND).  use_mm() users
are quite rare as well.

In order to help forward progress for the OOM killer, make sure that
this really rare case will not get in the way - we do this by hiding the
mm from the oom killer by setting MMF_OOM_REAPED flag for it.
oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
MMF_OOM_REAPED flag set to catch these oom victims.

After this patch we should guarantee forward progress for the OOM killer
even when the selected victim is sharing memory with a kernel thread or
global init as long as the victims mm is still alive.

Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 11a410d516 mm, oom_reaper: do not attempt to reap a task more than twice
oom_reaper relies on the mmap_sem for read to do its job.  Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot.  Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails.  If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 696453e666 mm, oom: task_will_free_mem should skip oom_reaped tasks
The 0-day robot has encountered the following:

   Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
   Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
   oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
   oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
   oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
   oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
   oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

oom_reaper is trying to reap the same task again and again.

This is possible only when the oom killer is bypassed because of
task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
already set during select_bad_process.  Teach task_will_free_mem to skip
over MMF_OOM_REAPED tasks as well because they will be unlikely to free
anything more.

Analyzed by Tetsuo Handa.

Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 1af8bb4326 mm, oom: fortify task_will_free_mem()
task_will_free_mem is rather weak.  It doesn't really tell whether the
task has chance to drop its mm.  98748bd722 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step into making
it more robust for multi-threaded applications so now we know that the
whole process is going down and probably drop the mm.

This patch builds on top for more complex scenarios where mm is shared
between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
use_mm().

Make sure that all processes sharing the mm are killed or exiting.  This
will allow us to replace try_oom_reaper by wake_oom_reaper because
task_will_free_mem implies the task is reapable now.  Therefore all paths
which bypass the oom killer are now reapable and so they shouldn't lock up
the oom killer.

Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 97fd49c235 mm, oom: kill all tasks sharing the mm
Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN.  After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).

We can still encounter oom disabled vforked task which has to be killed as
well if we want to have other tasks sharing the mm reapable because it can
access the memory before doing exec.  Killing such a task should be
acceptable because it is highly unlikely it has done anything useful
because it cannot modify any memory before it calls exec.  An alternative
would be to keep the task alive and skip the oom reaper and risk all the
weird corner cases where the OOM killer cannot make forward progress
because the oom victim hung somewhere on the way to exit.

[rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
 the setting is inherently racy and we cannot do much about it without
 introducing locks in hot paths]
Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko b18dc5f291 mm, oom: skip vforked tasks from being selected
vforked tasks are not really sitting on any memory.  They are sharing the
mm with parent until they exec into a new code.  Until then it is just
pinning the address space.  OOM killer will kill the vforked task along
with its parent but we still can end up selecting vforked task when the
parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
vforked being a child of oom unkillable task with an updated oom_score_adj
to be killable.

Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.

Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Michal Hocko 44a70adec9 mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj
oom_score_adj is shared for the thread groups (via struct signal) but this
is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_SIGHAND) and so we can easily end up in a situation when some
processes update their oom_score_adj and confuse the oom killer.  In the
worst case some of those processes might hide from the oom killer
altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
would then pick up those eligible but won't be allowed to kill others
sharing the same mm so the mm wouldn't release the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct because that is
the natural entity OOM killer considers.  But this will not work because
some programs are doing

	vfork()
	set_oom_adj()
	exec()

We can achieve the same though.  oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not in
the middle of vfork.  As a result all the processes will share the same
oom_score_adj.  The current implementation is rather pessimistic and
checks all the existing processes by default if there is more than 1
holder of the mm but we do not have any reliable way to check for external
users yet.

Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-28 16:07:41 -07:00
Linus Torvalds 6784725ab0 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
 "Assorted cleanups and fixes.

  Probably the most interesting part long-term is ->d_init() - that will
  have a bunch of followups in (at least) ceph and lustre, but we'll
  need to sort the barrier-related rules before it can get used for
  really non-trivial stuff.

  Another fun thing is the merge of ->d_iput() callers (dentry_iput()
  and dentry_unlink_inode()) and a bunch of ->d_compare() ones (all
  except the one in __d_lookup_lru())"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
  fs/dcache.c: avoid soft-lockup in dput()
  vfs: new d_init method
  vfs: Update lookup_dcache() comment
  bdev: get rid of ->bd_inodes
  Remove last traces of ->sync_page
  new helper: d_same_name()
  dentry_cmp(): use lockless_dereference() instead of smp_read_barrier_depends()
  vfs: clean up documentation
  vfs: document ->d_real()
  vfs: merge .d_select_inode() into .d_real()
  unify dentry_iput() and dentry_unlink_inode()
  binfmt_misc: ->s_root is not going anywhere
  drop redundant ->owner initializations
  ufs: get rid of redundant checks
  orangefs: constify inode_operations
  missed comment updates from ->direct_IO() prototype change
  file_inode(f)->i_mapping is f->f_mapping
  trim fsnotify hooks a bit
  9p: new helper - v9fs_parent_fid()
  debugfs: ->d_parent is never NULL or negative
  ...
2016-07-28 12:59:05 -07:00
Linus Torvalds 0e06f5c0de Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - a few misc bits

 - ocfs2

 - most(?) of MM

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (125 commits)
  thp: fix comments of __pmd_trans_huge_lock()
  cgroup: remove unnecessary 0 check from css_from_id()
  cgroup: fix idr leak for the first cgroup root
  mm: memcontrol: fix documentation for compound parameter
  mm: memcontrol: remove BUG_ON in uncharge_list
  mm: fix build warnings in <linux/compaction.h>
  mm, thp: convert from optimistic swapin collapsing to conservative
  mm, thp: fix comment inconsistency for swapin readahead functions
  thp: update Documentation/{vm/transhuge,filesystems/proc}.txt
  shmem: split huge pages beyond i_size under memory pressure
  thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
  khugepaged: add support of collapse for tmpfs/shmem pages
  shmem: make shmem_inode_info::lock irq-safe
  khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
  thp: extract khugepaged from mm/huge_memory.c
  shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
  shmem: add huge pages support
  shmem: get_unmapped_area align huge page
  shmem: prepare huge= mount option and sysfs knob
  mm, rmap: account shmem thp pages
  ...
2016-07-26 19:55:54 -07:00
Huang Ying 8f19b0c058 thp: fix comments of __pmd_trans_huge_lock()
To make the comments consistent with the already changed code.

Link: http://lkml.kernel.org/r/1466200004-6196-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Li RongQing 25843c2b21 mm: memcontrol: fix documentation for compound parameter
Commit f627c2f537 ("memcg: adjust to support new THP refcounting")
adds a compound parameter for several functions, and change one as
compound for mem_cgroup_move_account but it does not change the
comments.

Link: http://lkml.kernel.org/r/1465368216-9393-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Li RongQing 17408d785a mm: memcontrol: remove BUG_ON in uncharge_list
When calling uncharge_list, if a page is transparent huge we don't need
to BUG_ON about non-transparent huge, since nobody should be able to see
the page at this stage and this page cannot be raced against with a THP
split.

This check became unneeded after 0a31bc97c8 ("mm: memcontrol: rewrite
uncharge API").

[mhocko@suse.com: changelog enhancements]
Link: http://lkml.kernel.org/r/1465369248-13865-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim dd4123f324 mm: fix build warnings in <linux/compaction.h>
Randy reported below build error.

> In file included from ../include/linux/balloon_compaction.h:48:0,
>                  from ../mm/balloon_compaction.c:11:
> ../include/linux/compaction.h:237:51: warning: 'struct node' declared inside parameter list [enabled by default]
>  static inline int compaction_register_node(struct node *node)
> ../include/linux/compaction.h:237:51: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
> ../include/linux/compaction.h:242:54: warning: 'struct node' declared inside parameter list [enabled by default]
>  static inline void compaction_unregister_node(struct node *node)
>

It was caused by non-lru page migration which needs compaction.h but
compaction.h doesn't include any header to be standalone.

I think proper header for non-lru page migration is migrate.h rather
than compaction.h because migrate.h has already headers needed to work
non-lru page migration indirectly like isolate_mode_t, migrate_mode
MIGRATEPAGE_SUCCESS.

[akpm@linux-foundation.org: revert mm-balloon-use-general-non-lru-movable-page-feature-fix.patch temp fix]
Link: http://lkml.kernel.org/r/20160610003304.GE29779@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Rafael Aquini <aquini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 0db501f7a3 mm, thp: convert from optimistic swapin collapsing to conservative
To detect whether khugepaged swapin is worthwhile, this patch checks the
amount of young pages.  There should be at least half of HPAGE_PMD_NR to
swapin.

Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Suggested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 47f863ea22 mm, thp: fix comment inconsistency for swapin readahead functions
After fixing swapin issues, comment lines stayed as in old version.
This patch updates the comments.

Link: http://lkml.kernel.org/r/1468109345-32258-1-git-send-email-ebru.akagunduz@gmail.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 779750d20b shmem: split huge pages beyond i_size under memory pressure
Even if user asked to allocate huge pages always (huge=always), we
should be able to free up some memory by splitting pages which are
partly byound i_size if memory presure comes or once we hit limit on
filesystem size (-o size=).

In order to do this we maintain per-superblock list of inodes, which
potentially have huge pages on the border of file size.

Per-fs shrinker can reclaim memory by splitting such pages.

If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to
free up space on the filesystem and retry allocation if it succeed.

Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov e496cf3d78 thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE
For file mappings, we don't deposit page tables on THP allocation
because it's not strictly required to implement split_huge_pmd(): we can
just clear pmd and let following page faults to reconstruct the page
table.

But Power makes use of deposited page table to address MMU quirk.

Let's hide THP page cache, including huge tmpfs, under separate config
option, so it can be forbidden on Power.

We can revert the patch later once solution for Power found.

Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov f3f0e1d215 khugepaged: add support of collapse for tmpfs/shmem pages
This patch extends khugepaged to support collapse of tmpfs/shmem pages.
We share fair amount of infrastructure with anon-THP collapse.

Few design points:

  - First we are looking for VMA which can be suitable for mapping huge
    page;

  - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

  - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

  - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

      + scan over radix tree, making the range point to new huge page;

      + new huge page is not-uptodate, locked and freezed (refcount
        is 0), so nobody can touch them until we say so.

      + we swap in pages during the scan. khugepaged_scan_shmem()
        filters out ranges with more than khugepaged_max_ptes_swap
	swapped out pages. It's HPAGE_PMD_NR/8 by default.

      + old pages are isolated, unmapped and put to local list in case
        to be restored back if collapse failed.

  - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 4595ef88d1 shmem: make shmem_inode_info::lock irq-safe
We are going to need to call shmem_charge() under tree_lock to get
accoutning right on collapse of small tmpfs pages into a huge one.

The problem is that tree_lock is irq-safe and lockdep is not happy, that
we take irq-unsafe lock under irq-safe[1].

Let's convert the lock to irq-safe.

[1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc

Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 988ddb710b khugepaged: move up_read(mmap_sem) out of khugepaged_alloc_page()
Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
first: no point keep it inside the function.

Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov b46e756f5e thp: extract khugepaged from mm/huge_memory.c
khugepaged implementation grew to the point when it deserve separate
file in source.

Let's move it to mm/khugepaged.c.

Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 657e3038c4 shmem, thp: respect MADV_{NO,}HUGEPAGE for file mappings
Let's wire up existing madvise() hugepage hints for file mappings.

MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
VMA.  It only has effect if the filesystem is mounted with huge=advise
or huge=within_size.

MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
the VMA.  It doesn't prevent a huge page from being allocated by other
means, i.e.  page fault into different mapping or write(2) into file.

Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 800d8c63b2 shmem: add huge pages support
Here's basic implementation of huge pages support for shmem/tmpfs.

It's all pretty streight-forward:

  - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

  - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

  - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

  - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Hugh Dickins c01d5b3007 shmem: get_unmapped_area align huge page
Provide a shmem_get_unmapped_area method in file_operations, called at
mmap time to decide the mapping address.  It could be conditional on
CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout).  Lots of conditions, and in most cases it
just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back to
ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(), not via
the file_operations: because of the way shmem_zero_setup() is called to
create a shmem object late in the mmap sequence, when MAP_SHARED is
requested with MAP_ANONYMOUS or /dev/zero.  Though this only matters
when /proc/sys/vm/shmem_huge has been set.

Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 5a6e75f811 shmem: prepare huge= mount option and sysfs knob
This patch adds new mount option "huge=".  It can have following values:

  - "always":
	Attempt to allocate huge pages every time we need a new page;

  - "never":
	Do not allocate huge pages;

  - "within_size":
	Only allocate huge page if it will be fully within i_size.
	Also respect fadvise()/madvise() hints;

  - "advise:
	Only allocate huge pages if requested with fadvise()/madvise();

Default is "never" for now.

"mount -o remount,huge= /mountpoint" works fine after mount: remounting
huge=never will not attempt to break up huge pages at all, just stop
more from being allocated.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
is the appropriate option to protect those who don't want the new bloat,
and with which we shall share some pmd code.

Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
explicit).

Allow enabling THP only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
objects, Ashmem.  Though unlikely to suit all usages, provide sysfs knob
/sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
huge on those.

And allow shmem_enabled two further values:

  - "deny":
	For use in emergencies, to force the huge option off from
	all mounts;
  - "force":
	Force the huge option on for all - very useful for testing;

Based on patch by Hugh Dickins.

Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 65c453778a mm, rmap: account shmem thp pages
Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
smaps.  It indicates how many times we allocate and map shmem THP.

NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov fc127da085 truncate: handle file thp
For shmem/tmpfs we only need to tweak truncate_inode_page() and
invalidate_mapping_pages().

truncate_inode_pages_range() and invalidate_inode_pages2_range() are
adjusted to use page_to_pgoff().

Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 83929372f6 filemap: prepare find and delete operations for huge pages
For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
page.  That's suboptimal and it will be changed to use Matthew's
multi-order entries later.

'add' operation is not changed, because we don't need it to implement
hugetmpfs: shmem uses its own implementation.

Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 7751b2da6b vmscan: split file huge pages before paging them out
This is preparation of vmscan for file huge pages.  We cannot write out
huge pages, so we need to split them on the way out.

Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 9a73f61bdb thp, mlock: do not mlock PTE-mapped file huge pages
As with anon THP, we only mlock file huge pages if we can prove that the
page is not mapped with PTE.  This way we can avoid mlock leak into
non-mlocked vma on split.

We rely on PageDoubleMap() under lock_page() to check if the the page
may be PTE mapped.  PG_double_map is set by page_add_file_rmap() when
the page mapped with PTEs.

Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov baa355fd33 thp: file pages support for split_huge_page()
Basic scheme is the same as for anon THP.

Main differences:

  - File pages are on radix-tree, so we have head->_count offset by
    HPAGE_PMD_NR. The count got distributed to small pages during split.

  - mapping->tree_lock prevents non-lockless access to pages under split
    over radix-tree;

  - Lockless access is prevented by setting the head->_count to 0 during
    split;

  - After split, some pages can be beyond i_size. We drop them from
    radix-tree.

  - We don't setup migration entries. Just unmap pages. It helps
    handling cases when i_size is in the middle of the page: no need
    handle unmap pages beyond i_size manually.

Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 37f9f5595c thp: run vma_adjust_trans_huge() outside i_mmap_rwsem
vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
During split we munlock the huge page which requires rmap walk.  rmap
wants to take the lock on its own.

Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov b237aded41 thp: prepare change_huge_pmd() for file thp
change_huge_pmd() has assert which is not relvant for file page.  For
shared mapping it's perfectly fine to have page table entry writable,
without explicit mkwrite.

Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 628d47ce98 thp: skip file huge pmd on copy_huge_pmd()
copy_page_range() has a check for "Don't copy ptes where a page fault
will fill them correctly." It works on VMA level.  We still copy all
page table entries from private mappings, even if they map page cache.

We can simplify copy_huge_pmd() a bit by skipping file PMDs.

We don't map file private pages with PMDs, so they only can map page
cache.  It's safe to skip them as they can be re-faulted later.

Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov af9e4d5f2d thp: handle file COW faults
File COW for THP is handled on pte level: just split the pmd.

It's not clear how benefitial would be allocation of huge pages on COW
faults.  And it would require some code to make them work.

I think at some point we can consider teaching khugepaged to collapse
pages in COW mappings, but allocating huge on fault is probably
overkill.

Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov d21b9e57c7 thp: handle file pages in split_huge_pmd()
Splitting THP PMD is simple: just unmap it as in DAX case.  This way we
can avoid memory overhead on page table allocation to deposit.

It's probably a good idea to try to allocation page table with
GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
but clearing pmd should be good enough for now.

Unlike DAX, we also remove the page from rmap and drop reference.
pmd_young() is transfered to PageReferenced().

Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov b5072380eb thp: support file pages in zap_huge_pmd()
split_huge_pmd() for file mappings (and DAX too) is implemented by just
clearing pmd entry as we can re-fill this area from page cache on pte
level later.

This means we don't need deposit page tables when file THP is mapped.
Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
file THP PMD.

Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 95ecedcd6a thp, vmstats: add counters for huge file pages
THP_FILE_ALLOC: how many times huge page was allocated and put page
cache.

THP_FILE_MAPPED: how many times file huge page was mapped.

Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 1010245964 mm: introduce do_set_pmd()
With postponed page table allocation we have chance to setup huge pages.
do_set_pte() calls do_set_pmd() if following criteria met:

 - page is compound;
 - pmd entry in pmd_none();
 - vma has suitable size and alignment;

Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov dd78fedde4 rmap: support file thp
Naive approach: on mapping/unmapping the page as compound we update
->_mapcount on each 4k page.  That's not efficient, but it's not obvious
how we can optimize this.  We can look into optimization later.

PG_double_map optimization doesn't work for file pages since lifecycle
of file pages is different comparing to anon pages: file page can be
mapped again at any time.

Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 7267ec008b mm: postpone page table allocation until we have page to map
The idea (and most of code) is borrowed again: from Hugh's patchset on
huge tmpfs[1].

Instead of allocation pte page table upfront, we postpone this until we
have page to map in hands.  This approach opens possibility to map the
page as huge if filesystem supports this.

Comparing to Hugh's patch I've pushed page table allocation a bit
further: into do_set_pte().  This way we can postpone allocation even in
faultaround case without moving do_fault_around() after __do_fault().

do_set_pte() got renamed to alloc_set_pte() as it can allocate page
table if required.

[1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov bae473a423 mm: introduce fault_env
The idea borrowed from Peter's patch from patchset on speculative page
faults[1]:

Instead of passing around the endless list of function arguments,
replace the lot with a single structure so we can change context without
endless function signature changes.

The changes are mostly mechanical with exception of faultaround code:
filemap_map_pages() got reworked a bit.

This patch is preparation for the next one.

[1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov dcddffd41d mm: do not pass mm_struct into handle_mm_fault
We always have vma->vm_mm around.

Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Kirill A. Shutemov 1f52e67e5e khugepaged: recheck pmd after mmap_sem re-acquired
Vlastimil noted[1] that pmd can be no longer valid after we drop
mmap_sem.  We need recheck it once mmap_sem taken again.

[1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz

Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 8024ee2a09 mm, thp: fix locking inconsistency in collapse_huge_page
After creating revalidate vma function, locking inconsistency occured
due to directing the code path to wrong label.  This patch directs to
correct label and fix the inconsistency.

Related commit that caused inconsistency:
 http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 7269586252 mm, thp: make swapin readahead under down_read of mmap_sem
Currently khugepaged makes swapin readahead under down_write.  This
patch supplies to make swapin readahead under down_read instead of
down_write.

The patch was tested with a test program that allocates 800MB of memory,
writes to it, and then sleeps.  The system was forced to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.

[akpm@linux-foundation.org: update comment to match new code]
[kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
  Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
  Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 8a966ed746 mm: make swapin readahead to improve thp collapse rate
This patch makes swapin readahead to improve thp collapse rate.  When
khugepaged scanned pages, there can be a few of the pages in swap area.

With the patch THP can collapse 4kB pages into a THP when there are up
to max_ptes_swap swap ptes in a 2MB range.

The patch was tested with a test program that allocates 400B of memory,
writes to it, and then sleeps.  I force the system to swap out all.
Afterwards, the test program touches the area by writing, it skips a
page in each 20 pages of the area.

Without the patch, system did not swap in readahead.  THP rate was %65
of the program of the memory, it did not change over time.

With this patch, after 10 minutes of waiting khugepaged had collapsed
%99 of the program's memory.

[kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
[kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
[kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Ebru Akagunduz 70652f6ec0 mm: make optimistic check for swapin readahead
Introduce a new sysfs integer knob
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
optimistic check for swapin readahead to increase thp collapse rate.
Before getting swapped out pages to memory, checks them and allows up to a
certain number.  It also prints out using tracepoints amount of unmapped
ptes.

[vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
[sfr@canb.auug.org.au: build fix]
  Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Xie XiuQi <xiexiuqi@huawei.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
nimisolo ef3cc4db41 mm/memblock.c:memblock_add_range(): if nr_new is 0 just return
If nr_new is 0 which means there's no region would be added, so just
return to the caller.

Signed-off-by: nimisolo <nimisolo@gmail.com>
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Wei Yang <weiyang@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko 8a5c743e30 mm, memcg: use consistent gfp flags during readahead
Vladimir has noticed that we might declare memcg oom even during
readahead because read_pages only uses GFP_KERNEL (with mapping_gfp
restriction) while __do_page_cache_readahead uses
page_cache_alloc_readahead which adds __GFP_NORETRY to prevent from
OOMs.  This gfp mask discrepancy is really unfortunate and easily
fixable.  Drop page_cache_alloc_readahead() which only has one user and
outsource the gfp_mask logic into readahead_gfp_mask and propagate this
mask from __do_page_cache_readahead down to read_pages.

This alone would have only very limited impact as most filesystems are
implementing ->readpages and the common implementation mpage_readpages
does GFP_KERNEL (with mapping_gfp restriction) again.  We can tell it to
use readahead_gfp_mask instead as this function is called only during
readahead as well.  The same applies to read_cache_pages.

ext4 has its own ext4_mpage_readpages but the path which has pages !=
NULL can use the same gfp mask.  Btrfs, cifs, f2fs and orangefs are
doing a very similar pattern to mpage_readpages so the same can be
applied to them as well.

[akpm@linux-foundation.org: coding-style fixes]
[mhocko@suse.com: restrict gfp mask in mpage_alloc]
  Link: http://lkml.kernel.org/r/20160610074223.GC32285@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1465301556-26431-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Chris Mason <clm@fb.com>
Cc: Steve French <sfrench@samba.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Changman Lee <cm224.lee@samsung.com>
Cc: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko e5e3f4c4f0 mm, oom_reaper: make sure that mmput_async is called only when memory was reaped
Tetsuo is worried that mmput_async might still lead to a premature new
oom victim selection due to the following race:

__oom_reap_task				exit_mm
  find_lock_task_mm
  atomic_inc(mm->mm_users) # = 2
  task_unlock
  					  task_lock
					  task->mm = NULL
					  up_read(&mm->mmap_sem)
		< somebody write locks mmap_sem >
					  task_unlock
					  mmput
  					    atomic_dec_and_test # = 1
					  exit_oom_victim
  down_read_trylock # failed - no reclaim
  mmput_async # Takes unpredictable amount of time
  		< new OOM situation >

the final __mmput will be executed in the delayed context which might
happen far in the future.  Such a race is highly unlikely because the
write holder of mmap_sem would have to be an external task (all direct
holders are already killed or exiting) and it usually have to pin
mm_users in order to do anything reasonable.

We can, however, make sure that the mmput_async is only called when we
do not back off and reap some memory.  That would reduce the impact of
the delayed __mmput because the real content would be already freed.
Pin mm_count to keep it alive after we drop task_lock and before we try
to get mmap_sem.  If the mmap_sem succeeds we can try to grab mm_users
reference and then go on with unmapping the address space.

It is not clear whether this race is possible at all but it is better to
be more robust and do not pin mm_users unless we are sure we are
actually doing some real work during __oom_reap_task.

Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 91537fee00 mm: add NR_ZSMALLOC to vmstat
zram is very popular for some of the embedded world (e.g., TV, mobile
phones).  On those system, zsmalloc's consumed memory size is never
trivial (one of example from real product system, total memory: 800M,
zsmalloc consumed: 150M), so we have used this out of tree patch to
monitor system memory behavior via /proc/vmstat.

With zsmalloc in vmstat, it helps in tracking down system behavior due
to memory usage.

[minchan@kernel.org: zsmalloc: follow up zsmalloc vmstat]
  Link: http://lkml.kernel.org/r/20160607091737.GC23435@bbox
[akpm@linux-foundation.org: fix build with CONFIG_ZSMALLOC=m]
Link: http://lkml.kernel.org/r/1464919731-13255-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Chanho Min <chanho.min@lge.com>
Cc: Chan Gyun Jeong <chan.jeong@lge.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vlastimil Babka 8ea1d2a198 mm, frontswap: convert frontswap_enabled to static key
I have noticed that frontswap.h first declares "frontswap_enabled" as
extern bool variable, and then overrides it with "#define
frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled.  The
bool variable isn't actually instantiated anywhere.

This all looks like an unfinished attempt to make frontswap_enabled
reflect whether a backend is instantiated.  But in the current state,
all frontswap hooks call unconditionally into frontswap.c just to check
if frontswap_ops is non-NULL.  This should at least be checked inline,
but we can further eliminate the overhead when CONFIG_FRONTSWAP is
enabled and no backend registered, using a static key that is initially
disabled, and gets enabled only upon first backend registration.

Thus, checks for "frontswap_enabled" are replaced with
"frontswap_enabled()" wrapping the static key check.  There are two
exceptions:

- xen's selfballoon_process() was testing frontswap_enabled in code guarded
  by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
  The patch just removes this check. Using frontswap_enabled() does not sound
  correct here, as this can be true even without xen's own backend being
  registered.

- in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
  as it seems the bitmap allocation cannot currently be postponed until a
  backend is registered. This means that frontswap will still have some
  memory overhead by being configured, but without a backend.

After the patch, we can expect that some functions in frontswap.c are
called only when frontswap_ops is non-NULL.  Change the checks there to
VM_BUG_ONs.  While at it, convert other BUG_ONs to VM_BUG_ONs as
frontswap has been stable for some time.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Tetsuo Handa fbe84a09da mm,oom: remove unused argument from oom_scan_process_thread().
oom_scan_process_thread() does not use totalpages argument.
oom_badness() uses it.

Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov 5e8d35f849 mm: memcontrol: teach uncharge_list to deal with kmem pages
Page table pages are batched-freed in release_pages on most
architectures.  If we want to charge them to kmemcg (this is what is
done later in this series), we need to teach mem_cgroup_uncharge_list to
handle kmem pages.

Link: http://lkml.kernel.org/r/18d5c09e97f80074ed25b97a7d0f32b95d875717.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov 4949148ad4 mm: charge/uncharge kmemcg from generic page allocator paths
Currently, to charge a non-slab allocation to kmemcg one has to use
alloc_kmem_pages helper with __GFP_ACCOUNT flag.  A page allocated with
this helper should finally be freed using free_kmem_pages, otherwise it
won't be uncharged.

This API suits its current users fine, but it turns out to be impossible
to use along with page reference counting, i.e.  when an allocation is
supposed to be freed with put_page, as it is the case with pipe or unix
socket buffers.

To overcome this limitation, this patch moves charging/uncharging to
generic page allocator paths, i.e.  to __alloc_pages_nodemask and
free_pages_prepare, and zaps alloc/free_kmem_pages helpers.  This way,
one can use any of the available page allocation functions to get the
allocated page charged to kmemcg - it's enough to pass __GFP_ACCOUNT,
just like in case of kmalloc and friends.  A charged page will be
automatically uncharged on free.

To make it possible, we need to mark pages charged to kmemcg somehow.
To avoid introducing a new page flag, we make use of page->_mapcount for
marking such pages.  Since pages charged to kmemcg are not supposed to
be mapped to userspace, it should work just fine.  There are other
(ab)users of page->_mapcount - buddy and balloon pages - but we don't
conflict with them.

In case kmemcg is compiled out or not used at runtime, this patch
introduces no overhead to generic page allocator paths.  If kmemcg is
used, it will be plus one gfp flags check on alloc and plus one
page->_mapcount check on free, which shouldn't hurt performance, because
the data accessed are hot.

Link: http://lkml.kernel.org/r/a9736d856f895bcb465d9f257b54efe32eda6f99.1464079538.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov 452647784b mm: memcontrol: cleanup kmem charge functions
- Handle memcg_kmem_enabled check out to the caller. This reduces the
   number of function definitions making the code easier to follow. At
   the same time it doesn't result in code bloat, because all of these
   functions are used only in one or two places.

 - Move __GFP_ACCOUNT check to the caller as well so that one wouldn't
   have to dive deep into memcg implementation to see which allocations
   are charged and which are not.

 - Refresh comments.

Link: http://lkml.kernel.org/r/52882a28b542c1979fd9a033b4dc8637fc347399.1464079537.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Aneesh Kumar K.V e77b0852b5 mm/mmu_gather: track page size with mmu gather and force flush if page size change
This allows an arch which needs to do special handing with respect to
different page size when flushing tlb to implement the same in mmu
gather.

Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Aneesh Kumar K.V e9d55e1570 mm: change the interface for __tlb_remove_page()
This updates the generic and arch specific implementation to return true
if we need to do a tlb flush.  That means if a __tlb_remove_page
indicate a flush is needed, the page we try to remove need to be tracked
and added again after the flush.  We need to track it because we have
already update the pte to none and we can't just loop back.

This change is done to enable us to do a tlb_flush when we try to flush
a range that consists of different page sizes.  For architectures like
ppc64, we can do a range based tlb flush and we need to track page size
for that.  When we try to remove a huge page, we will force a tlb flush
and starts a new mmu gather.

[aneesh.kumar@linux.vnet.ibm.com: mm-change-the-interface-for-__tlb_remove_page-v3]
  Link: http://lkml.kernel.org/r/1465049193-22197-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1464860389-29019-2-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Aneesh Kumar K.V 31d49da5ad mm/hugetlb: simplify hugetlb unmap
For hugetlb like THP (and unlike regular page), we do tlb flush after
dropping ptl.  Because of the above, we don't need to track force_flush
like we do now.  Instead we can simply call tlb_remove_page() which will
do the flush if needed.

No functionality change in this patch.

Link: http://lkml.kernel.org/r/1465049193-22197-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Naoya Horiguchi 337d9abf1c mm: thp: check pmd_trans_unstable() after split_huge_pmd()
split_huge_pmd() doesn't guarantee that the pmd is normal pmd pointing
to pte entries, which can be checked with pmd_trans_unstable().  Some
callers make this assertion and some do it differently and some not, so
let's do it in a unified manner.

Link: http://lkml.kernel.org/r/1464741400-12143-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim e3a2713c3c mm/page_isolation: clean up confused code
When there is an isolated_page, post_alloc_hook() is called with page
but __free_pages() is called with isolated_page.  Since they are the
same so no problem but it's very confusing.  To reduce it, this patch
changes isolated_page to boolean type and uses page variable
consistently.

Link: http://lkml.kernel.org/r/1466150259-27727-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim 46f24fd857 mm/page_alloc: introduce post allocation processing on page allocator
This patch is motivated from Hugh and Vlastimil's concern [1].

There are two ways to get freepage from the allocator.  One is using
normal memory allocation API and the other is __isolate_free_page()
which is internally used for compaction and pageblock isolation.  Later
usage is rather tricky since it doesn't do whole post allocation
processing done by normal API.

One problematic thing I already know is that poisoned page would not be
checked if it is allocated by __isolate_free_page().  Perhaps, there
would be more.

We could add more debug logic for allocated page in the future and this
separation would cause more problem.  I'd like to fix this situation at
this time.  Solution is simple.  This patch commonize some logic for
newly allocated page and uses it on all sites.  This will solve the
problem.

[1] http://marc.info/?i=alpine.LSU.2.11.1604270029350.7066%40eggly.anvils%3E

[iamjoonsoo.kim@lge.com: mm-page_alloc-introduce-post-allocation-processing-on-page-allocator-v3]
  Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
  Link: http://lkml.kernel.org/r/1466150259-27727-9-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1464230275-25791-7-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim f2ca0b5571 mm/page_owner: use stackdepot to store stacktrace
Currently, we store each page's allocation stacktrace on corresponding
page_ext structure and it requires a lot of memory.  This causes the
problem that memory tight system doesn't work well if page_owner is
enabled.  Moreover, even with this large memory consumption, we cannot
get full stacktrace because we allocate memory at boot time and just
maintain 8 stacktrace slots to balance memory consumption.  We could
increase it to more but it would make system unusable or change system
behaviour.

To solve the problem, this patch uses stackdepot to store stacktrace.
It obviously provides memory saving but there is a drawback that
stackdepot could fail.

stackdepot allocates memory at runtime so it could fail if system has
not enough memory.  But, most of allocation stack are generated at very
early time and there are much memory at this time.  So, failure would
not happen easily.  And, one failure means that we miss just one page's
allocation stacktrace so it would not be a big problem.  In this patch,
when memory allocation failure happens, we store special stracktrace
handle to the page that is failed to save stacktrace.  With it, user can
guess memory usage properly even if failure happens.

Memory saving looks as following.  (4GB memory system with page_owner)
(before the patch -> after the patch)

static allocation:
92274688 bytes -> 25165824 bytes

dynamic allocation after boot + kernel build:
0 bytes -> 327680 bytes

total:
92274688 bytes -> 25493504 bytes

72% reduction in total.

Note that implementation looks complex than someone would imagine
because there is recursion issue.  stackdepot uses page allocator and
page_owner is called at page allocation.  Using stackdepot in page_owner
could re-call page allcator and then page_owner.  That is a recursion.
To detect and avoid it, whenever we obtain stacktrace, recursion is
checked and page_owner is set to dummy information if found.  Dummy
information means that this page is allocated for page_owner feature
itself (such as stackdepot) and it's understandable behavior for user.

[iamjoonsoo.kim@lge.com: mm-page_owner-use-stackdepot-to-store-stacktrace-v3]
  Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com
  Link: http://lkml.kernel.org/r/1466150259-27727-7-git-send-email-iamjoonsoo.kim@lge.com
Link: http://lkml.kernel.org/r/1464230275-25791-6-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim a9627bc5e3 mm/page_owner: introduce split_page_owner and replace manual handling
split_page() calls set_page_owner() to set up page_owner to each pages.
But, it has a drawback that head page and the others have different
stacktrace because callsite of set_page_owner() is slightly differnt.
To avoid this problem, this patch copies head page's page_owner to the
others.  It needs to introduce new function, split_page_owner() but it
also remove the other function, get_page_owner_gfp() so looks good to
do.

Link: http://lkml.kernel.org/r/1464230275-25791-4-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim a8efe1c982 mm/page_owner: copy last_migrate_reason in copy_page_owner()
Currently, copy_page_owner() doesn't copy all the owner information.  It
skips last_migrate_reason because copy_page_owner() is used for
migration and it will be properly set soon.  But, following patch will
use copy_page_owner() and this skip will cause the problem that
allocated page has uninitialied last_migrate_reason.  To prevent it,
this patch also copy last_migrate_reason in copy_page_owner().

Link: http://lkml.kernel.org/r/1464230275-25791-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim 83358ece26 mm/page_owner: initialize page owner without holding the zone lock
It's not necessary to initialized page_owner with holding the zone lock.
It would cause more contention on the zone lock although it's not a big
problem since it is just debug feature.  But, it is better than before
so do it.  This is also preparation step to use stackdepot in page owner
feature.  Stackdepot allocates new pages when there is no reserved space
and holding the zone lock in this case will cause deadlock.

Link: http://lkml.kernel.org/r/1464230275-25791-2-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Joonsoo Kim 66c64223ad mm/compaction: split freepages without holding the zone lock
We don't need to split freepages with holding the zone lock.  It will
cause more contention on zone lock so not desirable.

[rientjes@google.com: if __isolate_free_page() fails, avoid adding to freelist so we don't call map_pages() with it]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211447001.43430@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/1464230275-25791-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 3b1d9ca65a zsmalloc: use OBJ_TAG_BIT for bit shifter
Static check warns using tag as bit shifter.  It doesn't break current
working but not good for redability.  Let's use OBJ_TAG_BIT as bit
shifter instead of OBJ_ALLOCATED_TAG.

Link: http://lkml.kernel.org/r/20160607045146.GF26230@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 48b4800a1c zsmalloc: page migration support
This patch introduces run-time migration feature for zspage.

For migration, VM uses page.lru field so it would be better to not use
page.next field which is unified with page.lru for own purpose.  For
that, firstly, we can get first object offset of the page via runtime
calculation instead of using page.index so we can use page.index as link
for page chaining instead of page.next.

In case of huge object, it stores handle to page.index instead of next
link of page chaining because huge object doesn't need to next link for
page chaining.  So get_next_page need to identify huge object to return
NULL.  For it, this patch uses PG_owner_priv_1 flag of the page flag.

For migration, it supports three functions

* zs_page_isolate

It isolates a zspage which includes a subpage VM want to migrate from
class so anyone cannot allocate new object from the zspage.

We could try to isolate a zspage by the number of subpage so subsequent
isolation trial of other subpage of the zpsage shouldn't fail.  For
that, we introduce zspage.isolated count.  With that, zs_page_isolate
can know whether zspage is already isolated or not for migration so if
it is isolated for migration, subsequent isolation trial can be
successful without trying further isolation.

* zs_page_migrate

First of all, it holds write-side zspage->lock to prevent migrate other
subpage in zspage.  Then, lock all objects in the page VM want to
migrate.  The reason we should lock all objects in the page is due to
race between zs_map_object and zs_page_migrate.

  zs_map_object				zs_page_migrate

  pin_tag(handle)
  obj = handle_to_obj(handle)
  obj_to_location(obj, &page, &obj_idx);

					write_lock(&zspage->lock)
					if (!trypin_tag(handle))
						goto unpin_object

  zspage = get_zspage(page);
  read_lock(&zspage->lock);

If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can be
stale by migration so it goes crash.

If it locks all of objects successfully, it copies content from old page
to new one, finally, create new zspage chain with new page.  And if it's
last isolated subpage in the zspage, put the zspage back to class.

* zs_page_putback

It returns isolated zspage to right fullness_group list if it fails to
migrate a page.  If it find a zspage is ZS_EMPTY, it queues zspage
freeing to workqueue.  See below about async zspage freeing.

This patch introduces asynchronous zspage free.  The reason to need it
is we need page_lock to clear PG_movable but unfortunately, zs_free path
should be atomic so the apporach is try to grab page_lock.  If it got
page_lock of all of pages successfully, it can free zspage immediately.
Otherwise, it queues free request and free zspage via workqueue in
process context.

If zs_free finds the zspage is isolated when it try to free zspage, it
delays the freeing until zs_page_putback finds it so it will free free
the zspage finally.

In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.  First
of all, it will use ZS_EMPTY list for delay freeing.  And with adding
ZS_FULL list, it makes to identify whether zspage is isolated or not via
list_empty(&zspage->list) test.

[minchan@kernel.org: zsmalloc: keep first object offset in struct page]
  Link: http://lkml.kernel.org/r/1465788015-23195-1-git-send-email-minchan@kernel.org
[minchan@kernel.org: zsmalloc: zspage sanity check]
  Link: http://lkml.kernel.org/r/20160603010129.GC3304@bbox
Link: http://lkml.kernel.org/r/1464736881-24886-12-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim bfd093f5e7 zsmalloc: use freeobj for index
Zsmalloc stores first free object's <PFN, obj_idx> position into freeobj
in each zspage.  If we change it with index from first_page instead of
position, it makes page migration simple because we don't need to
correct other entries for linked list if a page is migrated out.

Link: http://lkml.kernel.org/r/1464736881-24886-11-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 4aa409cab7 zsmalloc: separate free_zspage from putback_zspage
Currently, putback_zspage does free zspage under class->lock if fullness
become ZS_EMPTY but it makes trouble to implement locking scheme for new
zspage migration.  So, this patch is to separate free_zspage from
putback_zspage and free zspage out of class->lock which is preparation
for zspage migration.

Link: http://lkml.kernel.org/r/1464736881-24886-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 3783689a1a zsmalloc: introduce zspage structure
We have squeezed meta data of zspage into first page's descriptor.  So,
to get meta data from subpage, we should get first page first of all.
But it makes trouble to implment page migration feature of zsmalloc
because any place where to get first page from subpage can be raced with
first page migration.  IOW, first page it got could be stale.  For
preventing it, I have tried several approahces but it made code
complicated so finally, I concluded to separate metadata from first
page.  Of course, it consumes more memory.  IOW, 16bytes per zspage on
32bit at the moment.  It means we lost 1% at *worst case*(40B/4096B)
which is not bad I think at the cost of maintenance.

Link: http://lkml.kernel.org/r/1464736881-24886-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim bdb0af7ca8 zsmalloc: factor page chain functionality out
For page migration, we need to create page chain of zspage dynamically
so this patch factors it out from alloc_zspage.

Link: http://lkml.kernel.org/r/1464736881-24886-8-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 4f42047bbd zsmalloc: use accessor
Upcoming patch will change how to encode zspage meta so for easy review,
this patch wraps code to access metadata as accessor.

Link: http://lkml.kernel.org/r/1464736881-24886-7-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 1b8320b620 zsmalloc: use bit_spin_lock
Use kernel standard bit spin-lock instead of custom mess.  Even, it has
a bug which doesn't disable preemption.  The reason we don't have any
problem is that we have used it during preemption disable section by
class->lock spinlock.  So no need to go to stable.

Link: http://lkml.kernel.org/r/1464736881-24886-6-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim 1fc6e27d7b zsmalloc: keep max_object in size_class
Every zspage in a size_class has same number of max objects so we could
move it to a size_class.

Link: http://lkml.kernel.org/r/1464736881-24886-5-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim b1123ea6d3 mm: balloon: use general non-lru movable page feature
Now, VM has a feature to migrate non-lru movable pages so balloon
doesn't need custom migration hooks in migrate.c and compaction.c.

Instead, this patch implements the page->mapping->a_ops->
{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general migration
functions and make balloon compaction simple.

[akpm@linux-foundation.org: compaction.h requires that the includer first include node.h]
Link: http://lkml.kernel.org/r/1464736881-24886-4-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim bda807d444 mm: migrate: support non-lru movable page migration
We have allowed migration for only LRU pages until now and it was enough
to make high-order pages.  But recently, embedded system(e.g., webOS,
android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
have seen several reports about troubles of small high-order allocation.
For fixing the problem, there were several efforts (e,g,.  enhance
compaction algorithm, SLUB fallback to 0-order page, reserved memory,
vmalloc and so on) but if there are lots of non-movable pages in system,
their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages with
movable.  For the feature, this patch introduces functions related to
migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three
functions which are function pointers of struct
address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully.  On returing true, VM marks the
page as PG_isolated so concurrent isolation in several CPUs skip the
page for isolation.  If a driver cannot isolate the page, it should
return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.  The
function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage.  Keep in mind that you should
indicate to the VM the oldpage is no longer movable via
__ClearPageMovable() under page_lock if you migrated the oldpage
successfully and returns 0.  If driver cannot migrate the page at the
moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
migration in a short time because VM interprets -EAGAIN as "temporal
migration failure".  On returning any error except -EAGAIN, VM will give
up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed
page.  In this function, driver should put the isolated page back to the
own data structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under
page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family
functions which will be called by VM.  Exactly speaking, PG_movable is
not a real flag of struct page.  Rather than, VM reuses page->mapping's
lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly.  Instead, driver
should use page_mapping which mask off the low two bits of page->mapping
so it can get right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.  As
well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
__ClearPageMovable).  But __PageMovable is cheap to catch whether page
is LRU or non-lru movable once the page has been isolated.  Because LRU
pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
good for just peeking to test non-lru movable pages before more
expensive checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
sudden destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via
__ClearMovablePage under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated
page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
non-lru movable page, it can skip it.  Driver doesn't need to manipulate
the flag because VM will set/clear it automatically.  Keep in mind that
if driver sees PG_isolated page, it means the page have been isolated by
VM so it shouldn't touch page.lru field.  PG_isolated is alias with
PG_reclaim flag so driver shouldn't use the flag for own purpose.

[opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
  Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Minchan Kim c6c919eb90 mm: use put_page() to free page instead of putback_lru_page()
Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages.  In the end, compaction cannot work well so reclaimer
shrinks all of working set pages.  It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not memory
efficient.  In our product which has big CMA memory, it reclaims zones
too exccessively to allocate GPU and zram page although there are lots
of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.

struct address_space_operations {
	..
	..
	bool (*isolate_page)(struct page *, isolate_mode_t);
	void (*putback_page)(struct page *);
	..
}

new page flags

	PG_movable
	PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so I
took over the work.  I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to
have many credit, too.

And I should mention Chulmin who have tested this patchset heavily so I
can find many bugs from him.  :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

This patch (of 12):

Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to migrate the
page.  If it is successful, it releases the page for freeing.
Otherwise, it should put the page back to LRU list.

For LRU pages, we have used putback_lru_page for both freeing and
putback to LRU list.  It's okay because put_page is aware of LRU list so
if it releases last refcount of the page, it removes the page from LRU
list.  However, It makes unnecessary operations (e.g., lru_cache_add,
pagevec and flags operations.  It would be not significant but no worth
to do) and harder to support new non-lru page migration because put_page
isn't aware of non-lru page's data structure.

To solve the problem, we can add new hook in put_page with PageMovable
flags check but it can increase overhead in hot path and needs new
locking scheme to stabilize the flag check with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure.  That makes code more readable and
doesn't add overhead in put_page.

Comment from Vlastimil
 "Yeah, and compaction (perhaps also other migration users) has to drain
  the lru pvec...  Getting rid of this stuff is worth even by itself."

Link: http://lkml.kernel.org/r/1464736881-24886-2-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov 2a966b77ae mm: oom: add memcg to oom_control
It's a part of oom context just like allocation order and nodemask, so
let's move it to oom_control instead of passing it in the argument list.

Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Vladimir Davydov 798fd75695 mm: zap ZONE_OOM_LOCKED
Not used since oom_lock was instroduced.

Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Reza Arbab df429ac039 memory-hotplug: more general validation of zone during online
When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.

To be more flexible, use the following criteria instead; to online
memory from zone X into zone Y,

* Any zones between X and Y must be unused.
* If X is lower than Y, the onlined memory must lie at the end of X.
* If X is higher than Y, the onlined memory must lie at the start of X.

Add zone_can_shift() to make this determination.

Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewd-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Reza Arbab e51e6c8f80 memory-hotplug: add move_pfn_range()
Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
move_pfn_range_right().

No functional change. This will be utilized by a later patch.

Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab <arbab@linux.vnet.ibm.com>
Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Kiper <daniel.kiper@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Andrew Banman <abanman@sgi.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Oliver O'Halloran 90cae1fe1c mm/init: fix zone boundary creation
As a part of memory initialisation the architecture passes an array to
free_area_init_nodes() which specifies the max PFN of each memory zone.
This array is not necessarily monotonic (due to unused zones) so this
array is parsed to build monotonic lists of the min and max PFN for each
zone.  ZONE_MOVABLE is special cased here as its limits are managed by
the mm subsystem rather than the architecture.  Unfortunately, this
special casing is broken when ZONE_MOVABLE is the not the last zone in
the zone list.  The core of the issue is:

	if (i == ZONE_MOVABLE)
		continue;
	arch_zone_lowest_possible_pfn[i] =
		arch_zone_highest_possible_pfn[i-1];

As ZONE_MOVABLE is skipped the lowest_possible_pfn of the next zone will
be set to zero.  This patch fixes this bug by adding explicitly tracking
where the next zone should start rather than relying on the contents
arch_zone_highest_possible_pfn[].

Thie is low priority.  To get bitten by this you need to enable a zone
that appears after ZONE_MOVABLE in the zone_type enum.  As far as I can
tell this means running a kernel with ZONE_DEVICE or ZONE_CMA enabled,
so I can't see this affecting too many people.

I only noticed this because I've been fiddling with ZONE_DEVICE on
powerpc and 4.6 broke my test kernel.  This bug, in conjunction with the
changes in Taku Izumi's kernelcore=mirror patch (d91749c1dd) and
powerpc being the odd architecture which initialises max_zone_pfn[] to
~0ul instead of 0 caused all of system memory to be placed into
ZONE_DEVICE at boot, followed a panic since device memory cannot be used
for kernel allocations.  I've already submitted a patch to fix the
powerpc specific bits, but I figured this should be fixed too.

Link: http://lkml.kernel.org/r/1462435033-15601-1-git-send-email-oohall@gmail.com
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Li RongQing 48406ef897 mm/memcontrol.c: remove the useless parameter for mc_handle_swap_pte
It seems like this parameter has never been used since being introduced
by 90254a6583 ("memcg: clean up move charge").  Not a big deal because
I assume the function would get inlined into the caller anyway but why
not get rid of it.

[mhocko@suse.com: wrote changelog]
  Link: http://lkml.kernel.org/r/20160525151831.GJ20132@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1464145026-26693-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Wei Yongjun de24baecd7 mm/slab: use list_move instead of list_del/list_add
Using list_move() instead of list_del() + list_add() to avoid needlessly
poisoning the next and prev values.

Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.com
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko 72baeef0c2 slab: do not panic on invalid gfp_mask
Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
This is a rather harsh way to announce a non-critical issue.  Allocator
is free to ignore invalid flags.  Let's simply replace BUG() by
dump_stack to tell the offender and fixup the mask to move on with the
allocation request.

This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
module:

  Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
  CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
  Call Trace:
    dump_stack+0x67/0x90
    cache_alloc_refill+0x201/0x617
    kmem_cache_alloc_trace+0xa7/0x24a
    ? 0xffffffffa0005000
    mymodule_init+0x20/0x1000 [test_slab]
    do_one_initcall+0xe7/0x16c
    ? rcu_read_lock_sched_held+0x61/0x69
    ? kmem_cache_alloc_trace+0x197/0x24a
    do_init_module+0x5f/0x1d9
    load_module+0x1a3d/0x1f21
    ? retint_kernel+0x2d/0x2d
    SyS_init_module+0xe8/0x10e
    ? SyS_init_module+0xe8/0x10e
    do_syscall_64+0x68/0x13f
    entry_SYSCALL64_slow_path+0x25/0x25

Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Michal Hocko bacdcb3460 slab: make GFP_SLAB_BUG_MASK information more human readable
printk offers %pGg for quite some time so let's use it to get a human
readable list of invalid flags.

The original output would be
  [  429.191962] gfp: 2

after the change
  [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)

Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Thomas Garnier 210e7a43fa mm: SLUB freelist randomization
Implements freelist randomization for the SLUB allocator.  It was
previous implemented for the SLAB allocator.  Both use the same
configuration option (CONFIG_SLAB_FREELIST_RANDOM).

The list is randomized during initialization of a new set of pages.  The
order on different freelist sizes is pre-computed at boot for
performance.  Each kmem_cache has its own randomized freelist.

This security feature reduces the predictability of the kernel SLUB
allocator against heap overflows rendering attacks much less stable.

For example these attacks exploit the predictability of the heap:
 - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
 - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

Performance results:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp.  It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
  100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
  100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
  100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
  100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
  100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
  100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
  100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
  100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
  100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 70 cycles
  100000 times kmalloc(16)/kfree -> 70 cycles
  100000 times kmalloc(32)/kfree -> 70 cycles
  100000 times kmalloc(64)/kfree -> 70 cycles
  100000 times kmalloc(128)/kfree -> 70 cycles
  100000 times kmalloc(256)/kfree -> 69 cycles
  100000 times kmalloc(512)/kfree -> 70 cycles
  100000 times kmalloc(1024)/kfree -> 73 cycles
  100000 times kmalloc(2048)/kfree -> 72 cycles
  100000 times kmalloc(4096)/kfree -> 71 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
  100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
  100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
  100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
  100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
  100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
  100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
  100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
  100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
  100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 66 cycles
  100000 times kmalloc(16)/kfree -> 66 cycles
  100000 times kmalloc(32)/kfree -> 66 cycles
  100000 times kmalloc(64)/kfree -> 66 cycles
  100000 times kmalloc(128)/kfree -> 65 cycles
  100000 times kmalloc(256)/kfree -> 67 cycles
  100000 times kmalloc(512)/kfree -> 67 cycles
  100000 times kmalloc(1024)/kfree -> 64 cycles
  100000 times kmalloc(2048)/kfree -> 67 cycles
  100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 101.873 (1.16069)
  User Time 1045.22 (1.60447)
  System Time 88.969 (0.559195)
  Percent CPU 1112.9 (13.8279)
  Context Switches 189140 (2282.15)
  Sleeps 99008.6 (768.091)

After:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 102.47 (0.562732)
  User Time 1045.3 (1.34263)
  System Time 88.311 (0.342554)
  Percent CPU 1105.8 (6.49444)
  Context Switches 189081 (2355.78)
  Sleeps 99231.5 (800.358)

Link: http://lkml.kernel.org/r/1464295031-26375-3-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Thomas Garnier 7c00fce98c mm: reorganize SLAB freelist randomization
The kernel heap allocators are using a sequential freelist making their
allocation predictable.  This predictability makes kernel heap overflow
easier to exploit.  An attacker can careful prepare the kernel heap to
control the following chunk overflowed.

For example these attacks exploit the predictability of the heap:
 - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
 - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)

***Problems that needed solving:
 - Randomize the Freelist (singled linked) used in the SLUB allocator.
 - Ensure good performance to encourage usage.
 - Get best entropy in early boot stage.

***Parts:
 - 01/02 Reorganize the SLAB Freelist randomization to share elements
   with the SLUB implementation.
 - 02/02 The SLUB Freelist randomization implementation. Similar approach
   than the SLAB but tailored to the singled freelist used in SLUB.

***Performance data:

slab_test impact is between 3% to 4% on average for 100000 attempts
without smp.  It is a very focused testing, kernbench show the overall
impact on the system is way lower.

Before:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
  100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
  100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
  100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
  100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
  100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
  100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
  100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
  100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
  100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 70 cycles
  100000 times kmalloc(16)/kfree -> 70 cycles
  100000 times kmalloc(32)/kfree -> 70 cycles
  100000 times kmalloc(64)/kfree -> 70 cycles
  100000 times kmalloc(128)/kfree -> 70 cycles
  100000 times kmalloc(256)/kfree -> 69 cycles
  100000 times kmalloc(512)/kfree -> 70 cycles
  100000 times kmalloc(1024)/kfree -> 73 cycles
  100000 times kmalloc(2048)/kfree -> 72 cycles
  100000 times kmalloc(4096)/kfree -> 71 cycles

After:

  Single thread testing
  =====================
  1. Kmalloc: Repeatedly allocate then free test
  100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
  100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
  100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
  100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
  100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
  100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
  100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
  100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
  100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
  100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
  2. Kmalloc: alloc/free test
  100000 times kmalloc(8)/kfree -> 66 cycles
  100000 times kmalloc(16)/kfree -> 66 cycles
  100000 times kmalloc(32)/kfree -> 66 cycles
  100000 times kmalloc(64)/kfree -> 66 cycles
  100000 times kmalloc(128)/kfree -> 65 cycles
  100000 times kmalloc(256)/kfree -> 67 cycles
  100000 times kmalloc(512)/kfree -> 67 cycles
  100000 times kmalloc(1024)/kfree -> 64 cycles
  100000 times kmalloc(2048)/kfree -> 67 cycles
  100000 times kmalloc(4096)/kfree -> 67 cycles

Kernbench, before:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 101.873 (1.16069)
  User Time 1045.22 (1.60447)
  System Time 88.969 (0.559195)
  Percent CPU 1112.9 (13.8279)
  Context Switches 189140 (2282.15)
  Sleeps 99008.6 (768.091)

After:

  Average Optimal load -j 12 Run (std deviation):
  Elapsed Time 102.47 (0.562732)
  User Time 1045.3 (1.34263)
  System Time 88.311 (0.342554)
  Percent CPU 1105.8 (6.49444)
  Context Switches 189081 (2355.78)
  Sleeps 99231.5 (800.358)

This patch (of 2):

This commit reorganizes the previous SLAB freelist randomization to
prepare for the SLUB implementation.  It moves functions that will be
shared to slab_common.

The entropy functions are changed to align with the SLUB implementation,
now using get_random_(int|long) functions.  These functions were chosen
because they provide a bit more entropy early on boot and better
performance when specific arch instructions are not available.

[akpm@linux-foundation.org: fix build]
Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.com
Signed-off-by: Thomas Garnier <thgarnie@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Dave Chinner 6c60d2b574 fs/fs-writeback.c: add a new writeback list for sync
wait_sb_inodes() currently does a walk of all inodes in the filesystem
to find dirty one to wait on during sync.  This is highly inefficient
and wastes a lot of CPU when there are lots of clean cached inodes that
we don't need to wait on.

To avoid this "all inode" walk, we need to track inodes that are
currently under writeback that we need to wait for.  We do this by
adding inodes to a writeback list on the sb when the mapping is first
tagged as having pages under writeback.  wait_sb_inodes() can then walk
this list of "inodes under IO" and wait specifically just for the inodes
that the current sync(2) needs to wait for.

Define a couple helpers to add/remove an inode from the writeback list
and call them when the overall mapping is tagged for or cleared from
writeback.  Update wait_sb_inodes() to walk only the inodes under
writeback due to the sync.

With this change, filesystem sync times are significantly reduced for
fs' with largely populated inode caches and otherwise no other work to
do.  For example, on a 16xcpu 2GHz x86-64 server, 10TB XFS filesystem
with a ~10m entry inode cache, sync times are reduced from ~7.3s to less
than 0.1s when the filesystem is fully clean.

Link: http://lkml.kernel.org/r/1466594593-6757-2-git-send-email-bfoster@redhat.com
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Tested-by: Holger Hoffstätte <holger.hoffstaette@applied-asynchrony.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-26 16:19:19 -07:00
Linus Torvalds d05d7f4079 Merge branch 'for-4.8/core' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:

   - the big change is the cleanup from Mike Christie, cleaning up our
     uses of command types and modified flags.  This is what will throw
     some merge conflicts

   - regression fix for the above for btrfs, from Vincent

   - following up to the above, better packing of struct request from
     Christoph

   - a 2038 fix for blktrace from Arnd

   - a few trivial/spelling fixes from Bart Van Assche

   - a front merge check fix from Damien, which could cause issues on
     SMR drives

   - Atari partition fix from Gabriel

   - convert cfq to highres timers, since jiffies isn't granular enough
     for some devices these days.  From Jan and Jeff

   - CFQ priority boost fix idle classes, from me

   - cleanup series from Ming, improving our bio/bvec iteration

   - a direct issue fix for blk-mq from Omar

   - fix for plug merging not involving the IO scheduler, like we do for
     other types of merges.  From Tahsin

   - expose DAX type internally and through sysfs.  From Toshi and Yigal

* 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
  block: Fix front merge check
  block: do not merge requests without consulting with io scheduler
  block: Fix spelling in a source code comment
  block: expose QUEUE_FLAG_DAX in sysfs
  block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
  Btrfs: fix comparison in __btrfs_map_block()
  block: atari: Return early for unsupported sector size
  Doc: block: Fix a typo in queue-sysfs.txt
  cfq-iosched: Charge at least 1 jiffie instead of 1 ns
  cfq-iosched: Fix regression in bonnie++ rewrite performance
  cfq-iosched: Convert slice_resid from u64 to s64
  block: Convert fifo_time from ulong to u64
  blktrace: avoid using timespec
  block/blk-cgroup.c: Declare local symbols static
  block/bio-integrity.c: Add #include "blk.h"
  block/partition-generic.c: Remove a set-but-not-used variable
  block: bio: kill BIO_MAX_SIZE
  cfq-iosched: temporarily boost queue priority for idle classes
  block: drbd: avoid to use BIO_MAX_SIZE
  block: bio: remove BIO_MAX_SECTORS
  ...
2016-07-26 15:03:07 -07:00
Kees Cook ed18adc1cd mm: SLUB hardened usercopy support
Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLUB allocator to catch any copies that may span objects. Includes a
redzone handling fix discovered by Michael Ellerman.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
Reviwed-by: Laura Abbott <labbott@redhat.com>
2016-07-26 14:43:54 -07:00
Kees Cook 04385fc5e8 mm: SLAB hardened usercopy support
Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
SLAB allocator to catch any copies that may span objects.

Based on code from PaX and grsecurity.

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
2016-07-26 14:41:53 -07:00
Kees Cook f5509cc18d mm: Hardened usercopy
This is the start of porting PAX_USERCOPY into the mainline kernel. This
is the first set of features, controlled by CONFIG_HARDENED_USERCOPY. The
work is based on code by PaX Team and Brad Spengler, and an earlier port
from Casey Schaufler. Additional non-slab page tests are from Rik van Riel.

This patch contains the logic for validating several conditions when
performing copy_to_user() and copy_from_user() on the kernel object
being copied to/from:
- address range doesn't wrap around
- address range isn't NULL or zero-allocated (with a non-zero copy size)
- if on the slab allocator:
  - object size must be less than or equal to copy size (when check is
    implemented in the allocator, which appear in subsequent patches)
- otherwise, object must not span page allocations (excepting Reserved
  and CMA ranges)
- if on the stack
  - object must not extend before/after the current process stack
  - object must be contained by a valid stack frame (when there is
    arch/build support for identifying stack frames)
- object must not overlap with kernel text

Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Tested-by: Michael Ellerman <mpe@ellerman.id.au>
2016-07-26 14:41:47 -07:00
Linus Torvalds 015cd867e5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 updates from Martin Schwidefsky:
 "There are a couple of new things for s390 with this merge request:

   - a new scheduling domain "drawer" is added to reflect the unusual
     topology found on z13 machines.  Performance tests showed up to 8
     percent gain with the additional domain.

   - the new crc-32 checksum crypto module uses the vector-galois-field
     multiply and sum SIMD instruction to speed up crc-32 and crc-32c.

   - proper __ro_after_init support, this requires RO_AFTER_INIT_DATA in
     the generic vmlinux.lds linker script definitions.

   - kcov instrumentation support.  A prerequisite for that is the
     inline assembly basic block cleanup, which is the reason for the
     net/iucv/iucv.c change.

   - support for 2GB pages is added to the hugetlbfs backend.

  Then there are two removals:

   - the oprofile hardware sampling support is dead code and is removed.
     The oprofile user space uses the perf interface nowadays.

   - the ETR clock synchronization is removed, this has been superseeded
     be the STP clock synchronization.  And it always has been
     "interesting" code..

  And the usual bug fixes and cleanups"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (82 commits)
  s390/pci: Delete an unnecessary check before the function call "pci_dev_put"
  s390/smp: clean up a condition
  s390/cio/chp : Remove deprecated create_singlethread_workqueue
  s390/chsc: improve channel path descriptor determination
  s390/chsc: sanitize fmt check for chp_desc determination
  s390/cio: make fmt1 channel path descriptor optional
  s390/chsc: fix ioctl CHSC_INFO_CU command
  s390/cio/device_ops: fix kernel doc
  s390/cio: allow to reset channel measurement block
  s390/console: Make preferred console handling more consistent
  s390/mm: fix gmap tlb flush issues
  s390/mm: add support for 2GB hugepages
  s390: have unique symbol for __switch_to address
  s390/cpuinfo: show maximum thread id
  s390/ptrace: clarify bits in the per_struct
  s390: stack address vs thread_info
  s390: remove pointless load within __switch_to
  s390: enable kcov support
  s390/cpumf: use basic block for ecctr inline assembly
  s390/hypfs: use basic block for diag inline assembly
  ...
2016-07-26 12:22:51 -07:00
Linus Torvalds 0f657262d5 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Ingo Molnar:
 "Various x86 low level modifications:

   - preparatory work to support virtually mapped kernel stacks (Andy
     Lutomirski)

   - support for 64-bit __get_user() on 32-bit kernels (Benjamin
     LaHaise)

   - (involved) workaround for Knights Landing CPU erratum (Dave Hansen)

   - MPX enhancements (Dave Hansen)

   - mremap() extension to allow remapping of the special VDSO vma, for
     purposes of user level context save/restore (Dmitry Safonov)

   - hweight and entry code cleanups (Borislav Petkov)

   - bitops code generation optimizations and cleanups with modern GCC
     (H. Peter Anvin)

   - syscall entry code optimizations (Paolo Bonzini)"

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  x86/mm/cpa: Add missing comment in populate_pdg()
  x86/mm/cpa: Fix populate_pgd(): Stop trying to deallocate failed PUDs
  x86/syscalls: Add compat_sys_preadv64v2/compat_sys_pwritev64v2
  x86/smp: Remove unnecessary initialization of thread_info::cpu
  x86/smp: Remove stack_smp_processor_id()
  x86/uaccess: Move thread_info::addr_limit to thread_struct
  x86/dumpstack: Rename thread_struct::sig_on_uaccess_error to sig_on_uaccess_err
  x86/uaccess: Move thread_info::uaccess_err and thread_info::sig_on_uaccess_err to thread_struct
  x86/dumpstack: When OOPSing, rewind the stack before do_exit()
  x86/mm/64: In vmalloc_fault(), use CR3 instead of current->active_mm
  x86/dumpstack/64: Handle faults when printing the "Stack: " part of an OOPS
  x86/dumpstack: Try harder to get a call trace on stack overflow
  x86/mm: Remove kernel_unmap_pages_in_pgd() and efi_cleanup_page_tables()
  x86/mm/cpa: In populate_pgd(), don't set the PGD entry until it's populated
  x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()
  x86/mm: Use pte_none() to test for empty PTE
  x86/mm: Disallow running with 32-bit PTEs to work around erratum
  x86/mm: Ignore A/D bits in pte/pmd/pud_none()
  x86/mm: Move swap offset/type up in PTE to work around erratum
  x86/entry: Inline enter_from_user_mode()
  ...
2016-07-25 15:34:18 -07:00
Johannes Weiner 73f576c04b mm: memcontrol: fix cgroup creation failure after many small jobs
The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.

Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[hannes@cmpxchg.org: init the IDR]
  Link: http://lkml.kernel.org/r/20160621154601.GA22431@cmpxchg.org
Fixes: b2052564e6 ("mm: memcontrol: continue cache reclaim from offlined groups")
Link: http://lkml.kernel.org/r/20160617162516.GD19084@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: John Garcia <john.garcia@mesosphere.io>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Cc: <stable@vger.kernel.org>	[3.19+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-23 10:25:54 +09:00
Anton Blanchard d3d36c4b5c mm: workingset: printk missing log level, use pr_info()
Commit 612e44939c ("mm: workingset: eviction buckets for bigmem/lowbit
machines") added a printk without a log level.  Quieten it by using
pr_info().

Link: http://lkml.kernel.org/r/1466982072-29836-2-git-send-email-anton@ozlabs.org
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Hugh Dickins 5a49973d71 mm: thp: refix false positive BUG in page_move_anon_rmap()
The VM_BUG_ON_PAGE in page_move_anon_rmap() is more trouble than it's
worth: the syzkaller fuzzer hit it again.  It's still wrong for some THP
cases, because linear_page_index() was never intended to apply to
addresses before the start of a vma.

That's easily fixed with a signed long cast inside linear_page_index();
and Dmitry has tested such a patch, to verify the false positive.  But
why extend linear_page_index() just for this case? when the avoidance in
page_move_anon_rmap() has already grown ugly, and there's no reason for
the check at all (nothing else there is using address or index).

Remove address arg from page_move_anon_rmap(), remove VM_BUG_ON_PAGE,
remove CONFIG_DEBUG_VM PageTransHuge adjustment.

And one more thing: should the compound_head(page) be done inside or
outside page_move_anon_rmap()? It's usually pushed down to the lowest
level nowadays (and mm/memory.c shows no other explicit use of it), so I
think it's better done in page_move_anon_rmap() than by caller.

Fixes: 0798d3c022 ("mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()")
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1607120444540.12528@eggly.anvils
Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Naoya Horiguchi 55bda43bb2 mm: rmap: call page_check_address() with sync enabled to avoid racy check
The previous patch addresses the race between split_huge_pmd_address()
and someone changing the pmd.  The fix is only for splitting of normal
thp (i.e.  pmd-mapped thp,) and for splitting of pte-mapped thp there
still is the similar race.

For splitting pte-mapped thp, the pte's conversion is done by
try_to_unmap_one(TTU_MIGRATION).  This function checks
page_check_address() to get the target pte, but it can return NULL under
some race, leading to VM_BUG_ON() in freeze_page().  Fortunately,
page_check_address() already has an argument to decide whether we do a
quick/racy check or not, so let's flip it when called from
freeze_page().

Link: http://lkml.kernel.org/r/1466990929-7452-2-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Naoya Horiguchi 33f4751e99 mm: thp: move pmd check inside ptl for freeze_page()
I found a race condition triggering VM_BUG_ON() in freeze_page(), when
running a testcase with 3 processes:
  - process 1: keep writing thp,
  - process 2: keep clearing soft-dirty bits from virtual address of process 1
  - process 3: call migratepages for process 1,

The kernel message is like this:

  kernel BUG at /src/linux-dev/mm/huge_memory.c:3096!
  invalid opcode: 0000 [#1] SMP
  Modules linked in: cfg80211 rfkill crc32c_intel ppdev serio_raw pcspkr virtio_balloon virtio_console parport_pc parport pvpanic acpi_cpufreq tpm_tis tpm i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy virtio_pci virtio_ring virtio
  CPU: 0 PID: 28863 Comm: migratepages Not tainted 4.6.0-v4.6-160602-0827-+ #2
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  task: ffff880037320000 ti: ffff88007cdd0000 task.ti: ffff88007cdd0000
  RIP: 0010:[<ffffffff811f8e06>]  [<ffffffff811f8e06>] split_huge_page_to_list+0x496/0x590
  RSP: 0018:ffff88007cdd3b70  EFLAGS: 00010202
  RAX: 0000000000000001 RBX: ffff88007c7b88c0 RCX: 0000000000000000
  RDX: 0000000000000000 RSI: 0000000700000200 RDI: ffffea0003188000
  RBP: ffff88007cdd3bb8 R08: 0000000000000001 R09: 00003ffffffff000
  R10: ffff880000000000 R11: ffffc000001fffff R12: ffffea0003188000
  R13: ffffea0003188000 R14: 0000000000000000 R15: 0400000000000080
  FS:  00007f8ec241d740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000             CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f8ec1f3ed20 CR3: 000000003707b000 CR4: 00000000000006f0
  Call Trace:
    ? list_del+0xd/0x30
    queue_pages_pte_range+0x4d1/0x590
    __walk_page_range+0x204/0x4e0
    walk_page_range+0x71/0xf0
    queue_pages_range+0x75/0x90
    ? queue_pages_hugetlb+0x190/0x190
    ? new_node_page+0xc0/0xc0
    ? change_prot_numa+0x40/0x40
    migrate_to_node+0x71/0xd0
    do_migrate_pages+0x1c3/0x210
    SyS_migrate_pages+0x261/0x290
    entry_SYSCALL_64_fastpath+0x1a/0xa4
  Code: e8 b0 87 fb ff 0f 0b 48 c7 c6 30 32 9f 81 e8 a2 87 fb ff 0f 0b 48 c7 c6 b8 46 9f 81 e8 94 87 fb ff 0f 0b 85 c0 0f 84 3e fd ff ff <0f> 0b 85 c0 0f 85 a6 00 00 00 48 8b 75 c0 4c 89 f7 41 be f0 ff
  RIP   split_huge_page_to_list+0x496/0x590

I'm not sure of the full scenario of the reproduction, but my debug
showed that split_huge_pmd_address(freeze=true) returned without running
main code of pmd splitting because pmd_present(*pmd) in precheck somehow
returned 0.  If this happens, the subsequent try_to_unmap() fails and
returns non-zero (because page_mapcount() still > 0), and finally
VM_BUG_ON() fires.  This patch tries to fix it by prechecking pmd state
inside ptl.

Link: http://lkml.kernel.org/r/1466990929-7452-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Mel Gorman ef70b6f41c mm, meminit: ensure node is online before checking whether pages are uninitialised
early_page_uninitialised looks up an arbitrary PFN.  While a machine
without node 0 will boot with "mm, page_alloc: Always return a valid
node from early_pfn_to_nid", it works because it assumes that nodes are
always in PFN order.  This is not guaranteed so this patch adds
robustness by always checking if the node being checked is online.

Link: http://lkml.kernel.org/r/1468008031-3848-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Mel Gorman e4568d3803 mm, meminit: always return a valid node from early_pfn_to_nid
early_pfn_to_nid can return node 0 if a PFN is invalid on machines that
has no node 0.  A machine with only node 1 was observed to crash with
the following message:

   BUG: unable to handle kernel paging request at 000000000002a3c8
   PGD 0
   Modules linked in:
   Hardware name: Supermicro H8DSP-8/H8DSP-8, BIOS 080011  06/30/2006
   task: ffffffff81c0d500 ti: ffffffff81c00000 task.ti: ffffffff81c00000
   RIP: reserve_bootmem_region+0x6a/0xef
   CR2: 000000000002a3c8 CR3: 0000000001c06000 CR4: 00000000000006b0
   Call Trace:
      free_all_bootmem+0x4b/0x12a
      mem_init+0x70/0xa3
      start_kernel+0x25b/0x49b

The problem is that early_page_uninitialised uses the early_pfn_to_nid
helper which returns node 0 for invalid PFNs.  No caller of
early_pfn_to_nid cares except early_page_uninitialised.  This patch has
early_pfn_to_nid always return a valid node.

Link: http://lkml.kernel.org/r/1468008031-3848-3-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Joonsoo Kim 0ab686d8c8 kasan/quarantine: fix bugs on qlist_move_cache()
There are two bugs on qlist_move_cache().  One is that qlist's tail
isn't set properly.  curr->next can be NULL since it is singly linked
list and NULL value on tail is invalid if there is one item on qlist.
Another one is that if cache is matched, qlist_put() is called and it
will set curr->next to NULL.  It would cause to stop the loop
prematurely.

These problems come from complicated implementation so I'd like to
re-implement it completely.  Implementation in this patch is really
simple.  Iterate all qlist_nodes and put them to appropriate list.

Unfortunately, I got this bug sometime ago and lose oops message.  But,
the bug looks trivial and no need to attach oops.

Fixes: 55834c5909 ("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/1467766348-22419-1-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Kuthonuzo Luruo <poll.stdin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Huang Ying 9818b8cde6 madvise_free, thp: fix madvise_free_huge_pmd return value after splitting
madvise_free_huge_pmd should return 0 if the fallback PTE operations are
required.  In madvise_free_huge_pmd, if part pages of THP are discarded,
the THP will be split and fallback PTE operations should be used if
splitting succeeds.  But the original code will make fallback PTE
operations skipped, after splitting succeeds.  Fix that via make
madvise_free_huge_pmd return 0 after splitting successfully, so that the
fallback PTE operations will be done.

Link: http://lkml.kernel.org/r/1467135452-16688-1-git-send-email-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
David Rientjes a46cbf3bc5 mm, compaction: prevent VM_BUG_ON when terminating freeing scanner
It's possible to isolate some freepages in a pageblock and then fail
split_free_page() due to the low watermark check.  In this case, we hit
VM_BUG_ON() because the freeing scanner terminated early without a
contended lock or enough freepages.

This should never have been a VM_BUG_ON() since it's not a fatal
condition.  It should have been a VM_WARN_ON() at best, or even handled
gracefully.

Regardless, we need to terminate anytime the full pageblock scan was not
done.  The logic belongs in isolate_freepages_block(), so handle its
state gracefully by terminating the pageblock loop and making a note to
restart at the same pageblock next time since it was not possible to
complete the scan this time.

[rientjes@google.com: don't rescan pages in a pageblock]
  Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Tested-by: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-15 14:54:27 +09:00
Hugh Dickins 7f55656703 tmpfs: fix regression hang in fallocate undo
The well-spotted fallocate undo fix is good in most cases, but not when
fallocate failed on the very first page.  index 0 then passes lend -1
to shmem_undo_range(), and that has two bad effects: (a) that it will
undo every fallocation throughout the file, unrestricted by the current
range; but more importantly (b) it can cause the undo to hang, because
lend -1 is treated as truncation, which makes it keep on retrying until
every page has gone, but those already fully instantiated will never go
away.  Big thank you to xfstests generic/269 which demonstrates this.

Fixes: b9b4bb26af ("tmpfs: don't undo fallocate past its last page")
Cc: stable@vger.kernel.org
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-07-10 20:08:44 -07:00
Dmitry Safonov b059a453b1 x86/vdso: Add mremap hook to vm_special_mapping
Add possibility for 32-bit user-space applications to move
the vDSO mapping.

Previously, when a user-space app called mremap() for the vDSO
address, in the syscall return path it would land on the previous
address of the vDSOpage, resulting in segmentation violation.

Now it lands fine and returns to userspace with a remapped vDSO.

This will also fix the context.vdso pointer for 64-bit, which does
not affect the user of vDSO after mremap() currently, but this
may change in the future.

As suggested by Andy, return -EINVAL for mremap() that would
split the vDSO image: that operation cannot possibly result in
a working system so reject it.

Renamed and moved the text_mapping structure declaration inside
map_vdso(), as it used only there and now it complements the
vvar_mapping variable.

There is still a problem for remapping the vDSO in glibc
applications: the linker relocates addresses for syscalls
on the vDSO page, so you need to relink with the new
addresses.

Without that the next syscall through glibc may fail:

  Program received signal SIGSEGV, Segmentation fault.
  #0  0xf7fd9b80 in __kernel_vsyscall ()
  #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6

Signed-off-by: Dmitry Safonov <dsafonov@virtuozzo.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: 0x7f454c46@gmail.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-08 14:17:51 +02:00
Gerald Schaefer d08de8e2d8 s390/mm: add support for 2GB hugepages
This adds support for 2GB hugetlbfs pages on s390.

Reviewed-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2016-07-06 08:46:43 +02:00
Paolo Bonzini add6a0cd1c KVM: MMU: try to fix up page faults before giving up
The vGPU folks would like to trap the first access to a BAR by setting
vm_ops on the VMAs produced by mmap-ing a VFIO device.  The fault handler
then can use remap_pfn_range to place some non-reserved pages in the VMA.

This kind of VM_PFNMAP mapping is not handled by KVM, but follow_pfn
and fixup_user_fault together help supporting it.  The patch also supports
VM_MIXEDMAP vmas where the pfns are not reserved and thus subject to
reference counting.

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Neo Jia <cjia@nvidia.com>
Reported-by: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05 14:41:26 +02:00
Al Viro b223f4e215 Merge branch 'd_real' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs into work.misc 2016-06-30 23:34:49 -04:00
Sudip Mukherjee 8285027fc4 mm/page_owner: avoid null pointer dereference
We have dereferenced page_ext before checking it.  Lets check it first
and then used it.

Fixes: f86e427197 ("mm: check the return value of lookup_page_ext for all call sites")
Link: http://lkml.kernel.org/r/1465249059-7883-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee <sudip.mukherjee@codethink.co.uk>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
David Rientjes a4f04f2c69 mm, compaction: abort free scanner if split fails
If the memory compaction free scanner cannot successfully split a free
page (only possible due to per-zone low watermark), terminate the free
scanner rather than continuing to scan memory needlessly.  If the
watermark is insufficient for a free page of order <= cc->order, then
terminate the scanner since all future splits will also likely fail.

This prevents the compaction freeing scanner from scanning all memory on
very large zones (very noticeable for zones > 128GB, for instance) when
all splits will likely fail while holding zone->lock.

compaction_alloc() iterating a 128GB zone has been benchmarked to take
over 400ms on some systems whereas any free page isolated and ready to
be split ends up failing in split_free_page() because of the low
watermark check and thus the iteration continues.

The next time compaction occurs, the freeing scanner will likely start
at the end of the zone again since no success was made previously and we
get the same lengthy iteration until the zone is brought above the low
watermark.  All thp page faults can take >400ms in such a state without
this fix.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606211820350.97086@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Dmitry Vyukov 5c335fe020 mm: prevent KASAN false positives in kmemleak
When kmemleak dumps contents of leaked objects it reads whole objects
regardless of user-requested size.  This upsets KASAN.  Disable KASAN
checks around object dump.

Link: http://lkml.kernel.org/r/1466617631-68387-1-git-send-email-dvyukov@google.com
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Gerald Schaefer c8cc708a34 mm/hugetlb: clear compound_mapcount when freeing gigantic pages
While working on s390 support for gigantic hugepages I ran into the
following "Bad page state" warning when freeing gigantic pages:

  BUG: Bad page state in process bash  pfn:580001
  page:000003d116000040 count:0 mapcount:0 mapping:ffffffff00000000 index:0x0
  flags: 0x7fffc0000000000()
  page dumped because: non-NULL mapping

This is because page->compound_mapcount, which is part of a union with
page->mapping, is initialized with -1 in prep_compound_gigantic_page(),
and not cleared again during destroy_compound_gigantic_page().  Fix this
by clearing the compound_mapcount in destroy_compound_gigantic_page()
before clearing compound_head.

Interestingly enough, the warning will not show up on x86_64, although
this should not be architecture specific.  Apparently there is an
endianness issue, combined with the fact that the union contains both a
64 bit ->mapping pointer and a 32 bit atomic_t ->compound_mapcount as
members.  The resulting bogus page->mapping on x86_64 therefore contains
00000000ffffffff instead of ffffffff00000000 on s390, which will falsely
trigger the PageAnon() check in free_pages_prepare() because
page->mapping & PAGE_MAPPING_ANON is true on little-endian architectures
like x86_64 in this case (the page is not compound anymore,
->compound_head was already cleared before).  As a result, page->mapping
will be cleared before doing the checks in free_pages_check().

Not sure if the bogus "PageAnon() returning true" on x86_64 for the
first tail page of a gigantic page (at this stage) has other theoretical
implications, but they would also be fixed with this patch.

Link: http://lkml.kernel.org/r/1466612719-5642-1-git-send-email-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Lukasz Odzioba 8f182270df mm/swap.c: flush lru pvecs on compound page arrival
Currently we can have compound pages held on per cpu pagevecs, which
leads to a lot of memory unavailable for reclaim when needed.  In the
systems with hundreads of processors it can be GBs of memory.

On of the way of reproducing the problem is to not call munmap
explicitly on all mapped regions (i.e.  after receiving SIGTERM).  After
that some pages (with THP enabled also huge pages) may end up on
lru_add_pvec, example below.

  void main() {
  #pragma omp parallel
  {
	size_t size = 55 * 1000 * 1000; // smaller than  MEM/CPUS
	void *p = mmap(NULL, size, PROT_READ | PROT_WRITE,
		MAP_PRIVATE | MAP_ANONYMOUS , -1, 0);
	if (p != MAP_FAILED)
		memset(p, 0, size);
	//munmap(p, size); // uncomment to make the problem go away
  }
  }

When we run it with THP enabled it will leave significant amount of
memory on lru_add_pvec.  This memory will be not reclaimed if we hit
OOM, so when we run above program in a loop:

	for i in `seq 100`; do ./a.out; done

many processes (95% in my case) will be killed by OOM.

The primary point of the LRU add cache is to save the zone lru_lock
contention with a hope that more pages will belong to the same zone and
so their addition can be batched.  The huge page is already a form of
batched addition (it will add 512 worth of memory in one go) so skipping
the batching seems like a safer option when compared to a potential
excess in the caching which can be quite large and much harder to fix
because lru_add_drain_all is way to expensive and it is not really clear
what would be a good moment to call it.

Similarly we can reproduce the problem on lru_deactivate_pvec by adding:
madvise(p, size, MADV_FREE); after memset.

This patch flushes lru pvecs on compound page arrival making the problem
less severe - after applying it kill rate of above example drops to 0%,
due to reducing maximum amount of memory held on pvec from 28MB (with
THP) to 56kB per CPU.

Suggested-by: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/1466180198-18854-1-git-send-email-lukasz.odzioba@intel.com
Signed-off-by: Lukasz Odzioba <lukasz.odzioba@intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
Cc: Ming Li <mingli199x@qq.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Tejun Heo ea3a964586 memcg: css_alloc should return an ERR_PTR value on error
mem_cgroup_css_alloc() was returning NULL on failure while cgroup core
expected it to return an ERR_PTR value leading to the following NULL
deref after a css allocation failure.  Fix it by return
ERR_PTR(-ENOMEM) instead.  I'll also update cgroup core so that it
can handle NULL returns.

  mkdir: page allocation failure: order:6, mode:0x240c0c0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO)
  CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
  ...
  Call Trace:
    dump_stack+0x68/0xa1
    warn_alloc_failed+0xd6/0x130
    __alloc_pages_nodemask+0x4c6/0xf20
    alloc_pages_current+0x66/0xe0
    alloc_kmem_pages+0x14/0x80
    kmalloc_order_trace+0x2a/0x1a0
    __kmalloc+0x291/0x310
    memcg_update_all_caches+0x6c/0x130
    mem_cgroup_css_alloc+0x590/0x610
    cgroup_apply_control_enable+0x18b/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
  ...
  BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
  IP:  init_and_link_css+0x37/0x220
  PGD 34b1e067 PUD 3a109067 PMD 0
  Oops: 0002 [#1] SMP
  Modules linked in:
  CPU: 0 PID: 8738 Comm: mkdir Not tainted 4.7.0-rc3+ #123
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.2-20160422_131301-anatol 04/01/2014
  task: ffff88007cbc5200 ti: ffff8800666d4000 task.ti: ffff8800666d4000
  RIP: 0010:[<ffffffff810f2ca7>]  [<ffffffff810f2ca7>] init_and_link_css+0x37/0x220
  RSP: 0018:ffff8800666d7d90  EFLAGS: 00010246
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
  RDX: ffffffff810f2499 RSI: 0000000000000000 RDI: 0000000000000008
  RBP: ffff8800666d7db8 R08: 0000000000000003 R09: 0000000000000000
  R10: 0000000000000001 R11: 0000000000000000 R12: ffff88005a5fb400
  R13: ffffffff81f0f8a0 R14: ffff88005a5fb400 R15: 0000000000000010
  FS:  00007fc944689700(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f3aed0d2b80 CR3: 000000003a1e8000 CR4: 00000000000006f0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
    cgroup_apply_control_enable+0x1ac/0x370
    cgroup_mkdir+0x1de/0x2e0
    kernfs_iop_mkdir+0x55/0x80
    vfs_mkdir+0xb9/0x150
    SyS_mkdir+0x66/0xd0
    do_syscall_64+0x53/0x120
    entry_SYSCALL64_slow_path+0x25/0x25
  Code: 89 f5 48 89 fb 49 89 d4 48 83 ec 08 8b 05 72 3b d8 00 85 c0 0f 85 60 01 00 00 4c 89 e7 e8 72 f7 ff ff 48 8d 7b 08 48 89 d9 31 c0 <48> c7 83 d0 00 00 00 00 00 00 00 48 83 e7 f8 48 29 f9 81 c1 d8
  RIP   init_and_link_css+0x37/0x220
   RSP <ffff8800666d7d90>
  CR2: 00000000000000d0
  ---[ end trace a2d8836ae1e852d1 ]---

Link: http://lkml.kernel.org/r/20160621165740.GJ3262@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Tejun Heo d93c4130a7 memcg: mem_cgroup_migrate() may be called with irq disabled
mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
with irq disabled from migrate_page_copy().  This ends up enabling irq
while holding a irq context lock triggering the following lockdep
warning.  Fix it by using irq_save/restore instead.

  =================================
  [ INFO: inconsistent lock state ]
  4.7.0-rc1+ #52 Tainted: G        W
  ---------------------------------
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
   (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<000000000038fd96>] aio_migratepage+0x156/0x1e8
  {IN-SOFTIRQ-W} state was registered at:
     __lock_acquire+0x5b6/0x1930
     lock_acquire+0xee/0x270
     _raw_spin_lock_irqsave+0x66/0xb0
     aio_complete+0x98/0x328
     dio_complete+0xe4/0x1e0
     blk_update_request+0xd4/0x450
     scsi_end_request+0x48/0x1c8
     scsi_io_completion+0x272/0x698
     blk_done_softirq+0xca/0xe8
     __do_softirq+0xc8/0x518
     irq_exit+0xee/0x110
     do_IRQ+0x6a/0x88
     io_int_handler+0x11a/0x25c
     __mutex_unlock_slowpath+0x144/0x1d8
     __mutex_unlock_slowpath+0x140/0x1d8
     kernfs_iop_permission+0x64/0x80
     __inode_permission+0x9e/0xf0
     link_path_walk+0x6e/0x510
     path_lookupat+0xc4/0x1a8
     filename_lookup+0x9c/0x160
     user_path_at_empty+0x5c/0x70
     SyS_readlinkat+0x68/0x140
     system_call+0xd6/0x270
  irq event stamp: 971410
  hardirqs last  enabled at (971409):  migrate_page_move_mapping+0x3ea/0x588
  hardirqs last disabled at (971410):  _raw_spin_lock_irqsave+0x3c/0xb0
  softirqs last  enabled at (970526):  __do_softirq+0x460/0x518
  softirqs last disabled at (970519):  irq_exit+0xee/0x110

  other info that might help us debug this:
   Possible unsafe locking scenario:

	 CPU0
	 ----
    lock(&(&ctx->completion_lock)->rlock);
    <Interrupt>
      lock(&(&ctx->completion_lock)->rlock);

    *** DEADLOCK ***

  3 locks held by kcompactd0/151:
   #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at:  aio_migratepage+0x42/0x1e8
   #1:  (&ctx->ring_lock){+.+.+.}, at:  aio_migratepage+0x5a/0x1e8
   #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at:  aio_migratepage+0x156/0x1e8

  stack backtrace:
  CPU: 20 PID: 151 Comm: kcompactd0 Tainted: G        W       4.7.0-rc1+ #52
  Call Trace:
    show_trace+0xea/0xf0
    show_stack+0x72/0xf0
    dump_stack+0x9a/0xd8
    print_usage_bug.part.27+0x2d4/0x2e8
    mark_lock+0x17e/0x758
    mark_held_locks+0xa2/0xd0
    trace_hardirqs_on_caller+0x140/0x1c0
    mem_cgroup_migrate+0x266/0x370
    aio_migratepage+0x16a/0x1e8
    move_to_new_page+0xb0/0x260
    migrate_pages+0x8f4/0x9f0
    compact_zone+0x4dc/0xdc8
    kcompactd_do_work+0x1aa/0x358
    kcompactd+0xba/0x2c8
    kthread+0x10a/0x110
    kernel_thread_starter+0x6/0xc
    kernel_thread_starter+0x0/0xc
  INFO: lockdep is turned off.

Link: http://lkml.kernel.org/r/20160620184158.GO3262@mtj.duckdns.org
Link: http://lkml.kernel.org/g/5767CFE5.7080904@de.ibm.com
Fixes: 74485cf2bc ("mm: migrate: consolidate mem_cgroup_migrate() calls")
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Kirill A. Shutemov c17b1f4259 hugetlb: fix nr_pmds accounting with shared page tables
We account HugeTLB's shared page table to all processes who share it.
The accounting happens during huge_pmd_share().

If somebody populates pud entry under us, we should decrease pagetable's
refcount and decrease nr_pmds of the process.

By mistake, I increase nr_pmds again in this case.  :-/ It will lead to
"BUG: non-zero nr_pmds on freeing mm: 2" on process' exit.

Let's fix this by increasing nr_pmds only when we're sure that the page
table will be used.

Link: http://lkml.kernel.org/r/20160617122506.GC6534@node.shutemov.name
Fixes: dc6c9a35b6 ("mm: account pmd page tables to the process")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: zhongjiang <zhongjiang@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Kirill A. Shutemov 06d8fbc7cf Revert "mm: disable fault around on emulated access bit architecture"
This reverts commit d0834a6c2c.

After revert of 5c0a85fad9 ("mm: make faultaround produce old ptes")
faultaround doesn't have dependencies on hardware accessed bit, so let's
revert this one too.

Link: http://lkml.kernel.org/r/1465893750-44080-3-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Kirill A. Shutemov 315d09bf30 Revert "mm: make faultaround produce old ptes"
This reverts commit 5c0a85fad9.

The commit causes ~6% regression in unixbench.

Let's revert it for now and consider other solution for reclaim problem
later.

Link: http://lkml.kernel.org/r/1465893750-44080-2-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Mel Gorman e838a45f93 mm, sl[au]b: add __GFP_ATOMIC to the GFP reclaim mask
Commit d0164adc89 ("mm, page_alloc: distinguish between being unable
to sleep, unwilling to sleep and avoiding waking kswapd") modified
__GFP_WAIT to explicitly identify the difference between atomic callers
and those that were unwilling to sleep.  Later the definition was
removed entirely.

The GFP_RECLAIM_MASK is the set of flags that affect watermark checking
and reclaim behaviour but __GFP_ATOMIC was never added.  Without it,
atomic users of the slab allocator strip the __GFP_ATOMIC flag and
cannot access the page allocator atomic reserves.  This patch addresses
the problem.

The user-visible impact depends on the workload but potentially atomic
allocations unnecessarily fail without this path.

Link: http://lkml.kernel.org/r/20160610093832.GK2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Marcin Wojtas <mw@semihalf.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: <stable@vger.kernel.org>	[4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Andrey Ryabinin 9b75a867cc mm: mempool: kasan: don't poot mempool objects in quarantine
Currently we may put reserved by mempool elements into quarantine via
kasan_kfree().  This is totally wrong since quarantine may really free
these objects.  So when mempool will try to use such element,
use-after-free will happen.  Or mempool may decide that it no longer
need that element and double-free it.

So don't put object into quarantine in kasan_kfree(), just poison it.
Rename kasan_kfree() to kasan_poison_kfree() to respect that.

Also, we shouldn't use kasan_slab_alloc()/kasan_krealloc() in
kasan_unpoison_element() because those functions may update allocation
stacktrace.  This would be wrong for the most of the remove_element call
sites.

(The only call site where we may want to update alloc stacktrace is
 in mempool_alloc(). Kmemleak solves this by calling
 kmemleak_update_trace(), so we could make something like that too.
 But this is out of scope of this patch).

Fixes: 55834c5909 ("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/575977C3.1010905@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reported-by: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Anthony Romano b9b4bb26af tmpfs: don't undo fallocate past its last page
When fallocate is interrupted it will undo a range that extends one byte
past its range of allocated pages.  This can corrupt an in-use page by
zeroing out its first byte.  Instead, undo using the inclusive byte
range.

Fixes: 1635f6a741 ("tmpfs: undo fallocation on failure")
Link: http://lkml.kernel.org/r/1462713387-16724-1-git-send-email-anthony.romano@coreos.com
Signed-off-by: Anthony Romano <anthony.romano@coreos.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Brandon Philips <brandon@ifup.co>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Tetsuo Handa 9df10fb7b8 oom_reaper: avoid pointless atomic_inc_not_zero usage.
Since commit 36324a990c ("oom: clear TIF_MEMDIE after oom_reaper
managed to unmap the address space") changed to use find_lock_task_mm()
for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0
because find_lock_task_mm() returns a task_struct with ->mm != NULL.
Therefore, we can safely use atomic_inc().

Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Tetsuo Handa 491a1c65ae mm,oom_reaper: don't call mmput_async() without atomic_inc_not_zero()
Commit e2fe14564d ("oom_reaper: close race with exiting task") reduced
frequency of needlessly selecting next OOM victim, but was calling
mmput_async() when atomic_inc_not_zero() failed.

Link: http://lkml.kernel.org/r/1464423365-5555-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-24 17:23:52 -07:00
Richard Weinberger 1118dce773 mm: Export migrate_page_move_mapping and migrate_page_copy
Export these symbols such that UBIFS can implement
->migratepage.

Cc: stable@vger.kernel.org
Signed-off-by: Richard Weinberger <richard@nod.at>
Acked-by: Christoph Hellwig <hch@lst.de>
2016-06-23 00:29:45 +02:00
Linus Torvalds db06d759d6 Merge branch 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
Pull percpu fixes from Tejun Heo:
 "While adding GFP_ATOMIC support to the percpu allocator, the
  synchronization for the fast-path which doesn't require external
  allocations was separated into pcpu_lock.

  Unfortunately, it incorrectly decoupled async paths and percpu
  chunks could get destroyed while still being operated on.  This
  contains two patches to fix the bug"

* 'for-4.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
  percpu: fix synchronization between synchronous map extension and chunk destruction
  percpu: fix synchronization between chunk->map_extend_work and chunk destruction
2016-06-13 19:54:46 -10:00
Linus Torvalds 8714f8f5fe Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A small collection of fixes for the current series.  This contains:

   - Two fixes for xen-blkfront, from Bob Liu.

   - A bug fix for NVMe, releasing only the specific resources we
     requested.

   - Fix for a debugfs flags entry for nbd, from Josef.

   - Plug fix from Omar, fixing up a case of code being switched between
     two functions.

   - A missing bio_put() for the new discard callers of
     submit_bio_wait(), fixing a regression causing a leak of the bio.
     From Shaun.

   - Improve dirty limit calculation precision in the writeback code,
     fixing a case where setting a limit lower than 1% of memory would
     end up being zero.  From Tejun"

* 'for-linus' of git://git.kernel.dk/linux-block:
  NVMe: Only release requested regions
  xen-blkfront: fix resume issues after a migration
  xen-blkfront: don't call talk_to_blkback when already connected to blkback
  nbd: pass the nbd pointer for flags debugfs
  block: missing bio_put following submit_bio_wait
  blk-mq: really fix plug list flushing for nomerge queues
  writeback: use higher precision calculation in domain_dirty_limits()
2016-06-11 18:42:59 -07:00
Oleg Drokin 18aba41cbf mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED
I noticed that the logic in the fadvise64_64 syscall is incorrect for
partial pages.  While first page of the region is correctly skipped if
it is partial, the last page of the region is mistakenly discarded.
This leads to problems for applications that read data in
non-page-aligned chunks discarding already processed data between the
reads.

A somewhat misguided application that does something like write(XX bytes
(non-page-alligned)); drop the data it just wrote; repeat gets a
significant penalty in performance as a result.

Link: http://lkml.kernel.org/r/1464917140-1506698-1-git-send-email-green@linuxhacker.ru
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Wang Sheng-Hui f3a932baa7 mm: introduce dedicated WQ_MEM_RECLAIM workqueue to do lru_add_drain_all
This patch is based on https://patchwork.ozlabs.org/patch/574623/.

Tejun submitted commit 23d11a58a9 ("workqueue: skip flush dependency
checks for legacy workqueues") for the legacy create*_workqueue()
interface.

But some workq created by alloc_workqueue still reports warning on
memory reclaim, e.g nvme_workq with flag WQ_MEM_RECLAIM set:

    workqueue: WQ_MEM_RECLAIM nvme:nvme_reset_work is flushing !WQ_MEM_RECLAIM events:lru_add_drain_per_cpu
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 6 at SoC/linux/kernel/workqueue.c:2448 check_flush_dependency+0xb4/0x10c
    ...
    check_flush_dependency+0xb4/0x10c
    flush_work+0x54/0x140
    lru_add_drain_all+0x138/0x188
    migrate_prep+0xc/0x18
    alloc_contig_range+0xf4/0x350
    cma_alloc+0xec/0x1e4
    dma_alloc_from_contiguous+0x38/0x40
    __dma_alloc+0x74/0x25c
    nvme_alloc_queue+0xcc/0x36c
    nvme_reset_work+0x5c4/0xda8
    process_one_work+0x128/0x2ec
    worker_thread+0x58/0x434
    kthread+0xd4/0xe8
    ret_from_fork+0x10/0x50

That's because lru_add_drain_all() will schedule the drain work on
system_wq, whose flag is set to 0, !WQ_MEM_RECLAIM.

Introduce a dedicated WQ_MEM_RECLAIM workqueue to do
lru_add_drain_all(), aiding in getting memory freed.

Link: http://lkml.kernel.org/r/1464917521-9775-1-git-send-email-shhuiw@foxmail.com
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Gerald Schaefer 770a537022 mm: thp: broken page count after commit aa88b68c3b
Christian Borntraeger reported a kernel panic after corrupt page counts,
and it turned out to be a regression introduced with commit aa88b68c3b
("thp: keep huge zero page pinned until tlb flush"), at least on s390.

put_huge_zero_page() was moved over from zap_huge_pmd() to
release_pages(), and it was replaced by tlb_remove_page().  However,
release_pages() might not always be triggered by (the arch-specific)
tlb_remove_page().

On s390 we call free_page_and_swap_cache() from tlb_remove_page(), and
not tlb_flush_mmu() -> free_pages_and_swap_cache() like the generic
version, because we don't use the MMU-gather logic.  Although both
functions have very similar names, they are doing very unsimilar things,
in particular free_page_xxx is just doing a put_page(), while
free_pages_xxx calls release_pages().

This of course results in very harmful put_page()s on the huge zero
page, on architectures where tlb_remove_page() is implemented in this
way.  It seems to affect only s390 and sh, but sh doesn't have THP
support, so the problem (currently) probably only exists on s390.

The following quick hack fixed the issue:

Link: http://lkml.kernel.org/r/20160602172141.75c006a9@thinkpad
Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@vger.kernel.org>	[4.6.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Andrew Morton d0db7afa1b revert "mm: memcontrol: fix possible css ref leak on oom"
Revert commit 1383399d7b ("mm: memcontrol: fix possible css ref leak
on oom").  Johannes points out "There is a task_in_memcg_oom() check
before calling mem_cgroup_oom()".

Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Shuah Khan 91a4c27214 kasan: change memory hot-add error messages to info messages
Change the following memory hot-add error messages to info messages.
There is no need for these to be errors.

   kasan: WARNING: KASAN doesn't support memory hot-add
   kasan: Memory hot-add will be disabled

Link: http://lkml.kernel.org/r/1464794430-5486-1-git-send-email-shuahkh@osg.samsung.com
Signed-off-by: Shuah Khan <shuahkh@osg.samsung.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Mike Kravetz 67961f9db8 mm/hugetlb: fix huge page reserve accounting for private mappings
When creating a private mapping of a hugetlbfs file, it is possible to
unmap pages via ftruncate or fallocate hole punch.  If subsequent faults
repopulate these mappings, the reserve counts will go negative.  This is
because the code currently assumes all faults to private mappings will
consume reserves.  The problem can be recreated as follows:

 - mmap(MAP_PRIVATE) a file in hugetlbfs filesystem
 - write fault in pages in the mapping
 - fallocate(FALLOC_FL_PUNCH_HOLE) some pages in the mapping
 - write fault in pages in the hole

This will result in negative huge page reserve counts and negative
subpool usage counts for the hugetlbfs.  Note that this can also be
recreated with ftruncate, but fallocate is more straight forward.

This patch modifies the routines vma_needs_reserves and vma_has_reserves
to examine the reserve map associated with private mappings similar to
that for shared mappings.  However, the reserve map semantics for
private and shared mappings are very different.  This results in subtly
different code that is explained in the comments.

Link: http://lkml.kernel.org/r/1464720957-15698-1-git-send-email-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-09 14:23:11 -07:00
Mike Christie 95fe6c1a20 block, fs, mm, drivers: use bio set/get op accessors
This patch converts the simple bi_rw use cases in the block,
drivers, mm and fs code to set/get the bio operation using
bio_set_op_attrs/bio_op

These should be simple one or two liner cases, so I just did them
in one patch. The next patches handle the more complicated
cases in a module per patch.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mike Christie 4e49ea4a3d block/fs/drivers: remove rw argument from submit_bio
This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
instead of passing it in. This makes that use the same as
generic_make_request and how we set the other bio fields.

Signed-off-by: Mike Christie <mchristi@redhat.com>

Fixed up fs/ext4/crypto.c

Signed-off-by: Jens Axboe <axboe@fb.com>
2016-06-07 13:41:38 -06:00
Mel Gorman e46e7b77c9 mm, page_alloc: recalculate the preferred zoneref if the context can ignore memory policies
The optimistic fast path may use cpuset_current_mems_allowed instead of
of a NULL nodemask supplied by the caller for cpuset allocations.  The
preferred zone is calculated on this basis for statistic purposes and as
a starting point in the zonelist iterator.

However, if the context can ignore memory policies due to being atomic
or being able to ignore watermarks then the starting point in the
zonelist iterator is no longer correct.  This patch resets the zonelist
iterator in the allocator slowpath if the context can ignore memory
policies.  This will alter the zone used for statistics but only after
it is known that it makes sense for that context.  Resetting it before
entering the slowpath would potentially allow an ALLOC_CPUSET allocation
to be accounted for against the wrong zone.  Note that while nodemask is
not explicitly set to the original nodemask, it would only have been
overwritten if cpuset_enabled() and it was reset before the slowpath was
entered.

Link: http://lkml.kernel.org/r/20160602103936.GU2527@techsingularity.net
Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:57 -07:00
Mel Gorman 0d0bd89435 mm, page_alloc: reset zonelist iterator after resetting fair zone allocation policy
Geert Uytterhoeven reported the following problem that bisected to
commit c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone
in a zonelist twice") on m68k/ARAnyM

    BUG: scheduling while atomic: cron/668/0x10c9a0c0
    Modules linked in:
    CPU: 0 PID: 668 Comm: cron Not tainted 4.6.0-atari-05133-gc33d6c06f60f710f #364
    Call Trace: [<0003d7d0>] __schedule_bug+0x40/0x54
      __schedule+0x312/0x388
      __schedule+0x0/0x388
      prepare_to_wait+0x0/0x52
      schedule+0x64/0x82
      schedule_timeout+0xda/0x104
      set_next_entity+0x18/0x40
      pick_next_task_fair+0x78/0xda
      io_schedule_timeout+0x36/0x4a
      bit_wait_io+0x0/0x40
      bit_wait_io+0x12/0x40
      __wait_on_bit+0x46/0x76
      wait_on_page_bit_killable+0x64/0x6c
      bit_wait_io+0x0/0x40
      wake_bit_function+0x0/0x4e
      __lock_page_or_retry+0xde/0x124
      do_scan_async+0x114/0x17c
      lookup_swap_cache+0x24/0x4e
      handle_mm_fault+0x626/0x7de
      find_vma+0x0/0x66
      down_read+0x0/0xe
      wait_on_page_bit_killable_timeout+0x77/0x7c
      find_vma+0x16/0x66
      do_page_fault+0xe6/0x23a
      res_func+0xa3c/0x141a
      buserr_c+0x190/0x6d4
      res_func+0xa3c/0x141a
      buserr+0x20/0x28
      res_func+0xa3c/0x141a
      buserr+0x20/0x28

The relationship is not obvious but it's due to a failure to rescan the
full zonelist after the fair zone allocation policy exhausts the batch
count.  While this is a functional problem, it's also a performance
issue.  A page allocator microbenchmark showed the following

                                   4.7.0-rc1                  4.7.0-rc1
                                     vanilla                 reset-v1r2
  Min      alloc-odr0-1     327.00 (  0.00%)           326.00 (  0.31%)
  Min      alloc-odr0-2     235.00 (  0.00%)           235.00 (  0.00%)
  Min      alloc-odr0-4     198.00 (  0.00%)           198.00 (  0.00%)
  Min      alloc-odr0-8     170.00 (  0.00%)           170.00 (  0.00%)
  Min      alloc-odr0-16    156.00 (  0.00%)           156.00 (  0.00%)
  Min      alloc-odr0-32    150.00 (  0.00%)           150.00 (  0.00%)
  Min      alloc-odr0-64    146.00 (  0.00%)           146.00 (  0.00%)
  Min      alloc-odr0-128   145.00 (  0.00%)           145.00 (  0.00%)
  Min      alloc-odr0-256   155.00 (  0.00%)           155.00 (  0.00%)
  Min      alloc-odr0-512   168.00 (  0.00%)           165.00 (  1.79%)
  Min      alloc-odr0-1024  175.00 (  0.00%)           174.00 (  0.57%)
  Min      alloc-odr0-2048  180.00 (  0.00%)           180.00 (  0.00%)
  Min      alloc-odr0-4096  187.00 (  0.00%)           186.00 (  0.53%)
  Min      alloc-odr0-8192  190.00 (  0.00%)           190.00 (  0.00%)
  Min      alloc-odr0-16384 191.00 (  0.00%)           191.00 (  0.00%)
  Min      alloc-odr1-1     736.00 (  0.00%)           445.00 ( 39.54%)
  Min      alloc-odr1-2     343.00 (  0.00%)           335.00 (  2.33%)
  Min      alloc-odr1-4     277.00 (  0.00%)           270.00 (  2.53%)
  Min      alloc-odr1-8     238.00 (  0.00%)           233.00 (  2.10%)
  Min      alloc-odr1-16    224.00 (  0.00%)           218.00 (  2.68%)
  Min      alloc-odr1-32    210.00 (  0.00%)           208.00 (  0.95%)
  Min      alloc-odr1-64    207.00 (  0.00%)           203.00 (  1.93%)
  Min      alloc-odr1-128   276.00 (  0.00%)           202.00 ( 26.81%)
  Min      alloc-odr1-256   206.00 (  0.00%)           202.00 (  1.94%)
  Min      alloc-odr1-512   207.00 (  0.00%)           202.00 (  2.42%)
  Min      alloc-odr1-1024  208.00 (  0.00%)           205.00 (  1.44%)
  Min      alloc-odr1-2048  213.00 (  0.00%)           212.00 (  0.47%)
  Min      alloc-odr1-4096  218.00 (  0.00%)           216.00 (  0.92%)
  Min      alloc-odr1-8192  341.00 (  0.00%)           219.00 ( 35.78%)

Note that order-0 allocations are unaffected but higher orders get a
small boost from this patch and a large reduction in system CPU usage
overall as can be seen here:

             4.7.0-rc1   4.7.0-rc1
               vanilla  reset-v1r2
  User           85.32       86.31
  System       2221.39     2053.36
  Elapsed      2368.89     2202.47

Fixes: c33d6c06f6 ("mm, page_alloc: avoid looking up the first zone in a zonelist twice")
Link: http://lkml.kernel.org/r/20160531100848.GR2527@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Tested-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:56 -07:00
Michal Hocko cbdcf7f789 mm, oom_reaper: do not use siglock in try_oom_reaper()
Oleg has noted that siglock usage in try_oom_reaper is both pointless
and dangerous.  signal_group_exit can be checked lockless.  The problem
is that sighand becomes NULL in __exit_signal so we can crash.

Fixes: 3ef22dfff2 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")
Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:56 -07:00
Vlastimil Babka 83b9355bf6 mm, page_alloc: prevent infinite loop in buffered_rmqueue()
In DEBUG_VM kernel, we can hit infinite loop for order == 0 in
buffered_rmqueue() when check_new_pcp() returns 1, because the bad page
is never removed from the pcp list.  Fix this by removing the page
before retrying.  Also we don't need to check if page is non-NULL,
because we simply grab it from the list which was just tested for being
non-empty.

Fixes: 479f854a20 ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
Link: http://lkml.kernel.org/r/20160530090154.GM2527@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:56 -07:00
Vitaly Wool 43afc19417 mm/z3fold.c: avoid modifying HEADLESS page and minor cleanup
Fix erroneous z3fold header access in a HEADLESS page in reclaim
function, and change one remaining direct handle-to-buddy conversion to
use the appropriate helper.

Link: http://lkml.kernel.org/r/5748706F.9020208@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Reviewed-by: Dan Streetman <ddstreet@ieee.org>
Cc: Seth Jennings <sjenning@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:55 -07:00
Tejun Heo 3a06bb78ce memcg: add RCU locking around css_for_each_descendant_pre() in memcg_offline_kmem()
memcg_offline_kmem() may be called from memcg_free_kmem() after a css
init failure.  memcg_free_kmem() is a ->css_free callback which is
called without cgroup_mutex and memcg_offline_kmem() ends up using
css_for_each_descendant_pre() without any locking.  Fix it by adding rcu
read locking around it.

    mkdir: cannot create directory `65530': No space left on device
    ===============================
    [ INFO: suspicious RCU usage. ]
    4.6.0-work+ #321 Not tainted
    -------------------------------
    kernel/cgroup.c:4008 cgroup_mutex or RCU read lock required!
     [  527.243970] other info that might help us debug this:
     [  527.244715]
    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by kworker/0:5/1664:
     #0:  ("cgroup_destroy"){.+.+..}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
     #1:  ((&css->destroy_work)#3){+.+...}, at: [<ffffffff81060ab5>] process_one_work+0x165/0x4a0
     [  527.248098] stack backtrace:
    CPU: 0 PID: 1664 Comm: kworker/0:5 Not tainted 4.6.0-work+ #321
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.1-1.fc24 04/01/2014
    Workqueue: cgroup_destroy css_free_work_fn
    Call Trace:
      dump_stack+0x68/0xa1
      lockdep_rcu_suspicious+0xd7/0x110
      css_next_descendant_pre+0x7d/0xb0
      memcg_offline_kmem.part.44+0x4a/0xc0
      mem_cgroup_css_free+0x1ec/0x200
      css_free_work_fn+0x49/0x5e0
      process_one_work+0x1c5/0x4a0
      worker_thread+0x49/0x490
      kthread+0xea/0x100
      ret_from_fork+0x1f/0x40

Link: http://lkml.kernel.org/r/20160526203018.GG23194@mtj.duckdns.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 16:02:55 -07:00
Yang Shi f86e427197 mm: check the return value of lookup_page_ext for all call sites
Per the discussion with Joonsoo Kim [1], we need check the return value
of lookup_page_ext() for all call sites since it might return NULL in
some cases, although it is unlikely, i.e.  memory hotplug.

Tested with ltp with "page_owner=0".

[1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE

[akpm@linux-foundation.org: fix build-breaking typos]
[arnd@arndb.de: fix build problems from lookup_page_ext]
  Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel
[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 15:06:22 -07:00
Guillermo Julián Moreno 65ee03c4b9 mm: fix overflow in vm_map_ram()
When remapping pages accounting for 4G or more memory space, the
operation 'count << PAGE_SHIFT' overflows as it is performed on an
integer.  Solution: cast before doing the bitshift.

[akpm@linux-foundation.org: fix vm_unmap_ram() also]
[akpm@linux-foundation.org: fix vmap() as well, per Guillermo]
Link: http://lkml.kernel.org/r/etPan.57175fb3.7a271c6b.2bd@naudit.es
Signed-off-by: Guillermo Julián Moreno <guillermo.julian@naudit.es>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-03 15:06:22 -07:00
Tejun Heo 62a584fe05 writeback: use higher precision calculation in domain_dirty_limits()
As vm.dirty_[background_]bytes can't be applied verbatim to multiple
cgroup writeback domains, they get converted to percentages in
domain_dirty_limits() and applied the same way as
vm.dirty_[background]ratio.  However, if the specified bytes is lower
than 1% of available memory, the calculated ratios become zero and the
writeback domain gets throttled constantly.

Fix it by using per-PAGE_SIZE instead of percentage for ratio
calculations.  Also, the updated DIV_ROUND_UP() usages now should
yield 1/4096 (0.0244%) as the minimum ratio as long as the specified
bytes are above zero.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Miao Xie <miaoxie@huawei.com>
Link: http://lkml.kernel.org/g/57333E75.3080309@huawei.com
Cc: stable@vger.kernel.org # v4.2+
Fixes: 9fc3a43e17 ("writeback: separate out domain_dirty_limits()")
Reviewed-by: Jan Kara <jack@suse.cz>

Adjusted comment based on Jan's suggestion.
Signed-off-by: Jens Axboe <axboe@fb.com>
2016-05-30 08:54:40 -06:00
Al Viro 93c76a3d43 file_inode(f)->i_mapping is f->f_mapping
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-29 18:56:09 -04:00
Linus Torvalds d102a56edb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "Followups to the parallel lookup work:

   - update docs

   - restore killability of the places that used to take ->i_mutex
     killably now that we have down_write_killable() merged

   - Additionally, it turns out that I missed a prerequisite for
     security_d_instantiate() stuff - ->getxattr() wasn't the only thing
     that could be called before dentry is attached to inode; with smack
     we needed the same treatment applied to ->setxattr() as well"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch ->setxattr() to passing dentry and inode separately
  switch xattr_handler->set() to passing dentry and inode separately
  restore killability of old mutex_lock_killable(&inode->i_mutex) users
  add down_write_killable_nested()
  update D/f/directory-locking
2016-05-27 17:14:05 -07:00
Linus Torvalds 5d22fc25d4 mm: remove more IS_ERR_VALUE abuses
The do_brk() and vm_brk() return value was "unsigned long" and returned
the starting address on success, and an error value on failure.  The
reasons are entirely historical, and go back to it basically behaving
like the mmap() interface does.

However, nobody actually wanted that interface, and it causes totally
pointless IS_ERR_VALUE() confusion.

What every single caller actually wants is just the simpler integer
return of zero for success and negative error number on failure.

So just convert to that much clearer and more common calling convention,
and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 15:57:31 -07:00
Linus Torvalds 7ded384a12 mm: fix section mismatch warning
The register_page_bootmem_info_node() function needs to be marked __init
in order to avoid a new warning introduced by commit f65e91df25 ("mm:
use early_pfn_to_nid in register_page_bootmem_info_node").

Otherwise you'll get a warning about how a non-init function calls
early_pfn_to_nid (which is __meminit)

Cc: Yang Shi <yang.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 15:23:32 -07:00
Gavin Shan 11e685672a mm: disable DEFERRED_STRUCT_PAGE_INIT on !NO_BOOTMEM
When we have !NO_BOOTMEM, the deferred page struct initialization
doesn't work well because the pages reserved in bootmem are released to
the page allocator uncoditionally.  It causes memory corruption and
system crash eventually.

As Mel suggested, the bootmem is retiring slowly.  We fix the issue by
simply hiding DEFERRED_STRUCT_PAGE_INIT when bootmem is enabled.

Link: http://lkml.kernel.org/r/1460602170-5821-1-git-send-email-gwshan@linux.vnet.ibm.com
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Li RongQing 7cf7806ce1 mm/memcontrol.c: move comments for get_mctgt_type() to proper position
Move the comments for get_mctgt_type() to be before get_mctgt_type()
implementation.

Link: http://lkml.kernel.org/r/1463644638-7446-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Li RongQing cbedbac3e6 mm/memcontrol.c: fix the margin computation in mem_cgroup_margin()
mem_cgroup_margin() might return (memory.limit - memory_count) when the
memsw.limit is in excess.  This doesn't happen usually because we do not
allow excess on hard limits and (memory.limit <= memsw.limit), but
__GFP_NOFAIL charges can force the charge and cause the excess when no
memory is really swappable (swap is full or no anonymous memory is
left).

[mhocko@suse.com: rewrote changelog]
  Link: http://lkml.kernel.org/r/20160525155122.GK20132@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/1464068266-27736-1-git-send-email-roy.qing.li@gmail.com
Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Stephen Rothwell badbda53e5 mm/cma: silence warnings due to max() usage
pageblock_order can be (at least) an unsigned int or an unsigned long
depending on the kernel config and architecture, so use max_t(unsigned
long, ...) when comparing it.

fixes these warnings:

In file included from include/asm-generic/bug.h:13:0,
                 from arch/powerpc/include/asm/bug.h:127,
                 from include/linux/bug.h:4,
                 from include/linux/mmdebug.h:4,
                 from include/linux/mm.h:8,
                 from include/linux/memblock.h:18,
                 from mm/cma.c:28:
mm/cma.c: In function 'cma_init_reserved_mem':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
mm/cma.c:186:27: note: in expansion of macro 'max'
  alignment = PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order);
                           ^
mm/cma.c: In function 'cma_declare_contiguous':
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
include/linux/kernel.h:747:9: note: in definition of macro 'max'
  typeof(y) _max2 = (y);            ^
mm/cma.c:270:29: note: in expansion of macro 'max'
   (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                             ^
include/linux/kernel.h:748:17: warning: comparison of distinct pointer types lacks a cast
  (void) (&_max1 == &_max2);                   ^
include/linux/kernel.h:747:21: note: in definition of macro 'max'
  typeof(y) _max2 = (y);                        ^
mm/cma.c:270:29: note: in expansion of macro 'max'
   (phys_addr_t)PAGE_SIZE << max(MAX_ORDER - 1, pageblock_order));
                             ^

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/20160526150748.5be38a4f@canb.auug.org.au
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Kirill A. Shutemov 0798d3c022 mm: thp: avoid false positive VM_BUG_ON_PAGE in page_move_anon_rmap()
If page_move_anon_rmap() is refiling a pmd-splitted THP mapped in a tail
page from a pte, the "address" must be THP aligned in order for the
page->index bugcheck to pass in the CONFIG_DEBUG_VM=y builds.

Link: http://lkml.kernel.org/r/1464253620-106404-1-git-send-email-kirill.shutemov@linux.intel.com
Fixes: 6d0a07edd1 ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>        [4.5]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Michal Hocko e2fe14564d oom_reaper: close race with exiting task
Tetsuo has reported:
  Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
  Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
  sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
  sh cpuset=/ mems_allowed=0
  CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
  Call Trace:
    dump_stack+0x85/0xc8
    dump_header+0x5b/0x394
  oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

In other words:

  __oom_reap_task		exit_mm
    atomic_inc_not_zero
				  tsk->mm = NULL
				  mmput
				    atomic_dec_and_test # > 0
				  exit_oom_victim # New victim will be
						  # selected
				<OOM killer invoked>
				  # no TIF_MEMDIE task so we can select a new one
    unmap_page_range # to release the memory

The race exists even without the oom_reaper because anybody who pins the
address space and gets preempted might race with exit_mm but oom_reaper
made this race more probable.

We can address the oom_reaper part by using oom_lock for __oom_reap_task
because this would guarantee that a new oom victim will not be selected
if the oom reaper might race with the exit path.  This doesn't solve the
original issue, though, because somebody else still might be pinning
mm_users and so __mmput won't be called to release the memory but that
is not really realiably solvable because the task will get away from the
oom sight as soon as it is unhashed from the task_list and so we cannot
guarantee a new victim won't be selected.

[akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
[akpm@linux-foundation.org: coding-style fixes]
Fixes: aac4536355 ("mm, oom: introduce oom reaper")
Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Yang Shi f65e91df25 mm: use early_pfn_to_nid in register_page_bootmem_info_node
register_page_bootmem_info_node() is invoked in mem_init(), so it will
be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
enabled.  But, pfn_to_nid() depends on memmap which won't be fully setup
until page_alloc_init_late() is done, so replace pfn_to_nid() by
early_pfn_to_nid().

Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Yang Shi fe53ca5427 mm: use early_pfn_to_nid in page_ext_init
page_ext_init() checks suitable pages with pfn_to_nid(), but
pfn_to_nid() depends on memmap which will not be setup fully until
page_alloc_init_late() is done.  Use early_pfn_to_nid() instead of
pfn_to_nid() so that page extension could be still used early even
though CONFIG_ DEFERRED_STRUCT_PAGE_INIT is enabled and catch early page
allocation call sites.

Suggested by Joonsoo Kim [1], this fix basically undoes the change
introduced by commit b8f1a75d61 ("mm: call page_ext_init() after all
struct pages are initialized") and fixes the same problem with a better
approach.

[1] http://lkml.kernel.org/r/CAAmzW4OUmyPwQjvd7QUfc6W1Aic__TyAuH80MLRZNMxKy0-wPQ@mail.gmail.com

Link: http://lkml.kernel.org/r/1464198689-23458-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Vladimir Davydov edd9f7230f mm: oom: do not reap task if there are live threads in threadgroup
If the current process is exiting, we don't invoke oom killer, instead
we give it access to memory reserves and try to reap its mm in case
nobody is going to use it.  There's a mistake in the code performing
this check - we just ignore any process of the same thread group no
matter if it is exiting or not - see try_oom_reaper.  Fix it.

Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
Fixes: 3ef22dfff2 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-27 14:49:37 -07:00
Al Viro 5930122683 switch xattr_handler->set() to passing dentry and inode separately
preparation for similar switch in ->setxattr() (see the next commit for
rationale).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-27 15:39:43 -04:00
Linus Torvalds e12fab28df Merge branch 'akpm' (patches from Andrew)
Merge fixes from Andrew Morton:
 "10 fixes"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  drivers/pinctrl/intel/pinctrl-baytrail.c: fix build with gcc-4.4
  update "mm/zsmalloc: don't fail if can't create debugfs info"
  dma-debug: avoid spinlock recursion when disabling dma-debug
  mm: oom_reaper: remove some bloat
  memcg: fix mem_cgroup_out_of_memory() return value.
  ocfs2: fix improper handling of return errno
  mm: slub: remove unused virt_to_obj()
  mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
  mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
  seqlock: fix raw_read_seqcount_latch()
2016-05-26 21:32:40 -07:00
Linus Torvalds 478a1469a7 Filesystem DAX locking for 4.7
- We use a bit in an exceptional radix tree entry as a lock bit and use it
   similarly to how page lock is used for normal faults.  This fixes races
   between hole instantiation and read faults of the same index.
 
 - Filesystem DAX PMD faults are disabled, and will be re-enabled when PMD
   locking is implemented.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXRKwLAAoJEJ/BjXdf9fLB+BkP/3HBm05KlAKDklvnBIPFDMUK
 hA7g2K6vuvaEDZXZQ1ioc1Ajf1sCpVip7shXJsojZqwWmRz0/4nneF7ytluW9AjS
 dBX+0qCgKGH1fnwyGFF+MN7fuj7kGrSDz34lG0OObRN6/oKiVNb2svXiYKkT6J6C
 AgsWlWRUpMy9jrn1u/FduMjDhk92Z3ojarexuicr0i8NUlBClCIrdCEmUMi4orSB
 DuiIjestLOc7+mERBUwrXkzoh9v8Z0FpIgnDLWwpeEkAvJwWkGe5eXrBJwF+hEbi
 RYfTrOYc7bBQLo22LRb8pdighjrx3OW9EpNCfEmLDOjM3cYBbMK/d2i/ww52H6IK
 Mw6iS5rXdGgJtQIGL8N96HLFk+cDyZ8J8xNUCwbYYBJqgpMzxzVkL3vTm72tyFnl
 InWhih+miCMbBPytQSRd6+1wZG2piJTv6SsFTd5K1OaiRmJhBJZG47t2QTBRBu7Y
 5A4FGPtlraV+iDJvD6VLO1Tp8twxdLluOJ2BwdGeiKXiGh6LP+FGGFF3aFa5N4Ro
 xSslCTX7Q1G66zXQwD4+IMWLwS1FDNymPkUSsF6RQo6qfAnl9SrmYTc4xJ4QXy92
 sUdrWEz2OBTfxKNqbGyc/KrXKZT3RnEkJNft8snB2h6WTCdOPaNYs/yETUwiwkSc
 CXpuQFrxm69QYwNsqVu1
 =Pkd0
 -----END PGP SIGNATURE-----

Merge tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull DAX locking updates from Ross Zwisler:
 "Filesystem DAX locking for 4.7

   - We use a bit in an exceptional radix tree entry as a lock bit and
     use it similarly to how page lock is used for normal faults.  This
     fixes races between hole instantiation and read faults of the same
     index.

   - Filesystem DAX PMD faults are disabled, and will be re-enabled when
     PMD locking is implemented"

* tag 'dax-locking-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
  dax: Remove i_mmap_lock protection
  dax: Use radix tree entry lock to protect cow faults
  dax: New fault locking
  dax: Allow DAX code to replace exceptional entries
  dax: Define DAX lock bit for radix tree exceptional entry
  dax: Make huge page handling depend of CONFIG_BROKEN
  dax: Fix condition for filling of PMD holes
2016-05-26 20:00:28 -07:00
Dan Streetman 4abaac9b73 update "mm/zsmalloc: don't fail if can't create debugfs info"
Some updates to commit d34f615720 ("mm/zsmalloc: don't fail if can't
create debugfs info"):

 - add pr_warn to all stat failure cases
 - do not prevent module loading on stat failure

Link: http://lkml.kernel.org/r/1463671123-5479-1-git-send-email-ddstreet@ieee.org
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reviewed-by: Ganesh Mahendran <opensource.ganesh@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26 15:35:44 -07:00
Tetsuo Handa 1ebab2db06 memcg: fix mem_cgroup_out_of_memory() return value.
mem_cgroup_out_of_memory() is returning "true" if it finds a TIF_MEMDIE
task after an eligible task was found, "false" if it found a TIF_MEMDIE
task before an eligible task is found.

This difference confuses memory_max_write() which checks the return
value of mem_cgroup_out_of_memory().  Since memory_max_write() wants to
continue looping, mem_cgroup_out_of_memory() should return "true" in
this case.

This patch sets a dummy pointer in order to return "true".

Link: http://lkml.kernel.org/r/1463753327-5170-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26 15:35:44 -07:00
Andrey Ryabinin 9725759a96 mm: kasan: remove unused 'reserved' field from struct kasan_alloc_meta
Commit cd11016e5f ("mm, kasan: stackdepot implementation.  Enable
stackdepot for SLAB") added 'reserved' field, but never used it.

Link: http://lkml.kernel.org/r/1464021054-2307-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26 15:35:44 -07:00
Yang Shi 957949243b mm: make CONFIG_DEFERRED_STRUCT_PAGE_INIT depends on !FLATMEM explicitly
Per the suggestion from Michal Hocko [1], DEFERRED_STRUCT_PAGE_INIT
requires some ordering wrt other initialization operations, e.g.
page_ext_init has to happen after the whole memmap is initialized
properly.

For SPARSEMEM this requires to wait for page_alloc_init_late.  Other
memory models (e.g.  flatmem) might have different initialization
layouts (page_ext_init_flatmem).  Currently DEFERRED_STRUCT_PAGE_INIT
depends on MEMORY_HOTPLUG which in turn

	depends on SPARSEMEM || X86_64_ACPI_NUMA
	depends on ARCH_ENABLE_MEMORY_HOTPLUG

and X86_64_ACPI_NUMA depends on NUMA which in turn disable FLATMEM
memory model:

config ARCH_FLATMEM_ENABLE
	def_bool y
	depends on X86_32 && !NUMA

so FLATMEM is ruled out via dependency maze.  Be explicit and disable
FLATMEM for DEFERRED_STRUCT_PAGE_INIT so that we do not reintroduce
subtle initialization bugs

[1] http://lkml.kernel.org/r/20160523073157.GD2278@dhcp22.suse.cz

Link: http://lkml.kernel.org/r/1464027356-32282-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi <yang.shi@linaro.org>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-26 15:35:44 -07:00
Tejun Heo 6710e594f7 percpu: fix synchronization between synchronous map extension and chunk destruction
For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.

This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 1a4d76076c ("percpu: implement asynchronous chunk population")
2016-05-25 11:48:25 -04:00
Tejun Heo 4f996e234d percpu: fix synchronization between chunk->map_extend_work and chunk destruction
Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work.  pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.

This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # v3.18+
Fixes: 9c824b6a17 ("percpu: make sure chunk->map array has available space")
2016-05-25 11:48:25 -04:00
Linus Torvalds 84787c572d Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - Oleg's "wait/ptrace: assume __WALL if the child is traced".  It's a
   kernel-based workaround for existing userspace issues.

 - A few hotfixes

 - befs cleanups

 - nilfs2 updates

 - sys_wait() changes

 - kexec updates

 - kdump

 - scripts/gdb updates

 - the last of the MM queue

 - a few other misc things

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (84 commits)
  kgdb: depends on VT
  drm/amdgpu: make amdgpu_mn_get wait for mmap_sem killable
  drm/radeon: make radeon_mn_get wait for mmap_sem killable
  drm/i915: make i915_gem_mmap_ioctl wait for mmap_sem killable
  uprobes: wait for mmap_sem for write killable
  prctl: make PR_SET_THP_DISABLE wait for mmap_sem killable
  exec: make exec path waiting for mmap_sem killable
  aio: make aio_setup_ring killable
  coredump: make coredump_wait wait for mmap_sem for write killable
  vdso: make arch_setup_additional_pages wait for mmap_sem for write killable
  ipc, shm: make shmem attach/detach wait for mmap_sem killable
  mm, fork: make dup_mmap wait for mmap_sem for write killable
  mm, proc: make clear_refs killable
  mm: make vm_brk killable
  mm, elf: handle vm_brk error
  mm, aout: handle vm_brk failures
  mm: make vm_munmap killable
  mm: make vm_mmap killable
  mm: make mmap_sem for write waits killable for mm syscalls
  MAINTAINERS: add co-maintainer for scripts/gdb
  ...
2016-05-23 19:42:28 -07:00
Michal Hocko 2d6c928241 mm: make vm_brk killable
Now that all the callers handle vm_brk failure we can change it wait for
mmap_sem killable to help oom_reaper to not get blocked just because
vm_brk gets blocked behind mmap_sem readers.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko ae79878356 mm: make vm_munmap killable
Almost all current users of vm_munmap are ignoring the return value and
so they do not handle potential error.  This means that some VMAs might
stay behind.  This patch doesn't try to solve those potential problems.
Quite contrary it adds a new failure mode by using down_write_killable
in vm_munmap.  This should be safer than other failure modes, though,
because the process is guaranteed to die as soon as it leaves the kernel
and exit_mmap will clean the whole address space.

This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko 9fbeb5ab59 mm: make vm_mmap killable
All the callers of vm_mmap seem to check for the failure already and
bail out in one way or another on the error which means that we can
change it to use killable version of vm_mmap_pgoff and return -EINTR if
the current task gets killed while waiting for mmap_sem.  This also
means that vm_mmap_pgoff can be killable by default and drop the
additional parameter.

This will help in the OOM conditions when the oom victim might be stuck
waiting for the mmap_sem for write which in turn can block oom_reaper
which relies on the mmap_sem for read to make a forward progress and
reclaim the address space of the victim.

Please note that load_elf_binary is ignoring vm_mmap error for
current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
problem because the address is not used anywhere and we never return to
the userspace if we got killed.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Michal Hocko dc0ef0df7b mm: make mmap_sem for write waits killable for mm syscalls
This is a follow up work for oom_reaper [1].  As the async OOM killing
depends on oom_sem for read we would really appreciate if a holder for
write didn't stood in the way.  This patchset is changing many of
down_write calls to be killable to help those cases when the writer is
blocked and waiting for readers to release the lock and so help
__oom_reap_task to process the oom victim.

Most of the patches are really trivial because the lock is help from a
shallow syscall paths where we can return EINTR trivially and allow the
current task to die (note that EINTR will never get to the userspace as
the task has fatal signal pending).  Others seem to be easy as well as
the callers are already handling fatal errors and bail and return to
userspace which should be sufficient to handle the failure gracefully.
I am not familiar with all those code paths so a deeper review is really
appreciated.

As this work is touching more areas which are not directly connected I
have tried to keep the CC list as small as possible and people who I
believed would be familiar are CCed only to the specific patches (all
should have received the cover though).

This patchset is based on linux-next and it depends on
down_write_killable for rw_semaphores which got merged into tip
locking/rwsem branch and it is merged into this next tree.  I guess it
would be easiest to route these patches via mmotm because of the
dependency on the tip tree but if respective maintainers prefer other
way I have no objections.

I haven't covered all the mmap_write(mm->mmap_sem) instances here

  $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
  98
  $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
  62

I have tried to cover those which should be relatively easy to review in
this series because this alone should be a nice improvement.  Other
places can be changed on top.

[0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
[1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
[2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org

This patch (of 18):

This is the first step in making mmap_sem write waiters killable.  It
focuses on the trivial ones which are taking the lock early after
entering the syscall and they are not changing state before.

Therefore it is very easy to change them to use down_write_killable and
immediately return with -EINTR.  This will allow the waiter to pass away
without blocking the mmap_sem which might be required to make a forward
progress.  E.g.  the oom reaper will need the lock for reading to
dismantle the OOM victim address space.

The only tricky function in this patch is vm_mmap_pgoff which has many
call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
original non-killable semantic for now.

vm_munmap callers do not bother checking the return value so open code
it into the munmap syscall path for now for simplicity.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Vladimir Davydov 1383399d7b mm: memcontrol: fix possible css ref leak on oom
mem_cgroup_oom may be invoked multiple times while a process is handling
a page fault, in which case current->memcg_in_oom will be overwritten
leaking the previously taken css reference.

Link: http://lkml.kernel.org/r/1464019330-7579-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-23 17:04:14 -07:00
Linus Torvalds 1d6da87a32 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
 "Here's the main drm pull request for 4.7, it's been a busy one, and
  I've been a bit more distracted in real life this merge window.  Lots
  more ARM drivers, not sure if it'll ever end.  I think I've at least
  one more coming the next merge window.

  But changes are all over the place, support for AMD Polaris GPUs is in
  here, some missing GM108 support for nouveau (found in some Lenovos),
  a bunch of MST and skylake fixes.

  I've also noticed a few fixes from Arnd in my inbox, that I'll try and
  get in asap, but I didn't think they should hold this up.

  New drivers:
   - Hisilicon kirin display driver
   - Mediatek MT8173 display driver
   - ARC PGU - bitstreamer on Synopsys ARC SDP boards
   - Allwinner A13 initial RGB output driver
   - Analogix driver for DisplayPort IP found in exynos and rockchip

  DRM Core:
   - UAPI headers fixes and C++ safety
   - DRM connector reference counting
   - DisplayID mode parsing for Dell 5K monitors
   - Removal of struct_mutex from drivers
   - Connector registration cleanups
   - MST robustness fixes
   - MAINTAINERS updates
   - Lockless GEM object freeing
   - Generic fbdev deferred IO support

  panel:
   - Support for a bunch of new panels

  i915:
   - VBT refactoring
   - PLL computation cleanups
   - DSI support for BXT
   - Color manager support
   - More atomic patches
   - GEM improvements
   - GuC fw loading fixes
   - DP detection fixes
   - SKL GPU hang fixes
   - Lots of BXT fixes

  radeon/amdgpu:
   - Initial Polaris support
   - GPUVM/Scheduler/Clock/Power improvements
   - ASYNC pageflip support
   - New mesa feature support

  nouveau:
   - GM108 support
   - Power sensor support improvements
   - GR init + ucode fixes.
   - Use GPU provided topology information

  vmwgfx:
   - Add host messaging support

  gma500:
   - Some cleanups and fixes

  atmel:
   - Bridge support
   - Async atomic commit support

  fsl-dcu:
   - Timing controller for LCD support
   - Pixel clock polarity support

  rcar-du:
   - Misc fixes

  exynos:
   - Pipeline clock support
   - Exynoss4533 SoC support
   - HW trigger mode support
   - export HDMI_PHY clock
   - DECON5433 fixes
   - Use generic prime functions
   - use DMA mapping APIs

  rockchip:
   - Lots of little fixes

  vc4:
   - Render node support
   - Gamma ramp support
   - DPI output support

  msm:
   - Mostly cleanups and fixes
   - Conversion to generic struct fence

  etnaviv:
   - Fix for prime buffer handling
   - Allow hangcheck to be coalesced with other wakeups

  tegra:
   - Gamme table size fix"

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (1050 commits)
  drm/edid: add displayid detailed 1 timings to the modelist. (v1.1)
  drm/edid: move displayid validation to it's own function.
  drm/displayid: Iterate over all DisplayID blocks
  drm/edid: move displayid tiled block parsing into separate function.
  drm: Nuke ->vblank_disable_allowed
  drm/vmwgfx: Report vmwgfx version to vmware.log
  drm/vmwgfx: Add VMWare host messaging capability
  drm/vmwgfx: Kill some lockdep warnings
  drm/nouveau/gr/gf100-: fix race condition in fecs/gpccs ucode
  drm/nouveau/core: recognise GM108 chipsets
  drm/nouveau/gr/gm107-: fix touching non-existent ppcs in attrib cb setup
  drm/nouveau/gr/gk104-: share implementation of ppc exception init
  drm/nouveau/gr/gk104-: move rop_active_fbps init to nonctx
  drm/nouveau/bios/pll: check BIT table version before trying to parse it
  drm/nouveau/bios/pll: prevent oops when limits table can't be parsed
  drm/nouveau/volt/gk104: round up in gk104_volt_set
  drm/nouveau/fb/gm200: setup mmu debug buffer registers at init()
  drm/nouveau/fb/gk20a,gm20b: setup mmu debug buffer registers at init()
  drm/nouveau/fb/gf100-: allocate mmu debug buffers
  drm/nouveau/fb: allow chipset-specific actions for oneinit()
  ...
2016-05-23 11:48:48 -07:00
Linus Torvalds 1f40c49570 libnvdimm for 4.7
1/ Device DAX for persistent memory:
    Device DAX is the device-centric analogue of Filesystem DAX
    (CONFIG_FS_DAX).  It allows memory ranges to be allocated and mapped
    without need of an intervening file system.  Device DAX is strict,
    precise and predictable.  Specifically this interface:
 
    a) Guarantees fault granularity with respect to a given page size
       (pte, pmd, or pud) set at configuration time.
 
    b) Enforces deterministic behavior by being strict about what fault
       scenarios are supported.
 
    Persistent memory is the first target, but the mechanism is also
    targeted for exclusive allocations of performance/feature differentiated
    memory ranges.
 
 2/ Support for the HPE DSM (device specific method) command formats.
    This enables management of these first generation devices until a
    unified DSM specification materializes.
 
 3/ Further ACPI 6.1 compliance with support for the common dimm
    identifier format.
 
 4/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXQhdeAAoJEB7SkWpmfYgCYP8P/RAgHkroL5lUKKU45TQUBKcY
 diC9POeNSccme4tIRIQCGQUZ7+7mKM5ECv2ulF4xYOHvFBCcd/8OF6xKAXs48r3v
 oguYhvX1YvIkBc9FUfBQbR1IsCOJ7uWp/UYiYCIQEXS5tS9Jv545j3ASqDt9xWoV
 TWlceZn3yWSbASiV9qZ2eXhEkk75pg4yara++rsm2/7rs/TTXn5EIjBs+57BtAo+
 6utI4fTy0CQvBYwVzam3m7y9dt2Z2jWXL4hgmT7pkvJ7HDoctVly0P9+bknJPUAo
 g+NugKgTGeiqH5GYp5CTZ9KvL91sDF4q00pfinITVdFl0E3VE293cIHlAzSQBm5/
 w58xxaRV958ZvpH7EaBmYQG82QDi/eFNqeHqVGn0xAM6MlaqO7avUMQp2lRPYMCJ
 u1z/NloR5yo+sffHxsn5Luiq9KqOf6zk33PuxEkKbN74OayCSPn/SeVCO7rQR0B6
 yPMJTTcTiCLnId1kOWAPaEmuK2U3BW/+ogg7hKgeCQSysuy5n6Ok5a2vEx/gJRAm
 v9yF68RmIWumpHr+QB0TmB8mVbD5SY+xWTm3CqJb9MipuFIOF7AVsPyTgucBvE7s
 v+i5F6MDO6tcVfiDT4AiZEt6D2TM5RbtckkUEX3ZTD6j7CGuR5D8bH0HNRrghrYk
 KT1lAk6tjWBOGAHc5Ji7
 =Y3Xv
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
 "The bulk of this update was stabilized before the merge window and
  appeared in -next.  The "device dax" implementation was revised this
  week in response to review feedback, and to address failures detected
  by the recently expanded ndctl unit test suite.

  Not included in this pull request are two dax topic branches (dax
  error handling, and dax radix-tree locking).  These topics were
  deferred to get a few more days of -next integration testing, and to
  coordinate a branch baseline with Ted and the ext4 tree.  Vishal and
  Ross will send the error handling and locking topics respectively in
  the next few days.

  This branch has received a positive build result from the kbuild robot
  across 226 configs.

  Summary:

   - Device DAX for persistent memory: Device DAX is the device-centric
     analogue of Filesystem DAX (CONFIG_FS_DAX).  It allows memory
     ranges to be allocated and mapped without need of an intervening
     file system.  Device DAX is strict, precise and predictable.
     Specifically this interface:

      a) Guarantees fault granularity with respect to a given page size
         (pte, pmd, or pud) set at configuration time.

      b) Enforces deterministic behavior by being strict about what
         fault scenarios are supported.

     Persistent memory is the first target, but the mechanism is also
     targeted for exclusive allocations of performance/feature
     differentiated memory ranges.

   - Support for the HPE DSM (device specific method) command formats.
     This enables management of these first generation devices until a
     unified DSM specification materializes.

   - Further ACPI 6.1 compliance with support for the common dimm
     identifier format.

   - Various fixes and cleanups across the subsystem"

* tag 'libnvdimm-for-4.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (40 commits)
  libnvdimm, dax: fix deletion
  libnvdimm, dax: fix alignment validation
  libnvdimm, dax: autodetect support
  libnvdimm: release ida resources
  Revert "block: enable dax for raw block devices"
  /dev/dax, core: file operations and dax-mmap
  /dev/dax, pmem: direct access to persistent memory
  libnvdimm: stop requiring a driver ->remove() method
  libnvdimm, dax: record the specified alignment of a dax-device instance
  libnvdimm, dax: reserve space to store labels for device-dax
  libnvdimm, dax: introduce device-dax infrastructure
  nfit: add sysfs dimm 'family' and 'dsm_mask' attributes
  tools/testing/nvdimm: ND_CMD_CALL support
  nfit: disable vendor specific commands
  nfit: export subsystem ids as attributes
  nfit: fix format interface code byte order per ACPI6.1
  nfit, libnvdimm: limited/whitelisted dimm command marshaling mechanism
  nfit, libnvdimm: clarify "commands" vs "_DSMs"
  libnvdimm: increase max envelope size for ioctl
  acpi/nfit: Add sysfs "id" for NVDIMM ID
  ...
2016-05-23 11:18:01 -07:00
Linus Torvalds bd28b14591 x86: remove more uaccess_32.h complexity
I'm looking at trying to possibly merge the 32-bit and 64-bit versions
of the x86 uaccess.h implementation, but first this needs to be cleaned
up.

For example, the 32-bit version of "__copy_from_user_inatomic()" is
mostly the special cases for the constant size, and it's actually almost
never relevant.  Most users aren't actually using a constant size
anyway, and the few cases that do small constant copies are better off
just using __get_user() instead.

So get rid of the unnecessary complexity.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-22 17:21:27 -07:00
Dan Williams dee4107924 /dev/dax, core: file operations and dax-mmap
The "Device DAX" core enables dax mappings of performance / feature
differentiated memory.  An open mapping or file handle keeps the backing
struct device live, but new mappings are only possible while the device
is enabled.   Faults are handled under rcu_read_lock to synchronize
with the enabled state of the device.

Similar to the filesystem-dax case the backing memory may optionally
have struct page entries.  However, unlike fs-dax there is no support
for private mappings, or mappings that are not backed by media (see
use of zero-page in fs-dax).

Mappings are always guaranteed to match the alignment of the dax_region.
If the dax_region is configured to have a 2MB alignment, all mappings
are guaranteed to be backed by a pmd entry.  Contrast this determinism
with the fs-dax case where pmd mappings are opportunistic.  If userspace
attempts to force a misaligned mapping, the driver will fail the mmap
attempt.  See dax_dev_check_vma() for other scenarios that are rejected,
like MAP_PRIVATE mappings.

Cc: Hannes Reinecke <hare@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-05-20 22:02:55 -07:00
Matthew Wilcox d604c32452 radix-tree: introduce radix_tree_replace_clear_tags()
In addition to replacing the entry, we also clear all associated tags.
This is really a one-off special for page_cache_tree_delete() which had
far too much detailed knowledge about how the radix tree works.

For efficiency, factor node_tag_clear() out of radix_tree_tag_clear() It
can be used by radix_tree_delete_item() as well as
radix_tree_replace_clear_tags().

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Matthew Wilcox 57578c2ea2 raxix-tree: introduce CONFIG_RADIX_TREE_MULTIORDER
I've been receiving increasingly concerned notes from 0day about how
much my recent changes have been bloating the radix tree.  Make it
happier by only including multiorder support if
CONFIG_TRANSPARENT_HUGEPAGES is set.

This is an independent Kconfig option, so other radix tree users can
also set it if they have a need.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jan Kara <jack@suse.com>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Dan Streetman d34f615720 mm/zsmalloc: don't fail if can't create debugfs info
Change the return type of zs_pool_stat_create() to void, and remove the
logic to abort pool creation if the stat debugfs dir/file could not be
created.

The debugfs stat file is for debugging/information only, and doesn't
affect operation of zsmalloc; there is no reason to abort creating the
pool if the stat file can't be created.  This was seen with zswap, which
used the same name for all pool creations, which caused zsmalloc to fail
to create a second pool for zswap if CONFIG_ZSMALLOC_STAT was enabled.

Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Dan Streetman <dan.streetman@canonical.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Dan Streetman 200867af4d mm/zswap: use workqueue to destroy pool
Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to
use the workqueue instead of using call_rcu().

When zswap destroys a pool no longer in use, it uses call_rcu() to
perform the destruction/freeing.  Since that executes in softirq
context, it must not sleep.  However, actually destroying the pool
involves freeing the per-cpu compressors (which requires locking the
cpu_add_remove_lock mutex) and freeing the zpool, for which the
implementation may sleep (e.g.  zsmalloc calls kmem_cache_destroy, which
locks the slab_mutex).  So if either mutex is currently taken, or any
other part of the compressor or zpool implementation sleeps, it will
result in a BUG().

It's not easy to reproduce this when changing zswap's params normally.
In testing with a loaded system, this does not fail:

  $ cd /sys/module/zswap/parameters
  $ echo lz4 > compressor ; echo zsmalloc > zpool

nor does this:

  $ while true ; do
  > echo lzo > compressor ; echo zbud > zpool
  > sleep 1
  > echo lz4 > compressor ; echo zsmalloc > zpool
  > sleep 1
  > done

although it's still possible either of those might fail, depending on
whether anything else besides zswap has locked the mutexes.

However, changing a parameter with no delay immediately causes the
schedule while atomic BUG:

  $ while true ; do
  > echo lzo > compressor ; echo lz4 > compressor
  > done

This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
but moved to zswap, to cover compressor and zpool freeing.

Fixes: f1c54846ee ("zswap: dynamic pool creation")
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Reported-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Dan Streetman <dan.streetman@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Sergey Senozhatsky d0d8da2dc4 zsmalloc: require GFP in zs_malloc()
Pass GFP flags to zs_malloc() instead of using a fixed mask supplied to
zs_create_pool(), so we can be more flexible, but, more importantly, we
need this to switch zram to per-cpu compression streams -- zram will try
to allocate handle with preemption disabled in a fast path and switch to
a slow path (using different gfp mask) if the fast one has failed.

Apart from that, this also align zs_malloc() interface with zspool/zbud.

[sergey.senozhatsky@gmail.com: pass GFP flags to zs_malloc() instead of using a fixed mask]
  Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Link: http://lkml.kernel.org/r/20160429150942.GA637@swordfish
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minchan Kim 1ee4716585 zsmalloc: remove unused pool param in obj_free
Let's remove unused pool param in obj_free

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minchan Kim 251cbb951b zsmalloc: reorder function parameters
Clean up function parameter ordering to order higher data structure
first.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minchan Kim 830e4bc5ba zsmalloc: clean up many BUG_ON
There are many BUG_ON in zsmalloc.c which is not recommened so change
them as alternatives.

Normal rule is as follows:

1. avoid BUG_ON if possible. Instead, use VM_BUG_ON or VM_BUG_ON_PAGE

2. use VM_BUG_ON_PAGE if we need to see struct page's fields

3. use those assertion in primitive functions so higher functions can
   rely on the assertion in the primitive function.

4. Don't use assertion if following instruction can trigger Oops

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minchan Kim a42094676f zsmalloc: use first_page rather than page
Clean up function parameter "struct page".  Many functions of zsmalloc
expect that page paramter is "first_page" so use "first_page" rather
than "page" for code readability.

Signed-off-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Andrey Ryabinin 64f8ebaf11 mm/kasan: add API to check memory regions
Memory access coded in an assembly won't be seen by KASAN as a compiler
can instrument only C code.  Add kasan_check_[read,write]() API which is
going to be used to check a certain memory range.

Link: http://lkml.kernel.org/r/1462538722-1574-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Andrey Ryabinin 936bb4bbbb mm/kasan: print name of mem[set,cpy,move]() caller in report
When bogus memory access happens in mem[set,cpy,move]() it's usually
caller's fault.  So don't blame mem[set,cpy,move]() in bug report, blame
the caller instead.

Before:
  BUG: KASAN: out-of-bounds access in memset+0x23/0x40 at <address>
After:
  BUG: KASAN: out-of-bounds access in <memset_caller> at <address>

Link: http://lkml.kernel.org/r/1462538722-1574-2-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Alexander Potapenko 4ebb31a42f mm, kasan: don't call kasan_krealloc() from ksize().
Instead of calling kasan_krealloc(), which replaces the memory
allocation stack ID (if stack depot is used), just unpoison the whole
memory chunk.

Signed-off-by: Alexander Potapenko <glider@google.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Konstantin Serebryany <kcc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Alexander Potapenko 55834c5909 mm: kasan: initial memory quarantine implementation
Quarantine isolates freed objects in a separate queue.  The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

When the object is freed, its state changes from KASAN_STATE_ALLOC to
KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
instead of being returned to the allocator, therefore every subsequent
access to that object triggers a KASAN error, and the error handler is
able to say where the object has been allocated and deallocated.

When it's time for the object to leave quarantine, its state becomes
KASAN_STATE_FREE and it's returned to the allocator.  From now on the
allocator may reuse it for another allocation.  Before that happens,
it's still possible to detect a use-after free on that object (it
retains the allocation/deallocation stacks).

When the allocator reuses this object, the shadow is unpoisoned and old
allocation/deallocation stacks are wiped.  Therefore a use of this
object, even an incorrect one, won't trigger ASan warning.

Without the quarantine, it's not guaranteed that the objects aren't
reused immediately, that's why the probability of catching a
use-after-free is lower than with quarantine in place.

Quarantine isolates freed objects in a separate queue.  The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

Freed objects are first added to per-cpu quarantine queues.  When a
cache is destroyed or memory shrinking is requested, the objects are
moved into the global quarantine queue.  Whenever a kmalloc call allows
memory reclaiming, the oldest objects are popped out of the global queue
until the total size of objects in quarantine is less than 3/4 of the
maximum quarantine size (which is a fraction of installed physical
memory).

As long as an object remains in the quarantine, KASAN is able to report
accesses to it, so the chance of reporting a use-after-free is
increased.  Once the object leaves quarantine, the allocator may reuse
it, in which case the object is unpoisoned and KASAN can't detect
incorrect accesses to it.

Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.

This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov.  A number of improvements have been
suggested by Andrey Ryabinin.

[glider@google.com: v9]
  Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.com
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Konstantin Serebryany <kcc@google.com>
Cc: Dmitry Chernenkov <dmitryc@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
David Rientjes dfef2ef402 mm, migrate: increment fail count on ENOMEM
If page migration fails due to -ENOMEM, nr_failed should still be
incremented for proper statistics.

This was encountered recently when all page migration vmstats showed 0,
and inferred that migrate_pages() was never called, although in reality
the first page migration failed because compaction_alloc() failed to
find a migration target.

This patch increments nr_failed so the vmstat is properly accounted on
ENOMEM.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605191510230.32658@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Chen Feng 6cd9dc3e75 mm/compaction.c: fix zoneindex in kcompactd()
While testing the kcompactd in my platform 3G MEM only DMA ZONE.  I
found the kcompactd never wakeup.  It seems the zoneindex has already
minus 1 before.  So the traverse here should be <=.

It fixes a regression where kswapd could previously compact, but
kcompactd not.  Not a crash fix though.

[akpm@linux-foundation.org: fix kcompactd_do_work() as well, per Hugh]
Link: http://lkml.kernel.org/r/1463659121-84124-1-git-send-email-puck.chen@hisilicon.com
Fixes: accf62422b ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
Signed-off-by: Chen Feng <puck.chen@hisilicon.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zhuangluan Su <suzhuangluan@hisilicon.com>
Cc: Yiping Xu <xuyiping@hisilicon.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
David Rientjes f050897778 mm, thp: khugepaged should scan when sleep value is written
If a large value is written to scan_sleep_millisecs, for example, that
period must lapse before khugepaged will wake up for periodic
collapsing.

If this value is tuned to 1 day, for example, and then re-tuned to its
default 10s, khugepaged will still wait for a day before scanning again.

This patch causes khugepaged to wakeup immediately when the value is
changed and then sleep until that value is rewritten or the new value
lapses.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1605181453200.4786@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
NeilBrown a53eaff8c1 MM: increase safety margin provided by PF_LESS_THROTTLE
When nfsd is exporting a filesystem over NFS which is then NFS-mounted
on the local machine there is a risk of deadlock.  This happens when
there are lots of dirty pages in the NFS filesystem and they cause NFSD
to be throttled, either in throttle_vm_writeout() or in
balance_dirty_pages().

To avoid this problem the PF_LESS_THROTTLE flag is set for NFSD threads
and it provides a 25% increase to the limits that affect NFSD.  Any
process writing to an NFS filesystem will be throttled well before the
number of dirty NFS pages reaches the limit imposed on NFSD, so NFSD
will not deadlock on pages that it needs to write out.  At least it
shouldn't.

All processes are allowed a small excess margin to avoid performing too
many calculations: ratelimit_pages.

ratelimit_pages is set so that if a thread on every CPU uses the entire
margin, the total will only go 3% over the limit, and this is much less
than the 25% bonus that PF_LESS_THROTTLE provides, so this margin
shouldn't be a problem.  But it is.

The "total memory" that these 3% and 25% are calculated against are not
really total memory but are "global_dirtyable_memory()" which doesn't
include anonymous memory, just free memory and page-cache memory.

The "ratelimit_pages" number is based on whatever the
global_dirtyable_memory was on the last CPU hot-plug, which might not be
what you expect, but is probably close to the total freeable memory.

The throttle threshold uses the global_dirtable_memory at the moment
when the throttling happens, which could be much less than at the last
CPU hotplug.  So if lots of anonymous memory has been allocated, thus
pushing out lots of page-cache pages, then NFSD might end up being
throttled due to dirty NFS pages because the "25%" bonus it gets is
calculated against a rather small amount of dirtyable memory, while the
"3%" margin that other processes are allowed to dirty without penalty is
calculated against a much larger number.

To remove this possibility of deadlock we need to make sure that the
margin granted to PF_LESS_THROTTLE exceeds that rate-limit margin.
Simply adding ratelimit_pages isn't enough as that should be multiplied
by the number of cpus.

So add "global_wb_domain.dirty_limit / 32" as that more accurately
reflects the current total over-shoot margin.  This ensures that the
number of dirty NFS pages never gets so high that nfsd will be throttled
waiting for them to be written.

Link: http://lkml.kernel.org/r/87futgowwv.fsf@notabene.neil.brown.name
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Naoya Horiguchi e570f56ccc mm: check_new_page_bad() directly returns in __PG_HWPOISON case
Currently we check page->flags twice for "HWPoisoned" case of
check_new_page_bad(), which can cause a race with unpoisoning.

This race unnecessarily taints kernel with "BUG: Bad page state".
check_new_page_bad() is the only caller of bad_page() which is
interested in __PG_HWPOISON, so let's move the hwpoison related code in
bad_page() to it.

Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jp
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
seokhoon.yoon 29b52de182 mm, kasan: fix to call kasan_free_pages() after poisoning page
When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
free_pages_prepare()'s codeflow is below.

  1)kmemcheck_free_shadow()
  2)kasan_free_pages()
    - set shadow byte of page is freed
  3)kernel_poison_pages()
  3.1) check access to page is valid or not using kasan
    ---> error occur, kasan think it is invalid access
  3.2) poison page
  4)kernel_map_pages()

So kasan_free_pages() should be called after poisoning the page.

Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon <iamyooon@gmail.com>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Laura Abbott <labbott@fedoraproject.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minchan Kim d0834a6c2c mm: disable fault around on emulated access bit architecture
fault_around aims to reduce minor faults of file-backed pages via
speculative ahead pte mapping and relying on readahead logic.  However,
on non-HW access bit architecture the benefit is highly limited because
they should emulate the young bit with minor faults for reclaim's page
aging algorithm.  IOW, we cannot reduce minor faults on those
architectures.

I did quick a test on my ARM machine.

512M file mmap sequential every word read on eSATA drive 4 times.
stddev is stable.

  = fault_around 4096 =
  elapsed time(usec): 6747645

  = fault_around 65536 =
  elapsed time(usec): 6709263

  0.5% gain.

Even when I tested it with eMMC there is no gain because I guess with
slow storage the major fault is the dominant factor.

Also, fault_around has the side effect of shrinking slab more
aggressively and causes higher vmpressure, so if such speculation fails,
it can evict slab more which can result in page I/O (e.g., inode cache).
In the end, it would make void any benefit of fault_around.

So let's make the default "disabled" on those architectures.

Link: http://lkml.kernel.org/r/20160518014229.GB21538@bbox
Signed-off-by: Minchan Kim <minchan@kernel.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Kirill A. Shutemov 5c0a85fad9 mm: make faultaround produce old ptes
Currently, faultaround code produces young pte.  This can screw up
vmscan behaviour[1], as it makes vmscan think that these pages are hot
and not push them out on first round.

During sparse file access faultaround gets more pages mapped and all of
them are young.  Under memory pressure, this makes vmscan swap out anon
pages instead, or to drop other page cache pages which otherwise stay
resident.

Modify faultaround to produce old ptes, so they can easily be reclaimed
under memory pressure.

This can to some extend defeat the purpose of faultaround on machines
without hardware accessed bit as it will not help us with reducing the
number of minor page faults.

We may want to disable faultaround on such machines altogether, but
that's subject for separate patchset.

Minchan:
 "I tested 512M mmap sequential word read test on non-HW access bit
  system (i.e., ARM) and confirmed it doesn't increase minor fault any
  more.

  old: 4096 fault_around
  minor fault: 131291
  elapsed time: 6747645 usec

  new: 65536 fault_around
  minor fault: 131291
  elapsed time: 6709263 usec

  0.56% benefit"

[1] https://lkml.kernel.org/r/1460992636-711-1-git-send-email-vinmenon@codeaurora.org

Link: http://lkml.kernel.org/r/1463488366-47723-1-git-send-email-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Tested-by: Minchan Kim <minchan@kernel.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Stefan Bader 4b50bcc7ed mm: use phys_addr_t for reserve_bootmem_region() arguments
Since commit 92923ca3aa ("mm: meminit: only set page reserved in the
memblock region") the reserved bit is set on reserved memblock regions.
However start and end address are passed as unsigned long.  This is only
32bit on i386, so it can end up marking the wrong pages reserved for
ranges at 4GB and above.

This was observed on a 32bit Xen dom0 which was booted with initial
memory set to a value below 4G but allowing to balloon in memory
(dom0_mem=1024M for example).  This would define a reserved bootmem
region for the additional memory (for example on a 8GB system there was
a reverved region covering the 4GB-8GB range).  But since the addresses
were passed on as unsigned long, this was actually marking all pages
from 0 to 4GB as reserved.

Fixes: 92923ca3aa ("mm: meminit: only set page reserved in the memblock region")
Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.com
Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
Cc: <stable@vger.kernel.org>	[4.2+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Richard Leitner cd33a76b0f mm/memblock.c: remove unnecessary always-true comparison
Comparing an u64 variable to >= 0 returns always true and can therefore
be removed.  This issue was detected using the -Wtype-limits gcc flag.

This patch fixes following type-limits warning:

  mm/memblock.c: In function `__next_reserved_mem_region':
  mm/memblock.c:843:11: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
    if (*idx >= 0 && *idx < type->cnt) {

Link: http://lkml.kernel.org/r/20160510103625.3a7f8f32@g0hl1n.net
Signed-off-by: Richard Leitner <dev@g0hl1n.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Vitaly Wool 9a001fc19c z3fold: the 3-fold allocator for compressed pages
This patch introduces z3fold, a special purpose allocator for storing
compressed pages.  It is designed to store up to three compressed pages
per physical page.  It is a ZBUD derivative which allows for higher
compression ratio keeping the simplicity and determinism of its
predecessor.

This patch comes as a follow-up to the discussions at the Embedded Linux
Conference in San-Diego related to the talk [1].  The outcome of these
discussions was that it would be good to have a compressed page
allocator as stable and deterministic as zbud with with higher
compression ratio.

To keep the determinism and simplicity, z3fold, just like zbud, always
stores an integral number of compressed pages per page, but it can store
up to 3 pages unlike zbud which can store at most 2.  Therefore the
compression ratio goes to around 2.6x while zbud's one is around 1.7x.

The patch is based on the latest linux.git tree.

This version has been updated after testing on various simulators (e.g.
ARM Versatile Express, MIPS Malta, x86_64/Haswell) and basing on
comments from Dan Streetman [3].

[1] https://openiotelc2016.sched.org/event/6DAC/swapping-and-embedded-compression-relieves-the-pressure-vitaly-wool-softprise-consulting-ou
[2] https://lkml.org/lkml/2016/4/21/799
[3] https://lkml.org/lkml/2016/5/4/852

Link: http://lkml.kernel.org/r/20160509151753.ec3f9fda3c9898d31ff52a32@gmail.com
Signed-off-by: Vitaly Wool <vitalywool@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Andrea Arcangeli d5ee7c3bcc mm: thp: split_huge_pmd_address() comment improvement
Comment is partly wrong, this improves it by including the case of
split_huge_pmd_address() called by try_to_unmap_one if TTU_SPLIT_HUGE_PMD
is set.

Link: http://lkml.kernel.org/r/1462547040-1737-4-git-send-email-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Christoph Lameter 7b8da4c7f0 vmstat: get rid of the ugly cpu_stat_off variable
The cpu_stat_off variable is unecessary since we can check if a
workqueue request is pending otherwise.  Removal of cpu_stat_off makes
it pretty easy for the vmstat shepherd to ensure that the proper things
happen.

Removing the state also removes all races related to it.  Should a
workqueue not be scheduled as needed for vmstat_update then the shepherd
will notice and schedule it as needed.  Should a workqueue be
unecessarily scheduled then the vmstat updater will disable it.

[akpm@linux-foundation.org: fix indentation, per Michal]
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1605061306460.17934@east.gentwo.org
Signed-off-by: Christoph Lameter <cl@linux.com>
Cc: Tejun Heo <htejun@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Greg Thelen 51038171b7 memcg: fix stale mem_cgroup_force_empty() comment
Commit f61c42a7d9 ("memcg: remove tasks/children test from
mem_cgroup_force_empty()") removed memory reparenting from the function.

Fix the function's comment.

Link: http://lkml.kernel.org/r/1462569810-54496-1-git-send-email-gthelen@google.com
Signed-off-by: Greg Thelen <gthelen@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Minfei Huang 2a138dc7e5 mm: use existing helper to convert "on"/"off" to boolean
It's more convenient to use existing function helper to convert string
"on/off" to boolean.

Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.com
Signed-off-by: Minfei Huang <mnghuan@gmail.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Ming Li a4a921aa5c mm/swap.c: put activate_page_pvecs and other pagevecs together
Put the activate_page_pvecs definition next to those of the other
pagevecs, for clarity.

Signed-off-by: Ming Li <mingli199x@qq.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
David Rientjes 297880f4af mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size
The page_counter rounds limits down to page size values.  This makes
sense, except in the case of hugetlb_cgroup where it's not possible to
charge partial hugepages.  If the hugetlb_cgroup margin is less than the
hugepage size being charged, it will fail as expected.

Round the hugetlb_cgroup limit down to hugepage size, since it is the
effective limit of the cgroup.

For consistency, round down PAGE_COUNTER_MAX as well when a
hugetlb_cgroup is created: this prevents error reports when a user
cannot restore the value to the kernel default.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nikolay Borisov <kernel@kyup.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Konstantin Khlebnikov f4fcd55841 mm: enable RLIMIT_DATA by default with workaround for valgrind
Since commit 8463833590 ("mm: rework virtual memory accounting")
RLIMIT_DATA limits both brk() and private mmap() but this's disabled by
default because of incompatibility with older versions of valgrind.

Valgrind always set limit to zero and fails if RLIMIT_DATA is enabled.
Fortunately it changes only rlim_cur and keeps rlim_max for reverting
limit back when needed.

This patch checks current usage also against rlim_max if rlim_cur is
zero.  This is safe because task anyway can increase rlim_cur up to
rlim_max.  Size of brk is still checked against rlim_cur, so this part
is completely compatible - zero rlim_cur forbids brk() but allows
private mmap().

Link: http://lkml.kernel.org/r/56A28613.5070104@de.ibm.com
Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Yongji Xie d5957d2fc2 mm: fix incorrect pfn passed to untrack_pfn() in remap_pfn_range()
We use generic hooks in remap_pfn_range() to help archs to track pfnmap
regions.  The code is something like:

  int remap_pfn_range()
  {
	...
	track_pfn_remap(vma, &prot, pfn, addr, PAGE_ALIGN(size));
	...
	pfn -= addr >> PAGE_SHIFT;
	...
	untrack_pfn(vma, pfn, PAGE_ALIGN(size));
	...
  }

Here we can easily find the pfn is changed but not recovered before
untrack_pfn() is called.  That's incorrect.

There are no known runtime effects - this is from inspection.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Chris Wilson 80c4bd7a5e mm/vmalloc: keep a separate lazy-free list
When mixing lots of vmallocs and set_memory_*() (which calls
vm_unmap_aliases()) I encountered situations where the performance
degraded severely due to the walking of the entire vmap_area list each
invocation.

One simple improvement is to add the lazily freed vmap_area to a
separate lockless free list, such that we then avoid having to walk the
full list on each purge.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Roman Pen <r.peniaev@gmail.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Roman Pen <r.peniaev@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Shawn Lin <shawn.lin@rock-chips.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Alexander Kuleshov f705ac4b39 mm/memblock.c: move memblock_{add,reserve}_region into memblock_{add,reserve}
memblock_add_region() and memblock_reserve_region() do nothing specific
before the call of memblock_add_range(), only print debug output.

We can do the same in memblock_add() and memblock_reserve() since both
memblock_add_region() and memblock_reserve_region() are not used by
anybody outside of memblock.c and memblock_{add,reserve}() have the same
set of flags and nids.

Since memblock_add_region() and memblock_reserve_region() will be
inlined, there will not be functional changes, but will improve code
readability a little.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tang Chen <tangchen@cn.fujitsu.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Chen Yucong 495367c051 mm/memory-failure.c: replace "MCE" with "Memory failure"
HWPoison was specific to some particular x86 platforms.  And it is often
seen as high level machine check handler.  And therefore, 'MCE' is used
for the format prefix of printk().  However, 'PowerNV' has also used
HWPoison for handling memory errors[1], so 'MCE' is no longer suitable
to memory_failure.c.

Additionally, 'MCE' and 'Memory failure' have different context.  The
former belongs to exception context and the latter belongs to process
context.  Furthermore, HWPoison can also be used for off-lining those
sub-health pages that do not trigger any machine check exception.

This patch aims to replace 'MCE' with a more appropriate prefix.

[1] commit 75eb3d9b60 ("powerpc/powernv: Get FSP memory errors
and plumb into memory poison infrastructure.")

Signed-off-by: Chen Yucong <slaoub@gmail.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Yang Shi 340a43bed6 mm: thp: simplify the implementation of mk_huge_pmd()
The implementation of mk_huge_pmd looks verbose, it could be just
simplified to one line code.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Tetsuo Handa f44666b046 mm,oom: speed up select_bad_process() loop
Since commit 3a5dda7a17 ("oom: prevent unnecessary oom kills or kernel
panics"), select_bad_process() is using for_each_process_thread().

Since oom_unkillable_task() scans all threads in the caller's thread
group and oom_task_origin() scans signal_struct of the caller's thread
group, we don't need to call oom_unkillable_task() and oom_task_origin()
on each thread.  Also, since !mm test will be done later at
oom_badness(), we don't need to do !mm test on each thread.  Therefore,
we only need to do TIF_MEMDIE test on each thread.

Although the original code was correct it was quite inefficient because
each thread group was scanned num_threads times which can be a lot
especially with processes with many threads.  Even though the OOM is
extremely cold path it is always good to be as effective as possible
when we are inside rcu_read_lock() - aka unpreemptible context.

If we track number of TIF_MEMDIE threads inside signal_struct, we don't
need to do TIF_MEMDIE test on each thread.  This will allow
select_bad_process() to use for_each_process().

This patch adds a counter to signal_struct for tracking how many
TIF_MEMDIE threads are in a given thread group, and check it at
oom_scan_process_thread() so that select_bad_process() can use
for_each_process() rather than for_each_process_thread().

[mhocko@suse.com: do not blow the signal_struct size]
  Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko ec8d7c14ea mm, oom_reaper: do not mmput synchronously from the oom reaper context
Tetsuo has properly noted that mmput slow path might get blocked waiting
for another party (e.g.  exit_aio waits for an IO).  If that happens the
oom_reaper would be put out of the way and will not be able to process
next oom victim.  We should strive for making this context as reliable
and independent on other subsystems as much as possible.

Introduce mmput_async which will perform the slow path from an async
(WQ) context.  This will delay the operation but that shouldn't be a
problem because the oom_reaper has reclaimed the victim's address space
for most cases as much as possible and the remaining context shouldn't
bind too much memory anymore.  The only exception is when mmap_sem
trylock has failed which shouldn't happen too often.

The issue is only theoretical but not impossible.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko bb8a4b7fd1 mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully
Commit 36324a990c ("oom: clear TIF_MEMDIE after oom_reaper managed to
unmap the address space") not only clears TIF_MEMDIE for oom reaped task
but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
oom killer.  This works in simple cases but it is not sufficient for
(unlikely) cases where the mm is shared between independent processes
(as they do not share signal struct).  If the mm had only small amount
of memory which could be reaped then another task sharing the mm could
be selected and that wouldn't help to move out from the oom situation.

Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
flag after __oom_reap_task is done with a task.  This will force the
select_bad_process() to ignore all already oom reaped tasks as well as
no such task is sacrificed for its parent.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 31e49bfda1 mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION
Joonsoo has reported that he is able to trigger OOM for !costly high
order requests (heavy fork() workload close the OOM) with the new oom
detection rework.  This is because we rely only on should_reclaim_retry
when the compaction is disabled and it only checks watermarks for the
requested order and so we might trigger OOM when there is a lot of free
memory.

It is not very clear what are the usual workloads when the compaction is
disabled.  Relying on high order allocations heavily without any
mechanism to create those orders except for unbound amount of reclaim is
certainly not a good idea.

To prevent from potential regressions let's help this configuration
some.  We have to sacrifice the determinsm though because there simply
is none here possible.  should_compact_retry implementation for
!CONFIG_COMPACTION, which was empty so far, will do watermark check for
order-0 on all eligible zones.  This will cause retrying until either
the reclaim cannot make any further progress or all the zones are
depleted even for order-0 pages.  This means that the number of retries
is basically unbounded for !costly orders but that was the case before
the rework as well so this shouldn't regress.

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.org
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 86a294a81f mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders
"mm: consider compaction feedback also for costly allocation" has
removed the upper bound for the reclaim/compaction retries based on the
number of reclaimed pages for costly orders.  While this is desirable
the patch did miss a mis interaction between reclaim, compaction and the
retry logic.  The direct reclaim tries to get zones over min watermark
while compaction backs off and returns COMPACT_SKIPPED when all zones
are below low watermark + 1<<order gap.  If we are getting really close
to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
high order request (e.g.  hugetlb order-9) while the reclaim is not able
to release enough pages to get us over low watermark.  The reclaim is
still able to make some progress (usually trashing over few remaining
pages) so we are not able to break out from the loop.

I have seen this happening with the same test described in "mm: consider
compaction feedback also for costly allocation" on a swapless system.
The original problem got resolved by "vmscan: consider classzone_idx in
compaction_ready" but it shows how things might go wrong when we
approach the oom event horizont.

The reason why compaction requires being over low rather than min
watermark is not clear to me.  This check was there essentially since
56de7263fc ("mm: compaction: direct compact when a high-order
allocation fails").  It is clearly an implementation detail though and
we shouldn't pull it into the generic retry logic while we should be
able to cope with such eventuality.  The only place in
should_compact_retry where we retry without any upper bound is for
compaction_withdrawn() case.

Introduce compaction_zonelist_suitable function which checks the given
zonelist and returns true only if there is at least one zone which would
would unblock __compaction_suitable if more memory got reclaimed.  In
this implementation it checks __compaction_suitable with NR_FREE_PAGES
plus part of the reclaimable memory as the target for the watermark
check.  The reclaimable memory is reduced linearly by the allocation
order.  The idea is that we do not want to reclaim all the remaining
memory for a single allocation request just unblock
__compaction_suitable which doesn't guarantee we will make a further
progress.

The new helper is then used if compaction_withdrawn() feedback was
provided so we do not retry if there is no outlook for a further
progress.  !costly requests shouldn't be affected much - e.g.  order-2
pages would require to have at least 64kB on the reclaimable LRUs while
order-9 would need at least 32M which should be enough to not lock up.

[vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
[akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 7854ea6c28 mm: consider compaction feedback also for costly allocation
PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
should_reclaim_retry currently where we decide to not retry after at
least order worth of pages were reclaimed or the watermark check for at
least one zone would succeed after reclaiming all pages if the reclaim
hasn't made any progress.  Compaction feedback is mostly ignored and we
just try to make sure that the compaction did at least something before
giving up.

The first condition was added by a41f24ea9f ("page allocator: smarter
retry of costly-order allocations) and it assumed that lumpy reclaim
could have created a page of the sufficient order.  Lumpy reclaim, has
been removed quite some time ago so the assumption doesn't hold anymore.
Remove the check for the number of reclaimed pages and rely on the
compaction feedback solely.  should_reclaim_retry now only makes sure
that we keep retrying reclaim for high order pages only if they are
hidden by watermaks so order-0 reclaim makes really sense.

should_compact_retry now keeps retrying even for the costly allocations.
The number of retries is reduced wrt.  !costly requests because they are
less important and harder to grant and so their pressure shouldn't cause
contention for other requests or cause an over reclaim.  We also do not
reset no_progress_loops for costly request to make sure we do not keep
reclaiming too agressively.

This has been tested by running a process which fragments memory:
	- compact memory
	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
	  of swapspace)
	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
	  steps until certain amount of memory is freed (250M in my test)
	  and reduce the step to (step / 2) + 1 after reaching the end of
	  the mapping
	- then run a script which populates the page cache 2G (MemTotal)
	  from /dev/zero to a new file
And then tries to allocate
nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
huge pages.

root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437

* This is the /proc/buddyinfo after the compaction

Done fragmenting. size=2013265920 freed=262144000
Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0

* /proc/buddyinfo after memory got fragmented

Executing "/root/alloc_hugepages.sh"
Eating some pagecache
508623+0 records in
508623+0 records out
2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0

* /proc/buddyinfo after page cache got eaten

Trying to allocate 129
129

* 129 hugepages requested and all of them granted.

Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0

* /proc/buddyinfo after hugetlb allocation.

10 runs will behave as follows:
Trying to allocate 130
130
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129
--
Trying to allocate 132
132
--
Trying to allocate 129
129
--
Trying to allocate 128
128
--
Trying to allocate 129
129

So basically 100% success for all 10 attempts.
Without the patch numbers looked much worse:
Trying to allocate 128
12
--
Trying to allocate 129
14
--
Trying to allocate 129
7
--
Trying to allocate 129
16
--
Trying to allocate 129
30
--
Trying to allocate 129
38
--
Trying to allocate 129
19
--
Trying to allocate 129
37
--
Trying to allocate 129
28
--
Trying to allocate 129
37

Just for completness the base kernel without oom detection rework looks
as follows:
Trying to allocate 127
30
--
Trying to allocate 129
12
--
Trying to allocate 129
52
--
Trying to allocate 128
32
--
Trying to allocate 129
12
--
Trying to allocate 129
10
--
Trying to allocate 129
32
--
Trying to allocate 128
14
--
Trying to allocate 128
16
--
Trying to allocate 129
8

As we can see the success rate is much more volatile and smaller without
this patch. So the patch not only makes the retry logic for costly
requests more sensible the success rate is even higher.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 33c2d21438 mm, oom: protect !costly allocations some more
should_reclaim_retry will give up retries for higher order allocations
if none of the eligible zones has any requested or higher order pages
available even if we pass the watermak check for order-0.  This is done
because there is no guarantee that the reclaimable and currently free
pages will form the required order.

This can, however, lead to situations where the high-order request (e.g.
order-2 required for the stack allocation during fork) will trigger OOM
too early - e.g.  after the first reclaim/compaction round.  Such a
system would have to be highly fragmented and there is no guarantee
further reclaim/compaction attempts would help but at least make sure
that the compaction was active before we go OOM and keep retrying even
if should_reclaim_retry tells us to oom if

	- the last compaction round backed off or
	- we haven't completed at least MAX_COMPACT_RETRIES active
	  compaction rounds.

The first rule ensures that the very last attempt for compaction was not
ignored while the second guarantees that the compaction has done some
work.  Multiple retries might be needed to prevent occasional pigggy
backing of other contexts to steal the compacted pages before the
current context manages to retry to allocate them.

compaction_failed() is taken as a final word from the compaction that
the retry doesn't make much sense.  We have to be careful though because
the first compaction round is MIGRATE_ASYNC which is rather weak as it
ignores pages under writeback and gives up too easily in other
situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
has been used before we give up.  With this logic in place we do not
have to increase the migration mode unconditionally and rather do it
only if the compaction failed for the weaker mode.  A nice side effect
is that the stronger migration mode is used only when really needed so
this has a potential of smaller latencies in some cases.

Please note that the compaction doesn't tell us much about how
successful it was when returning compaction_made_progress so we just
have to blindly trust that another retry is worthwhile and cap the
number to something reasonable to guarantee a convergence.

If the given number of successful retries is not sufficient for a
reasonable workloads we should focus on the collected compaction
tracepoints data and try to address the issue in the compaction code.
If this is not feasible we can increase the retries limit.

[mhocko@suse.com: fix warning]
  Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.cz
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko ede3771373 mm: throttle on IO only when there are too many dirty and writeback pages
wait_iff_congested has been used to throttle allocator before it retried
another round of direct reclaim to allow the writeback to make some
progress and prevent reclaim from looping over dirty/writeback pages
without making any progress.

We used to do congestion_wait before commit 0e093d9976 ("writeback: do
not sleep on the congestion queue if there are no congested BDIs or if
significant congestion is not being encountered in the current zone")
but that led to undesirable stalls and sleeping for the full timeout
even when the BDI wasn't congested.  Hence wait_iff_congested was used
instead.

But it seems that even wait_iff_congested doesn't work as expected.  We
might have a small file LRU list with all pages dirty/writeback and yet
the bdi is not congested so this is just a cond_resched in the end and
can end up triggering pre mature OOM.

This patch replaces the unconditional wait_iff_congested by
congestion_wait which is executed only if we _know_ that the last round
of direct reclaim didn't make any progress and dirty+writeback pages are
more than a half of the reclaimable pages on the zone which might be
usable for our target allocation.  This shouldn't reintroduce stalls
fixed by 0e093d9976 because congestion_wait is called only when we are
getting hopeless when sleeping is a better choice than OOM with many
pages under IO.

We have to preserve logic introduced by commit 373ccbe592 ("mm,
vmstat: allow WQ concurrency to discover memory reclaim doesn't make any
progress") into the __alloc_pages_slowpath now that wait_iff_congested
is not used anymore.  As the only remaining user of wait_iff_congested
is shrink_inactive_list we can remove the WQ specific short sleep from
wait_iff_congested because the sleep is needed to be done only once in
the allocation retry cycle.

[mhocko@suse.com: high_zoneidx->ac_classzone_idx to evaluate memory reserves properly]
 Link: http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 0a0337e0d1 mm, oom: rework oom detection
__alloc_pages_slowpath has traditionally relied on the direct reclaim
and did_some_progress as an indicator that it makes sense to retry
allocation rather than declaring OOM.  shrink_zones had to rely on
zone_reclaimable if shrink_zone didn't make any progress to prevent from
a premature OOM killer invocation - the LRU might be full of dirty or
writeback pages and direct reclaim cannot clean those up.

zone_reclaimable allows to rescan the reclaimable lists several times
and restart if a page is freed.  This is really subtle behavior and it
might lead to a livelock when a single freed page keeps allocator
looping but the current task will not be able to allocate that single
page.  OOM killer would be more appropriate than looping without any
progress for unbounded amount of time.

This patch changes OOM detection logic and pulls it out from shrink_zone
which is too low to be appropriate for any high level decisions such as
OOM which is per zonelist property.  It is __alloc_pages_slowpath which
knows how many attempts have been done and what was the progress so far
therefore it is more appropriate to implement this logic.

The new heuristic is implemented in should_reclaim_retry helper called
from __alloc_pages_slowpath.  It tries to be more deterministic and
easier to follow.  It builds on an assumption that retrying makes sense
only if the currently reclaimable memory + free pages would allow the
current allocation request to succeed (as per __zone_watermark_ok) at
least for one zone in the usable zonelist.

This alone wouldn't be sufficient, though, because the writeback might
get stuck and reclaimable pages might be pinned for a really long time
or even depend on the current allocation context.  Therefore there is a
backoff mechanism implemented which reduces the reclaim target after
each reclaim round without any progress.  This means that we should
eventually converge to only NR_FREE_PAGES as the target and fail on the
wmark check and proceed to OOM.  The backoff is simple and linear with
1/16 of the reclaimable pages for each round without any progress.  We
are optimistic and reset counter for successful reclaim rounds.

Costly high order pages mostly preserve their semantic and those without
__GFP_REPEAT fail right away while those which have the flag set will
back off after the amount of reclaimable pages reaches equivalent of the
requested order.  The only difference is that if there was no progress
during the reclaim we rely on zone watermark check.  This is more
logical thing to do than previous 1<<order attempts which were a result
of zone_reclaimable faking the progress.

[vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
[hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
[rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
[rientjes@google.com: shrink_zones doesn't need to return anything]
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko c5d01d0d18 mm, compaction: simplify __alloc_pages_direct_compact feedback interface
__alloc_pages_direct_compact communicates potential back off by two
variables:
	- deferred_compaction tells that the compaction returned
	  COMPACT_DEFERRED
	- contended_compaction is set when there is a contention on
	  zone->lock resp. zone->lru_lock locks

__alloc_pages_slowpath then backs of for THP allocation requests to
prevent from long stalls. This is rather messy and it would be much
cleaner to return a single compact result value and hide all the nasty
details into __alloc_pages_direct_compact.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko c8f7de0bfa mm, compaction: distinguish between full and partial COMPACT_COMPLETE
COMPACT_COMPLETE now means that compaction and free scanner met.  This
is not very useful information if somebody just wants to use this
feedback and make any decisions based on that.  The current caller might
be a poor guy who just happened to scan tiny portion of the zone and
that could be the reason no suitable pages were compacted.  Make sure we
distinguish the full and partial zone walks.

Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
and be optimistic in retrying.

The existing users of COMPACT_COMPLETE are conservatively changed to use
COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
reconsidered and only defer the compaction only for COMPACT_COMPLETE
with the new semantic.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko 1d4746d395 mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED
try_to_compact_pages() can currently return COMPACT_SKIPPED even when
the compaction is defered for some zone just because zone DMA is skipped
in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
basically unusable for the page allocator as a feedback mechanism.

Make sure we distinguish those two states properly and switch their
ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
returned only when all eligible zones are skipped.

As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
will be more precise and we would bail out rather than reclaim.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko c46649deae mm, compaction: cover all compaction mode in compact_zone
The compiler is complaining after "mm, compaction: change COMPACT_
constants into enum"

  mm/compaction.c: In function `compact_zone':
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_DEFERRED' not handled in switch [-Wswitch]
    switch (ret) {
    ^
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_COMPLETE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NO_SUITABLE_PAGE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_NOT_SUITABLE_ZONE' not handled in switch [-Wswitch]
  mm/compaction.c:1350:2: warning: enumeration value `COMPACT_CONTENDED' not handled in switch [-Wswitch]

compaction_suitable is allowed to return only COMPACT_PARTIAL,
COMPACT_SKIPPED and COMPACT_CONTINUE so other cases are simply
impossible.  Put a VM_BUG_ON to catch an impossible return value.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko ea7ab982b6 mm, compaction: change COMPACT_ constants into enum
Compaction code is doing weird dances between COMPACT_FOO -> int ->
unsigned long

But there doesn't seem to be any reason for that.  All functions which
return/use one of those constants are not expecting any other value so it
really makes sense to define an enum for them and make it clear that no
other values are expected.

This is a pure cleanup and shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Michal Hocko b6459cc154 vmscan: consider classzone_idx in compaction_ready
Motivation:
As pointed out by Linus [2][3] relying on zone_reclaimable as a way to
communicate the reclaim progress is rater dubious. I tend to agree,
not only it is really obscure, it is not hard to imagine cases where a
single page freed in the loop keeps all the reclaimers looping without
getting any progress because their gfp_mask wouldn't allow to get that
page anyway (e.g. single GFP_ATOMIC alloc and free loop). This is rather
rare so it doesn't happen in the practice but the current logic which we
have is rather obscure and hard to follow a also non-deterministic.

This is an attempt to make the OOM detection more deterministic and
easier to follow because each reclaimer basically tracks its own
progress which is implemented at the page allocator layer rather spread
out between the allocator and the reclaim.  The more on the
implementation is described in the first patch.

I have tested several different scenarios but it should be clear that
testing OOM killer is quite hard to be representative.  There is usually
a tiny gap between almost OOM and full blown OOM which is often time
sensitive.  Anyway, I have tested the following 2 scenarios and I would
appreciate if there are more to test.

Testing environment: a virtual machine with 2G of RAM and 2CPUs without
any swap to make the OOM more deterministic.

1) 2 writers (each doing dd with 4M blocks to an xfs partition with 1G
   file size, removes the files and starts over again) running in
   parallel for 10s to build up a lot of dirty pages when 100 parallel
   mem_eaters (anon private populated mmap which waits until it gets
   signal) with 80M each.

   This causes an OOM flood of course and I have compared both patched
   and unpatched kernels. The test is considered finished after there
   are no OOM conditions detected. This should tell us whether there are
   any excessive kills or some of them premature (e.g. due to dirty pages):

I have performed two runs this time each after a fresh boot.

* base kernel
$ grep "Out of memory:" base-oom-run1.log | wc -l
78
$ grep "Out of memory:" base-oom-run2.log | wc -l
78

$ grep "Kill process" base-oom-run1.log | tail -n1
[   91.391203] Out of memory: Kill process 3061 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" base-oom-run2.log | tail -n1
[   82.141919] Out of memory: Kill process 3086 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" base-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5376.00 max: 6776.00 avg: 5530.75 std: 166.50 nr: 61
$ grep "DMA32 free:" base-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5416.00 max: 5608.00 avg: 5514.15 std: 42.94 nr: 52

$ grep "DMA32.*all_unreclaimable? no" base-oom-run1.log | wc -l
1
$ grep "DMA32.*all_unreclaimable? no" base-oom-run2.log | wc -l
3

* patched kernel
$ grep "Out of memory:" patched-oom-run1.log | wc -l
78
miso@tiehlicka /mnt/share/devel/miso/kvm $ grep "Out of memory:" patched-oom-run2.log | wc -l
77

e grep "Kill process" patched-oom-run1.log | tail -n1
[  497.317732] Out of memory: Kill process 3108 (mem_eater) score 39 or sacrifice child
$ grep "Kill process" patched-oom-run2.log | tail -n1
[  316.169920] Out of memory: Kill process 3093 (mem_eater) score 39 or sacrifice child

$ grep "DMA32 free:" patched-oom-run1.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5420.00 max: 5808.00 avg: 5513.90 std: 60.45 nr: 78
$ grep "DMA32 free:" patched-oom-run2.log | sed 's@.*free:\([0-9]*\)kB.*@\1@' | calc_min_max.awk
min: 5380.00 max: 6384.00 avg: 5520.94 std: 136.84 nr: 77

e grep "DMA32.*all_unreclaimable? no" patched-oom-run1.log | wc -l
2
$ grep "DMA32.*all_unreclaimable? no" patched-oom-run2.log | wc -l
3

The patched kernel run noticeably longer while invoking OOM killer same
number of times. This means that the original implementation is much
more aggressive and triggers the OOM killer sooner. free pages stats
show that neither kernels went OOM too early most of the time, though. I
guess the difference is in the backoff when retries without any progress
do sleep for a while if there is memory under writeback or dirty which
is highly likely considering the parallel IO.
Both kernels have seen races where zone wasn't marked unreclaimable
and we still hit the OOM killer. This is most likely a race where
a task managed to exit between the last allocation attempt and the oom
killer invocation.

2) 2 writers again with 10s of run and then 10 mem_eaters to consume as much
   memory as possible without triggering the OOM killer. This required a lot
   of tuning but I've considered 3 consecutive runs in three different boots
   without OOM as a success.

* base kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(16*1024)}' /proc/meminfo)

* patched kernel
size=$(awk '/MemFree/{printf "%dK", ($2/10)-(12*1024)}' /proc/meminfo)

That means 40M more memory was usable without triggering OOM killer. The
base kernel sometimes managed to handle the same as patched but it
wasn't consistent and failed in at least on of the 3 runs. This seems
like a minor improvement.

I was testing also GPF_REPEAT costly requests (hughetlb) with fragmented
memory and under memory pressure. The results are in patch 11 where the
logic is implemented. In short I can see huge improvement there.

I am certainly interested in other usecases as well as well as any
feedback. Especially those which require higher order requests.

This patch (of 14):

While playing with the oom detection rework [1] I have noticed that my
heavy order-9 (hugetlb) load close to OOM ended up in an endless loop
where the reclaim hasn't made any progress but did_some_progress didn't
reflect that and compaction_suitable was backing off because no zone is
above low wmark + 1 << order.

It turned out that this is in fact an old standing bug in
compaction_ready which ignores the requested_highidx and did the
watermark check for 0 classzone_idx.  This succeeds for zone DMA most
of the time as the zone is mostly unused because of lowmem protection.
As a result costly high order allocatios always report a successfull
progress even when there was none.  This wasn't a problem so far
because these allocations usually fail quite early or retry only few
times with __GFP_REPEAT but this will change after later patch in this
series so make sure to not lie about the progress and propagate
requested_highidx down to compaction_ready and use it for both the
watermak check and compaction_suitable to fix this issue.

[1] http://lkml.kernel.org/r/1459855533-4600-1-git-send-email-mhocko@kernel.org
[2] https://lkml.org/lkml/2015/10/12/808
[3] https://lkml.org/lkml/2015/10/13/597

Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Joonsoo Kim <js1304@gmail.com>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Rik van Riel 59dc76b0d4 mm: vmscan: reduce size of inactive file list
The inactive file list should still be large enough to contain readahead
windows and freshly written file data, but it no longer is the only
source for detecting multiple accesses to file pages.  The workingset
refault measurement code causes recently evicted file pages that get
accessed again after a shorter interval to be promoted directly to the
active list.

With that mechanism in place, we can afford to (on a larger system)
dedicate more memory to the active file list, so we can actually cache
more of the frequently used file pages in memory, and not have them
pushed out by streaming writes, once-used streaming file reads, etc.

This can help things like database workloads, where only half the page
cache can currently be used to cache the database working set.  This
patch automatically increases that fraction on larger systems, using the
same ratio that has already been used for anonymous memory.

[hannes@cmpxchg.org: cgroup-awareness]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Johannes Weiner bbddabe2e4 mm: filemap: only do access activations on reads
Andres observed that his database workload is struggling with the
transaction journal creating pressure on frequently read pages.

Access patterns like transaction journals frequently write the same
pages over and over, but in the majority of cases those pages are never
read back.  There are no caching benefits to be had for those pages, so
activating them and having them put pressure on pages that do benefit
from caching is a bad choice.

Leave page activations to read accesses and don't promote pages based on
writes alone.

It could be said that partially written pages do contain cache-worthy
data, because even if *userspace* does not access the unwritten part,
the kernel still has to read it from the filesystem for correctness.
However, a counter argument is that these pages enjoy at least *some*
protection over other inactive file pages through the writeback cache,
in the sense that dirty pages are written back with a delay and cache
reclaim leaves them alone until they have been written back to disk.
Should that turn out to be insufficient and we see increased read IO
from partial writes under memory pressure, we can always go back and
update grab_cache_page_write_begin() to take (pos, len) so that it can
tell partial writes from pages that don't need partial reads.  But for
now, keep it simple.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Rik van Riel f0281a00fe mm: workingset: only do workingset activations on reads
This is a follow-up to

  http://www.spinics.net/lists/linux-mm/msg101739.html

where Andres reported his database workingset being pushed out by the
minimum size enforcement of the inactive file list - currently 50% of
cache - as well as repeatedly written file pages that are never actually
read.

Two changes fell out of the discussions.  The first change observes that
pages that are only ever written don't benefit from caching beyond what
the writeback cache does for partial page writes, and so we shouldn't
promote them to the active file list where they compete with pages whose
cached data is actually accessed repeatedly.  This change comes in two
patches - one for in-cache write accesses and one for refaults triggered
by writes, neither of which should promote a cache page.

Second, with the refault detection we don't need to set 50% of the cache
aside for used-once cache anymore since we can detect frequently used
pages even when they are evicted between accesses.  We can allow the
active list to be bigger and thus protect a bigger workingset that isn't
challenged by streamers.  Depending on the access patterns, this can
increase major faults during workingset transitions for better
performance during stable phases.

This patch (of 3):

When rewriting a page, the data in that page is replaced with new data.
This means that evicting something else from the active file list, in
order to cache data that will be replaced by something else, is likely
to be a waste of memory.

It is better to save the active list for frequently read pages, because
reads actually use the data that is in the page.

This patch ignores partial writes, because it is unclear whether the
complexity of identifying those is worth any potential performance gain
obtained from better caching pages that see repeated partial writes at
large enough intervals to not get caught by the use-twice promotion code
used for the inactive file list.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-20 17:58:30 -07:00
Mel Gorman 4741526b83 mm, page_alloc: restore the original nodemask if the fast path allocation failed
The page allocator fast path uses either the requested nodemask or
cpuset_current_mems_allowed if cpusets are enabled.  If the allocation
context allows watermarks to be ignored then it can also ignore memory
policies.  However, on entering the allocator slowpath the nodemask may
still be cpuset_current_mems_allowed and the policies are enforced.
This patch resets the nodemask appropriately before entering the
slowpath.

Link: http://lkml.kernel.org/r/20160504143628.GU2858@techsingularity.net
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00
Vlastimil Babka 4e6118016e mm, page_alloc: uninline the bad page part of check_new_page()
Bad pages should be rare so the code handling them doesn't need to be
inline for performance reasons.  Put it to separate function which
returns void.  This also assumes that the initial page_expected_state()
result will match the result of the thorough check, i.e.  the page
doesn't become "good" in the meanwhile.  This matches the same
expectations already in place in free_pages_check().

!DEBUG_VM bloat-o-meter:

  add/remove: 1/0 grow/shrink: 0/1 up/down: 134/-274 (-140)
  function                                     old     new   delta
  check_new_page_bad                             -     134    +134
  get_page_from_freelist                      3468    3194    -274

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-05-19 19:12:14 -07:00