From 7258ae5c5a2ce2f5969e8b18b881be40ab55433d Mon Sep 17 00:00:00 2001 From: James Morse Date: Fri, 16 Jun 2017 14:02:29 -0700 Subject: [PATCH 1/5] mm/memory-failure.c: use compound_head() flags for huge pages memory_failure() chooses a recovery action function based on the page flags. For huge pages it uses the tail page flags which don't have anything interesting set, resulting in: > Memory failure: 0x9be3b4: Unknown page state > Memory failure: 0x9be3b4: recovery action for unknown page: Failed Instead, save a copy of the head page's flags if this is a huge page, this means if there are no relevant flags for this tail page, we use the head pages flags instead. This results in the me_huge_page() recovery action being called: > Memory failure: 0x9b7969: recovery action for huge page: Delayed For hugepages that have not yet been allocated, this allows the hugepage to be dequeued. Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages") Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com Signed-off-by: James Morse Tested-by: Punit Agrawal Acked-by: Punit Agrawal Acked-by: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory-failure.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 342fac9ba89b..ecc183fd94f3 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1184,7 +1184,10 @@ int memory_failure(unsigned long pfn, int trapno, int flags) * page_remove_rmap() in try_to_unmap_one(). So to determine page status * correctly, we save a copy of the page flags at this time. */ - page_flags = p->flags; + if (PageHuge(p)) + page_flags = hpage->flags; + else + page_flags = p->flags; /* * unpoison always clear PG_hwpoison inside page lock From ef70762948dde012146926720b70e79736336764 Mon Sep 17 00:00:00 2001 From: Yu Zhao Date: Fri, 16 Jun 2017 14:02:31 -0700 Subject: [PATCH 2/5] swap: cond_resched in swap_cgroup_prepare() I saw need_resched() warnings when swapping on large swapfile (TBs) because continuously allocating many pages in swap_cgroup_prepare() took too long. We already cond_resched when freeing page in swap_cgroup_swapoff(). Do the same for the page allocation. Link: http://lkml.kernel.org/r/20170604200109.17606-1-yuzhao@google.com Signed-off-by: Yu Zhao Acked-by: Michal Hocko Acked-by: Vladimir Davydov Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap_cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c index ac6318a064d3..3405b4ee1757 100644 --- a/mm/swap_cgroup.c +++ b/mm/swap_cgroup.c @@ -48,6 +48,9 @@ static int swap_cgroup_prepare(int type) if (!page) goto not_enough_page; ctrl->map[idx] = page; + + if (!(idx % SWAP_CLUSTER_MAX)) + cond_resched(); } return 0; not_enough_page: From 3c226c637b69104f6b9f1c6ec5b08d7b741b3229 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Fri, 16 Jun 2017 14:02:34 -0700 Subject: [PATCH 3/5] mm: numa: avoid waiting on freed migrated pages In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by waiting until the pmd is unlocked before we return and retry. However, we can race with migrate_misplaced_transhuge_page(): // do_huge_pmd_numa_page // migrate_misplaced_transhuge_page() // Holds 0 refs on page // Holds 2 refs on page vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); /* ... */ if (pmd_trans_migrating(*vmf->pmd)) { page = pmd_page(*vmf->pmd); spin_unlock(vmf->ptl); ptl = pmd_lock(mm, pmd); if (page_count(page) != 2)) { /* roll back */ } /* ... */ mlock_migrate_page(new_page, page); /* ... */ spin_unlock(ptl); put_page(page); put_page(page); // page freed here wait_on_page_locked(page); goto out; } This can result in the freed page having its waiters flag set unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the page alloc/free functions. This has been observed on arm64 KVM guests. We can avoid this by having do_huge_pmd_numa_page() take a reference on the page before dropping the pmd lock, mirroring what we do in __migration_entry_wait(). When we hit the race, migrate_misplaced_transhuge_page() will see the reference and abort the migration, as it may do today in other cases. Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration") Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.com Signed-off-by: Mark Rutland Signed-off-by: Will Deacon Acked-by: Steve Capper Acked-by: Kirill A. Shutemov Acked-by: Vlastimil Babka Cc: Mel Gorman Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/huge_memory.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a84909cf20d3..88c6167f194d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1426,8 +1426,11 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) */ if (unlikely(pmd_trans_migrating(*vmf->pmd))) { page = pmd_page(*vmf->pmd); + if (!get_page_unless_zero(page)) + goto out_unlock; spin_unlock(vmf->ptl); wait_on_page_locked(page); + put_page(page); goto out; } @@ -1459,9 +1462,12 @@ int do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd) /* Migration could have started since the pmd_trans_migrating check */ if (!page_locked) { + page_nid = -1; + if (!get_page_unless_zero(page)) + goto out_unlock; spin_unlock(vmf->ptl); wait_on_page_locked(page); - page_nid = -1; + put_page(page); goto out; } From 64c2b20301f62c697352c8028c569b1b2bdd8e82 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Fri, 16 Jun 2017 14:02:37 -0700 Subject: [PATCH 4/5] userfaultfd: shmem: handle coredumping in handle_userfault() Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to __get_user_pages(). shmem as opposed has no special FOLL_DUMP handling there so handle_mm_fault() is invoked without mmap_sem and ends up calling handle_userfault() that isn't expecting to be invoked without mmap_sem held. This makes handle_userfault() fail immediately if invoked through shmem_vm_ops->fault during coredumping and solves the problem. The side effect is a BUG_ON with no lock held triggered by the coredumping process which exits. Only 4.11 is affected, pre-4.11 anon memory holes are skipped in __get_user_pages by checking FOLL_DUMP explicitly against empty pagetables (mm/gup.c:no_page_table()). It's zero cost as we already had a check for current->flags to prevent futex to trigger userfaults during exit (PF_EXITING). Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com Signed-off-by: Andrea Arcangeli Reported-by: "Dr. David Alan Gilbert" Cc: [4.11+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/userfaultfd.c | 29 +++++++++++++++++++++-------- 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index f7555fc25877..1d622f276e3a 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -340,9 +340,28 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) bool must_wait, return_to_userland; long blocking_state; - BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); - ret = VM_FAULT_SIGBUS; + + /* + * We don't do userfault handling for the final child pid update. + * + * We also don't do userfault handling during + * coredumping. hugetlbfs has the special + * follow_hugetlb_page() to skip missing pages in the + * FOLL_DUMP case, anon memory also checks for FOLL_DUMP with + * the no_page_table() helper in follow_page_mask(), but the + * shmem_vm_ops->fault method is invoked even during + * coredumping without mmap_sem and it ends up here. + */ + if (current->flags & (PF_EXITING|PF_DUMPCORE)) + goto out; + + /* + * Coredumping runs without mmap_sem so we can only check that + * the mmap_sem is held, if PF_DUMPCORE was not set. + */ + WARN_ON_ONCE(!rwsem_is_locked(&mm->mmap_sem)); + ctx = vmf->vma->vm_userfaultfd_ctx.ctx; if (!ctx) goto out; @@ -360,12 +379,6 @@ int handle_userfault(struct vm_fault *vmf, unsigned long reason) if (unlikely(ACCESS_ONCE(ctx->released))) goto out; - /* - * We don't do userfault handling for the final child pid update. - */ - if (current->flags & PF_EXITING) - goto out; - /* * Check that we can return VM_FAULT_RETRY. * From d7143e31259cb029e207619209b31aa7520f8e28 Mon Sep 17 00:00:00 2001 From: zhongjiang Date: Fri, 16 Jun 2017 14:02:40 -0700 Subject: [PATCH 5/5] mm: correct the comment when reclaimed pages exceed the scanned pages Commit e1587a494540 ("mm: vmpressure: fix sending wrong events on underflow") declared that reclaimed pages exceed the scanned pages due to the thp reclaim. That is incorrect because THP will be spilt to normal page and loop again, which will result in the scanned pages increment. [akpm@linux-foundation.org: tweak comment text] Link: http://lkml.kernel.org/r/1496824266-25235-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhongjiang Acked-by: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmpressure.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/vmpressure.c b/mm/vmpressure.c index 6063581f705c..ce0618bfa8d0 100644 --- a/mm/vmpressure.c +++ b/mm/vmpressure.c @@ -115,9 +115,9 @@ static enum vmpressure_levels vmpressure_calc_level(unsigned long scanned, unsigned long pressure = 0; /* - * reclaimed can be greater than scanned in cases - * like THP, where the scanned is 1 and reclaimed - * could be 512 + * reclaimed can be greater than scanned for things such as reclaimed + * slab pages. shrink_node() just adds reclaimed pages without a + * related increment to scanned pages. */ if (reclaimed >= scanned) goto out;