From e746bf730a76fe53b82c9e6b6da72d58e9ae3565 Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 31 Aug 2017 16:15:20 -0700 Subject: [PATCH 1/6] mm,page_alloc: don't call __node_reclaim() with oom_lock held. We are doing a last second memory allocation attempt before calling out_of_memory(). But since slab shrinker functions might indirectly wait for other thread's __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocations via sleeping locks, calling slab shrinker functions from node_reclaim() from get_page_from_freelist() with oom_lock held has possibility of deadlock. Therefore, make sure that last second memory allocation attempt does not call slab shrinker functions. Link: http://lkml.kernel.org/r/1503577106-9196-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Acked-by: Michal Hocko Cc: Mel Gorman Cc: Vlastimil Babka Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7a58eb5757e3..1423da8dd16f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3291,10 +3291,13 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, /* * Go through the zonelist yet one more time, keep very high watermark * here, this is only to catch a parallel oom killing, we must fail if - * we're still under heavy pressure. + * we're still under heavy pressure. But make sure that this reclaim + * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY + * allocation which will never fail due to oom_lock already held. */ - page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order, - ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); + page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) & + ~__GFP_DIRECT_RECLAIM, order, + ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac); if (page) goto out; From 22cf8bc6cb0d48c6d31da8eccbd4201ab9b0c4c6 Mon Sep 17 00:00:00 2001 From: Shaohua Li Date: Thu, 31 Aug 2017 16:15:23 -0700 Subject: [PATCH 2/6] kernel/kthread.c: kthread_worker: don't hog the cpu If the worker thread continues getting work, it will hog the cpu and rcu stall complains. Make it a good citizen. This is triggered in a loop block device test. Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.com Signed-off-by: Shaohua Li Cc: Petr Mladek Cc: Thomas Gleixner Cc: Tejun Heo Cc: Oleg Nesterov Cc: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kthread.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/kthread.c b/kernel/kthread.c index 26db528c1d88..1c19edf82427 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -637,6 +637,7 @@ int kthread_worker_fn(void *worker_ptr) schedule(); try_to_freeze(); + cond_resched(); goto repeat; } EXPORT_SYMBOL_GPL(kthread_worker_fn); From 355627f518978b5167256d27492fe0b343aaf2f2 Mon Sep 17 00:00:00 2001 From: Eric Biggers Date: Thu, 31 Aug 2017 16:15:26 -0700 Subject: [PATCH 3/6] mm, uprobes: fix multiple free of ->uprobes_state.xol_area Commit 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") made it possible to kill a forking task while it is waiting to acquire its ->mmap_sem for write, in dup_mmap(). However, it was overlooked that this introduced an new error path before the new mm_struct's ->uprobes_state.xol_area has been set to NULL after being copied from the old mm_struct by the memcpy in dup_mm(). For a task that has previously hit a uprobe tracepoint, this resulted in the 'struct xol_area' being freed multiple times if the task was killed at just the right time while forking. Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather than in uprobe_dup_mmap(). With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C program given by commit 2b7e8665b4ff ("fork: fix incorrect fput of ->exe_file causing use-after-free"), provided that a uprobe tracepoint has been set on the fork_thread() function. For example: $ gcc reproducer.c -o reproducer -lpthread $ nm reproducer | grep fork_thread 0000000000400719 t fork_thread $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable $ ./reproducer Here is the use-after-free reported by KASAN: BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200 Read of size 8 at addr ffff8800320a8b88 by task reproducer/198 CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f3fb5 #255 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 Call Trace: dump_stack+0xdb/0x185 print_address_description+0x7e/0x290 kasan_report+0x23b/0x350 __asan_report_load8_noabort+0x19/0x20 uprobe_clear_state+0x1c4/0x200 mmput+0xd6/0x360 do_exit+0x740/0x1670 do_group_exit+0x13f/0x380 get_signal+0x597/0x17d0 do_signal+0x99/0x1df0 exit_to_usermode_loop+0x166/0x1e0 syscall_return_slowpath+0x258/0x2c0 entry_SYSCALL_64_fastpath+0xbc/0xbe ... Allocated by task 199: save_stack_trace+0x1b/0x20 kasan_kmalloc+0xfc/0x180 kmem_cache_alloc_trace+0xf3/0x330 __create_xol_area+0x10f/0x780 uprobe_notify_resume+0x1674/0x2210 exit_to_usermode_loop+0x150/0x1e0 prepare_exit_to_usermode+0x14b/0x180 retint_user+0x8/0x20 Freed by task 199: save_stack_trace+0x1b/0x20 kasan_slab_free+0xa8/0x1a0 kfree+0xba/0x210 uprobe_clear_state+0x151/0x200 mmput+0xd6/0x360 copy_process.part.8+0x605f/0x65d0 _do_fork+0x1a5/0xbd0 SyS_clone+0x19/0x20 do_syscall_64+0x22f/0x660 return_from_SYSCALL_64+0x0/0x7a Note: without KASAN, you may instead see a "Bad page state" message, or simply a general protection fault. Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com Fixes: 7c051267931a ("mm, fork: make dup_mmap wait for mmap_sem for write killable") Signed-off-by: Eric Biggers Reported-by: Oleg Nesterov Acked-by: Oleg Nesterov Cc: Alexander Shishkin Cc: Arnaldo Carvalho de Melo Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Konstantin Khlebnikov Cc: Mark Rutland Cc: Michal Hocko Cc: Peter Zijlstra Cc: Vlastimil Babka Cc: [4.7+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/events/uprobes.c | 2 -- kernel/fork.c | 8 ++++++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 0e137f98a50c..267f6ef91d97 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -1262,8 +1262,6 @@ void uprobe_end_dup_mmap(void) void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm) { - newmm->uprobes_state.xol_area = NULL; - if (test_bit(MMF_HAS_UPROBES, &oldmm->flags)) { set_bit(MMF_HAS_UPROBES, &newmm->flags); /* unconditionally, dup_mmap() skips VM_DONTCOPY vmas */ diff --git a/kernel/fork.c b/kernel/fork.c index cbbea277b3fb..b7e9e57b71ea 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -785,6 +785,13 @@ static void mm_init_owner(struct mm_struct *mm, struct task_struct *p) #endif } +static void mm_init_uprobes_state(struct mm_struct *mm) +{ +#ifdef CONFIG_UPROBES + mm->uprobes_state.xol_area = NULL; +#endif +} + static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, struct user_namespace *user_ns) { @@ -812,6 +819,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS mm->pmd_huge_pte = NULL; #endif + mm_init_uprobes_state(mm); if (current->mm) { mm->flags = current->mm->flags & MMF_INIT_MASK; From c461ad6a63b37ba74632e90c063d14823c884247 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 31 Aug 2017 16:15:30 -0700 Subject: [PATCH 4/6] mm, madvise: ensure poisoned pages are removed from per-cpu lists Wendy Wang reported off-list that a RAS HWPOISON-SOFT test case failed and bisected it to the commit 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP"). The problem is that a page that was poisoned with madvise() is reused. The commit removed a check that would trigger if DEBUG_VM was enabled but re-enabling the check only fixes the problem as a side-effect by printing a bad_page warning and recovering. The root of the problem is that an madvise() can leave a poisoned page on the per-cpu list. This patch drains all per-cpu lists after pages are poisoned so that they will not be reused. Wendy reports that the test case in question passes with this patch applied. While this could be done in a targeted fashion, it is over-complicated for such a rare operation. Link: http://lkml.kernel.org/r/20170828133414.7qro57jbepdcyz5x@techsingularity.net Fixes: 479f854a207c ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") Signed-off-by: Mel Gorman Reported-by: Wang, Wendy Tested-by: Wang, Wendy Acked-by: David Rientjes Acked-by: Vlastimil Babka Cc: "Hansen, Dave" Cc: "Luck, Tony" Cc: Naoya Horiguchi Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/madvise.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 23ed525bc2bc..4d7d1e5ddba9 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -613,6 +613,7 @@ static int madvise_inject_error(int behavior, unsigned long start, unsigned long end) { struct page *page; + struct zone *zone; if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -646,6 +647,11 @@ static int madvise_inject_error(int behavior, if (ret) return ret; } + + /* Ensure that all poisoned pages are removed from per-cpu lists */ + for_each_populated_zone(zone) + drain_all_pages(zone); + return 0; } #endif From c03567a8e8d5cf2aaca40e605c48f319dc2ead57 Mon Sep 17 00:00:00 2001 From: Joe Stringer Date: Thu, 31 Aug 2017 16:15:33 -0700 Subject: [PATCH 5/6] include/linux/compiler.h: don't perform compiletime_assert with -O0 Commit c7acec713d14 ("kernel.h: handle pointers to arrays better in container_of()") made use of __compiletime_assert() from container_of() thus increasing the usage of this macro, allowing developers to notice type conflicts in usage of container_of() at compile time. However, the implementation of __compiletime_assert relies on compiler optimizations to report an error. This means that if a developer uses "-O0" with any code that performs container_of(), the compiler will always report an error regardless of whether there is an actual problem in the code. This patch disables compile_time_assert when optimizations are disabled to allow such code to compile with CFLAGS="-O0". Example compilation failure: ./include/linux/compiler.h:547:38: error: call to `__compiletime_assert_94' declared with attribute error: pointer type mismatch in container_of() _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) ^ ./include/linux/compiler.h:530:4: note: in definition of macro `__compiletime_assert' prefix ## suffix(); \ ^~~~~~ ./include/linux/compiler.h:547:2: note: in expansion of macro `_compiletime_assert' _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) ^~~~~~~~~~~~~~~~~~~ ./include/linux/build_bug.h:46:37: note: in expansion of macro `compiletime_assert' #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) ^~~~~~~~~~~~~~~~~~ ./include/linux/kernel.h:860:2: note: in expansion of macro `BUILD_BUG_ON_MSG' BUILD_BUG_ON_MSG(!__same_type(*(ptr), ((type *)0)->member) && \ ^~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: use do{}while(0), per Michal] Link: http://lkml.kernel.org/r/20170829230114.11662-1-joe@ovn.org Fixes: c7acec713d14c6c ("kernel.h: handle pointers to arrays better in container_of()") Signed-off-by: Joe Stringer Cc: Ian Abbott Cc: Arnd Bergmann Cc: Michal Nazarewicz Cc: Kees Cook Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/compiler.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index eca8ad75e28b..043b60de041e 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -517,7 +517,8 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s # define __compiletime_error_fallback(condition) do { } while (0) #endif -#define __compiletime_assert(condition, msg, prefix, suffix) \ +#ifdef __OPTIMIZE__ +# define __compiletime_assert(condition, msg, prefix, suffix) \ do { \ bool __cond = !(condition); \ extern void prefix ## suffix(void) __compiletime_error(msg); \ @@ -525,6 +526,9 @@ static __always_inline void __write_once_size(volatile void *p, void *res, int s prefix ## suffix(); \ __compiletime_error_fallback(__cond); \ } while (0) +#else +# define __compiletime_assert(condition, msg, prefix, suffix) do { } while (0) +#endif #define _compiletime_assert(condition, msg, prefix, suffix) \ __compiletime_assert(condition, msg, prefix, suffix) From e66186920bff278b18ebe460c710c7b0e0cfdf6e Mon Sep 17 00:00:00 2001 From: Russell King Date: Thu, 31 Aug 2017 16:15:36 -0700 Subject: [PATCH 6/6] scripts/dtc: fix '%zx' warning dtc uses an incorrect format specifier for printing a uint64_t value. uint64_t may be either 'unsigned long' or 'unsigned long long' depending on the host architecture. Fix this by using %llx and casting to unsigned long long, which ensures that we always have a wide enough variable to print 64 bits of hex. HOSTCC scripts/dtc/checks.o scripts/dtc/checks.c: In function 'check_simple_bus_reg': scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=] snprintf(unit_addr, sizeof(unit_addr), "%zx", reg); ^ scripts/dtc/checks.c:876:2: warning: format '%zx' expects argument of type 'size_t', but argument 4 has type 'uint64_t' [-Wformat=] Link: http://lkml.kernel.org/r/20170829222034.GJ20805@n2100.armlinux.org.uk Fixes: 828d4cdd012c ("dtc: check.c fix compile error") Signed-off-by: Russell King Cc: Rob Herring Cc: Frank Rowand Cc: Shuah Khan Cc: David Gibson Cc: Michal Marek Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- scripts/dtc/checks.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/dtc/checks.c b/scripts/dtc/checks.c index 4b72b530c84f..62ea8f83d4a0 100644 --- a/scripts/dtc/checks.c +++ b/scripts/dtc/checks.c @@ -873,7 +873,7 @@ static void check_simple_bus_reg(struct check *c, struct dt_info *dti, struct no while (size--) reg = (reg << 32) | fdt32_to_cpu(*(cells++)); - snprintf(unit_addr, sizeof(unit_addr), "%zx", reg); + snprintf(unit_addr, sizeof(unit_addr), "%llx", (unsigned long long)reg); if (!streq(unitname, unit_addr)) FAIL(c, dti, "Node %s simple-bus unit address format error, expected \"%s\"", node->fullpath, unit_addr);