From 1105a2fc022f3c7482e32faf516e8bc44095f778 Mon Sep 17 00:00:00 2001 From: Jia He Date: Thu, 14 Jun 2018 15:26:14 -0700 Subject: [PATCH 01/27] mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm() In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by PAGE_SIZE unaligned for rmap_item->address under memory pressure tests(start 20 guests and run memhog in the host). WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8 CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8 Call trace: kvm_age_hva_handler+0xc0/0xc8 handle_hva_to_gpa+0xa8/0xe0 kvm_age_hva+0x4c/0xe8 kvm_mmu_notifier_clear_flush_young+0x54/0x98 __mmu_notifier_clear_flush_young+0x6c/0xa0 page_referenced_one+0x154/0x1d8 rmap_walk_ksm+0x12c/0x1d0 rmap_walk+0x94/0xa0 page_referenced+0x194/0x1b0 shrink_page_list+0x674/0xc28 shrink_inactive_list+0x26c/0x5b8 shrink_node_memcg+0x35c/0x620 shrink_node+0x100/0x430 do_try_to_free_pages+0xe0/0x3a8 try_to_free_pages+0xe4/0x230 __alloc_pages_nodemask+0x564/0xdc0 alloc_pages_vma+0x90/0x228 do_anonymous_page+0xc8/0x4d0 __handle_mm_fault+0x4a0/0x508 handle_mm_fault+0xf8/0x1b0 do_page_fault+0x218/0x4b8 do_translation_fault+0x90/0xa0 do_mem_abort+0x68/0xf0 el0_da+0x24/0x28 In rmap_walk_ksm, the rmap_item->address might still have the STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa on arm64. This patch fixes it by ignoring (not removing) the low bits of address when doing rmap_walk_ksm. IMO, it should be backported to stable tree. the storm of WARN_ONs is very easy for me to reproduce. More than that, I watched a panic (not reproducible) as follows: page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0 flags: 0x1fffc00000000000() raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000 raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000 page dumped because: nonzero _refcount CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1 Hardware name: Call trace: dump_backtrace+0x0/0x22c show_stack+0x24/0x2c dump_stack+0x8c/0xb0 bad_page+0xf4/0x154 free_pages_check_bad+0x90/0x9c free_pcppages_bulk+0x464/0x518 free_hot_cold_page+0x22c/0x300 __put_page+0x54/0x60 unmap_stage2_range+0x170/0x2b4 kvm_unmap_hva_handler+0x30/0x40 handle_hva_to_gpa+0xb0/0xec kvm_unmap_hva_range+0x5c/0xd0 I even injected a fault on purpose in kvm_unmap_hva_range by seting size=size-0x200, the call trace is similar as above. So I thought the panic is similarly caused by the root cause of WARN_ON. Andrea said: : It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would : zap those bits and hide the misalignment caused by the low metadata : bits being erroneously left set in the address, but the arm code : notices when that's the last page in the memslot and the hva_end is : getting aligned and the size is below one page. : : I think the problem triggers in the addr += PAGE_SIZE of : unmap_stage2_ptes that never matches end because end is aligned but : addr is not. : : } while (pte++, addr += PAGE_SIZE, addr != end); : : x86 again only works on hva_start/hva_end after converting it to : gfn_start/end and that being in pfn units the bits are zapped before : they risk to cause trouble. Jia He said: : I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without : this patch, the WARN_ON is very easy for reproducing. After this patch, I : have run the same benchmarch for a whole day without any WARN_ONs Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com Signed-off-by: Jia He Reviewed-by: Andrea Arcangeli Tested-by: Jia He Cc: Suzuki K Poulose Cc: Minchan Kim Cc: Claudio Imbrenda Cc: Arvind Yadav Cc: Mike Rapoport Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/ksm.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/ksm.c b/mm/ksm.c index e2d2886fb1df..a6d43cf9a982 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -216,6 +216,8 @@ struct rmap_item { #define SEQNR_MASK 0x0ff /* low bits of unstable tree seqnr */ #define UNSTABLE_FLAG 0x100 /* is a node of the unstable tree */ #define STABLE_FLAG 0x200 /* is listed from the stable tree */ +#define KSM_FLAG_MASK (SEQNR_MASK|UNSTABLE_FLAG|STABLE_FLAG) + /* to mask all the flags */ /* The stable and unstable tree heads */ static struct rb_root one_stable_tree[1] = { RB_ROOT }; @@ -2598,10 +2600,15 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) anon_vma_lock_read(anon_vma); anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root, 0, ULONG_MAX) { + unsigned long addr; + cond_resched(); vma = vmac->vma; - if (rmap_item->address < vma->vm_start || - rmap_item->address >= vma->vm_end) + + /* Ignore the stable/unstable/sqnr flags */ + addr = rmap_item->address & ~KSM_FLAG_MASK; + + if (addr < vma->vm_start || addr >= vma->vm_end) continue; /* * Initially we examine only the vma which covers this @@ -2615,8 +2622,7 @@ void rmap_walk_ksm(struct page *page, struct rmap_walk_control *rwc) if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg)) continue; - if (!rwc->rmap_one(page, vma, - rmap_item->address, rwc->arg)) { + if (!rwc->rmap_one(page, vma, addr, rwc->arg)) { anon_vma_unlock_read(anon_vma); return; } From df2a419677652439b473a73c5acf32354b217cc6 Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Thu, 14 Jun 2018 15:26:17 -0700 Subject: [PATCH 02/27] mm: fix null pointer dereference in mem_cgroup_protected Shakeel reported a crash in mem_cgroup_protected(), which can be triggered by memcg reclaim if the legacy cgroup v1 use_hierarchy=0 mode is used: BUG: unable to handle kernel NULL pointer dereference at 0000000000000120 PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0 Oops: 0000 [#4] SMP PTI CPU: 0 PID: 15581 Comm: bash Tainted: G D 4.17.0-smp-clean #5 Hardware name: ... RIP: 0010:mem_cgroup_protected+0x54/0x130 Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d 8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf 00 00 00 <48> 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2 74 05 RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286 RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000 RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88 R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000 FS: 00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0 Call Trace: ? shrink_node+0x194/0x510 do_try_to_free_pages+0xfd/0x390 try_to_free_mem_cgroup_pages+0x123/0x210 try_charge+0x19e/0x700 mem_cgroup_try_charge+0x10b/0x1a0 wp_page_copy+0x134/0x5b0 do_wp_page+0x90/0x460 __handle_mm_fault+0x8e3/0xf30 handle_mm_fault+0xfe/0x220 __do_page_fault+0x262/0x500 do_page_fault+0x28/0xd0 ? page_fault+0x8/0x30 page_fault+0x1e/0x30 RIP: 0033:0x485b72 The problem happens because parent_mem_cgroup() returns a NULL pointer, which is dereferenced later without a check. As cgroup v1 has no memory guarantee support, let's make mem_cgroup_protected() immediately return MEMCG_PROT_NONE, if the given cgroup has no parent (non-hierarchical mode is used). Link: http://lkml.kernel.org/r/20180611175418.7007-2-guro@fb.com Fixes: bf8d5d52ffe8 ("memcg: introduce memory.min") Signed-off-by: Roman Gushchin Reported-by: Shakeel Butt Tested-by: Shakeel Butt Tested-by: John Stultz Acked-by: Johannes Weiner Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c1e64d60ed02..5a3873e9d657 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5480,6 +5480,10 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, elow = memcg->memory.low; parent = parent_mem_cgroup(memcg); + /* No parent means a non-hierarchical mode on v1 memcg */ + if (!parent) + return MEMCG_PROT_NONE; + if (parent == root) goto exit; From 955c97f0859abef698e77f5697f5c4008303abb9 Mon Sep 17 00:00:00 2001 From: Daniel Jordan Date: Thu, 14 Jun 2018 15:26:21 -0700 Subject: [PATCH 03/27] mm/swapfile.c: fix swap_count comment about nonexistent SWAP_HAS_CONT Commit 570a335b8e22 ("swap_info: swap count continuations") introduces COUNT_CONTINUED but refers to it incorrectly as SWAP_HAS_CONT in a comment in swap_count. Fix it. Link: http://lkml.kernel.org/r/20180612175919.30413-1-daniel.m.jordan@oracle.com Fixes: 570a335b8e22 ("swap_info: swap count continuations") Signed-off-by: Daniel Jordan Reviewed-by: Andrew Morton Cc: "Huang, Ying" Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 925cf795a652..2cc2972eedaf 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -100,7 +100,7 @@ atomic_t nr_rotate_swap = ATOMIC_INIT(0); static inline unsigned char swap_count(unsigned char ent) { - return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */ + return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */ } /* returns 1 if swap entry is freed */ From 2bdce74412c249ac01dfe36b6b0043ffd7a5361e Mon Sep 17 00:00:00 2001 From: Dan Williams Date: Thu, 14 Jun 2018 15:26:24 -0700 Subject: [PATCH 04/27] mm: fix devmem_is_allowed() for sub-page System RAM intersections Hussam reports: I was poking around and for no real reason, I did cat /dev/mem and strings /dev/mem. Then I saw the following warning in dmesg. I saved it and rebooted immediately. memremap attempted on mixed range 0x000000000009c000 size: 0x1000 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 11810 at kernel/memremap.c:98 memremap+0x104/0x170 [..] Call Trace: xlate_dev_mem_ptr+0x25/0x40 read_mem+0x89/0x1a0 __vfs_read+0x36/0x170 The memremap() implementation checks for attempts to remap System RAM with MEMREMAP_WB and instead redirects those mapping attempts to the linear map. However, that only works if the physical address range being remapped is page aligned. In low memory we have situations like the following: 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved ...where System RAM intersects Reserved ranges on a sub-page page granularity. Given that devmem_is_allowed() special cases any attempt to map System RAM in the first 1MB of memory, replace page_is_ram() with the more precise region_intersects() to trap attempts to map disallowed ranges. Link: https://bugzilla.kernel.org/show_bug.cgi?id=199999 Link: http://lkml.kernel.org/r/152856436164.18127.2847888121707136898.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: 92281dee825f ("arch: introduce memremap()") Signed-off-by: Dan Williams Reported-by: Hussam Al-Tayeb Tested-by: Hussam Al-Tayeb Cc: Christoph Hellwig Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/x86/mm/init.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index fec82b577c18..cee58a972cb2 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -706,7 +706,9 @@ void __init init_mem_mapping(void) */ int devmem_is_allowed(unsigned long pagenr) { - if (page_is_ram(pagenr)) { + if (region_intersects(PFN_PHYS(pagenr), PAGE_SIZE, + IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE) + != REGION_DISJOINT) { /* * For disallowed memory regions in the low 1MB range, * request that the page be shown as all zeros. From 92ee383f6daab4da5471b86f6fdaba775e6928f6 Mon Sep 17 00:00:00 2001 From: Shakeel Butt Date: Thu, 14 Jun 2018 15:26:27 -0700 Subject: [PATCH 05/27] mm: fix race between kmem_cache destroy, create and deactivate The memcg kmem cache creation and deactivation (SLUB only) is asynchronous. If a root kmem cache is destroyed whose memcg cache is in the process of creation or deactivation, the kernel may crash. Example of one such crash: general protection fault: 0000 [#1] SMP PTI CPU: 1 PID: 1721 Comm: kworker/14:1 Not tainted 4.17.0-smp ... Workqueue: memcg_kmem_cache kmemcg_deactivate_workfn RIP: 0010:has_cpu_slab ... Call Trace: ? on_each_cpu_cond __kmem_cache_shrink kmemcg_cache_deact_after_rcu kmemcg_deactivate_workfn process_one_work worker_thread kthread ret_from_fork+0x35/0x40 To fix this race, on root kmem cache destruction, mark the cache as dying and flush the workqueue used for memcg kmem cache creation and deactivation. SLUB's memcg kmem cache deactivation also includes RCU callback and thus make sure all previous registered RCU callbacks have completed as well. [shakeelb@google.com: handle the RCU callbacks for SLUB deactivation] Link: http://lkml.kernel.org/r/20180611192951.195727-1-shakeelb@google.com [shakeelb@google.com: add more documentation, rename fields for readability] Link: http://lkml.kernel.org/r/20180522201336.196994-1-shakeelb@google.com [akpm@linux-foundation.org: fix build, per Shakeel] [shakeelb@google.com: v3. Instead of refcount, flush the workqueue] Link: http://lkml.kernel.org/r/20180530001204.183758-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180521174116.171846-1-shakeelb@google.com Signed-off-by: Shakeel Butt Acked-by: Vladimir Davydov Cc: Michal Hocko Cc: Greg Thelen Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Johannes Weiner Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/slab.h | 1 + mm/slab_common.c | 33 ++++++++++++++++++++++++++++++++- 2 files changed, 33 insertions(+), 1 deletion(-) diff --git a/include/linux/slab.h b/include/linux/slab.h index 4d759e1ddc33..14e3fe4bd6a1 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -600,6 +600,7 @@ struct memcg_cache_params { struct memcg_cache_array __rcu *memcg_caches; struct list_head __root_caches_node; struct list_head children; + bool dying; }; struct { struct mem_cgroup *memcg; diff --git a/mm/slab_common.c b/mm/slab_common.c index 98dcdc352062..42aca26d61d0 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -136,6 +136,7 @@ void slab_init_memcg_params(struct kmem_cache *s) s->memcg_params.root_cache = NULL; RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL); INIT_LIST_HEAD(&s->memcg_params.children); + s->memcg_params.dying = false; } static int init_memcg_params(struct kmem_cache *s, @@ -608,7 +609,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg, * The memory cgroup could have been offlined while the cache * creation work was pending. */ - if (memcg->kmem_state != KMEM_ONLINE) + if (memcg->kmem_state != KMEM_ONLINE || root_cache->memcg_params.dying) goto out_unlock; idx = memcg_cache_id(memcg); @@ -712,6 +713,9 @@ void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s, WARN_ON_ONCE(s->memcg_params.deact_fn)) return; + if (s->memcg_params.root_cache->memcg_params.dying) + return; + /* pin memcg so that @s doesn't get destroyed in the middle */ css_get(&s->memcg_params.memcg->css); @@ -823,11 +827,36 @@ static int shutdown_memcg_caches(struct kmem_cache *s) return -EBUSY; return 0; } + +static void flush_memcg_workqueue(struct kmem_cache *s) +{ + mutex_lock(&slab_mutex); + s->memcg_params.dying = true; + mutex_unlock(&slab_mutex); + + /* + * SLUB deactivates the kmem_caches through call_rcu_sched. Make + * sure all registered rcu callbacks have been invoked. + */ + if (IS_ENABLED(CONFIG_SLUB)) + rcu_barrier_sched(); + + /* + * SLAB and SLUB create memcg kmem_caches through workqueue and SLUB + * deactivates the memcg kmem_caches through workqueue. Make sure all + * previous workitems on workqueue are processed. + */ + flush_workqueue(memcg_kmem_cache_wq); +} #else static inline int shutdown_memcg_caches(struct kmem_cache *s) { return 0; } + +static inline void flush_memcg_workqueue(struct kmem_cache *s) +{ +} #endif /* CONFIG_MEMCG && !CONFIG_SLOB */ void slab_kmem_cache_release(struct kmem_cache *s) @@ -845,6 +874,8 @@ void kmem_cache_destroy(struct kmem_cache *s) if (unlikely(!s)) return; + flush_memcg_workqueue(s); + get_online_cpus(); get_online_mems(); From a8311f647e419675f6ecba9f4284080fd38a0a37 Mon Sep 17 00:00:00 2001 From: Jarrett Farnitano Date: Thu, 14 Jun 2018 15:26:31 -0700 Subject: [PATCH 06/27] kexec: yield to scheduler when loading kimage segments Without yielding while loading kimage segments, a large initrd will block all other work on the CPU performing the load until it is completed. For example loading an initrd of 200MB on a low power single core system will lock up the system for a few seconds. To increase system responsiveness to other tasks at that time, call cond_resched() in both the crash kernel and normal kernel segment loading loops. I did run into a practical problem. Hardware watchdogs on embedded systems can have short timers on the order of seconds. If the system is locked up for a few seconds with only a single core available, the watchdog may not be pet in a timely fashion. If this happens, the hardware watchdog will fire and reset the system. This really only becomes a problem when you are working with a single core, a decently sized initrd, and have a constrained hardware watchdog. Link: http://lkml.kernel.org/r/1528738546-3328-1-git-send-email-jmf@amazon.com Signed-off-by: Jarrett Farnitano Reviewed-by: "Eric W. Biederman" Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kexec_core.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c index 20fef1a38602..23a83a4da38a 100644 --- a/kernel/kexec_core.c +++ b/kernel/kexec_core.c @@ -829,6 +829,8 @@ static int kimage_load_normal_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; + + cond_resched(); } out: return result; @@ -893,6 +895,8 @@ static int kimage_load_crash_segment(struct kimage *image, else buf += mchunk; mbytes -= mchunk; + + cond_resched(); } out: return result; From 655c79bb40a0870adcd0871057d01de11625882b Mon Sep 17 00:00:00 2001 From: Tetsuo Handa Date: Thu, 14 Jun 2018 15:26:34 -0700 Subject: [PATCH 07/27] mm: check for SIGKILL inside dup_mmap() loop As a theoretical problem, dup_mmap() of an mm_struct with 60000+ vmas can loop while potentially allocating memory, with mm->mmap_sem held for write by current thread. This is bad if current thread was selected as an OOM victim, for current thread will continue allocations using memory reserves while OOM reaper is unable to reclaim memory. As an actually observable problem, it is not difficult to make OOM reaper unable to reclaim memory if the OOM victim is blocked at i_mmap_lock_write() in this loop. Unfortunately, since nobody can explain whether it is safe to use killable wait there, let's check for SIGKILL before trying to allocate memory. Even without an OOM event, there is no point with continuing the loop from the beginning if current thread is killed. I tested with debug printk(). This patch should be safe because we already fail if security_vm_enough_memory_mm() or kmem_cache_alloc(GFP_KERNEL) fails and exit_mmap() handles it. ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting dup_mmap() due to SIGKILL ***** ***** Aborting exit_mmap() due to NULL mmap ***** [akpm@linux-foundation.org: add comment] Link: http://lkml.kernel.org/r/201804071938.CDE04681.SOFVQJFtMHOOLF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa Cc: Alexander Viro Cc: Rik van Riel Cc: Michal Hocko Cc: Kirill A. Shutemov Cc: David Rientjes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/fork.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index 92870be50bba..9440d61b925c 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -440,6 +440,14 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, continue; } charge = 0; + /* + * Don't duplicate many vmas if we've been oom-killed (for + * example) + */ + if (fatal_signal_pending(current)) { + retval = -EINTR; + goto out; + } if (mpnt->vm_flags & VM_ACCOUNT) { unsigned long len = vma_pages(mpnt); From 69b5086b12cda645d95f00575c25f1dfd1e929ad Mon Sep 17 00:00:00 2001 From: Mathieu Malaterre Date: Thu, 14 Jun 2018 15:26:38 -0700 Subject: [PATCH 08/27] mm/memblock: add missing include Commit 26f09e9b3a06 ("mm/memblock: add memblock memory allocation apis") introduced two new function definitions: memblock_virt_alloc_try_nid_nopanic() memblock_virt_alloc_try_nid() Commit ea1f5f3712af ("mm: define memblock_virt_alloc_try_nid_raw") introduced the following function definition: memblock_virt_alloc_try_nid_raw() This commit adds an includeof header file to provide the missing function prototypes. Silence the following gcc warning (W=1): mm/memblock.c:1334:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_raw' [-Wmissing-prototypes] mm/memblock.c:1371:15: warning: no previous prototype for `memblock_virt_alloc_try_nid_nopanic' [-Wmissing-prototypes] mm/memblock.c:1407:15: warning: no previous prototype for `memblock_virt_alloc_try_nid' [-Wmissing-prototypes] Link: http://lkml.kernel.org/r/20180606194144.16990-1-malat@debian.org Signed-off-by: Mathieu Malaterre Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memblock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memblock.c b/mm/memblock.c index 93ad42bc8a73..f1dba2826d0f 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include From 37a4094e828f3c7673aa9c60f8b2b9d1019db81b Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Thu, 14 Jun 2018 15:26:41 -0700 Subject: [PATCH 09/27] mremap: remove LATENCY_LIMIT from mremap to reduce the number of TLB shootdowns Commit 5d1904204c99 ("mremap: fix race between mremap() and page cleanning") fixed races between mremap and other operations for both file-backed and anonymous mappings. The file-backed was the most critical as it allowed the possibility that data could be changed on a physical page after page_mkclean returned which could trigger data loss or data integrity issues. A customer reported that the cost of the TLBs for anonymous regressions was excessive and resulting in a 30-50% drop in performance overall since this commit on a microbenchmark. Unfortunately I neither have access to the test-case nor can I describe what it does other than saying that mremap operations dominate heavily. This patch removes the LATENCY_LIMIT to handle TLB flushes on a PMD boundary instead of every 64 pages to reduce the number of TLB shootdowns by a factor of 8 in the ideal case. LATENCY_LIMIT was almost certainly used originally to limit the PTL hold times but the latency savings are likely offset by the cost of IPIs in many cases. This patch is not reported to completely restore performance but gets it within an acceptable percentage. The given metric here is simply described as "higher is better". Baseline that was known good 002: Metric: 91.05 004: Metric: 109.45 008: Metric: 73.08 016: Metric: 58.14 032: Metric: 61.09 064: Metric: 57.76 128: Metric: 55.43 Current 001: Metric: 54.98 002: Metric: 56.56 004: Metric: 41.22 008: Metric: 35.96 016: Metric: 36.45 032: Metric: 35.71 064: Metric: 35.73 128: Metric: 34.96 With patch 001: Metric: 61.43 002: Metric: 81.64 004: Metric: 67.92 008: Metric: 51.67 016: Metric: 50.47 032: Metric: 52.29 064: Metric: 50.01 128: Metric: 49.04 So for low threads, it's not restored but for larger number of threads, it's closer to the "known good" baseline. Using a different mremap-intensive workload that is not representative of the real workload there is little difference observed outside of noise in the headline metrics However, the TLB shootdowns are reduced by 11% on average and at the peak, TLB shootdowns were reduced by 21%. Interrupts were sampled every second while the workload ran to get those figures. It's known that the figures will vary as the non-representative load is non-deterministic. An alternative patch was posted that should have significantly reduced the TLB flushes but unfortunately it does not perform as well as this version on the customer test case. If revisited, the two patches can stack on top of each other. Link: http://lkml.kernel.org/r/20180606183803.k7qaw2xnbvzshv34@techsingularity.net Signed-off-by: Mel Gorman Reviewed-by: Andrew Morton Cc: Nadav Amit Cc: Dave Hansen Cc: Michal Hocko Cc: Vlastimil Babka Cc: Aaron Lu Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mremap.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index 049470aa1e3e..5c2e18505f75 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -191,8 +191,6 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, drop_rmap_locks(vma); } -#define LATENCY_LIMIT (64 * PAGE_SIZE) - unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -247,8 +245,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma, next = (new_addr + PMD_SIZE) & PMD_MASK; if (extent > next - new_addr) extent = next - new_addr; - if (extent > LATENCY_LIMIT) - extent = LATENCY_LIMIT; move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma, new_pmd, new_addr, need_rmap_locks, &need_flush); } From 26b95137d673d8d6a9a124d433d968e18298f4ed Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Thu, 14 Jun 2018 15:27:17 -0700 Subject: [PATCH 10/27] proc: skip branch in /proc/*/* lookup Code is structured like this: for ( ... p < last; p++) { if (memcmp == 0) break; } if (p >= last) ERROR OK gcc doesn't see that if if lookup succeeds than post loop branch will never be taken and skip it. [akpm@linux-foundation.org: proc_pident_instantiate() no longer takes an inode*] Link: http://lkml.kernel.org/r/20180423213954.GD9043@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Cc: Al Viro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/proc/base.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 80aa42506b8b..b6572944efc3 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2439,14 +2439,11 @@ static struct dentry *proc_pident_lookup(struct inode *dir, for (p = ents; p < last; p++) { if (p->len != dentry->d_name.len) continue; - if (!memcmp(dentry->d_name.name, p->name, p->len)) + if (!memcmp(dentry->d_name.name, p->name, p->len)) { + res = proc_pident_instantiate(dentry, task, p); break; + } } - if (p >= last) - goto out; - - res = proc_pident_instantiate(dentry, task, p); -out: put_task_struct(task); out_no_task: return res; From c2574aaa5d98684519f0dfa64abe6896aeb124c6 Mon Sep 17 00:00:00 2001 From: OGAWA Hirofumi Date: Thu, 14 Jun 2018 15:27:21 -0700 Subject: [PATCH 11/27] fat: use fat_fs_error() instead of BUG_ON() in __fat_get_block() If file size and FAT cluster chain is not matched (corrupted image), we can hit BUG_ON(!phys) in __fat_get_block(). So, use fat_fs_error() instead. [hirofumi@mail.parknet.co.jp: fix printk warning] Link: http://lkml.kernel.org/r/87po12aq5p.fsf@mail.parknet.co.jp Link: http://lkml.kernel.org/r/874lilcu67.fsf@mail.parknet.co.jp Signed-off-by: OGAWA Hirofumi Reported-by: Anatoly Trosinenko Tested-by: Anatoly Trosinenko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/fat/inode.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/fat/inode.c b/fs/fat/inode.c index ffbbf0520d9e..4f818f7fa155 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -158,8 +158,14 @@ static inline int __fat_get_block(struct inode *inode, sector_t iblock, err = fat_bmap(inode, iblock, &phys, &mapped_blocks, create, false); if (err) return err; + if (!phys) { + fat_fs_error(sb, + "invalid FAT chain (i_pos %lld, last_block %llu)", + MSDOS_I(inode)->i_pos, + (unsigned long long)last_block); + return -EIO; + } - BUG_ON(!phys); BUG_ON(*max_blocks != mapped_blocks); set_buffer_new(bh_result); map_bh(bh_result, sb, phys); From 86a2bb5ad83161cc687671bdf188699e137ae226 Mon Sep 17 00:00:00 2001 From: Alexey Dobriyan Date: Thu, 14 Jun 2018 15:27:24 -0700 Subject: [PATCH 12/27] coredump: fix spam with zero VMA process Nobody ever tried to self destruct by unmapping whole address space at once: munmap((void *)0, (1ULL << 47) - 4096); Doing this produces 2 warnings for zero-length vmalloc allocations: a.out[1353]: segfault at 7f80bcc4b757 ip 00007f80bcc4b757 sp 00007fff683939b8 error 14 a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null) ... a.out: vmalloc: allocation failure: 0 bytes, mode:0xcc0(GFP_KERNEL), nodemask=(null) ... Fix is to switch to kvmalloc(). Steps to reproduce: // vsyscall=none #include #include int main(void) { setrlimit(RLIMIT_CORE, &(struct rlimit){RLIM_INFINITY, RLIM_INFINITY}); munmap((void *)0, (1ULL << 47) - 4096); return 0; } Link: http://lkml.kernel.org/r/20180410180353.GA2515@avx2 Signed-off-by: Alexey Dobriyan Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/binfmt_elf.c | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 070b6184642d..0ac456b52bdd 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1621,8 +1621,8 @@ static int fill_files_note(struct memelfnote *note) if (size >= MAX_FILE_NOTE_SIZE) /* paranoia check */ return -EINVAL; size = round_up(size, PAGE_SIZE); - data = vmalloc(size); - if (!data) + data = kvmalloc(size, GFP_KERNEL); + if (ZERO_OR_NULL_PTR(data)) return -ENOMEM; start_end_ofs = data + 2; @@ -1639,7 +1639,7 @@ static int fill_files_note(struct memelfnote *note) filename = file_path(file, name_curpos, remaining); if (IS_ERR(filename)) { if (PTR_ERR(filename) == -ENAMETOOLONG) { - vfree(data); + kvfree(data); size = size * 5 / 4; goto alloc; } @@ -1932,7 +1932,7 @@ static void free_note_info(struct elf_note_info *info) kfree(t); } kfree(info->psinfo.data); - vfree(info->files.data); + kvfree(info->files.data); } #else @@ -2148,7 +2148,7 @@ static void free_note_info(struct elf_note_info *info) /* Free data possibly allocated by fill_files_note(): */ if (info->notes_files) - vfree(info->notes_files->data); + kvfree(info->notes_files->data); kfree(info->prstatus); kfree(info->psinfo); @@ -2294,8 +2294,9 @@ static int elf_core_dump(struct coredump_params *cprm) if (segs - 1 > ULONG_MAX / sizeof(*vma_filesz)) goto end_coredump; - vma_filesz = vmalloc(array_size(sizeof(*vma_filesz), (segs - 1))); - if (!vma_filesz) + vma_filesz = kvmalloc(array_size(sizeof(*vma_filesz), (segs - 1)), + GFP_KERNEL); + if (ZERO_OR_NULL_PTR(vma_filesz)) goto end_coredump; for (i = 0, vma = first_vma(current, gate_vma); vma != NULL; @@ -2402,7 +2403,7 @@ static int elf_core_dump(struct coredump_params *cprm) cleanup: free_note_info(&info); kfree(shdr4extnum); - vfree(vma_filesz); + kvfree(vma_filesz); kfree(phdr4note); kfree(elf); out: From 20fe935358de5f3481790f3e3d399cf97801f830 Mon Sep 17 00:00:00 2001 From: Kees Cook Date: Thu, 14 Jun 2018 15:27:27 -0700 Subject: [PATCH 13/27] exofs: avoid VLA in structures On the quest to remove all VLAs from the kernel[1] this adjusts several cases where allocation is made after an array of structures that points back into the allocation. The allocations are changed to perform explicit calculations instead of using a Variable Length Array in a structure. Additionally, this lets Clang compile this code now, since Clang does not support VLAIS[2]. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com [2] https://lkml.kernel.org/r/CA+55aFy6h1c3_rP_bXFedsTXzwW+9Q9MfJaW7GUmMBrAp-fJ9A@mail.gmail.com [keescook@chromium.org: v2] Link: http://lkml.kernel.org/r/20180418163546.GA45794@beast Link: http://lkml.kernel.org/r/20180327203904.GA1151@beast Signed-off-by: Kees Cook Reviewed-by: Nick Desaulniers Cc: Boaz Harrosh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/exofs/ore.c | 82 ++++++++++++++++++++++++++------------------- fs/exofs/ore_raid.c | 75 ++++++++++++++++++++++++++++++----------- fs/exofs/super.c | 23 ++++++------- 3 files changed, 114 insertions(+), 66 deletions(-) diff --git a/fs/exofs/ore.c b/fs/exofs/ore.c index ddbf87246898..1b8b44637e70 100644 --- a/fs/exofs/ore.c +++ b/fs/exofs/ore.c @@ -146,68 +146,82 @@ int _ore_get_io_state(struct ore_layout *layout, struct ore_io_state **pios) { struct ore_io_state *ios; - struct page **pages; - struct osd_sg_entry *sgilist; + size_t size_ios, size_extra, size_total; + void *ios_extra; + + /* + * The desired layout looks like this, with the extra_allocation + * items pointed at from fields within ios or per_dev: + struct __alloc_all_io_state { struct ore_io_state ios; struct ore_per_dev_state per_dev[numdevs]; union { struct osd_sg_entry sglist[sgs_per_dev * numdevs]; struct page *pages[num_par_pages]; - }; - } *_aios; + } extra_allocation; + } whole_allocation; - if (likely(sizeof(*_aios) <= PAGE_SIZE)) { - _aios = kzalloc(sizeof(*_aios), GFP_KERNEL); - if (unlikely(!_aios)) { - ORE_DBGMSG("Failed kzalloc bytes=%zd\n", - sizeof(*_aios)); + */ + + /* This should never happen, so abort early if it ever does. */ + if (sgs_per_dev && num_par_pages) { + ORE_DBGMSG("Tried to use both pages and sglist\n"); + *pios = NULL; + return -EINVAL; + } + + if (numdevs > (INT_MAX - sizeof(*ios)) / + sizeof(struct ore_per_dev_state)) + return -ENOMEM; + size_ios = sizeof(*ios) + sizeof(struct ore_per_dev_state) * numdevs; + + if (sgs_per_dev * numdevs > INT_MAX / sizeof(struct osd_sg_entry)) + return -ENOMEM; + if (num_par_pages > INT_MAX / sizeof(struct page *)) + return -ENOMEM; + size_extra = max(sizeof(struct osd_sg_entry) * (sgs_per_dev * numdevs), + sizeof(struct page *) * num_par_pages); + + size_total = size_ios + size_extra; + + if (likely(size_total <= PAGE_SIZE)) { + ios = kzalloc(size_total, GFP_KERNEL); + if (unlikely(!ios)) { + ORE_DBGMSG("Failed kzalloc bytes=%zd\n", size_total); *pios = NULL; return -ENOMEM; } - pages = num_par_pages ? _aios->pages : NULL; - sgilist = sgs_per_dev ? _aios->sglist : NULL; - ios = &_aios->ios; + ios_extra = (char *)ios + size_ios; } else { - struct __alloc_small_io_state { - struct ore_io_state ios; - struct ore_per_dev_state per_dev[numdevs]; - } *_aio_small; - union __extra_part { - struct osd_sg_entry sglist[sgs_per_dev * numdevs]; - struct page *pages[num_par_pages]; - } *extra_part; - - _aio_small = kzalloc(sizeof(*_aio_small), GFP_KERNEL); - if (unlikely(!_aio_small)) { + ios = kzalloc(size_ios, GFP_KERNEL); + if (unlikely(!ios)) { ORE_DBGMSG("Failed alloc first part bytes=%zd\n", - sizeof(*_aio_small)); + size_ios); *pios = NULL; return -ENOMEM; } - extra_part = kzalloc(sizeof(*extra_part), GFP_KERNEL); - if (unlikely(!extra_part)) { + ios_extra = kzalloc(size_extra, GFP_KERNEL); + if (unlikely(!ios_extra)) { ORE_DBGMSG("Failed alloc second part bytes=%zd\n", - sizeof(*extra_part)); - kfree(_aio_small); + size_extra); + kfree(ios); *pios = NULL; return -ENOMEM; } - pages = num_par_pages ? extra_part->pages : NULL; - sgilist = sgs_per_dev ? extra_part->sglist : NULL; /* In this case the per_dev[0].sgilist holds the pointer to * be freed */ - ios = &_aio_small->ios; ios->extra_part_alloc = true; } - if (pages) { - ios->parity_pages = pages; + if (num_par_pages) { + ios->parity_pages = ios_extra; ios->max_par_pages = num_par_pages; } - if (sgilist) { + if (sgs_per_dev) { + struct osd_sg_entry *sgilist = ios_extra; unsigned d; for (d = 0; d < numdevs; ++d) { diff --git a/fs/exofs/ore_raid.c b/fs/exofs/ore_raid.c index 27cbdb697649..199590f36203 100644 --- a/fs/exofs/ore_raid.c +++ b/fs/exofs/ore_raid.c @@ -71,6 +71,11 @@ static int _sp2d_alloc(unsigned pages_in_unit, unsigned group_width, { struct __stripe_pages_2d *sp2d; unsigned data_devs = group_width - parity; + + /* + * Desired allocation layout is, though when larger than PAGE_SIZE, + * each struct __alloc_1p_arrays is separately allocated: + struct _alloc_all_bytes { struct __alloc_stripe_pages_2d { struct __stripe_pages_2d sp2d; @@ -82,55 +87,85 @@ static int _sp2d_alloc(unsigned pages_in_unit, unsigned group_width, char page_is_read[data_devs]; } __a1pa[pages_in_unit]; } *_aab; + struct __alloc_1p_arrays *__a1pa; struct __alloc_1p_arrays *__a1pa_end; - const unsigned sizeof__a1pa = sizeof(_aab->__a1pa[0]); + + */ + + char *__a1pa; + char *__a1pa_end; + + const size_t sizeof_stripe_pages_2d = + sizeof(struct __stripe_pages_2d) + + sizeof(struct __1_page_stripe) * pages_in_unit; + const size_t sizeof__a1pa = + ALIGN(sizeof(struct page *) * (2 * group_width) + data_devs, + sizeof(void *)); + const size_t sizeof__a1pa_arrays = sizeof__a1pa * pages_in_unit; + const size_t alloc_total = sizeof_stripe_pages_2d + + sizeof__a1pa_arrays; + unsigned num_a1pa, alloc_size, i; /* FIXME: check these numbers in ore_verify_layout */ - BUG_ON(sizeof(_aab->__asp2d) > PAGE_SIZE); + BUG_ON(sizeof_stripe_pages_2d > PAGE_SIZE); BUG_ON(sizeof__a1pa > PAGE_SIZE); - if (sizeof(*_aab) > PAGE_SIZE) { - num_a1pa = (PAGE_SIZE - sizeof(_aab->__asp2d)) / sizeof__a1pa; - alloc_size = sizeof(_aab->__asp2d) + sizeof__a1pa * num_a1pa; + /* + * If alloc_total would be larger than PAGE_SIZE, only allocate + * as many a1pa items as would fill the rest of the page, instead + * of the full pages_in_unit count. + */ + if (alloc_total > PAGE_SIZE) { + num_a1pa = (PAGE_SIZE - sizeof_stripe_pages_2d) / sizeof__a1pa; + alloc_size = sizeof_stripe_pages_2d + sizeof__a1pa * num_a1pa; } else { num_a1pa = pages_in_unit; - alloc_size = sizeof(*_aab); + alloc_size = alloc_total; } - _aab = kzalloc(alloc_size, GFP_KERNEL); - if (unlikely(!_aab)) { + *psp2d = sp2d = kzalloc(alloc_size, GFP_KERNEL); + if (unlikely(!sp2d)) { ORE_DBGMSG("!! Failed to alloc sp2d size=%d\n", alloc_size); return -ENOMEM; } + /* From here Just call _sp2d_free */ - sp2d = &_aab->__asp2d.sp2d; - *psp2d = sp2d; /* From here Just call _sp2d_free */ - - __a1pa = _aab->__a1pa; - __a1pa_end = __a1pa + num_a1pa; + /* Find start of a1pa area. */ + __a1pa = (char *)sp2d + sizeof_stripe_pages_2d; + /* Find end of the _allocated_ a1pa area. */ + __a1pa_end = __a1pa + alloc_size; + /* Allocate additionally needed a1pa items in PAGE_SIZE chunks. */ for (i = 0; i < pages_in_unit; ++i) { + struct __1_page_stripe *stripe = &sp2d->_1p_stripes[i]; + if (unlikely(__a1pa >= __a1pa_end)) { num_a1pa = min_t(unsigned, PAGE_SIZE / sizeof__a1pa, pages_in_unit - i); + alloc_size = sizeof__a1pa * num_a1pa; - __a1pa = kcalloc(num_a1pa, sizeof__a1pa, GFP_KERNEL); + __a1pa = kzalloc(alloc_size, GFP_KERNEL); if (unlikely(!__a1pa)) { ORE_DBGMSG("!! Failed to _alloc_1p_arrays=%d\n", num_a1pa); return -ENOMEM; } - __a1pa_end = __a1pa + num_a1pa; + __a1pa_end = __a1pa + alloc_size; /* First *pages is marked for kfree of the buffer */ - sp2d->_1p_stripes[i].alloc = true; + stripe->alloc = true; } - sp2d->_1p_stripes[i].pages = __a1pa->pages; - sp2d->_1p_stripes[i].scribble = __a1pa->scribble ; - sp2d->_1p_stripes[i].page_is_read = __a1pa->page_is_read; - ++__a1pa; + /* + * Attach all _lp_stripes pointers to the allocation for + * it which was either part of the original PAGE_SIZE + * allocation or the subsequent allocation in this loop. + */ + stripe->pages = (void *)__a1pa; + stripe->scribble = stripe->pages + group_width; + stripe->page_is_read = (char *)stripe->scribble + group_width; + __a1pa += sizeof__a1pa; } sp2d->parity = parity; diff --git a/fs/exofs/super.c b/fs/exofs/super.c index 719a3152da80..41cf2fbee50d 100644 --- a/fs/exofs/super.c +++ b/fs/exofs/super.c @@ -549,27 +549,26 @@ static int exofs_devs_2_odi(struct exofs_dt_device_info *dt_dev, static int __alloc_dev_table(struct exofs_sb_info *sbi, unsigned numdevs, struct exofs_dev **peds) { - struct __alloc_ore_devs_and_exofs_devs { - /* Twice bigger table: See exofs_init_comps() and comment at - * exofs_read_lookup_dev_table() - */ - struct ore_dev *oreds[numdevs * 2 - 1]; - struct exofs_dev eds[numdevs]; - } *aoded; + /* Twice bigger table: See exofs_init_comps() and comment at + * exofs_read_lookup_dev_table() + */ + const size_t numores = numdevs * 2 - 1; struct exofs_dev *eds; unsigned i; - aoded = kzalloc(sizeof(*aoded), GFP_KERNEL); - if (unlikely(!aoded)) { + sbi->oc.ods = kzalloc(numores * sizeof(struct ore_dev *) + + numdevs * sizeof(struct exofs_dev), GFP_KERNEL); + if (unlikely(!sbi->oc.ods)) { EXOFS_ERR("ERROR: failed allocating Device array[%d]\n", numdevs); return -ENOMEM; } - sbi->oc.ods = aoded->oreds; - *peds = eds = aoded->eds; + /* Start of allocated struct exofs_dev entries */ + *peds = eds = (void *)sbi->oc.ods[numores]; + /* Initialize pointers into struct exofs_dev */ for (i = 0; i < numdevs; ++i) - aoded->oreds[i] = &eds[i].ored; + sbi->oc.ods[i] = &eds[i].ored; return 0; } From 3fb3894b84c2e0f83cb1e4f4e960243742e6b3a6 Mon Sep 17 00:00:00 2001 From: Souptick Joarder Date: Thu, 14 Jun 2018 15:27:31 -0700 Subject: [PATCH 14/27] kernel/relay.c: change return type to vm_fault_t Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180510140335.GA25363@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder Reviewed-by: Matthew Wilcox Reviewed-by: Andrew Morton Cc: Eric Biggers Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/relay.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/relay.c b/kernel/relay.c index 9f5326e8a036..04f248644e06 100644 --- a/kernel/relay.c +++ b/kernel/relay.c @@ -39,7 +39,7 @@ static void relay_file_mmap_close(struct vm_area_struct *vma) /* * fault() vm_op implementation for relay file mapping. */ -static int relay_buf_fault(struct vm_fault *vmf) +static vm_fault_t relay_buf_fault(struct vm_fault *vmf) { struct page *page; struct rchan_buf *buf = vmf->vma->vm_private_data; From c9484b986ef03492357fddd50afbdd02929cfa72 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:34 -0700 Subject: [PATCH 15/27] kcov: ensure irq code sees a valid area Patch series "kcov: fix unexpected faults". These patches fix a few issues where KCOV code could trigger recursive faults, discovered while debugging a patch enabling KCOV for arch/arm: * On CONFIG_PREEMPT kernels, there's a small race window where __sanitizer_cov_trace_pc() can see a bogus kcov_area. * Lazy faulting of the vmalloc area can cause mutual recursion between fault handling code and __sanitizer_cov_trace_pc(). * During the context switch, switching the mm can cause the kcov_area to be transiently unmapped. These are prerequisites for enabling KCOV on arm, but the issues themsevles are generic -- we just happen to avoid them by chance rather than design on x86-64 and arm64. This patch (of 3): For kernels built with CONFIG_PREEMPT, some C code may execute before or after the interrupt handler, while the hardirq count is zero. In these cases, in_task() can return true. A task can be interrupted in the middle of a KCOV_DISABLE ioctl while it resets the task's kcov data via kcov_task_init(). Instrumented code executed during this period will call __sanitizer_cov_trace_pc(), and as in_task() returns true, will inspect t->kcov_mode before trying to write to t->kcov_area. In kcov_init_task() we update t->kcov_{mode,area,size} with plain stores, which may be re-ordered, torn, etc. Thus __sanitizer_cov_trace_pc() may see bogus values for any of these fields, and may attempt to write to memory which is not mapped. Let's avoid this by using WRITE_ONCE() to set t->kcov_mode, with a barrier() to ensure this is ordered before we clear t->kov_{area,size}. This ensures that any code execute while kcov_init_task() is preempted will either see valid values for t->kcov_{area,size}, or will see that t->kcov_mode is KCOV_MODE_DISABLED, and bail out without touching t->kcov_area. Link: http://lkml.kernel.org/r/20180504135535.53744-2-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/kcov.c b/kernel/kcov.c index 2c16f1ab5e10..5be9a60a959f 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -241,7 +241,8 @@ static void kcov_put(struct kcov *kcov) void kcov_task_init(struct task_struct *t) { - t->kcov_mode = KCOV_MODE_DISABLED; + WRITE_ONCE(t->kcov_mode, KCOV_MODE_DISABLED); + barrier(); t->kcov_size = 0; t->kcov_area = NULL; t->kcov = NULL; From dc55daff9040a90adce97208e776ee0bf515ab12 Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:37 -0700 Subject: [PATCH 16/27] kcov: prefault the kcov_area On many architectures the vmalloc area is lazily faulted in upon first access. This is problematic for KCOV, as __sanitizer_cov_trace_pc accesses the (vmalloc'd) kcov_area, and fault handling code may be instrumented. If an access to kcov_area faults, this will result in mutual recursion through the fault handling code and __sanitizer_cov_trace_pc(), eventually leading to stack corruption and/or overflow. We can avoid this by faulting in the kcov_area before __sanitizer_cov_trace_pc() is permitted to access it. Once it has been faulted in, it will remain present in the process page tables, and will not fault again. [akpm@linux-foundation.org: code cleanup] [akpm@linux-foundation.org: add comment explaining kcov_fault_in_area()] [akpm@linux-foundation.org: fancier code comment from Mark] Link: http://lkml.kernel.org/r/20180504135535.53744-3-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- kernel/kcov.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/kernel/kcov.c b/kernel/kcov.c index 5be9a60a959f..cf250392c55c 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -324,6 +324,21 @@ static int kcov_close(struct inode *inode, struct file *filep) return 0; } +/* + * Fault in a lazily-faulted vmalloc area before it can be used by + * __santizer_cov_trace_pc(), to avoid recursion issues if any code on the + * vmalloc fault handling path is instrumented. + */ +static void kcov_fault_in_area(struct kcov *kcov) +{ + unsigned long stride = PAGE_SIZE / sizeof(unsigned long); + unsigned long *area = kcov->area; + unsigned long offset; + + for (offset = 0; offset < kcov->size; offset += stride) + READ_ONCE(area[offset]); +} + static int kcov_ioctl_locked(struct kcov *kcov, unsigned int cmd, unsigned long arg) { @@ -372,6 +387,7 @@ static int kcov_ioctl_locked(struct kcov *kcov, unsigned int cmd, #endif else return -EINVAL; + kcov_fault_in_area(kcov); /* Cache in task struct for performance. */ t->kcov_size = kcov->size; t->kcov_area = kcov->area; From 0ed557aa813922f6f32adec69e266532091c895b Mon Sep 17 00:00:00 2001 From: Mark Rutland Date: Thu, 14 Jun 2018 15:27:41 -0700 Subject: [PATCH 17/27] sched/core / kcov: avoid kcov_area during task switch During a context switch, we first switch_mm() to the next task's mm, then switch_to() that new task. This means that vmalloc'd regions which had previously been faulted in can transiently disappear in the context of the prev task. Functions instrumented by KCOV may try to access a vmalloc'd kcov_area during this window, and as the fault handling code is instrumented, this results in a recursive fault. We must avoid accessing any kcov_area during this window. We can do so with a new flag in kcov_mode, set prior to switching the mm, and cleared once the new task is live. Since task_struct::kcov_mode isn't always a specific enum kcov_mode value, this is made an unsigned int. The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers, which are empty for !CONFIG_KCOV kernels. The code uses macros because I can't use static inline functions without a circular include dependency between and , since the definition of task_struct uses things defined in Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.com Signed-off-by: Mark Rutland Acked-by: Andrey Ryabinin Cc: Dmitry Vyukov Cc: Ingo Molnar Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/kcov.h | 14 ++++++++++++++ include/linux/sched.h | 2 +- kernel/kcov.c | 2 +- kernel/sched/core.c | 4 ++++ 4 files changed, 20 insertions(+), 2 deletions(-) diff --git a/include/linux/kcov.h b/include/linux/kcov.h index 3ecf6f5e3a5f..b76a1807028d 100644 --- a/include/linux/kcov.h +++ b/include/linux/kcov.h @@ -22,13 +22,27 @@ enum kcov_mode { KCOV_MODE_TRACE_CMP = 3, }; +#define KCOV_IN_CTXSW (1 << 30) + void kcov_task_init(struct task_struct *t); void kcov_task_exit(struct task_struct *t); +#define kcov_prepare_switch(t) \ +do { \ + (t)->kcov_mode |= KCOV_IN_CTXSW; \ +} while (0) + +#define kcov_finish_switch(t) \ +do { \ + (t)->kcov_mode &= ~KCOV_IN_CTXSW; \ +} while (0) + #else static inline void kcov_task_init(struct task_struct *t) {} static inline void kcov_task_exit(struct task_struct *t) {} +static inline void kcov_prepare_switch(struct task_struct *t) {} +static inline void kcov_finish_switch(struct task_struct *t) {} #endif /* CONFIG_KCOV */ #endif /* _LINUX_KCOV_H */ diff --git a/include/linux/sched.h b/include/linux/sched.h index cfb7da88c217..87bf02d93a27 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1130,7 +1130,7 @@ struct task_struct { #ifdef CONFIG_KCOV /* Coverage collection mode enabled for this task (0 if disabled): */ - enum kcov_mode kcov_mode; + unsigned int kcov_mode; /* Size of the kcov_area: */ unsigned int kcov_size; diff --git a/kernel/kcov.c b/kernel/kcov.c index cf250392c55c..3ebd09efe72a 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -58,7 +58,7 @@ struct kcov { static bool check_kcov_mode(enum kcov_mode needed_mode, struct task_struct *t) { - enum kcov_mode mode; + unsigned int mode; /* * We are interested in code coverage as a function of a syscall inputs, diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a98d54cd5535..78d8facba456 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -10,6 +10,8 @@ #include #include +#include + #include #include @@ -2633,6 +2635,7 @@ static inline void prepare_task_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { + kcov_prepare_switch(prev); sched_info_switch(rq, prev, next); perf_event_task_sched_out(prev, next); rseq_preempt(prev); @@ -2702,6 +2705,7 @@ static struct rq *finish_task_switch(struct task_struct *prev) finish_task(prev); finish_lock_switch(rq); finish_arch_post_lock_switch(); + kcov_finish_switch(current); fire_sched_in_preempt_notifiers(current); /* From 758517202bd2e427664857c9f2aa59da36848aca Mon Sep 17 00:00:00 2001 From: Dmitry Vyukov Date: Thu, 14 Jun 2018 15:27:44 -0700 Subject: [PATCH 18/27] arm: port KCOV to arm KCOV is code coverage collection facility used, in particular, by syzkaller system call fuzzer. There is some interest in using syzkaller on arm devices. So port KCOV to arm. On implementation level this merely declares that KCOV is supported and disables instrumentation of 3 special cases. Reasons for disabling are commented in code. Tested with qemu-system-arm/vexpress-a15. Link: http://lkml.kernel.org/r/20180511143248.112484-1-dvyukov@google.com Signed-off-by: Dmitry Vyukov Acked-by: Mark Rutland Cc: Russell King Cc: Abbott Liu Cc: Catalin Marinas Cc: Koguchi Takuo Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm/Kconfig | 3 ++- arch/arm/boot/compressed/Makefile | 3 +++ arch/arm/kvm/hyp/Makefile | 8 ++++++++ arch/arm/vdso/Makefile | 3 +++ 4 files changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 2a78bdef9a24..47eb26dc2427 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -8,9 +8,10 @@ config ARM select ARCH_HAS_DEVMEM_IS_ALLOWED select ARCH_HAS_ELF_RANDOMIZE select ARCH_HAS_FORTIFY_SOURCE + select ARCH_HAS_KCOV select ARCH_HAS_PTE_SPECIAL if ARM_LPAE - select ARCH_HAS_SET_MEMORY select ARCH_HAS_PHYS_TO_DMA + select ARCH_HAS_SET_MEMORY select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL select ARCH_HAS_STRICT_MODULE_RWX if MMU select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST diff --git a/arch/arm/boot/compressed/Makefile b/arch/arm/boot/compressed/Makefile index a3c5fbcad4ab..1f5a5ffe7fcf 100644 --- a/arch/arm/boot/compressed/Makefile +++ b/arch/arm/boot/compressed/Makefile @@ -25,6 +25,9 @@ endif GCOV_PROFILE := n +# Prevents link failures: __sanitizer_cov_trace_pc() is not linked in. +KCOV_INSTRUMENT := n + # # Architecture dependencies # diff --git a/arch/arm/kvm/hyp/Makefile b/arch/arm/kvm/hyp/Makefile index 7fc0638f263a..d2b5ec9c4b92 100644 --- a/arch/arm/kvm/hyp/Makefile +++ b/arch/arm/kvm/hyp/Makefile @@ -23,3 +23,11 @@ obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o obj-$(CONFIG_KVM_ARM_HOST) += switch.o CFLAGS_switch.o += $(CFLAGS_ARMV7VE) obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o + +# KVM code is run at a different exception code with a different map, so +# compiler instrumentation that inserts callbacks or checks into the code may +# cause crashes. Just disable it. +GCOV_PROFILE := n +KASAN_SANITIZE := n +UBSAN_SANITIZE := n +KCOV_INSTRUMENT := n diff --git a/arch/arm/vdso/Makefile b/arch/arm/vdso/Makefile index bb4118213fee..f4efff9d3afb 100644 --- a/arch/arm/vdso/Makefile +++ b/arch/arm/vdso/Makefile @@ -30,6 +30,9 @@ CFLAGS_vgettimeofday.o = -O2 # Disable gcov profiling for VDSO code GCOV_PROFILE := n +# Prevents link failures: __sanitizer_cov_trace_pc() is not linked in. +KCOV_INSTRUMENT := n + # Force dependency $(obj)/vdso.o : $(obj)/vdso.so From f1b4bd0676c2b3d4a023cf3f5d535e618f7e6eff Mon Sep 17 00:00:00 2001 From: Mikulas Patocka Date: Thu, 14 Jun 2018 15:27:48 -0700 Subject: [PATCH 19/27] fault-injection: reorder config entries Reorder Kconfig entries, so that menuconfig displays proper indentation. Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1804251601160.30569@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka Acked-by: Randy Dunlap Tested-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/Kconfig.debug | 36 ++++++++++++++++++------------------ 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index d543c65ce0eb..8838d1158d19 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1506,6 +1506,10 @@ config NETDEV_NOTIFIER_ERROR_INJECT If unsure, say N. +config FUNCTION_ERROR_INJECTION + def_bool y + depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES + config FAULT_INJECTION bool "Fault-injection framework" depends on DEBUG_KERNEL @@ -1513,10 +1517,6 @@ config FAULT_INJECTION Provide fault-injection framework. For more details, see Documentation/fault-injection/. -config FUNCTION_ERROR_INJECTION - def_bool y - depends on HAVE_FUNCTION_ERROR_INJECTION && KPROBES - config FAILSLAB bool "Fault-injection capability for kmalloc" depends on FAULT_INJECTION @@ -1547,16 +1547,6 @@ config FAIL_IO_TIMEOUT Only works with drivers that use the generic timeout handling, for others it wont do anything. -config FAIL_MMC_REQUEST - bool "Fault-injection capability for MMC IO" - depends on FAULT_INJECTION_DEBUG_FS && MMC - help - Provide fault-injection capability for MMC IO. - This will make the mmc core return data errors. This is - useful to test the error handling in the mmc block device - and to test how the mmc host driver handles retries from - the block device. - config FAIL_FUTEX bool "Fault-injection capability for futexes" select DEBUG_FS @@ -1564,6 +1554,12 @@ config FAIL_FUTEX help Provide fault-injection capability for futexes. +config FAULT_INJECTION_DEBUG_FS + bool "Debugfs entries for fault-injection capabilities" + depends on FAULT_INJECTION && SYSFS && DEBUG_FS + help + Enable configuration of fault-injection capabilities via debugfs. + config FAIL_FUNCTION bool "Fault-injection capability for functions" depends on FAULT_INJECTION_DEBUG_FS && FUNCTION_ERROR_INJECTION @@ -1574,11 +1570,15 @@ config FAIL_FUNCTION an error value and have to handle it. This is useful to test the error handling in various subsystems. -config FAULT_INJECTION_DEBUG_FS - bool "Debugfs entries for fault-injection capabilities" - depends on FAULT_INJECTION && SYSFS && DEBUG_FS +config FAIL_MMC_REQUEST + bool "Fault-injection capability for MMC IO" + depends on FAULT_INJECTION_DEBUG_FS && MMC help - Enable configuration of fault-injection capabilities via debugfs. + Provide fault-injection capability for MMC IO. + This will make the mmc core return data errors. This is + useful to test the error handling in the mmc block device + and to test how the mmc host driver handles retries from + the block device. config FAULT_INJECTION_STACKTRACE_FILTER bool "stacktrace filter for fault-injection capabilities" From ec67aaa46dce26d671b46c94ac674ad0b67d044c Mon Sep 17 00:00:00 2001 From: Davidlohr Bueso Date: Thu, 14 Jun 2018 15:27:51 -0700 Subject: [PATCH 20/27] sysvipc/sem: mitigate semnum index against spectre v1 Both smatch and coverity are reporting potential issues with spectre variant 1 with the 'semnum' index within the sma->sems array, ie: ipc/sem.c:388 sem_lock() warn: potential spectre issue 'sma->sems' ipc/sem.c:641 perform_atomic_semop_slow() warn: potential spectre issue 'sma->sems' ipc/sem.c:721 perform_atomic_semop() warn: potential spectre issue 'sma->sems' Avoid any possible speculation by using array_index_nospec() thus ensuring the semnum value is bounded to [0, sma->sem_nsems). With the exception of sem_lock() all of these are slowpaths. Link: http://lkml.kernel.org/r/20180423171131.njs4rfm2yzyeg6do@linux-n805 Signed-off-by: Davidlohr Bueso Reported-by: Dan Carpenter Cc: Peter Zijlstra Cc: "Gustavo A. R. Silva" Cc: Manfred Spraul Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/sem.c | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/ipc/sem.c b/ipc/sem.c index 59a3cd1d3252..5af1943ad782 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -85,6 +85,7 @@ #include #include #include +#include #include #include "util.h" @@ -368,6 +369,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, int nsops) { struct sem *sem; + int idx; if (nsops != 1) { /* Complex operation - acquire a full lock */ @@ -385,7 +387,8 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops, * * Both facts are tracked by use_global_mode. */ - sem = &sma->sems[sops->sem_num]; + idx = array_index_nospec(sops->sem_num, sma->sem_nsems); + sem = &sma->sems[idx]; /* * Initial check for use_global_lock. Just an optimization, @@ -638,7 +641,8 @@ static int perform_atomic_semop_slow(struct sem_array *sma, struct sem_queue *q) un = q->undo; for (sop = sops; sop < sops + nsops; sop++) { - curr = &sma->sems[sop->sem_num]; + int idx = array_index_nospec(sop->sem_num, sma->sem_nsems); + curr = &sma->sems[idx]; sem_op = sop->sem_op; result = curr->semval; @@ -718,7 +722,9 @@ static int perform_atomic_semop(struct sem_array *sma, struct sem_queue *q) * until the operations can go through. */ for (sop = sops; sop < sops + nsops; sop++) { - curr = &sma->sems[sop->sem_num]; + int idx = array_index_nospec(sop->sem_num, sma->sem_nsems); + + curr = &sma->sems[idx]; sem_op = sop->sem_op; result = curr->semval; @@ -1356,6 +1362,7 @@ static int semctl_setval(struct ipc_namespace *ns, int semid, int semnum, return -EIDRM; } + semnum = array_index_nospec(semnum, sma->sem_nsems); curr = &sma->sems[semnum]; ipc_assert_locked_object(&sma->sem_perm); @@ -1509,6 +1516,8 @@ static int semctl_main(struct ipc_namespace *ns, int semid, int semnum, err = -EIDRM; goto out_unlock; } + + semnum = array_index_nospec(semnum, nsems); curr = &sma->sems[semnum]; switch (cmd) { @@ -2081,7 +2090,8 @@ static long do_semtimedop(int semid, struct sembuf __user *tsops, */ if (nsops == 1) { struct sem *curr; - curr = &sma->sems[sops->sem_num]; + int idx = array_index_nospec(sops->sem_num, sma->sem_nsems); + curr = &sma->sems[idx]; if (alter) { if (sma->complex_count) { From 14f28f5776927be30717986f86b765d49eec392c Mon Sep 17 00:00:00 2001 From: Souptick Joarder Date: Thu, 14 Jun 2018 15:27:55 -0700 Subject: [PATCH 21/27] ipc: use new return type vm_fault_t Use new return type vm_fault_t for fault handler. For now, this is just documenting that the function returns a VM_FAULT value rather than an errno. Once all instances are converted, vm_fault_t will become a distinct type. Commit 1c8f422059ae ("mm: change return type to vm_fault_t") Link: http://lkml.kernel.org/r/20180425043413.GA21467@jordon-HP-15-Notebook-PC Signed-off-by: Souptick Joarder Reviewed-by: Matthew Wilcox Reviewed-by: Andrew Morton Acked-by: Davidlohr Bueso Cc: Manfred Spraul Cc: Eric W. Biederman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- ipc/shm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ipc/shm.c b/ipc/shm.c index 29978ee76c2e..051a3e1fb8df 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -408,7 +408,7 @@ void exit_shm(struct task_struct *task) up_write(&shm_ids(ns).rwsem); } -static int shm_fault(struct vm_fault *vmf) +static vm_fault_t shm_fault(struct vm_fault *vmf) { struct file *file = vmf->vma->vm_file; struct shm_file_data *sfd = shm_file_data(file); From 0825a6f98689d847ab8058c51b3a55f0abcc6563 Mon Sep 17 00:00:00 2001 From: Joe Perches Date: Thu, 14 Jun 2018 15:27:58 -0700 Subject: [PATCH 22/27] mm: use octal not symbolic permissions mm/*.c files use symbolic and octal styles for permissions. Using octal and not symbolic permissions is preferred by many as more readable. https://lkml.org/lkml/2016/8/2/1945 Prefer the direct use of octal for permissions. Done using $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c and some typing. Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 44 After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l 86 Miscellanea: o Whitespace neatening around these conversions. Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com Signed-off-by: Joe Perches Acked-by: David Rientjes Acked-by: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/cleancache.c | 10 ++++------ mm/cma_debug.c | 25 ++++++++++--------------- mm/compaction.c | 2 +- mm/dmapool.c | 2 +- mm/failslab.c | 2 +- mm/frontswap.c | 11 +++++------ mm/memblock.c | 9 ++++++--- mm/page_alloc.c | 2 +- mm/page_idle.c | 2 +- mm/page_owner.c | 4 ++-- mm/shmem.c | 9 +++++---- mm/slab_common.c | 4 ++-- mm/vmalloc.c | 4 ++-- mm/zsmalloc.c | 5 +++-- mm/zswap.c | 38 +++++++++++++++++++------------------- 15 files changed, 63 insertions(+), 66 deletions(-) diff --git a/mm/cleancache.c b/mm/cleancache.c index 126548b5a292..2bf12da9baa0 100644 --- a/mm/cleancache.c +++ b/mm/cleancache.c @@ -307,12 +307,10 @@ static int __init init_cleancache(void) struct dentry *root = debugfs_create_dir("cleancache", NULL); if (root == NULL) return -ENXIO; - debugfs_create_u64("succ_gets", S_IRUGO, root, &cleancache_succ_gets); - debugfs_create_u64("failed_gets", S_IRUGO, - root, &cleancache_failed_gets); - debugfs_create_u64("puts", S_IRUGO, root, &cleancache_puts); - debugfs_create_u64("invalidates", S_IRUGO, - root, &cleancache_invalidates); + debugfs_create_u64("succ_gets", 0444, root, &cleancache_succ_gets); + debugfs_create_u64("failed_gets", 0444, root, &cleancache_failed_gets); + debugfs_create_u64("puts", 0444, root, &cleancache_puts); + debugfs_create_u64("invalidates", 0444, root, &cleancache_invalidates); #endif return 0; } diff --git a/mm/cma_debug.c b/mm/cma_debug.c index 275df8b5b22e..f23467291cfb 100644 --- a/mm/cma_debug.c +++ b/mm/cma_debug.c @@ -172,23 +172,18 @@ static void cma_debugfs_add_one(struct cma *cma, int idx) tmp = debugfs_create_dir(name, cma_debugfs_root); - debugfs_create_file("alloc", S_IWUSR, tmp, cma, - &cma_alloc_fops); - - debugfs_create_file("free", S_IWUSR, tmp, cma, - &cma_free_fops); - - debugfs_create_file("base_pfn", S_IRUGO, tmp, - &cma->base_pfn, &cma_debugfs_fops); - debugfs_create_file("count", S_IRUGO, tmp, - &cma->count, &cma_debugfs_fops); - debugfs_create_file("order_per_bit", S_IRUGO, tmp, - &cma->order_per_bit, &cma_debugfs_fops); - debugfs_create_file("used", S_IRUGO, tmp, cma, &cma_used_fops); - debugfs_create_file("maxchunk", S_IRUGO, tmp, cma, &cma_maxchunk_fops); + debugfs_create_file("alloc", 0200, tmp, cma, &cma_alloc_fops); + debugfs_create_file("free", 0200, tmp, cma, &cma_free_fops); + debugfs_create_file("base_pfn", 0444, tmp, + &cma->base_pfn, &cma_debugfs_fops); + debugfs_create_file("count", 0444, tmp, &cma->count, &cma_debugfs_fops); + debugfs_create_file("order_per_bit", 0444, tmp, + &cma->order_per_bit, &cma_debugfs_fops); + debugfs_create_file("used", 0444, tmp, cma, &cma_used_fops); + debugfs_create_file("maxchunk", 0444, tmp, cma, &cma_maxchunk_fops); u32s = DIV_ROUND_UP(cma_bitmap_maxno(cma), BITS_PER_BYTE * sizeof(u32)); - debugfs_create_u32_array("bitmap", S_IRUGO, tmp, (u32*)cma->bitmap, u32s); + debugfs_create_u32_array("bitmap", 0444, tmp, (u32 *)cma->bitmap, u32s); } static int __init cma_debugfs_init(void) diff --git a/mm/compaction.c b/mm/compaction.c index 29bd1df18b98..faca45ebe62d 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1899,7 +1899,7 @@ static ssize_t sysfs_compact_node(struct device *dev, return count; } -static DEVICE_ATTR(compact, S_IWUSR, NULL, sysfs_compact_node); +static DEVICE_ATTR(compact, 0200, NULL, sysfs_compact_node); int compaction_register_node(struct node *node) { diff --git a/mm/dmapool.c b/mm/dmapool.c index 4d90a64b2fdc..6d4b97e7e9e9 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -105,7 +105,7 @@ show_pools(struct device *dev, struct device_attribute *attr, char *buf) return PAGE_SIZE - size; } -static DEVICE_ATTR(pools, S_IRUGO, show_pools, NULL); +static DEVICE_ATTR(pools, 0444, show_pools, NULL); /** * dma_pool_create - Creates a pool of consistent memory blocks, for dma. diff --git a/mm/failslab.c b/mm/failslab.c index 1f2f248e3601..b135ebb88b6f 100644 --- a/mm/failslab.c +++ b/mm/failslab.c @@ -42,7 +42,7 @@ __setup("failslab=", setup_failslab); static int __init failslab_debugfs_init(void) { struct dentry *dir; - umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; + umode_t mode = S_IFREG | 0600; dir = fault_create_debugfs_attr("failslab", NULL, &failslab.attr); if (IS_ERR(dir)) diff --git a/mm/frontswap.c b/mm/frontswap.c index 4f5476a0f955..157e5bf63504 100644 --- a/mm/frontswap.c +++ b/mm/frontswap.c @@ -486,12 +486,11 @@ static int __init init_frontswap(void) struct dentry *root = debugfs_create_dir("frontswap", NULL); if (root == NULL) return -ENXIO; - debugfs_create_u64("loads", S_IRUGO, root, &frontswap_loads); - debugfs_create_u64("succ_stores", S_IRUGO, root, &frontswap_succ_stores); - debugfs_create_u64("failed_stores", S_IRUGO, root, - &frontswap_failed_stores); - debugfs_create_u64("invalidates", S_IRUGO, - root, &frontswap_invalidates); + debugfs_create_u64("loads", 0444, root, &frontswap_loads); + debugfs_create_u64("succ_stores", 0444, root, &frontswap_succ_stores); + debugfs_create_u64("failed_stores", 0444, root, + &frontswap_failed_stores); + debugfs_create_u64("invalidates", 0444, root, &frontswap_invalidates); #endif return 0; } diff --git a/mm/memblock.c b/mm/memblock.c index f1dba2826d0f..cc16d70b8333 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1809,10 +1809,13 @@ static int __init memblock_init_debugfs(void) struct dentry *root = debugfs_create_dir("memblock", NULL); if (!root) return -ENXIO; - debugfs_create_file("memory", S_IRUGO, root, &memblock.memory, &memblock_debug_fops); - debugfs_create_file("reserved", S_IRUGO, root, &memblock.reserved, &memblock_debug_fops); + debugfs_create_file("memory", 0444, root, + &memblock.memory, &memblock_debug_fops); + debugfs_create_file("reserved", 0444, root, + &memblock.reserved, &memblock_debug_fops); #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP - debugfs_create_file("physmem", S_IRUGO, root, &memblock.physmem, &memblock_debug_fops); + debugfs_create_file("physmem", 0444, root, + &memblock.physmem, &memblock_debug_fops); #endif return 0; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 07b3c23762ad..1521100f1e63 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3061,7 +3061,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) static int __init fail_page_alloc_debugfs(void) { - umode_t mode = S_IFREG | S_IRUSR | S_IWUSR; + umode_t mode = S_IFREG | 0600; struct dentry *dir; dir = fault_create_debugfs_attr("fail_page_alloc", NULL, diff --git a/mm/page_idle.c b/mm/page_idle.c index e412a63b2b74..6302bc62c27d 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -201,7 +201,7 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj, } static struct bin_attribute page_idle_bitmap_attr = - __BIN_ATTR(bitmap, S_IRUSR | S_IWUSR, + __BIN_ATTR(bitmap, 0600, page_idle_bitmap_read, page_idle_bitmap_write, 0); static struct bin_attribute *page_idle_bin_attrs[] = { diff --git a/mm/page_owner.c b/mm/page_owner.c index 75d21a2259b3..d80adfe702d3 100644 --- a/mm/page_owner.c +++ b/mm/page_owner.c @@ -631,8 +631,8 @@ static int __init pageowner_init(void) return 0; } - dentry = debugfs_create_file("page_owner", S_IRUSR, NULL, - NULL, &proc_page_owner_operations); + dentry = debugfs_create_file("page_owner", 0400, NULL, + NULL, &proc_page_owner_operations); return PTR_ERR_OR_ZERO(dentry); } diff --git a/mm/shmem.c b/mm/shmem.c index e9a7ac74823d..2cab84403055 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3013,7 +3013,8 @@ static int shmem_symlink(struct inode *dir, struct dentry *dentry, const char *s if (len > PAGE_SIZE) return -ENAMETOOLONG; - inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK|S_IRWXUGO, 0, VM_NORESERVE); + inode = shmem_get_inode(dir->i_sb, dir, S_IFLNK | 0777, 0, + VM_NORESERVE); if (!inode) return -ENOSPC; @@ -3445,7 +3446,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) sbinfo->max_blocks << (PAGE_SHIFT - 10)); if (sbinfo->max_inodes != shmem_default_max_inodes()) seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes); - if (sbinfo->mode != (S_IRWXUGO | S_ISVTX)) + if (sbinfo->mode != (0777 | S_ISVTX)) seq_printf(seq, ",mode=%03ho", sbinfo->mode); if (!uid_eq(sbinfo->uid, GLOBAL_ROOT_UID)) seq_printf(seq, ",uid=%u", @@ -3486,7 +3487,7 @@ int shmem_fill_super(struct super_block *sb, void *data, int silent) if (!sbinfo) return -ENOMEM; - sbinfo->mode = S_IRWXUGO | S_ISVTX; + sbinfo->mode = 0777 | S_ISVTX; sbinfo->uid = current_fsuid(); sbinfo->gid = current_fsgid(); sb->s_fs_info = sbinfo; @@ -3929,7 +3930,7 @@ static struct file *__shmem_file_setup(struct vfsmount *mnt, const char *name, l d_set_d_op(path.dentry, &anon_ops); res = ERR_PTR(-ENOSPC); - inode = shmem_get_inode(sb, NULL, S_IFREG | S_IRWXUGO, 0, flags); + inode = shmem_get_inode(sb, NULL, S_IFREG | 0777, 0, flags); if (!inode) goto put_memory; diff --git a/mm/slab_common.c b/mm/slab_common.c index 42aca26d61d0..890b1f04a03a 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1243,9 +1243,9 @@ void cache_random_seq_destroy(struct kmem_cache *cachep) #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG) #ifdef CONFIG_SLAB -#define SLABINFO_RIGHTS (S_IWUSR | S_IRUSR) +#define SLABINFO_RIGHTS (0600) #else -#define SLABINFO_RIGHTS S_IRUSR +#define SLABINFO_RIGHTS (0400) #endif static void print_slabinfo_header(struct seq_file *m) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 89efac3a020e..cfea25be7754 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2741,11 +2741,11 @@ static const struct seq_operations vmalloc_op = { static int __init proc_vmalloc_init(void) { if (IS_ENABLED(CONFIG_NUMA)) - proc_create_seq_private("vmallocinfo", S_IRUSR, NULL, + proc_create_seq_private("vmallocinfo", 0400, NULL, &vmalloc_op, nr_node_ids * sizeof(unsigned int), NULL); else - proc_create_seq("vmallocinfo", S_IRUSR, NULL, &vmalloc_op); + proc_create_seq("vmallocinfo", 0400, NULL, &vmalloc_op); return 0; } module_init(proc_vmalloc_init); diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 61cb05dc950c..8d87e973a4f5 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -661,8 +661,9 @@ static void zs_pool_stat_create(struct zs_pool *pool, const char *name) } pool->stat_dentry = entry; - entry = debugfs_create_file("classes", S_IFREG | S_IRUGO, - pool->stat_dentry, pool, &zs_stats_size_fops); + entry = debugfs_create_file("classes", S_IFREG | 0444, + pool->stat_dentry, pool, + &zs_stats_size_fops); if (!entry) { pr_warn("%s: debugfs file entry <%s> creation failed\n", name, "classes"); diff --git a/mm/zswap.c b/mm/zswap.c index 61a5c41972db..7d34e69507e3 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1256,26 +1256,26 @@ static int __init zswap_debugfs_init(void) if (!zswap_debugfs_root) return -ENOMEM; - debugfs_create_u64("pool_limit_hit", S_IRUGO, - zswap_debugfs_root, &zswap_pool_limit_hit); - debugfs_create_u64("reject_reclaim_fail", S_IRUGO, - zswap_debugfs_root, &zswap_reject_reclaim_fail); - debugfs_create_u64("reject_alloc_fail", S_IRUGO, - zswap_debugfs_root, &zswap_reject_alloc_fail); - debugfs_create_u64("reject_kmemcache_fail", S_IRUGO, - zswap_debugfs_root, &zswap_reject_kmemcache_fail); - debugfs_create_u64("reject_compress_poor", S_IRUGO, - zswap_debugfs_root, &zswap_reject_compress_poor); - debugfs_create_u64("written_back_pages", S_IRUGO, - zswap_debugfs_root, &zswap_written_back_pages); - debugfs_create_u64("duplicate_entry", S_IRUGO, - zswap_debugfs_root, &zswap_duplicate_entry); - debugfs_create_u64("pool_total_size", S_IRUGO, - zswap_debugfs_root, &zswap_pool_total_size); - debugfs_create_atomic_t("stored_pages", S_IRUGO, - zswap_debugfs_root, &zswap_stored_pages); + debugfs_create_u64("pool_limit_hit", 0444, + zswap_debugfs_root, &zswap_pool_limit_hit); + debugfs_create_u64("reject_reclaim_fail", 0444, + zswap_debugfs_root, &zswap_reject_reclaim_fail); + debugfs_create_u64("reject_alloc_fail", 0444, + zswap_debugfs_root, &zswap_reject_alloc_fail); + debugfs_create_u64("reject_kmemcache_fail", 0444, + zswap_debugfs_root, &zswap_reject_kmemcache_fail); + debugfs_create_u64("reject_compress_poor", 0444, + zswap_debugfs_root, &zswap_reject_compress_poor); + debugfs_create_u64("written_back_pages", 0444, + zswap_debugfs_root, &zswap_written_back_pages); + debugfs_create_u64("duplicate_entry", 0444, + zswap_debugfs_root, &zswap_duplicate_entry); + debugfs_create_u64("pool_total_size", 0444, + zswap_debugfs_root, &zswap_pool_total_size); + debugfs_create_atomic_t("stored_pages", 0444, + zswap_debugfs_root, &zswap_stored_pages); debugfs_create_atomic_t("same_filled_pages", 0444, - zswap_debugfs_root, &zswap_same_filled_pages); + zswap_debugfs_root, &zswap_same_filled_pages); return 0; } From d7dc899abefb4412388a5d3ec690070197d07d20 Mon Sep 17 00:00:00 2001 From: Stefan Agner Date: Thu, 14 Jun 2018 15:28:02 -0700 Subject: [PATCH 23/27] treewide: use PHYS_ADDR_MAX to avoid type casting ULLONG_MAX With PHYS_ADDR_MAX there is now a type safe variant for all bits set. Make use of it. Patch created using a semantic patch as follows: // @@ typedef phys_addr_t; @@ -(phys_addr_t)ULLONG_MAX +PHYS_ADDR_MAX // Link: http://lkml.kernel.org/r/20180419214204.19322-1-stefan@agner.ch Signed-off-by: Stefan Agner Reviewed-by: Andrew Morton Acked-by: Ard Biesheuvel Acked-by: Catalin Marinas [arm64] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/arm64/mm/init.c | 6 +++--- arch/mips/kernel/setup.c | 4 ++-- arch/powerpc/mm/mem.c | 2 +- arch/sparc/mm/init_64.c | 2 +- arch/x86/mm/init_32.c | 2 +- arch/x86/mm/init_64.c | 2 +- drivers/firmware/efi/arm-init.c | 2 +- drivers/remoteproc/qcom_q6v5_pil.c | 2 +- drivers/soc/qcom/mdt_loader.c | 4 ++-- 9 files changed, 13 insertions(+), 13 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 1b18b4722420..325cfb3b858a 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -310,7 +310,7 @@ static void __init arm64_memory_present(void) } #endif -static phys_addr_t memory_limit = (phys_addr_t)ULLONG_MAX; +static phys_addr_t memory_limit = PHYS_ADDR_MAX; /* * Limit the memory size that was specified via FDT. @@ -401,7 +401,7 @@ void __init arm64_memblock_init(void) * high up in memory, add back the kernel region that must be accessible * via the linear mapping. */ - if (memory_limit != (phys_addr_t)ULLONG_MAX) { + if (memory_limit != PHYS_ADDR_MAX) { memblock_mem_limit_remove_map(memory_limit); memblock_add(__pa_symbol(_text), (u64)(_end - _text)); } @@ -666,7 +666,7 @@ __setup("keepinitrd", keepinitrd_setup); */ static int dump_mem_limit(struct notifier_block *self, unsigned long v, void *p) { - if (memory_limit != (phys_addr_t)ULLONG_MAX) { + if (memory_limit != PHYS_ADDR_MAX) { pr_emerg("Memory Limit: %llu MB\n", memory_limit >> 20); } else { pr_emerg("Memory Limit: none\n"); diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c index 563188ac6fa2..2c96c0c68116 100644 --- a/arch/mips/kernel/setup.c +++ b/arch/mips/kernel/setup.c @@ -93,7 +93,7 @@ void __init add_memory_region(phys_addr_t start, phys_addr_t size, long type) * If the region reaches the top of the physical address space, adjust * the size slightly so that (start + size) doesn't overflow */ - if (start + size - 1 == (phys_addr_t)ULLONG_MAX) + if (start + size - 1 == PHYS_ADDR_MAX) --size; /* Sanity check */ @@ -376,7 +376,7 @@ static void __init bootmem_init(void) unsigned long reserved_end; unsigned long mapstart = ~0UL; unsigned long bootmap_size; - phys_addr_t ramstart = (phys_addr_t)ULLONG_MAX; + phys_addr_t ramstart = PHYS_ADDR_MAX; bool bootmap_valid = false; int i; diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c index 8cecda4bd66a..5c8530d0c611 100644 --- a/arch/powerpc/mm/mem.c +++ b/arch/powerpc/mm/mem.c @@ -215,7 +215,7 @@ void __init mem_topology_setup(void) /* Place all memblock_regions in the same node and merge contiguous * memblock_regions */ - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0); + memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); } void __init initmem_init(void) diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c index 8aeb1aabe76e..f396048a0d68 100644 --- a/arch/sparc/mm/init_64.c +++ b/arch/sparc/mm/init_64.c @@ -1620,7 +1620,7 @@ static void __init bootmem_init_nonnuma(void) (top_of_ram - total_ram) >> 20); init_node_masks_nonnuma(); - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0); + memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); allocate_node_data(0); node_set_online(0); } diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c index c893c6a3d707..979e0a02cbe1 100644 --- a/arch/x86/mm/init_32.c +++ b/arch/x86/mm/init_32.c @@ -692,7 +692,7 @@ void __init initmem_init(void) high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1; #endif - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0); + memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); sparse_memory_present_with_active_regions(0); #ifdef CONFIG_FLATMEM diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 17383f9677fa..045f492d5f68 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -742,7 +742,7 @@ kernel_physical_mapping_init(unsigned long paddr_start, #ifndef CONFIG_NUMA void __init initmem_init(void) { - memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0); + memblock_set_node(0, PHYS_ADDR_MAX, &memblock.memory, 0); } #endif diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c index 80d1a885def5..b5214c143fee 100644 --- a/drivers/firmware/efi/arm-init.c +++ b/drivers/firmware/efi/arm-init.c @@ -193,7 +193,7 @@ static __init void reserve_regions(void) * uses its own memory map instead. */ memblock_dump_all(); - memblock_remove(0, (phys_addr_t)ULLONG_MAX); + memblock_remove(0, PHYS_ADDR_MAX); for_each_efi_memory_desc(md) { paddr = md->phys_addr; diff --git a/drivers/remoteproc/qcom_q6v5_pil.c b/drivers/remoteproc/qcom_q6v5_pil.c index 2190debf3d35..2bf8e7c49f2a 100644 --- a/drivers/remoteproc/qcom_q6v5_pil.c +++ b/drivers/remoteproc/qcom_q6v5_pil.c @@ -686,7 +686,7 @@ static int q6v5_mpss_load(struct q6v5 *qproc) struct elf32_hdr *ehdr; phys_addr_t mpss_reloc; phys_addr_t boot_addr; - phys_addr_t min_addr = (phys_addr_t)ULLONG_MAX; + phys_addr_t min_addr = PHYS_ADDR_MAX; phys_addr_t max_addr = 0; bool relocate = false; char seg_name[10]; diff --git a/drivers/soc/qcom/mdt_loader.c b/drivers/soc/qcom/mdt_loader.c index 17b314d9a148..dc09d7ac905f 100644 --- a/drivers/soc/qcom/mdt_loader.c +++ b/drivers/soc/qcom/mdt_loader.c @@ -50,7 +50,7 @@ ssize_t qcom_mdt_get_size(const struct firmware *fw) const struct elf32_phdr *phdrs; const struct elf32_phdr *phdr; const struct elf32_hdr *ehdr; - phys_addr_t min_addr = (phys_addr_t)ULLONG_MAX; + phys_addr_t min_addr = PHYS_ADDR_MAX; phys_addr_t max_addr = 0; int i; @@ -97,7 +97,7 @@ int qcom_mdt_load(struct device *dev, const struct firmware *fw, const struct elf32_hdr *ehdr; const struct firmware *seg_fw; phys_addr_t mem_reloc; - phys_addr_t min_addr = (phys_addr_t)ULLONG_MAX; + phys_addr_t min_addr = PHYS_ADDR_MAX; phys_addr_t max_addr = 0; size_t fw_name_len; ssize_t offset; From fe6bdfc8e1e131720abbe77a2eb990c94c9024cb Mon Sep 17 00:00:00 2001 From: Roman Gushchin Date: Thu, 14 Jun 2018 15:28:05 -0700 Subject: [PATCH 24/27] mm: fix oom_kill event handling Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch] Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin Acked-by: Konstantin Khlebnikov Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 26 ++++++++++++++++++++++---- mm/memcontrol.c | 6 ++++-- mm/oom_kill.c | 2 +- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 4f52ec755725..6c6fb116e925 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -53,6 +53,7 @@ enum memcg_memory_event { MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM, + MEMCG_OOM_KILL, MEMCG_SWAP_MAX, MEMCG_SWAP_FAIL, MEMCG_NR_MEMORY_EVENTS, @@ -720,11 +721,8 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (likely(memcg)) { + if (likely(memcg)) count_memcg_events(memcg, idx, 1); - if (idx == OOM_KILL) - cgroup_file_notify(&memcg->events_file); - } rcu_read_unlock(); } @@ -735,6 +733,21 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg, cgroup_file_notify(&memcg->events_file); } +static inline void memcg_memory_event_mm(struct mm_struct *mm, + enum memcg_memory_event event) +{ + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); + if (likely(memcg)) + memcg_memory_event(memcg, event); + rcu_read_unlock(); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE void mem_cgroup_split_huge_fixup(struct page *head); #endif @@ -756,6 +769,11 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg, { } +static inline void memcg_memory_event_mm(struct mm_struct *mm, + enum memcg_memory_event event) +{ +} + static inline enum mem_cgroup_protection mem_cgroup_protected( struct mem_cgroup *root, struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5a3873e9d657..e6f0d5ef320a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3550,7 +3550,8 @@ static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); - seq_printf(sf, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); + seq_printf(sf, "oom_kill %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); return 0; } @@ -5239,7 +5240,8 @@ static int memory_events_show(struct seq_file *m, void *v) atomic_long_read(&memcg->memory_events[MEMCG_MAX])); seq_printf(m, "oom %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_OOM])); - seq_printf(m, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); + seq_printf(m, "oom_kill %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); return 0; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6694348b27e9..84081e77bc51 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -913,7 +913,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) /* Raise event before sending signal: task reaper must see this */ count_vm_event(OOM_KILL); - count_memcg_event_mm(mm, OOM_KILL); + memcg_memory_event_mm(mm, MEMCG_OOM_KILL); /* * We should send SIGKILL before granting access to memory reserves From 2738f359b1dcae50b704efe4ab799ea4861fc490 Mon Sep 17 00:00:00 2001 From: Randy Dunlap Date: Thu, 14 Jun 2018 15:28:09 -0700 Subject: [PATCH 25/27] hexagon: fix printk format warning in setup.c Fix printk format warning in hexagon/kernel/setup.c: ../arch/hexagon/kernel/setup.c: In function 'setup_arch': ../arch/hexagon/kernel/setup.c:69:2: warning: format '%x' expects argument of type 'unsigned int', but argument 2 has type 'long unsigned int' [-Wformat] where: extern unsigned long __phys_offset; #define PHYS_OFFSET __phys_offset Link: http://lkml.kernel.org/r/adce8db5-4b01-dc10-7fbb-6a64e0787eb5@infradead.org Signed-off-by: Randy Dunlap Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/hexagon/kernel/setup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/hexagon/kernel/setup.c b/arch/hexagon/kernel/setup.c index 6981949f5df3..dc8c7e75b5d1 100644 --- a/arch/hexagon/kernel/setup.c +++ b/arch/hexagon/kernel/setup.c @@ -66,7 +66,7 @@ void __init setup_arch(char **cmdline_p) */ __vmsetvec(_K_VM_event_vector); - printk(KERN_INFO "PHYS_OFFSET=0x%08x\n", PHYS_OFFSET); + printk(KERN_INFO "PHYS_OFFSET=0x%08lx\n", PHYS_OFFSET); /* * Simulator has a few differences from the hardware. From 608dbdfb1f0299f4500e56d62b0d84c44dcfa3be Mon Sep 17 00:00:00 2001 From: Anshuman Khandual Date: Thu, 14 Jun 2018 15:28:12 -0700 Subject: [PATCH 26/27] hexagon: drop the unused variable zero_page_mask Hexagon arch does not seem to have subscribed to _HAVE_COLOR_ZERO_PAGE framework. Hence zero_page_mask variable is not needed. Link: http://lkml.kernel.org/r/20180517061105.30447-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual Reviewed-by: Andrew Morton Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- arch/hexagon/include/asm/pgtable.h | 1 - arch/hexagon/mm/init.c | 3 --- 2 files changed, 4 deletions(-) diff --git a/arch/hexagon/include/asm/pgtable.h b/arch/hexagon/include/asm/pgtable.h index aef02f7ca8aa..65125d0b02dd 100644 --- a/arch/hexagon/include/asm/pgtable.h +++ b/arch/hexagon/include/asm/pgtable.h @@ -30,7 +30,6 @@ /* A handy thing to have if one has the RAM. Declared in head.S */ extern unsigned long empty_zero_page; -extern unsigned long zero_page_mask; /* * The PTE model described here is that of the Hexagon Virtual Machine, diff --git a/arch/hexagon/mm/init.c b/arch/hexagon/mm/init.c index 192584d5ac2f..1495d45e472d 100644 --- a/arch/hexagon/mm/init.c +++ b/arch/hexagon/mm/init.c @@ -39,9 +39,6 @@ unsigned long __phys_offset; /* physical kernel offset >> 12 */ /* Set as variable to limit PMD copies */ int max_kernel_seg = 0x303; -/* think this should be (page_size-1) the way it's used...*/ -unsigned long zero_page_mask; - /* indicate pfn's of high memory */ unsigned long highstart_pfn, highend_pfn; From ee410f15b1418f2f4428e79980674c979081bcb7 Mon Sep 17 00:00:00 2001 From: Thierry Escande Date: Thu, 14 Jun 2018 15:28:15 -0700 Subject: [PATCH 27/27] lib/test_printf.c: call wait_for_random_bytes() before plain %p tests If the test_printf module is loaded before the crng is initialized, the plain 'p' tests will fail because the printed address will not be hashed and the buffer will contain '(ptrval)' instead. This patch adds a call to wait_for_random_bytes() before plain 'p' tests to make sure the crng is initialized. Link: http://lkml.kernel.org/r/20180604113708.11554-1-thierry.escande@linaro.org Signed-off-by: Thierry Escande Acked-by: Tobin C. Harding Reviewed-by: Andrew Morton Cc: David Miller Cc: Rasmus Villemoes Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- lib/test_printf.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/lib/test_printf.c b/lib/test_printf.c index cea592f402ed..b2aa8f514844 100644 --- a/lib/test_printf.c +++ b/lib/test_printf.c @@ -260,6 +260,13 @@ plain(void) { int err; + /* + * Make sure crng is ready. Otherwise we get "(ptrval)" instead + * of a hashed address when printing '%p' in plain_hash() and + * plain_format(). + */ + wait_for_random_bytes(); + err = plain_hash(); if (err) { pr_warn("plain 'p' does not appear to be hashed\n");