mirror of https://gitee.com/openkylin/linux.git
13305 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Yang Shi | 2cee57d1b0 |
mm: ksm: do not block on page lock when searching stable tree
ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e. KSM page rmap
walk. Basically it is not a big deal since commit
|
|
Chris Down | 1ff9e6e179 |
mm: memcontrol: expose THP events on a per-memcg basis
Currently THP allocation events data is fairly opaque, since you can only get it system-wide. This patch makes it easier to reason about transparent hugepage behaviour on a per-memcg basis. For anonymous THP-backed pages, we already have MEMCG_RSS_HUGE in v1, which is used for v1's rss_huge [sic]. This is reused here as it's fairly involved to untangle NR_ANON_THPS right now to make it per-memcg, since right now some of this is delegated to rmap before we have any memcg actually assigned to the page. It's a good idea to rework that, but let's leave untangling THP allocation for a future patch. [akpm@linux-foundation.org: fix build] [chris@chrisdown.name: fix memcontrol build when THP is disabled] Link: http://lkml.kernel.org/r/20190131160802.GA5777@chrisdown.name Link: http://lkml.kernel.org/r/20190129205852.GA7310@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 2bb0f34fe3 |
mm: vmscan: do not iterate all mem cgroups for global direct reclaim
In current implementation, both kswapd and direct reclaim has to iterate all mem cgroups. It is not a problem before offline mem cgroups could be iterated. But, currently with iterating offline mem cgroups, it could be very time consuming. In our workloads, we saw over 400K mem cgroups accumulated in some cases, only a few hundred are online memcgs. Although kswapd could help out to reduce the number of memcgs, direct reclaim still get hit with iterating a number of offline memcgs in some cases. We experienced the responsiveness problems due to this occassionally. A simple test with pref shows it may take around 220ms to iterate 8K memcgs in direct reclaim: dd 13873 [011] 578.542919: vmscan:mm_vmscan_direct_reclaim_begin dd 13873 [011] 578.758689: vmscan:mm_vmscan_direct_reclaim_end So for 400K, it may take around 11 seconds to iterate all memcgs. Here just break the iteration once it reclaims enough pages as what memcg direct reclaim does. This may hurt the fairness among memcgs. But the cached iterator cookie could help to achieve the fairness more or less. Link: http://lkml.kernel.org/r/1548799877-10949-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joel Fernandes (Google) | ab3948f58f |
mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week where we don't need to modify core VFS structures to get the same behavior of the seal. This solves several side-effects pointed by Andy. self-tests are provided in later patch to verify the expected semantics. [1] https://lore.kernel.org/lkml/20181111173650.GA256781@google.com/ Thanks a lot to Andy for suggestions to improve code. Link: http://lkml.kernel.org/r/20190112203816.85534-2-joel@joelfernandes.org Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: John Stultz <john.stultz@linaro.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Marc-Andr Lureau <marcandre.lureau@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Aneesh Kumar K.V | 9a4e9f3b2d |
mm: update get_user_pages_longterm to migrate pages allocated from CMA region
This patch updates get_user_pages_longterm to migrate pages allocated out of CMA region. This makes sure that we don't keep non-movable pages (due to page reference count) in the CMA area. This will be used by ppc64 in a later patch to avoid pinning pages in the CMA region. ppc64 uses CMA region for allocation of the hardware page table (hash page table) and not able to migrate pages out of CMA region results in page table allocation failures. One case where we hit this easy is when a guest using a VFIO passthrough device. VFIO locks all the guest's memory and if the guest memory is backed by CMA region, it becomes unmovable resulting in fragmenting the CMA and possibly preventing other guests from allocation a large enough hash page table. NOTE: We allocate the new page without using __GFP_THISNODE Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vineeth Remanan Pillai | b56a2d8af9 |
mm: rid swapoff of quadratic complexity
This patch was initially posted by Kelley Nielsen. Reposting the patch with all review comments addressed and with minor modifications and optimizations. Also, folding in the fixes offered by Hugh Dickins and Huang Ying. Tests were rerun and commit message updated with new results. try_to_unuse() is of quadratic complexity, with a lot of wasted effort. It unuses swap entries one by one, potentially iterating over all the page tables for all the processes in the system for each one. This new proposed implementation of try_to_unuse simplifies its complexity to linear. It iterates over the system's mms once, unusing all the affected entries as it walks each set of page tables. It also makes similar changes to shmem_unuse. Improvement swapoff was called on a swap partition containing about 6G of data, in a VM(8cpu, 16G RAM), and calls to unuse_pte_range() were counted. Present implementation....about 1200M calls(8min, avg 80% cpu util). Prototype.................about 9.0K calls(3min, avg 5% cpu util). Details In shmem_unuse(), iterate over the shmem_swaplist and, for each shmem_inode_info that contains a swap entry, pass it to shmem_unuse_inode(), along with the swap type. In shmem_unuse_inode(), iterate over its associated xarray, and store the index and value of each swap entry in an array for passing to shmem_swapin_page() outside of the RCU critical section. In try_to_unuse(), instead of iterating over the entries in the type and unusing them one by one, perhaps walking all the page tables for all the processes for each one, iterate over the mmlist, making one pass. Pass each mm to unuse_mm() to begin its page table walk, and during the walk, unuse all the ptes that have backing store in the swap type received by try_to_unuse(). After the walk, check the type for orphaned swap entries with find_next_to_unuse(), and remove them from the swap cache. If find_next_to_unuse() starts over at the beginning of the type, repeat the check of the shmem_swaplist and the walk a maximum of three times. Change unuse_mm() and the intervening walk functions down to unuse_pte_range() to take the type as a parameter, and to iterate over their entire range, calling the next function down on every iteration. In unuse_pte_range(), make a swap entry from each pte in the range using the passed in type. If it has backing store in the type, call swapin_readahead() to retrieve the page and pass it to unuse_pte(). Pass the count of pages_to_unuse down the page table walks in try_to_unuse(), and return from the walk when the desired number of pages has been swapped back in. Link: http://lkml.kernel.org/r/20190114153129.4852-2-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Signed-off-by: Kelley Nielsen <kelleynnn@gmail.com> Signed-off-by: Huang Ying <ying.huang@intel.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vineeth Remanan Pillai | c5bf121e43 |
mm: refactor swap-in logic out of shmem_getpage_gfp
swapin logic can be reused independently without rest of the logic in shmem_getpage_gfp. So lets refactor it out as an independent function. Link: http://lkml.kernel.org/r/20190114153129.4852-1-vpillai@digitalocean.com Signed-off-by: Vineeth Remanan Pillai <vpillai@digitalocean.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill Tkhai | a9e7c39fa9 |
mm/vmscan.c: remove 7th argument of isolate_lru_pages()
We may simply check for sc->may_unmap in isolate_lru_pages() instead of doing that in both of its callers. Link: http://lkml.kernel.org/r/154748280735.29962.15867846875217618569.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 2e25644e8d |
mm, mempolicy: fix uninit memory access
Syzbot with KMSAN reports (excerpt): ================================================================== BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline] BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384 CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x173/0x1d0 lib/dump_stack.c:113 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295 mpol_rebind_policy mm/mempolicy.c:353 [inline] mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384 update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120 update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline] update_nodemask kernel/cgroup/cpuset.c:1253 [inline] cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728 ... Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline] kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176 kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777 mpol_new mm/mempolicy.c:276 [inline] do_mbind mm/mempolicy.c:1180 [inline] kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347 __do_sys_mbind mm/mempolicy.c:1354 [inline] As it's difficult to report where exactly the uninit value resides in the mempolicy object, we have to guess a bit. mm/mempolicy.c:353 contains this part of mpol_rebind_policy(): if (!mpol_store_user_nodemask(pol) && nodes_equal(pol->w.cpuset_mems_allowed, *newmask)) "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't ever see being uninitialized after leaving mpol_new(). So I'll guess it's actually about accessing pol->w.cpuset_mems_allowed on line 354, but still part of statement starting on line 353. For w.cpuset_mems_allowed to be not initialized, and the nodes_equal() reachable for a mempolicy where mpol_set_nodemask() is called in do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy with empty set of nodes, i.e. MPOL_LOCAL equivalent, with MPOL_F_LOCAL flag. Let's exclude such policies from the nodes_equal() check. Note the uninit access should be benign anyway, as rebinding this kind of policy is always a no-op. Therefore no actual need for stable inclusion. Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: zhong jiang <zhongjiang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tetsuo Handa | 7775face20 |
memcg: killed threads should not invoke memcg OOM killer
If a memory cgroup contains a single process with many threads (including different process group sharing the mm) then it is possible to trigger a race when the oom killer complains that there are no oom elible tasks and complain into the log which is both annoying and confusing because there is no actual problem. The race looks as follows: P1 oom_reaper P2 try_charge try_charge mem_cgroup_out_of_memory mutex_lock(oom_lock) out_of_memory oom_kill_process(P1,P2) wake_oom_reaper mutex_unlock(oom_lock) oom_reap_task mutex_lock(oom_lock) select_bad_process # no victim The problem is more visible with many threads. Fix this by checking for fatal_signal_pending from mem_cgroup_out_of_memory when the oom_lock is already held. The oom bypass is safe because we do the same early in the try_charge path already. The situation migh have changed in the mean time. It should be safe to check for fatal_signal_pending and tsk_is_oom_victim but for a better code readability abstract the current charge bypass condition into should_force_charge and reuse it from that path. " Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Rapoport | 23a7052a5d |
mm/page_alloc.c: check return value of memblock_alloc_node_nopanic()
There are two early memory allocations that use memblock_alloc_node_nopanic() and do not check its return value. While this happens very early during boot and chances that the allocation will fail are diminishing, it is still worth to have proper checks for the allocation errors. Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Aneesh Kumar K.V | 023bdd0023 |
mm/hugetlb: add prot_modify_start/commit sequence for hugetlb update
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Follow the regular pte change protection sequence for hugetlb too. This allows the architectures to override the update sequence. Link: http://lkml.kernel.org/r/20190116085035.29729-5-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Aneesh Kumar K.V | 04a8645304 |
mm: update ptep_modify_prot_commit to take old pte value as arg
Architectures like ppc64 require to do a conditional tlb flush based on the old and new value of pte. Enable that by passing old pte value as the arg. Link: http://lkml.kernel.org/r/20190116085035.29729-3-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Aneesh Kumar K.V | 0cbe3e26ab |
mm: update ptep_modify_prot_start/commit to take vm_area_struct as arg
Patch series "NestMMU pte upgrade workaround for mprotect", v5.
We can upgrade pte access (R -> RW transition) via mprotect. We need to
make sure we follow the recommended pte update sequence as outlined in
commit
|
|
Wei Yang | 8bb4e7a2ee |
mm: fix some typos in mm directory
No functional change. Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Changbin Du | 8aa49762db |
mm/page_owner: move config option to mm/Kconfig.debug
Move the PAGE_OWNER option from submenu "Compile-time checks and compiler options" to dedicated submenu "Memory Debugging". Link: http://lkml.kernel.org/r/20190120024254.6270-1-changbin.du@gmail.com Signed-off-by: Changbin Du <changbin.du@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Fan | 43cca0b1c5 |
mm/mmap.c: remove some redundancy in arch_get_unmapped_area_topdown()
The variable 'addr' is redundant in arch_get_unmapped_area_topdown(), just use parameter 'addr0' directly. Then remove the const qualifier of the parameter, and change its name to 'addr'. And in according with other functions, remove the const qualifier of all other no-pointer parameters in function arch_get_unmapped_area_topdown(). Link: http://lkml.kernel.org/r/20190127041112.25599-1-nullptr.cpp@gmail.com Signed-off-by: Yang Fan <nullptr.cpp@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shakeel Butt | bbbe480297 |
mm, oom: remove 'prefer children over parent' heuristic
Since the start of the git history of Linux, the kernel after selecting the worst process to be oom-killed, prefer to kill its child (if the child does not share mm with the parent). Later it was changed to prefer to kill a child who is worst. If the parent is still the worst then the parent will be killed. This heuristic assumes that the children did less work than their parent and by killing one of them, the work lost will be less. However this is very workload dependent. If there is a workload which can benefit from this heuristic, can use oom_score_adj to prefer children to be killed before the parent. The select_bad_process() has already selected the worst process in the system/memcg. There is no need to recheck the badness of its children and hoping to find a worse candidate. That's a lot of unneeded racy work. Also the heuristic is dangerous because it make fork bomb like workloads to recover much later because we constantly pick and kill processes which are not memory hogs. So, let's remove this whole heuristic. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: David Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Greg Kroah-Hartman | d9f7979c92 |
mm: no need to check return value of debugfs_create functions
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Laura Abbott <labbott@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 0ee930e6ca |
mm/memory.c: prevent mapping typed pages to userspace
Pages which use page_type must never be mapped to userspace as it would destroy their page type. Add an explicit check for this instead of assuming that kernel drivers always get this right. Link: http://lkml.kernel.org/r/20190129053830.3749-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 2d432cb709 |
mm: prevent mapping slab pages to userspace
It's never appropriate to map a page allocated by SLAB into userspace. A buggy device driver might try this, or an attacker might be able to find a way to make it happen. Christoph said: : Let's just fail the code. Currently this may work with SLUB. But SLAB : and SLOB overlay fields with mapcount. So you would have a corrupted page : struct if you mapped a slab page to user space. Link: http://lkml.kernel.org/r/20190125173827.2658-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Uladzislau Rezki (Sony) | afd07389d3 |
mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512!
One of the vmalloc stress test case triggers the kernel BUG(): <snip> [60.562151] ------------[ cut here ]------------ [60.562154] kernel BUG at mm/vmalloc.c:512! [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161 [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390 <snip> it can happen due to big align request resulting in overflowing of calculated address, i.e. it becomes 0 after ALIGN()'s fixup. Fix it by checking if calculated address is within vstart/vend range. Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Joel Fernandes <joelaf@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Garnier <thgarnie@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chris Down | 677dc9731b |
mm, memcg: extract memcg maxable seq_file logic to seq_show_memcg_tunable
memcg has a significant number of files exposed to kernfs where their value is either exposed directly or is "max" in the case of PAGE_COUNTER_MAX. This patch makes this generic by providing a single function to do this work. In combination with the previous patch adding mem_cgroup_from_seq, this makes all of the seq_show feeder functions significantly more simple. Link: http://lkml.kernel.org/r/20190124194100.GA31425@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Chris Down | aa9694bb78 |
mm, memcg: create mem_cgroup_from_seq
This is the start of a series of patches similar to my earlier DEFINE_MEMCG_MAX_OR_VAL work, but with less Macro Magic(tm). There are a bunch of places we go from seq_file to mem_cgroup, which currently requires manually getting the css, then getting the mem_cgroup from the css. It's in enough places now that having mem_cgroup_from_seq makes sense (and also makes the next patch a bit nicer). Link: http://lkml.kernel.org/r/20190124194050.GA31341@chrisdown.name Signed-off-by: Chris Down <chris@chrisdown.name> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 5e1f0f098b |
mm, compaction: capture a page under direct compaction
Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | e332f741a8 |
mm, compaction: be selective about what pageblocks to clear skip hints
Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pw Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4fca9730c5 |
mm, compaction: sample pageblocks for free pages
Once fast searching finishes, there is a possibility that the linear scanner is scanning full blocks found by the fast scanner earlier. This patch uses an adaptive stride to sample pageblocks for free pages. The more consecutive full pageblocks encountered, the larger the stride until a pageblock with free pages is found. The scanners might meet slightly sooner but it is an acceptable risk given that the search of the free lists may still encounter the pages and adjust the cached PFN of the free scanner accordingly. 5.0.0-rc1 5.0.0-rc1 roundrobin-v3r17 samplefree-v3r17 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2752.37 ( 0.00%) 2729.95 ( 0.81%) Amean fault-both-5 4341.69 ( 0.00%) 4397.80 ( -1.29%) Amean fault-both-7 6308.75 ( 0.00%) 6097.61 ( 3.35%) Amean fault-both-12 10241.81 ( 0.00%) 9407.15 ( 8.15%) Amean fault-both-18 13736.09 ( 0.00%) 10857.63 * 20.96%* Amean fault-both-24 16853.95 ( 0.00%) 13323.24 * 20.95%* Amean fault-both-30 15862.61 ( 0.00%) 17345.44 ( -9.35%) Amean fault-both-32 18450.85 ( 0.00%) 16892.00 ( 8.45%) The latency is mildly improved offseting some overhead from earlier patches that are prerequisites for the rest of the series. However, a major impact is on the free scan rate with an 82% reduction. 5.0.0-rc1 5.0.0-rc1 roundrobin-v3r17 samplefree-v3r17 Compaction migrate scanned 21607271 20116887 Compaction free scanned 95336406 16668703 It's also the first time in the series where the number of pages scanned by the migration scanner is greater than the free scanner due to the increased search efficiency. Link: http://lkml.kernel.org/r/20190118175136.31341-21-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | dbe2d4e4f1 |
mm, compaction: round-robin the order while searching the free lists for a target
As compaction proceeds and creates high-order blocks, the free list search gets less efficient as the larger blocks are used as compaction targets. Eventually, the larger blocks will be behind the migration scanner for partially migrated pageblocks and the search fails. This patch round-robins what orders are searched so that larger blocks can be ignored and find smaller blocks that can be used as migration targets. The overall impact was small on 1-socket but it avoids corner cases where the migration/free scanners meet prematurely or situations where many of the pageblocks encountered by the free scanner are almost full instead of being properly packed. Previous testing had indicated that without this patch there were occasional large spikes in the free scanner without this patch. [dan.carpenter@oracle.com: fix static checker warning] Link: http://lkml.kernel.org/r/20190118175136.31341-20-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | d097a6f635 |
mm, compaction: reduce premature advancement of the migration target scanner
The fast isolation of free pages allows the cached PFN of the free scanner to advance faster than necessary depending on the contents of the free list. The key is that fast_isolate_freepages() can update zone->compact_cached_free_pfn via isolate_freepages_block(). When the fast search fails, the linear scan can start from a point that has skipped valid migration targets, particularly pageblocks with just low-order free pages. This can cause the migration source/target scanners to meet prematurely causing a reset. This patch starts by avoiding an update of the pageblock skip information and cached PFN from isolate_freepages_block() and puts the responsibility of updating that information in the callers. The fast scanner will update the cached PFN if and only if it finds a block that is higher than the existing cached PFN and sets the skip if the pageblock is full or nearly full. The linear scanner will update skipped information and the cached PFN only when a block is completely scanned. The total impact is that the free scanner advances more slowly as it is primarily driven by the linear scanner instead of the fast search. 5.0.0-rc1 5.0.0-rc1 noresched-v3r17 slowfree-v3r17 Amean fault-both-3 2965.68 ( 0.00%) 3036.75 ( -2.40%) Amean fault-both-5 3995.90 ( 0.00%) 4522.24 * -13.17%* Amean fault-both-7 5842.12 ( 0.00%) 6365.35 ( -8.96%) Amean fault-both-12 9550.87 ( 0.00%) 10340.93 ( -8.27%) Amean fault-both-18 13304.72 ( 0.00%) 14732.46 ( -10.73%) Amean fault-both-24 14618.59 ( 0.00%) 16288.96 ( -11.43%) Amean fault-both-30 16650.96 ( 0.00%) 16346.21 ( 1.83%) Amean fault-both-32 17145.15 ( 0.00%) 19317.49 ( -12.67%) The impact to latency is higher than the last version but it appears to be due to a slight increase in the free scan rates which is a potential side-effect of the patch. However, this is necessary for later patches that are more careful about how pageblocks are treated as earlier iterations of those patches hit corner cases where the restarts were punishing and very visible. Link: http://lkml.kernel.org/r/20190118175136.31341-19-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | cf66f0700c |
mm, compaction: do not consider a need to reschedule as contention
Scanning on large machines can take a considerable length of time and eventually need to be rescheduled. This is treated as an abort event but that's not appropriate as the attempt is likely to be retried after making numerous checks and taking another cycle through the page allocator. This patch will check the need to reschedule if necessary but continue the scanning. The main benefit is reduced scanning when compaction is taking a long time or the machine is over-saturated. It also avoids an unnecessary exit of compaction that ends up being retried by the page allocator in the outer loop. 5.0.0-rc1 5.0.0-rc1 synccached-v3r16 noresched-v3r17 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2958.27 ( 0.00%) 2965.68 ( -0.25%) Amean fault-both-5 4091.90 ( 0.00%) 3995.90 ( 2.35%) Amean fault-both-7 5803.05 ( 0.00%) 5842.12 ( -0.67%) Amean fault-both-12 9481.06 ( 0.00%) 9550.87 ( -0.74%) Amean fault-both-18 14141.51 ( 0.00%) 13304.72 ( 5.92%) Amean fault-both-24 16438.00 ( 0.00%) 14618.59 ( 11.07%) Amean fault-both-30 17531.72 ( 0.00%) 16650.96 ( 5.02%) Amean fault-both-32 17101.96 ( 0.00%) 17145.15 ( -0.25%) Link: http://lkml.kernel.org/r/20190118175136.31341-18-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | cb810ad294 |
mm, compaction: rework compact_should_abort as compact_check_resched
With incremental changes, compact_should_abort no longer makes any documented sense. Rename to compact_check_resched and update the associated comments. There is no benefit other than reducing redundant code and making the intent slightly clearer. It could potentially be merged with earlier patches but it just makes the review slightly harder. Link: http://lkml.kernel.org/r/20190118175136.31341-17-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 8854c55f54 |
mm, compaction: keep cached migration PFNs synced for unusable pageblocks
Migrate has separate cached PFNs for ASYNC and SYNC* migration on the basis that some migrations will fail in ASYNC mode. However, if the cached PFNs match at the start of scanning and pageblocks are skipped due to having no isolation candidates, then the sync state does not matter. This patch keeps matching cached PFNs in sync until a pageblock with isolation candidates is found. The actual benefit is marginal given that the sync scanner following the async scanner will often skip a number of pageblocks but it's useless work. Any benefit depends heavily on whether the scanners restarted recently. Link: http://lkml.kernel.org/r/20190118175136.31341-16-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 9bebefd590 |
mm, compaction: check early for huge pages encountered by the migration scanner
When scanning for sources or targets, PageCompound is checked for huge pages as they can be skipped quickly but it happens relatively late after a lot of setup and checking. This patch short-cuts the check to make it earlier. It might still change when the lock is acquired but this has less overhead overall. The free scanner advances but the migration scanner does not. Typically the free scanner encounters more movable blocks that change state over the lifetime of the system and also tends to scan more aggressively as it's actively filling its portion of the physical address space with data. This could change in the future but for the moment, this worked better in practice and incurred fewer scan restarts. The impact on latency and allocation success rates is marginal but the free scan rates are reduced by 15% and system CPU usage is reduced by 3.3%. The 2-socket results are not materially different. Link: http://lkml.kernel.org/r/20190118175136.31341-15-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | cb2dcaf023 |
mm, compaction: finish pageblock scanning on contention
Async migration aborts on spinlock contention but contention can be high when there are multiple compaction attempts and kswapd is active. The consequence is that the migration scanners move forward uselessly while still contending on locks for longer while leaving suitable migration sources behind. This patch will acquire the lock but track when contention occurs. When it does, the current pageblock will finish as compaction may succeed for that block and then abort. This will have a variable impact on latency as in some cases useless scanning is avoided (reduces latency) but a lock will be contended (increase latency) or a single contended pageblock is scanned that would otherwise have been skipped (increase latency). 5.0.0-rc1 5.0.0-rc1 norescan-v3r16 finishcontend-v3r16 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3002.07 ( 0.00%) 3153.17 ( -5.03%) Amean fault-both-5 4684.47 ( 0.00%) 4280.52 ( 8.62%) Amean fault-both-7 6815.54 ( 0.00%) 5811.50 * 14.73%* Amean fault-both-12 10864.02 ( 0.00%) 9276.85 ( 14.61%) Amean fault-both-18 12247.52 ( 0.00%) 11032.67 ( 9.92%) Amean fault-both-24 15683.99 ( 0.00%) 14285.70 ( 8.92%) Amean fault-both-30 18620.02 ( 0.00%) 16293.76 * 12.49%* Amean fault-both-32 19250.28 ( 0.00%) 16721.02 * 13.14%* 5.0.0-rc1 5.0.0-rc1 norescan-v3r16 finishcontend-v3r16 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Percentage huge-3 95.00 ( 0.00%) 96.82 ( 1.92%) Percentage huge-5 94.22 ( 0.00%) 95.40 ( 1.26%) Percentage huge-7 92.35 ( 0.00%) 95.92 ( 3.86%) Percentage huge-12 91.90 ( 0.00%) 96.73 ( 5.25%) Percentage huge-18 89.58 ( 0.00%) 96.77 ( 8.03%) Percentage huge-24 90.03 ( 0.00%) 96.05 ( 6.69%) Percentage huge-30 89.14 ( 0.00%) 96.81 ( 8.60%) Percentage huge-32 90.58 ( 0.00%) 97.41 ( 7.54%) There is a variable impact that is mostly good on latency while allocation success rates are slightly higher. System CPU usage is reduced by about 10% but scan rate impact is mixed Compaction migrate scanned 27997659.00 20148867 Compaction free scanned 120782791.00 118324914 Migration scan rates are reduced 28% which is expected as a pageblock is used by the async scanner instead of skipped. The impact on the free scanner is known to be variable. Overall the primary justification for this patch is that completing scanning of a pageblock is very important for later patches. [yuehaibing@huawei.com: fix unused variable warning] Link: http://lkml.kernel.org/r/20190118175136.31341-14-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: YueHaibing <yuehaibing@huawei.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 804d3121ba |
mm, compaction: avoid rescanning the same pageblock multiple times
Pageblocks are marked for skip when no pages are isolated after a scan. However, it's possible to hit corner cases where the migration scanner gets stuck near the boundary between the source and target scanner. Due to pages being migrated in blocks of COMPACT_CLUSTER_MAX, pages that are migrated can be reallocated before the pageblock is complete. The pageblock is not necessarily skipped so it can be rescanned multiple times. Similarly, a pageblock with some dirty/writeback pages may fail to migrate and be rescanned until writeback completes which is wasteful. This patch tracks if a pageblock is being rescanned. If so, then the entire pageblock will be migrated as one operation. This narrows the race window during which pages can be reallocated during migration. Secondly, if there are pages that cannot be isolated then the pageblock will still be fully scanned and marked for skipping. On the second rescan, the pageblock skip is set and the migration scanner makes progress. 5.0.0-rc1 5.0.0-rc1 findfree-v3r16 norescan-v3r16 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3200.68 ( 0.00%) 3002.07 ( 6.21%) Amean fault-both-5 4847.75 ( 0.00%) 4684.47 ( 3.37%) Amean fault-both-7 6658.92 ( 0.00%) 6815.54 ( -2.35%) Amean fault-both-12 11077.62 ( 0.00%) 10864.02 ( 1.93%) Amean fault-both-18 12403.97 ( 0.00%) 12247.52 ( 1.26%) Amean fault-both-24 15607.10 ( 0.00%) 15683.99 ( -0.49%) Amean fault-both-30 18752.27 ( 0.00%) 18620.02 ( 0.71%) Amean fault-both-32 21207.54 ( 0.00%) 19250.28 * 9.23%* 5.0.0-rc1 5.0.0-rc1 findfree-v3r16 norescan-v3r16 Percentage huge-3 96.86 ( 0.00%) 95.00 ( -1.91%) Percentage huge-5 93.72 ( 0.00%) 94.22 ( 0.53%) Percentage huge-7 94.31 ( 0.00%) 92.35 ( -2.08%) Percentage huge-12 92.66 ( 0.00%) 91.90 ( -0.82%) Percentage huge-18 91.51 ( 0.00%) 89.58 ( -2.11%) Percentage huge-24 90.50 ( 0.00%) 90.03 ( -0.52%) Percentage huge-30 91.57 ( 0.00%) 89.14 ( -2.65%) Percentage huge-32 91.00 ( 0.00%) 90.58 ( -0.46%) Negligible difference but this was likely a case when the specific corner case was not hit. A previous run of the same patch based on an earlier iteration of the series showed large differences where migration rates could be halved when the corner case was hit. The specific corner case where migration scan rates go through the roof was due to a dirty/writeback pageblock located at the boundary of the migration/free scanner did not happen in this case. When it does happen, the scan rates multipled by massive margins. Link: http://lkml.kernel.org/r/20190118175136.31341-13-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 5a811889de |
mm, compaction: use free lists to quickly locate a migration target
Similar to the migration scanner, this patch uses the free lists to quickly locate a migration target. The search is different in that lower orders will be searched for a suitable high PFN if necessary but the search is still bound. This is justified on the grounds that the free scanner typically scans linearly much more than the migration scanner. If a free page is found, it is isolated and compaction continues if enough pages were isolated. For SYNC* scanning, the full pageblock is scanned for any remaining free pages so that is can be marked for skipping in the near future. 1-socket thpfioscale 5.0.0-rc1 5.0.0-rc1 isolmig-v3r15 findfree-v3r16 Amean fault-both-3 3024.41 ( 0.00%) 3200.68 ( -5.83%) Amean fault-both-5 4749.30 ( 0.00%) 4847.75 ( -2.07%) Amean fault-both-7 6454.95 ( 0.00%) 6658.92 ( -3.16%) Amean fault-both-12 10324.83 ( 0.00%) 11077.62 ( -7.29%) Amean fault-both-18 12896.82 ( 0.00%) 12403.97 ( 3.82%) Amean fault-both-24 13470.60 ( 0.00%) 15607.10 * -15.86%* Amean fault-both-30 17143.99 ( 0.00%) 18752.27 ( -9.38%) Amean fault-both-32 17743.91 ( 0.00%) 21207.54 * -19.52%* The impact on latency is variable but the search is optimistic and sensitive to the exact system state. Success rates are similar but the major impact is to the rate of scanning 5.0.0-rc1 5.0.0-rc1 isolmig-v3r15 findfree-v3r16 Compaction migrate scanned 25646769 29507205 Compaction free scanned 201558184 100359571 The free scan rates are reduced by 50%. The 2-socket reductions for the free scanner are more dramatic which is a likely reflection that the machine has more memory. [dan.carpenter@oracle.com: fix static checker warning] [vbabka@suse.cz: correct number of pages scanned for lower orders] Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | e380bebe47 |
mm, compaction: keep migration source private to a single compaction instance
Due to either a fast search of the free list or a linear scan, it is possible for multiple compaction instances to pick the same pageblock for migration. This is lucky for one scanner and increased scanning for all the others. It also allows a race between requests on which first allocates the resulting free block. This patch tests and updates the pageblock skip for the migration scanner carefully. When isolating a block, it will check and skip if the block is already in use. Once the zone lock is acquired, it will be rechecked so that only one scanner can set the pageblock skip for exclusive use. Any scanner contending will continue with a linear scan. The skip bit is still set if no pages can be isolated in a range. While this may result in redundant scanning, it avoids unnecessarily acquiring the zone lock when there are no suitable migration sources. 1-socket thpscale Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3390.40 ( 0.00%) 3024.41 ( 10.80%) Amean fault-both-5 5082.28 ( 0.00%) 4749.30 ( 6.55%) Amean fault-both-7 7012.51 ( 0.00%) 6454.95 ( 7.95%) Amean fault-both-12 11346.63 ( 0.00%) 10324.83 ( 9.01%) Amean fault-both-18 15324.19 ( 0.00%) 12896.82 * 15.84%* Amean fault-both-24 16088.50 ( 0.00%) 13470.60 * 16.27%* Amean fault-both-30 18723.42 ( 0.00%) 17143.99 ( 8.44%) Amean fault-both-32 18612.01 ( 0.00%) 17743.91 ( 4.66%) 5.0.0-rc1 5.0.0-rc1 findmig-v3r15 isolmig-v3r15 Percentage huge-3 89.83 ( 0.00%) 92.96 ( 3.48%) Percentage huge-5 91.96 ( 0.00%) 93.26 ( 1.41%) Percentage huge-7 92.85 ( 0.00%) 93.63 ( 0.84%) Percentage huge-12 92.74 ( 0.00%) 92.80 ( 0.07%) Percentage huge-18 91.71 ( 0.00%) 91.62 ( -0.10%) Percentage huge-24 92.13 ( 0.00%) 91.50 ( -0.69%) Percentage huge-30 93.79 ( 0.00%) 92.73 ( -1.13%) Percentage huge-32 91.27 ( 0.00%) 91.94 ( 0.74%) This shows a reasonable reduction in latency as multiple compaction scanners do not operate on the same blocks with a similar allocation success rate. Compaction migrate scanned 41093126 25646769 Migration scan rates are reduced by 38%. Link: http://lkml.kernel.org/r/20190118175136.31341-11-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 70b44595ea |
mm, compaction: use free lists to quickly locate a migration source
The migration scanner is a linear scan of a zone with a potentiall large search space. Furthermore, many pageblocks are unusable such as those filled with reserved pages or partially filled with pages that cannot migrate. These still get scanned in the common case of allocating a THP and the cost accumulates. The patch uses a partial search of the free lists to locate a migration source candidate that is marked as MOVABLE when allocating a THP. It prefers picking a block with a larger number of free pages already on the basis that there are fewer pages to migrate to free the entire block. The lowest PFN found during searches is tracked as the basis of the start for the linear search after the first search of the free list fails. After the search, the free list is shuffled so that the next search will not encounter the same page. If the search fails then the subsequent searches will be shorter and the linear scanner is used. If this search fails, or if the request is for a small or unmovable/reclaimable allocation then the linear scanner is still used. It is somewhat pointless to use the list search in those cases. Small free pages must be used for the search and there is no guarantee that movable pages are located within that block that are contiguous. 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Amean fault-both-3 3771.41 ( 0.00%) 3390.40 ( 10.10%) Amean fault-both-5 5409.05 ( 0.00%) 5082.28 ( 6.04%) Amean fault-both-7 7040.74 ( 0.00%) 7012.51 ( 0.40%) Amean fault-both-12 11887.35 ( 0.00%) 11346.63 ( 4.55%) Amean fault-both-18 16718.19 ( 0.00%) 15324.19 ( 8.34%) Amean fault-both-24 21157.19 ( 0.00%) 16088.50 * 23.96%* Amean fault-both-30 21175.92 ( 0.00%) 18723.42 * 11.58%* Amean fault-both-32 21339.03 ( 0.00%) 18612.01 * 12.78%* 5.0.0-rc1 5.0.0-rc1 noboost-v3r10 findmig-v3r15 Percentage huge-3 86.50 ( 0.00%) 89.83 ( 3.85%) Percentage huge-5 92.52 ( 0.00%) 91.96 ( -0.61%) Percentage huge-7 92.44 ( 0.00%) 92.85 ( 0.44%) Percentage huge-12 92.98 ( 0.00%) 92.74 ( -0.25%) Percentage huge-18 91.70 ( 0.00%) 91.71 ( 0.02%) Percentage huge-24 91.59 ( 0.00%) 92.13 ( 0.60%) Percentage huge-30 90.14 ( 0.00%) 93.79 ( 4.04%) Percentage huge-32 90.03 ( 0.00%) 91.27 ( 1.37%) This shows an improvement in allocation latencies with similar allocation success rates. While not presented, there was a 31% reduction in migration scanning and a 8% reduction on system CPU usage. A 2-socket machine showed similar benefits. [mgorman@techsingularity.net: several fixes] Link: http://lkml.kernel.org/r/20190204120111.GL9565@techsingularity.net [vbabka@suse.cz: migrate block that was found-fast, some optimisations] Link: http://lkml.kernel.org/r/20190118175136.31341-10-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <Vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | fd1444b272 |
mm, compaction: ignore the fragmentation avoidance boost for isolation and compaction
When pageblocks get fragmented, watermarks are artifically boosted to reclaim pages to avoid further fragmentation events. However, compaction is often either fragmentation-neutral or moving movable pages away from unmovable/reclaimable pages. As the true watermarks are preserved, allow compaction to ignore the boost factor. The expected impact is very slight as the main benefit is that compaction is slightly more likely to succeed when the system has been fragmented very recently. On both 1-socket and 2-socket machines for THP-intensive allocation during fragmentation the success rate was increased by less than 1% which is marginal. However, detailed tracing indicated that failure of migration due to a premature ENOMEM triggered by watermark checks were eliminated. Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | efe771c760 |
mm, compaction: always finish scanning of a full pageblock
When compaction is finishing, it uses a flag to ensure the pageblock is complete but it makes sense to always complete migration of a pageblock. Minimally, skip information is based on a pageblock and partially scanned pageblocks may incur more scanning in the future. The pageblock skip handling also becomes more strict later in the series and the hint is more useful if a complete pageblock was always scanned. The potentially impacts latency as more scanning is done but it's not a consistent win or loss as the scanning is not always a high percentage of the pageblock and sometimes it is offset by future reductions in scanning. Hence, the results are not presented this time due to a misleading mix of gains/losses without any clear pattern. However, full scanning of the pageblock is important for later patches. Link: http://lkml.kernel.org/r/20190118175136.31341-8-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 806031bb5e |
mm, migrate: immediately fail migration of a page with no migration handler
Pages with no migration handler use a fallback handler which sometimes works and sometimes persistently retries. A historical example was blockdev pages but there are others such as odd refcounting when page->private is used. These are retried multiple times which is wasteful during compaction so this patch will fail migration faster unless the caller specifies MIGRATE_SYNC. This is not expected to help THP allocation success rates but it did reduce latencies very slightly in some cases. 1-socket thpfioscale 4.20.0 4.20.0 noreserved-v2r15 failfast-v2r15 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 3839.67 ( 0.00%) 3833.72 ( 0.15%) Amean fault-both-5 5177.47 ( 0.00%) 4967.15 ( 4.06%) Amean fault-both-7 7245.03 ( 0.00%) 7139.19 ( 1.46%) Amean fault-both-12 11534.89 ( 0.00%) 11326.30 ( 1.81%) Amean fault-both-18 16241.10 ( 0.00%) 16270.70 ( -0.18%) Amean fault-both-24 19075.91 ( 0.00%) 19839.65 ( -4.00%) Amean fault-both-30 22712.11 ( 0.00%) 21707.05 ( 4.43%) Amean fault-both-32 21692.92 ( 0.00%) 21968.16 ( -1.27%) The 2-socket results are not materially different. Scan rates are similar as expected. Link: http://lkml.kernel.org/r/20190118175136.31341-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 4469ab9847 |
mm, compaction: rename map_pages to split_map_pages
It's non-obvious that high-order free pages are split into order-0 pages from the function name. Fix it. Link: http://lkml.kernel.org/r/20190118175136.31341-6-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 40cacbcb32 |
mm, compaction: remove unnecessary zone parameter in some instances
A zone parameter is passed into a number of top-level compaction functions despite the fact that it's already in compact_control. This is harmless but it did need an audit to check if zone actually ever changes meaningfully. This patches removes the parameter in a number of top-level functions. The change could be much deeper but this was enough to briefly clarify the flow. No functional change. Link: http://lkml.kernel.org/r/20190118175136.31341-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | 566e54e113 |
mm, compaction: remove last_migrated_pfn from compact_control
The last_migrated_pfn field is a bit dubious as to whether it really helps but either way, the information from it can be inferred without increasing the size of compact_control so remove the field. Link: http://lkml.kernel.org/r/20190118175136.31341-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | c5943b9c53 |
mm, compaction: rearrange compact_control
compact_control spans two cache lines with write-intensive lines on both. Rearrange so the most write-intensive fields are in the same cache line. This has a negligible impact on the overall performance of compaction and is more a tidying exercise than anything. Link: http://lkml.kernel.org/r/20190118175136.31341-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mel Gorman | c5fbd937b6 |
mm, compaction: shrink compact_control
Patch series "Increase success rates and reduce latency of compaction", v3. This series reduces scan rates and success rates of compaction, primarily by using the free lists to shorten scans, better controlling of skip information and whether multiple scanners can target the same block and capturing pageblocks before being stolen by parallel requests. The series is based on mmotm from January 9th, 2019 with the previous compaction series reverted. I'm mostly using thpscale to measure the impact of the series. The benchmark creates a large file, maps it, faults it, punches holes in the mapping so that the virtual address space is fragmented and then tries to allocate THP. It re-executes for different numbers of threads. From a fragmentation perspective, the workload is relatively benign but it does stress compaction. The overall impact on latencies for a 1-socket machine is baseline patches Amean fault-both-3 3832.09 ( 0.00%) 2748.56 * 28.28%* Amean fault-both-5 4933.06 ( 0.00%) 4255.52 ( 13.73%) Amean fault-both-7 7017.75 ( 0.00%) 6586.93 ( 6.14%) Amean fault-both-12 11610.51 ( 0.00%) 9162.34 * 21.09%* Amean fault-both-18 17055.85 ( 0.00%) 11530.06 * 32.40%* Amean fault-both-24 19306.27 ( 0.00%) 17956.13 ( 6.99%) Amean fault-both-30 22516.49 ( 0.00%) 15686.47 * 30.33%* Amean fault-both-32 23442.93 ( 0.00%) 16564.83 * 29.34%* The allocation success rates are much improved baseline patches Percentage huge-3 85.99 ( 0.00%) 97.96 ( 13.92%) Percentage huge-5 88.27 ( 0.00%) 96.87 ( 9.74%) Percentage huge-7 85.87 ( 0.00%) 94.53 ( 10.09%) Percentage huge-12 82.38 ( 0.00%) 98.44 ( 19.49%) Percentage huge-18 83.29 ( 0.00%) 99.14 ( 19.04%) Percentage huge-24 81.41 ( 0.00%) 97.35 ( 19.57%) Percentage huge-30 80.98 ( 0.00%) 98.05 ( 21.08%) Percentage huge-32 80.53 ( 0.00%) 97.06 ( 20.53%) That's a nearly perfect allocation success rate. The biggest impact is on the scan rates Compaction migrate scanned 55893379 19341254 Compaction free scanned 474739990 11903963 The number of pages scanned for migration was reduced by 65% and the free scanner was reduced by 97.5%. So much less work in exchange for lower latency and better success rates. The series was also evaluated using a workload that heavily fragments memory but the benefits there are also significant, albeit not presented. It was commented that we should be rethinking scanning entirely and to a large extent I agree. However, to achieve that you need a lot of this series in place first so it's best to make the linear scanners as best as possible before ripping them out. This patch (of 22): The isolate and migrate scanners should never isolate more than a pageblock of pages so unsigned int is sufficient saving 8 bytes on a 64-bit build. Link: http://lkml.kernel.org/r/20190118175136.31341-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
zhengbin | 35f12f0f5c |
mm/filemap: pass inclusive 'end_byte' parameter to filemap_range_has_page
The 'end_byte' parameter of filemap_range_has_page is required to be
inclusive, so follow the rule.
Link: http://lkml.kernel.org/r/1548678679-18122-1-git-send-email-zhengbin13@huawei.com
Fixes:
|
|
Yang Shi | e9f598730e |
mm: swap: add comment for swap_vma_readahead
swap_vma_readahead()'s comment is missing, just add it. Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 8fd2e0b505 |
mm: swap: check if swap backing device is congested or not
Swap readahead would read in a few pages regardless if the underlying device is busy or not. It may incur long waiting time if the device is congested, and it may also exacerbate the congestion. Use inode_read_congested() to check if the underlying device is busy or not like what file page readahead does. Get inode from swap_info_struct. Although we can add inode information in swap_address_space (address_space->host), it may lead some unexpected side effect, i.e. it may break mapping_cap_account_dirty(). Using inode from swap_info_struct seems simple and good enough. Just does the check in vma_cluster_readahead() since swap_vma_readahead() is just used for non-rotational device which much less likely has congestion than traditional HDD. Although swap slots may be consecutive on swap partition, it still may be fragmented on swap file. This check would help to reduce excessive stall for such case. The test with page_fault1 of will-it-scale (sometimes tracing may just show runtest.py that is the wrapper script of page_fault1), which basically launches NR_CPU threads to generate 128MB anonymous pages for each thread, on my virtual machine with congested HDD shows long tail latency is reduced significantly. Without the patch page_fault1_thr-1490 [023] 129.311706: funcgraph_entry: #57377.796 us | do_swap_page(); page_fault1_thr-1490 [023] 129.369103: funcgraph_entry: 5.642us | do_swap_page(); page_fault1_thr-1490 [023] 129.369119: funcgraph_entry: #1289.592 us | do_swap_page(); page_fault1_thr-1490 [023] 129.370411: funcgraph_entry: 4.957us | do_swap_page(); page_fault1_thr-1490 [023] 129.370419: funcgraph_entry: 1.940us | do_swap_page(); page_fault1_thr-1490 [023] 129.378847: funcgraph_entry: #1411.385 us | do_swap_page(); page_fault1_thr-1490 [023] 129.380262: funcgraph_entry: 3.916us | do_swap_page(); page_fault1_thr-1490 [023] 129.380275: funcgraph_entry: #4287.751 us | do_swap_page(); With the patch runtest.py-1417 [020] 301.925911: funcgraph_entry: #9870.146 us | do_swap_page(); runtest.py-1417 [020] 301.935785: funcgraph_entry: 9.802us | do_swap_page(); runtest.py-1417 [020] 301.935799: funcgraph_entry: 3.551us | do_swap_page(); runtest.py-1417 [020] 301.935806: funcgraph_entry: 2.142us | do_swap_page(); runtest.py-1417 [020] 301.935853: funcgraph_entry: 6.938us | do_swap_page(); runtest.py-1417 [020] 301.935864: funcgraph_entry: 3.765us | do_swap_page(); runtest.py-1417 [020] 301.935871: funcgraph_entry: 3.600us | do_swap_page(); runtest.py-1417 [020] 301.935878: funcgraph_entry: 7.202us | do_swap_page(); [akpm@linux-foundation.org: code cleanup] [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Tim Chen <tim.c.chen@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Hugh Dickins <hughd@google.com Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 14ef1fc72a |
mm/filemap.c: remove redundant test from find_get_pages_contig
After we establish a reference on the page, we check the pointer
continues to be in the correct position in i_pages. Checking
page->index afterwards is unnecessary; if it were to change, then the
pointer to it from the page cache would also move. The check used to be
done before grabbing a reference on the page which was racy (see commit
|
|
Gustavo A. R. Silva | 67b8046f42 |
mm/memcontrol.c: use struct_size() in kmalloc()
One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; void *entry[]; }; instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Link: http://lkml.kernel.org/r/20190104183726.GA6374@embeddedor Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | c52e75935f |
mm: remove extra drain pages on pcp list
In the current implementation, there are two places to isolate a range of page: __offline_pages() and alloc_contig_range(). During this procedure, it will drain pages on pcp list. Below is a brief call flow: __offline_pages()/alloc_contig_range() start_isolate_page_range() set_migratetype_isolate() drain_all_pages() drain_all_pages() <--- A This snippet shows the current logic is isolate and drain pcp list for each pageblock and drain pcp list again for the whole range. start_isolate_page_range is responsible for isolating the given pfn range. One part of that job is to make sure that also pages that are on the allocator pcp lists are properly isolated. Otherwise they could be reused and the range wouldn't be completely isolated until the memory is freed back. While there is no strict guarantee here because pages might get allocated at any time before drain_all_pages is called there doesn't seem to be any strong demand for such a guarantee. In any case, draining is already done at the isolation level and there is no need to do it again later by start_isolate_page_range callers (memory hotplug and CMA allocator currently). Therefore remove pointless draining in existing callers to make the code more clear and functionally correct. [mhocko@suse.com: provide a clearer changelog for the last two paragraphs] Link: http://lkml.kernel.org/r/20190105233141.2329-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Anshuman Khandual | 7ed2c31dab |
mm/hugetlb: distinguish between migratability and movability
Patch series "arm64/mm: Enable HugeTLB migration", v4. This patch series enables HugeTLB migration support for all supported huge page sizes at all levels including contiguous bit implementation. Following HugeTLB migration support matrix has been enabled with this patch series. All permutations have been tested except for the 16GB. CONT PTE PMD CONT PMD PUD -------- --- -------- --- 4K: 64K 2M 32M 1G 16K: 2M 32M 1G 64K: 2M 512M 16G First the series adds migration support for PUD based huge pages. It then adds a platform specific hook to query an architecture if a given huge page size is supported for migration while also providing a default fallback option preserving the existing semantics which just checks for (PMD|PUD|PGDIR)_SHIFT macros. The last two patches enables HugeTLB migration on arm64 and subscribe to this new platform specific hook by defining an override. The second patch differentiates between movability and migratability aspects of huge pages and implements hugepage_movable_supported() which can then be used during allocation to decide whether to place the huge page in movable zone or not. This patch (of 5): During huge page allocation it's migratability is checked to determine if it should be placed under movable zones with GFP_HIGHUSER_MOVABLE. But the movability aspect of the huge page could depend on other factors than just migratability. Movability in itself is a distinct property which should not be tied with migratability alone. This differentiates these two and implements an enhanced movability check which also considers huge page size to determine if it is feasible to be placed under a movable zone. At present it just checks for gigantic pages but going forward it can incorporate other enhanced checks. Link: http://lkml.kernel.org/r/1545121450-1663-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Steve Capper <steve.capper@arm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox | 6b7e5cad65 |
mm: remove sysctl_extfrag_handler()
sysctl_extfrag_handler() neglects to propagate the return value from proc_dointvec_minmax() to its caller. It's a wrapper that doesn't need to exist, so just use proc_dointvec_minmax() directly. Link: http://lkml.kernel.org/r/20190104032557.3056-1-willy@infradead.org Signed-off-by: Matthew Wilcox <willy@infradead.org> Reported-by: Aditya Pakki <pakki001@umn.edu> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Uladzislau Rezki (Sony) | 153178edc7 |
vmalloc: export __vmalloc_node_range for CONFIG_TEST_VMALLOC_MODULE
Export __vmaloc_node_range() function if CONFIG_TEST_VMALLOC_MODULE is enabled. Some test cases in vmalloc test suite module require and make use of that function. Please note, that it is not supposed to be used for other purposes. We need it only for performance analysis, stressing and stability check of vmalloc allocator. Link: http://lkml.kernel.org/r/20190103142108.20744-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Roman Penyaev | bc84c53525 |
mm/vmalloc: pass VM_USERMAP flags directly to __vmalloc_node_range()
vmalloc_user*() calls differ from normal vmalloc() only in that they set VM_USERMAP flags for the area. During the whole history of vmalloc.c changes now it is possible simply to pass VM_USERMAP flags directly to __vmalloc_node_range() call instead of finding the area (which obviously takes time) after the allocation. Link: http://lkml.kernel.org/r/20190103145954.16942-4-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Roman Penyaev | c67dc62475 |
mm/vmalloc: do not call kmemleak_free() on not yet accounted memory
__vmalloc_area_node() calls vfree() on error path, which in turn calls kmemleak_free(), but area is not yet accounted by kmemleak_vmalloc(). Link: http://lkml.kernel.org/r/20190103145954.16942-3-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Roman Penyaev | 401592d2e0 |
mm/vmalloc: fix size check for remap_vmalloc_range_partial()
When VM_NO_GUARD is not set area->size includes adjacent guard page, thus for correct size checking get_vm_area_size() should be used, but not area->size. This fixes possible kernel oops when userspace tries to mmap an area on 1 page bigger than was allocated by vmalloc_user() call: the size check inside remap_vmalloc_range_partial() accounts non-existing guard page also, so check successfully passes but vmalloc_to_page() returns NULL (guard page does not physically exist). The following code pattern example should trigger an oops: static int oops_mmap(struct file *file, struct vm_area_struct *vma) { void *mem; mem = vmalloc_user(4096); BUG_ON(!mem); /* Do not care about mem leak */ return remap_vmalloc_range(vma, mem, 0); } And userspace simply mmaps size + PAGE_SIZE: mmap(NULL, 8192, PROT_WRITE|PROT_READ, MAP_PRIVATE, fd, 0); Possible candidates for oops which do not have any explicit size checks: *** drivers/media/usb/stkwebcam/stk-webcam.c: v4l_stk_mmap[789] ret = remap_vmalloc_range(vma, sbuf->buffer, 0); Or the following one: *** drivers/video/fbdev/core/fbmem.c static int fb_mmap(struct file *file, struct vm_area_struct * vma) ... res = fb->fb_mmap(info, vma); Where fb_mmap callback calls remap_vmalloc_range() directly without any explicit checks: *** drivers/video/fbdev/vfb.c static int vfb_mmap(struct fb_info *info, struct vm_area_struct *vma) { return remap_vmalloc_range(vma, (void *)info->fix.smem_start, vma->vm_pgoff); } Link: http://lkml.kernel.org/r/20190103145954.16942-2-rpenyaev@suse.de Signed-off-by: Roman Penyaev <rpenyaev@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Joe Perches <joe@perches.com> Cc: "Luis R. Rodriguez" <mcgrof@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Roman Penyaev | 5a82ac715d |
mm/vmalloc.c: make vmalloc_32_user() align base kernel virtual address to SHMLBA
This patch repeats the original one from David S Miller:
|
|
Shakeel Butt | 60cd4bcd62 |
memcg: localize memcg_kmem_enabled() check
Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge functions, so, the users don't have to explicitly check that condition. This is purely code cleanup patch without any functional change. Only the order of checks in memcg_charge_slab() can potentially be changed but the functionally it will be same. This should not matter as memcg_charge_slab() is not in the hot path. Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | 9234bae9b2 |
mm, slub: make the comment of put_cpu_partial() complete
There are two cases when put_cpu_partial() is invoked. * __slab_free * get_partial_node This patch just makes it cover these two cases. Link: http://lkml.kernel.org/r/20181025094437.18951-3-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kirill Tkhai | 52d1e606ee |
mm: reuse only-pte-mapped KSM page in do_wp_page()
Add an optimization for KSM pages almost in the same way that we have for ordinary anonymous pages. If there is a write fault in a page, which is mapped to an only pte, and it is not related to swap cache; the page may be reused without copying its content. [ Note that we do not consider PageSwapCache() pages at least for now, since we don't want to complicate __get_ksm_page(), which has nice optimization based on this (for the migration case). Currenly it is spinning on PageSwapCache() pages, waiting for when they have unfreezed counters (i.e., for the migration finish). But we don't want to make it also spinning on swap cache pages, which we try to reuse, since there is not a very high probability to reuse them. So, for now we do not consider PageSwapCache() pages at all. ] So in reuse_ksm_page() we check for 1) PageSwapCache() and 2) page_stable_node(), to skip a page, which KSM is currently trying to link to stable tree. Then we do page_ref_freeze() to prohibit KSM to merge one more page into the page, we are reusing. After that, nobody can refer to the reusing page: KSM skips !PageSwapCache() pages with zero refcount; and the protection against of all other participants is the same as for reused ordinary anon pages pte lock, page lock and mmap_sem. [akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s] Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Anshuman Khandual | 98fa15f34c |
mm: replace all open encodings for NUMA_NO_NODE
Patch series "Replace all open encodings for NUMA_NO_NODE", v3. All these places for replacement were found by running the following grep patterns on the entire kernel code. Please let me know if this might have missed some instances. This might also have replaced some false positives. I will appreciate suggestions, inputs and review. 1. git grep "nid == -1" 2. git grep "node == -1" 3. git grep "nid = -1" 4. git grep "node = -1" This patch (of 2): At present there are multiple places where invalid node number is encoded as -1. Even though implicitly understood it is always better to have macros in there. Replace these open encodings for an invalid node number with the global macro NUMA_NO_NODE. This helps remove NUMA related assumptions like 'invalid node' from various places redirecting them to a common definition. Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [ixgbe] Acked-by: Jens Axboe <axboe@kernel.dk> [mtip32xx] Acked-by: Vinod Koul <vkoul@kernel.org> [dmaengine.c] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: Doug Ledford <dledford@redhat.com> [drivers/infiniband] Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Liviu Dudau | 6ade20327d |
mm/vmalloc.c: don't dereference possible NULL pointer in __vunmap()
find_vmap_area() can return a NULL pointer and we're going to
dereference it without checking it first. Use the existing
find_vm_area() function which does exactly what we want and checks for
the NULL pointer.
Link: http://lkml.kernel.org/r/20181228171009.22269-1-liviu@dudau.co.uk
Fixes:
|
|
Arun KS | a9cd410a3d |
mm/page_alloc.c: memory hotplug: free pages as higher order
When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [arunks@codeaurora.org: v11] Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org [akpm@linux-foundation.org: remove unused local, per Arun] [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar] [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [arunks@codeaurora.org: v8] Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org [arunks@codeaurora.org: v9] Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org Signed-off-by: Arun KS <arunks@codeaurora.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | 278d7756df |
mm/slub.c: remove an unused addr argument
"addr" function argument is not used in alloc_consistency_checks() at
all, so remove it.
Link: http://lkml.kernel.org/r/20190211123214.35592-1-cai@lca.pw
Fixes:
|
|
Qian Cai | 92d1d07daa |
mm/slab.c: kmemleak no scan alien caches
Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),
alc = kmalloc_node(memsize, gfp, node);
init_arraycache(&alc->ac, entries, batch);
kmemleak_no_scan(ac);
Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the latter by lifting kmemleak_no_scan()
out of init_arraycache().
There is another place that calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.
kmemleak: Found object by alias at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
lookup_object+0x84/0xac
find_and_get_object+0x84/0xe4
kmemleak_no_scan+0x74/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
kmemleak: Object 0xffff8007b9aa7e00 (size 256):
kmemleak: comm "swapper/0", pid 1, jiffies 4294697137
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
kmemleak_alloc+0x84/0xb8
kmem_cache_alloc_node_trace+0x31c/0x3a0
__kmalloc_node+0x58/0x78
setup_kmem_cache_node+0x26c/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
Call trace:
dump_backtrace+0x0/0x168
show_stack+0x24/0x30
dump_stack+0x88/0xb0
kmemleak_no_scan+0x90/0xf4
setup_kmem_cache_node+0x2b4/0x35c
__do_tune_cpucache+0x250/0x2d4
do_tune_cpucache+0x4c/0xe4
enable_cpucache+0xc8/0x110
setup_cpu_cache+0x40/0x1b8
__kmem_cache_create+0x240/0x358
create_cache+0xc0/0x198
kmem_cache_create_usercopy+0x158/0x20c
kmem_cache_create+0x50/0x64
fsnotify_init+0x58/0x6c
do_one_initcall+0x194/0x388
kernel_init_freeable+0x668/0x688
kernel_init+0x18/0x124
ret_from_fork+0x10/0x18
Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
Fixes:
|
|
Peng Wang | edde82b6df |
mm/slub.c: freelist is ensured to be NULL when new_slab() fails
new_slab_objects() will return immediately if freelist is not NULL. if (freelist) return freelist; One more assignment operation could be avoided. Link: http://lkml.kernel.org/r/20181229062512.30469-1-rocking@whu.edu.cn Signed-off-by: Peng Wang <rocking@whu.edu.cn> Reviewed-by: Pekka Enberg <penberg@kernel.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 5c0198b6fb |
kasan: fix coccinelle warnings in kasan_p*_table
kasan_p4d_table(), kasan_pmd_table() and kasan_pud_table() are declared
as returning bool, but return 0 instead of false, which produces a
coccinelle warning. Fix it.
Link: http://lkml.kernel.org/r/1fa6fadf644859e8a6a8ecce258444b49be8c7ee.1551716733.git.andreyknvl@google.com
Fixes:
|
|
Arnd Bergmann | bcf6f55a0d |
kasan: fix kasan_check_read/write definitions
Building little-endian allmodconfig kernels on arm64 started failing with the generated atomic.h implementation, since we now try to call kasan helpers from the EFI stub: aarch64-linux-gnu-ld: drivers/firmware/efi/libstub/arm-stub.stub.o: in function `atomic_set': include/generated/atomic-instrumented.h:44: undefined reference to `__efistub_kasan_check_write' I suspect that we get similar problems in other files that explicitly disable KASAN for some reason but call atomic_t based helper functions. We can fix this by checking the predefined __SANITIZE_ADDRESS__ macro that the compiler sets instead of checking CONFIG_KASAN, but this in turn requires a small hack in mm/kasan/common.c so we do see the extern declaration there instead of the inline function. Link: http://lkml.kernel.org/r/20181211133453.2835077-1-arnd@arndb.de Fixes: b1864b828644 ("locking/atomics: build atomic headers as required") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au>, Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | 4117992df6 |
page_poison: play nicely with KASAN
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING). It triggers false positives in the allocation path: BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330 Read of size 8 at addr ffff88881f800000 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54 Call Trace: dump_stack+0xe0/0x19a print_address_description.cold.2+0x9/0x28b kasan_report.cold.3+0x7a/0xb5 __asan_report_load8_noabort+0x19/0x20 memchr_inv+0x2ea/0x330 kernel_poison_pages+0x103/0x3d5 get_page_from_freelist+0x15e7/0x4d90 because KASAN has not yet unpoisoned the shadow page for allocation before it checks memchr_inv() but only found a stale poison pattern. Also, false positives in free path, BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5 Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1 CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55 Call Trace: dump_stack+0xe0/0x19a print_address_description.cold.2+0x9/0x28b kasan_report.cold.3+0x7a/0xb5 check_memory_region+0x22d/0x250 memset+0x28/0x40 kernel_poison_pages+0x29e/0x3d5 __free_pages_ok+0x75f/0x13e0 due to KASAN adds poisoned redzones around slab objects, but the page poisoning needs to poison the whole page. Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Ryabinin | 7771bdbbfd |
kasan: remove use after scope bugs detection.
Use after scope bugs detector seems to be almost entirely useless for the linux kernel. It exists over two years, but I've seen only one valid bug so far [1]. And the bug was fixed before it has been reported. There were some other use-after-scope reports, but they were false-positives due to different reasons like incompatibility with structleak plugin. This feature significantly increases stack usage, especially with GCC < 9 version, and causes a 32K stack overflow. It probably adds performance penalty too. Given all that, let's remove use-after-scope detector entirely. While preparing this patch I've noticed that we mistakenly enable use-after-scope detection for clang compiler regardless of CONFIG_KASAN_EXTRA setting. This is also fixed now. [1] http://lkml.kernel.org/r/<20171129052106.rhgbjhhis53hkgfn@wfg-t540p.sh.intel.com> Link: http://lkml.kernel.org/r/20190111185842.13978-1-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Will Deacon <will.deacon@arm.com> [arm64] Cc: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
zhongjiang | 46612b751c |
mm: hwpoison: fix thp split handing in soft_offline_in_use_page()
When soft_offline_in_use_page() runs on a thp tail page after pmd is split, we trigger the following VM_BUG_ON_PAGE(): Memory failure: 0x3755ff: non anonymous thp __get_any_page: 0x3755ff: unknown zero refcount page type 2fffff80000000 Soft offlining pfn 0x34d805 at process virtual address 0x20fff000 page:ffffea000d360140 count:0 mapcount:0 mapping:0000000000000000 index:0x1 flags: 0x2fffff80000000() raw: 002fffff80000000 ffffea000d360108 ffffea000d360188 0000000000000000 raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) ------------[ cut here ]------------ kernel BUG at ./include/linux/mm.h:519! soft_offline_in_use_page() passed refcount and page lock from tail page to head page, which is not needed because we can pass any subpage to split_huge_page(). Naoya had fixed a similar issue in |
|
Linus Torvalds | 2d28e01dca |
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton: "2 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: hugetlbfs: fix races and page leaks during migration kasan: turn off asan-stack for clang-8 and earlier |
|
Mike Kravetz | cb6acd01e2 |
hugetlbfs: fix races and page leaks during migration
hugetlb pages should only be migrated if they are 'active'. The
routines set/clear_page_huge_active() modify the active state of hugetlb
pages.
When a new hugetlb page is allocated at fault time, set_page_huge_active
is called before the page is locked. Therefore, another thread could
race and migrate the page while it is being added to page table by the
fault code. This race is somewhat hard to trigger, but can be seen by
strategically adding udelay to simulate worst case scheduling behavior.
Depending on 'how' the code races, various BUG()s could be triggered.
To address this issue, simply delay the set_page_huge_active call until
after the page is successfully added to the page table.
Hugetlb pages can also be leaked at migration time if the pages are
associated with a file in an explicitly mounted hugetlbfs filesystem.
For example, consider a two node system with 4GB worth of huge pages
available. A program mmaps a 2G file in a hugetlbfs filesystem. It
then migrates the pages associated with the file from one node to
another. When the program exits, huge page counts are as follows:
node0
1024 free_hugepages
1024 nr_hugepages
node1
0 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
That is as expected. 2G of huge pages are taken from the free_hugepages
counts, and 2G is the size of the file in the explicitly mounted
filesystem. If the file is then removed, the counts become:
node0
1024 free_hugepages
1024 nr_hugepages
node1
1024 free_hugepages
1024 nr_hugepages
Filesystem Size Used Avail Use% Mounted on
nodev 4.0G 2.0G 2.0G 50% /var/opt/hugepool
Note that the filesystem still shows 2G of pages used, while there
actually are no huge pages in use. The only way to 'fix' the filesystem
accounting is to unmount the filesystem
If a hugetlb page is associated with an explicitly mounted filesystem,
this information in contained in the page_private field. At migration
time, this information is not preserved. To fix, simply transfer
page_private from old to new page at migration time if necessary.
There is a related race with removing a huge page from a file and
migration. When a huge page is removed from the pagecache, the
page_mapping() field is cleared, yet page_private remains set until the
page is actually freed by free_huge_page(). A page could be migrated
while in this state. However, since page_mapping() is not set the
hugetlbfs specific routine to transfer page_private is not called and we
leak the page count in the filesystem.
To fix that, check for this condition before migrating a huge page. If
the condition is detected, return EBUSY for the page.
Link: http://lkml.kernel.org/r/74510272-7319-7372-9ea6-ec914734c179@oracle.com
Link: http://lkml.kernel.org/r/20190212221400.3512-1-mike.kravetz@oracle.com
Fixes:
|
|
Dave Hansen | 2794129e90 |
mm/memory-hotplug: Allow memory resources to be children
The mm/resource.c code is used to manage the physical address space. The current resource configuration can be viewed in /proc/iomem. An example of this is at the bottom of this description. The nvdimm subsystem "owns" the physical address resources which map to persistent memory and has resources inserted for them as "Persistent Memory". The best way to repurpose this for volatile use is to leave the existing resource in place, but add a "System RAM" resource underneath it. This clearly communicates the ownership relationship of this memory. The request_resource_conflict() API only deals with the top-level resources. Replace it with __request_region() which will search for !IORESOURCE_BUSY areas lower in the resource tree than the top level. We *could* also simply truncate the existing top-level "Persistent Memory" resource and take over the released address space. But, this means that if we ever decide to hot-unplug the "RAM" and give it back, we need to recreate the original setup, which may mean going back to the BIOS tables. This should have no real effect on the existing collision detection because the areas that truly conflict should be marked IORESOURCE_BUSY. 00000000-00000fff : Reserved 00001000-0009fbff : System RAM 0009fc00-0009ffff : Reserved 000a0000-000bffff : PCI Bus 0000:00 000c0000-000c97ff : Video ROM 000c9800-000ca5ff : Adapter ROM 000f0000-000fffff : Reserved 000f0000-000fffff : System ROM 00100000-9fffffff : System RAM 01000000-01e071d0 : Kernel code 01e071d1-027dfdff : Kernel data 02dc6000-0305dfff : Kernel bss a0000000-afffffff : Persistent Memory (legacy) a0000000-a7ffffff : System RAM b0000000-bffdffff : System RAM bffe0000-bfffffff : Reserved c0000000-febfffff : PCI Bus 0000:00 Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying <ying.huang@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Borislav Petkov <bp@suse.de> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
|
Dave Hansen | b926b7f3ba |
mm/resource: Move HMM pr_debug() deeper into resource code
HMM consumes physical address space for its own use, even though nothing is mapped or accessible there. It uses a special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY) to uniquely identify these areas. When HMM consumes address space, it makes a best guess about what to consume. However, it is possible that a future memory or device hotplug can collide with the reserved area. In the case of these conflicts, there is an error message in register_memory_resource(). Later patches in this series move register_memory_resource() from using request_resource_conflict() to __request_region(). Unfortunately, __request_region() does not return the conflict like the previous function did, which makes it impossible to check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting resource. Instead of warning in register_memory_resource(), move the check into the core resource code itself (__request_region()) where the conflicting resource _is_ available. This has the added bonus of producing a warning in case of HMM conflicts with devices *or* RAM address space, as opposed to the RAM- only warnings that were there previously. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ross Zwisler <zwisler@kernel.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: linux-nvdimm@lists.01.org Cc: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org Cc: Huang Ying <ying.huang@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> |
|
Jann Horn | 0a1d52994d |
mm: enforce min addr even if capable() in expand_downwards()
security_mmap_addr() does a capability check with current_cred(), but
we can reach this code from contexts like a VFS write handler where
current_cred() must not be used.
This can be abused on systems without SMAP to make NULL pointer
dereferences exploitable again.
Fixes:
|
|
Peng Fan | 1b046b445c |
percpu: km: no need to consider pcpu_group_offsets[0]
percpu-km is used on UP systems which only has one group, so the group offset will be always 0, there is no need to subtract pcpu_group_offsets[0] when assigning chunk->base_addr Signed-off-by: Peng Fan <peng.fan@nxp.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
|
Darrick J. Wong | 29b00e6099 |
tmpfs: fix uninitialized return value in shmem_link
When we made the shmem_reserve_inode call in shmem_link conditional, we
forgot to update the declaration for ret so that it always has a known
value. Dan Carpenter pointed out this deficiency in the original patch.
Fixes:
|
|
Linus Torvalds | 53a41cb7ed |
Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses"
This reverts commit
|
|
Peng Fan | 2de7852fe9 |
percpu: use nr_groups as check condition
group_cnt array is defined with NR_CPUS entries, but normally nr_groups will not reach up to NR_CPUS. So there is no issue to the current code. Checking other parts of pcpu_build_alloc_info, use nr_groups as check condition, so make it consistent to use 'group < nr_groups' as for loop check. In case we do have nr_groups equals with NR_CPUS, we could also avoid memory access out of bounds. Signed-off-by: Peng Fan <peng.fan@nxp.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
|
Michal Hocko | 891cb2a72d |
mm, memory_hotplug: fix off-by-one in is_pageblock_removable
Rong Chen has reported the following boot crash: PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 1 PID: 239 Comm: udevd Not tainted 5.0.0-rc4-00149-gefad4e4 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 RIP: 0010:page_mapping+0x12/0x80 Code: 5d c3 48 89 df e8 0e ad 02 00 85 c0 75 da 89 e8 5b 5d c3 0f 1f 44 00 00 53 48 89 fb 48 8b 43 08 48 8d 50 ff a8 01 48 0f 45 da <48> 8b 53 08 48 8d 42 ff 83 e2 01 48 0f 44 c3 48 83 38 ff 74 2f 48 RSP: 0018:ffff88801fa87cd8 EFLAGS: 00010202 RAX: ffffffffffffffff RBX: fffffffffffffffe RCX: 000000000000000a RDX: fffffffffffffffe RSI: ffffffff820b9a20 RDI: ffff88801e5c0000 RBP: 6db6db6db6db6db7 R08: ffff88801e8bb000 R09: 0000000001b64d13 R10: ffff88801fa87cf8 R11: 0000000000000001 R12: ffff88801e640000 R13: ffffffff820b9a20 R14: ffff88801f145258 R15: 0000000000000001 FS: 00007fb2079817c0(0000) GS:ffff88801dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000006 CR3: 000000001fa82000 CR4: 00000000000006a0 Call Trace: __dump_page+0x14/0x2c0 is_mem_section_removable+0x24c/0x2c0 removable_show+0x87/0xa0 dev_attr_show+0x25/0x60 sysfs_kf_seq_show+0xba/0x110 seq_read+0x196/0x3f0 __vfs_read+0x34/0x180 vfs_read+0xa0/0x150 ksys_read+0x44/0xb0 do_syscall_64+0x5e/0x4a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe and bisected it down to commit |
|
Daniel Vetter | 6c8fcc096b |
mm: don't let userspace spam allocations warnings
memdump_user usually gets fed unchecked userspace input. Blasting a full backtrace into dmesg every time is a bit excessive - I'm not sure on the kernel rule in general, but at least in drm we're trying not to let unpriviledge userspace spam the logs freely. Definitely not entire warning backtraces. It also means more filtering for our CI, because our testsuite exercises these corner cases and so hits these a lot. Link: http://lkml.kernel.org/r/20190220204058.11676-1-daniel.vetter@ffwll.ch Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roman Gushchin <guro@fb.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Stancek <jstancek@redhat.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Bartosz Golaszewski <brgl@bgdev.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | 6373dca16c |
slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS
In process_slab(), "p = get_freepointer()" could return a tagged pointer, but "addr = page_address()" always return a native pointer. As the result, slab_index() is messed up here, return (p - addr) / s->size; All other callers of slab_index() have the same situation where "addr" is from page_address(), so just need to untag "p". # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls Unable to handle kernel paging request at virtual address 2bff808aa4856d48 Mem abort info: ESR = 0x96000007 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000007 CM = 0, WnR = 0 swapper pgtable: 64k pages, 48-bit VAs, pgdp = 0000000002498338 [2bff808aa4856d48] pgd=00000097fcfd0003, pud=00000097fcfd0003, pmd=00000097fca30003, pte=00e8008b24850712 Internal error: Oops: 96000007 [#1] SMP CPU: 3 PID: 79210 Comm: read_all Tainted: G L 5.0.0-rc7+ #84 Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018 pstate: 00400089 (nzcv daIf +PAN -UAO) pc : get_map+0x78/0xec lr : get_map+0xa0/0xec sp : aeff808989e3f8e0 x29: aeff808989e3f940 x28: ffff800826200000 x27: ffff100012d47000 x26: 9700000000002500 x25: 0000000000000001 x24: 52ff8008200131f8 x23: 52ff8008200130a0 x22: 52ff800820013098 x21: ffff800826200000 x20: ffff100013172ba0 x19: 2bff808a8971bc00 x18: ffff1000148f5538 x17: 000000000000001b x16: 00000000000000ff x15: ffff1000148f5000 x14: 00000000000000d2 x13: 0000000000000001 x12: 0000000000000000 x11: 0000000020000002 x10: 2bff808aa4856d48 x9 : 0000020000000000 x8 : 68ff80082620ebb0 x7 : 0000000000000000 x6 : ffff1000105da1dc x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000010 x2 : 2bff808a8971bc00 x1 : ffff7fe002098800 x0 : ffff80082620ceb0 Process read_all (pid: 79210, stack limit = 0x00000000f65b9361) Call trace: get_map+0x78/0xec process_slab+0x7c/0x47c list_locations+0xb0/0x3c8 alloc_calls_show+0x34/0x40 slab_attr_show+0x34/0x48 sysfs_kf_seq_show+0x2e4/0x570 kernfs_seq_show+0x12c/0x1a0 seq_read+0x48c/0xf84 kernfs_fop_read+0xd4/0x448 __vfs_read+0x94/0x5d4 vfs_read+0xcc/0x194 ksys_read+0x6c/0xe8 __arm64_sys_read+0x68/0xb0 el0_svc_handler+0x230/0x3bc el0_svc+0x8/0xc Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b) ---[ end trace a383a9a44ff13176 ]--- Kernel panic - not syncing: Fatal exception SMP: stopping secondary CPUs SMP: failed to stop secondary CPUs 1-7,32,40,127 Kernel Offset: disabled CPU features: 0x002,20000c18 Memory Limit: none ---[ end Kernel panic - not syncing: Fatal exception ]--- Link: http://lkml.kernel.org/r/20190220020251.82039-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 557ea25383 |
kasan, slab: remove redundant kasan_slab_alloc hooks
kasan_slab_alloc() calls in kmem_cache_alloc() and kmem_cache_alloc_node() are redundant as they are already called via slab_alloc/slab_alloc_node()-> slab_post_alloc_hook()->kasan_slab_alloc(). Remove them. Link: http://lkml.kernel.org/r/4ca1655cdcfc4379c49c50f7bf80f81c4ad01485.1550602886.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 51dedad06b |
kasan, slab: make freelist stored without tags
Similarly to "kasan, slub: move kasan_poison_slab hook before page_address", move kasan_poison_slab() before alloc_slabmgmt(), which calls page_address(), to make page_address() return value to be non-tagged. This, combined with calling kasan_reset_tag() for off-slab slab management object, leads to freelist being stored non-tagged. Link: http://lkml.kernel.org/r/dfb53b44a4d00de3879a05a9f04c1f55e584f7a1.1550602886.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 219667c23c |
kasan, slab: fix conflicts with CONFIG_HARDENED_USERCOPY
Similarly to commit
|
|
Andrey Konovalov | dc15a8a254 |
kasan: prevent tracing of tags.c
Similarly to commit
|
|
Andrey Konovalov | 3f41b60938 |
kasan: fix random seed generation for tag-based mode
There are two issues with assigning random percpu seeds right now: 1. We use for_each_possible_cpu() to iterate over cpus, but cpumask is not set up yet at the moment of kasan_init(), and thus we only set the seed for cpu #0. 2. A call to get_random_u32() always returns the same number and produces a message in dmesg, since the random subsystem is not yet initialized. Fix 1 by calling kasan_init_tags() after cpumask is set up. Fix 2 by using get_cycles() instead of get_random_u32(). This gives us lower quality random numbers, but it's good enough, as KASAN is meant to be used as a debugging tool and not a mitigation. Link: http://lkml.kernel.org/r/1f815cc914b61f3516ed4cc9bfd9eeca9bd5d9de.1550677973.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Darrick J. Wong | 1062af920c |
tmpfs: fix link accounting when a tmpfile is linked in
tmpfs has a peculiarity of accounting hard links as if they were
separate inodes: so that when the number of inodes is limited, as it is
by default, a user cannot soak up an unlimited amount of unreclaimable
dcache memory just by repeatedly linking a file.
But when v3.11 added O_TMPFILE, and the ability to use linkat() on the
fd, we missed accommodating this new case in tmpfs: "df -i" shows that
an extra "inode" remains accounted after the file is unlinked and the fd
closed and the actual inode evicted. If a user repeatedly links
tmpfiles into a tmpfs, the limit will be hit (ENOSPC) even after they
are deleted.
Just skip the extra reservation from shmem_link() in this case: there's
a sense in which this first link of a tmpfile is then cheaper than a
hard link of another file, but the accounting works out, and there's
still good limiting, so no need to do anything more complicated.
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182134370.7035@eggly.anvils
Fixes:
|
|
Michal Hocko | 6ea183d60c |
mm: handle lru_add_drain_all for UP properly
Since for_each_cpu(cpu, mask) added by commit |
|
Mel Gorman | 94b3334cbe |
mm, page_alloc: fix a division by zero error when boosting watermarks v2
Yury Norov reported that an arm64 KVM instance could not boot since after v5.0-rc1 and could addressed by reverting the patches |
|
Robin Murphy | 311ade0eab |
mm/debug.c: fix __dump_page() for poisoned pages
Evaluating page_mapping() on a poisoned page ends up dereferencing junk
and making PF_POISONED_CHECK() considerably crashier than intended:
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000006
Mem abort info:
ESR = 0x96000005
Exception class = DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
Data abort info:
ISV = 0, ISS = 0x00000005
CM = 0, WnR = 0
user pgtable: 4k pages, 39-bit VAs, pgdp = 00000000c2f6ac38
[0000000000000006] pgd=0000000000000000, pud=0000000000000000
Internal error: Oops: 96000005 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 491 Comm: bash Not tainted 5.0.0-rc1+ #1
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
pstate: 00000005 (nzcv daif -PAN -UAO)
pc : page_mapping+0x18/0x118
lr : __dump_page+0x1c/0x398
Process bash (pid: 491, stack limit = 0x000000004ebd4ecd)
Call trace:
page_mapping+0x18/0x118
__dump_page+0x1c/0x398
dump_page+0xc/0x18
remove_store+0xbc/0x120
dev_attr_store+0x18/0x28
sysfs_kf_write+0x40/0x50
kernfs_fop_write+0x130/0x1d8
__vfs_write+0x30/0x180
vfs_write+0xb4/0x1a0
ksys_write+0x60/0xd0
__arm64_sys_write+0x18/0x20
el0_svc_common+0x94/0xf8
el0_svc_handler+0x68/0x70
el0_svc+0x8/0xc
Code: f9400401 d1000422 f240003f 9a801040 (f9400402)
---[ end trace cdb5eb5bf435cecb ]---
Fix that by not inspecting the mapping until we've determined that it's
likely to be valid. Now the above condition still ends up stopping the
kernel, but in the correct manner:
page:ffffffbf20000000 is uninitialized and poisoned
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
------------[ cut here ]------------
kernel BUG at ./include/linux/mm.h:1006!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 PID: 483 Comm: bash Not tainted 5.0.0-rc1+ #3
Hardware name: ARM LTD ARM Juno Development Platform/ARM Juno Development Platform, BIOS EDK II Dec 17 2018
pstate: 40000005 (nZcv daif -PAN -UAO)
pc : remove_store+0xbc/0x120
lr : remove_store+0xbc/0x120
...
Link: http://lkml.kernel.org/r/03b53ee9d7e76cda4b9b5e1e31eea080db033396.1550071778.git.robin.murphy@arm.com
Fixes:
|
|
Qian Cai | 338cfaad49 |
slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS
Enabling SLUB_DEBUG's SLAB_CONSISTENCY_CHECKS with KASAN_SW_TAGS triggers endless false positives during boot below due to check_valid_pointer() checks tagged pointers which have no addresses that is valid within slab pages: BUG radix_tree_node (Tainted: G B ): Freelist Pointer check fails ----------------------------------------------------------------------------- INFO: Slab objects=69 used=69 fp=0x (null) flags=0x7ffffffc000200 INFO: Object @offset=15060037153926966016 fp=0x Redzone: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 18 6b 06 00 08 80 ff d0 .........k...... Object : 18 6b 06 00 08 80 ff d0 00 00 00 00 00 00 00 00 .k.............. Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Object : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ Redzone: bb bb bb bb bb bb bb bb ........ Padding: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B 5.0.0-rc5+ #18 Call trace: dump_backtrace+0x0/0x450 show_stack+0x20/0x2c __dump_stack+0x20/0x28 dump_stack+0xa0/0xfc print_trailer+0x1bc/0x1d0 object_err+0x40/0x50 alloc_debug_processing+0xf0/0x19c ___slab_alloc+0x554/0x704 kmem_cache_alloc+0x2f8/0x440 radix_tree_node_alloc+0x90/0x2fc idr_get_free+0x1e8/0x6d0 idr_alloc_u32+0x11c/0x2a4 idr_alloc+0x74/0xe0 worker_pool_assign_id+0x5c/0xbc workqueue_init_early+0x49c/0xd50 start_kernel+0x52c/0xac4 FIX radix_tree_node: Marking all objects used Link: http://lkml.kernel.org/r/20190209044128.3290-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | d36a63a943 |
kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED
When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged. Normally, this doesn't cause any issues, as both set_freepointer() and get_freepointer() are called with a pointer with the same tag. However, there are some issues with CONFIG_SLUB_DEBUG code. For example, when __free_slub() iterates over objects in a cache, it passes untagged pointers to check_object(). check_object() in turns calls get_freepointer() with an untagged pointer, which causes the freepointer to be restored incorrectly. Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment. Link: http://lkml.kernel.org/r/bf858f26ef32eb7bd24c665755b3aee4bc58d0e4.1550103861.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Tested-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 18e5066102 |
kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED
CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address of the object where the pointer gets stored. With tag based KASAN we don't account for that when building freelist, as we call set_freepointer() with the first argument untagged. This patch changes the code to properly propagate tags throughout the loop. Link: http://lkml.kernel.org/r/3df171559c52201376f246bf7ce3184fe21c1dc7.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | a710122428 |
kasan, slub: move kasan_poison_slab hook before page_address
With tag based KASAN page_address() looks at the page flags to see whether the resulting pointer needs to have a tag set. Since we don't want to set a tag when page_address() is called on SLAB pages, we call page_kasan_tag_reset() in kasan_poison_slab(). However in allocate_slab() page_address() is called before kasan_poison_slab(). Fix it by changing the order. [andreyknvl@google.com: fix compilation error when CONFIG_SLUB_DEBUG=n] Link: http://lkml.kernel.org/r/ac27cc0bbaeb414ed77bcd6671a877cf3546d56e.1550066133.git.andreyknvl@google.com Link: http://lkml.kernel.org/r/cd895d627465a3f1c712647072d17f10883be2a1.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | a2f775751d |
kmemleak: account for tagged pointers when calculating pointer range
kmemleak keeps two global variables, min_addr and max_addr, which store the range of valid (encountered by kmemleak) pointer values, which it later uses to speed up pointer lookup when scanning blocks. With tagged pointers this range will get bigger than it needs to be. This patch makes kmemleak untag pointers before saving them to min_addr and max_addr and when performing a lookup. Link: http://lkml.kernel.org/r/16e887d442986ab87fe87a755815ad92fa431a5f.1550066133.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Qian Cai <cai@lca.pw> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrey Konovalov | 53128245b4 |
kasan, kmemleak: pass tagged pointers to kmemleak
Right now we call kmemleak hooks before assigning tags to pointers in KASAN hooks. As a result, when an objects gets allocated, kmemleak sees a differently tagged pointer, compared to the one it sees when the object gets freed. Fix it by calling KASAN hooks before kmemleak's ones. Link: http://lkml.kernel.org/r/cd825aa4897b0fc37d3316838993881daccbe9f5.1549921721.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reported-by: Qian Cai <cai@lca.pw> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgeniy Stepanov <eugenis@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |