mirror of https://gitee.com/openkylin/linux.git
13935 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Qian Cai | 2b38d01b4d |
mm/zsmalloc.c: fix a -Wunused-function warning
set_zspage_inuse() was introduced in the commit |
|
Vitaly Wool | 068619e32f |
zswap: do not map same object twice
zswap_writeback_entry() maps a handle to read swpentry first, and then in the most common case it would map the same handle again. This is ok when zbud is the backend since its mapping callback is plain and simple, but it slows things down for z3fold. Since there's hardly a point in unmapping a handle _that_ fast as zswap_writeback_entry() does when it reads swpentry, the suggestion is to keep the handle mapped till the end. Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reviewed-by: Dan Streetman <ddstreet@ieee.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hui Zhu | d2fcd82bb8 |
zswap: use movable memory if zpool support allocate movable memory
This is the third version that was updated according to the comments from Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt https://lkml.org/lkml/2019/6/4/973 zswap compresses swap pages into a dynamically allocated RAM-based memory pool. The memory pool should be zbud, z3fold or zsmalloc. All of them will allocate unmovable pages. It will increase the number of unmovable page blocks that will bad for anti-fragment. zsmalloc support page migration if request movable page: handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_HIGHMEM | __GFP_MOVABLE); And commit "zpool: Add malloc_support_movable to zpool_driver" add zpool_malloc_support_movable check malloc_support_movable to make sure if a zpool support allocate movable memory. This commit let zswap allocate block with gfp __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory. Following part is test log in a pc that has 8G memory and 2G swap. Without this commit: ~# echo lz4 > /sys/module/zswap/parameters/compressor ~# echo zsmalloc > /sys/module/zswap/parameters/zpool ~# echo 1 > /sys/module/zswap/parameters/enabled ~# swapon /swapfile ~# cd /home/teawater/kernel/vm-scalability/ /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024)) /home/teawater/kernel/vm-scalability# ./case-anon-w-seq 2717908992 bytes / 4826062 usecs = 549973 KB/s 2717908992 bytes / 4864201 usecs = 545661 KB/s 2717908992 bytes / 4867015 usecs = 545346 KB/s 2717908992 bytes / 4915485 usecs = 539968 KB/s 397853 usecs to free memory 357820 usecs to free memory 421333 usecs to free memory 420454 usecs to free memory /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 6 5 8 6 6 5 4 1 1 1 0 Node 0, zone DMA32, type Movable 25 20 20 19 22 15 14 11 11 5 767 Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 4753 5588 5159 4613 3712 2520 1448 594 188 11 0 Node 0, zone Normal, type Movable 16 3 457 2648 2143 1435 860 459 223 224 296 Node 0, zone Normal, type Reclaimable 0 0 44 38 11 2 0 0 0 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate Node 0, zone DMA 1 7 0 0 0 0 Node 0, zone DMA32 4 1652 0 0 0 0 Node 0, zone Normal 931 1485 15 0 0 0 With this commit: ~# echo lz4 > /sys/module/zswap/parameters/compressor ~# echo zsmalloc > /sys/module/zswap/parameters/zpool ~# echo 1 > /sys/module/zswap/parameters/enabled ~# swapon /swapfile ~# cd /home/teawater/kernel/vm-scalability/ /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024)) /home/teawater/kernel/vm-scalability# ./case-anon-w-seq 2717908992 bytes / 4689240 usecs = 566020 KB/s 2717908992 bytes / 4760605 usecs = 557535 KB/s 2717908992 bytes / 4803621 usecs = 552543 KB/s 2717908992 bytes / 5069828 usecs = 523530 KB/s 431546 usecs to free memory 383397 usecs to free memory 456454 usecs to free memory 224487 usecs to free memory /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo Page block order: 9 Pages per block: 512 Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 1 1 0 2 1 1 0 1 0 0 Node 0, zone DMA, type Movable 0 0 0 0 0 0 0 0 0 1 3 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Unmovable 10 8 10 9 10 4 3 2 3 0 0 Node 0, zone DMA32, type Movable 18 12 14 16 16 11 9 5 5 6 775 Node 0, zone DMA32, type Reclaimable 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA32, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 2669 1236 452 118 37 14 4 1 2 3 0 Node 0, zone Normal, type Movable 3850 6086 5274 4327 3510 2494 1520 934 438 220 470 Node 0, zone Normal, type Reclaimable 56 93 155 124 47 31 17 7 3 0 0 Node 0, zone Normal, type HighAtomic 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type CMA 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Movable Reclaimable HighAtomic CMA Isolate Node 0, zone DMA 1 7 0 0 0 0 Node 0, zone DMA32 4 1650 2 0 0 0 Node 0, zone Normal 79 2326 26 0 0 0 You can see that the number of unmovable page blocks is decreased when the kernel has this commit. Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.com Signed-off-by: Hui Zhu <teawaterz@linux.alibaba.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hui Zhu | c165f25d23 |
zpool: add malloc_support_movable to zpool_driver
As a zpool_driver, zsmalloc can allocate movable memory because it support migate pages. But zbud and z3fold cannot allocate movable memory. Add malloc_support_movable to zpool_driver. If a zpool_driver support allocate movable memory, set it to true. And add zpool_malloc_support_movable check malloc_support_movable to make sure if a zpool support allocate movable memory. Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.com Signed-off-by: Hui Zhu <teawaterz@linux.alibaba.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Miles Chen | 28eb3c8087 |
shmem: fix obsolete comment in shmem_getpage_gfp()
Replace "fault_mm" with "vmf" in code comment because commit
|
|
Mike Rapoport | f3bc0dba31 |
mm/madvise: reduce code duplication in error handling paths
madvise_behavior() converts -ENOMEM to -EAGAIN in several places using identical code. Move that code to a common error handling path. No functional changes. Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Pankaj Gupta <pagupta@redhat.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Ivan Khoronzhuk | 76f3495077 |
mm: mmap: increase sockets maximum memory size pgoff for 32bits
The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING and XDP_UMEM_PGOFF_COMPLETION_RING offsets. These offsets are established already and are part of the configuration interface. But for 32-bit systems, using AF_XDP socket configuration, these values are too large to pass the maximum allowed file size verification. The offsets can be tuned off, but instead of changing the existing interface, let's extend the max allowed file size for sockets. No one has been using this until this patch with 32 bits as without this fix af_xdp sockets can't be used at all, so it unblocks af_xdp socket usage for 32bit systems. All list of mmap cbs for sockets was verified for side effects and all of them contain dummy cb - sock_no_mmap() at this moment, except the following: xsk_mmap() - it's what this fix is needed for. tcp_mmap() - doesn't have obvious issues with pgoff - no any references on it. packet_mmap() - return -EINVAL if it's even set. Link: http://lkml.kernel.org/r/20190812124326.32146-1-ivan.khoronzhuk@linaro.org Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Björn Töpel <bjorn.topel@intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Magnus Karlsson <magnus.karlsson@intel.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | 73848a9711 |
mm/mmap.c: refine find_vma_prev() with rb_last()
When addr is out of range of the whole rb_tree, pprev will point to the right-most node. rb_tree facility already provides a helper function, rb_last(), to do this task. We can leverage this instead of reimplementing it. This patch refines find_vma_prev() with rb_last() to make it a little nicer to read. [akpm@linux-foundation.org: little cleanup, per Vlastimil] Link: http://lkml.kernel.org/r/20190809001928.4950-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexandre Ghiti | e7142bf5d2 |
arm64, mm: make randomization selected by generic topdown mmap layout
This commits selects ARCH_HAS_ELF_RANDOMIZE when an arch uses the generic topdown mmap layout functions so that this security feature is on by default. Note that this commit also removes the possibility for arm64 to have elf randomization and no MMU: without MMU, the security added by randomization is worth nothing. Link: http://lkml.kernel.org/r/20190730055113.23635-6-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexandre Ghiti | 67f3977f80 |
arm64, mm: move generic mmap layout functions to mm
arm64 handles top-down mmap layout in a way that can be easily reused by other architectures, so make it available in mm. It then introduces a new config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT that can be set by other architectures to benefit from those functions. Note that this new config depends on MMU being enabled, if selected without MMU support, a warning will be thrown. Link: http://lkml.kernel.org/r/20190730055113.23635-5-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Suggested-by: Christoph Hellwig <hch@infradead.org> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alexandre Ghiti | 649775be63 |
mm, fs: move randomize_stack_top from fs to mm
Patch series "Provide generic top-down mmap layout functions", v6. This series introduces generic functions to make top-down mmap layout easily accessible to architectures, in particular riscv which was the initial goal of this series. The generic implementation was taken from arm64 and used successively by arm, mips and finally riscv. Note that in addition the series fixes 2 issues: - stack randomization was taken into account even if not necessary. - [1] fixed an issue with mmap base which did not take into account randomization but did not report it to arm and mips, so by moving arm64 into a generic library, this problem is now fixed for both architectures. This work is an effort to factorize architecture functions to avoid code duplication and oversights as in [1]. [1]: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html This patch (of 14): This preparatory commit moves this function so that further introduction of generic topdown mmap layout is contained only in mm/util.c. Link: http://lkml.kernel.org/r/20190730055113.23635-2-alex@ghiti.fr Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Acked-by: Kees Cook <keescook@chromium.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@mips.com> Cc: James Hogan <jhogan@kernel.org> Cc: Palmer Dabbelt <palmer@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 27e1f82731 |
khugepaged: enable collapse pmd for pte-mapped THP
khugepaged needs exclusive mmap_sem to access page table. When it fails to lock mmap_sem, the page will fault in as pte-mapped THP. As the page is already a THP, khugepaged will not handle this pmd again. This patch enables the khugepaged to retry collapse the page table. struct mm_slot (in khugepaged.c) is extended with an array, containing addresses of pte-mapped THPs. We use array here for simplicity. We can easily replace it with more advanced data structures when needed. In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to collapse the page table. Since collapse may happen at an later time, some pages may already fault in. collapse_pte_mapped_thp() is added to properly handle these pages. collapse_pte_mapped_thp() also double checks whether all ptes in this pmd are mapping to the same THP. This is necessary because some subpage of the THP may be replaced, for example by uprobe. In such cases, it is not possible to collapse the pmd. [kirill.shutemov@linux.intel.com: add comments for retract_page_tables()] Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | bfe7b00de6 |
mm, thp: introduce FOLL_SPLIT_PMD
Introduce a new foll_flag: FOLL_SPLIT_PMD. As the name says FOLL_SPLIT_PMD splits huge pmd for given mm_struct, the underlining huge page stays as-is. FOLL_SPLIT_PMD is useful for cases where we need to use regular pages, but would switch back to huge page and huge pmd on. One of such example is uprobe. The following patches use FOLL_SPLIT_PMD in uprobe. Link: http://lkml.kernel.org/r/20190815164525.1848545-4-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 010c164a5f |
mm: move memcmp_pages() and pages_identical()
Patch series "THP aware uprobe", v13. This patchset makes uprobe aware of THPs. Currently, when uprobe is attached to text on THP, the page is split by FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of THP. This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces FOLL_SPLIT_PMD, which only split PMD for uprobe. After all uprobes within the THP are removed, the PTE-mapped pages are regrouped as huge PMD. This set (plus a few THP patches) is also available at https://github.com/liu-song-6/linux/tree/uprobe-thp This patch (of 6): Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we can use them in other files. Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <matthew.wilcox@oracle.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 87eaceb3fa |
mm: thp: make deferred split shrinker memcg aware
Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. [yang.shi@linux.alibaba.com: simplify deferred split queue dereference per Kirill Tkhai] Link: http://lkml.kernel.org/r/1566496227-84952-5-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1565144277-36240-5-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 0a432dcbeb |
mm: shrinker: make shrinker not depend on memcg kmem
Currently shrinker is just allocated and can work when memcg kmem is enabled. But, THP deferred split shrinker is not slab shrinker, it doesn't make too much sense to have such shrinker depend on memcg kmem. It should be able to reclaim THP even though memcg kmem is disabled. Introduce a new shrinker flag, SHRINKER_NONSLAB, for non-slab shrinker. When memcg kmem is disabled, just such shrinkers can be called in shrinking memcg slab. [yang.shi@linux.alibaba.com: add comment] Link: http://lkml.kernel.org/r/1566496227-84952-4-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1565144277-36240-4-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 7ae88534cd |
mm: move mem_cgroup_uncharge out of __page_cache_release()
A later patch makes THP deferred split shrinker memcg aware, but it needs page->mem_cgroup information in THP destructor, which is called after mem_cgroup_uncharge() now. So move mem_cgroup_uncharge() from __page_cache_release() to compound page destructor, which is called by both THP and other compound pages except HugeTLB. And call it in __put_single_page() for single order page. Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | 364c1eebe4 |
mm: thp: extract split_queue_* into a struct
Patch series "Make deferred split shrinker memcg aware", v6. Currently THP deferred split shrinker is not memcg aware, this may cause premature OOM with some configuration. For example the below test would run into premature OOM easily: $ cgcreate -g memory:thp $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes $ cgexec -g memory:thp transhuge-stress 4000 transhuge-stress comes from kernel selftest. It is easy to hit OOM, but there are still a lot THP on the deferred split queue, memcg direct reclaim can't touch them since the deferred split shrinker is not memcg aware. Convert deferred split shrinker memcg aware by introducing per memcg deferred split queue. The THP should be on either per node or per memcg deferred split queue if it belongs to a memcg. When the page is immigrated to the other memcg, it will be immigrated to the target memcg's deferred split queue too. Reuse the second tail page's deferred_list for per memcg list since the same THP can't be on multiple deferred split queues. Make deferred split shrinker not depend on memcg kmem since it is not slab. It doesn't make sense to not shrink THP even though memcg kmem is disabled. With the above change the test demonstrated above doesn't trigger OOM even though with cgroup.memory=nokmem. This patch (of 4): Put split_queue, split_queue_lock and split_queue_len into a struct in order to reduce code duplication when we convert deferred_split to memcg aware in the later patches. Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Suggested-by: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 09d91cda0e |
mm,thp: avoid writes to file with THP in pagecache
In previous patch, an application could put part of its text section in THP via madvise(). These THPs will be protected from writes when the application is still running (TXTBSY). However, after the application exits, the file is available for writes. This patch avoids writes to file THP by dropping page cache for the file when the file is open for write. A new counter nr_thps is added to struct address_space. In do_dentry_open(), if the file is open for write and nr_thps is non-zero, we drop page cache for the whole file. Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Reported-by: kbuild test robot <lkp@intel.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 99cb0dbd47 |
mm,thp: add read-only THP support for (non-shmem) FS
This patch is (hopefully) the first step to enable THP for non-shmem filesystems. This patch enables an application to put part of its text sections to THP via madvise, for example: madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE); We tried to reuse the logic for THP on tmpfs. Currently, write is not supported for non-shmem THP. khugepaged will only process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is execve(). This requirement limits non-shmem THP to text sections. The next patch will handle writes, which would only happen when the all the vmas with VM_DENYWRITE are unmapped. An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this feature. [songliubraving@fb.com: fix build without CONFIG_SHMEM] Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com [songliubraving@fb.com: fix double unlock in collapse_file()] Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 579c571e2e |
khugepaged: rename collapse_shmem() and khugepaged_scan_shmem()
Next patch will add khugepaged support of non-shmem files. This patch renames these two functions to reflect the new functionality: collapse_shmem() => collapse_file() khugepaged_scan_shmem() => khugepaged_scan_file() Link: http://lkml.kernel.org/r/20190801184244.3169074-6-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 60fbf0ab5d |
mm,thp: stats for file backed THP
In preparation for non-shmem THP, this patch adds a few stats and exposes them in /proc/meminfo, /sys/bus/node/devices/<node>/meminfo, and /proc/<pid>/task/<tid>/smaps. This patch is mostly a rewrite of Kirill A. Shutemov's earlier version: https://lkml.kernel.org/r/20170126115819.58875-5-kirill.shutemov@linux.intel.com/ Link: http://lkml.kernel.org/r/20190801184244.3169074-5-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 520e5ba415 |
filemap: update offset check in filemap_fault()
With THP, current check of offset: VM_BUG_ON_PAGE(page->index != offset, page); is no longer accurate. Update it to: VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page); Link: http://lkml.kernel.org/r/20190801184244.3169074-4-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 31895438e7 |
filemap: check compound_head(page)->mapping in pagecache_get_page()
Similar to previous patch, pagecache_get_page() avoids race condition with truncate by checking page->mapping == mapping. This does not work for compound pages. This patch let it check compound_head(page)->mapping instead. Link: http://lkml.kernel.org/r/20190801184244.3169074-3-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Song Liu | 585e5a7bab |
filemap: check compound_head(page)->mapping in filemap_fault()
Patch series "Enable THP for text section of non-shmem files", v10; This patchset follows up discussion at LSF/MM 2019. The motivation is to put text section of an application in THP, and thus reduces iTLB miss rate and improves performance. Both Facebook and Oracle showed strong interests to this feature. To make reviews easier, this set aims a mininal valid product. Current version of the work does not have any changes to file system specific code. This comes with some limitations (discussed later). This set enables an application to "hugify" its text section by simply running something like: madvise(0x600000, 0x80000, MADV_HUGEPAGE); Before this call, the /proc/<pid>/maps looks like: 00400000-074d0000 r-xp 00000000 00:27 2006927 app After this call, part of the text section is split out and mapped to THP: 00400000-00425000 r-xp 00000000 00:27 2006927 app 00600000-00e00000 r-xp 00200000 00:27 2006927 app <<< on THP 00e00000-074d0000 r-xp 00a00000 00:27 2006927 app Limitations: 1. This only works for text section (vma with VM_DENYWRITE). 2. Original limitation #2 is removed in v3. We gated this feature with an experimental config, READ_ONLY_THP_FOR_FS. Once we get better support on the write path, we can remove the config and enable it by default. Tested cases: 1. Tested with btrfs and ext4. 2. Tested with real work application (memcache like caching service). 3. Tested with "THP aware uprobe": https://patchwork.kernel.org/project/linux-mm/list/?series=131339 This patch (of 7): Currently, filemap_fault() avoids race condition with truncate by checking page->mapping == mapping. This does not work for compound pages. This patch let it check compound_head(page)->mapping instead. Link: http://lkml.kernel.org/r/20190801184244.3169074-2-songliubraving@fb.com Signed-off-by: Song Liu <songliubraving@fb.com> Acked-by: Rik van Riel <riel@surriel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Pingfan Liu | 276f756d70 |
mm/migrate.c: clean up useless code in migrate_vma_collect_pmd()
Remove unused 'pfn' variable. Link: http://lkml.kernel.org/r/1565167272-21453-1-git-send-email-kernelfans@gmail.com Signed-off-by: Pingfan Liu <kernelfans@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Jan Kara <jack@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Kravetz | f60858f9d3 |
hugetlbfs: don't retry when pool page allocations start to fail
When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Link: http://lkml.kernel.org/r/20190806014744.15446-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 4943308556 |
mm, compaction: raise compaction priority after it withdrawns
Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours when should_compact_retry() would return true more often then it should. Specifically, this was in the case where compact_result was COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being made." The problem is that the compaction_withdrawn() test in should_compact_retry() includes compaction outcomes that are only possible on low compaction priority, and results in a retry without increasing the priority. This may result in furter reclaim, and more incomplete compaction attempts. With this patch, compaction priority is raised when possible, or should_compact_retry() returns false. The COMPACT_SKIPPED result doesn't really fit together with the other outcomes in compaction_withdrawn(), as that's a result caused by insufficient order-0 pages, not due to low compaction priority. With this patch, it is moved to a new compaction_needs_reclaim() function, and for that outcome we keep the current logic of retrying if it looks like reclaim will be able to help. Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com Reported-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Vlastimil Babka | 5ee04716c4 |
mm, reclaim: cleanup should_continue_reclaim()
After commit "mm, reclaim: make should_continue_reclaim perform dryrun detection", closer look at the function shows, that nr_reclaimed == 0 means the function will always return false. And since non-zero nr_reclaimed implies non_zero nr_scanned, testing nr_scanned serves no purpose, and so does the testing for __GFP_RETRY_MAYFAIL. This patch thus cleans up the function to test only !nr_reclaimed upfront, and remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter completely. Comment is also updated, explaining that approximating "full LRU list has been scanned" with nr_scanned == 0 didn't really work. Link: http://lkml.kernel.org/r/20190806014744.15446-3-mike.kravetz@oracle.com Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Hillf Danton | 1c6c15971e |
mm, reclaim: make should_continue_reclaim perform dryrun detection
Patch series "address hugetlb page allocation stalls", v2. Allocation of hugetlb pages via sysctl or procfs can stall for minutes or hours. A simple example on a two node system with 8GB of memory is as follows: echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages echo 4096 > /proc/sys/vm/nr_hugepages Obviously, both allocation attempts will fall short of their 8GB goal. However, one or both of these commands may stall and not be interruptible. The issues were initially discussed in mail thread [1] and RFC code at [2]. This series addresses the issues causing the stalls. There are two distinct fixes, a cleanup, and an optimization. The reclaim patch by Hillf and compaction patch by Vlasitmil address corner cases in their respective areas. hugetlb page allocation could stall due to either of these issues. Vlasitmil added a cleanup patch after Hillf's modifications. The hugetlb patch by Mike is an optimization suggested during the debug and development process. [1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com [2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com This patch (of 4): Address the issue of should_continue_reclaim returning true too often for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned. This was observed during hugetlb page allocation causing stalls for minutes or hours. We can stop reclaiming pages if compaction reports it can make a progress. There might be side-effects for other high-order allocations that would potentially benefit from reclaiming more before compaction so that they would be faster and less likely to stall. However, the consequences of premature/over-reclaim are considered worse. We can also bail out of reclaiming pages if we know that there are not enough inactive lru pages left to satisfy the costly allocation. We can give up reclaiming pages too if we see dryrun occur, with the certainty of plenty of inactive pages. IOW with dryrun detected, we are sure we have reclaimed as many pages as we could. Link: http://lkml.kernel.org/r/20190806014744.15446-2-mike.kravetz@oracle.com Signed-off-by: Hillf Danton <hdanton@sina.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Michal Hocko | 0158115f70 |
memcg, kmem: deprecate kmem.limit_in_bytes
Cgroup v1 memcg controller has exposed a dedicated kmem limit to users which turned out to be really a bad idea because there are paths which cannot shrink the kernel memory usage enough to get below the limit (e.g. because the accounted memory is not reclaimable). There are cases when the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the kmem limit is in excess to the hard limit without any way to shrink and thus completely useless. OOM killer cannot be invoked to handle the situation because that would lead to a premature oom killing. As a result many places might see ENOMEM returning from kmalloc and result in unexpected errors. E.g. a global OOM killer when there is a lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and therefore pagefault_out_of_memory would result in OOM killer. Please note that the kernel memory is still accounted to the overall limit along with the user memory so removing the kmem specific limit should still allow to contain kernel memory consumption. Unlike the kmem one, though, it invokes memory reclaim and targeted memcg oom killing if necessary. Start the deprecation process by crying to the kernel log. Let's see whether there are relevant usecases and simply return to EINVAL in the second stage if nobody complains in few releases. [akpm@linux-foundation.org: tweak documentation text] Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Thomas Lindroth <thomas.lindroth@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | 4d0e3230a5 |
mm/memcontrol.c: fix a -Wunused-function warning
mem_cgroup_id_get() was introduced in commit |
|
Michal Hocko | 1eb41bb07e |
mm, oom: consider present pages for the node size
constrained_alloc() calculates the size of the oom domain by using node_spanned_pages which is incorrect because this is the full range of the physical memory range that the numa node occupies rather than the memory that backs that range which is represented by node_present_pages. Sparsely populated nodes (e.g. after memory hot remove or simply sparse due to memory layout) can have really a large difference between the two. This shouldn't really cause any real user observable problems because the oom calculates a ratio against totalpages and used memory cannot exceed present pages but it is confusing and wrong from code point of view. Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yi Wang | f364f06b34 |
mm/oom_kill.c: fix oom_cpuset_eligible() comment
Commit
|
|
Edward Chron | 70cb6d2677 |
mm/oom: add oom_score_adj and pgtables to Killed process message
For an OOM event: print oom_score_adj value for the OOM Killed process to document what the oom score adjust value was at the time the process was OOM Killed. The adjustment value can be set by user code and it affects the resulting oom_score so it is used to influence kill process selection. When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing this value is the only documentation of the value for the process being killed. Having this value on the Killed process message is useful to document if a miscconfiguration occurred or to confirm that the oom_score_adj configuration applies as expected. An example which illustates both misconfiguration and validation that the oom_score_adj was applied as expected is: Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB, shmem-rss:0kB pgtables:22kB oom_score_adj:1000 The systemd-udevd is a critical system application that should have an oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000 making it a highly favored OOM kill target process. The output documents both the misconfiguration and the fact that the process was correctly targeted by OOM due to the miconfiguration. This can be quite helpful for triage and problem determination. The addition of the pgtables_bytes shows page table usage by the process and is a useful measure of the memory size of the process. Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com Signed-off-by: Edward Chron <echron@arista.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Tetsuo Handa | f9c645621a |
memcg, oom: don't require __GFP_FS when invoking memcg OOM killer
Masoud Sharbiani noticed that commit |
|
Joel Savitz | 8ac3f8fe91 |
mm/oom_kill.c: add task UID to info message on an oom kill
In the event of an oom kill, useful information about the killed process is printed to dmesg. Users, especially system administrators, will find it useful to immediately see the UID of the process. We already print uid when dumping eligible tasks so it is not overly hard to find that information in the oom report. However this information is unavailable when dumping of eligible tasks is disabled. In the following example, abuse_the_ram is the name of a program that attempts to iteratively allocate all available memory until it is stopped by force. Current message: Out of memory: Killed process 35389 (abuse_the_ram) total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB, shmem-rss:0kB Patched message: Out of memory: Killed process 2739 (abuse_the_ram), total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB, shmem-rss:0kB, UID:0 [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk] Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com Signed-off-by: Joel Savitz <jsavitz@redhat.com> Suggested-by: David Rientjes <rientjes@google.com> Acked-by: Rafael Aquini <aquini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Kefeng Wang | 4406548ee3 |
mm/mempolicy.c: remove unnecessary nodemask check in kernel_migrate_pages()
1) task_nodes = cpuset_mems_allowed(current); -> cpuset_mems_allowed() guaranteed to return some non-empty subset of node_states[N_MEMORY]. 2) nodes_and(*new, *new, task_nodes); -> after nodes_and(), the 'new' should be empty or appropriate nodemask(online node and with memory). After 1) and 2), we could remove unnecessary check whether the 'new' AND node_states[N_MEMORY] is empty. Link: http://lkml.kernel.org/r/20190806023634.55356-1-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Pengfei Li | 32aaf0553d |
mm/compaction.c: remove unnecessary zone parameter in isolate_migratepages()
Like commit
|
|
Yafang Shao | a94b525241 |
mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone
total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
COMPACTFREE_SCANNED in compact_zone(). We should clear them before
scanning a new zone. In the proc triggered compaction, we forgot clearing
them.
[laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
[akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
[vbabka@suse.cz: squash compact_zone() list_head init as well]
Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
[akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
Fixes:
|
|
Vitaly Wool | 63398413c0 |
z3fold: fix memory leak in kmem cache
Currently there is a leak in init_z3fold_page() -- it allocates handles from kmem cache even for headless pages, but then they are never used and never freed, so eventually kmem cache may get exhausted. This patch provides a fix for that. Link: http://lkml.kernel.org/r/20190917185352.44cf285d3ebd9e64548de5de@gmail.com Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Reported-by: Markus Linnala <markus.linnala@gmail.com> Tested-by: Markus Linnala <markus.linnala@gmail.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | b57a775f51 |
mm: silence -Woverride-init/initializer-overrides
When compiling a kernel with W=1, there are several of those warnings due to arm64 overriding a field on purpose. Just disable those warnings for both GCC and Clang of this file, so it will help dig "gems" hidden in the W=1 warnings by reducing some noises. mm/init-mm.c:39:2: warning: initializer overrides prior initialization of this subobject [-Winitializer-overrides] INIT_MM_CONTEXT(init_mm) ^~~~~~~~~~~~~~~~~~~~~~~~ ./arch/arm64/include/asm/mmu.h:133:9: note: expanded from macro 'INIT_MM_CONTEXT' .pgd = init_pg_dir, ^~~~~~~~~~~ mm/init-mm.c:30:10: note: previous initialization is here .pgd = swapper_pg_dir, ^~~~~~~~~~~~~~ Note: there is a side project trying to support explicitly allowing specific initializer overrides in Clang, but there is no guarantee it will happen or not. https://github.com/ClangBuiltLinux/linux/issues/639 Link: http://lkml.kernel.org/r/1566920867-27453-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Rapoport | 2286bf4e4d |
mm: use CPU_BITS_NONE to initialize init_mm.cpu_bitmask
Replace open-coded bitmap array initialization of init_mm.cpu_bitmask with neat CPU_BITS_NONE macro. And, since init_mm.cpu_bitmask is statically set to zero, there is no way to clear it again in start_kernel(). Link: http://lkml.kernel.org/r/1565703815-8584-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Austin Kim | 7ea362427c |
mm/vmalloc.c: move 'area->pages' after if statement
If !area->pages statement is true where memory allocation fails, area is freed. In this case 'area->pages = pages' should not executed. So move 'area->pages = pages' after if statement. [akpm@linux-foundation.org: give area->pages the same treatment] Link: http://lkml.kernel.org/r/20190830035716.GA190684@LGEARND20B15 Signed-off-by: Austin Kim <austindh.kim@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Roman Penyaev <rpenyaev@suse.de> Cc: Rick Edgecombe <rick.p.edgecombe@intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Pengfei Li | 688fcbfc06 |
mm/vmalloc: modify struct vmap_area to reduce its size
Objective --------- The current implementation of struct vmap_area wasted space. After applying this commit, sizeof(struct vmap_area) has been reduced from 11 words to 8 words. Description ----------- 1) Pack "subtree_max_size", "vm" and "purge_list". This is no problem because A) "subtree_max_size" is only used when vmap_area is in "free" tree B) "vm" is only used when vmap_area is in "busy" tree C) "purge_list" is only used when vmap_area is in vmap_purge_list 2) Eliminate "flags". ;Since only one flag VM_VM_AREA is being used, and the same thing can be done by judging whether "vm" is NULL, then the "flags" can be eliminated. Link: http://lkml.kernel.org/r/20190716152656.12255-3-lpf.vector@gmail.com Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Suggested-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Roman Gushchin <guro@fb.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Uladzislau Rezki (Sony) | dd3b8353ba |
mm/vmalloc: do not keep unpurged areas in the busy tree
The busy tree can be quite big, even though the area is freed or unmapped it still stays there until "purge" logic removes it. 1) Optimize and reduce the size of "busy" tree by removing a node from it right away as soon as user triggers free paths. It is possible to do so, because the allocation is done using another augmented tree. The vmalloc test driver shows the difference, for example the "fix_size_alloc_test" is ~11% better comparing with default configuration: sudo ./test_vmalloc.sh performance <default> Summary: fix_size_alloc_test loops: 1000000 avg: 993985 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973554 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12617652 usec <default> <this patch> Summary: fix_size_alloc_test loops: 1000000 avg: 882263 usec Summary: full_fit_alloc_test loops: 1000000 avg: 973407 usec Summary: long_busy_list_alloc_test loops: 1000000 avg: 12593929 usec <this patch> 2) Since the busy tree now contains allocated areas only and does not interfere with lazily free nodes, introduce the new function show_purge_info() that dumps "unpurged" areas that is propagated through "/proc/vmallocinfo". 3) Eliminate VM_LAZY_FREE flag. Link: http://lkml.kernel.org/r/20190716152656.12255-2-lpf.vector@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Pengfei Li <lpf.vector@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Alastair D'Silva | 5ed867037e |
mm/sparse.c: remove NULL check in clear_hwpoisoned_pages()
There is no possibility for memmap to be NULL in the current codebase. This check was added in commit |
|
Alastair D'Silva | 9f82883c6d |
mm/sparse.c: don't manually decrement num_poisoned_pages
Use the function written to do it instead. Link: http://lkml.kernel.org/r/20190827053656.32191-2-alastair@au1.ibm.com Signed-off-by: Alastair D'Silva <alastair@d-silva.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | c1cbc3eebf |
mm/sparse.c: use __nr_to_section(section_nr) to get mem_section
__pfn_to_section is defined as __nr_to_section(pfn_to_section_nr(pfn)). Since we already get section_nr, it is not necessary to get mem_section from start_pfn. By doing so, we reduce one redundant operation. Link: http://lkml.kernel.org/r/20190809010242.29797-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Tested-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Lecopzer Chen | db57e98d87 |
mm/sparse.c: fix ALIGN() without power of 2 in sparse_buffer_alloc()
The size argument passed into sparse_buffer_alloc() has already been aligned with PAGE_SIZE or PMD_SIZE. If the size after aligned is not power of 2 (e.g. 0x480000), the PTR_ALIGN() will return wrong value. Use roundup to round sparsemap_buf up to next multiple of size. Link: http://lkml.kernel.org/r/20190705114826.28586-1-lecopzer.chen@mediatek.com Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Signed-off-by: Mark-PK Tsai <Mark-PK.Tsai@mediatek.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |