mirror of https://gitee.com/openkylin/linux.git
1630 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
David S. Miller | 150f29f5e6 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-09-01 The following pull-request contains BPF updates for your *net-next* tree. There are two small conflicts when pulling, resolve as follows: 1) Merge conflict in tools/lib/bpf/libbpf.c between |
|
Alexei Starovoitov | 76cd61739f |
mm/error_inject: Fix allow_error_inject function signatures.
'static' and 'static noinline' function attributes make no guarantees that gcc/clang won't optimize them. The compiler may decide to inline 'static' function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler could have inlined __add_to_page_cache_locked() in one callsite and didn't inline in another. In such case injecting errors into it would cause unpredictable behavior. It's worse with 'static noinline' which won't be inlined, but it still can be optimized. Like the compiler may decide to remove one argument or constant propagate the value depending on the callsite. To avoid such issues make sure that these functions are global noinline. Fixes: |
|
Charan Teja Reddy | 88e8ac11d2 |
mm, page_alloc: fix core hung in free_pcppages_bulk()
The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.
P1 P2
Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.
Allocate the pages from the
movable zone.
Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.
Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.
Avoid this by ensuring free_pcppages_bulk() is called with proper count of
pcp list pages.
The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there is
a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.
With [1], the race still exists but it is very narrow as we update the pcp
struct values for the first memory block online itself.
This is not limited to the movable zone, it could also happen in cases
with the normal zone (e.g., hotplug to a node that only has DMA memory, or
no other memory yet).
[1]: https://patchwork.kernel.org/patch/11696389/
Fixes:
|
|
Doug Berger | e08d3fdfe2 |
mm: include CMA pages in lowmem_reserve at boot
The lowmem_reserve arrays provide a means of applying pressure against
allocations from lower zones that were targeted at higher zones. Its
values are a function of the number of pages managed by higher zones and
are assigned by a call to the setup_per_zone_lowmem_reserve() function.
The function is initially called at boot time by the function
init_per_zone_wmark_min() and may be called later by accesses of the
/proc/sys/vm/lowmem_reserve_ratio sysctl file.
The function init_per_zone_wmark_min() was moved up from a module_init to
a core_initcall to resolve a sequencing issue with khugepaged.
Unfortunately this created a sequencing issue with CMA page accounting.
The CMA pages are added to the managed page count of a zone when
cma_init_reserved_areas() is called at boot also as a core_initcall. This
makes it uncertain whether the CMA pages will be added to the managed page
counts of their zones before or after the call to
init_per_zone_wmark_min() as it becomes dependent on link order. With the
current link order the pages are added to the managed count after the
lowmem_reserve arrays are initialized at boot.
This means the lowmem_reserve values at boot may be lower than the values
used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
ratio values are unchanged.
In many cases the difference is not significant, but for example
an ARM platform with 1GB of memory and the following memory layout
cma: Reserved 256 MiB at 0x0000000030000000
Zone ranges:
DMA [mem 0x0000000000000000-0x000000002fffffff]
Normal empty
HighMem [mem 0x0000000030000000-0x000000003fffffff]
would result in 0 lowmem_reserve for the DMA zone. This would allow
userspace to deplete the DMA zone easily.
Funnily enough
$ cat /proc/sys/vm/lowmem_reserve_ratio
would fix up the situation because as a side effect it forces
setup_per_zone_lowmem_reserve.
This commit breaks the link order dependency by invoking
init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
have the chance to be properly accounted in their zone(s) and allowing
the lowmem_reserve arrays to receive consistent values.
Fixes:
|
|
Matthew Wilcox (Oracle) | 1378a5ee45 |
mm: store compound_nr as well as compound_order
Patch series "THP prep patches". These are some generic cleanups and improvements, which I would like merged into mmotm soon. The first one should be a performance improvement for all users of compound pages, and the others are aimed at getting code to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small systems). Also better documented / less confusing than the current prefix mixture of compound, hpage and thp. This patch (of 7): This removes a few instructions from functions which need to know how many pages are in a compound page. The storage used is either page->mapping on 64-bit or page->index on 32-bit. Both of these are fine to overlay on tail pages. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 8b94e0b8be |
mm/page_alloc: remove a wrapper for alloc_migration_target()
There is a well-defined standard migration target callback. Use it directly. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Randy Dunlap | 047b9967d5 |
mm/page_alloc.c: delete or fix duplicated words
Drop the repeated word "them" and "that". Change "the the" to "to the". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 8510e69c8e |
mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
Currently, memalloc_nocma_{save/restore} API that prevents CMA area
in page allocation is implemented by using current_gfp_context(). However,
there are two problems of this implementation.
First, this doesn't work for allocation fastpath. In the fastpath,
original gfp_mask is used since current_gfp_context() is introduced in
order to control reclaim and it is on slowpath. So, CMA area can be
allocated through the allocation fastpath even if
memalloc_nocma_{save/restore} APIs are used. Currently, there is just
one user for these APIs and it has a fallback method to prevent actual
problem.
Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
to exclude the memory on the ZONE_MOVABLE for allocation target.
To fix these problems, this patch changes the implementation to exclude
CMA area in page allocation. Main point of this change is using the
alloc_flags. alloc_flags is mainly used to control allocation so it fits
for excluding CMA area in allocation.
Fixes:
|
|
Muchun Song | 182f3d7a02 |
mm/page_alloc.c: skip setting nodemask when we are in interrupt
When we are in the interrupt context, it is irrelevant to the current task context. If we use current task's mems_allowed, we can be fair to alloc pages in the fast path and fall back to slow path memory allocation when the current node(which is the current task mems_allowed) does not have enough memory to allocate. In this case, it slows down the memory allocation speed of interrupt context. So we can skip setting the nodemask to allow any node to allocate memory, so that fast path allocation can success. Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pekka Enberg <penberg@kernel.org> Cc: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200706025921.53683-1-songmuchun@bytedance.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | da41566399 |
mm/page_alloc: fallbacks at most has 3 elements
MIGRAGE_TYPES is used to be the mark of end and there are at most 3 elements for the one dimension array. Reduce to 3 to save little memory. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200625231022.18784-1-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | 9e15afa5a8 |
mm/page_alloc: silence a KASAN false positive
kernel_init_free_pages() will use memset() on s390 to clear all pages from
kmalloc_order() which will override KASAN redzones because a redzone was
setup from the end of the allocation size to the end of the last page.
Silence it by not reporting it there. An example of the report is,
BUG: KASAN: slab-out-of-bounds in __free_pages_ok
Write of size 4096 at addr 000000014beaa000
Call Trace:
show_stack+0x152/0x210
dump_stack+0x1f8/0x248
print_address_description.isra.13+0x5e/0x4d0
kasan_report+0x130/0x178
check_memory_region+0x190/0x218
memset+0x34/0x60
__free_pages_ok+0x894/0x12f0
kfree+0x4f2/0x5e0
unpack_to_rootfs+0x60e/0x650
populate_rootfs+0x56/0x358
do_one_initcall+0x1f4/0xa20
kernel_init_freeable+0x758/0x7e8
kernel_init+0x1c/0x170
ret_from_fork+0x24/0x28
Memory state around the buggy address:
000000014bea9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
000000014bea9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>000000014beaa000: 03 fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
^
000000014beaa080: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
000000014beaa100: fe fe fe fe fe fe fe fe fe fe fe fe fe fe
Fixes:
|
|
Wei Yang | 535b81e209 |
mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
After previous cleanup, the end_bitidx is not necessary any more. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-4-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | d93d5ab9ca |
mm/page_alloc.c: simplify pageblock bitmap access
Due to commit
|
|
Wei Yang | 399b795b7a |
mm/page_alloc.c: extract the common part in pfn_to_bitidx()
The return value calculation is the same both for SPARSEMEM or not. Just take it out. Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Link: http://lkml.kernel.org/r/20200623124201.8199-2-richard.weiyang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Hildenbrand | 56b9413bcb |
mm/page_alloc: remove nr_free_pagecache_pages()
nr_free_pagecache_pages() isn't used outside page_alloc.c anymore - and the name does not really help to understand what's going on. Let's open-code it instead and add a comment. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Huang Ying <ying.huang@intel.com> Link: http://lkml.kernel.org/r/20200619132410.23859-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Hildenbrand | 0a18e60788 |
mm: remove vm_total_pages
The global variable "vm_total_pages" is a relic from older days. There is only a single user that reads the variable - build_all_zonelists() - and the first thing it does is update it. Use a local variable in build_all_zonelists() instead and remove the global variable. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/20200619132410.23859-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Charan Teja Reddy | f80b08fc44 |
mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations
When boosting is enabled, it is observed that rate of atomic order-0 allocation failures are high due to the fact that free levels in the system are checked with ->watermark_boost offset. This is not a problem for sleepable allocations but for atomic allocations which looks like regression. This problem is seen frequently on system setup of Android kernel running on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred in the system, ->watermark_boost factor is zero, thus the watermark configurations in the system are: _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 0 After launching some memory hungry applications in Android which can cause extfrag events in the system to an extent that ->watermark_boost can be set to max i.e. default boost factor makes it to 150% of high watermark. _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 14077, -->~57MB With default system configuration, for an atomic order-0 allocation to succeed, having free memory of ~2MB will suffice. But boosting makes the min_wmark to ~61MB thus for an atomic order-0 allocation to be successful system should have minimum of ~23MB of free memory(from calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite system is having ~20MB of free memory. In the testing, this is reproducible as early as first 300secs since boot and with furtherlowram configurations(<2GB) it is observed as early as first 150secs since boot. These failures can be avoided by excluding the ->watermark_boost in watermark caluculations for atomic order-0 allocations. [akpm@linux-foundation.org: fix comment grammar, reflow comment] [charante@codeaurora.org: fix suggested by Mel Gorman] Link: http://lkml.kernel.org/r/31556793-57b1-1c21-1a9d-22674d9bd938@codeaurora.org Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Mel Gorman <mgorman@techsingularity.net> Link: http://lkml.kernel.org/r/1589882284-21010-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Jaewon Kim | f27ce0e140 |
page_alloc: consider highatomic reserve in watermark fast
zone_watermark_fast was introduced by commit
|
|
Vlastimil Babka | deba04872b |
mm, page_alloc: use unlikely() in task_capc()
Hugh noted that task_capc() could use unlikely(), as most of the time there is no capture in progress and we are in page freeing hot path. Indeed adding unlikely() produces assembly that better matches the assumption and moves all the tests away from the hot path. I have also noticed that we don't need to test for cc->direct_compaction as the only place we set current->task_capture is compact_zone_order() which also always sets cc->direct_compaction true. Suggested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@googlecom> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Li Wang <liwang@redhat.com> Link: http://lkml.kernel.org/r/4a24f7af-3aa5-6e80-4ae6-8f253b562039@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Rapoport | c89ab04feb |
mm/sparse: cleanup the code surrounding memory_present()
After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent functions that call memory_present() for each region in memblock.memory: sparse_memory_present_with_active_regions() and membocks_present(). Moreover, all architectures have a call to either of these functions preceding the call to sparse_init() and in the most cases they are called one after the other. Mark the regions from memblock.memory as present during sparce_init() by making sparse_init() call memblocks_present(), make memblocks_present() and memory_present() functions static and remove redundant sparse_memory_present_with_active_regions() function. Also remove no longer required HAVE_MEMORY_PRESENT configuration option. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Shakeel Butt | 991e767385 |
mm: memcontrol: account kernel stack per node
Currently the kernel stack is being accounted per-zone. There is no need to do that. In addition due to being per-zone, memcg has to keep a separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of node_stat_item. In addition localize the kernel stack stats updates to account_kernel_stack(). Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Roman Gushchin | d42f3245c7 |
mm: memcg: convert vmstat slab counters to bytes
In order to prepare for per-object slab memory accounting, convert NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes. To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB). Internally global and per-node counters are stored in pages, however memcg and lruvec counters are stored in bytes. This scheme may look weird, but only for now. As soon as slab pages will be shared between multiple cgroups, global and node counters will reflect the total number of slab pages. However memcg and lruvec counters will be used for per-memcg slab memory tracking, which will take separate kernel objects in the account. Keeping global and node counters in pages helps to avoid additional overhead. The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it will fit into atomic_long_t we use for vmstats. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 99ea1521a0 |
Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var() -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+ drhmmImUr1YylrtVOw== =xHmk -----END PGP SIGNATURE----- Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var() |
|
Kees Cook | 3f649ab728 |
treewide: Remove uninitialized_var() usage
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org> |
|
Joel Savitz | 8beeae86b8 |
mm/page_alloc: fix documentation error
When I increased the upper bound of the min_free_kbytes value in
|
|
Linus Torvalds | 09102704c6 |
virtio: features, fixes
virtio-mem doorbell mapping for vdpa config interrupt support in ifc fixes all over the place Signed-off-by: Michael S. Tsirkin <mst@redhat.com> -----BEGIN PGP SIGNATURE----- iQFDBAABCAAtFiEEXQn9CHHI+FuUyooNKB8NuNKNVGkFAl7fZ6APHG1zdEByZWRo YXQuY29tAAoJECgfDbjSjVRpkDoIAMcBcQx5su1iuX7vT35xzUWZO478eAf1jOMZ 7KxKUVBeztkcxVFUlRVRu9MR6wOzwHils+1HD6025775Smr5M6x3aJxR6xOORaBj RoU6OVGkpDvbzsxlhW+xhONz4O7/RkveKJPCwzGjqHrsFeh92lkfTqroz/EuNpw+ LZsO0+DhdUf123HbwHQp5lxW8EjyrRabgeZZg/D9VLPhoCP88vCjRhBXU2GPuaUl /UNXsQafn4xUgrxPaoN5f4Phn/P46NNrbZ1jmlkw/z/3QhF/DhktGXGaZsIHDCN/ vicUii0or5QLeBsZpMbKko/BIe2xWHxFjkMRhMOMZOfcBb6sMBI= =auUa -----END PGP SIGNATURE----- Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost Pull virtio updates from Michael Tsirkin: - virtio-mem: paravirtualized memory hotplug - support doorbell mapping for vdpa - config interrupt support in ifc - fixes all over the place * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (40 commits) vhost/test: fix up after API change virtio_mem: convert device block size into 64bit virtio-mem: drop unnecessary initialization ifcvf: implement config interrupt in IFCVF vhost: replace -1 with VHOST_FILE_UNBIND in ioctls vhost_vdpa: Support config interrupt in vdpa ifcvf: ignore continuous setting same status value virtio-mem: Don't rely on implicit compiler padding for requests virtio-mem: Try to unplug the complete online memory block first virtio-mem: Use -ETXTBSY as error code if the device is busy virtio-mem: Unplug subblocks right-to-left virtio-mem: Drop manual check for already present memory virtio-mem: Add parent resource for all added "System RAM" virtio-mem: Better retry handling virtio-mem: Offline and remove completely unplugged memory blocks mm/memory_hotplug: Introduce offline_and_remove_memory() virtio-mem: Allow to offline partially unplugged memory blocks mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE virtio-mem: Paravirtualized memory hotunplug part 2 virtio-mem: Paravirtualized memory hotunplug part 1 ... |
|
Vlastimil Babka | 0a477e1ae2 |
kernel/sysctl: support handling command line aliases
We can now handle sysctl parameters on kernel command line, but historically some parameters introduced their own command line equivalent, which we don't want to remove for compatibility reasons. We can, however, convert them to the generic infrastructure with a table translating the legacy command line parameters to their sysctl names, and removing the one-off param handlers. This patch adds the support and makes the first conversion to demonstrate it, on the (deprecated) numa_zonelist_order parameter. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
David Hildenbrand | aa218795cb |
mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE
virtio-mem wants to allow to offline memory blocks of which some parts were unplugged (allocated via alloc_contig_range()), especially, to later offline and remove completely unplugged memory blocks. The important part is that PageOffline() has to remain set until the section is offline, so these pages will never get accessed (e.g., when dumping). The pages should not be handed back to the buddy (which would require clearing PageOffline() and result in issues if offlining fails and the pages are suddenly in the buddy). Let's allow to do that by allowing to isolate any PageOffline() page when offlining. This way, we can reach the memory hotplug notifier MEM_GOING_OFFLINE, where the driver can signal that he is fine with offlining this page by dropping its reference count. PageOffline() pages with a reference count of 0 can then be skipped when offlining the pages (like if they were free, however they are not in the buddy). Anybody who uses PageOffline() pages and does not agree to offline them (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not decrement the reference count and make offlining fail when trying to migrate such an unmovable page. So there should be no observable change. Same applies to balloon compaction users (movable PageOffline() pages), the pages will simply be migrated. Note 1: If offlining fails, a driver has to increment the reference count again in MEM_CANCEL_OFFLINE. Note 2: A driver that makes use of this has to be aware that re-onlining the memory block has to be handled by hooking into onlining code (online_page_callback_t), resetting the page PageOffline() and not giving them to the buddy. Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Juergen Gross <jgross@suse.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Qian Cai <cai@lca.pw> Cc: Pingfan Liu <kernelfans@gmail.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-7-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> |
|
David Hildenbrand | 255f598507 |
virtio-mem: Paravirtualized memory hotunplug part 2
We also want to unplug online memory (contained in online memory blocks and, therefore, managed by the buddy), and eventually replug it later. When requested to unplug memory, we use alloc_contig_range() to allocate subblocks in online memory blocks (so we are the owner) and send them to our hypervisor. When requested to plug memory, we can replug such memory using free_contig_range() after asking our hypervisor. We also want to mark all allocated pages PG_offline, so nobody will touch them. To differentiate pages that were never onlined when onlining the memory block from pages allocated via alloc_contig_range(), we use PageDirty(). Based on this flag, virtio_mem_fake_online() can either online the pages for the first time or use free_contig_range(). It is worth noting that there are no guarantees on how much memory can actually get unplugged again. All device memory might completely be fragmented with unmovable data, such that no subblock can get unplugged. We are not touching the ZONE_MOVABLE. If memory is onlined to the ZONE_MOVABLE, it can only get unplugged after that memory was offlined manually by user space. In normal operation, virtio-mem memory is suggested to be onlined to ZONE_NORMAL. In the future, we will try to make unplug more likely to succeed. Add a module parameter to control if online memory shall be touched. As we want to access alloc_contig_range()/free_contig_range() from kernel module context, export the symbols. Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages are on the same node, in the same zone, and contain no holes. Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API Tested-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com> |
|
Linus Torvalds | ee01c4d72a |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "More mm/ work, plenty more to come Subsystems affected by this patch series: slub, memcg, gup, kasan, pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs, thp, mmap, kconfig" * akpm: (131 commits) arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined riscv: support DEBUG_WX mm: add DEBUG_WX support drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid() powerpc/mm: drop platform defined pmd_mknotpresent() mm: thp: don't need to drain lru cache when splitting and mlocking THP hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs sparc32: register memory occupied by kernel as memblock.memory include/linux/memblock.h: fix minor typo and unclear comment mm, mempolicy: fix up gup usage in lookup_node tools/vm/page_owner_sort.c: filter out unneeded line mm: swap: memcg: fix memcg stats for huge pages mm: swap: fix vmstats for huge pages mm: vmscan: limit the range of LRU type balancing mm: vmscan: reclaim writepage is IO cost mm: vmscan: determine anon/file pressure balance at the reclaim root mm: balance LRU lists based on relative thrashing mm: only count actual rotations as LRU reclaim cost ... |
|
Maninder Singh | 730ec8c01a |
mm/vmscan.c: change prototype for shrink_page_list
commit
|
|
Chen Tao | 633bf2fe8d |
mm/page_alloc.c: add missing newline
Add missing line breaks on pr_warn(). Signed-off-by: Chen Tao <chentao107@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200603063547.235825-1-chentao107@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Daniel Jordan | ecd0965069 |
mm: make deferred init's max threads arch-specific
Using padata during deferred init has only been tested on x86, so for now limit it to this architecture. If another arch wants this, it can find the max thread limit that's best for it and override deferred_page_init_max_threads(). Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Josh Triplett <josh@joshtriplett.org> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200527173608.2885243-8-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Daniel Jordan | e44431498f |
mm: parallelize deferred_init_memmap()
Deferred struct page init is a significant bottleneck in kernel boot. Optimizing it maximizes availability for large-memory systems and allows spinning up short-lived VMs as needed without having to leave them running. It also benefits bare metal machines hosting VMs that are sensitive to downtime. In projects such as VMM Fast Restart[1], where guest state is preserved across kexec reboot, it helps prevent application and network timeouts in the guests. Multithread to take full advantage of system memory bandwidth. The maximum number of threads is capped at the number of CPUs on the node because speedups always improve with additional threads on every system tested, and at this phase of boot, the system is otherwise idle and waiting on page init to finish. Helper threads operate on section-aligned ranges to both avoid false sharing when setting the pageblock's migrate type and to avoid accessing uninitialized buddy pages, though max order alignment is enough for the latter. The minimum chunk size is also a section. There was benefit to using multiple threads even on relatively small memory (1G) systems, and this is the smallest size that the alignment allows. The time (milliseconds) is the slowest node to initialize since boot blocks until all nodes finish. intel_pstate is loaded in active mode without hwp and with turbo enabled, and intel_idle is active as well. Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal) 2 nodes * 26 cores * 2 threads = 104 CPUs 384G/node = 768G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6) 2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8) 12% ( 6) 34.9% 2662.7 ( 2.9) 79.9% 359.3 ( 0.6) 25% ( 13) 39.9% 2459.0 ( 3.6) 91.2% 157.0 ( 0.0) 37% ( 19) 39.2% 2485.0 ( 29.7) 90.4% 172.0 ( 28.6) 50% ( 26) 39.3% 2482.7 ( 25.7) 90.3% 173.7 ( 30.0) 75% ( 39) 39.0% 2495.7 ( 5.5) 89.4% 190.0 ( 1.0) 100% ( 52) 40.2% 2443.7 ( 3.8) 92.3% 138.0 ( 1.0) Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest) 1 node * 16 cores * 2 threads = 32 CPUs 192G/node = 192G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5) 3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0) 12% ( 4) 41.1% 1170.3 ( 14.2) 73.8% 287.0 ( 3.6) 25% ( 8) 47.1% 1052.7 ( 21.9) 83.9% 177.0 ( 13.5) 38% ( 12) 48.9% 1016.3 ( 12.1) 86.8% 144.7 ( 1.5) 50% ( 16) 48.9% 1015.7 ( 8.1) 87.8% 134.0 ( 4.4) 75% ( 24) 49.1% 1012.3 ( 3.1) 88.1% 130.3 ( 2.3) 100% ( 32) 49.5% 1004.0 ( 5.3) 88.5% 125.7 ( 2.1) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal) 2 nodes * 18 cores * 2 threads = 72 CPUs 128G/node = 256G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1680.0 ( 4.6) -- 627.0 ( 4.0) 3% ( 1) 0.3% 1675.7 ( 4.5) -0.2% 628.0 ( 3.6) 11% ( 4) 25.6% 1250.7 ( 2.1) 67.9% 201.0 ( 0.0) 25% ( 9) 30.7% 1164.0 ( 17.3) 81.8% 114.3 ( 17.7) 36% ( 13) 31.4% 1152.7 ( 10.8) 84.0% 100.3 ( 17.9) 50% ( 18) 31.5% 1150.7 ( 9.3) 83.9% 101.0 ( 14.1) 75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4) 100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0) AMD EPYC 7551 32-Core Processor (Zen, kvm guest) 1 node * 8 cores * 2 threads = 16 CPUs 64G/node = 64G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 1029.3 ( 25.1) -- 240.7 ( 1.5) 6% ( 1) -0.6% 1036.0 ( 7.8) -2.2% 246.0 ( 0.0) 12% ( 2) 11.8% 907.7 ( 8.6) 44.7% 133.0 ( 1.0) 25% ( 4) 13.9% 886.0 ( 10.6) 62.6% 90.0 ( 6.0) 38% ( 6) 17.8% 845.7 ( 14.2) 69.1% 74.3 ( 3.8) 50% ( 8) 16.8% 856.0 ( 22.1) 72.9% 65.3 ( 5.7) 75% ( 12) 15.4% 871.0 ( 29.2) 79.8% 48.7 ( 7.4) 100% ( 16) 21.0% 813.7 ( 21.0) 80.5% 47.0 ( 5.2) Server-oriented distros that enable deferred page init sometimes run in small VMs, and they still benefit even though the fraction of boot time saved is smaller: AMD EPYC 7551 32-Core Processor (Zen, kvm guest) 1 node * 2 cores * 2 threads = 4 CPUs 16G/node = 16G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 716.0 ( 14.0) -- 49.7 ( 0.6) 25% ( 1) 1.8% 703.0 ( 5.3) -4.0% 51.7 ( 0.6) 50% ( 2) 1.6% 704.7 ( 1.2) 43.0% 28.3 ( 0.6) 75% ( 3) 2.7% 696.7 ( 13.1) 49.7% 25.0 ( 0.0) 100% ( 4) 4.1% 687.0 ( 10.4) 55.7% 22.0 ( 0.0) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest) 1 node * 2 cores * 2 threads = 4 CPUs 14G/node = 14G memory kernel boot deferred init ------------------------ ------------------------ node% (thr) speedup time_ms (stdev) speedup time_ms (stdev) ( 0) -- 787.7 ( 6.4) -- 122.3 ( 0.6) 25% ( 1) 0.2% 786.3 ( 10.8) -2.5% 125.3 ( 2.1) 50% ( 2) 5.9% 741.0 ( 13.9) 37.6% 76.3 ( 19.7) 75% ( 3) 8.3% 722.0 ( 19.0) 49.9% 61.3 ( 3.2) 100% ( 4) 9.3% 714.7 ( 9.5) 56.4% 53.3 ( 1.5) On Josh's 96-CPU and 192G memory system: Without this patch series: [ 0.487132] node 0 initialised, 23398907 pages in 292ms [ 0.499132] node 1 initialised, 24189223 pages in 304ms ... [ 0.629376] Run /sbin/init as init process With this patch series: [ 0.231435] node 1 initialised, 24189223 pages in 32ms [ 0.236718] node 0 initialised, 23398907 pages in 36ms [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Daniel Jordan | 89c7c4022d |
mm: don't track number of pages during deferred initialization
Deferred page init used to report the number of pages initialized: node 0 initialised, 32439114 pages in 97ms Tracking this makes the code more complicated when using multiple threads. Given that the statistic probably has limited value, especially since a zone grows on demand so that the page count can vary, just remove it. The boot message now looks like node 0 deferred pages initialised in 97ms Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Robert Elliott <elliott@hpe.com> Cc: Shile Zhang <shile.zhang@linux.alibaba.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200527173608.2885243-6-daniel.m.jordan@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Pavel Tatashin | da97f2d56b |
mm: call cond_resched() from deferred_init_memmap()
Now that deferred pages are initialized with interrupts enabled we can replace touch_nmi_watchdog() with cond_resched(), as it was before |
|
Pavel Tatashin | 3d060856ad |
mm: initialize deferred pages with interrupts enabled
Initializing struct pages is a long task and keeping interrupts disabled for the duration of this operation introduces a number of problems. 1. jiffies are not updated for long period of time, and thus incorrect time is reported. See proposed solution and discussion here: lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com 2. It prevents farther improving deferred page initialization by allowing intra-node multi-threading. We are keeping interrupts disabled to solve a rather theoretical problem that was never observed in real world (See |
|
Daniel Jordan | 117003c327 |
mm/pagealloc.c: call touch_nmi_watchdog() on max order boundaries in deferred init
Patch series "initialize deferred pages with interrupts enabled", v4.
Keep interrupts enabled during deferred page initialization in order to
make code more modular and allow jiffies to update.
Original approach, and discussion can be found here:
http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
This patch (of 3):
deferred_init_memmap() disables interrupts the entire time, so it calls
touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
will run with interrupts enabled, at which point cond_resched() should be
used instead.
deferred_grow_zone() makes the same watchdog calls through code shared
with deferred init but will continue to run with interrupts disabled, so
it can't call cond_resched().
Pull the watchdog calls up to these two places to allow the first to be
changed later, independently of the second. The frequency reduces from
twice per pageblock (init and free) to once per max order block.
Fixes:
|
|
Anshuman Khandual | ae70eddd56 |
mm/page_alloc: restrict and formalize compound_page_dtors[]
Restrict elements in compound_page_dtors[] array per NR_COMPOUND_DTORS and explicitly position them according to enum compound_dtor_id. This improves protection against possible misalignment between compound_page_dtors[] and enum compound_dtor_id later on. Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/1589795958-19317-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Charan Teja Reddy | aa09259109 |
mm, page_alloc: reset the zone->watermark_boost early
Updating the zone watermarks by any means, like min_free_kbytes, water_mark_scale_factor etc, when ->watermark_boost is set will result in higher low and high watermarks than the user asked. Below are the steps to reproduce the problem on system setup of Android kernel running on Snapdragon hardware. 1) Default settings of the system are as below: #cat /proc/sys/vm/min_free_kbytes = 5162 #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 8340 high 8539 2) Monitor the zone->watermark_boost(by adding a debug print in the kernel) and whenever it is greater than zero value, write the same value of min_free_kbytes obtained from step 1. #echo 5162 > /proc/sys/vm/min_free_kbytes 3) Then read the zone watermarks in the system while the ->watermark_boost is zero. This should show the same values of watermarks as step 1 but shown a higher values than asked. #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 21148 high 21347 These higher values are because of updating the zone watermarks using the macro min_wmark_pages(zone) which also adds the zone->watermark_boost. #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) So the steps that lead to the issue are: 1) On the extfrag event, watermarks are boosted by storing the required value in ->watermark_boost. 2) User tries to update the zone watermarks level in the system through min_free_kbytes or watermark_scale_factor. 3) Later, when kswapd woke up, it resets the zone->watermark_boost to zero. In step 2), we use the min_wmark_pages() macro to store the watermarks in the zone structure thus the values are always offsetted by ->watermark_boost value. This can be avoided by resetting the ->watermark_boost to zero before it is used. Signed-off-by: Charan Teja Reddy <charante@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Sandipan Das | b418a0f9f0 |
mm/page_alloc.c: reset numa stats for boot pagesets
Initially, the per-cpu pagesets of each zone are set to the boot pagesets. The real pagesets are allocated later but before that happens, page allocations do occur and the numa stats for the boot pagesets get incremented since they are common to all zones at that point. The real pagesets, however, are allocated for the populated zones only. Unpopulated zones, like those associated with memory-less nodes, continue using the boot pageset and end up skewing the numa stats of the corresponding node. E.g. $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 4 5 6 7 node 1 size: 8131 MB node 1 free: 6980 MB node distances: node 0 1 0: 10 40 1: 40 10 $ numastat node0 node1 numa_hit 108 56495 numa_miss 0 0 numa_foreign 0 0 interleave_hit 0 4537 local_node 108 31547 other_node 0 24948 Hence, the boot pageset stats need to be cleared after the real pagesets are allocated. After this point, the stats of the boot pagesets do not change as page allocations requested for a memory-less node will either fail (if __GFP_THISNODE is used) or get fulfilled by a preferred zone of a different node based on the fallback zonelist. [sandipan@linux.ibm.com: v3] Link: http://lkml.kernel.org/r/20200511170356.162531-1-sandipan@linux.ibm.com Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Link: http://lkml.kernel.org/r/9c9c2d1b15e37f6e6bf32f99e3100035e90c4ac9.1588868430.git.sandipan@linux.ibm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | 01c0bfe061 |
mm: rename gfpflags_to_migratetype to gfp_migratetype for same convention
Pageblock migrate type is encoded in GFP flags, just as zone_type and zonelist. Currently we use gfp_zone() and gfp_zonelist() to extract related information, it would be proper to use the same naming convention for migrate type. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200329080823.7735-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | d0ddf49b7c |
mm/page_alloc.c: use NODE_MASK_NONE in build_zonelists()
Slightly simplify the code by initializing user_mask with NODE_MASK_NONE, instead of later calling nodes_clear(). This saves a line of code. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200330220840.21228-1-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | 97a225e69a |
mm/page_alloc: integrate classzone_idx and high_zoneidx
classzone_idx is just different name for high_zoneidx now. So, integrate them and add some comment to struct alloc_context in order to reduce future confusion about the meaning of this variable. The accessor, ac_classzone_idx() is also removed since it isn't needed after integration. In addition to integration, this patch also renames high_zoneidx to highest_zoneidx since it represents more precise meaning. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Ye Xiaolong <xiaolong.ye@intel.com> Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Baoquan He | f63661566f |
mm/page_alloc.c: clear out zone->lowmem_reserve[] if the zone is empty
When requesting memory allocation from a specific zone is not satisfied, it will fall to lower zone to try allocating memory. In this case, lower zone's ->lowmem_reserve[] will help protect its own memory resource. The higher the relevant ->lowmem_reserve[] is, the harder the upper zone can get memory from this lower zone. However, this protection mechanism should be applied to populated zone, but not an empty zone. So filling ->lowmem_reserve[] for empty zone is not necessary, and may mislead people that it's valid data in that zone. Node 2, zone DMA pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1024, 1024) Node 2, zone DMA32 pages free 0 min 0 low 0 high 0 spanned 0 present 0 managed 0 protection: (0, 0, 1024, 1024) Node 2, zone Normal per-node stats nr_inactive_anon 0 nr_active_anon 143 nr_inactive_file 0 nr_active_file 0 nr_unevictable 0 nr_slab_reclaimable 45 nr_slab_unreclaimable 254 Here clear out zone->lowmem_reserve[] if zone is empty. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200402140113.3696-3-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Baoquan He | 86aaf25543 |
mm/page_alloc.c: only tune sysctl_lowmem_reserve_ratio value once when changing it
Patch series "improvements about lowmem_reserve and /proc/zoneinfo", v2. This patch (of 3): When people write to /proc/sys/vm/lowmem_reserve_ratio to change sysctl_lowmem_reserve_ratio[], setup_per_zone_lowmem_reserve() is called to recalculate all ->lowmem_reserve[] for each zone of all nodes as below: static void setup_per_zone_lowmem_reserve(void) { ... for_each_online_pgdat(pgdat) { for (j = 0; j < MAX_NR_ZONES; j++) { ... while (idx) { ... if (sysctl_lowmem_reserve_ratio[idx] < 1) { sysctl_lowmem_reserve_ratio[idx] = 0; lower_zone->lowmem_reserve[j] = 0; } else { ... } } } } Meanwhile, here, sysctl_lowmem_reserve_ratio[idx] will be tuned if its value is smaller than '1'. As we know, sysctl_lowmem_reserve_ratio[] is set for zone without regarding to which node it belongs to. That means the tuning will be done on all nodes, even though it has been done in the first node. And the tuning will be done too even when init_per_zone_wmark_min() calls setup_per_zone_lowmem_reserve(), where actually nobody tries to change sysctl_lowmem_reserve_ratio[]. So now move the tuning into lowmem_reserve_ratio_sysctl_handler(), to make code logic more reasonable. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/20200402140113.3696-1-bhe@redhat.com Link: http://lkml.kernel.org/r/20200402140113.3696-2-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Baoquan He | 4ca7be24ee |
mm/page_alloc.c: remove unused free_bootmem_with_active_regions
Since commit
|
|
Roman Gushchin | 1686766493 |
mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations
Currently a cma area is barely used by the page allocator because it's used only as a fallback from movable, however kswapd tries hard to make sure that the fallback path isn't used. This results in a system evicting memory and pushing data into swap, while lots of CMA memory is still available. This happens despite the fact that alloc_contig_range is perfectly capable of moving any movable allocations out of the way of an allocation. To effectively use the cma area let's alter the rules: if the zone has more free cma pages than the half of total free pages in the zone, use cma pageblocks first and fallback to movable blocks in the case of failure. [guro@fb.com: ifdef the cma-specific code] Link: http://lkml.kernel.org/r/20200311225832.GA178154@carbon.DHCP.thefacebook.com Co-developed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Joonsoo Kim <js1304@gmail.com> Link: http://lkml.kernel.org/r/20200306150102.3e77354b@imladris.surriel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | 58b7f1194f |
mm/page_alloc.c: extract check_[new|free]_page_bad() common part to page_bad_reason()
We share similar code in check_[new|free]_page_bad() to get the page's bad reason. Let's extract it and reduce code duplication. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200411220357.9636-6-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Wei Yang | 534fe5e3c4 |
mm/page_alloc.c: rename free_pages_check() to check_free_page()
free_pages_check() is the counterpart of check_new_page(). Rename it to use the same naming convention. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Link: http://lkml.kernel.org/r/20200411220357.9636-5-richard.weiyang@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |