mirror of https://gitee.com/openkylin/linux.git
15539 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Jakub Kicinski | 46d5e62dd3 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
Gerald Schaefer | ba9c1201be |
mm/hugetlb: clear compound_nr before freeing gigantic pages
Commit |
|
Kuan-Ying Lee | 6c82d45c7f |
kasan: fix object remaining in offline per-cpu quarantine
We hit this issue in our internal test. When enabling generic kasan, a kfree()'d object is put into per-cpu quarantine first. If the cpu goes offline, object still remains in the per-cpu quarantine. If we call kmem_cache_destroy() now, slub will report "Objects remaining" error. ============================================================================= BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown() ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200 CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2b0 show_stack+0x18/0x68 dump_stack+0xfc/0x168 slab_err+0xac/0xd4 __kmem_cache_shutdown+0x1e4/0x3c8 kmem_cache_destroy+0x68/0x130 test_version_show+0x84/0xf0 module_attr_show+0x40/0x60 sysfs_kf_seq_show+0x128/0x1c0 kernfs_seq_show+0xa0/0xb8 seq_read+0x1f0/0x7e8 kernfs_fop_read+0x70/0x338 vfs_read+0xe4/0x250 ksys_read+0xc8/0x180 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.0+0xac/0x228 do_el0_svc+0x38/0xa0 el0_sync_handler+0x170/0x178 el0_sync+0x174/0x180 INFO: Object 0x(____ptrval____) @offset=15848 INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172 stack_trace_save+0x9c/0xd0 set_track+0x64/0xf0 alloc_debug_processing+0x104/0x1a0 ___slab_alloc+0x628/0x648 __slab_alloc.isra.0+0x2c/0x58 kmem_cache_alloc+0x560/0x588 test_version_show+0x98/0xf0 module_attr_show+0x40/0x60 sysfs_kf_seq_show+0x128/0x1c0 kernfs_seq_show+0xa0/0xb8 seq_read+0x1f0/0x7e8 kernfs_fop_read+0x70/0x338 vfs_read+0xe4/0x250 ksys_read+0xc8/0x180 __arm64_sys_read+0x44/0x58 el0_svc_common.constprop.0+0xac/0x228 kmem_cache_destroy test_module_slab: Slab cache still has objects Register a cpu hotplug function to remove all objects in the offline per-cpu quarantine when cpu is going offline. Set a per-cpu variable to indicate this cpu is offline. [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug] Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Signed-off-by: Zqiang <qiang.zhang@windriver.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Guangye Yang <guangye.yang@mediatek.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Qian Cai <qcai@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Andrew Morton | 16c0cc0ce3 |
revert "mm/filemap: add static for function __add_to_page_cache_locked"
Revert commit
|
|
Minchan Kim | a68a0262ab |
mm/madvise: remove racy mm ownership check
Jann spotted the security hole due to race of mm ownership check.
If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check. That
makes *any advice hint* to reach into the remote process.
This patch removes the mm ownership check. With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance). Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.
Fixes:
|
|
Liu Zixian | 309d08d9b3 |
mm/mmap.c: fix mmap return value when vma is merged after call_mmap()
On success, mmap should return the begin address of newly mapped area,
but patch "mm: mmap: merge vma after call_mmap() if possible" set
vm_start of newly merged vma to return value addr. Users of mmap will
get wrong address if vma is merged after call_mmap(). We fix this by
moving the assignment to addr before merging vma.
We have a driver which changes vm_flags, and this bug is found by our
testcases.
Fixes:
|
|
Mike Kravetz | 7a5bde3798 |
hugetlb_cgroup: fix offline of hugetlb cgroup with reservations
Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs. In this environment the issue is reproduced by:
- Start a simple pod that uses the recently added HugePages medium
feature (pod yaml attached)
- Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing
the EAL layer (which handles hugepage reservation and locking) is
enough to trigger the issue
- Delete the Pod (or let it "Complete").
This would result in a kworker thread going into a tight loop (top output):
1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy
'perf top -g' reports:
- 63.28% 0.01% [kernel] [k] worker_thread
- 49.97% worker_thread
- 52.64% process_one_work
- 62.08% css_killed_work_fn
- hugetlb_cgroup_css_offline
41.52% _raw_spin_lock
- 2.82% _cond_resched
rcu_all_qs
2.66% PageHuge
- 0.57% schedule
- 0.57% __schedule
We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning. Little else can be done on the system as the
cgroup_mutex can not be acquired.
Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.
The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup. This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts. Discussion about what to do with
reservation counts when reparenting was discussed here:
https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/
The decision was made to leave a zombie cgroup for with reservation
counts. Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.
To fix the issue, simply remove the check for reservation counts. While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed. The hstate index is not reinitialized each time through the
do-while loop. Fix this as well.
Fixes:
|
|
Alex Shi | 3351b16af4 |
mm/filemap: add static for function __add_to_page_cache_locked
mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes] Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Qian Cai | b11a76b37a |
mm/swapfile: do not sleep with a spin lock held
We can't call kvfree() with a spin lock held, so defer it. Fixes a
might_sleep() runtime warning.
Fixes:
|
|
Minchan Kim | e91d8d7823 |
mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING
While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.
BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460
We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.
Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).
Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.
[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org
Fixes:
|
|
Yang Shi | 8199be001a |
mm: list_lru: set shrinker map bit when child nr_items is not zero
When investigating a slab cache bloat problem, significant amount of negative dentry cache was seen, but confusingly they neither got shrunk by reclaimer (the host has very tight memory) nor be shrunk by dropping cache. The vmcore shows there are over 14M negative dentry objects on lru, but tracing result shows they were even not scanned at all. Further investigation shows the memcg's vfs shrinker_map bit is not set. So the reclaimer or dropping cache just skip calling vfs shrinker. So we have to reboot the hosts to get the memory back. I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting. But it seems there is race between shrinker map bit clear and reparenting by code inspection. The hypothesis is elaborated as below. The memcg hierarchy on our production environment looks like: root / \ system user The main workloads are running under user slice's children, and it creates and removes memcg frequently. So reparenting happens very often under user slice, but no task is under user slice directly. So with the frequent reparenting and tight memory pressure, the below hypothetical race condition may happen: CPU A CPU B reparent dst->nr_items == 0 shrinker: total_objects == 0 add src->nr_items to dst set_bit return SHRINK_EMPTY clear_bit child memcg offline replace child's kmemcg_id with parent's (in memcg_offline_kmem()) list_lru_del() between shrinker runs see parent's kmemcg_id dec dst->nr_items reparent again dst->nr_items may go negative due to concurrent list_lru_del() The second run of shrinker: read nr_items without any synchronization, so it may see intermediate negative nr_items then total_objects may return 0 coincidently keep the bit cleared dst->nr_items != 0 skip set_bit add scr->nr_item to dst After this point dst->nr_item may never go zero, so reparenting will not set shrinker_map bit anymore. And since there is no task under user slice directly, so no new object will be added to its lru to set the shrinker map bit either. That bit is kept cleared forever. How does list_lru_del() race with reparenting? It is because reparenting replaces children's kmemcg_id to parent's without protecting from nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually deleting items from child's lru, but dec'ing parent's nr_items, so the parent's nr_items may go negative as commit |
|
Roman Gushchin | becaba65f6 |
mm: memcg/slab: fix obj_cgroup_charge() return value handling
Commit |
|
Jakub Kicinski | a1dd1d8697 |
Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says: ==================== pull-request: bpf-next 2020-12-03 The main changes are: 1) Support BTF in kernel modules, from Andrii. 2) Introduce preferred busy-polling, from Björn. 3) bpf_ima_inode_hash() and bpf_bprm_opts_set() helpers, from KP Singh. 4) Memcg-based memory accounting for bpf objects, from Roman. 5) Allow bpf_{s,g}etsockopt from cgroup bind{4,6} hooks, from Stanislav. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (118 commits) selftests/bpf: Fix invalid use of strncat in test_sockmap libbpf: Use memcpy instead of strncpy to please GCC selftests/bpf: Add fentry/fexit/fmod_ret selftest for kernel module selftests/bpf: Add tp_btf CO-RE reloc test for modules libbpf: Support attachment of BPF tracing programs to kernel modules libbpf: Factor out low-level BPF program loading helper bpf: Allow to specify kernel module BTFs when attaching BPF programs bpf: Remove hard-coded btf_vmlinux assumption from BPF verifier selftests/bpf: Add CO-RE relocs selftest relying on kernel module BTF selftests/bpf: Add support for marking sub-tests as skipped selftests/bpf: Add bpf_testmod kernel module for testing libbpf: Add kernel module BTF support for CO-RE relocations libbpf: Refactor CO-RE relocs to not assume a single BTF object libbpf: Add internal helper to load BTF data by FD bpf: Keep module's btf_data_size intact after load bpf: Fix bpf_put_raw_tracepoint()'s use of __module_address() selftests/bpf: Add Userspace tests for TCP_WINDOW_CLAMP bpf: Adds support for setting window clamp samples/bpf: Fix spelling mistake "recieving" -> "receiving" bpf: Fix cold build of test_progs-no_alu32 ... ==================== Link: https://lore.kernel.org/r/20201204021936.85653-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
|
Roman Gushchin | 18b2db3b03 |
mm: Convert page kmemcg type to a page memcg flag
PageKmemcg flag is currently defined as a page type (like buddy, offline, table and guard). Semantically it means that the page was accounted as a kernel memory by the page allocator and has to be uncharged on the release. As a side effect of defining the flag as a page type, the accounted page can't be mapped to userspace (look at page_has_type() and comments above). In particular, this blocks the accounting of vmalloc-backed memory used by some bpf maps, because these maps do map the memory to userspace. One option is to fix it by complicating the access to page->mapcount, which provides some free bits for page->page_type. But it's way better to move this flag into page->memcg_data flags. Indeed, the flag makes no sense without enabled memory cgroups and memory cgroup pointer set in particular. This commit replaces PageKmemcg() and __SetPageKmemcg() with PageMemcgKmem() and an open-coded OR operation setting the memcg pointer with the MEMCG_DATA_KMEM bit. __ClearPageKmemcg() can be simple deleted, as the whole memcg_data is zeroed at once. As a bonus, on !CONFIG_MEMCG build the PageMemcgKmem() check will be compiled out. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-5-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-5-guro@fb.com |
|
Roman Gushchin | 270c6a7146 |
mm: memcontrol/slab: Use helpers to access slab page's memcg_data
To gather all direct accesses to struct page's memcg_data field in one place, let's introduce 3 new helpers to use in the slab accounting code: struct obj_cgroup **page_objcgs(struct page *page); struct obj_cgroup **page_objcgs_check(struct page *page); bool set_page_objcgs(struct page *page, struct obj_cgroup **objcgs); They are similar to the corresponding API for generic pages, except that the setter can return false, indicating that the value has been already set from a different thread. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/20201027001657.3398190-3-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-3-guro@fb.com |
|
Roman Gushchin | bcfe06bf26 |
mm: memcontrol: Use helpers to read page's memcg data
Patch series "mm: allow mapping accounted kernel pages to userspace", v6. Currently a non-slab kernel page which has been charged to a memory cgroup can't be mapped to userspace. The underlying reason is simple: PageKmemcg flag is defined as a page type (like buddy, offline, etc), so it takes a bit from a page->mapped counter. Pages with a type set can't be mapped to userspace. But in general the kmemcg flag has nothing to do with mapping to userspace. It only means that the page has been accounted by the page allocator, so it has to be properly uncharged on release. Some bpf maps are mapping the vmalloc-based memory to userspace, and their memory can't be accounted because of this implementation detail. This patchset removes this limitation by moving the PageKmemcg flag into one of the free bits of the page->mem_cgroup pointer. Also it formalizes accesses to the page->mem_cgroup and page->obj_cgroups using new helpers, adds several checks and removes a couple of obsolete functions. As the result the code became more robust with fewer open-coded bit tricks. This patch (of 4): Currently there are many open-coded reads of the page->mem_cgroup pointer, as well as a couple of read helpers, which are barely used. It creates an obstacle on a way to reuse some bits of the pointer for storing additional bits of information. In fact, we already do this for slab pages, where the last bit indicates that a pointer has an attached vector of objcg pointers instead of a regular memcg pointer. This commits uses 2 existing helpers and introduces a new helper to converts all read sides to calls of these helpers: struct mem_cgroup *page_memcg(struct page *page); struct mem_cgroup *page_memcg_rcu(struct page *page); struct mem_cgroup *page_memcg_check(struct page *page); page_memcg_check() is intended to be used in cases when the page can be a slab page and have a memcg pointer pointing at objcg vector. It does check the lowest bit, and if set, returns NULL. page_memcg() contains a VM_BUG_ON_PAGE() check for the page not being a slab page. To make sure nobody uses a direct access, struct page's mem_cgroup/obj_cgroups is converted to unsigned long memcg_data. Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com |
|
Hugh Dickins | 073861ed77 |
mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)
Twice now, when exercising ext4 looped on shmem huge pages, I have crashed on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling end_page_writeback() calling wake_up_page() on tail of a shmem huge page, no longer an ext4 page at all. The problem is that PageWriteback is not accompanied by a page reference (as the NOTE at the end of test_clear_page_writeback() acknowledges): as soon as TestClearPageWriteback has been done, that page could be removed from page cache, freed, and reused for something else by the time that wake_up_page() is reached. https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/ Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail check; but I'm paranoid about even looking at an unreferenced struct page, lest its memory might itself have already been reused or hotremoved (and wake_up_page_bit() may modify that memory with its ClearPageWaiters()). Then on crashing a second time, realized there's a stronger reason against that approach. If my testing just occasionally crashes on that check, when the page is reused for part of a compound page, wouldn't it be much more common for the page to get reused as an order-0 page before reaching wake_up_page()? And on rare occasions, might that reused page already be marked PageWriteback by its new user, and already be waited upon? What would that look like? It would look like BUG_ON(PageWriteback) after wait_on_page_writeback() in write_cache_pages() (though I have never seen that crash myself). Matthew Wilcox explaining this to himself: "page is allocated, added to page cache, dirtied, writeback starts, --- thread A --- filesystem calls end_page_writeback() test_clear_page_writeback() --- context switch to thread B --- truncate_inode_pages_range() finds the page, it doesn't have writeback set, we delete it from the page cache. Page gets reallocated, dirtied, writeback starts again. Then we call write_cache_pages(), see PageWriteback() set, call wait_on_page_writeback() --- context switch back to thread A --- wake_up_page(page, PG_writeback); ... thread B is woken, but because the wakeup was for the old use of the page, PageWriteback is still set. Devious" And prior to |
|
Matthew Wilcox (Oracle) | 66383800df |
mm: fix madvise WILLNEED performance problem
The calculation of the end page index was incorrect, leading to a
regression of 70% when running stress-ng.
With this fix, we instead see a performance improvement of 3%.
Fixes:
|
|
Gerald Schaefer | bfe8cc1db0 |
mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault()
Alexander reported a syzkaller / KASAN finding on s390, see below for complete output. In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be freed in some cases. In the case of userfaultfd_missing(), this will happen after calling handle_userfault(), which might have released the mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will access an unstable vma->vm_mm, which could have been freed or re-used already. For all architectures other than s390 this will go w/o any negative impact, because pte_free() simply frees the page and ignores the passed-in mm. The implementation for SPARC32 would also access mm->page_table_lock for pte_free(), but there is no THP support in SPARC32, so the buggy code path will not be used there. For s390, the mm->context.pgtable_list is being used to maintain the 2K pagetable fragments, and operating on an already freed or even re-used mm could result in various more or less subtle bugs due to list / pagetable corruption. Fix this by calling pte_free() before handle_userfault(), similar to how it is already done in __do_huge_pmd_anonymous_page() for the WRITE / non-huge_zero_page case. Commit |
|
Muchun Song | 8faeb1ffd7 |
mm: memcg/slab: fix root memcg vmstats
If we reparent the slab objects to the root memcg, when we free the slab
object, we need to update the per-memcg vmstats to keep it correct for
the root memcg. Now this at least affects the vmstat of
NR_KERNEL_STACK_KB for !CONFIG_VMAP_STACK when the thread stack size is
smaller than the PAGE_SIZE.
David said:
"I assume that without this fix that the root memcg's vmstat would
always be inflated if we reparented"
Fixes:
|
|
Dan Williams | a927bd6ba9 |
mm: fix phys_to_target_node() and memory_add_physaddr_to_nid() exports
The core-mm has a default __weak implementation of phys_to_target_node()
to mirror the weak definition of memory_add_physaddr_to_nid(). That
symbol is exported for modules. However, while the export in
mm/memory_hotplug.c exported the symbol in the configuration cases of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=y
...and:
CONFIG_NUMA_KEEP_MEMINFO=n
CONFIG_MEMORY_HOTPLUG=y
...it failed to export the symbol in the case of:
CONFIG_NUMA_KEEP_MEMINFO=y
CONFIG_MEMORY_HOTPLUG=n
Not only is that broken, but Christoph points out that the kernel should
not be exporting any __weak symbol, which means that
memory_add_physaddr_to_nid() example that phys_to_target_node() copied
is broken too.
Rework the definition of phys_to_target_node() and
memory_add_physaddr_to_nid() to not require weak symbols. Move to the
common arch override design-pattern of an asm header defining a symbol
to replace the default implementation.
The only common header that all memory_add_physaddr_to_nid() producing
architectures implement is asm/sparsemem.h. In fact, powerpc already
defines its memory_add_physaddr_to_nid() helper in sparsemem.h.
Double-down on that observation and define phys_to_target_node() where
necessary in asm/sparsemem.h. An alternate consideration that was
discarded was to put this override in asm/numa.h, but that entangles
with the definition of MAX_NUMNODES relative to the inclusion of
linux/nodemask.h, and requires powerpc to grow a new header.
The dependency on NUMA_KEEP_MEMINFO for DEV_DAX_HMEM_DEVICES is invalid
now that the symbol is properly exported / stubbed in all combinations
of CONFIG_NUMA_KEEP_MEMINFO and CONFIG_MEMORY_HOTPLUG.
[dan.j.williams@intel.com: v4]
Link: https://lkml.kernel.org/r/160461461867.1505359.5301571728749534585.stgit@dwillia2-desk3.amr.corp.intel.com
[dan.j.williams@intel.com: powerpc: fix create_section_mapping compile warning]
Link: https://lkml.kernel.org/r/160558386174.2948926.2740149041249041764.stgit@dwillia2-desk3.amr.corp.intel.com
Fixes:
|
|
Eric Dumazet | 450677dcb0 |
mm/madvise: fix memory leak from process_madvise
The early return in process_madvise() will produce a memory leak.
Fix it.
Fixes:
|
|
Linus Torvalds | fa5fca78bb |
io_uring-5.10-2020-11-20
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+4DAwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgphdOD/9xOEnYPuekvVH9G9nyNd//Q9fPArG2+j6V /MCnze07GNtDt7z15oR+T07hKXmf+Ejh4nu3JJ6MUNfe/47hhJqHSxRHU6+PJCjk hPrsaTsDedxxLEDiLmvhXnUPzfVzJtefxVAAaKikWOb3SBqLdh7xTFSlor1HbRBl Zk4d343cjBDYfvSSt/zMWDzwwvramdz7rJnnPMKXITu64ITL5314vuK2YVZmBOet YujSah7J8FL1jKhiG1Iw5rayd2Q3smnHWIEQ+lvW6WiTvMJMLOxif2xNF4/VEZs1 CBGJUQt42LI6QGEzRBHohcefZFuPGoxnduSzHCOIhh7d6+k+y9mZfsPGohr3g9Ov NotXpVonnA7GbRqzo1+IfBRve7iRONdZ3/LBwyRmqav4I4jX68wXBNH5IDpVR0Sn c31avxa/ZL7iLIBx32enp0/r3mqNTQotEleSLUdyJQXAZTyG2INRhjLLXTqSQ5BX oVp0fZzKCwsr6HCPZpXZ/f2G7dhzuF0ghoceC02GsOVooni22gdVnQj+AWNus398 e+wcimT4MX6AHNFxO2aUtJow0KWWZRzC1p5Mxu/9W3YiMtJiC0YOGePfSqiTqX0g Uk0H5dOAgBUQrAsusf7bKr0K6W25yEk/JipxhWqi0rC71x42mLTsCT1wxSCvLwqs WxhdtVKroQ== =7PAe -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Mostly regression or stable fodder: - Disallow async path resolution of /proc/self - Tighten constraints for segmented async buffered reads - Fix double completion for a retry error case - Fix for fixed file life times (Pavel)" * tag 'io_uring-5.10-2020-11-20' of git://git.kernel.dk/linux-block: io_uring: order refnode recycling io_uring: get an active ref_node from files_data io_uring: don't double complete failed reissue request mm: never attempt async page lock if we've transferred data already io_uring: handle -EOPNOTSUPP on path resolution proc: don't allow async path resolution of /proc/self components |
|
Linus Torvalds | 4d02da974e |
Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211),
can and bpf (including the strncpy_from_user fix). Current release - regressions: - mac80211: fix memory leak of filtered powersave frames - mac80211: free sta in sta_info_insert_finish() on errors to avoid sleeping in atomic context - netlabel: fix an uninitialized variable warning added in -rc4 Previous release - regressions: - vsock: forward all packets to the host when no H2G is registered, un-breaking AWS Nitro Enclaves - net: Exempt multicast addresses from five-second neighbor lifetime requirement, decreasing the chances neighbor tables fill up - net/tls: fix corrupted data in recvmsg - qed: fix ILT configuration of SRC block - can: m_can: process interrupt only when not runtime suspended Previous release - always broken: - page_frag: Recover from memory pressure by not recycling pages allocating from the reserves - strncpy_from_user: Mask out bytes after NUL terminator - ip_tunnels: Set tunnel option flag only when tunnel metadata is present, always setting it confuses Open vSwitch - bpf, sockmap: - Fix partial copy_page_to_iter so progress can still be made - Fix socket memory accounting and obeying SO_RCVBUF - net: Have netpoll bring-up DSA management interface - net: bridge: add missing counters to ndo_get_stats64 callback - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt - enetc: Workaround MDIO register access HW bug - net/ncsi: move netlink family registration to a subsystem init, instead of tying it to driver probe - net: ftgmac100: unregister NC-SI when removing driver to avoid crash - lan743x: prevent interrupt storm on open - lan743x: fix freeing skbs in the wrong context - net/mlx5e: Fix socket refcount leak on kTLS RX resync - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097 - fix 21 unset return codes and other mistakes on error paths, mostly detected by the Hulk Robot Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAl+226AACgkQMUZtbf5S IruE1w/+JX3CqJwGIqyzyhwVshNaKxmX9gAOMJzkckjEohn8932zPaNq7kbmNYqt 5QsJoou3cXjFeoIEAkQA5fqR4stTZpZMnLO+7JnxxQ0vb2YBN+tIGQRNCnmd1Q0h u9gb5+5AdORdlmk3E7oC8v50dzQRfboJXLEEZTo2uGJwUgLlEAiqTSV2w4YDHMhL JtgtWA/fraL0CUc2WMoxuimg9NirbRuMijsU6+d/yExzznDpdoho/qsxL+Odu1NF hSdaKirA8B8ml0pOd/b4mj+fm4+lKyXZBfSyLx4Ki1TqluEMLzDp7gQPRnU6yyJm AOu4zsKxx6qitOX2qLQCNlEpkQp6LA2N2Zb1orliUV3Bsq2DJRhU35FgLcghtdRP GTRSdKHr2BvMScOZ7dQo8l4TqVc3e/khSZDRGdvpsM275Dt0JyS/l7yAWxunPqMb +/483/s75OuBRO57ULLJ/hR02TG37g/Jv5sI0sG/7oDpGfnulinQX+fxy9izyTEM KYl0mAPSqhb6RcjE0YXWG0rhJN6FSvc/lwPQHjq8wPSkwEdD/FTb6/eYqbXDi1ld UTYhFpkh1PQrwct14eSScMeJqTsNKvG0VV39/uZLZCzcqa3yOY5+oTzwaCFlMsy3 a5yGGxqoh7/FTM8t1ml21is9uZ31LAQEnNTMPv69pZPwAv5G5yE= =SRwI -----END PGP SIGNATURE----- Merge tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.10-rc5, including fixes from the WiFi (mac80211), can and bpf (including the strncpy_from_user fix). Current release - regressions: - mac80211: fix memory leak of filtered powersave frames - mac80211: free sta in sta_info_insert_finish() on errors to avoid sleeping in atomic context - netlabel: fix an uninitialized variable warning added in -rc4 Previous release - regressions: - vsock: forward all packets to the host when no H2G is registered, un-breaking AWS Nitro Enclaves - net: Exempt multicast addresses from five-second neighbor lifetime requirement, decreasing the chances neighbor tables fill up - net/tls: fix corrupted data in recvmsg - qed: fix ILT configuration of SRC block - can: m_can: process interrupt only when not runtime suspended Previous release - always broken: - page_frag: Recover from memory pressure by not recycling pages allocating from the reserves - strncpy_from_user: Mask out bytes after NUL terminator - ip_tunnels: Set tunnel option flag only when tunnel metadata is present, always setting it confuses Open vSwitch - bpf, sockmap: - Fix partial copy_page_to_iter so progress can still be made - Fix socket memory accounting and obeying SO_RCVBUF - net: Have netpoll bring-up DSA management interface - net: bridge: add missing counters to ndo_get_stats64 callback - tcp: brr: only postpone PROBE_RTT if RTT is < current min_rtt - enetc: Workaround MDIO register access HW bug - net/ncsi: move netlink family registration to a subsystem init, instead of tying it to driver probe - net: ftgmac100: unregister NC-SI when removing driver to avoid crash - lan743x: - prevent interrupt storm on open - fix freeing skbs in the wrong context - net/mlx5e: Fix socket refcount leak on kTLS RX resync - net: dsa: mv88e6xxx: Avoid VLAN database corruption on 6097 - fix 21 unset return codes and other mistakes on error paths, mostly detected by the Hulk Robot" * tag 'net-5.10-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (115 commits) fail_function: Remove a redundant mutex unlock selftest/bpf: Test bpf_probe_read_user_str() strips trailing bytes after NUL lib/strncpy_from_user.c: Mask out bytes after NUL terminator. net/smc: fix direct access to ib_gid_addr->ndev in smc_ib_determine_gid() net/smc: fix matching of existing link groups ipv6: Remove dependency of ipv6_frag_thdr_truncated on ipv6 module libbpf: Fix VERSIONED_SYM_COUNT number parsing net/mlx4_core: Fix init_hca fields offset atm: nicstar: Unmap DMA on send error page_frag: Recover from memory pressure net: dsa: mv88e6xxx: Wait for EEPROM done after HW reset mlxsw: core: Use variable timeout for EMAD retries mlxsw: Fix firmware flashing net: Have netpoll bring-up DSA management interface atl1e: fix error return code in atl1e_probe() atl1c: fix error return code in atl1c_probe() ah6: fix error return code in ah6_input() net: usb: qmi_wwan: Set DTR quirk for MR400 can: m_can: process interrupt only when not runtime suspended can: flexcan: flexcan_chip_start(): fix erroneous flexcan_transceiver_enable() during bus-off recovery ... |
|
Dongli Zhang | d8c19014bb |
page_frag: Recover from memory pressure
The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
This ends up to page_frag_alloc() to allocate skb->data from
page_frag_cache->va.
During the memory pressure, page_frag_cache->va may be allocated as
pfmemalloc page. As a result, the skb->pfmemalloc is always true as
skb->data is from page_frag_cache->va. The skb will be dropped if the
sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
under memory pressure.
However, once kernel is not under memory pressure any longer (suppose large
amount of memory pages are just reclaimed), the page_frag_alloc() may still
re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
result, the skb->pfmemalloc is always true unless page_frag_cache->va is
re-allocated, even if the kernel is not under memory pressure any longer.
Here is how kernel runs into issue.
1. The kernel is under memory pressure and allocation of
PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
the pfmemalloc page is allocated for page_frag_cache->va.
2: All skb->data from page_frag_cache->va (pfmemalloc) will have
skb->pfmemalloc=true. The skb will always be dropped by sock without
SOCK_MEMALLOC. This is an expected behaviour.
3. Suppose a large amount of pages are reclaimed and kernel is not under
memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
4. Unfortunately, page_frag_alloc() does not proactively re-allocate
page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
skb->pfmemalloc is always true even kernel is not under memory pressure any
longer.
Fix this by freeing and re-allocating the page instead of recycling it.
References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
Cc: Bert Barbe <bert.barbe@oracle.com>
Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Fixes:
|
|
Jens Axboe | 0abed7c69b |
mm: never attempt async page lock if we've transferred data already
We catch the case where we enter generic_file_buffered_read() with data
already transferred, but we also need to be careful not to allow an async
page lock if we're looping transferring data. If not, we could be
returning -EIOCBQUEUED instead of the transferred amount, and it could
result in double waitqueue additions as well.
Cc: stable@vger.kernel.org # v5.9
Fixes:
|
|
Linus Torvalds | a50cf15906 |
Merge branch 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu
Pull percpu fix and cleanup from Dennis Zhou: "A fix for a Wshadow warning in the asm-generic percpu macros came in and then I tacked on the removal of flexible array initializers in the percpu allocator" * 'for-5.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu: percpu: convert flexible array initializers to use struct_size() asm-generic: percpu: avoid Wshadow warning |
|
Mike Kravetz | 336bf30eb7 |
hugetlbfs: fix anon huge page migration race
Qian Cai reported the following BUG in [1]
LTP: starting move_pages12
BUG: unable to handle page fault for address: ffffffffffffffe0
...
RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
Call Trace:
rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
migrate_pages+0x1005/0x1fb0
move_pages_and_store_status.isra.47+0xd7/0x1a0
__x64_sys_move_pages+0xa5c/0x1100
do_syscall_64+0x5f/0x310
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Hugh Dickins diagnosed this as a migration bug caused by code introduced
to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
anon pages as the anon_vma_lock should be held in this case. Further
analysis suggested that i_mmap_rwsem was not required to he held at all
when calling try_to_unmap for anon pages as an anon page could never be
part of a shared pmd mapping.
Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
keep mapping valid while dropping page lock.
This patch does the following:
- Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
calling try_to_unmap.
- Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
will now simply do a 'trylock' while still holding the page lock. If
the trylock fails, it will return NULL. This could impact the
callers:
- migration calling code will receive -EAGAIN and retry up to the
hard coded limit (10).
- memory error code will treat the page as BUSY. This will force
killing (SIGKILL) instead of SIGBUS any mapping tasks.
Do note that this change in behavior only happens when there is a
race. None of the standard kernel testing suites actually hit this
race, but it is possible.
[1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
[2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
Fixes:
|
|
Jason Gunthorpe | 96e1fac162 |
mm/gup: use unpin_user_pages() in __gup_longterm_locked()
When FOLL_PIN is passed to __get_user_pages() the page list must be put
back using unpin_user_pages() otherwise the page pin reference persists
in a corrupted state.
There are two places in the unwind of __gup_longterm_locked() that put
the pages back without checking. Normally on error this function would
return the partial page list making this the caller's responsibility,
but in these two cases the caller is not allowed to see these pages at
all.
Fixes:
|
|
Laurent Dufour | 22e4663e91 |
mm/slub: fix panic in slab_alloc_node()
While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs with 11TB of ram, I hit the following panic: BUG: Kernel NULL pointer dereference on read at 0x00000007 Faulting instruction address: 0xc000000000456048 Oops: Kernel access of bad area, sig: 11 [#2] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp CPU: 160 PID: 1 Comm: systemd Tainted: G D 5.9.0 #1 NIP: c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350 REGS: c00006028d1b77a0 TRAP: 0300 Tainted: G D (5.9.0) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004228 XER: 00000000 CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0 GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000 GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320 GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000 GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1 GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8 GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000 GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001 NIP [c000000000456048] __kmalloc_node+0x108/0x790 LR [c000000000455fd4] __kmalloc_node+0x94/0x790 Call Trace: kvmalloc_node+0x58/0x110 mem_cgroup_css_online+0x10c/0x270 online_css+0x48/0xd0 cgroup_apply_control_enable+0x2c4/0x470 cgroup_mkdir+0x408/0x5f0 kernfs_iop_mkdir+0x90/0x100 vfs_mkdir+0x138/0x250 do_mkdirat+0x154/0x1c0 system_call_exception+0xf8/0x200 system_call_common+0xf0/0x27c Instruction dump: e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a 2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78 This pointing to the following code: mm/slub.c:2851 if (unlikely(!object || !node_match(page, node))) { c000000000456038: 00 00 bc 2f cmpdi cr7,r28,0 c00000000045603c: 18 00 9e 41 beq cr7,c000000000456054 <__kmalloc_node+0x114> node_match(): mm/slub.c:2491 if (node != NUMA_NO_NODE && page_to_nid(page) != node) c000000000456040: 30 02 92 41 beq cr4,c000000000456270 <__kmalloc_node+0x330> page_to_nid(): include/linux/mm.h:1294 c000000000456044: 10 00 27 e9 ld r9,16(r7) c000000000456048: 07 00 29 89 lbz r9,7(r9) <<<< r9 = NULL node_match(): mm/slub.c:2491 c00000000045604c: 00 48 99 7f cmpw cr7,r25,r9 c000000000456050: 20 02 9e 41 beq cr7,c000000000456270 <__kmalloc_node+0x330> The panic occurred in slab_alloc_node() when checking for the page's node: object = c->freelist; page = c->page; if (unlikely(!object || !node_match(page, node))) { object = __slab_alloc(s, gfpflags, node, addr, c); stat(s, ALLOC_SLOWPATH); The issue is that object is not NULL while page is NULL which is odd but may happen if the cache flush happened after loading object but before loading page. Thus checking for the page pointer is required too. The cache flush is done through an inter processor interrupt when a piece of memory is off-lined. That interrupt is triggered when a memory hot-unplug operation is initiated and offline_pages() is calling the slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback() which is calling flush_cpu_slab(). If that interrupt is caught between the reading of c->freelist and the reading of c->page, this could lead to such a situation. That situation is expected and the later call to this_cpu_cmpxchg_double() will detect the change to c->freelist and redo the whole operation. In commit |
|
Nicholas Piggin | 2da9f6305f |
mm/vmscan: fix NR_ISOLATED_FILE corruption on 64-bit
Previously the negated unsigned long would be cast back to signed long which would have the correct negative value. After commit |
|
Zi Yan | d20bdd571e |
mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
In isolate_migratepages_block, if we have too many isolated pages and
nr_migratepages is not zero, we should try to migrate what we have
without wasting time on isolating.
In theory it's possible that multiple parallel compactions will cause
too_many_isolated() to become true even if each has isolated less than
COMPACT_CLUSTER_MAX, and loop forever in the while loop. Bailing
immediately prevents that.
[vbabka@suse.cz: changelog addition]
Fixes:
|
|
Zi Yan | 38935861d8 |
mm/compaction: count pages and stop correctly during page isolation
In isolate_migratepages_block, when cc->alloc_contig is true, we are
able to isolate compound pages. But nr_migratepages and nr_isolated did
not count compound pages correctly, causing us to isolate more pages
than we thought.
So count compound pages as the number of base pages they contain.
Otherwise, we might be trapped in too_many_isolated while loop, since
the actual isolated pages can go up to COMPACT_CLUSTER_MAX*512=16384,
where COMPACT_CLUSTER_MAX is 32, since we stop isolation after
cc->nr_migratepages reaches to COMPACT_CLUSTER_MAX.
In addition, after we fix the issue above, cc->nr_migratepages could
never be equal to COMPACT_CLUSTER_MAX if compound pages are isolated,
thus page isolation could not stop as we intended. Change the isolation
stop condition to '>='.
The issue can be triggered as follows:
In a system with 16GB memory and an 8GB CMA region reserved by
hugetlb_cma, if we first allocate 10GB THPs and mlock them (so some THPs
are allocated in the CMA region and mlocked), reserving 6 1GB hugetlb
pages via /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages will
get stuck (looping in too_many_isolated function) until we kill either
task. With the patch applied, oom will kill the application with 10GB
THPs and let hugetlb page reservation finish.
[ziy@nvidia.com: v3]
Link: https://lkml.kernel.org/r/20201030183809.3616803-1-zi.yan@sent.com
Fixes:
|
|
Jason Yan | a77eedbc87 |
mm/truncate.c: make __invalidate_mapping_pages() static
Fix the following sparse warning:
mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?
Fixes:
|
|
Shijie Luo | 3f08842098 |
mm: mempolicy: fix potential pte_unmap_unlock pte error
When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to pte_unmap_unlock seems like not a good idea. queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't migrate misplaced pages but returns with EIO when encountering such a page. Since commit |
|
Roman Gushchin | 8de15e920d |
mm: memcg: link page counters to root if use_hierarchy is false
Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):
WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Modules linked in:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
Workqueue: events drain_local_stock
RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Call Trace:
__memcg_kmem_uncharge (mm/memcontrol.c:3022)
drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
drain_local_stock (mm/memcontrol.c:2255)
process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:292)
ret_from_fork (arch/x86/entry/entry_64.S:300)
The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.
After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge and
triggers a warning.
Fix it by linking all page counters to corresponding root page counters
in the non-hierarchical mode.
Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more
than 1 level. This patch doesn't change it.
The patch also doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.
Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!
Fixes:
|
|
zhongjiang-ali | 7de2e9f195 |
mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg
memcg_page_state will get the specified number in hierarchical memcg, It
should multiply by HPAGE_PMD_NR rather than an page if the item is
NR_ANON_THPS.
[akpm@linux-foundation.org: fix printk warning]
[akpm@linux-foundation.org: use u64 cast, per Michal]
Fixes:
|
|
Mike Kravetz | 79aa925bf2 |
hugetlb_cgroup: fix reservation accounting
Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below. QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM. The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.
WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
Modules linked in: ...
CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
RIP: 0010:page_counter_uncharge+0x4b/0x50
...
Call Trace:
hugetlb_cgroup_uncharge_file_region+0x4b/0x80
region_del+0x1d3/0x300
hugetlb_unreserve_pages+0x39/0xb0
remove_inode_hugepages+0x1a8/0x3d0
hugetlbfs_fallocate+0x3c4/0x5c0
vfs_fallocate+0x146/0x290
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting. This patch addresses the found issues.
Fixes:
|
|
Ralph Campbell | 46b1ee38b2 |
mm/mremap_pages: fix static key devmap_managed_key updates
commit
|
|
Dennis Zhou | 61cf93d3e1 |
percpu: convert flexible array initializers to use struct_size()
Use the safer macro as sparked by the long discussion in [1]. [1] https://lore.kernel.org/lkml/20200917204514.GA2880159@google.com/ Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Dennis Zhou <dennis@kernel.org> |
|
Geert Uytterhoeven | f78f63da91 |
mm/process_vm_access: Add missing #include <linux/compat.h>
With e.g. m68k/defconfig:
mm/process_vm_access.c: In function ‘process_vm_rw’:
mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration]
277 | in_compat_syscall());
| ^~~~~~~~~~~~~~~~~
Fix this by adding #include <linux/compat.h>.
Reported-by: noreply@ellerman.id.au
Reported-by: damian <damian.tometzki@familie-tometzki.de>
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Fixes:
|
|
Jens Axboe | 38dc5079da |
Fix compat regression in process_vm_rw()
The removal of compat_process_vm_{readv,writev} didn't change
process_vm_rw(), which always assumes it's not doing a compat syscall.
Instead of passing in 'false' unconditionally for 'compat', make it
conditional on in_compat_syscall().
[ Both Al and Christoph point out that trying to access a 64-bit process
from a 32-bit one cannot work anyway, and is likely better prohibited,
but that's a separate issue - Linus ]
Fixes:
|
|
Linus Torvalds | c4728cfbed |
Refactored code for 5.10:
- Move the file range remap generic functions out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to reduce clutter in the first two files. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAl+SADAACgkQ+H93GTRK tOtZPxAAjwh/wOD+QPWAlu2zs1qvq9aU5uU56nWZC86JXr5RTokc2DIIwHvsT28I Xr3Oya8hiegsIVohQWLQr7AhVe469G2iegTkn7YmmLJLfrwhtSYxvkYTNMI/Uyx3 LzGRcaqg9QR6DnrEHzI9QfCHyKz73PMD26eJR1wLerVIIcMYIsg7xp3Yd6Y0G5iD VX9qJ15OZNnXlQelG8E/A44dggZPt10D20czD9f/N7ZIpPxrQQLonO08i2YhPlRz sqQT4RjkZoJeZGY2wv2+vGMsbUxTui7sJj7Zsk+ljfo8ByY/wy1nK2IM9xR0jeZx o/td9YcSzGEMan9Q4jSIwMYbgMLw/x79nNWpnFdRh4+xQYGGPfkGOseJ9Sm0SlW5 P6zb2bWMxZkiE/xq/Dsxbnl5Obzk3xc8c1w4nsStsQTcgBTLFJupP626Ib+yythZ pOzWRc2wdH9f4Oy52kxO8GB8kg23abXMACgTfSpzqU9GtSIijoS/Z+AN36jWT890 mkoLFsssRfufmalQX438c8XF94xD+tRCOkxgq9ud71kcWgQnUVzQWvCflkIfetEa jcw+uuChuPOaQ9x6M6Z7gGt+a2zYreyGAmTw67M32UsgXQGO/nCx7f2j/7raYitd ZJb/XoGB1aRfWKpWjaL+66ORmOFY7Uuq9UkRibtYzmR6iMknQcA= =DAPl -----END PGP SIGNATURE----- Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull clone/dedupe/remap code refactoring from Darrick Wong: "Move the generic file range remap (aka reflink and dedupe) functions out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to reduce clutter in the first two files" * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: move the generic write and copy checks out of mm vfs: move the remap range helpers to remap_range.c vfs: move generic_remap_checks out of mm |
|
Linus Torvalds | c4d6fe7311 |
XArray updates for 5.9
- Fix the test suite after introduction of the local_lock - Fix a bug in the IDA spotted by Coverity - Change the API that allows the workingset code to delete a node - Fix xas_reload() when dealing with entries that occupy multiple indices - Add a few more tests to the test suite - Fix an unsigned int being shifted into an unsigned long -----BEGIN PGP SIGNATURE----- iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAl+OzzAACgkQDpNsjXcp gj5YFgf/cV99dyPaal7AfMwhVwFcuVjIRH4S/VeOHkjS2QT1lpu3ffqfKALVR8vU 3IObM3oDCmLk0mYz9O+V/udVJoBYWiduI0LZhR6+V5ZrDjbw/d4VdCbwOplpeF5x rntyI9r8f5d4LxBJ/moLjsosc1KfCzyVnV389eZRvZ8Muxuyc73WdAwZZZfD79nY 66gScEXQokU99zqJJ1nWfh05XTcTsKF25fVBGMLZTUBAytoFyPuC/kO2z8Uq9lEi Ug6gDClskSB7A2W5gvprMcoUAVYcHfTb0wqJD5/MhkHyoTdcWdW8Re0kssXvD86V KwlBdYQ/JuskgY/hbynZ/FP3p8+t1Q== =12E/ -----END PGP SIGNATURE----- Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray Pull XArray updates from Matthew Wilcox: - Fix the test suite after introduction of the local_lock - Fix a bug in the IDA spotted by Coverity - Change the API that allows the workingset code to delete a node - Fix xas_reload() when dealing with entries that occupy multiple indices - Add a few more tests to the test suite - Fix an unsigned int being shifted into an unsigned long * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray: XArray: Fix xas_create_range for ranges above 4 billion radix-tree: fix the comment of radix_tree_next_slot() XArray: Fix xas_reload for multi-index entries XArray: Add private interface for workingset node deletion XArray: Fix xas_for_each_conflict documentation XArray: Test marked multiorder iterations XArray: Test two more things about xa_cmpxchg ida: Free allocated bitmap in error path radix tree test suite: Fix compilation |
|
Linus Torvalds | 4962a85696 |
io_uring-5.10-2020-10-20
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+O9WEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqajEADSD5PiO94YWTtVNWFUjn5RW+GlCE70/0VV XImdasmvM8nb48E+z2EW0Ky4vKXeVy5r+WZAeIYqPUHy/ogQDVpEn00NL7tFQmOz 8UYrlZ3LLE/bSeWM5iavgG7TldVs/ZfspJ0hj3/Ac7jJpzuRGEI5TClsxJ0mWV39 b2qT4OYBDhwvwVPZ/qhWgEwJXJFzFywckouIbMw8gPkveebUYeyu/yuScNGwYuiQ 46YPEk/XIuOy8iUvQjqhLY+NNlAKJwt3z9WZgt5F+TZIhkpp7z6h20+jezFQcuFP GXzIDN+EADpsbw7MWJYIZVffxDEMlDpkJlAVMT1hsYLDfTXoEzmFwRddoFh2Fjf6 ghWqhOKffUuAOX2xs1MrS2xLaxd0ot7QqZJVTYk7zEljkaRANlstSZZ+PpI+Sad/ rNieQvs6jnsmTODDEaV3qyFX5aBQ2NdvyndZNU9wz0GZAWAdz+wxE0A1FVD0A37i p6m8sIvhNg3/cW89G04IDYUkAygi8knVDnEDHRwaJtswZQ4pRSGMp+N4qZ0GpnK7 BviaAhofGaYlqruavO6Ug2YyomYpWGlUxTaB9ZKh0HkEDlDM945+0sgQRdxfsE8d OboycqJn3puOl/wh5Fc4oGYrWLsDbaA/5kksC4lm85Z+HUf+UXMS4QFdoPJYjhuM H6oMz1w2bQ== =v56S -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block Pull io_uring updates from Jens Axboe: "A mix of fixes and a few stragglers. In detail: - Revert the bogus __read_mostly that we discussed for the initial pull request. - Fix a merge window regression with fixed file registration error path handling. - Fix io-wq numa node affinities. - Series abstracting out an io_identity struct, making it both easier to see what the personality items are, and also easier to to adopt more. Use this to cover audit logging. - Fix for read-ahead disabled block condition in async buffered reads, and using single page read-ahead to unify what generic_file_buffer_read() path is used. - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel) - Poll fix (Pavel)" * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits) io_uring: use blk_queue_nowait() to check if NOWAIT supported mm: use limited read-ahead to satisfy read mm: mark async iocb read as NOWAIT once some data has been copied io_uring: fix double poll mask init io-wq: inherit audit loginuid and sessionid io_uring: use percpu counters to track inflight requests io_uring: assign new io_identity for task if members have changed io_uring: store io_identity in io_uring_task io_uring: COW io_identity on mismatch io_uring: move io identity items into separate struct io_uring: rely solely on work flags to determine personality. io_uring: pass required context in as flags io-wq: assign NUMA node locality if appropriate io_uring: fix error path cleanup in io_sqe_files_register() Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly" io_uring: fix REQ_F_COMP_LOCKED by killing it io_uring: dig out COMP_LOCK from deep call chain io_uring: don't put a poll req under spinlock io_uring: don't unnecessarily clear F_LINK_TIMEOUT io_uring: don't set COMP_LOCKED if won't put ... |
|
Christoph Hellwig | b71df8de41 |
mm: remove the filename in the top of file comment in vmalloc.c
No point in having the filename inside the file. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Hellwig | f255935b97 |
mm: cleanup the gfp_mask handling in __vmalloc_area_node
Patch series "two small vmalloc cleanups". This patch (of 2): __vmalloc_area_node currently has four different gfp_t variables to just express this simple logic: - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if suitable) for the underlying page allocation - use just the reclaim flags from the passed in mask plus __GFP_ZERO for allocating the page array Simplify this down to just use the pre-existing nested_gfp as-is for the page array allocation, and just the passed in gfp_mask for the page allocation, after conditionally ORing __GFP_HIGHMEM into it. This also makes the allocation warning a little more correct. Also initialize two variables at the time of declaration while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Hellwig | 301fa9f2dd |
mm: remove alloc_vm_area
All users are gone now. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Hellwig | d1b6d2e1fe |
zsmalloc: switch from alloc_vm_area to get_vm_area
Just manually pre-fault the PTEs using apply_to_page_range. Co-developed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Hellwig | eeb4a05fce |
mm: allow a NULL fn callback in apply_to_page_range
Besides calling the callback on each page, apply_to_page_range also has the effect of pre-faulting all PTEs for the range. To support callers that only need the pre-faulting, make the callback optional. Based on a patch from Minchan Kim <minchan@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |