mirror of https://gitee.com/openkylin/linux.git
963 Commits
Author | SHA1 | Message | Date |
---|---|---|---|
Nadav Amit | 6ce64428d6 |
mm/userfaultfd: fix memory corruption due to writeprotect
Userfaultfd self-test fails occasionally, indicating a memory corruption. Analyzing this problem indicates that there is a real bug since mmap_lock is only taken for read in mwriteprotect_range() and defers flushes, and since there is insufficient consideration of concurrent deferred TLB flushes in wp_page_copy(). Although the PTE is flushed from the TLBs in wp_page_copy(), this flush takes place after the copy has already been performed, and therefore changes of the page are possible between the time of the copy and the time in which the PTE is flushed. To make matters worse, memory-unprotection using userfaultfd also poses a problem. Although memory unprotection is logically a promotion of PTE permissions, and therefore should not require a TLB flush, the current userrfaultfd code might actually cause a demotion of the architectural PTE permission: when userfaultfd_writeprotect() unprotects memory region, it unintentionally *clears* the RW-bit if it was already set. Note that this unprotecting a PTE that is not write-protected is a valid use-case: the userfaultfd monitor might ask to unprotect a region that holds both write-protected and write-unprotected PTEs. The scenario that happens in selftests/vm/userfaultfd is as follows: cpu0 cpu1 cpu2 ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-*unprotect* ] mwriteprotect_range() mmap_read_lock() change_protection() change_protection_range() ... change_pte_range() [ *clear* “write”-bit ] [ defer TLB flushes ] [ page-fault ] ... wp_page_copy() cow_user_page() [ copy page ] [ write to old page ] ... set_pte_at_notify() A similar scenario can happen: cpu0 cpu1 cpu2 cpu3 ---- ---- ---- ---- [ Writable PTE cached in TLB ] userfaultfd_writeprotect() [ write-protect ] [ deferred TLB flush ] userfaultfd_writeprotect() [ write-unprotect ] [ deferred TLB flush] [ page-fault ] wp_page_copy() cow_user_page() [ copy page ] ... [ write to page ] set_pte_at_notify() This race exists since commit |
|
Peter Xu | 97a7e4733b |
mm: introduce page_needs_cow_for_dma() for deciding whether cow
We've got quite a few places (pte, pmd, pud) that explicitly checked against whether we should break the cow right now during fork(). It's easier to provide a helper, especially before we work the same thing on hugetlbfs. Since we'll reference is_cow_mapping() in mm.h, move it there too. Actually it suites mm.h more since internal.h is mm/ only, but mm.h is exported to the whole kernel. With that we should expect another patch to use is_cow_mapping() whenever we can across the kernel since we do use it quite a lot but it's always done with raw code against VM_* flags. Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Gal Pressman <galpress@amazon.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Kirill Shutemov <kirill@shutemov.name> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Roland Scheidegger <sroland@vmware.com> Cc: VMware Graphics <linux-graphics-maintainer@vmware.com> Cc: Wei Zhang <wzam@amazon.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Huang Pei | f685a533a7 |
MIPS: make userspace mapping young by default
MIPS page fault path(except huge page) takes 3 exceptions (1 TLB Miss + 2 TLB Invalid), butthe second TLB Invalid exception is just triggered by __update_tlb from do_page_fault writing tlb without _PAGE_VALID set. With this patch, user space mapping prot is made young by default (with both _PAGE_VALID and _PAGE_YOUNG set), and it only take 1 TLB Miss + 1 TLB Invalid exception Remove pte_sw_mkyoung without polluting MM code and make page fault delay of MIPS on par with other architecture Link: https://lkml.kernel.org/r/20210204013942.8398-1-huangpei@loongson.cn Signed-off-by: Huang Pei <huangpei@loongson.cn> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: <huangpei@loongson.cn> Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: <ambrosehua@gmail.com> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: Paul Burton <paulburton@kernel.org> Cc: Li Xuefeng <lixuefeng@loongson.cn> Cc: Yang Tiezhu <yangtiezhu@loongson.cn> Cc: Gao Juxin <gaojuxin@loongson.cn> Cc: Fuxin Zhang <zhangfx@lemote.com> Cc: Huacai Chen <chenhc@lemote.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Mike Kravetz | 3272cfc252 |
hugetlb: fix copy_huge_page_from_user contig page struct assumption
page structs are not guaranteed to be contiguous for gigantic pages. The
routine copy_huge_page_from_user can encounter gigantic pages, yet it
assumes page structs are contiguous when copying pages from user space.
Since page structs for the target gigantic page are not contiguous, the
data copied from user space could overwrite other pages not associated
with the gigantic page and cause data corruption.
Non-contiguous page structs are generally not an issue. However, they can
exist with a specific kernel configuration and hotplug operations. For
example: Configure the kernel with CONFIG_SPARSEMEM and
!CONFIG_SPARSEMEM_VMEMMAP. Then, hotplug add memory for the area where
the gigantic page will be allocated.
Link: https://lkml.kernel.org/r/20210217184926.33567-2-mike.kravetz@oracle.com
Fixes:
|
|
Miaohe Lin | 8abb50c76b |
mm/memory.c: fix potential pte_unmap_unlock pte error
If all pte entry is none in 'non-create' case, we would break the loop with
pte unchanged. Then the wrong pte - 1 would be passed to pte_unmap_unlock.
This is a theoretical issue which may not be a real bug. So it's not worth
cc stable.
Link: https://lkml.kernel.org/r/20210205081925.59809-1-linmiaohe@huawei.com
Fixes:
|
|
Miaohe Lin | 90a3e375d3 |
mm/memory.c: fix potential pte_unmap_unlock pte error
Since commit |
|
Linus Torvalds | e913a8cdc2 |
Fixes around VM_FPNMAP and follow_pfn
- replace mm/frame_vector.c by get_user_pages in misc/habana and drm/exynos drivers, then move that into media as it's sole user - close race in generic_access_phys - s390 pci ioctl fix of this series landed in 5.11 already - properly revoke iomem mappings (/dev/mem, pci files) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmAzgywACgkQTA9ye/CY qnFPbA//RUHB5bD7vwnEglfJhonKSi/Vt3dNQwUI+pCFK8muWvvPyTkGXKjjT2dI uAOY2F23wymtIexV3fNLgnMez7kMcupOLkdxJic4GiO+HJn1jnkshdX7/dGtUW7O G3yfnf/D27i912tT3j6PN7dVnasAYYtndCgImM027Zigzn4ibY+02tnzd5XTj1F8 yq8Swx88oqF8v10HxfpF3RLShqT3S17mFmd9dTv0GkZX497Pe75O44XcXzkD33Bj wasH2Tz8gMEQx6TNAGlJe13dzDHReh2cG0z2r+6PTA6KnaMMxbEIImHNuhWOmHb/ nf8Jpu9uMOLzB+3hG3TzISTDBhAgPfoJ8Ov40VJCWMtCVBnyMyPJr28Oobb8Dj3V SXvjSVlLeobOLt+E9vAS+Rmas07LCGBdNP9sexxV7S/sveSQ5W+FptaQW03EghwA nBYEUC68WqpX99lJCFPmv5zmy5xkecjpU6mLHZljtV1ORzktqWZdVhmC8njHMAMY Hi/emnPxEX1FpOD38rr7F9KUUSsy4t/ZaCgVaLcxCcbglCHXSHC41R09p9TBRSJo G6Lksjyj4aa+UL5dZDAtLY0shg0bv2u93dGQNaDAC+uzj6D0ErBBzDK570zBKjp/ 75+nqezJlD0d7I6rOl6FwiEYeSrYXJxYEveKVUr8CnH6sfeBlwo= =lQoR -----END PGP SIGNATURE----- Merge tag 'topic/iomem-mmap-vs-gup-2021-02-22' of git://anongit.freedesktop.org/drm/drm Pull follow_pfn() updates from Daniel Vetter: "Fixes around VM_FPNMAP and follow_pfn: - replace mm/frame_vector.c by get_user_pages in misc/habana and drm/exynos drivers, then move that into media as it's sole user - close race in generic_access_phys - s390 pci ioctl fix of this series landed in 5.11 already - properly revoke iomem mappings (/dev/mem, pci files)" * tag 'topic/iomem-mmap-vs-gup-2021-02-22' of git://anongit.freedesktop.org/drm/drm: PCI: Revoke mappings like devmem PCI: Also set up legacy files only after sysfs init sysfs: Support zapping of binary attr mmaps resource: Move devmem revoke code to resource framework /dev/mem: Only set filp->f_mapping PCI: Obey iomem restrictions for procfs mmap mm: Close race in generic_access_phys media: videobuf2: Move frame_vector into media subsystem mm/frame-vector: Use FOLL_LONGTERM misc/habana: Use FOLL_LONGTERM for userptr misc/habana: Stop using frame_vector helpers drm/exynos: Use FOLL_LONGTERM for g2d cmdlists drm/exynos: Stop using frame_vector helpers |
|
Linus Torvalds | 3e10585335 |
x86:
- Support for userspace to emulate Xen hypercalls - Raise the maximum number of user memslots - Scalability improvements for the new MMU. Instead of the complex "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent, but the code that can run against page faults is limited. Right now only page faults take the lock for reading; in the future this will be extended to some cases of page table destruction. I hope to switch the default MMU around 5.12-rc3 (some testing was delayed due to Chinese New Year). - Cleanups for MAXPHYADDR checks - Use static calls for vendor-specific callbacks - On AMD, use VMLOAD/VMSAVE to save and restore host state - Stop using deprecated jump label APIs - Workaround for AMD erratum that made nested virtualization unreliable - Support for LBR emulation in the guest - Support for communicating bus lock vmexits to userspace - Add support for SEV attestation command - Miscellaneous cleanups PPC: - Support for second data watchpoint on POWER10 - Remove some complex workarounds for buggy early versions of POWER9 - Guest entry/exit fixes ARM64 - Make the nVHE EL2 object relocatable - Cleanups for concurrent translation faults hitting the same page - Support for the standard TRNG hypervisor call - A bunch of small PMU/Debug fixes - Simplification of the early init hypercall handling Non-KVM changes (with acks): - Detection of contended rwlocks (implemented only for qrwlocks, because KVM only needs it for x86) - Allow __DISABLE_EXPORTS from assembly code - Provide a saner follow_pfn replacements for modules -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4 5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ== =uFZU -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "x86: - Support for userspace to emulate Xen hypercalls - Raise the maximum number of user memslots - Scalability improvements for the new MMU. Instead of the complex "fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent, but the code that can run against page faults is limited. Right now only page faults take the lock for reading; in the future this will be extended to some cases of page table destruction. I hope to switch the default MMU around 5.12-rc3 (some testing was delayed due to Chinese New Year). - Cleanups for MAXPHYADDR checks - Use static calls for vendor-specific callbacks - On AMD, use VMLOAD/VMSAVE to save and restore host state - Stop using deprecated jump label APIs - Workaround for AMD erratum that made nested virtualization unreliable - Support for LBR emulation in the guest - Support for communicating bus lock vmexits to userspace - Add support for SEV attestation command - Miscellaneous cleanups PPC: - Support for second data watchpoint on POWER10 - Remove some complex workarounds for buggy early versions of POWER9 - Guest entry/exit fixes ARM64: - Make the nVHE EL2 object relocatable - Cleanups for concurrent translation faults hitting the same page - Support for the standard TRNG hypervisor call - A bunch of small PMU/Debug fixes - Simplification of the early init hypercall handling Non-KVM changes (with acks): - Detection of contended rwlocks (implemented only for qrwlocks, because KVM only needs it for x86) - Allow __DISABLE_EXPORTS from assembly code - Provide a saner follow_pfn replacements for modules" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits) KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes KVM: selftests: Don't bother mapping GVA for Xen shinfo test KVM: selftests: Fix hex vs. decimal snafu in Xen test KVM: selftests: Fix size of memslots created by Xen tests KVM: selftests: Ignore recently added Xen tests' build output KVM: selftests: Add missing header file needed by xAPIC IPI tests KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static locking/arch: Move qrwlock.h include after qspinlock.h KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2 KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path KVM: PPC: remove unneeded semicolon KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest KVM: PPC: Book3S HV: Fix radix guest SLB side channel KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR ... |
|
Linus Torvalds | 99ca0edb41 |
arm64 updates for 5.12
- vDSO build improvements including support for building with BSD. - Cleanup to the AMU support code and initialisation rework to support cpufreq drivers built as modules. - Removal of synthetic frame record from exception stack when entering the kernel from EL0. - Add support for the TRNG firmware call introduced by Arm spec DEN0098. - Cleanup and refactoring across the board. - Avoid calling arch_get_random_seed_long() from add_interrupt_randomness() - Perf and PMU updates including support for Cortex-A78 and the v8.3 SPE extensions. - Significant steps along the road to leaving the MMU enabled during kexec relocation. - Faultaround changes to initialise prefaulted PTEs as 'old' when hardware access-flag updates are supported, which drastically improves vmscan performance. - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55 (#1024718) - Preparatory work for yielding the vector unit at a finer granularity in the crypto code, which in turn will one day allow us to defer softirq processing when it is in use. - Support for overriding CPU ID register fields on the command-line. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmAmwZcQHHdpbGxAa2Vy bmVsLm9yZwAKCRC3rHDchMFjNLA1B/0XMwWUhmJ4ZPK4sr28YWHNGLuCFHDgkMKU dEmS806OF9d0J7fTczGsKdS4IKtXWko67Z0UGiPIStwfm0itSW2Zgbo9KZeDPqPI fH0s23nQKxUMyNW7b9p4cTV3YuGVMZSBoMug2jU2DEDpSqeGBk09NPi6inERBCz/ qZxcqXTKxXbtOY56eJmq09UlFZiwfONubzuCrrUH7LU8ZBSInM/6Q4us/oVm4zYI Pnv996mtL4UxRqq/KoU9+cQ1zsI01kt9/coHwfCYvSpZEVAnTWtfECsJ690tr3mF TSKQLvOzxbDtU+HcbkNVKW0A38EIO1xXr8yXW9SJx6BJBkyb24xo =IwMb -----END PGP SIGNATURE----- Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 updates from Will Deacon: - vDSO build improvements including support for building with BSD. - Cleanup to the AMU support code and initialisation rework to support cpufreq drivers built as modules. - Removal of synthetic frame record from exception stack when entering the kernel from EL0. - Add support for the TRNG firmware call introduced by Arm spec DEN0098. - Cleanup and refactoring across the board. - Avoid calling arch_get_random_seed_long() from add_interrupt_randomness() - Perf and PMU updates including support for Cortex-A78 and the v8.3 SPE extensions. - Significant steps along the road to leaving the MMU enabled during kexec relocation. - Faultaround changes to initialise prefaulted PTEs as 'old' when hardware access-flag updates are supported, which drastically improves vmscan performance. - CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55 (#1024718) - Preparatory work for yielding the vector unit at a finer granularity in the crypto code, which in turn will one day allow us to defer softirq processing when it is in use. - Support for overriding CPU ID register fields on the command-line. * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits) drivers/perf: Replace spin_lock_irqsave to spin_lock mm: filemap: Fix microblaze build failure with 'mmu_defconfig' arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+ arm64: cpufeatures: Allow disabling of Pointer Auth from the command-line arm64: Defer enabling pointer authentication on boot core arm64: cpufeatures: Allow disabling of BTI from the command-line arm64: Move "nokaslr" over to the early cpufeature infrastructure KVM: arm64: Document HVC_VHE_RESTART stub hypercall arm64: Make kvm-arm.mode={nvhe, protected} an alias of id_aa64mmfr1.vh=0 arm64: Add an aliasing facility for the idreg override arm64: Honor VHE being disabled from the command-line arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line arm64: cpufeature: Add an early command-line cpufeature override facility arm64: Extract early FDT mapping from kaslr_early_init() arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding() arm64: cpufeature: Add global feature override facility arm64: Move SCTLR_EL1 initialisation to EL-agnostic code arm64: Simplify init_el2_state to be non-VHE only arm64: Move VHE-specific SPE setup to mutate_to_vhe() arm64: Drop early setting of MDSCR_EL2.TPMS ... |
|
Paolo Bonzini | 9fd6dad126 |
mm: provide a saner PTE walking API for modules
Currently, the follow_pfn function is exported for modules but follow_pte is not. However, follow_pfn is very easy to misuse, because it does not provide protections (so most of its callers assume the page is writable!) and because it returns after having already unlocked the page table lock. Provide instead a simplified version of follow_pte that does not have the pmdpp and range arguments. The older version survives as follow_invalidate_pte() for use by fs/dax.c. Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> |
|
Will Deacon | a72afd8730 |
tlb: mmu_gather: Remove start/end arguments from tlb_gather_mmu()
The 'start' and 'end' arguments to tlb_gather_mmu() are no longer needed now that there is a separate function for 'fullmm' flushing. Remove the unused arguments and update all callers. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Yu Zhao <yuzhao@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com |
|
Will Deacon | ae8eba8b5d |
tlb: mmu_gather: Remove unused start/end arguments from tlb_finish_mmu()
Since commit
|
|
Will Deacon | 9d3af4b448 |
mm: Pass 'address' to map to do_set_pte() and drop FAULT_FLAG_PREFAULT
Rather than modifying the 'address' field of the 'struct vm_fault' passed to do_set_pte(), leave that to identify the real faulting address and pass in the virtual address to be mapped by the new pte as a separate argument. This makes FAULT_FLAG_PREFAULT redundant, as a prefault entry can be identified simply by comparing the new address parameter with the faulting address, so remove the redundant flag at the same time. Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will@kernel.org> |
|
Will Deacon | 46bdb4277f |
mm: Allow architectures to request 'old' entries when prefaulting
Commit |
|
Kirill A. Shutemov | f9ce0be71d |
mm: Cleanup faultaround and finish_fault() codepaths
alloc_set_pte() has two users with different requirements: in the faultaround code, it called from an atomic context and PTE page table has to be preallocated. finish_fault() can sleep and allocate page table as needed. PTL locking rules are also strange, hard to follow and overkill for finish_fault(). Let's untangle the mess. alloc_set_pte() has gone now. All locking is explicit. The price is some code duplication to handle huge pages in faultaround path, but it should be fine, having overall improvement in readability. Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@box Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [will: s/from from/from/ in comment; spotted by willy] Signed-off-by: Will Deacon <will@kernel.org> |
|
Daniel Vetter | 96667f8a43 |
mm: Close race in generic_access_phys
Way back it was a reasonable assumptions that iomem mappings never change the pfn range they point at. But this has changed: - gpu drivers dynamically manage their memory nowadays, invalidating ptes with unmap_mapping_range when buffers get moved - contiguous dma allocations have moved from dedicated carvetouts to cma regions. This means if we miss the unmap the pfn might contain pagecache or anon memory (well anything allocated with GFP_MOVEABLE) - even /dev/mem now invalidates mappings when the kernel requests that iomem region when CONFIG_IO_STRICT_DEVMEM is set, see |
|
Nicholas Piggin | 111fe7186b |
mm: generalise COW SMC TLB flushing race comment
I'm not sure if I'm completely missing something here, but AFAIKS the
reference to the mysterious "COW SMC race" confuses the issue. The
original changelog and mailing list thread didn't help me either.
This SMC race is where the problem was detected, but isn't the general
problem bigger and more obvious: that the new PTE could be picked up at
any time by any TLB while entries for the old PTE exist in other TLBs
before the TLB flush takes effect?
The case where the iTLB and dTLB of a CPU are pointing at different pages
is an interesting one but follows from the general problem.
The other (minor) thing with the comment I think it makes it a bit clearer
to say what the old code was doing (i.e., it avoids the race as opposed to
what?).
References:
|
|
Christoph Hellwig | ff5c19ed4b |
mm: simplify follow_pte{,pmd}
Merge __follow_pte_pmd, follow_pte_pmd and follow_pte into a single follow_pte function and just pass two additional NULL arguments for the two previous follow_pte callers. [sfr@canb.auug.org.au: merge fix for "s390/pci: remove races against pte updates"] Link: https://lkml.kernel.org/r/20201111221254.7f6a3658@canb.auug.org.au Link: https://lkml.kernel.org/r/20201029101432.47011-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Christoph Hellwig | 7336375734 |
mm: unexport follow_pte_pmd
Patch series "simplify follow_pte a bit". This small series drops the not needed follow_pte_pmd exports, and simplifies the follow_pte family of functions a bit. This patch (of 2): follow_pte_pmd() is only used by the DAX code, which can't be modular. Link: https://lkml.kernel.org/r/20201029101432.47011-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
John Hubbard | d3f5ffcacd |
mm: cleanup: remove unused tsk arg from __access_remote_vm
Despite a comment that said that page fault accounting would be charged to whatever task_struct* was passed into __access_remote_vm(), the tsk argument was actually unused. Making page fault accounting actually use this task struct is quite a project, so there is no point in keeping the tsk argument. Delete both the comment, and the argument. [rppt@linux.ibm.com: changelog addition] Link: https://lkml.kernel.org/r/20201026074137.4147787-1-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Jason Gunthorpe | 57efa1fe59 |
mm/gup: prevent gup_fast from racing with COW during fork
Since commit |
|
Christoph Hellwig | eeb4a05fce |
mm: allow a NULL fn callback in apply_to_page_range
Besides calling the callback on each page, apply_to_page_range also has the effect of pre-faulting all PTEs for the range. To support callers that only need the pre-faulting, make the callback optional. Based on a patch from Minchan Kim <minchan@kernel.org>. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Matthew Auld <matthew.auld@intel.com> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Matthew Wilcox (Oracle) | d01ac3c352 |
mm/memory: remove page fault assumption of compound page size
A compound page in the page cache will not necessarily be of PMD size, so check explicitly. [willy@infradead.org: fix remove page fault assumption of compound page size] Link: https://lkml.kernel.org/r/20201001152259.14932-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Huang Ying <ying.huang@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lkml.kernel.org/r/20200908195539.25896-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 5a32c3413d |
dma-mapping updates for 5.10
- rework the non-coherent DMA allocator - move private definitions out of <linux/dma-mapping.h> - lower CMA_ALIGNMENT (Paul Cercueil) - remove the omap1 dma address translation in favor of the common code - make dma-direct aware of multiple dma offset ranges (Jim Quinlan) - support per-node DMA CMA areas (Barry Song) - increase the default seg boundary limit (Nicolin Chen) - misc fixes (Robin Murphy, Thomas Tai, Xu Wang) - various cleanups -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAl+IiPwLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYPKEQ//TM8vxjucnRl/pklpMin49dJorwiVvROLhQqLmdxw 286ZKpVzYYAPc7LnNqwIBugnFZiXuHu8xPKQkIiOa2OtNDTwhKNoBxOAmOJaV6DD 8JfEtZYeX5mKJ/Nqd2iSkIqOvCwZ9Wzii+aytJ2U88wezQr1fnyF4X49MegETEey FHWreSaRWZKa0MMRu9AQ0QxmoNTHAQUNaPc0PeqEtPULybfkGOGw4/ghSB7WcKrA gtKTuooNOSpVEHkTas2TMpcBp6lxtOjFqKzVN0ml+/nqq5NeTSDx91VOCX/6Cj76 mXIg+s7fbACTk/BmkkwAkd0QEw4fo4tyD6Bep/5QNhvEoAriTuSRbhvLdOwFz0EF vhkF0Rer6umdhSK7nPd7SBqn8kAnP4vBbdmB68+nc3lmkqysLyE4VkgkdH/IYYQI 6TJ0oilXWFmU6DT5Rm4FBqCvfcEfU2dUIHJr5wZHqrF2kLzoZ+mpg42fADoG4GuI D/oOsz7soeaRe3eYfWybC0omGR6YYPozZJ9lsfftcElmwSsFrmPsbO1DM5IBkj1B gItmEbOB9ZK3RhIK55T/3u1UWY3Uc/RVr+kchWvADGrWnRQnW0kxYIqDgiOytLFi JZNH8uHpJIwzoJAv6XXSPyEUBwXTG+zK37Ce769HGbUEaUrE71MxBbQAQsK8mDpg 7fM= =Bkf/ -----END PGP SIGNATURE----- Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping updates from Christoph Hellwig: - rework the non-coherent DMA allocator - move private definitions out of <linux/dma-mapping.h> - lower CMA_ALIGNMENT (Paul Cercueil) - remove the omap1 dma address translation in favor of the common code - make dma-direct aware of multiple dma offset ranges (Jim Quinlan) - support per-node DMA CMA areas (Barry Song) - increase the default seg boundary limit (Nicolin Chen) - misc fixes (Robin Murphy, Thomas Tai, Xu Wang) - various cleanups * tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits) ARM/ixp4xx: add a missing include of dma-map-ops.h dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling dma-direct: factor out a dma_direct_alloc_from_pool helper dma-direct check for highmem pages in dma_direct_alloc_pages dma-mapping: merge <linux/dma-noncoherent.h> into <linux/dma-map-ops.h> dma-mapping: move large parts of <linux/dma-direct.h> to kernel/dma dma-mapping: move dma-debug.h to kernel/dma/ dma-mapping: remove <asm/dma-contiguous.h> dma-mapping: merge <linux/dma-contiguous.h> into <linux/dma-map-ops.h> dma-contiguous: remove dma_contiguous_set_default dma-contiguous: remove dev_set_cma_area dma-contiguous: remove dma_declare_contiguous dma-mapping: split <linux/dma-mapping.h> cma: decrease CMA_ALIGNMENT lower limit to 2 firewire-ohci: use dma_alloc_pages dma-iommu: implement ->alloc_noncoherent dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods dma-mapping: add a new dma_alloc_pages API dma-mapping: remove dma_cache_sync 53c700: convert to dma_alloc_noncoherent ... |
|
Peter Xu | c78f463649 |
mm: remove src/dst mm parameter in copy_page_range()
Both of the mm pointers are not needed after commit
|
|
Randy Dunlap | f1dc1685f6 |
mm/memory.c: fix spello of "function"
Fix typo/spello of "function". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/e7bf180e-c558-b1d5-9a15-6d9708823c9c@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yanfei Xu | a7069ee3f8 |
mm/memory.c: replace vmf->vma with variable vma
The code has declared a vma_struct named vma which is assigned a value of vmf->vma. Thus, use variable vma directly here. Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Link: http://lkml.kernel.org/r/20200818084607.37616-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yanfei Xu | d383807aaf |
mm/memory.c: fix typo in __do_fault() comment
It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that. Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20200818104339.5310-1-yanfei.xu@windriver.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | f3c64eda3e |
mm: avoid early COW write protect games during fork()
In commit |
|
Christoph Hellwig | a1fd09e8e6 |
dma-mapping: move dma-debug.h to kernel/dma/
Most of dma-debug.h is not required by anything outside of kernel/dma. Move the four declarations needed by dma-mappin.h or dma-ops providers into dma-mapping.h and dma-map-ops.h, and move the remainder of the file to kernel/dma/debug.h. Signed-off-by: Christoph Hellwig <hch@lst.de> |
|
Peter Xu | 70e806e4e6 |
mm: Do early cow for pinned pages during fork() for ptes
This allows copy_pte_range() to do early cow if the pages were pinned on the source mm. Currently we don't have an accurate way to know whether a page is pinned or not. The only thing we have is page_maybe_dma_pinned(). However that's good enough for now. Especially, with the newly added mm->has_pinned flag to make sure we won't affect processes that never pinned any pages. It would be easier if we can do GFP_KERNEL allocation within copy_one_pte(). Unluckily, we can't because we're with the page table locks held for both the parent and child processes. So the page allocation needs to be done outside copy_one_pte(). Some trick is there in copy_present_pte(), majorly the wrprotect trick to block concurrent fast-gup. Comments in the function should explain better in place. Oleg Nesterov reported a (probably harmless) bug during review that we didn't reset entry.val properly in copy_pte_range() so that potentially there's chance to call add_swap_count_continuation() multiple times on the same swp entry. However that should be harmless since even if it happens, the same function (add_swap_count_continuation()) will return directly noticing that there're enough space for the swp counter. So instead of a standalone stable patch, it is touched up in this patch directly. Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/ Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Peter Xu | 7a4830c380 |
mm/fork: Pass new vma pointer into copy_page_range()
This prepares for the future work to trigger early cow on pinned pages during fork(). No functional change intended. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | be068f2903 |
mm: fix misplaced unlock_page in do_wp_page()
Commit |
|
Linus Torvalds | 79a1971c5f |
mm: move the copy_one_pte() pte_present check into the caller
This completes the split of the non-present and present pte cases by moving the check for the source pte being present into the single caller, which also means that we clearly separate out the very different return value case for a non-present pte. The present pte case currently always succeeds. This is a pure code re-organization with no semantic change: the intent is to make it much easier to add a new return case to the present pte case for when we do early COW at page table copy time. This was split out from the previous commit simply to make it easy to visually see that there were no semantic changes from this code re-organization. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | df3a57d1f6 |
mm: split out the non-present case from copy_one_pte()
This is a purely mechanical split of the copy_one_pte() function. It's not immediately obvious when looking at the diff because of the indentation change, but the way to see what is going on in this commit is to use the "-w" flag to not show pure whitespace changes, and you see how the first part of copy_one_pte() is simply lifted out into a separate function. And since the non-present case is marked unlikely, don't make the new function be inlined. Not that gcc really seems to care, since it looks like it will inline it anyway due to the whole "single callsite for static function" logic. In fact, code generation with the function split is almost identical to before. But not marking it inline is the right thing to do. This is pure prep-work and cleanup for subsequent changes. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 7514c0362f |
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton: "19 patches. Subsystems affected by this patch series: MAINTAINERS, ipc, fork, checkpatch, lib, and mm (memcg, slub, pagemap, madvise, migration, hugetlb)" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: include/linux/log2.h: add missing () around n in roundup_pow_of_two() mm/khugepaged.c: fix khugepaged's request size in collapse_file mm/hugetlb: fix a race between hugetlb sysctl handlers mm/hugetlb: try preferred node first when alloc gigantic page from cma mm/migrate: preserve soft dirty in remove_migration_pte() mm/migrate: remove unnecessary is_zone_device_page() check mm/rmap: fixup copying of soft dirty and uffd ptes mm/migrate: fixup setting UFFD_WP flag mm: madvise: fix vma user-after-free checkpatch: fix the usage of capture group ( ... ) fork: adjust sysctl_max_threads definition to match prototype ipc: adjust proc_ipc_sem_dointvec definition to match prototype mm: track page table modifications in __apply_to_page_range() MAINTAINERS: IA64: mark Status as Odd Fixes only MAINTAINERS: add LLVM maintainers MAINTAINERS: update Cavium/Marvell entries mm: slub: fix conversion of freelist_corrupted() mm: memcg: fix memcg reclaim soft lockup memcg: fix use-after-free in uncharge_batch |
|
Joerg Roedel | e80d3909be |
mm: track page table modifications in __apply_to_page_range()
__apply_to_page_range() is also used to change and/or allocate page-table pages in the vmalloc area of the address space. Make sure these changes get synchronized to other page-tables in the system by calling arch_sync_kernel_mappings() when necessary. The impact appears limited to x86-32, where apply_to_page_range may miss updating the PMD. That leads to explosions in drivers like BUG: unable to handle page fault for address: fe036000 #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page *pde = 00000000 Oops: 0002 [#1] SMP CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16 Hardware name: /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015 EIP: __execlists_context_alloc+0x132/0x2d0 [i915] Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81 EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001 ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286 CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: execlists_context_alloc+0x10/0x20 [i915] intel_context_alloc_state+0x3f/0x70 [i915] __intel_context_do_pin+0x117/0x170 [i915] i915_gem_do_execbuffer+0xcc7/0x2500 [i915] i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915] drm_ioctl_kernel+0x8f/0xd0 drm_ioctl+0x223/0x3d0 __ia32_sys_ioctl+0x1ab/0x760 __do_fast_syscall_32+0x3f/0x70 do_fast_syscall_32+0x29/0x60 do_SYSENTER_32+0x15/0x20 entry_SYSENTER_32+0x9f/0xf2 EIP: 0xb7f28559 Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76 EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296 Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan CR2: 00000000fe036000 It looks like kasan, xen and i915 are vulnerable. Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking after 30-or-so minutes, and machine is unusable" [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h] Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au [chris@chris-wilson.co.uk: changelog addition] [pavel@ucw.cz: changelog addition] Fixes: |
|
Linus Torvalds | b25d1dc947 |
Merge branch 'simplify-do_wp_page'
Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.
This makes uffd-wp work again (verified by running upmapsort)"
Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:
8 files changed, 29 insertions(+), 120 deletions(-)
The reason for the bad timing is that it turns out that commit
|
|
Peter Xu | 798a6b87ec |
mm: Add PGREUSE counter
This accounts for wp_page_reuse() case, where we reused a page for COW. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 09854ba94c |
mm: do_wp_page() simplification
How about we just make sure we're the only possible valid user fo the page before we bother to reuse it? Simplify, simplify, simplify. And get rid of the nasty serialization on the page lock at the same time. [peterx: add subject prefix] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Yang Shi | b7333b58f3 |
mm/memory.c: skip spurious TLB flush for retried page fault
Recently we found regression when running will_it_scale/page_fault3 test
on ARM64. Over 70% down for the multi processes cases and over 20% down
for the multi threads cases. It turns out the regression is caused by
commit
|
|
Linus Torvalds | 18737f4243 |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton: "Subsystems affected by this patch series: mm/hotfixes, lz4, exec, mailmap, mm/thp, autofs, sysctl, mm/kmemleak, mm/misc and lib" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) virtio: pci: constify ioreadX() iomem argument (as in generic implementation) ntb: intel: constify ioreadX() iomem argument (as in generic implementation) rtl818x: constify ioreadX() iomem argument (as in generic implementation) iomap: constify ioreadX() iomem argument (as in generic implementation) sh: use generic strncpy() sh: clkfwk: remove r8/r16/r32 include/asm-generic/vmlinux.lds.h: align ro_after_init mm: annotate a data race in page_zonenum() mm/swap.c: annotate data races for lru_rotate_pvecs mm/rmap: annotate a data race at tlb_flush_batched mm/mempool: fix a data race in mempool_free() mm/list_lru: fix a data race in list_lru_count_one mm/memcontrol: fix a data race in scan count mm/page_counter: fix various data races at memsw mm/swapfile: fix and annotate various data races mm/filemap.c: fix a data race in filemap_fault() mm/swap_state: mark various intentional data races mm/page_io: mark various intentional data races mm/frontswap: mark various intentional data races mm/kmemleak: silence KCSAN splats in checksum ... |
|
Qian Cai | a449bf58e4 |
mm/swapfile: fix and annotate various data races
swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could be accessed concurrently separately as noticed by KCSAN, === si.highest_bit === write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24: swap_range_alloc+0x81/0x130 swap_range_alloc at mm/swapfile.c:681 scan_swap_map_slots+0x371/0xb90 get_swap_pages+0x39d/0x5c0 get_swap_page+0xf2/0x524 add_to_swap+0xe4/0x1c0 shrink_page_list+0x1795/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70: scan_swap_map_slots+0x4a6/0xb90 scan_swap_map_slots at mm/swapfile.c:892 get_swap_pages+0x39d/0x5c0 get_swap_page+0xf2/0x524 add_to_swap+0xe4/0x1c0 shrink_page_list+0x1795/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 Reported by Kernel Concurrency Sanitizer on: CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 === si.swap_map[offset] === write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86: __swap_entry_free_locked+0x8c/0x100 __swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4) __swap_entry_free.constprop.20+0x69/0xb0 free_swap_and_cache+0x53/0xa0 unmap_page_range+0x7f8/0x1d70 unmap_single_vma+0xcd/0x170 unmap_vmas+0x18b/0x220 exit_mmap+0xee/0x220 mmput+0x10e/0x270 do_exit+0x59b/0xf40 do_group_exit+0x8b/0x180 read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20: _swap_info_get+0x81/0xa0 _swap_info_get at mm/swapfile.c:1140 free_swap_and_cache+0x40/0xa0 unmap_page_range+0x7f8/0x1d70 unmap_single_vma+0xcd/0x170 unmap_vmas+0x18b/0x220 exit_mmap+0xee/0x220 mmput+0x10e/0x270 do_exit+0x59b/0xf40 do_group_exit+0x8b/0x180 === si.flags === write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23: scan_swap_map_slots+0x6fe/0xb50 scan_swap_map_slots at mm/swapfile.c:887 get_swap_pages+0x39d/0x5c0 get_swap_page+0x377/0x524 add_to_swap+0xe4/0x1c0 shrink_page_list+0x1795/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63: _swap_info_get+0x41/0xa0 __swap_info_get at mm/swapfile.c:1114 put_swap_page+0x84/0x490 __remove_mapping+0x384/0x5f0 shrink_page_list+0xff1/0x2870 shrink_inactive_list+0x316/0x880 shrink_lruvec+0x8dc/0x1380 shrink_node+0x317/0xd80 do_try_to_free_pages+0x1f7/0xa10 try_to_free_pages+0x26c/0x5e0 __alloc_pages_slowpath+0x458/0x1290 The writes are under si->lock but the reads are not. For si.highest_bit and si.swap_map[offset], data race could trigger logic bugs, so fix them by having WRITE_ONCE() for the writes and READ_ONCE() for the reads except those isolated reads where they compare against zero which a data race would cause no harm. Thus, annotate them as intentional data races using the data_race() macro. For si.flags, the readers are only interested in a single bit where a data race there would cause no issue there. [cai@lca.pw: add a missing annotation for si->flags in memory.c] Link: http://lkml.kernel.org/r/1581612647-5958-1-git-send-email-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Hugh Dickins <hughd@google.com> Link: http://lkml.kernel.org/r/1581095163-12198-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Linus Torvalds | 5848dc5b1b |
dma-debug: remove debug_dma_assert_idle() function
This remoes the code from the COW path to call debug_dma_assert_idle(), which was added many years ago. Google shows that it hasn't caught anything in the 6+ years we've had it apart from a false positive, and Hugh just noticed how it had a very unfortunate spinlock serialization in the COW path. He fixed that issue the previous commit (a85ffd59bd36: "dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()"), but let's see if anybody even notices when we remove this function entirely. NOTE! We keep the dma tracking infrastructure that was added by the commit that introduced it. Partly to make it easier to resurrect this debug code if we ever deside to, and partly because that tracking by pfn and offset looks quite reasonable. The problem with this debug code was simply that it was expensive and didn't seem worth it, not that it was wrong per se. Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Peter Xu | 64019a2e46 |
mm/gup: remove task_struct pointer for all gup code
After the cleanup of page fault accounting, gup does not need to pass task_struct around any more. Remove that parameter in the whole gup stack. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Peter Xu | a2beb5f1ef |
mm: clean up the last pieces of page fault accountings
Here're the last pieces of page fault accounting that were still done outside handle_mm_fault() where we still have regs==NULL when calling handle_mm_fault(): arch/powerpc/mm/copro_fault.c: copro_handle_mm_fault arch/sparc/mm/fault_32.c: force_user_fault arch/um/kernel/trap.c: handle_page_fault mm/gup.c: faultin_page fixup_user_fault mm/hmm.c: hmm_vma_fault mm/ksm.c: break_ksm Some of them has the issue of duplicated accounting for page fault retries. Some of them didn't do the accounting at all. This patch cleans all these up by letting handle_mm_fault() to do per-task page fault accounting even if regs==NULL (though we'll still skip the perf event accountings). With that, we can safely remove all the outliers now. There's another functional change in that now we account the page faults to the caller of gup, rather than the task_struct that passed into the gup code. More information of this can be found at [1]. After this patch, below things should never be touched again outside handle_mm_fault(): - task_struct.[maj|min]_flt - PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] [1] https://lore.kernel.org/lkml/CAHk-=wj_V2Tps2QrMn20_W0OJF9xqNh52XSGA42s-ZJ8Y+GyKw@mail.gmail.com/ Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Cain <bcain@codeaurora.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Chris Zankel <chris@zankel.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Guo Ren <guoren@kernel.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Ley Foon Tan <ley.foon.tan@intel.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Nick Hu <nickhu@andestech.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Stafford Horne <shorne@gmail.com> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Link: http://lkml.kernel.org/r/20200707225021.200906-25-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Peter Xu | bce617edec |
mm: do page fault accounting in handle_mm_fault
Patch series "mm: Page fault accounting cleanups", v5.
This is v5 of the pf accounting cleanup series. It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit
|
|
Randy Dunlap | a1a0aea592 |
mm/memory.c: delete duplicated words
Drop the repeated word "to" in two places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Link: http://lkml.kernel.org/r/20200801173822.14973-7-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | aae466b005 |
mm/swap: implement workingset detection for anonymous LRU
This patch implements workingset detection for anonymous LRU. All the infrastructure is implemented by the previous patches so this patch just activates the workingset detection by installing/retrieving the shadow entry and adding refault calculation. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-6-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
Joonsoo Kim | b518154e59 |
mm/vmscan: protect the workingset on anonymous LRU
In current implementation, newly created or swap-in anonymous page is started on active list. Growing active list results in rebalancing active/inactive list so old pages on active list are demoted to inactive list. Hence, the page on active list isn't protected at all. Following is an example of this situation. Assume that 50 hot pages on active list. Numbers denote the number of pages on active/inactive list (active | inactive). 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(uo) | 50(h) 3. workload: another 50 newly created (used-once) pages 50(uo) | 50(uo), swap-out 50(h) This patch tries to fix this issue. Like as file LRU, newly created or swap-in anonymous pages will be inserted to the inactive list. They are promoted to active list if enough reference happens. This simple modification changes the above example as following. 1. 50 hot pages on active list 50(h) | 0 2. workload: 50 newly created (used-once) pages 50(h) | 50(uo) 3. workload: another 50 newly created (used-once) pages 50(h) | 50(uo), swap-out 50(uo) As you can see, hot pages on active list would be protected. Note that, this implementation has a drawback that the page cannot be promoted and will be swapped-out if re-access interval is greater than the size of inactive list but less than the size of total(active+inactive). To solve this potential issue, following patch will apply workingset detection similar to the one that's already applied to file LRU. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |