commit 97308f8b0d867e9ef59528cd97f0db55ffdf5651 upstream
In iomap_dio_rw, when iomap_apply returns an -EFAULT error and the
IOMAP_DIO_PARTIAL flag is set, complete the request synchronously and
return a partial result. This allows the caller to deal with the page
fault and retry the remainder of the request.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ea7a578588)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5de5ec3e55f18305f76e78ab4c72ce3ee1272522
commit 42c498c18a94eed79896c50871889af52fa0822e upstream
When a user copy fails in one of the helpers of iomap_dio_rw, fail with
-EFAULT instead of returning 0. This matches what iomap_dio_bio_actor
returns when it gets an -EFAULT from bio_iov_iter_get_pages. With these
changes, iomap_dio_actor now consistently fails with -EFAULT when a user
page cannot be faulted in.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a00cc46f97)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib80793f127fe9e0a72195802fcb5a865834c05fc
commit 00bfe02f479688a67a29019d1228f1470e26f014 upstream
In the .read_iter and .write_iter file operations, we're accessing
user-space memory while holding the inode glock. There is a possibility
that the memory is mapped to the same file, in which case we'd recurse
on the same glock.
We could detect and work around this simple case of recursive locking,
but more complex scenarios exist that involve multiple glocks,
processes, and cluster nodes, and working around all of those cases
isn't practical or even possible.
Avoid these kinds of problems by disabling page faults while holding the
inode glock. If a page fault would occur, we either end up with a
partial read or write or with -EFAULT if nothing could be read or
written. In either case, we know that we're not done with the
operation, so we indicate that we're willing to give up the inode glock
and then we fault in the missing pages. If that made us lose the inode
glock, we return a partial read or write. Otherwise, we resume the
operation.
This locking problem was originally reported by Jan Kara. Linus came up
with the idea of disabling page faults. Many thanks to Al Viro and
Matthew Wilcox for their feedback.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 81a7fc397a)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia00865e0cd79d269d8b6be204a529eb97d0a632e
commit 1b223f7065bc7d89c4677c27381817cc95b117a8 upstream
Now that gfs2_file_buffered_write is the only remaining user of
ip->i_gh, we can move the glock holder to the stack (or rather, use the
one we already have on the stack); there is no need for keeping the
holder in the inode anymore.
This is slightly complicated by the fact that we're using ip->i_gh for
the statfs inode in gfs2_file_buffered_write as well. Writing to the
statfs inode isn't very common, so allocate the statfs holder
dynamically when needed.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 38b5849881)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Icf91ad1960e8dec76e009059fd8ead472920971e
commit b924bdab7445946e2ed364a0e6e249d36f1f1158 upstream
So far, for buffered writes, we were taking the inode glock in
gfs2_iomap_begin and dropping it in gfs2_iomap_end with the intention of
not holding the inode glock while iomap_write_actor faults in user
pages. It turns out that iomap_write_actor is called inside iomap_begin
... iomap_end, so the user pages were still faulted in while holding the
inode glock and the locking code in iomap_begin / iomap_end was
completely pointless.
Move the locking into gfs2_file_buffered_write instead. We'll take care
of the potential deadlocks due to faulting in user pages while holding a
glock in a subsequent patch.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8d363d8173)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I4ab7278f9255974aa5b4ae0d9dc3157909145fff
commit dc732906c2450939c319fec6e258aa89ecb5a632 upstream
This patch introduces a new HIF_MAY_DEMOTE flag and infrastructure that
will allow glocks to be demoted automatically on locking conflicts.
When a locking request comes in that isn't compatible with the locking
state of an active holder and that holder has the HIF_MAY_DEMOTE flag
set, the holder will be demoted before the incoming locking request is
granted.
Note that this mechanism demotes active holders (with the HIF_HOLDER
flag set), while before we were only demoting glocks without any active
holders. This allows processes to keep hold of locks that may form a
cyclic locking dependency; the core glock logic will then break those
dependencies in case a conflicting locking request occurs. We'll use
this to avoid giving up the inode glock proactively before faulting in
pages.
Processes that allow a glock holder to be taken away indicate this by
calling gfs2_holder_allow_demote(), which sets the HIF_MAY_DEMOTE flag.
Later, they call gfs2_holder_disallow_demote() to clear the flag again,
and then they check if their holder is still queued: if it is, they are
still holding the glock; if it isn't, they can re-acquire the glock (or
abort).
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 416a705304)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ief7088f69c59f2bbe73867846ee53c0735e514bc
commit 6144464937fe1e6135b13a30502a339d549bf093 upstream
Pass the first current glock holder into function may_grant and
deobfuscate the logic there.
While at it, switch from BUG_ON to GLOCK_BUG_ON in may_grant. To make
that build cleanly, de-constify the may_grant arguments.
We're now using function find_first_holder in do_promote, so move the
function's definition above do_promote.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b25cfbc0e7)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ib02a752cdf8da470494fa06782bc92ca3f047a6f
commit 2eb7509a05443048fb4df60b782de3f03c6c298b upstream
Add a wrapper around iomap_file_buffered_write. We'll add code for when
the operation needs to be retried here later.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b88b998579)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I144fa4b02e35bb8a1c9ac1f1f0ed149c5cdf91b6
commit a6294593e8a1290091d0b078d5d33da5e0cd3dfe upstream
Turn iov_iter_fault_in_readable into a function that returns the number
of bytes not faulted in, similar to copy_to_user, instead of returning a
non-zero value when any of the requested pages couldn't be faulted in.
This supports the existing users that require all pages to be faulted in
as well as new users that are happy if any pages can be faulted in.
Rename iov_iter_fault_in_readable to fault_in_iov_iter_readable to make
sure this change doesn't silently break things.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 30e66b1dfc)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Idc687713cf0585664c6eae5a075b3b067ba76b5e
commit bb523b406c849eef8f265a07cd7f320f1f177743 upstream
Turn fault_in_pages_{readable,writeable} into versions that return the
number of bytes not faulted in, similar to copy_to_user, instead of
returning a non-zero value when any of the requested pages couldn't be
faulted in. This supports the existing users that require all pages to
be faulted in as well as new users that are happy if any pages can be
faulted in.
Rename the functions to fault_in_{readable,writeable} to make sure
this change doesn't silently break things.
Neither of these functions is entirely trivial and it doesn't seem
useful to inline them, so move them to mm/gup.c.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 923f05a660)
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: If7387298a868e4e72244bd9c123e5ef9cf833629
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEZH8oZUiU471FcZm+ONu9yGCSaT4FAmKErdYACgkQONu9yGCS
aT4LIg//bqU/qlVBJq3v3QCfN93SOXwECuPXi0e8p3aG/ptn1pzV0A5+rDD4TXeU
mpw8oWSNQC0Wn8mag9poCvLFaSHuMIo2wPoPOwpRyb0BfhXCaWE4s1HbX49dNi2l
9V76KC4JH69zaHtEmnBy83zzr4nyRG6q3bTd1hD3uJaSg6+2E+DfIK7MUSAyhyPU
DSi2CSHgxZkBEJIVncCPPhz4HX7RJIav51EwAZlnqW6eX3osEe9Qh1b82qcatXIF
rC/w74mQoRnghL0FRwQ5Z4mTEcgBI3lYoKJMkt8iZgOm9nBq1F62gXysKojmjDYc
AvP97Jt+hF9hDFUfR2wW/sA/rsRhw1mx0LoFu96fc15Nxiyh8b/MFD+b3a2CqtN9
szggwoezx90RbGOsc1zNhoX3VmZ2xMeVyAVg4RCscxXiO5FR1DqV4mUlbXoQx9ah
I9+GYtc0vs9BP9pucK2kx7fzJ+eW93jIZVXHuh6mcspu7sdTBQam+ERH1gMIyCki
OXcg0HCTLQi26MNMkrdX+RebwHxB7tbLXUXqYzQKEWjsLtCyv7LYmsRV+RZSyZ1p
vqKs4gTjLPmcINebsT80mYZSH560lXkjl6Z0H5U23NQO8RG3QewrKZkHqwOn5zyQ
aF5fnmVTVcMbT8BK3qCadr00Giz5usnw8R9am5EY1ELzo48207A=
=35Fl
-----END PGP SIGNATURE-----
Merge 5.15.41 into android13-5.15
Changes in 5.15.41
batman-adv: Don't skb_split skbuffs with frag_list
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
hwmon: (tmp401) Add OF device ID table
mac80211: Reset MBSSID parameters upon connection
net: Fix features skip in for_each_netdev_feature()
net: mscc: ocelot: fix last VCAP IS1/IS2 filter persisting in hardware when deleted
net: mscc: ocelot: fix VCAP IS2 filters matching on both lookups
net: mscc: ocelot: restrict tc-trap actions to VCAP IS2 lookup 0
net: mscc: ocelot: avoid corrupting hardware counters when moving VCAP filters
fbdev: simplefb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: efifb: Cleanup fb_info in .fb_destroy rather than .remove
fbdev: vesafb: Cleanup fb_info in .fb_destroy rather than .remove
platform/surface: aggregator: Fix initialization order when compiling as builtin module
ice: Fix race during aux device (un)plugging
ice: fix PTP stale Tx timestamps cleanup
ipv4: drop dst in multicast routing path
drm/nouveau: Fix a potential theorical leak in nouveau_get_backlight_name()
netlink: do not reset transport header in netlink_recvmsg()
net: chelsio: cxgb4: Avoid potential negative array offset
fbdev: efifb: Fix a use-after-free due early fb_info cleanup
sfc: Use swap() instead of open coding it
net: sfc: fix memory leak due to ptp channel
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
nfs: fix broken handling of the softreval mount option
ionic: fix missing pci_release_regions() on error in ionic_probe()
dim: initialize all struct fields
hwmon: (ltq-cputemp) restrict it to SOC_XWAY
procfs: prevent unprivileged processes accessing fdinfo dir
selftests: vm: Makefile: rename TARGETS to VMTARGETS
arm64: vdso: fix makefile dependency on vdso.so
virtio: fix virtio transitional ids
s390/ctcm: fix variable dereferenced before check
s390/ctcm: fix potential memory leak
s390/lcs: fix variable dereferenced before check
net/sched: act_pedit: really ensure the skb is writable
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
drm/vc4: hdmi: Fix build error for implicit function declaration
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
tls: Fix context leak on tls_device_down
drm/vmwgfx: Fix fencing on SVGAv3
gfs2: Fix filesystem block deallocation for short writes
hwmon: (f71882fg) Fix negative temperature
RDMA/irdma: Fix deadlock in irdma_cleanup_cm_core()
iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu
ASoC: max98090: Reject invalid values in custom control put()
ASoC: max98090: Generate notifications on changes for custom control
ASoC: ops: Validate input values in snd_soc_put_volsw_range()
s390: disable -Warray-bounds
ASoC: SOF: Fix NULL pointer exception in sof_pci_probe callback
net: emaclite: Don't advertise 1000BASE-T and do auto negotiation
net: sfp: Add tx-fault workaround for Huawei MA5671A SFP ONT
secure_seq: use the 64 bits of the siphash for port offset calculation
tcp: use different parts of the port_offset for index and offset
tcp: resalt the secret every 10 seconds
tcp: add small random increments to the source port
tcp: dynamically allocate the perturb table used by source ports
tcp: increase source port perturb table to 2^16
tcp: drop the hash_32() part from the index calculation
interconnect: Restore sync state by ignoring ipa-virt in provider count
firmware_loader: use kernel credentials when reading firmware
KVM: PPC: Book3S PR: Enable MSR_DR for switch_mmu_context()
usb: xhci-mtk: fix fs isoc's transfer error
x86/mm: Fix marking of unused sub-pmd ranges
tty/serial: digicolor: fix possible null-ptr-deref in digicolor_uart_probe()
tty: n_gsm: fix buffer over-read in gsm_dlci_data()
tty: n_gsm: fix mux activation issues in gsm_config()
usb: cdc-wdm: fix reading stuck on device close
usb: typec: tcpci: Don't skip cleanup in .remove() on error
usb: typec: tcpci_mt6360: Update for BMC PHY setting
USB: serial: pl2303: add device id for HP LM930 Display
USB: serial: qcserial: add support for Sierra Wireless EM7590
USB: serial: option: add Fibocom L610 modem
USB: serial: option: add Fibocom MA510 modem
slimbus: qcom: Fix IRQ check in qcom_slim_probe
fsl_lpuart: Don't enable interrupts too early
serial: 8250_mtk: Fix UART_EFR register address
serial: 8250_mtk: Fix register address for XON/XOFF character
ceph: fix setting of xattrs on async created inodes
Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
mm/huge_memory: do not overkill when splitting huge_zero_page
drm/vmwgfx: Disable command buffers on svga3 without gbobjects
drm/nouveau/tegra: Stop using iommu_present()
i40e: i40e_main: fix a missing check on list iterator
net: atlantic: always deep reset on pm op, fixing up my null deref regression
net: phy: Fix race condition on link status change
writeback: Avoid skipping inode writeback
cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
net: phy: micrel: Do not use kszphy_suspend/resume for KSZ8061
net: phy: micrel: Pass .probe for KS8737
SUNRPC: Ensure that the gssproxy client can start in a connected state
drm/vmwgfx: Initialize drm_mode_fb_cmd2
Revert "drm/amd/pm: keep the BACO feature enabled for suspend"
dma-buf: call dma_buf_stats_setup after dmabuf is in valid list
mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
SUNRPC: Ensure we flush any closed sockets before xs_xprt_free()
ping: fix address binding wrt vrf
usb: gadget: uvc: rename function to be more consistent
usb: gadget: uvc: allow for application to cleanly shutdown
Linux 5.15.41
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia65cdbbddf553237d6a3a38efb9bcb2fcc3990ec
commit f00432063db1a0db484e85193eccc6845435b80e upstream.
We must ensure that all sockets are closed before we call xprt_free()
and release the reference to the net namespace. The problem is that
calling fput() will defer closing the socket until delayed_fput() gets
called.
Let's fix the situation by allowing rpciod and the transport teardown
code (which runs on the system wq) to call __fput_sync(), and directly
close the socket.
Reported-by: Felix Fu <foyjog@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Fixes: a73881c96d ("SUNRPC: Fix an Oops in udp_poll()")
Cc: stable@vger.kernel.org # 5.1.x: 3be232f11a3c: SUNRPC: Prevent immediate close+reconnect
Cc: stable@vger.kernel.org # 5.1.x: 89f42494f92f: SUNRPC: Don't call connect() more than once on a TCP socket
Cc: stable@vger.kernel.org # 5.1.x
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Meena Shanmugam <meenashanmugam@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 846a3351ddfe4a86eede4bb26a205c3f38ef84d3 upstream.
We have run into an issue that a task gets stuck in
balance_dirty_pages_ratelimited() when perform I/O stress testing.
The reason we observed is that an I_DIRTY_PAGES inode with lots
of dirty pages is in b_dirty_time list and standard background
writeback cannot writeback the inode.
After studing the relevant code, the following scenario may lead
to the issue:
task1 task2
----- -----
fuse_flush
write_inode_now //in b_dirty_time
writeback_single_inode
__writeback_single_inode
fuse_write_end
filemap_dirty_folio
__xa_set_mark:PAGECACHE_TAG_DIRTY
lock inode->i_lock
if mapping tagged PAGECACHE_TAG_DIRTY
inode->i_state |= I_DIRTY_PAGES
unlock inode->i_lock
__mark_inode_dirty:I_DIRTY_PAGES
lock inode->i_lock
-was dirty,inode stays in
-b_dirty_time
unlock inode->i_lock
if(!(inode->i_state & I_DIRTY_All))
-not true,so nothing done
This patch moves the dirty inode to b_dirty list when the inode
currently is not queued in b_io or b_more_io list at the end of
writeback_single_inode.
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
CC: stable@vger.kernel.org
Fixes: 0ae45f63d4 ("vfs: add support for a lazytime mount option")
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220510023514.27399-1-jing.xia@unisoc.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 620239d9a32e9fe27c9204ec11e40058671aeeb6 upstream.
Currently when we create a file, we spin up an xattr buffer to send
along with the create request. If we end up doing an async create
however, then we currently pass down a zero-length xattr buffer.
Fix the code to send down the xattr buffer in req->r_pagelist. If the
xattrs span more than a page, however give up and don't try to do an
async create.
Cc: stable@vger.kernel.org
URL: https://bugzilla.redhat.com/show_bug.cgi?id=2063929
Fixes: 9a8d03ca2e ("ceph: attempt to do async create when possible")
Reported-by: John Fortin <fortinj66@gmail.com>
Reported-by: Sri Ramanujam <sri@ramanujam.io>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d031a8866e709c9d1ee5537a321b6192b4d2dc5b ]
When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.
To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation. It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.
In addition, use round_up() for computing the start of the unused range.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1927e498aee1757b3df755a194cbfc5cc0f2b663 ]
The file permissions on the fdinfo dir from were changed from
S_IRUSR|S_IXUSR to S_IRUGO|S_IXUGO, and a PTRACE_MODE_READ check was added
for opening the fdinfo files [1]. However, the ptrace permission check
was not added to the directory, allowing anyone to get the open FD numbers
by reading the fdinfo directory.
Add the missing ptrace permission check for opening the fdinfo directory.
[1] https://lkml.kernel.org/r/20210308170651.919148-1-kaleshsingh@google.com
Link: https://lkml.kernel.org/r/20210713162008.1056986-1-kaleshsingh@google.com
Fixes: 7bc3fa0172 ("procfs: allow reading fdinfo with PTRACE_MODE_READ")
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Hridya Valsaraju <hridya@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 085d16d5f949b64713d5e960d6c9bbf51bc1d511 ]
Turns out that ever since this mount option was added, passing
`softreval` in NFS mount options cancelled all other flags while not
affecting the underlying flag `NFS_MOUNT_SOFTREVAL`.
Fixes: c74dfe97c1 ("NFS: Add mount option 'softreval'")
Signed-off-by: Dan Aloni <dan.aloni@vastdata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Changes in 5.15.39
MIPS: Fix CP0 counter erratum detection for R4k CPUs
parisc: Merge model and model name into one line in /proc/cpuinfo
ALSA: hda/realtek: Add quirk for Yoga Duet 7 13ITL6 speakers
ALSA: fireworks: fix wrong return count shorter than expected by 4 bytes
mmc: sdhci-msm: Reset GCC_SDCC_BCR register for SDHC
mmc: sunxi-mmc: Fix DMA descriptors allocated above 32 bits
mmc: core: Set HS clock speed before sending HS CMD13
gpiolib: of: fix bounds check for 'gpio-reserved-ranges'
x86/fpu: Prevent FPU state corruption
KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id
iommu/vt-d: Calculate mask for non-aligned flushes
iommu/arm-smmu-v3: Fix size calculation in arm_smmu_mm_invalidate_range()
drm/amd/display: Avoid reading audio pattern past AUDIO_CHANNELS_COUNT
drm/amdgpu: do not use passthrough mode in Xen dom0
RISC-V: relocate DTB if it's outside memory region
Revert "SUNRPC: attempt AF_LOCAL connect on setup"
timekeeping: Mark NMI safe time accessors as notrace
firewire: fix potential uaf in outbound_phy_packet_callback()
firewire: remove check of list iterator against head past the loop body
firewire: core: extend card->lock in fw_core_handle_bus_reset
net: stmmac: disable Split Header (SPH) for Intel platforms
genirq: Synchronize interrupt thread startup
ASoC: da7219: Fix change notifications for tone generator frequency
ASoC: wm8958: Fix change notifications for DSP controls
ASoC: meson: Fix event generation for AUI ACODEC mux
ASoC: meson: Fix event generation for G12A tohdmi mux
ASoC: meson: Fix event generation for AUI CODEC mux
s390/dasd: fix data corruption for ESE devices
s390/dasd: prevent double format of tracks for ESE devices
s390/dasd: Fix read for ESE with blksize < 4k
s390/dasd: Fix read inconsistency for ESE DASD devices
can: grcan: grcan_close(): fix deadlock
can: isotp: remove re-binding of bound socket
can: grcan: use ofdev->dev when allocating DMA memory
can: grcan: grcan_probe(): fix broken system id check for errata workaround needs
can: grcan: only use the NAPI poll budget for RX
nfc: replace improper check device_is_registered() in netlink related functions
nfc: nfcmrvl: main: reorder destructive operations in nfcmrvl_nci_unregister_dev to avoid bugs
NFC: netlink: fix sleep in atomic bug when firmware download timeout
gpio: visconti: Fix fwnode of GPIO IRQ
gpio: pca953x: fix irq_stat not updated when irq is disabled (irq_mask not set)
hwmon: (adt7470) Fix warning on module removal
hwmon: (pmbus) disable PEC if not enabled
ASoC: dmaengine: Restore NULL prepare_slave_config() callback
ASoC: soc-ops: fix error handling
iommu/vt-d: Drop stop marker messages
iommu/dart: check return value after calling platform_get_resource()
net/mlx5e: Fix trust state reset in reload
net/mlx5e: Don't match double-vlan packets if cvlan is not set
net/mlx5e: CT: Fix queued up restore put() executing after relevant ft release
net/mlx5e: Fix the calling of update_buffer_lossy() API
net/mlx5: Avoid double clear or set of sync reset requested
net/mlx5: Fix deadlock in sync reset flow
selftests/seccomp: Don't call read() on TTY from background pgrp
SUNRPC release the transport of a relocated task with an assigned transport
RDMA/siw: Fix a condition race issue in MPA request processing
RDMA/irdma: Flush iWARP QP if modified to ERR from RTR state
RDMA/irdma: Reduce iWARP QP destroy time
RDMA/irdma: Fix possible crash due to NULL netdev in notifier
NFSv4: Don't invalidate inode attributes on delegation return
net: ethernet: mediatek: add missing of_node_put() in mtk_sgmii_init()
net: dsa: mt7530: add missing of_node_put() in mt7530_setup()
net: stmmac: dwmac-sun8i: add missing of_node_put() in sun8i_dwmac_register_mdio_mux()
net: mdio: Fix ENOMEM return value in BCM6368 mux bus controller
net: cpsw: add missing of_node_put() in cpsw_probe_dt()
net: igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter()
net: emaclite: Add error handling for of_address_to_resource()
selftests/net: so_txtime: fix parsing of start time stamp on 32 bit systems
selftests/net: so_txtime: usage(): fix documentation of default clock
drm/msm/dp: remove fail safe mode related code
btrfs: do not BUG_ON() on failure to update inode when setting xattr
hinic: fix bug of wq out of bound access
mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter()
rxrpc: Enable IPv6 checksums on transport socket
selftests: mirror_gre_bridge_1q: Avoid changing PVID while interface is operational
bnxt_en: Fix possible bnxt_open() failure caused by wrong RFS flag
bnxt_en: Fix unnecessary dropping of RX packets
selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer
smsc911x: allow using IRQ0
btrfs: force v2 space cache usage for subpage mount
btrfs: always log symlinks in full mode
drm/amdgpu: unify BO evicting method in amdgpu_ttm
drm/amdgpu: explicitly check for s0ix when evicting resources
drm/amdgpu: don't set s3 and s0ix at the same time
drm/amdgpu: Ensure HDA function is suspended before ASIC reset
gpio: mvebu: drop pwm base assignment
kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU
fbdev: Make fb_release() return -ENODEV if fbdev was unregistered
net/mlx5: Fix slab-out-of-bounds while reading resource dump menu
net/mlx5e: Lag, Fix use-after-free in fib event handler
net/mlx5e: Lag, Fix fib_info pointer assignment
net/mlx5e: Lag, Don't skip fib events on current dst
iommu/dart: Add missing module owner to ops structure
kvm: selftests: do not use bitfields larger than 32-bits for PTEs
KVM: selftests: Silence compiler warning in the kvm_page_table_test
x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume
KVM: x86: Do not change ICR on write to APIC_SELF_IPI
KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs
KVM: LAPIC: Enable timer posted-interrupt only when mwait/hlt is advertised
selftest/vm: verify mmap addr in mremap_test
selftest/vm: verify remap destination address in mremap_test
mmc: rtsx: add 74 Clocks in power on flow
Revert "parisc: Mark sched_clock unstable only if clocks are not syncronized"
rcu: Fix callbacks processing time limit retaining cond_resched()
rcu: Apply callbacks processing time limit only on softirq
PCI: pci-bridge-emul: Add description for class_revision field
PCI: pci-bridge-emul: Add definitions for missing capabilities registers
PCI: aardvark: Add support for DEVCAP2, DEVCTL2, LNKCAP2 and LNKCTL2 registers on emulated bridge
PCI: aardvark: Clear all MSIs at setup
PCI: aardvark: Comment actions in driver remove method
PCI: aardvark: Disable bus mastering when unbinding driver
PCI: aardvark: Mask all interrupts when unbinding driver
PCI: aardvark: Fix memory leak in driver unbind
PCI: aardvark: Assert PERST# when unbinding driver
PCI: aardvark: Disable link training when unbinding driver
PCI: aardvark: Disable common PHY when unbinding driver
PCI: aardvark: Replace custom PCIE_CORE_INT_* macros with PCI_INTERRUPT_*
PCI: aardvark: Rewrite IRQ code to chained IRQ handler
PCI: aardvark: Check return value of generic_handle_domain_irq() when processing INTx IRQ
PCI: aardvark: Make MSI irq_chip structures static driver structures
PCI: aardvark: Make msi_domain_info structure a static driver structure
PCI: aardvark: Use dev_fwnode() instead of of_node_to_fwnode(dev->of_node)
PCI: aardvark: Refactor unmasking summary MSI interrupt
PCI: aardvark: Add support for masking MSI interrupts
PCI: aardvark: Fix setting MSI address
PCI: aardvark: Enable MSI-X support
PCI: aardvark: Add support for ERR interrupt on emulated bridge
PCI: aardvark: Optimize writing PCI_EXP_RTCTL_PMEIE and PCI_EXP_RTSTA_PME on emulated bridge
PCI: aardvark: Add support for PME interrupts
PCI: aardvark: Fix support for PME requester on emulated bridge
PCI: aardvark: Use separate INTA interrupt for emulated root bridge
PCI: aardvark: Remove irq_mask_ack() callback for INTx interrupts
PCI: aardvark: Don't mask irq when mapping
PCI: aardvark: Drop __maybe_unused from advk_pcie_disable_phy()
PCI: aardvark: Update comment about link going down after link-up
Linux 5.15.39
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic51d6d05a0d99156c6fd844786e984aff8e7386a
Changes in 5.15.38
usb: mtu3: fix USB 3.0 dual-role-switch from device to host
USB: quirks: add a Realtek card reader
USB: quirks: add STRING quirk for VCOM device
USB: serial: whiteheat: fix heap overflow in WHITEHEAT_GET_DTR_RTS
USB: serial: cp210x: add PIDs for Kamstrup USB Meter Reader
USB: serial: option: add support for Cinterion MV32-WA/MV32-WB
USB: serial: option: add Telit 0x1057, 0x1058, 0x1075 compositions
usb: xhci: tegra:Fix PM usage reference leak of tegra_xusb_unpowergate_partitions
xhci: Enable runtime PM on second Alderlake controller
xhci: stop polling roothubs after shutdown
xhci: increase usb U3 -> U0 link resume timeout from 100ms to 500ms
iio: dac: ad5592r: Fix the missing return value.
iio: dac: ad5446: Fix read_raw not returning set value
iio: magnetometer: ak8975: Fix the error handling in ak8975_power_on()
iio: imu: inv_icm42600: Fix I2C init possible nack
usb: misc: fix improper handling of refcount in uss720_probe()
usb: core: Don't hold the device lock while sleeping in do_proc_control()
usb: typec: ucsi: Fix reuse of completion structure
usb: typec: ucsi: Fix role swapping
usb: gadget: uvc: Fix crash when encoding data for usb request
usb: gadget: configfs: clear deactivation flag in configfs_composite_unbind()
usb: dwc3: Try usb-role-switch first in dwc3_drd_init
usb: dwc3: core: Fix tx/rx threshold settings
usb: dwc3: core: Only handle soft-reset in DCTL
usb: dwc3: gadget: Return proper request status
usb: dwc3: pci: add support for the Intel Meteor Lake-P
usb: cdns3: Fix issue for clear halt endpoint
usb: phy: generic: Get the vbus supply
serial: imx: fix overrun interrupts in DMA mode
serial: amba-pl011: do not time out prematurely when draining tx fifo
serial: 8250: Also set sticky MCR bits in console restoration
serial: 8250: Correct the clock for EndRun PTP/1588 PCIe device
arch_topology: Do not set llc_sibling if llc_id is invalid
ceph: fix possible NULL pointer dereference for req->r_session
bus: mhi: host: pci_generic: Add missing poweroff() PM callback
bus: mhi: host: pci_generic: Flush recovery worker during freeze
arm64: dts: imx8mm-venice: fix spi2 pin configuration
pinctrl: samsung: fix missing GPIOLIB on ARM64 Exynos config
hex2bin: make the function hex_to_bin constant-time
hex2bin: fix access beyond string end
riscv: patch_text: Fixup last cpu should be master
x86/pci/xen: Disable PCI/MSI[-X] masking for XEN_HVM guests
iocost: don't reset the inuse weight of under-weighted debtors
virtio_net: fix wrong buf address calculation when using xdp
cpufreq: qcom-hw: fix the race between LMH worker and cpuhp
cpufreq: qcom-cpufreq-hw: Fix throttle frequency value on EPSS platforms
video: fbdev: udlfb: properly check endpoint type
arm64: dts: meson: remove CPU opps below 1GHz for G12B boards
arm64: dts: meson: remove CPU opps below 1GHz for SM1 boards
iio:imu:bmi160: disable regulator in error path
mtd: rawnand: fix ecc parameters for mt7622
xsk: Fix l2fwd for copy mode + busy poll combo
arm64: dts: imx8qm: Correct SCU clock controller's compatible property
USB: Fix xhci event ring dequeue pointer ERDP update issue
ARM: dts: imx6qdl-apalis: Fix sgtl5000 detection issue
arm64: dts: imx8mn: Fix SAI nodes
arm64: dts: meson-sm1-bananapi-m5: fix wrong GPIO pin labeling for CON1
phy: samsung: Fix missing of_node_put() in exynos_sata_phy_probe
phy: samsung: exynos5250-sata: fix missing device put in probe error paths
ARM: OMAP2+: Fix refcount leak in omap_gic_of_init
bus: ti-sysc: Make omap3 gpt12 quirk handling SoC specific
ARM: dts: dra7: Fix suspend warning for vpe powerdomain
phy: ti: omap-usb2: Fix error handling in omap_usb2_enable_clocks
ARM: dts: at91: Map MCLK for wm8731 on at91sam9g20ek
ARM: dts: at91: sama5d4_xplained: fix pinctrl phandle name
ARM: dts: at91: fix pinctrl phandles
phy: mapphone-mdm6600: Fix PM error handling in phy_mdm6600_probe
phy: ti: Add missing pm_runtime_disable() in serdes_am654_probe
interconnect: qcom: sdx55: Drop IP0 interconnects
ARM: dts: Fix mmc order for omap3-gta04
ARM: dts: am3517-evm: Fix misc pinmuxing
ARM: dts: logicpd-som-lv: Fix wrong pinmuxing on OMAP35
ipvs: correctly print the memory size of ip_vs_conn_tab
phy: amlogic: fix error path in phy_g12a_usb3_pcie_probe()
pinctrl: mediatek: moore: Fix build error
mtd: rawnand: Fix return value check of wait_for_completion_timeout
mtd: fix 'part' field data corruption in mtd_info
pinctrl: stm32: Do not call stm32_gpio_get() for edge triggered IRQs in EOI
memory: renesas-rpc-if: Fix HF/OSPI data transfer in Manual Mode
net: dsa: Add missing of_node_put() in dsa_port_link_register_of
netfilter: nft_set_rbtree: overlap detection with element re-addition after deletion
bpf, lwt: Fix crash when using bpf_skb_set_tunnel_key() from bpf_xmit lwt hook
pinctrl: rockchip: fix RK3308 pinmux bits
tcp: md5: incorrect tcp_header_len for incoming connections
pinctrl: stm32: Keep pinctrl block clock enabled when LEVEL IRQ requested
tcp: ensure to use the most recently sent skb when filling the rate sample
wireguard: device: check for metadata_dst with skb_valid_dst()
sctp: check asoc strreset_chunk in sctp_generate_reconf_event
ARM: dts: imx6ull-colibri: fix vqmmc regulator
arm64: dts: imx8mn-ddr4-evk: Describe the 32.768 kHz PMIC clock
pinctrl: pistachio: fix use of irq_of_parse_and_map()
cpufreq: fix memory leak in sun50i_cpufreq_nvmem_probe
net: hns3: clear inited state and stop client after failed to register netdev
net: hns3: modify the return code of hclge_get_ring_chain_from_mbx
net: hns3: add validity check for message data length
net: hns3: add return value for mailbox handling in PF
net/smc: sync err code when tcp connection was refused
ip_gre: Make o_seqno start from 0 in native mode
ip6_gre: Make o_seqno start from 0 in native mode
ip_gre, ip6_gre: Fix race condition on o_seqno in collect_md mode
tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT
tcp: make sure treq->af_specific is initialized
bus: sunxi-rsb: Fix the return value of sunxi_rsb_device_create()
clk: sunxi: sun9i-mmc: check return value after calling platform_get_resource()
cpufreq: qcom-cpufreq-hw: Clear dcvs interrupts
net: bcmgenet: hide status block before TX timestamping
net: phy: marvell10g: fix return value on error
net: dsa: mv88e6xxx: Fix port_hidden_wait to account for port_base_addr
drm/sun4i: Remove obsolete references to PHYS_OFFSET
net: dsa: lantiq_gswip: Don't set GSWIP_MII_CFG_RMII_CLK
io_uring: check reserved fields for send/sendmsg
io_uring: check reserved fields for recv/recvmsg
netfilter: conntrack: fix udp offload timeout sysctl
drm/amdkfd: Fix GWS queue count
drm/amd/display: Fix memory leak in dcn21_clock_source_create
tls: Skip tls_append_frag on zero copy size
bnx2x: fix napi API usage sequence
net: fec: add missing of_node_put() in fec_enet_init_stop_mode()
gfs2: Prevent endless loops in gfs2_file_buffered_write
gfs2: Minor retry logic cleanup
gfs2: Make sure not to return short direct writes
gfs2: No short reads or writes upon glock contention
perf arm-spe: Fix addresses of synthesized SPE events
ixgbe: ensure IPsec VF<->PF compatibility
Revert "ibmvnic: Add ethtool private flag for driver-defined queue limits"
tcp: fix F-RTO may not work correctly when receiving DSACK
ASoC: Intel: soc-acpi: correct device endpoints for max98373
ASoC: wm8731: Disable the regulator when probing fails
ext4: fix bug_on in start_this_handle during umount filesystem
arch: xtensa: platforms: Fix deadlock in rs_close()
ksmbd: increment reference count of parent fp
ksmbd: set fixed sector size to FS_SECTOR_SIZE_INFORMATION
bonding: do not discard lowest hash bit for non layer3+4 hashing
x86: __memcpy_flushcache: fix wrong alignment if size > 2^32
cifs: destage any unwritten data to the server before calling copychunk_write
drivers: net: hippi: Fix deadlock in rr_close()
powerpc/perf: Fix 32bit compile
selftest/vm: verify mmap addr in mremap_test
selftest/vm: verify remap destination address in mremap_test
Revert "ACPI: processor: idle: fix lockup regression on 32-bit ThinkPad T40"
zonefs: Fix management of open zones
zonefs: Clear inode information flags on inode creation
kasan: prevent cpu_quarantine corruption when CPU offline and cache shrink occur at same time
mtd: rawnand: qcom: fix memory corruption that causes panic
netfilter: Update ip6_route_me_harder to consider L3 domain
drm/i915: Check EDID for HDR static metadata when choosing blc
drm/i915: Fix SEL_FETCH_PLANE_*(PIPE_B+) register addresses
net: ethernet: stmmac: fix write to sgmii_adapter_base
ACPI: processor: idle: Avoid falling back to C3 type C-states
thermal: int340x: Fix attr.show callback prototype
btrfs: fix leaked plug after failure syncing log on zoned filesystems
ARM: dts: at91: sama7g5ek: enable pull-up on flexcom3 console lines
ARM: dts: imx8mm-venice-gw{71xx,72xx,73xx}: fix OTG controller OC mode
x86/cpu: Load microcode during restore_processor_state()
perf symbol: Pass is_kallsyms to symbols__fixup_end()
perf symbol: Update symbols__fixup_end()
tty: n_gsm: fix restart handling via CLD command
tty: n_gsm: fix decoupled mux resource
tty: n_gsm: fix mux cleanup after unregister tty device
tty: n_gsm: fix wrong signal octet encoding in convergence layer type 2
tty: n_gsm: fix malformed counter for out of frame data
netfilter: nft_socket: only do sk lookups when indev is available
tty: n_gsm: fix insufficient txframe size
tty: n_gsm: fix wrong DLCI release order
tty: n_gsm: fix missing explicit ldisc flush
tty: n_gsm: fix wrong command retry handling
tty: n_gsm: fix wrong command frame length field encoding
tty: n_gsm: fix wrong signal octets encoding in MSC
tty: n_gsm: fix missing tty wakeup in convergence layer type 2
tty: n_gsm: fix reset fifo race condition
tty: n_gsm: fix incorrect UA handling
tty: n_gsm: fix software flow control handling
perf symbol: Remove arch__symbols__fixup_end()
eeprom: at25: Use DMA safe buffers
objtool: Fix code relocs vs weak symbols
objtool: Fix type of reloc::addend
powerpc/64: Add UADDR64 relocation support
Linux 5.15.38
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic5e398d47dd6240ecde38f635b3085fae7c3c0ab
This reverts the merge of 5.15.37 into the android13-5.15
There are lots of ABI issues, and many of the commits are not needed in
the Android tree at this time. Revert the merge (except for the
Makefile change), so that future merges will continue to work, and the
needed individual changes from this release will be manually added to
the tree at a later point in time.
Fixes: f7dace75d276 ("Merge 5.15.37 into android13-5.15")
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I0632858e5c0fb94fc14c0f4216997330eca260a7
Changes in 5.15.37
floppy: disable FDRAWCMD by default
bpf: Introduce composable reg, ret and arg types.
bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULL
bpf: Replace RET_XXX_OR_NULL with RET_XXX | PTR_MAYBE_NULL
bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX | PTR_MAYBE_NULL
bpf: Introduce MEM_RDONLY flag
bpf: Convert PTR_TO_MEM_OR_NULL to composable types.
bpf: Make per_cpu_ptr return rdonly PTR_TO_MEM.
bpf: Add MEM_RDONLY for helper args that are pointers to rdonly mem.
bpf/selftests: Test PTR_TO_RDONLY_MEM
bpf: Fix crash due to out of bounds access into reg2btf_ids.
spi: cadence-quadspi: fix write completion support
ARM: dts: socfpga: change qspi to "intel,socfpga-qspi"
mm: kfence: fix objcgs vector allocation
gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable}
iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable
iov_iter: Introduce fault_in_iov_iter_writeable
gfs2: Add wrapper for iomap_file_buffered_write
gfs2: Clean up function may_grant
gfs2: Introduce flag for glock holder auto-demotion
gfs2: Move the inode glock locking to gfs2_file_buffered_write
gfs2: Eliminate ip->i_gh
gfs2: Fix mmap + page fault deadlocks for buffered I/O
iomap: Fix iomap_dio_rw return value for user copies
iomap: Support partial direct I/O on user copy failures
iomap: Add done_before argument to iomap_dio_rw
gup: Introduce FOLL_NOFAULT flag to disable page faults
iov_iter: Introduce nofault flag to disable page faults
gfs2: Fix mmap + page fault deadlocks for direct I/O
btrfs: fix deadlock due to page faults during direct IO reads and writes
btrfs: fallback to blocking mode when doing async dio over multiple extents
mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
selftests/bpf: Add test for reg2btf_ids out of bounds access
Linux 5.15.37
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ica39b8856d6e3928a82f4e34f8b401f1a5cba5ee
Changes in 5.15.36
fs: remove __sync_filesystem
block: remove __sync_blockdev
block: simplify the block device syncing code
vfs: make sync_filesystem return errors from ->sync_fs
xfs: return errors in xfs_fs_sync_fs
dma-mapping: remove bogus test for pfn_valid from dma_map_resource
arm64/mm: drop HAVE_ARCH_PFN_VALID
etherdevice: Adjust ether_addr* prototypes to silence -Wstringop-overead
mm: page_alloc: fix building error on -Werror=array-compare
perf tools: Fix segfault accessing sample_id xyarray
mm, kfence: support kmem_dump_obj() for KFENCE objects
gfs2: assign rgrp glock before compute_bitstructs
scsi: ufs: core: scsi_get_lba() error fix
net/sched: cls_u32: fix netns refcount changes in u32_change()
ALSA: usb-audio: Clear MIDI port active flag after draining
ALSA: hda/realtek: Add quirk for Clevo NP70PNP
ASoC: atmel: Remove system clock tree configuration for at91sam9g20ek
ASoC: topology: Correct error handling in soc_tplg_dapm_widget_create()
ASoC: rk817: Use devm_clk_get() in rk817_platform_probe
ASoC: msm8916-wcd-digital: Check failure for devm_snd_soc_register_component
ASoC: codecs: wcd934x: do not switch off SIDO Buck when codec is in use
dmaengine: idxd: fix device cleanup on disable
dmaengine: imx-sdma: Fix error checking in sdma_event_remap
dmaengine: mediatek:Fix PM usage reference leak of mtk_uart_apdma_alloc_chan_resources
dmaengine: dw-edma: Fix unaligned 64bit access
spi: spi-mtk-nor: initialize spi controller after resume
esp: limit skb_page_frag_refill use to a single page
spi: cadence-quadspi: fix incorrect supports_op() return value
igc: Fix infinite loop in release_swfw_sync
igc: Fix BUG: scheduling while atomic
igc: Fix suspending when PTM is active
ALSA: hda/hdmi: fix warning about PCM count when used with SOF
rxrpc: Restore removed timer deletion
net/smc: Fix sock leak when release after smc_shutdown()
net/packet: fix packet_sock xmit return value checking
ip6_gre: Avoid updating tunnel->tun_hlen in __gre6_xmit()
ip6_gre: Fix skb_under_panic in __gre6_xmit()
net: restore alpha order to Ethernet devices in config
net/sched: cls_u32: fix possible leak in u32_init_knode()
l3mdev: l3mdev_master_upper_ifindex_by_index_rcu should be using netdev_master_upper_dev_get_rcu
ipv6: make ip6_rt_gc_expire an atomic_t
can: isotp: stop timeout monitoring when no first frame was sent
net: dsa: hellcreek: Calculate checksums in tagger
net: mscc: ocelot: fix broken IP multicast flooding
netlink: reset network and mac headers in netlink_dump()
drm/i915/display/psr: Unset enable_psr2_sel_fetch if other checks in intel_psr2_config_valid() fails
net: stmmac: Use readl_poll_timeout_atomic() in atomic state
dmaengine: idxd: add RO check for wq max_batch_size write
dmaengine: idxd: add RO check for wq max_transfer_size write
dmaengine: idxd: skip clearing device context when device is read-only
selftests: mlxsw: vxlan_flooding: Prevent flooding of unwanted packets
arm64: mm: fix p?d_leaf()
ARM: vexpress/spc: Avoid negative array index when !SMP
reset: renesas: Check return value of reset_control_deassert()
reset: tegra-bpmp: Restore Handle errors in BPMP response
platform/x86: samsung-laptop: Fix an unsigned comparison which can never be negative
ALSA: usb-audio: Fix undefined behavior due to shift overflowing the constant
drm/msm/disp: check the return value of kzalloc()
arm64: dts: imx: Fix imx8*-var-som touchscreen property sizes
vxlan: fix error return code in vxlan_fdb_append
cifs: Check the IOCB_DIRECT flag, not O_DIRECT
net: atlantic: Avoid out-of-bounds indexing
mt76: Fix undefined behavior due to shift overflowing the constant
brcmfmac: sdio: Fix undefined behavior due to shift overflowing the constant
dpaa_eth: Fix missing of_node_put in dpaa_get_ts_info()
drm/msm/mdp5: check the return of kzalloc()
net: macb: Restart tx only if queue pointer is lagging
scsi: iscsi: Release endpoint ID when its freed
scsi: iscsi: Merge suspend fields
scsi: iscsi: Fix NOP handling during conn recovery
scsi: qedi: Fix failed disconnect handling
stat: fix inconsistency between struct stat and struct compat_stat
VFS: filename_create(): fix incorrect intent.
nvme: add a quirk to disable namespace identifiers
nvme-pci: disable namespace identifiers for the MAXIO MAP1002/1202
nvme-pci: disable namespace identifiers for Qemu controllers
EDAC/synopsys: Read the error count from the correct register
mm/memory-failure.c: skip huge_zero_page in memory_failure()
memcg: sync flush only if periodic flush is delayed
mm, hugetlb: allow for "high" userspace addresses
oom_kill.c: futex: delay the OOM reaper to allow time for proper futex cleanup
mm/mmu_notifier.c: fix race in mmu_interval_notifier_remove()
ata: pata_marvell: Check the 'bmdma_addr' beforing reading
dma: at_xdmac: fix a missing check on list iterator
dmaengine: imx-sdma: fix init of uart scripts
net: atlantic: invert deep par in pm functions, preventing null derefs
Input: omap4-keypad - fix pm_runtime_get_sync() error checking
scsi: sr: Do not leak information in ioctl
sched/pelt: Fix attach_entity_load_avg() corner case
perf/core: Fix perf_mmap fail when CONFIG_PERF_USE_VMALLOC enabled
drm/panel/raspberrypi-touchscreen: Avoid NULL deref if not initialised
drm/panel/raspberrypi-touchscreen: Initialise the bridge in prepare
KVM: PPC: Fix TCE handling for VFIO
drm/vc4: Use pm_runtime_resume_and_get to fix pm_runtime_get_sync() usage
powerpc/perf: Fix power9 event alternatives
powerpc/perf: Fix power10 event alternatives
perf script: Always allow field 'data_src' for auxtrace
perf report: Set PERF_SAMPLE_DATA_SRC bit for Arm SPE event
xtensa: patch_text: Fixup last cpu should be master
xtensa: fix a7 clobbering in coprocessor context load/store
openvswitch: fix OOB access in reserve_sfa_size()
gpio: Request interrupts after IRQ is initialized
ASoC: soc-dapm: fix two incorrect uses of list iterator
e1000e: Fix possible overflow in LTR decoding
ARC: entry: fix syscall_trace_exit argument
arm_pmu: Validate single/group leader events
KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog
KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs
netfilter: conntrack: convert to refcount_t api
netfilter: conntrack: avoid useless indirection during conntrack destruction
ext4: fix fallocate to use file_modified to update permissions consistently
ext4: fix symlink file size not match to file content
ext4: fix use-after-free in ext4_search_dir
ext4: limit length to bitmap_maxbytes - blocksize in punch_hole
ext4, doc: fix incorrect h_reserved size
ext4: fix overhead calculation to account for the reserved gdt blocks
ext4: force overhead calculation if the s_overhead_cluster makes no sense
netfilter: nft_ct: fix use after free when attaching zone template
jbd2: fix a potential race while discarding reserved buffers after an abort
spi: atmel-quadspi: Fix the buswidth adjustment between spi-mem and controller
block/compat_ioctl: fix range check in BLKGETSIZE
arm64: dts: qcom: add IPA qcom,qmp property
Linux 5.15.36
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I44d3a4de9b6fa1d2016b4e063eb211e8373a1216
Some of the platforms might be still expecting dedicated memory region
for ramoops node. So add logic to detect the start and size of the
ramoops memory region by looking up reserved memory region with
of_reserved_mem_lookup() when platform_get_resource() failed.
Bug: 191636717
Change-Id: Idc479b45fb3f637f7235efd6eabac62059d5e92b
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
(cherry picked from commit 9b136eab76eee8988a8474008aa7a8cfc83be421)
The reserved memory region for ramoops is assumed to be at a fixed
and known location when read from the devicetree. This is not desirable
in environments where it is preferred for the region to be dynamically
allocated at runtime, as opposed to it being fixed at compile time.
Change the logic for detecting the start and size of the ramoops
memory region by looking up the reserved memory region instead of
using platform_get_resource(), which assumes that the location
of the memory is known ahead of time.
Bug: 191636717
Link: https://lore.kernel.org/patchwork/patch/1451704/
Change-Id: I24066de9f4fe1f1575cb1bbb1687c37a2b1938a4
Signed-off-by: Isaac J. Manjarres <isaacm@codeaurora.org>
Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Prasad Sodagudi <psodagud@codeaurora.org>
(cherry picked from commit bd2ca0ba5b9ccf3aa846d0f08a06c193e5a388f4)
commit c1ad35dd0548ce947d97aaf92f7f2f9a202951cf upstream.
udf_write_fi() uses lengthOfImpUse of the entry it is writing to.
However this field has not yet been initialized so it either contains
completely bogus value or value from last directory entry at that place.
In either case this is wrong and can lead to filesystem corruption or
kernel crashes.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
CC: stable@vger.kernel.org
Fixes: 979a6e28dd ("udf: Get rid of 0-length arrays in struct fileIdentDesc")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d0e64a981fd841cb0f28fcd6afcac55e6f1e6994 upstream.
On Linux, empty symlinks are invalid, and attempting to create one with
the system call symlink(2) results in an -ENOENT error and this is
explicitly documented in the man page.
If we rename a symlink that was created in the current transaction and its
parent directory was logged before, we actually end up logging the symlink
without logging its content, which is stored in an inline extent. That
means that after a power failure we can end up with an empty symlink,
having no content and an i_size of 0 bytes.
It can be easily reproduced like this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ mkdir /mnt/testdir
$ sync
# Create a file inside the directory and fsync the directory.
$ touch /mnt/testdir/foo
$ xfs_io -c "fsync" /mnt/testdir
# Create a symlink inside the directory and then rename the symlink.
$ ln -s /mnt/testdir/foo /mnt/testdir/bar
$ mv /mnt/testdir/bar /mnt/testdir/baz
# Now fsync again the directory, this persist the log tree.
$ xfs_io -c "fsync" /mnt/testdir
<power failure>
$ mount /dev/sdc /mnt
$ stat -c %s /mnt/testdir/baz
0
$ readlink /mnt/testdir/baz
$
Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
their content is also logged.
A test case for fstests will follow.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 9f73f1aef98b2fa7252c0a89be64840271ce8ea0 upstream.
[BUG]
For a 4K sector sized btrfs with v1 cache enabled and only mounted on
systems with 4K page size, if it's mounted on subpage (64K page size)
systems, it can cause the following warning on v1 space cache:
BTRFS error (device dm-1): csum mismatch on free space cache
BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now
Although not a big deal, as kernel can rebuild it without problem, such
warning will bother end users, especially if they want to switch the
same btrfs seamlessly between different page sized systems.
[CAUSE]
V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
like BITS_PER_BITMAP.
Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
cache size to checksum.
Thus kernel will always reject v1 cache with a different PAGE_SIZE with
csum mismatch.
[FIX]
Although we should fix v1 cache, it's already going to be marked
deprecated soon.
And we have v2 cache based on metadata (which is already fully subpage
compatible), and it has almost everything superior than v1 cache.
So just force subpage mount to use v2 cache on mount.
Reported-by: Matt Corallo <blnxfsl@bluematt.me>
CC: stable@vger.kernel.org # 5.15+
Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 193b4e83986d7ee6caa8ceefb5ee9f58240fbee0 upstream.
We are doing a BUG_ON() if we fail to update an inode after setting (or
clearing) a xattr, but there's really no reason to not instead simply
abort the transaction and return the error to the caller. This should be
a rare error because we have previously reserved enough metadata space to
update the inode and the delayed inode should have already been setup, so
an -ENOSPC or -ENOMEM, which are the possible errors, are very unlikely to
happen.
So replace the BUG_ON()s with a transaction abort.
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 00c94ebec5925593c0377b941289224469e72ac7 upstream.
There is no need to declare attributes such as the ctime, mtime and
block size invalid when we're just returning a delegation, so it is
inappropriate to call nfs_post_op_update_inode_force_wcc().
Instead, just call nfs_refresh_inode() after faking up the change
attribute. We know that the GETATTR op occurs before the DELEGRETURN, so
we are safe when doing this.
Fixes: 0bc2c9b4dc ("NFSv4: Don't discard the attributes returned by asynchronous DELEGRETURN")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 50ff57888d0b13440e7f4cde05dc339ee8d0f1f8 upstream.
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.
So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.
Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 694852ead287a3433126e7ebda397b242dc99624 upstream.
Ensure that the i_flags field of struct zonefs_inode_info is cleared to
0 when initializing a zone file inode, avoiding seeing the flag
ZONEFS_ZONE_OPEN being incorrectly set.
Fixes: b5c00e9757 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 1da18a296f5ba4f99429e62a7cf4fdbefa598902 upstream.
The mount option "explicit_open" manages the device open zone
resources to ensure that if an application opens a sequential file for
writing, the file zone can always be written by explicitly opening
the zone and accounting for that state with the s_open_zones counter.
However, if some zones are already open when mounting, the device open
zone resource usage status will be larger than the initial s_open_zones
value of 0. Ensure that this inconsistency does not happen by closing
any sequential zone that is open when mounting.
Furthermore, with ZNS drives, closing an explicitly open zone that has
not been written will change the zone state to "closed", that is, the
zone will remain in an active state. Since this can then cause failures
of explicit open operations on other zones if the drive active zone
resources are exceeded, we need to make sure that the zone is not
active anymore by resetting it instead of closing it. To address this,
zonefs_zone_mgmt() is modified to change a REQ_OP_ZONE_CLOSE request
into a REQ_OP_ZONE_RESET for sequential zones that have not been
written.
Fixes: b5c00e9757 ("zonefs: open/close zone on file open/close")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hans Holmberg <hans.holmberg@wdc.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit f5d0f921ea362636e4a2efb7c38d1ead373a8700 ]
because the copychunk_write might cover a region of the file that has not yet
been sent to the server and thus fail.
A simple way to reproduce this is:
truncate -s 0 /mnt/testfile; strace -f -o x -ttT xfs_io -i -f -c 'pwrite 0k 128k' -c 'fcollapse 16k 24k' /mnt/testfile
the issue is that the 'pwrite 0k 128k' becomes rearranged on the wire with
the 'fcollapse 16k 24k' due to write-back caching.
fcollapse is implemented in cifs.ko as a SMB2 IOCTL(COPYCHUNK_WRITE) call
and it will fail serverside since the file is still 0b in size serverside
until the writes have been destaged.
To avoid this we must ensure that we destage any unwritten data to the
server before calling COPYCHUNK_WRITE.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1997373
Reported-by: Xiaoli Feng <xifeng@redhat.com>
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 02655a70b7cc0f534531ee65fa72692f4d31a944 ]
Currently ksmbd is using ->f_bsize from vfs_statfs() as sector size.
If fat/exfat is a local share, ->f_bsize is a cluster size that is too
large to be used as a sector size. Sector sizes larger than 4K cause
problem occurs when mounting an iso file through windows client.
The error message can be obtained using Mount-DiskImage command,
the error is:
"Mount-DiskImage : The sector size of the physical disk on which the
virtual disk resides is not supported."
This patch reports fixed 4KB sector size if ->s_blocksize is bigger
than 4KB.
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 296abc0d91d8b65d42224dd33452ace14491ad08 ]
Commit 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks for buffered
I/O") changed gfs2_file_read_iter() and gfs2_file_buffered_write() to
allow dropping the inode glock while faulting in user buffers. When the
lock was dropped, a short result was returned to indicate that the
operation was interrupted.
As pointed out by Linus (see the link below), this behavior is broken
and the operations should always re-acquire the inode glock and resume
the operation instead.
Link: https://lore.kernel.org/lkml/CAHk-=whaz-g_nOOoo8RRiWNjnv2R+h6_xk2F1J4TuSRxk1MtLw@mail.gmail.com/
Fixes: 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks for buffered I/O")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 3bde4c48586074202044456285a97ccdf9048988 ]
When direct writes fail with -ENOTBLK because we're writing into a
hole (gfs2_iomap_begin()) or because of a page invalidation failure
(iomap_dio_rw()), we're falling back to buffered writes. In that case,
when we lose the inode glock in gfs2_file_buffered_write(), we want to
re-acquire it instead of returning a short write.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 124c458a401a2497f796e4f2d6cafac6edbea8e9 ]
Clean up the retry logic in the read and write functions somewhat.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 554c577cee95bdc1d03d9f457e57dc96eb791845 ]
Currently, instead of performing a short write,
iomap_file_buffered_write will fail when part of its iov iterator cannot
be read. In contrast, gfs2_file_buffered_write will loop around if it
can read part of the iov iterator, so we can end up in an endless loop.
This should be fixed in iomap_file_buffered_write (and also
generic_perform_write), but this comes a bit late in the 5.16
development cycle, so work around it in the filesystem by
trimming the iov iterator to the known-good size for now.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 5a1e99b61b0c81388cde0c808b3e4173907df19f ]
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.
Fixes: aa1fa28fc7 ("io_uring: add support for recvmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 588faa1ea5eecb351100ee5d187b9be99210f70d ]
We should check unused fields for non-zero and -EINVAL if they are set,
making it consistent with other opcodes.
Fixes: 0fa03c624d ("io_uring: add support for sendmsg()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7acae6183cf37c48b8da48bbbdb78820fb3913f3 upstream.
The request will be inserted into the ci->i_unsafe_dirops before
assigning the req->r_session, so it's possible that we will hit
NULL pointer dereference bug here.
Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/55327
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Aaron Tomlin <atomlin@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This reverts commit 0c5b51622c.
It is no longer needed as we are able to update the abi at this point in
time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I5b2c6e89737a2ac647f19719a1ccf256c2794a02
This reverts commit c31598eb0b.
It is no longer needed as we are able to update the abi at this point in
time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ia44bf70cb550e4840e3be1be7c8b2b0bea0a330e
This reverts commit ff1561ac7f.
It is no longer needed as we are able to update the abi at this point in
time.
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I20acdca0aa5b59436e890887c03f125f027a9d45
commit ca93e44bfb5fd7996b76f0f544999171f647f93b upstream
Some users recently reported that MariaDB was getting a read corruption
when using io_uring on top of btrfs. This started to happen in 5.16,
after commit 51bd9563b6783d ("btrfs: fix deadlock due to page faults
during direct IO reads and writes"). That changed btrfs to use the new
iomap flag IOMAP_DIO_PARTIAL and to disable page faults before calling
iomap_dio_rw(). This was necessary to fix deadlocks when the iovector
corresponds to a memory mapped file region. That type of scenario is
exercised by test case generic/647 from fstests.
For this MariaDB scenario, we attempt to read 16K from file offset X
using IOCB_NOWAIT and io_uring. In that range we have 4 extents, each
with a size of 4K, and what happens is the following:
1) btrfs_direct_read() disables page faults and calls iomap_dio_rw();
2) iomap creates a struct iomap_dio object, its reference count is
initialized to 1 and its ->size field is initialized to 0;
3) iomap calls btrfs_dio_iomap_begin() with file offset X, which finds
the first 4K extent, and setups an iomap for this extent consisting
of a single page;
4) At iomap_dio_bio_iter(), we are able to access the first page of the
buffer (struct iov_iter) with bio_iov_iter_get_pages() without
triggering a page fault;
5) iomap submits a bio for this 4K extent
(iomap_dio_submit_bio() -> btrfs_submit_direct()) and increments
the refcount on the struct iomap_dio object to 2; The ->size field
of the struct iomap_dio object is incremented to 4K;
6) iomap calls btrfs_iomap_begin() again, this time with a file
offset of X + 4K. There we setup an iomap for the next extent
that also has a size of 4K;
7) Then at iomap_dio_bio_iter() we call bio_iov_iter_get_pages(),
which tries to access the next page (2nd page) of the buffer.
This triggers a page fault and returns -EFAULT;
8) At __iomap_dio_rw() we see the -EFAULT, but we reset the error
to 0 because we passed the flag IOMAP_DIO_PARTIAL to iomap and
the struct iomap_dio object has a ->size value of 4K (we submitted
a bio for an extent already). The 'wait_for_completion' variable
is not set to true, because our iocb has IOCB_NOWAIT set;
9) At the bottom of __iomap_dio_rw(), we decrement the reference count
of the struct iomap_dio object from 2 to 1. Because we were not
the only ones holding a reference on it and 'wait_for_completion' is
set to false, -EIOCBQUEUED is returned to btrfs_direct_read(), which
just returns it up the callchain, up to io_uring;
10) The bio submitted for the first extent (step 5) completes and its
bio endio function, iomap_dio_bio_end_io(), decrements the last
reference on the struct iomap_dio object, resulting in calling
iomap_dio_complete_work() -> iomap_dio_complete().
11) At iomap_dio_complete() we adjust the iocb->ki_pos from X to X + 4K
and return 4K (the amount of io done) to iomap_dio_complete_work();
12) iomap_dio_complete_work() calls the iocb completion callback,
iocb->ki_complete() with a second argument value of 4K (total io
done) and the iocb with the adjust ki_pos of X + 4K. This results
in completing the read request for io_uring, leaving it with a
result of 4K bytes read, and only the first page of the buffer
filled in, while the remaining 3 pages, corresponding to the other
3 extents, were not filled;
13) For the application, the result is unexpected because if we ask
to read N bytes, it expects to get N bytes read as long as those
N bytes don't cross the EOF (i_size).
MariaDB reports this as an error, as it's not expecting a short read,
since it knows it's asking for read operations fully within the i_size
boundary. This is typical in many applications, but it may also be
questionable if they should react to such short reads by issuing more
read calls to get the remaining data. Nevertheless, the short read
happened due to a change in btrfs regarding how it deals with page
faults while in the middle of a read operation, and there's no reason
why btrfs can't have the previous behaviour of returning the whole data
that was requested by the application.
The problem can also be triggered with the following simple program:
/* Get O_DIRECT */
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
#include <liburing.h>
int main(int argc, char *argv[])
{
char *foo_path;
struct io_uring ring;
struct io_uring_sqe *sqe;
struct io_uring_cqe *cqe;
struct iovec iovec;
int fd;
long pagesize;
void *write_buf;
void *read_buf;
ssize_t ret;
int i;
if (argc != 2) {
fprintf(stderr, "Use: %s <directory>\n", argv[0]);
return 1;
}
foo_path = malloc(strlen(argv[1]) + 5);
if (!foo_path) {
fprintf(stderr, "Failed to allocate memory for file path\n");
return 1;
}
strcpy(foo_path, argv[1]);
strcat(foo_path, "/foo");
/*
* Create file foo with 2 extents, each with a size matching
* the page size. Then allocate a buffer to read both extents
* with io_uring, using O_DIRECT and IOCB_NOWAIT. Before doing
* the read with io_uring, access the first page of the buffer
* to fault it in, so that during the read we only trigger a
* page fault when accessing the second page of the buffer.
*/
fd = open(foo_path, O_CREAT | O_TRUNC | O_WRONLY |
O_DIRECT, 0666);
if (fd == -1) {
fprintf(stderr,
"Failed to create file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
pagesize = sysconf(_SC_PAGE_SIZE);
ret = posix_memalign(&write_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate write buffer\n");
return 1;
}
memset(write_buf, 0xab, pagesize);
memset(write_buf + pagesize, 0xcd, pagesize);
/* Create 2 extents, each with a size matching page size. */
for (i = 0; i < 2; i++) {
ret = pwrite(fd, write_buf + i * pagesize, pagesize,
i * pagesize);
if (ret != pagesize) {
fprintf(stderr,
"Failed to write to file, ret = %ld errno %d (%s)\n",
ret, errno, strerror(errno));
return 1;
}
ret = fsync(fd);
if (ret != 0) {
fprintf(stderr, "Failed to fsync file\n");
return 1;
}
}
close(fd);
fd = open(foo_path, O_RDONLY | O_DIRECT);
if (fd == -1) {
fprintf(stderr,
"Failed to open file 'foo': %s (errno %d)",
strerror(errno), errno);
return 1;
}
ret = posix_memalign(&read_buf, pagesize, 2 * pagesize);
if (ret) {
fprintf(stderr, "Failed to allocate read buffer\n");
return 1;
}
/*
* Fault in only the first page of the read buffer.
* We want to trigger a page fault for the 2nd page of the
* read buffer during the read operation with io_uring
* (O_DIRECT and IOCB_NOWAIT).
*/
memset(read_buf, 0, 1);
ret = io_uring_queue_init(1, &ring, 0);
if (ret != 0) {
fprintf(stderr, "Failed to create io_uring queue\n");
return 1;
}
sqe = io_uring_get_sqe(&ring);
if (!sqe) {
fprintf(stderr, "Failed to get io_uring sqe\n");
return 1;
}
iovec.iov_base = read_buf;
iovec.iov_len = 2 * pagesize;
io_uring_prep_readv(sqe, fd, &iovec, 1, 0);
ret = io_uring_submit_and_wait(&ring, 1);
if (ret != 1) {
fprintf(stderr,
"Failed at io_uring_submit_and_wait()\n");
return 1;
}
ret = io_uring_wait_cqe(&ring, &cqe);
if (ret < 0) {
fprintf(stderr, "Failed at io_uring_wait_cqe()\n");
return 1;
}
printf("io_uring read result for file foo:\n\n");
printf(" cqe->res == %d (expected %d)\n", cqe->res, 2 * pagesize);
printf(" memcmp(read_buf, write_buf) == %d (expected 0)\n",
memcmp(read_buf, write_buf, 2 * pagesize));
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
return 0;
}
When running it on an unpatched kernel:
$ gcc io_uring_test.c -luring
$ mkfs.btrfs -f /dev/sda
$ mount /dev/sda /mnt/sda
$ ./a.out /mnt/sda
io_uring read result for file foo:
cqe->res == 4096 (expected 8192)
memcmp(read_buf, write_buf) == -205 (expected 0)
After this patch, the read always returns 8192 bytes, with the buffer
filled with the correct data. Although that reproducer always triggers
the bug in my test vms, it's possible that it will not be so reliable
on other environments, as that can happen if the bio for the first
extent completes and decrements the reference on the struct iomap_dio
object before we do the atomic_dec_and_test() on the reference at
__iomap_dio_rw().
Fix this in btrfs by having btrfs_dio_iomap_begin() return -EAGAIN
whenever we try to satisfy a non blocking IO request (IOMAP_NOWAIT flag
set) over a range that spans multiple extents (or a mix of extents and
holes). This avoids returning success to the caller when we only did
partial IO, which is not optimal for writes and for reads it's actually
incorrect, as the caller doesn't expect to get less bytes read than it has
requested (unless EOF is crossed), as previously mentioned. This is also
the type of behaviour that xfs follows (xfs_direct_write_iomap_begin()),
even though it doesn't use IOMAP_DIO_PARTIAL.
A test case for fstests will follow soon.
Link: https://lore.kernel.org/linux-btrfs/CABVffEM0eEWho+206m470rtM0d9J8ue85TtR-A_oVTuGLWFicA@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/CAHF2GV6U32gmqSjLe=XKgfcZAmLCiH26cJ2OnHGp5x=VAH4OHQ@mail.gmail.com/
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 51bd9563b6783de8315f38f7baed949e77c42311 upstream
If we do a direct IO read or write when the buffer given by the user is
memory mapped to the file range we are going to do IO, we end up ending
in a deadlock. This is triggered by the new test case generic/647 from
fstests.
For a direct IO read we get a trace like this:
[967.872718] INFO: task mmap-rw-fault:12176 blocked for more than 120 seconds.
[967.874161] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[967.874909] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[967.875983] task:mmap-rw-fault state:D stack: 0 pid:12176 ppid: 11884 flags:0x00000000
[967.875992] Call Trace:
[967.875999] __schedule+0x3ca/0xe10
[967.876015] schedule+0x43/0xe0
[967.876020] wait_extent_bit.constprop.0+0x1eb/0x260 [btrfs]
[967.876109] ? do_wait_intr_irq+0xb0/0xb0
[967.876118] lock_extent_bits+0x37/0x90 [btrfs]
[967.876150] btrfs_lock_and_flush_ordered_range+0xa9/0x120 [btrfs]
[967.876184] ? extent_readahead+0xa7/0x530 [btrfs]
[967.876214] extent_readahead+0x32d/0x530 [btrfs]
[967.876253] ? lru_cache_add+0x104/0x220
[967.876255] ? kvm_sched_clock_read+0x14/0x40
[967.876258] ? sched_clock_cpu+0xd/0x110
[967.876263] ? lock_release+0x155/0x4a0
[967.876271] read_pages+0x86/0x270
[967.876274] ? lru_cache_add+0x125/0x220
[967.876281] page_cache_ra_unbounded+0x1a3/0x220
[967.876291] filemap_fault+0x626/0xa20
[967.876303] __do_fault+0x36/0xf0
[967.876308] __handle_mm_fault+0x83f/0x15f0
[967.876322] handle_mm_fault+0x9e/0x260
[967.876327] __get_user_pages+0x204/0x620
[967.876332] ? get_user_pages_unlocked+0x69/0x340
[967.876340] get_user_pages_unlocked+0xd3/0x340
[967.876349] internal_get_user_pages_fast+0xbca/0xdc0
[967.876366] iov_iter_get_pages+0x8d/0x3a0
[967.876374] bio_iov_iter_get_pages+0x82/0x4a0
[967.876379] ? lock_release+0x155/0x4a0
[967.876387] iomap_dio_bio_actor+0x232/0x410
[967.876396] iomap_apply+0x12a/0x4a0
[967.876398] ? iomap_dio_rw+0x30/0x30
[967.876414] __iomap_dio_rw+0x29f/0x5e0
[967.876415] ? iomap_dio_rw+0x30/0x30
[967.876420] ? lock_acquired+0xf3/0x420
[967.876429] iomap_dio_rw+0xa/0x30
[967.876431] btrfs_file_read_iter+0x10b/0x140 [btrfs]
[967.876460] new_sync_read+0x118/0x1a0
[967.876472] vfs_read+0x128/0x1b0
[967.876477] __x64_sys_pread64+0x90/0xc0
[967.876483] do_syscall_64+0x3b/0xc0
[967.876487] entry_SYSCALL_64_after_hwframe+0x44/0xae
[967.876490] RIP: 0033:0x7fb6f2c038d6
[967.876493] RSP: 002b:00007fffddf586b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000011
[967.876496] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007fb6f2c038d6
[967.876498] RDX: 0000000000001000 RSI: 00007fb6f2c17000 RDI: 0000000000000003
[967.876499] RBP: 0000000000001000 R08: 0000000000000003 R09: 0000000000000000
[967.876501] R10: 0000000000001000 R11: 0000000000000246 R12: 0000000000000003
[967.876502] R13: 0000000000000000 R14: 00007fb6f2c17000 R15: 0000000000000000
This happens because at btrfs_dio_iomap_begin() we lock the extent range
and return with it locked - we only unlock in the endio callback, at
end_bio_extent_readpage() -> endio_readpage_release_extent(). Then after
iomap called the btrfs_dio_iomap_begin() callback, it triggers the page
faults that resulting in reading the pages, through the readahead callback
btrfs_readahead(), and through there we end to attempt to lock again the
same extent range (or a subrange of what we locked before), resulting in
the deadlock.
For a direct IO write, the scenario is a bit different, and it results in
trace like this:
[1132.442520] run fstests generic/647 at 2021-08-31 18:53:35
[1330.349355] INFO: task mmap-rw-fault:184017 blocked for more than 120 seconds.
[1330.350540] Not tainted 5.14.0-rc7-btrfs-next-95 #1
[1330.351158] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1330.351900] task:mmap-rw-fault state:D stack: 0 pid:184017 ppid:183725 flags:0x00000000
[1330.351906] Call Trace:
[1330.351913] __schedule+0x3ca/0xe10
[1330.351930] schedule+0x43/0xe0
[1330.351935] btrfs_start_ordered_extent+0x108/0x1c0 [btrfs]
[1330.352020] ? do_wait_intr_irq+0xb0/0xb0
[1330.352028] btrfs_lock_and_flush_ordered_range+0x8c/0x120 [btrfs]
[1330.352064] ? extent_readahead+0xa7/0x530 [btrfs]
[1330.352094] extent_readahead+0x32d/0x530 [btrfs]
[1330.352133] ? lru_cache_add+0x104/0x220
[1330.352135] ? kvm_sched_clock_read+0x14/0x40
[1330.352138] ? sched_clock_cpu+0xd/0x110
[1330.352143] ? lock_release+0x155/0x4a0
[1330.352151] read_pages+0x86/0x270
[1330.352155] ? lru_cache_add+0x125/0x220
[1330.352162] page_cache_ra_unbounded+0x1a3/0x220
[1330.352172] filemap_fault+0x626/0xa20
[1330.352176] ? filemap_map_pages+0x18b/0x660
[1330.352184] __do_fault+0x36/0xf0
[1330.352189] __handle_mm_fault+0x1253/0x15f0
[1330.352203] handle_mm_fault+0x9e/0x260
[1330.352208] __get_user_pages+0x204/0x620
[1330.352212] ? get_user_pages_unlocked+0x69/0x340
[1330.352220] get_user_pages_unlocked+0xd3/0x340
[1330.352229] internal_get_user_pages_fast+0xbca/0xdc0
[1330.352246] iov_iter_get_pages+0x8d/0x3a0
[1330.352254] bio_iov_iter_get_pages+0x82/0x4a0
[1330.352259] ? lock_release+0x155/0x4a0
[1330.352266] iomap_dio_bio_actor+0x232/0x410
[1330.352275] iomap_apply+0x12a/0x4a0
[1330.352278] ? iomap_dio_rw+0x30/0x30
[1330.352292] __iomap_dio_rw+0x29f/0x5e0
[1330.352294] ? iomap_dio_rw+0x30/0x30
[1330.352306] btrfs_file_write_iter+0x238/0x480 [btrfs]
[1330.352339] new_sync_write+0x11f/0x1b0
[1330.352344] ? NF_HOOK_LIST.constprop.0.cold+0x31/0x3e
[1330.352354] vfs_write+0x292/0x3c0
[1330.352359] __x64_sys_pwrite64+0x90/0xc0
[1330.352365] do_syscall_64+0x3b/0xc0
[1330.352369] entry_SYSCALL_64_after_hwframe+0x44/0xae
[1330.352372] RIP: 0033:0x7f4b0a580986
[1330.352379] RSP: 002b:00007ffd34d75418 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[1330.352382] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f4b0a580986
[1330.352383] RDX: 0000000000001000 RSI: 00007f4b0a3a4000 RDI: 0000000000000003
[1330.352385] RBP: 00007f4b0a3a4000 R08: 0000000000000003 R09: 0000000000000000
[1330.352386] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
[1330.352387] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Unlike for reads, at btrfs_dio_iomap_begin() we return with the extent
range unlocked, but later when the page faults are triggered and we try
to read the extents, we end up btrfs_lock_and_flush_ordered_range() where
we find the ordered extent for our write, created by the iomap callback
btrfs_dio_iomap_begin(), and we wait for it to complete, which makes us
deadlock since we can't complete the ordered extent without reading the
pages (the iomap code only submits the bio after the pages are faulted
in).
Fix this by setting the nofault attribute of the given iov_iter and retry
the direct IO read/write if we get an -EFAULT error returned from iomap.
For reads, also disable page faults completely, this is because when we
read from a hole or a prealloc extent, we can still trigger page faults
due to the call to iov_iter_zero() done by iomap - at the moment, it is
oblivious to the value of the ->nofault attribute of an iov_iter.
We also need to keep track of the number of bytes written or read, and
pass it to iomap_dio_rw(), as well as use the new flag IOMAP_DIO_PARTIAL.
This depends on the iov_iter and iomap changes introduced in commit
c03098d4b9ad ("Merge tag 'gfs2-v5.15-rc5-mmap-fault' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>