-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5RXt4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgprPZEACevRyIjhdEmD9eyXlixw1O6zs/dHR4QVf6
RuuNoX1Ssxmf4zHBPcifBVenoUhIviJ/hBACdYNuPz+YWdx3FO/BF8FFv656ssHr
xhj8sC/8vz+fnwKyb/Lwt56NdRc8Ddtw6iWsF4po650n7JItq8BmDkHT/y3SJI0Z
L1UrUX4TxXEDfKsW2gbNCNIPjaiDSErJFP6FT1pcUZwLmF3zyJC6btR21AaAJbRC
CwatdbBg9K1SnvArn/NMd16C0p1LVBt3P2clagC90zlkCyb2vANN+YTnbo7KCsX7
XmssosPu5lamJQdsTNNxH7DHVUh/lZg9CEhUpy2ctXYSf1a6Ak6Y3qktCM5VW7FX
x+6aZdJj0UDdA+MvdcHZWjxKfJFmbS2iRjTfbTXpyLX/1qFmvI9ww9xzgP68iK8s
guxLxOQoCDx102SNKGmffcKY2C+yl3HHGRZATxy9C85WSvz7bwtvcbWwT/x13UxO
TWa8ghe0N4jfJ3sNfADZ0Dtehrj8ryslrRc0XS6y7v3m7MqOABkz7texH006j43G
FW23kqMyYJTlm+JIEIly9C5MSd4nFU0gyfBtMKGMBHF2JHgZez1LkDEiC2B5O1he
m9IAhGgFzgOuTFwJxwLcutDNUv4GyK6dMdLl+DzAv0hthSHjVsT2vb06X99NMenq
nzMADXHvxQ==
=euzH
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a small collection of fixes that were queued up:
- Remove unnecessary NULL check (Dan)
- Missing io_req_cancelled() call in fallocate (Pavel)
- Put the cleanup check for aux data in the right spot (Pavel)
- Two fixes for SQPOLL (Stefano, Xiaoguang)"
* tag 'io_uring-5.6-2020-02-22' of git://git.kernel.dk/linux-block:
io_uring: fix __io_iopoll_check deadlock in io_sq_thread
io_uring: prevent sq_thread from spinning when it should stop
io_uring: fix use-after-free by io_cleanup_req()
io_uring: remove unnecessary NULL checks
io_uring: add missing io_req_cancelled()
Since commit a3a0e43fd7 ("io_uring: don't enter poll loop if we have
CQEs pending"), if we already events pending, we won't enter poll loop.
In case SETUP_IOPOLL and SETUP_SQPOLL are both enabled, if app has
been terminated and don't reap pending events which are already in cq
ring, and there are some reqs in poll_list, io_sq_thread will enter
__io_iopoll_check(), and find pending events, then return, this loop
will never have a chance to exit.
I have seen this issue in fio stress tests, to fix this issue, let
io_sq_thread call io_iopoll_getevents() with argument 'min' being zero,
and remove __io_iopoll_check().
Fixes: a3a0e43fd7 ("io_uring: don't enter poll loop if we have CQEs pending")
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When CONFIG_QFMT_V2 is configured as a module, the test in
ext4_feature_set_ok() fails and so mount of filesystems with quota or
project features fails. Fix the test to use IS_ENABLED macro which
works properly even for modules.
Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
Fixes: d65d87a074 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:
WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
Here's a minimal reproducer (not 100% reliable) (root isn't required):
while true; do
sync
done &
while true; do
rm -f file
touch file
chattr -e file
echo X >> file
chattr +e file
done
The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.
Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
related to doing so as well, fix this by synchronizing changing
EXT4_EXTENTS_FL with ext4_writepages() via the existing
s_writepages_rwsem (previously called s_journal_flag_rwsem).
This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.
Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
In preparation for making s_journal_flag_rwsem synchronize
ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
flags (rather than just JOURNAL_DATA as it does currently), rename it to
s_writepages_rwsem.
Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
During an online resize an array of s_flex_groups structures gets replaced
so it can get enlarged. If there is a concurrent access to the array and
this memory has been reused then this can lead to an invalid memory access.
The s_flex_group array has been converted into an array of pointers rather
than an array of structures. This is to ensure that the information
contained in the structures cannot get out of sync during a resize due to
an accessor updating the value in the old structure after it has been
copied but before the array pointer is updated. Since the structures them-
selves are no longer copied but only the pointers to them this case is
mitigated.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
This patch drops 'cur_mm' before calling cond_resched(), to prevent
the sq_thread from spinning even when the user process is finished.
Before this patch, if the user process ended without closing the
io_uring fd, the sq_thread continues to spin until the
'sq_thread_idle' timeout ends.
In the worst case where the 'sq_thread_idle' parameter is bigger than
INT_MAX, the sq_thread will spin forever.
Fixes: 6c271ce2f1 ("io_uring: add submission polling")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:
* the fsync is ranged
* the file has prealloc extents beyond eof
* writeback for a range different from the fsync range starts
during the fsync
* the size of the file is not sector size aligned
Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.
The following diagram shows how the deadlock can happen.
CPU 1 CPU 2
btrfs_sync_file()
--> for range [0, 1MiB)
--> inode has a size of
1MiB and has 1 prealloc
extent beyond the
i_size, starting at offset
4MiB
flushes all delalloc for the
range [0MiB, 1MiB) and waits
for the respective ordered
extents to complete
--> before task at CPU 1 locks the
inode, a write into file range
[1MiB, 2MiB + 1KiB) is made
--> i_size is updated to 2MiB + 1KiB
--> writeback is started for that
range, [1MiB, 2MiB + 4KiB)
--> end offset rounded up to
be sector size aligned
btrfs_log_dentry_safe()
btrfs_log_inode_parent()
btrfs_log_inode()
btrfs_log_changed_extents()
btrfs_log_prealloc_extents()
--> does a search on the
inode's root
--> holds a read lock on
leaf X
btrfs_finish_ordered_io()
--> locks range [1MiB, 2MiB + 4KiB)
--> end offset rounded up
to be sector size aligned
--> tries to cow leaf X, through
insert_reserved_file_extent()
--> already locked by the
task at CPU 1
btrfs_truncate_inode_items()
--> gets an i_size of
2MiB + 1KiB, which is
not sector size
aligned
--> tries to lock file
range [2MiB, (u64)-1)
--> the start range
is rounded down
from 2MiB + 1K
to 2MiB to be sector
size aligned
--> but the subrange
[2MiB, 2MiB + 4KiB) is
already locked by
task at CPU 2 which
is waiting to get a
write lock on leaf X
for which we are
holding a read lock
*** deadlock ***
This results in a stack trace like the following, triggered by test case
generic/561 from fstests:
[ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
[ 2779.979536] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2779.990136] kworker/u8:6 D 0 247 2 0x80004000
[ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
[ 2779.990466] Call Trace:
[ 2779.990491] ? __schedule+0x384/0xa30
[ 2779.990521] schedule+0x33/0xe0
[ 2779.990616] btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
[ 2779.990632] ? remove_wait_queue+0x60/0x60
[ 2779.990730] btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
[ 2779.990782] btrfs_search_slot+0x510/0x1000 [btrfs]
[ 2779.990869] btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
[ 2779.990944] __btrfs_drop_extents+0x161/0x1060 [btrfs]
[ 2779.990987] ? mark_held_locks+0x6d/0xc0
[ 2779.990994] ? __slab_alloc.isra.49+0x99/0x100
[ 2779.991060] ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
[ 2779.991145] insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
[ 2779.991222] ? start_transaction+0xdd/0x5c0 [btrfs]
[ 2779.991291] btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
[ 2779.991405] btrfs_work_helper+0xaa/0x720 [btrfs]
[ 2779.991432] process_one_work+0x26d/0x6a0
[ 2779.991460] worker_thread+0x4f/0x3e0
[ 2779.991481] ? process_one_work+0x6a0/0x6a0
[ 2779.991489] kthread+0x103/0x140
[ 2779.991499] ? kthread_create_worker_on_cpu+0x70/0x70
[ 2779.991515] ret_from_fork+0x3a/0x50
(...)
[ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
[ 2780.027480] Not tainted 5.6.0-rc2-btrfs-next-53 #1
[ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2780.030035] fsstress D 0 17375 17373 0x00004000
[ 2780.030038] Call Trace:
[ 2780.030044] ? __schedule+0x384/0xa30
[ 2780.030052] schedule+0x33/0xe0
[ 2780.030075] lock_extent_bits+0x20c/0x320 [btrfs]
[ 2780.030094] ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
[ 2780.030098] ? rcu_read_lock_sched_held+0x59/0xa0
[ 2780.030102] ? remove_wait_queue+0x60/0x60
[ 2780.030122] btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
[ 2780.030151] ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
[ 2780.030165] ? btrfs_search_slot+0x379/0x1000 [btrfs]
[ 2780.030195] btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
[ 2780.030202] ? do_raw_spin_unlock+0x49/0xc0
[ 2780.030215] ? btrfs_get_num_csums+0x10/0x10 [btrfs]
[ 2780.030239] btrfs_log_inode+0xf83/0x1124 [btrfs]
[ 2780.030251] ? __mutex_unlock_slowpath+0x45/0x2a0
[ 2780.030275] btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
[ 2780.030282] ? dget_parent+0xa1/0x370
[ 2780.030309] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[ 2780.030329] btrfs_sync_file+0x3f3/0x490 [btrfs]
[ 2780.030339] do_fsync+0x38/0x60
[ 2780.030343] __x64_sys_fdatasync+0x13/0x20
[ 2780.030345] do_syscall_64+0x5c/0x280
[ 2780.030348] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
[ 2780.030361] Code: Bad RIP value.
[ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
[ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
[ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
[ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
[ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90
So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During an online resize an array of pointers to s_group_info gets replaced
so it can get enlarged. If there is a concurrent access to the array in
ext4_get_group_info() and this memory has been reused then this can lead to
an invalid memory access.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Balbir Singh <sblbir@amazon.com>
Cc: stable@kernel.org
During an online resize an array of pointers to buffer heads gets
replaced so it can get enlarged. If there is a racing block
allocation or deallocation which uses the old array, and the old array
has gotten reused this can lead to a GPF or some other random kernel
memory getting modified.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
Reported-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
We tested a soft lockup problem in linux 4.19 which could also
be found in linux 5.x.
When dir inode takes up a large number of blocks, and if the
directory is growing when we are searching, it's possible the
restart branch could be called many times, and the do while loop
could hold cpu a long time.
Here is the call trace in linux 4.19.
[ 473.756186] Call trace:
[ 473.756196] dump_backtrace+0x0/0x198
[ 473.756199] show_stack+0x24/0x30
[ 473.756205] dump_stack+0xa4/0xcc
[ 473.756210] watchdog_timer_fn+0x300/0x3e8
[ 473.756215] __hrtimer_run_queues+0x114/0x358
[ 473.756217] hrtimer_interrupt+0x104/0x2d8
[ 473.756222] arch_timer_handler_virt+0x38/0x58
[ 473.756226] handle_percpu_devid_irq+0x90/0x248
[ 473.756231] generic_handle_irq+0x34/0x50
[ 473.756234] __handle_domain_irq+0x68/0xc0
[ 473.756236] gic_handle_irq+0x6c/0x150
[ 473.756238] el1_irq+0xb8/0x140
[ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4]
[ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4]
[ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4]
[ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4]
[ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4]
[ 473.756402] ext4_lookup+0x8c/0x258 [ext4]
[ 473.756407] __lookup_hash+0x8c/0xe8
[ 473.756411] filename_create+0xa0/0x170
[ 473.756413] do_mkdirat+0x6c/0x140
[ 473.756415] __arm64_sys_mkdirat+0x28/0x38
[ 473.756419] el0_svc_common+0x78/0x130
[ 473.756421] el0_svc_handler+0x38/0x78
[ 473.756423] el0_svc+0x8/0xc
[ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]
Add cond_resched() to avoid soft lockup and to provide a better
system responding.
Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com
Signed-off-by: Shijie Luo <luoshijie1@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@kernel.org
EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]
write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
ext4_write_end+0x4e3/0x750 [ext4]
ext4_update_i_disksize at fs/ext4/ext4.h:3032
(inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
(inlined by) ext4_write_end at fs/ext4/inode.c:1287
generic_perform_write+0x208/0x2a0
ext4_buffered_write_iter+0x11f/0x210 [ext4]
ext4_file_write_iter+0xce/0x9e0 [ext4]
new_sync_write+0x29c/0x3b0
__vfs_write+0x92/0xa0
vfs_write+0x103/0x260
ksys_write+0x9d/0x130
__x64_sys_write+0x4c/0x60
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
ext4_writepages+0x10ac/0x1d00 [ext4]
mpage_map_and_submit_extent at fs/ext4/inode.c:2468
(inlined by) ext4_writepages at fs/ext4/inode.c:2772
do_writepages+0x5e/0x130
__writeback_single_inode+0xeb/0xb20
writeback_sb_inodes+0x429/0x900
__writeback_inodes_wb+0xc4/0x150
wb_writeback+0x4bd/0x870
wb_workfn+0x6b4/0x960
process_one_work+0x54c/0xbe0
worker_thread+0x80/0x650
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
Reported by Kernel Concurrency Sanitizer on:
CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
Workqueue: writeback wb_workfn (flush-7:0)
Since only the read is operating as lockless (outside of the
"i_data_sem"), load tearing could introduce a logic bug. Fix it by
adding READ_ONCE() for the read and WRITE_ONCE() for the write.
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
io_cleanup_req() should be called before req->io is freed, and so
shouldn't be after __io_free_req() -> __io_req_aux_free(). Also,
it will be ignored for in io_free_req_many(), which use
__io_req_aux_free().
Place cleanup_req() into __io_req_aux_free().
Fixes: 99bc4c3853 ("io_uring: fix iovec leaks")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In btrfs_wait_ordered_range() once we find an ordered extent that has
finished with an error we exit the loop and don't wait for any other
ordered extents that might be still in progress.
All the users of btrfs_wait_ordered_range() expect that there are no more
ordered extents in progress after that function returns. So past fixes
such like the ones from the two following commits:
ff612ba784 ("btrfs: fix panic during relocation after ENOSPC before
writeback happens")
28aeeac1dd ("Btrfs: fix panic when starting bg cache writeout after
IO error")
don't work when there are multiple ordered extents in the range.
Fix that by making btrfs_wait_ordered_range() wait for all ordered extents
even after it finds one that had an error.
Link: https://github.com/kdave/btrfs-progs/issues/228#issuecomment-569777554
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I hit the following warning while running my error injection stress
testing:
WARNING: CPU: 3 PID: 1453 at fs/btrfs/space-info.h:108 btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
RIP: 0010:btrfs_free_reserved_data_space_noquota+0xfd/0x160 [btrfs]
Call Trace:
btrfs_free_reserved_data_space+0x4f/0x70 [btrfs]
__btrfs_prealloc_file_range+0x378/0x470 [btrfs]
elfcorehdr_read+0x40/0x40
? elfcorehdr_read+0x40/0x40
? btrfs_commit_transaction+0xca/0xa50 [btrfs]
? dput+0xb4/0x2a0
? btrfs_log_dentry_safe+0x55/0x70 [btrfs]
? btrfs_sync_file+0x30e/0x420 [btrfs]
? do_fsync+0x38/0x70
? __x64_sys_fdatasync+0x13/0x20
? do_syscall_64+0x5b/0x1b0
? entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens if we fail to insert our reserved file extent. At this
point we've already converted our reservation from ->bytes_may_use to
->bytes_reserved. However once we break we will attempt to free
everything from [cur_offset, end] from ->bytes_may_use, but our extent
reservation will overlap part of this.
Fix this problem by adding ins.offset (our extent allocation size) to
cur_offset so we remove the actual remaining part from ->bytes_may_use.
I validated this fix using my inject-error.py script
python inject-error.py -o should_fail_bio -t cache_save_setup -t \
__btrfs_prealloc_file_range \
-t insert_reserved_file_extent.constprop.0 \
-r "-5" ./run-fsstress.sh
where run-fsstress.sh simply mounts and runs fsstress on a disk.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're allocating a logged extent we attempt to insert an extent
record for the file extent directly. We increase
space_info->bytes_reserved, because the extent entry addition will call
btrfs_update_block_group(), which will convert the ->bytes_reserved to
->bytes_used. However if we fail at any point while inserting the
extent entry we will bail and leave space on ->bytes_reserved, which
will trigger a WARN_ON() on umount. Fix this by pinning the space if we
fail to insert, which is what happens in every other failure case that
involves adding the extent entry.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_assert_delayed_root_empty() will check if the delayed root is
completely empty, but this is a filesystem-wide check. On cleanup we
may have allowed other transactions to begin, for whatever reason, and
thus the delayed root is not empty.
So remove this check from cleanup_one_transation(). This however can
stay in btrfs_cleanup_transaction(), because it checks only after all of
the transactions have been properly cleaned up, and thus is valid.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While running my error injection script I hit a panic when we tried to
clean up the fs_root when freeing the fs_root. This is because
fs_info->fs_root == PTR_ERR(-EIO), which isn't great. Fix this by
setting fs_info->fs_root = NULL; if we fail to read the root.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We clean up the delayed references when we abort a transaction but we
leave the pending qgroup extent records behind, leaking memory.
This patch destroys the extent records when we destroy the delayed refs
and makes sure ensure they're gone before releasing the transaction.
Fixes: 3368d001ba ("btrfs: qgroup: Record possible quota-related extent for qgroup.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
[ Rebased to latest upstream, remove to_qgroup() helper, use
rbtree_postorder_for_each_entry_safe() wrapper ]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Andrei Vagin reported that commit 0ddad21d3e ("pipe: use exclusive
waits when reading or writing") broke one of the CRIU tests. He even
has a trivial reproducer:
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
int main()
{
int p[2];
pid_t p1, p2;
int status;
if (pipe(p) == -1)
return 1;
p1 = fork();
if (p1 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
p2 = fork();
if (p2 == 0) {
close(p[1]);
read(p[0], &status, sizeof(status));
return 0;
}
sleep(1);
close(p[1]);
wait(&status);
wait(&status);
return 0;
}
and the problem - once he points it out - is obvious. We use these nice
exclusive waits, but when the last writer goes away, it then needs to
wake up _every_ reader (and conversely, the last reader disappearing
needs to wake every writer, of course).
In fact, when going through this, we had several small oddities around
how to wake things. We did in fact wake every reader when we changed
the size of the pipe buffers. But that's entirely pointless, since that
just acts as a possible source of new space - no new data to read.
And when we change the size of the buffer, we don't need to wake all
writers even when we add space - that case acts just as if somebody made
space by reading, and any writer that finds itself not filling it up
entirely will wake the next one.
On the other hand, on the exit path, we tried to limit the wakeups with
the proper poll keys etc, which is entirely pointless, because at that
point we obviously need to wake up everybody. So don't do that: just
wake up everybody - but only do that if the counts changed to zero.
So fix those non-IO wakeups to be more proper: space change doesn't add
any new data, but it might make room for writers, so it wakes up a
writer. And the actual changes to reader/writer counts should wake up
everybody, since everybody is affected (ie readers will all see EOF if
the writers have gone away, and writers will all get EPIPE if all
readers have gone away).
Fixes: 0ddad21d3e ("pipe: use exclusive waits when reading or writing")
Reported-and-tested-by: Andrei Vagin <avagin@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "kmsg" pointer can't be NULL and we have already dereferenced it so
a check here would be useless.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Downgrade the eCryptfs maintenance status to "Odd Fixes"
- Change my email address
- Fix a couple memory leaks in error paths
- Stability improvement to avoid a needless BUG_ON()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKvuQkp28KvPJn/fVfL7QslYSS/0FAl5K+wkACgkQfL7QslYS
S/1vrA//TIaQWadfWeXbdd8Kj+zEsd6BDZEiGq56w5WtuYe+XcAKdXITBLXXVZR1
/MHR4h16BpkmyRbKojvGjqECLf8W8XeZC30gEXJEM7PVbFvLvM92ZKUT/K3ax7Tt
HBQs8L0MM2BDWFDyfQ9vWnPakIvy1pHS4iHQeJgtcBfqcMS+u6fYmpoCIbWZ9UfA
7ilS931CjFxgRvKdwMRp4C08fUs7awqYcZxkDMjp3uAOSp0YFtp3/EiPrNAe1+ot
/PuwzV69a7raqRurdBUApj9QMY1VmKheY80Wgl4C0aS5lc+JUiOpclKDRW174W51
UFI9CanRcqNig7KY+eTenJWDS8l9nEK40nQWKHQZ5JT+kitWMyvjrXh6MwIRFvKE
N+7gSWZWumQknKSg3y1AVkEZK+mHw7lRJ8ziLCxrR28EkS3TTvw2mpp80ed0jC/v
SCUwj4rqP1l7wR7xa4vlYSXOLznK/DRBSrud7bbQlZOTjVXZAeNv2LOe8qLu8C0b
j+/Mhf6OKUd5c7Pjl+7YD1/DcKsUlayyVn0k2HX/V17QK1a5Kp5R+6mbvyFclq1T
nQTgu5prq7BiS49GGY/JePO0wKCi2+A3T87rj2Y1yfGjYNgcR/zsQa2LuPtP92Tt
5aAHpP+lC9CoJohYjM/cRncq1rsOnBchQRz+i/VgVtUqJ9vfgXM=
=kXnY
-----END PGP SIGNATURE-----
Merge tag 'ecryptfs-5.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs
Pull eCryptfs fixes from Tyler Hicks:
- downgrade the eCryptfs maintenance status to "Odd Fixes"
- change my email address
- fix a couple memory leaks in error paths
- stability improvement to avoid a needless BUG_ON()
* tag 'ecryptfs-5.6-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tyhicks/ecryptfs:
ecryptfs: replace BUG_ON with error handling code
eCryptfs: Replace deactivated email address
MAINTAINERS: eCryptfs: Update maintainer address and downgrade status
ecryptfs: fix a memory leak bug in ecryptfs_init_messaging()
ecryptfs: fix a memory leak bug in parse_tag_1_packet()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5K0xIACgkQxWXV+ddt
WDvihhAAjTdEr8Gbt4Beu+F9I2j+1+x/qdVADsrJErBJ5s1J0prXI27Vx5auHMMD
qJViTGTTjwlIo4jFI+AtlrzcE07eIMN1zzjVb2o8efrpkFDTfgT+dAkSVR8zAX20
nycsQ93QebCrlRGpPuXsniczy+QTRXWBY+vPXe1UbfGrzb3q+1pBJgfdq/7pkt/V
0RgB78JDVWUhw+YBjLuV0Y7tjmWlANGvofT34A34CR23i/JoIjVx7GfeQyILBmHE
iuCwzMgAdIKlRpjG5WJeK60bW0Dkjmbr3HDyJexiXSa4u2UwgUlT8UsITbVCr48t
OrfYxb7aWmwEqa5fbZzIiyTCWDKSxiQ5X0XW8e6ehdkh+s9DWmNp48xt9OogpYcv
8qni+vMgjrOLZwj//DgHpr2CYrISxJ+vhCvaOLMK++8TMIH/EQZEyaYkKn5Osg/8
iSwI9b2IF8LmnnvWhkbs1p4Jn2Ca4fMhlN2D5xflwSK/lbhgJk5E6P+4h6pQ9MvG
U5bWR2mqW1xT+lCO7AD+MBq+VbF/BZik0TjuNssoYJJtx/VbTbpydrHW65SvZvQK
QT7MTHekkbtIR+SGYl66n1Sc4RG7znWNwjfoyVEbSblLIczfzDDsl1O/9ldqbia4
BoPWghEO8Wy543UTH9DyX+oMzB2Pwb/TMa4eQyNhbQMqcwXHv5Q=
=nlEW
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"This is the fix for sleeping in a locked section bug reported by Dave
Jones, caused by a patch dependence in development and pulled
branches.
I picked the existing patch over the fixup that Filipe sent, as it's a
bit more generic fix. I've verified it with a specific test case, some
rsync stress and one round of fstests"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't set path->leave_spinning for truncate
The only time we actually leave the path spinning is if we're truncating
a small amount and don't actually free an extent, which is not a common
occurrence. We have to set the path blocking in order to add the
delayed ref anyway, so the first extent we find we set the path to
blocking and stay blocking for the duration of the operation. With the
upcoming file extent map stuff there will be another case that we have
to have the path blocking, so just swap to blocking always.
Note: this patch also fixes a warning after 28553fa992 ("Btrfs: fix
race between shrinking truncate and fiemap") got merged that inserts
extent locks around truncation so the path must not leave spinning locks
after btrfs_search_slot.
[70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565
[70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync
[70.794863] 5 locks held by rsync/1141:
[70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50
[70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100
[70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560
[70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160
[70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2
[70.795362] Call Trace:
[70.795374] dump_stack+0x71/0xa0
[70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6
[70.795459] kmem_cache_alloc+0x1d3/0x290
[70.795471] alloc_extent_state+0x22/0x1c0
[70.795544] __clear_extent_bit+0x3ba/0x580
[70.795557] ? _raw_spin_unlock_irq+0x24/0x30
[70.795569] btrfs_truncate_inode_items+0x339/0xe50
[70.795647] btrfs_evict_inode+0x269/0x540
[70.795659] ? dput.part.38+0x29/0x460
[70.795671] evict+0xcd/0x190
[70.795682] __dentry_kill+0xd6/0x180
[70.795754] dput.part.38+0x2ad/0x460
[70.795765] do_renameat2+0x3cb/0x540
[70.795777] __x64_sys_rename+0x1c/0x20
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl5JQQUACgkQxWXV+ddt
WDv+fRAAh2VeACdVgGPCqH3kBphdBzqT2x9EUWZK++AWTr0Upg4Mz1mvD2h1WX6L
Yd/cuG0hIy2AFl1fEwrOp9d64PGFRnVcs0IS0C9GC/kH9ZzrSuz3pcfni3uxCr0N
t+/Pghxb654fLPd8eho9Pnc9bi/EPPRZ+mmwup4A9MyC7Bdfzf821ykrhqwam54V
LWRUanVfeJOuxNbgF2Hjnr9ZI2fALsFfh5+q1VCeNsm76Le67RGVi7V6i9Gxj8eE
3zOH/XM1Jzft3cLinfimIUC0BQKBuFXSKnYun5zP47Y5VHgxYCcfBBAD71uGgs52
mxaz1vLGE6GZo631zBBXvM6BalEHyQROP8Daman9NiJDctkvICXzv/bwi4k5ZHSF
i45MANAtBb10S/AYitscqmGR7sYHZtKBkgJHjH+0YqKgcu3OWInqNVuW77kZEguV
/UjcdjAmxirvyAaxOdfXtBfuZKBU1kL+dZwg4OeqzbDoBsUAtkEG+3oIcjkGnP9e
6QuQx5XYKmI1tugcAz0uj5ryXW6UtIFVy+Fgxht+XMSNRR5CiUpu7Ndm/ZrKx5RS
/hv2KZot+/p7IA9J0YvZbYwwvw28YxR4PgGMOKYhqKnTVHR6eaFTDNKiaQRkJaXr
Nimk6TQrpc9Qu/rcoeyuQajxCRhwSqNAJH97MJPMDz39xolEys8=
=i1Ct
-----END PGP SIGNATURE-----
Merge tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Two races fixed, memory leak fix, sysfs directory fixup and two new
log messages:
- two fixed race conditions: extent map merging and truncate vs
fiemap
- create the right sysfs directory with device information and move
the individual device dirs under it
- print messages when the tree-log is replayed at mount time or
cannot be replayed on remount"
* tag 'for-5.6-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: sysfs, move device id directories to UUID/devinfo
btrfs: sysfs, add UUID/devinfo kobject
Btrfs: fix race between shrinking truncate and fiemap
btrfs: log message when rw remount is attempted with unclean tree-log
btrfs: print message when tree-log replay starts
Btrfs: fix race between using extent maps and merging them
btrfs: ref-verify: fix memory leaks
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl5IHmgACgkQiiy9cAdy
T1Hv9gv/QzgtpGUkK+8mLkv6BCoc0vChvr6s3xIxuSBOnAdhe6km1NSwVek7lUxU
Liupsw+mLOt4TO8hCsHKEIfR4QKnPp22UF0nUj6bYWMJH2+zNNwlcV1TF01xGW8t
J7jTKNd0MYOC1jQ+NLX4pwAz85Sp6yw4iYxg+PBpgPjEOUlGFttVK7U22nDMR1cz
hlNY8KfJ3CiAOQEGIu+lIpQAo2/WcvUUGKjDwOlIRUwsodPJdy5XCromPa4wktuS
OcDcrehfs16QfYW6K7rFSvRvcHUJ9nOf/Ht31rm9kD3DL8nN3MSV2vVWzszq+b4y
LFm/6Z0/rjYE9+W+/w0H5jFGatjGpChrGzzGTaTjqIEAh9xNFGmfd96ZpixGZpIL
SaJD8ab0RBUJ5Q0DAmBJHcaWCuN1ELbGu2ZFJNiJROH53epaRtyi54ryVv9XHNsh
RQTbmmnbTPPqlv4S0Rd++Z6H+eYBSU1Vbm7lmZzK84XHQ0w2ttnU+grVsxoUdPlb
530Yvu5z
=IQBE
-----END PGP SIGNATURE-----
Merge tag '5.6-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four small CIFS/SMB3 fixes. One (the EA overflow fix) for stable"
* tag '5.6-rc1-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: make sure we do not overflow the max EA buffer size
cifs: enable change notification for SMB2.1 dialect
cifs: Fix mode output in debugging statements
cifs: fix mount option display for sec=krb5i
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAl5IrGgACgkQ8vlZVpUN
gaMi2Qf+NuXFQ6xpxjw90ZmkOXbsqbFVAIx3yjzW5WKIVyHrhG19P8NuNaCbUajP
WEUzrY/LMmcODd46zKcCR61jOygKCJ5NXyF8qTEFltiMuqugvwmTv49ofRxX+Zlf
YkavQGo+cPs5p1xzo+bKqhA29zhc4PApd3bGpjgUqDlfQxBmm2vNpkbXKj/Bk6CL
k9c1EodygpTO45Kb3Y/JvPcbTLOOu0hPnrCemtb1Vc9tm0j6I+g5RtQofUag+1BX
FOR5Z2EUwByb7F+TTxO1Jse0oFjcxH9J/VqeKi2Fbzucdq3NrRypiFj0XPFBRPFn
g7ONbaoIGiK+02pCNFuwbY0WmLQVfg==
=T3j1
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Miscellaneous ext4 bug fixes (all stable fodder)"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: improve explanation of a mount failure caused by a misconfigured kernel
jbd2: do not clear the BH_Mapped flag when forgetting a metadata buffer
jbd2: move the clearing of b_modified flag to the journal_unmap_buffer()
ext4: add cond_resched() to ext4_protect_reserved_inode
ext4: fix checksum errors with indexed dirs
ext4: fix support for inode sizes > 1024 bytes
ext4: simplify checking quota limits in ext4_statfs()
ext4: don't assume that mmp_nodename/bdevname have NUL
fallocate_finish() is missing cancellation check. Add it.
It's safe to do that, as only flags setup and sqe fields copy are done
before it gets into __io_fallocate().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If CONFIG_QFMT_V2 is not enabled, but CONFIG_QUOTA is enabled, when a
user tries to mount a file system with the quota or project quota
enabled, the kernel will emit a very confusing messsage:
EXT4-fs warning (device vdc): ext4_enable_quotas:5914: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix.
EXT4-fs (vdc): mount failed
We will now report an explanatory message indicating which kernel
configuration options have to be enabled, to avoid customer/sysadmin
confusion.
Link: https://lore.kernel.org/r/20200215012738.565735-1-tytso@mit.edu
Google-Bug-Id: 149093531
Fixes: 7c319d3285 ("ext4: make quota as first class supported feature")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Stable Bugfixes:
- Fix DMA scatter-gather list mapping imbalance
Other Fixes:
- Fix directory verifier races
- Fix races between open and dentry revalidation
- Fix revalidation of dentries with delegations
- Fix "cachethis" setting for writes
- Fix delegation and delegation cred pinning
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5HEa4ACgkQ18tUv7Cl
QOvwAA/9FAIwiE5UU+OrbpJ47IKFay2lfu6ywWVt0Q9i/tDeTRawa9aZDNKmKuYk
rViVzRhIrmdkhpSVeTl3BdV/U9oXmNQ9ubFxWkQ0TI1j/tkTotT4dA0/kjeampZJ
dZeW+NtRnFxYtfyV6fV/Bp/GB1RE9tAr1hTzHYEpNfnKxmAgkOeIXsVebuWKvUxz
XCeRED30Ekt2n1JQqb1m3VAMpYmprF2i9TErAVR8ILt0dkkJD8nNlPogrZwMjzuD
GtB7BSJ7rIDHQomuF3HVrz+SpMhM530mpsZiVGYCKjChJWaPm12+AZGPtv3UcpAf
/M2VTa+z8MJDwirzjWogo3OvYAqZGI6KUG0o2t5dKmguQoi1RrDlPbFMLXlxow2e
5gi/qKFbnBOt6iIV9FffMb50jpnShBE6hQS4/rb2inGmAKYgEXcL+ouBY6308soS
fnWJSViTQfGbF9uVxGeOYl766zEqlNleclfjKBCNXMC3HOfIslIBpilMDwCoOHeB
MqYSqmnwAH9ZrqjNmR3Qh2RlcOxN8CTUMhkUW2KLMAgnEAd9WINaYu4u2Y50Fe6j
yDa6YEAmaNSGVYVofh8qA1R1Zi4tnKlg1/WFdy/SnEQKi8mQqXASGiVu2bXjhlF1
22cHVXNIjp1PMijZKnJtimuxChygj9xxgzw4Kstd1WCLm2VirAc=
=Ase9
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client bugfixes from Anna Schumaker:
"The only stable fix this time is the DMA scatter-gather list bug fixed
by Chuck.
The rest fix up races and refcounting issues that have been found
during testing.
Stable fix:
- fix DMA scatter-gather list mapping imbalance
The rest:
- fix directory verifier races
- fix races between open and dentry revalidation
- fix revalidation of dentries with delegations
- fix "cachethis" setting for writes
- fix delegation and delegation cred pinning"
* tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFSv4: Ensure the delegation cred is pinned when we call delegreturn
NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
NFSv4.1 make cachethis=no for writes
xprtrdma: Fix DMA scatter-gather list mapping imbalance
NFSv4: Fix revalidation of dentries with delegations
NFSv4: Fix races between open and dentry revalidation
NFS: Fix up directory verifier races
of server path canonicalization patch that went into -rc1 and a fixup
for noacl mount option that got broken by the conversion to the new
mount API in 5.5.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl5GzToTHGlkcnlvbW92
QGdtYWlsLmNvbQAKCRBKf944AhHzi6tKB/9VZ5DIkd5m3UyKh+rgw565LddPiJTU
zIngCUV2EwyRyTwgOCMzG/AsCxcjQDS3RzEZsCCIpAsT09vAJxP0uFFm8o3LE0xX
NX8F/0cHc14lqpI39dzB3HiBETx3XIf3eNUpSJphtwKqmn3MOhAASPK9xDZdICNh
VziVMOogns4QEYTxY+aA5DoKmkOuuolCVKUZj7/fyFXX90NGhLVBsIfRIuIE2O0v
xbtitsGqqZtf+1Iar4gKRLIWCpwzETSBE9G+GJlxqUSDz2VTovDDhn6tBtu65m8d
APvN9+Le355ts9iFuep3SYkYQhNXRZ/ydxx3fOqTM60Tzi3cVTtO+Mk4
=YoNz
-----END PGP SIGNATURE-----
Merge tag 'ceph-for-5.6-rc2' of https://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
- make O_DIRECT | O_APPEND combination work better
- redo the server path canonicalization patch that went into -rc1
- fix the 'noacl' mount option that got broken by the conversion to the
new mount API in 5.5
* tag 'ceph-for-5.6-rc2' of https://github.com/ceph/ceph-client:
ceph: noacl mount option is effectively ignored
ceph: canonicalize server path in place
ceph: do not execute direct write in parallel if O_APPEND is specified
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl5GzcYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgptCNEAC7F8fAOTKDtYxl7vNgoSVd4lUJAllyaYFk
gFaxuh6fQXqXnqQhPQxJxHCTqvREqfdbGItLmWp/g3r9bEYZouKflPoVTSdgeTjz
47JoIXhdcv4A26lGO2N+tAuRDk68dcyG31tHqHNcrHhFL+1hcs8pMMf++KD8i/VF
kooo56E7Wfh4170q5Jr2pRpVv5O2G/WgkKqbmHRfOXLvvBvCNUDYGfK8Gu8glp7T
auMNdRy7+3uia5oBre5zz8oqeSXS8Th6BWUPTJYEE3kl4rnZMnVsRSfCXDszUchP
yl59tDM83WR4Jc/uq6baStL/IAjK6y76O3mKIZPdjoxhL3YIzPHOTxiVzynWsFDh
abbRqMdmst7c7qYIaWA0S6jZE7uN8eR9huuo+EwKdzXQeKJIYBLo447ysHmaglq3
amzEJUBrw/attgpqakQ1nL6aj5D+XhthuDCYKK5/447ygdNUQfES1ZobzHZVMjx1
74ljXJ24Y/QsFfqntf1szf8mh3lxD3KO85OGgxjYj/iv7mtBCPALTLn3xbwNzJ15
Mrb2xY7n6rxO6gCEmJEJAAs2vGQFewS8MaXQu/fOcUWujRP4a5xNvaL4YDZ6epXh
wXHPTrNDxmF948r78EYxF6N7BqR/uRv+razMYyZ/OWVi4CLogwAOoTtfaHhbA04H
Jh2GUgotdA==
=6MrA
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.6-2020-02-14' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Here's a set of fixes for io_uring:
- Various fixes with cleanups from Pavel, fixing corner cases where
we're not correctly dealing with iovec cleanup.
- Clarify that statx/openat/openat2 don't accept fixed files
- Buffered raw device write EOPTNOTSUPP fix
- Ensure async workers grab current->fs
- A few task exit fixes with pending requests that grab the file
table
- send/recvmsg async load fix
- io-wq offline node setup fix
- CQ overflow flush in poll"
* tag 'io_uring-5.6-2020-02-14' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: prune request from overflow list on flush
io-wq: don't call kXalloc_node() with non-online node
io_uring: retain sockaddr_storage across send/recvmsg async punt
io_uring: cancel pending async work if task exits
io-wq: add io_wq_cancel_pid() to cancel based on a specific pid
io-wq: make io_wqe_cancel_work() take a match handler
io_uring: fix openat/statx's filename leak
io_uring: fix double prep iovec leak
io_uring: fix async close() with f_op->flush()
io_uring: allow AT_FDCWD for non-file openat/openat2/statx
io_uring: grab ->fs as part of async preparation
io-wq: add support for inheriting ->fs
io_uring: retry raw bdev writes if we hit -EOPNOTSUPP
io_uring: add cleanup for openat()/statx()
io_uring: fix iovec leaks
io_uring: remove unused struct io_async_open
io_uring: flush overflowed CQ events in the io_uring_poll()
io_uring: statx/openat/openat2 don't support fixed files
io_uring: fix deferred req iovec leak
io_uring: fix 1-bit bitfields to be unsigned
...
In crypt_scatterlist, if the crypt_stat argument is not set up
correctly, the kernel crashes. Instead, by returning an error code
upstream, the error is handled safely.
The issue is detected via a static analysis tool written by us.
Fixes: 237fead619 (ecryptfs: fs/Makefile and fs/Kconfig)
Signed-off-by: Aditya Pakki <pakki001@umn.edu>
Signed-off-by: Tyler Hicks <code@tyhicks.com>
Replace a recently deactived email address with one that I'll be able to
personally control and keep alive.
Signed-off-by: Tyler Hicks <code@tyhicks.com>
RHBZ: 1752437
Before we add a new EA we should check that this will not overflow
the maximum buffer we have available to read the EAs back.
Otherwise we can get into a situation where the EAs are so big that
we can not read them back to the client and thus we can not list EAs
anymore or delete them.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
It was originally enabled only for SMB3 or later dialects, but
had requests to add it to SMB2.1 mounts as well given the
large number of systems at that dialect level.
Signed-off-by: Steve French <stfrench@microsoft.com>
Reported-by: L Walsh <cifs@tlinx.org>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
Carter reported an issue where he could produce a stall on ring exit,
when we're cleaning up requests that match the given file table. For
this particular test case, a combination of a few things caused the
issue:
- The cq ring was overflown
- The request being canceled was in the overflow list
The combination of the above means that the cq overflow list holds a
reference to the request. The request is canceled correctly, but since
the overflow list holds a reference to it, the final put won't happen.
Since the final put doesn't happen, the request remains in the inflight.
Hence we never finish the cancelation flush.
Fix this by removing requests from the overflow list if we're canceling
them.
Cc: stable@vger.kernel.org # 5.5
Reported-by: Carter Li 李通洲 <carter.li@eoitek.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ensure we don't release the delegation cred during the call to
nfs4_proc_delegreturn().
Fixes: ee05f45677 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
The call to nfs_do_return_delegation() needs to be taken without
any RCU locks. Add a refcount to make sure the delegation remains
pinned in memory until we're done.
Fixes: ee05f45677 ("NFSv4: Fix races between open and delegreturn")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Turning caching off for writes on the server should improve performance.
Fixes: fba83f3411 ("NFS: Pass "privileged" value to nfs4_init_sequence()")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Commit 904cdbd41d ("jbd2: clear dirty flag when revoking a buffer from
an older transaction") set the BH_Freed flag when forgetting a metadata
buffer which belongs to the committing transaction, it indicate the
committing process clear dirty bits when it is done with the buffer. But
it also clear the BH_Mapped flag at the same time, which may trigger
below NULL pointer oops when block_size < PAGE_SIZE.
rmdir 1 kjournald2 mkdir 2
jbd2_journal_commit_transaction
commit transaction N
jbd2_journal_forget
set_buffer_freed(bh1)
jbd2_journal_commit_transaction
commit transaction N+1
...
clear_buffer_mapped(bh1)
ext4_getblk(bh2 ummapped)
...
grow_dev_page
init_page_buffers
bh1->b_private=NULL
bh2->b_private=NULL
jbd2_journal_put_journal_head(jh1)
__journal_remove_journal_head(hb1)
jh1 is NULL and trigger oops
*) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has
already been unmapped.
For the metadata buffer we forgetting, we should always keep the mapped
flag and clear the dirty flags is enough, so this patch pick out the
these buffers and keep their BH_Mapped flag.
Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com
Fixes: 904cdbd41d ("jbd2: clear dirty flag when revoking a buffer from an older transaction")
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
There is no need to delay the clearing of b_modified flag to the
transaction committing time when unmapping the journalled buffer, so
just move it to the journal_unmap_buffer().
Link: https://lore.kernel.org/r/20200213063821.30455-2-yi.zhang@huawei.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
DIR_INDEX has been introduced as a compat ext4 feature. That means that
even kernels / tools that don't understand the feature may modify the
filesystem. This works because for kernels not understanding indexed dir
format, internal htree nodes appear just as empty directory entries.
Index dir aware kernels then check the htree structure is still
consistent before using the data. This all worked reasonably well until
metadata checksums were introduced. The problem is that these
effectively made DIR_INDEX only ro-compatible because internal htree
nodes store checksums in a different place than normal directory blocks.
Thus any modification ignorant to DIR_INDEX (or just clearing
EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch
and trigger kernel errors. So we have to be more careful when dealing
with indexed directories on filesystems with checksumming enabled.
1) We just disallow loading any directory inodes with EXT4_INDEX_FL when
DIR_INDEX is not enabled. This is harsh but it should be very rare (it
means someone disabled DIR_INDEX on existing filesystem and didn't run
e2fsck), e2fsck can fix the problem, and we don't want to answer the
difficult question: "Should we rather corrupt the directory more or
should we ignore that DIR_INDEX feature is not set?"
2) When we find out htree structure is corrupted (but the filesystem and
the directory should in support htrees), we continue just ignoring htree
information for reading but we refuse to add new entries to the
directory to avoid corrupting it more.
Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz
Fixes: dbe8944404 ("ext4: Calculate and verify checksums for htree nodes")
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
A recent commit, 9803387c55 ("ext4: validate the
debug_want_extra_isize mount option at parse time"), moved mount-time
checks around. One of those changes moved the inode size check before
the blocksize variable was set to the blocksize of the file system.
After 9803387c55 was set to the minimum allowable blocksize, which
in practice on most systems would be 1024 bytes. This cuased file
systems with inode sizes larger than 1024 bytes to be rejected with a
message:
EXT4-fs (sdXX): unsupported inode size: 4096
Fixes: 9803387c55 ("ext4: validate the debug_want_extra_isize mount option at parse time")
Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.edu
Reported-by: Herbert Poetzl <herbert@13thfloor.at>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Coverity reports that conditions checking quota limits in ext4_statfs()
contain dead code. Indeed it is right and current conditions can be
simplified.
Link: https://lore.kernel.org/r/20200130111148.10766-1-jack@suse.cz
Reported-by: Coverity <scan-admin@coverity.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
A number of the debug statements output file or directory mode
in hex. Change these to print using octal.
Signed-off-by: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
If a dentry was not initially looked up while we were holding a
delegation, then we do still need to revalidate that it still holds
the same name. If there are multiple hard links to the same file,
then all the hard links need validation.
Reported-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>
[Anna: Put nfs_unset_verifier_delegated() under CONFIG_NFS_V4]
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Originally it was planned to create device id directories under
UUID/devinfo, but it got under UUID/devices by mistake. We really want
it under definfo so the bare device node names are not mixed with device
ids and are easy to enumerate.
Fixes: 668e48af7a ("btrfs: sysfs, add devid/dev_state kobject and device attributes")
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Create directory /sys/fs/btrfs/UUID/devinfo to hold devices directories
by the id (unlike /devices).
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When there is a fiemap executing in parallel with a shrinking truncate
we can end up in a situation where we have extent maps for which we no
longer have corresponding file extent items. This is generally harmless
and at the moment the only consequences are missing file extent items
representing holes after we expand the file size again after the
truncate operation removed the prealloc extent items, and stale
information for future fiemap calls (reporting extents that no longer
exist or may have been reallocated to other files for example).
Consider the following example:
1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0
and a 1MiB prealloc extent at file offset 128KiB;
2) Task A starts doing a shrinking truncate of our inode to reduce it to
a size of 64KiB. Before it searches the subvolume tree for file
extent items to delete, it drops all the extent maps in the range
from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache();
3) Task B starts doing a fiemap against our inode. When looking up for
the inode's extent maps in the range from 128KiB to (u64)-1, it
doesn't find any in the inode's extent map tree, since they were
removed by task A. Because it didn't find any in the extent map
tree, it scans the inode's subvolume tree for file extent items, and
it finds the 1MiB prealloc extent at file offset 128KiB, then it
creates an extent map based on that file extent item and adds it to
inode's extent map tree (this ends up being done by
btrfs_get_extent() <- btrfs_get_extent_fiemap() <-
get_extent_skip_holes());
4) Task A then drops the prealloc extent at file offset 128KiB and
shrinks the 128KiB extent file offset 0 to a length of 64KiB. The
truncation operation finishes and we end up with an extent map
representing a 1MiB prealloc extent at file offset 128KiB, despite we
don't have any more that extent;
After this the two types of problems we have are:
1) Future calls to fiemap always report that a 1MiB prealloc extent
exists at file offset 128KiB. This is stale information, no longer
correct;
2) If the size of the file is increased, by a truncate operation that
increases the file size or by a write into a file offset > 64KiB for
example, we end up not inserting file extent items to represent holes
for any range between 128KiB and 128KiB + 1MiB, since the hole
expansion function, btrfs_cont_expand() will skip hole insertion for
any range for which an extent map exists that represents a prealloc
extent. This causes fsck to complain about missing file extent items
when not using the NO_HOLES feature.
The second issue could be often triggered by test case generic/561 from
fstests, which runs fsstress and duperemove in parallel, and duperemove
does frequent fiemap calls.
Essentially the problems happens because fiemap does not acquire the
inode's lock while truncate does, and fiemap locks the file range in the
inode's iotree while truncate does not. So fix the issue by making
btrfs_truncate_inode_items() lock the file range from the new file size
to (u64)-1, so that it serializes with fiemap.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A remount to a read-write filesystem is not safe when there's tree-log
to be replayed. Files that could be opened until now might be affected
by the changes in the tree-log.
A regular mount is needed to replay the log so the filesystem presents
the consistent view with the pending changes included.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no logged information about tree-log replay although this is
something that points to previous unclean unmount. Other filesystems
report that as well.
Suggested-by: Chris Murphy <lists@colorremedies.com>
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a few cases where we allow an extent map that is in an extent map
tree to be merged with other extents in the tree. Such cases include the
unpinning of an extent after the respective ordered extent completed or
after logging an extent during a fast fsync. This can lead to subtle and
dangerous problems because when doing the merge some other task might be
using the same extent map and as consequence see an inconsistent state of
the extent map - for example sees the new length but has seen the old start
offset.
With luck this triggers a BUG_ON(), and not some silent bug, such as the
following one in __do_readpage():
$ cat -n fs/btrfs/extent_io.c
3061 static int __do_readpage(struct extent_io_tree *tree,
3062 struct page *page,
(...)
3127 em = __get_extent_map(inode, page, pg_offset, cur,
3128 end - cur + 1, get_extent, em_cached);
3129 if (IS_ERR_OR_NULL(em)) {
3130 SetPageError(page);
3131 unlock_extent(tree, cur, end);
3132 break;
3133 }
3134 extent_offset = cur - em->start;
3135 BUG_ON(extent_map_end(em) <= cur);
(...)
Consider the following example scenario, where we end up hitting the
BUG_ON() in __do_readpage().
We have an inode with a size of 8KiB and 2 extent maps:
extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by
a previous transaction
extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet
persisted but writeback started for it already. The extent map
is pinned since there's writeback and an ordered extent in
progress, so it can not be merged with extent map A yet
The following sequence of steps leads to the BUG_ON():
1) The ordered extent for extent B completes, the respective page gets its
writeback bit cleared and the extent map is unpinned, at that point it
is not yet merged with extent map A because it's in the list of modified
extents;
2) Due to memory pressure, or some other reason, the MM subsystem releases
the page corresponding to extent B - btrfs_releasepage() is called and
returns 1, meaning the page can be released as it's not dirty, not under
writeback anymore and the extent range is not locked in the inode's
iotree. However the extent map is not released, either because we are
not in a context that allows memory allocations to block or because the
inode's size is smaller than 16MiB - in this case our inode has a size
of 8KiB;
3) Task B needs to read extent B and ends up __do_readpage() through the
btrfs_readpage() callback. At __do_readpage() it gets a reference to
extent map B;
4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B
while holding the write lock on the inode's extent map tree - this
results in try_merge_map() being called and since it's possible to merge
extent map B with extent map A now (the extent map B was removed from
the list of modified extents), the merging begins - it sets extent map
B's start offset to 0 (was 4KiB), but before it increments the map's
length to 8KiB (4kb + 4KiB), task A is at:
BUG_ON(extent_map_end(em) <= cur);
The call to extent_map_end() sees the extent map has a start of 0
and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so
the BUG_ON() is triggered.
So it's dangerous to modify an extent map that is in the tree, because some
other task might have got a reference to it before and still using it, and
needs to see a consistent map while using it. Generally this is very rare
since most paths that lookup and use extent maps also have the file range
locked in the inode's iotree. The fsync path is pretty much the only
exception where we don't do it to avoid serialization with concurrent
reads.
Fix this by not allowing an extent map do be merged if if it's being used
by tasks other then the one attempting to merge the extent map (when the
reference count of the extent map is greater than 2).
Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp>
Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_ref_tree_mod(), 'ref' and 'ra' are allocated through kzalloc() and
kmalloc(), respectively. In the following code, if an error occurs, the
execution will be redirected to 'out' or 'out_unlock' and the function will
be exited. However, on some of the paths, 'ref' and 'ra' are not
deallocated, leading to memory leaks. For example, if 'action' is
BTRFS_ADD_DELAYED_EXTENT, add_block_entry() will be invoked. If the return
value indicates an error, the execution will be redirected to 'out'. But,
'ref' is not deallocated on this path, causing a memory leak.
To fix the above issues, deallocate both 'ref' and 'ra' before exiting from
the function when an error is encountered.
CC: stable@vger.kernel.org # 4.15+
Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
- Fix RWF_NOWAIT writes to properly return -EAGAIN
- Clean up an unused helper
- Update dax_writeback_mapping_range to not need a block_device argument
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEf41QbsdZzFdA8EfZHtKRamZ9iAIFAl5DH00ACgkQHtKRamZ9
iAIPUA/+KHTrnE4Pb69d1r/wmdNOQE1vCpZxN606ozfzAMIfCgzw6i7V82xpFUyZ
VdQ1wcCSYLoPTQ+y/0Bk7SQlLjBx58c0ShaOTHmE2IRdYwzKMOBtwqy5vxGYWT/i
k2b44Xdwc4x8KMdajKCZlZOWNIM4BnlhE6nqnyL7Zv7J4BC71IKTCgh0WrriKvKe
t7IkYK6fLWx9Y64UhcIoXL9jDp1r5N/pJCjKaSfpS7gH+iqz/M3NaAwRfDr8UuIY
aHUUoPHP/cbwakQnYZtpxGaP1dzkKQ1FG3nL74Lp2XUOulivSbGa0fmw/7Vs18Of
M2d8/yKWMh0pElPdoh/2ORhHAcMOsIUCx3HRxMm1x6g293BkE96ESpbN/s62oo7H
uND3AOqnE79jxd6AqECHyYXGpxwlHah5HXZdCjU5b6rmNbz9YNpHGK8STwpa2ReL
AnYpWlDPjUkSMAD/rzwR7T6xh4TlzYQa2y6QR3HyffftPg6Dm0g4I8pBi0PVZYBB
4Whg8dLsiK73KZsjhraPaSFFDT42Btd6BHKLrMLckoVoIoyt3EB4FPwMQjCwXgqI
WySehmiMfEqXWUEKsYxBf+j0+ASC5ewIKuP1Ziqxa9kBel7964my7ariGn/mEAbo
yJ6Iyn+wEeLE5VhImZKxrBNqNPo6mripT6omJmgl2G2HZt+JfK4=
=ZA5Y
-----END PGP SIGNATURE-----
Merge tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull dax fixes from Dan Williams:
"A fix for an xfstest failure and some and an update that removes an
fsdax dependency on block devices.
Summary:
- Fix RWF_NOWAIT writes to properly return -EAGAIN
- Clean up an unused helper
- Update dax_writeback_mapping_range to not need a block_device
argument"
* tag 'dax-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: pass NOWAIT flag to iomap_apply
dax: Get rid of fs_dax_get_by_host() helper
dax: Pass dax_dev instead of bdev to dax_writeback_mapping_range()
For the old mount API, the module parameters parseing function will
be called in ceph_mount() and also just after the default posix acl
flag set, so we can control to enable/disable it via the mount option.
But for the new mount API, it will call the module parameters
parseing function before ceph_get_tree(), so the posix acl will always
be enabled.
Fixes: 82995cc6c5 ("libceph, rbd, ceph: convert to use the new mount API")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
syzbot reported that 4fbc0c711b ("ceph: remove the extra slashes in
the server path") had caused a regression where an allocation could be
done under a spinlock -- compare_mount_options() is called by sget_fc()
with sb_lock held.
We don't really need the supplied server path, so canonicalize it
in place and compare it directly. To make this work, the leading
slash is kept around and the logic in ceph_real_mount() to skip it
is restored. CEPH_MSG_CLIENT_SESSION now reports the same (i.e.
canonicalized) path, with the leading slash of course.
Fixes: 4fbc0c711b ("ceph: remove the extra slashes in the server path")
Reported-by: syzbot+98704a51af8e3d9425a9@syzkaller.appspotmail.com
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
In O_APPEND & O_DIRECT mode, the data from different writers will
be possibly overlapping each other since they take the shared lock.
For example, both Writer1 and Writer2 are in O_APPEND and O_DIRECT
mode:
Writer1 Writer2
shared_lock() shared_lock()
getattr(CAP_SIZE) getattr(CAP_SIZE)
iocb->ki_pos = EOF iocb->ki_pos = EOF
write(data1)
write(data2)
shared_unlock() shared_unlock()
The data2 will overlap the data1 from the same file offset, the
old EOF.
Switch to exclusive lock instead when O_APPEND is specified.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
No one checks the return value of debugfs_create_regset32(), as it's not
needed, so make the return value void, so that no one tries to do so in
the future.
Link: https://lore.kernel.org/r/20191122104453.GA2017837@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We want to make sure that we revalidate the dentry if and only if
we've done an OPEN by filename.
In order to avoid races with remote changes to the directory on the
server, we want to save the verifier before calling OPEN. The exception
is if the server returned a delegation with our OPEN, as we then
know that the filename can't have changed on the server.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
In order to avoid having our dentry revalidation race with an update
of the directory on the server, we need to store the verifier before
the RPC calls to LOOKUP and READDIR.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Benjamin Coddington <bcodding@gmail.com>
Tested-by: Benjamin Coddington <bcodding@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Fix display for sec=krb5i which was wrongly interleaved by cruid,
resulting in string "sec=krb5,cruid=<...>i" instead of
"sec=krb5i,cruid=<...>".
Fixes: 96281b9e46 ("smb3: for kerberos mounts display the credential uid used")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are
more natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
-----BEGIN PGP SIGNATURE-----
iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
/KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
NC/coUmQX7McxMXh
=sv4U
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- fix randconfig to generate a sane .config
- rename hostprogs-y / always to hostprogs / always-y, which are more
natual syntax.
- optimize scripts/kallsyms
- fix yes2modconfig and mod2yesconfig
- make multiple directory targets ('make foo/ bar/') work
* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kbuild: make multiple directory targets work
kconfig: Invalidate all symbols after changing to y or m.
kallsyms: fix type of kallsyms_token_table[]
scripts/kallsyms: change table to store (strcut sym_entry *)
scripts/kallsyms: rename local variables in read_symbol()
kbuild: rename hostprogs-y/always to hostprogs/always-y
kbuild: fix the document to use extra-y for vmlinux.lds
kconfig: fix broken dependency in randconfig-generated .config
Zonefs is a very simple file system exposing each zone of a zoned block
device as a file.
Unlike a regular file system with native zoned block device support
(e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
sequential write constraint of zoned block devices to the user. As a
result, zonefs is not a POSIX compliant file system. Its goal is to
simplify the implementation of zoned block devices support in
applications by replacing raw block device file accesses with a richer
file based API, avoiding relying on direct block device file ioctls
which may be more obscure to developers.
One example of this approach is the implementation of LSM
(log-structured merge) tree structures (such as used in RocksDB and
LevelDB) on zoned block devices by allowing SSTables to be stored in a
zone file similarly to a regular file system rather than as a range of
sectors of a zoned device. The introduction of the higher level
construct "one file is one zone" can help reducing the amount of changes
needed in the application while at the same time allowing the use of
zoned block devices with various programming languages other than C.
Zonefs IO management implementation uses the new iomap generic code.
Zonefs has been successfully tested using a functional test suite
(available with zonefs userland format tool on github) and a prototype
implementation of LevelDB on top of zonefs.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCXj1y8QAKCRDdoc3SxdoY
dqozAP9J3t+Q95BgKgI5jP+XEtyYsPBTaVrvaSaViEnwtJLVoQD/ZQ1lTCZSE9OI
UkvWawkuFtLGfOxTqyA3eZrZi22Ttwk=
=YVvO
-----END PGP SIGNATURE-----
Merge tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull new zonefs file system from Damien Le Moal:
"Zonefs is a very simple file system exposing each zone of a zoned
block device as a file.
Unlike a regular file system with native zoned block device support
(e.g. f2fs or the on-going btrfs effort), zonefs does not hide the
sequential write constraint of zoned block devices to the user. As a
result, zonefs is not a POSIX compliant file system. Its goal is to
simplify the implementation of zoned block devices support in
applications by replacing raw block device file accesses with a richer
file based API, avoiding relying on direct block device file ioctls
which may be more obscure to developers.
One example of this approach is the implementation of LSM
(log-structured merge) tree structures (such as used in RocksDB and
LevelDB) on zoned block devices by allowing SSTables to be stored in a
zone file similarly to a regular file system rather than as a range of
sectors of a zoned device. The introduction of the higher level
construct "one file is one zone" can help reducing the amount of
changes needed in the application while at the same time allowing the
use of zoned block devices with various programming languages other
than C.
Zonefs IO management implementation uses the new iomap generic code.
Zonefs has been successfully tested using a functional test suite
(available with zonefs userland format tool on github) and a prototype
implementation of LevelDB on top of zonefs"
* tag 'zonefs-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Add documentation
fs: New zonefs file system
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAl49bNsACgkQiiy9cAdy
T1EGlQwArDJiHUV7W/WaoDZnusPPQqUT3ayqAHL0P8cDsjxLu3uNMkUISr0HdbxC
kqYahSTb+/BKQzoZhVe5wK3S8W6R8+wyaPJExRCL3brlIHVP/eC9uUjSgkT6QVDl
/vZCwxj7KmTK/S+ofji/XTl2f8f8BCw2biGVxwR2Jj5pwKI4wFIMFm7mDetTQRD4
bK0UR2Owiw4DpPXdwHlXPf9N06z0ETa1UdMXklIBgeK9B1eT1STD9q/iHJh3bLpO
klhbiq5eGRCcs9cBVTQcn6U+zGYBOcdJuhPGbAObEU+R2vNX06clydKlKy1oz1VL
4jbVVn9xuGZ9evFBC3h7Na1X7C3V28WcpfeRfFxZ157hNuQSNo5wiq0rF66EQ14U
hbmlx2S2ooyNKcnrj46SUw9zVLZ0xcx1Mw7kmoyHgI/vznW9fvV0Y2JXawJMPei5
VuQTgDLFsvnIIrUnrGBu2UXMzXghxLZ3SXJVKXuW3luvNRk82RAGHmIdty3OTgPp
DN9lhGvv
=F1qf
-----END PGP SIGNATURE-----
Merge tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"13 cifs/smb3 patches, most from testing at the SMB3 plugfest this week:
- Important fix for multichannel and for modefromsid mounts.
- Two reconnect fixes
- Addition of SMB3 change notify support
- Backup tools fix
- A few additional minor debug improvements (tracepoints and
additional logging found useful during testing this week)"
* tag '5.6-rc-smb3-plugfest-patches' of git://git.samba.org/sfrench/cifs-2.6:
smb3: Add defines for new information level, FileIdInformation
smb3: print warning once if posix context returned on open
smb3: add one more dynamic tracepoint missing from strict fsync path
cifs: fix mode bits from dir listing when mounted with modefromsid
cifs: fix channel signing
cifs: add SMB3 change notification support
cifs: make multichannel warning more visible
cifs: fix soft mounts hanging in the reconnect code
cifs: Add tracepoints for errors on flush or fsync
cifs: log warning message (once) if out of disk space
cifs: fail i/o on soft mounts if sessionsetup errors out
smb3: fix problem with null cifs super block with previous patch
SMB3: Backup intent flag missing from some more ops
Pull vboxfs from Al Viro:
"This is the VirtualBox guest shared folder support by Hans de Goede,
with fixups for fs_parse folded in to avoid bisection hazards from
those API changes..."
* 'work.vboxsf' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Add VirtualBox guest shared folder (vboxsf) support
Jonas reports that he sometimes sees -97/-22 error returns from
sendmsg, if it gets punted async. This is due to not retaining the
sockaddr_storage between calls. Include that in the state we copy when
going async.
Cc: stable@vger.kernel.org # 5.3+
Reported-by: Jonas Bonn <jonas@norrbonn.se>
Tested-by: Jonas Bonn <jonas@norrbonn.se>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Normally we cancel all work we track, but for untracked work we could
leave the async worker behind until that work completes. This is totally
fine, but does leave resources pending after the task is gone until that
work completes.
Cancel work that this task queued up when it goes away.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add a helper that allows the caller to cancel work based on what mm
it belongs to. This allows io_uring to cancel work from a given
task or thread when it exits.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We want to use the cancel functionality for canceling based on not
just the work itself. Instead of matching on the work address
manually, allow a match handler to tell us if we found the right work
item or not.
No functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
VirtualBox hosts can share folders with guests, this commit adds a
VFS driver implementing the Linux-guest side of this, allowing folders
exported by the host to be mounted under Linux.
This driver depends on the guest <-> host IPC functions exported by
the vboxguest driver.
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
One patch in the compat-ioctl series broke 32-bit rootfs for multiple
people testing on 64-bit kernels. Let's fix it in -rc1 before others
run into the same issue.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJePyKSAAoJEGCrR//JCVInFMEP/0fH0rI/M2xL4neSjp50qV7Q
M86gLTmatgNFLZGdC+xbaImaxpZHEMIF93JgOzN+0IRfdc7MbD23JEm5yzCZEc2I
NCAsmjcoA+mq8ntOl8o2J1pgLpVOe4BoBRvAGYFIM2kS63JXloCi9mD/3svDJF8C
WVPOskzoNT1pO9mRUKsAE740qdI86US/ksvrGOQQHVFVm3Iwm3srot4OBQYRaErw
bVEl8vqSOEKtXQwk3r7QcW9Bi83WZuqTbkLxqTtWm/U/8JreA82qH1/6zDbKj7d5
IH/J8D4FJ2LLRP5860bLj0vGl5mKBEJaou2S8ak7sRNu4K8DZJlTYu7YNMTgocQE
kbNylNqftKastVRGS1EPmOPdam8PiPlu3sxOY1X1TKO5+4jtb1WcqtFqmUkM5/Jk
C4iMDXNHXmHcC+ORrjchpIZs53nkfSAdeUzQ7xxwZlfaHD3DXnsFi5DHEZ9PWf9a
UlCOaWFLXRqdso30iobaaLa6JKa50Znjiqfsh6mbUuU9tmDnTffaHMy9eFhfMjqH
eP3acYXkAyK2toHuE6l5qaM6mOlYfrooCw/75CfQB0CR3e5tmXXLdFA319/OHRwX
m+tFby6Fw4KGGZWzXls54rZNa3ZmODj6N3Ymx1lKZfBVwblLNGHJYYVPP1ZUBCTS
43oommoGHO6/roOjLl3j
=uMS2
-----END PGP SIGNATURE-----
Merge tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground
Pull compat-ioctl fix from Arnd Bergmann:
"One patch in the compat-ioctl series broke 32-bit rootfs for multiple
people testing on 64-bit kernels. Let's fix it in -rc1 before others
run into the same issue"
* tag 'compat-ioctl-fix' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
compat_ioctl: fix FIONREAD on devices
Pull vfs file system parameter updates from Al Viro:
"Saner fs_parser.c guts and data structures. The system-wide registry
of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
the horror switch() in fs_parse() that would have to grow another case
every time something got added to that system-wide registry.
New syntax types can be added by filesystems easily now, and their
namespace is that of functions - not of system-wide enum members. IOW,
they can be shared or kept private and if some turn out to be widely
useful, we can make them common library helpers, etc., without having
to do anything whatsoever to fs_parse() itself.
And we already get that kind of requests - the thing that finally
pushed me into doing that was "oh, and let's add one for timeouts -
things like 15s or 2h". If some filesystem really wants that, let them
do it. Without somebody having to play gatekeeper for the variants
blessed by direct support in fs_parse(), TYVM.
Quite a bit of boilerplate is gone. And IMO the data structures make a
lot more sense now. -200LoC, while we are at it"
* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
tmpfs: switch to use of invalfc()
cgroup1: switch to use of errorfc() et.al.
procfs: switch to use of invalfc()
hugetlbfs: switch to use of invalfc()
cramfs: switch to use of errofc() et.al.
gfs2: switch to use of errorfc() et.al.
fuse: switch to use errorfc() et.al.
ceph: use errorfc() and friends instead of spelling the prefix out
prefix-handling analogues of errorf() and friends
turn fs_param_is_... into functions
fs_parse: handle optional arguments sanely
fs_parse: fold fs_parameter_desc/fs_parameter_spec
fs_parser: remove fs_parameter_description name field
add prefix to fs_context->log
ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
new primitive: __fs_parse()
switch rbd and libceph to p_log-based primitives
struct p_log, variants of warnf() et.al. taking that one instead
teach logfc() to handle prefices, give it saner calling conventions
get rid of cg_invalf()
...
Pull misc vfs updates from Al Viro:
- bmap series from cmaiolino
- getting rid of convolutions in copy_mount_options() (use a couple of
copy_from_user() instead of the __get_user() crap)
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
saner copy_mount_options()
fibmap: Reject negative block numbers
fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
ecryptfs: drop direct calls to ->bmap
cachefiles: drop direct usage of ->bmap method.
fs: Enable bmap() function to properly return errors
As in the previous patch, make openat*_prep() and statx_prep() handle
double preparation to avoid resource leakage.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Requests may be prepared multiple times with ->io allocated (i.e. async
prepared). Preparation functions don't handle it and forget about
previously allocated resources. This may happen in case of:
- spurious defer_check
- non-head (i.e. async prepared) request executed in sync (via nxt).
Make the handlers check, whether they already allocated resources, which
is true IFF REQ_F_NEED_CLEANUP is set.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
First, io_close() misses filp_close() and io_cqring_add_event(), when
f_op->flush is defined. That's because in this case it will
io_queue_async_work() itself not grabbing files, so the corresponding
chunk in io_close_finish() won't be executed.
Second, when submitted through io_wq_submit_work(), it will do
filp_close() and *_add_event() twice: first inline in io_close(),
and the second one in call to io_close_finish() from io_close().
The second one will also fire, because it was submitted async through
generic path, and so have grabbed files.
And the last nice thing is to remove this weird pilgrimage with checking
work/old_work and casting it to nxt. Just use a helper instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This passes it in to io-wq, so it assumes the right fs_struct when
executing async work that may need to do lookups.
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some work items need this for relative path lookup, make it available
like the other inherited credentials/mm/etc.
Cc: stable@vger.kernel.org # 5.3+
Signed-off-by: Jens Axboe <axboe@kernel.dk>
For non-blocking issue, we set IOCB_NOWAIT in the kiocb. However, on a
raw block device, this yields an -EOPNOTSUPP return, as non-blocking
writes aren't supported. Turn this -EOPNOTSUPP into -EAGAIN, so we retry
from blocking context with IOCB_NOWAIT cleared.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Jens Axboe <axboe@kernel.dk>
openat() and statx() may have allocated ->open.filename, which should be
be put. Add cleanup handlers for them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allocated iovec is freed only in io_{read,write,send,recv)(), and just
leaves it if an error occured. There are plenty of such cases:
- cancellation of non-head requests
- fail grabbing files in __io_queue_sqe()
- set REQ_F_NOWAIT and returning in __io_queue_sqe()
Add REQ_F_NEED_CLEANUP, which will force such requests with custom
allocated resourses go through cleanup handlers on put.
Cc: stable@vger.kernel.org # 5.5
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In io_uring_poll() we must flush overflowed CQ events before to
check if there are CQ events available, to avoid missing events.
We call the io_cqring_events() that checks and flushes any overflow
and returns the number of CQ events available.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All of these opcodes take a directory file descriptor. We can't easily
support fixed files for these operations, and the use case for that
probably isn't all that clear (or sensible) anyway.
Disable IOSQE_FIXED_FILE for these operations.
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Merge thundering herd avoidance on pipe IO.
This would have been applied for 5.5 already, but got delayed because of
a user-space race condition in the GNU make jobserver code. Now that
there's a new GNU make 4.3 release, and most distributions seem to have
at least applied the (almost three year old) fix for the problem, let's
see if people notice.
And it might have been just bad random timing luck on my machine.
If you do hit the race condition, things will still work, but the
symptom is that you don't get nearly the expected parallelism when using
"make -j<N>".
The jobserver bug can definitely happen without this patch too, but
seems to be easier to trigger when we no longer wake up pipe waiters
unnecessarily.
* pipe-exclusive-wakeup:
pipe: use exclusive waits when reading or writing
This makes the pipe code use separate wait-queues and exclusive waiting
for readers and writers, avoiding a nasty thundering herd problem when
there are lots of readers waiting for data on a pipe (or, less commonly,
lots of writers waiting for a pipe to have space).
While this isn't a common occurrence in the traditional "use a pipe as a
data transport" case, where you typically only have a single reader and
a single writer process, there is one common special case: using a pipe
as a source of "locking tokens" rather than for data communication.
In particular, the GNU make jobserver code ends up using a pipe as a way
to limit parallelism, where each job consumes a token by reading a byte
from the jobserver pipe, and releases the token by writing a byte back
to the pipe.
This pattern is fairly traditional on Unix, and works very well, but
will waste a lot of time waking up a lot of processes when only a single
reader needs to be woken up when a writer releases a new token.
A simplified test-case of just this pipe interaction is to create 64
processes, and then pass a single token around between them (this
test-case also intentionally passes another token that gets ignored to
test the "wake up next" logic too, in case anybody wonders about it):
#include <unistd.h>
int main(int argc, char **argv)
{
int fd[2], counters[2];
pipe(fd);
counters[0] = 0;
counters[1] = -1;
write(fd[1], counters, sizeof(counters));
/* 64 processes */
fork(); fork(); fork(); fork(); fork(); fork();
do {
int i;
read(fd[0], &i, sizeof(i));
if (i < 0)
continue;
counters[0] = i+1;
write(fd[1], counters, (1+(i & 1)) *sizeof(int));
} while (counters[0] < 1000000);
return 0;
}
and in a perfect world, passing that token around should only cause one
context switch per transfer, when the writer of a token causes a
directed wakeup of just a single reader.
But with the "writer wakes all readers" model we traditionally had, on
my test box the above case causes more than an order of magnitude more
scheduling: instead of the expected ~1M context switches, "perf stat"
shows
231,852.37 msec task-clock # 15.857 CPUs utilized
11,250,961 context-switches # 0.049 M/sec
616,304 cpu-migrations # 0.003 M/sec
1,648 page-faults # 0.007 K/sec
1,097,903,998,514 cycles # 4.735 GHz
120,781,778,352 instructions # 0.11 insn per cycle
27,997,056,043 branches # 120.754 M/sec
283,581,233 branch-misses # 1.01% of all branches
14.621273891 seconds time elapsed
0.018243000 seconds user
3.611468000 seconds sys
before this commit.
After this commit, I get
5,229.55 msec task-clock # 3.072 CPUs utilized
1,212,233 context-switches # 0.232 M/sec
103,951 cpu-migrations # 0.020 M/sec
1,328 page-faults # 0.254 K/sec
21,307,456,166 cycles # 4.074 GHz
12,947,819,999 instructions # 0.61 insn per cycle
2,881,985,678 branches # 551.096 M/sec
64,267,015 branch-misses # 2.23% of all branches
1.702148350 seconds time elapsed
0.004868000 seconds user
0.110786000 seconds sys
instead. Much better.
[ Note! This kernel improvement seems to be very good at triggering a
race condition in the make jobserver (in GNU make 4.2.1) for me. It's
a long known bug that was fixed back in June 2017 by GNU make commit
b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to
avoid hangs.").
But there wasn't a new release of GNU make until 4.3 on Jan 19 2020,
so a number of distributions may still have the buggy version. Some
have backported the fix to their 4.2.1 release, though, and even
without the fix it's quite timing-dependent whether the bug actually
is hit. ]
Josh Triplett says:
"I've been hammering on your pipe fix patch (switching to exclusive
wait queues) for a month or so, on several different systems, and I've
run into no issues with it. The patch *substantially* improves
parallel build times on large (~100 CPU) systems, both with parallel
make and with other things that use make's pipe-based jobserver.
All current distributions (including stable and long-term stable
distributions) have versions of GNU make that no longer have the
jobserver bug"
Tested-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
My final cleanup patch for sys_compat_ioctl() introduced a regression on
the FIONREAD ioctl command, which is used for both regular and special
files, but only works on regular files after my patch, as I had missed
the warning that Al Viro put into a comment right above it.
Change it back so it can work on any file again by moving the implementation
to do_vfs_ioctl() instead.
Fixes: 77b9040195 ("compat_ioctl: simplify the implementation")
Reported-and-tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Reported-and-tested-by: youling257 <youling257@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCXj182AAKCRDh3BK/laaZ
PIiWAQCprdMIBe0u9Rd9cqQYXClOI7PI9oIcpLmkIlHDuUWDgQD/Y4c1UMsN8yQY
d8cYZXMivKKhyY2nRitR1mbv0RPoGwE=
=8hFo
-----END PGP SIGNATURE-----
Merge tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fixes from Miklos Szeredi:
- Fix a regression introduced in v5.1 that triggers WARNINGs for some
fuse filesystems
- Fix an xfstest failure
- Allow overlayfs to be used on top of fuse/virtiofs
- Code and documentation cleanups
* tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: use true,false for bool variable
Documentation: filesystems: convert fuse to RST
fuse: Support RENAME_WHITEOUT flag
fuse: don't overflow LLONG_MAX with end offset
fix up iter on short count in fuse_direct_io()
- Fix a bug in Abhi Das's journal head lookup improvements that can cause a
valid journal to be rejected.
- Fix an O_SYNC write handling bug reported by Christoph Hellwig.
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAl49TKgUHGFncnVlbmJh
QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTqG4RAAmXJfWI7UskM3ZRSVivbBwc0R3P9h
PgdgeXlr/k4mJ2I421/btln614l4nD+JuG/FplOvj2m3YCzeKj2W+OiSxqahmwGm
7yGb0sFHjx6wZcID37mig5aqo2MJYI0xTAR6rSI2tAA7E/B7KX8SfFIS2EWB2vLz
+rzCawqUGWKCiS/tZTe6JlB0Aeg20oy0Y00p3gewMN0ILNqy0w9kA+LVOGQYyUmW
rrad638czIXi7kugWvNo82vcoU140m3A6OTaINf3EaT8AmOtw+e4qyej+f82BiVS
RIXWKI+uRSfKFE+aYkwTxQn0BCMor63QIs3aaDXyLZBqnhTywRsckK6O7iBLFVDb
NQc86wxiHzDoWubXstV0lrlel5m5dHT0dUq8mVogj4kOtwOvejiDRopqtdGkdxA5
j8zFV6O8BRPUw5g7MS37n9myuTNRj5q3L2vmN2xZ6fZmHykorwbwLbw0WMeAQ/4i
pQh4NGE6h/lSPZIa/ZVCH8hwj2b41ZtAQK3k8G5IhNUhMHNkU8iynTbddoHFPsOF
67hj/aAycDx2NA6j/suS8PGt7PPU+VwwoDLKIG64bTHFTpwnRDNzZDt9wPIB9sGu
P4QlSQYPrLf/u8+TD8yv2VJsyAEMFl1UQWbxqf6EzLtKjRM/TIoQsdRUF7upqUnN
k5CM8FgmfYpIiFQ=
=+6PA
-----END PGP SIGNATURE-----
Merge tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 fixes from Andreas Gruenbacher:
- Fix a bug in Abhi Das's journal head lookup improvements that can
cause a valid journal to be rejected.
- Fix an O_SYNC write handling bug reported by Christoph Hellwig.
* tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: fix O_SYNC write handling
gfs2: move setting current->backing_dev_info
gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
Vasliy Averin noticed that "if seq_file .next function does not change
position index, read after some lseek can generate unexpected output."
and sent in this fix.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJePJJTAAoJEM9EDqnrzg2+S8YQAIYo2nEphfAjWufSYdPwVY0f
OQ3xAwX69gpzlQPwNTcTmoxQ/A/E8r2lwOmL5H7wYPlqUHtDjiS5EAhnCvtCIRS1
mHR35qsdlW0o4yKWQtYnfcuY6p3ucSASBaKYGO5NrKuVdsqASJT/aFTixVFmO8V2
eo8VZlhwYARTgGWJQpvIvSbQJRi6BDnE3kWimuRdq5TMnH0piMm6Vpef2P5vTaE3
L5D63iCuJbg7nBR4KS2Mc77Ou4lj3RCQGBQBz1hRGs0zJbqNrlExGaCE+FkabJ92
WWITYrh4DaA+wAhmkMbxh8g7DThZFkkK+fsNOZIlSVry3wFuZ1E+6UFUHVICrgxG
y99wGKV816SfbKD0DZ9cer6slMnNXRRNWzbXqOzey85UmNQSuCNZrVokM8TE1Pve
q7OOxoqg6N/S46gc+3nrv0kNZ9eg1rh4jyvCdtL4we1AgPPb6O+ihuxlrj9oFPlW
zo9wHn7Zt1xu+jtQJxR5tChpD5FmO1im0LOYeaJHv76zhOakRTcQ19OnQMUlQ/gk
ez+5T5LxfmYUHew6kTnGvvp76zZF5QkylsUVOLPke8KChXvLP06ib9Nuya7lBX7+
WtvLKVRNYMmwose1HsYjQo2DgD0p2fdJhEe61Q5VDFCBI8X8wkv7fV+Hm2ud3/hz
b9RfP2CnwJbCKzHTolKE
=tjs3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs fix from Mike Marshall:
"Debugfs fix for orangefs.
Vasliy Averin noticed that 'if seq_file .next function does not change
position index, read after some lseek can generate unexpected output'
and sent in this fix"
* tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
help_next should increase position index
- Server-to-server copy code from Olga. To use it, client and
both servers must have support, the target server must be able
to access the source server over NFSv4.2, and the target
server must have the inter_copy_offload_enable module
parameter set.
- Improvements and bugfixes for the new filehandle cache,
especially in the container case, from Trond
- Also from Trond, better reporting of write errors.
- Y2038 work from Arnd.
-----BEGIN PGP SIGNATURE-----
iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl490mAVHGJmaWVsZHNA
ZmllbGRzZXMub3JnAAoJECebzXlCjuG+HkkP/33CsYXp0wvfNrfxCY3zHRxHpfw+
T9Ownxxw0RAJc/dRluC/2PIKJ20uVqtLrplU63bMBqJn84WF7OALq9twZ79a3fVF
mvdmnZbNq9B3ncKJlT7akkEelyJCRap7NgG/oTyubE8MlPl6gKpD8c+G7XdW/uN+
r0fprQz4rW4CYCBGSHq7HusEKqY4Gw+gbyAfJ6A79TMjF1ei51PG+9c8rkIsI5CO
1TQ3gY1gSJmGf2DoF86Q9WTVb+DvRTEs+t7QkxY/Vlo+QXY8CZyu+qSxN7i/F20m
gv2GrSpQMS9DEK/ZaG6cxaH+sM18Db4KLvcl3koL6lONHDR2OafSdKLyy0I60jhO
WfDSHhfDCrAdASTjNlTPrjBrdK3gafiaJVL9vy901ZJjPaNb3EH0nMQ5bEvOBECq
TCqPcQUcbku+qUVIcFwzSK1hXQFQHNh8WIuqXvNviZIzFDoipwsHVnQK02Owj89L
R2tbZue1O8voacg/9xw3tWAT7pI+SaBb0EvJuqRxBshiZEU8kKKtMchOwSECRDcu
k4lcqC5EFW7e4EzGlr6Wx8sI5lwCapva8ccjmPXX+R/vyM81oxWGB84GqWjjwubH
3Fcok23F9rW2IQJkqgPlNj/9hAjTn2+vM13UbfMlnchGNsQ2gbkc5CDGC/J6Wwpo
tHVristV9Gu5bJym
=FxLY
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux
Pull nfsd updates from Bruce Fields:
"Highlights:
- Server-to-server copy code from Olga.
To use it, client and both servers must have support, the target
server must be able to access the source server over NFSv4.2, and
the target server must have the inter_copy_offload_enable module
parameter set.
- Improvements and bugfixes for the new filehandle cache, especially
in the container case, from Trond
- Also from Trond, better reporting of write errors.
- Y2038 work from Arnd"
* tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
sunrpc: expiry_time should be seconds not timeval
nfsd: make nfsd_filecache_wq variable static
nfsd4: fix double free in nfsd4_do_async_copy()
nfsd: convert file cache to use over/underflow safe refcount
nfsd: Define the file access mode enum for tracing
nfsd: Fix a perf warning
nfsd: Ensure sampling of the write verifier is atomic with the write
nfsd: Ensure sampling of the commit verifier is atomic with the commit
sunrpc: clean up cache entry add/remove from hashtable
sunrpc: Fix potential leaks in sunrpc_cache_unhash()
nfsd: Ensure exclusion between CLONE and WRITE errors
nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
nfsd: Update the boot verifier on stable writes too.
nfsd: Fix stable writes
nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
nfsd: Reduce the number of calls to nfsd_file_gc()
nfsd: Schedule the laundrette regularly irrespective of file errors
nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
nfsd: Containerise filecache laundrette
...
Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl48kMUACgkQ18tUv7Cl
QOs/bA/+KAHaee+1jWdgRS88CnNDfeokU2sGWuyXWrVTmiKZ+IjnIUIWqmeKhVyg
RTbaG4PGTIwiLDFibgzdnc3cTOQEgLnVGWWZ50Xh3b7ubock7+/4JHxqZS+/f3vf
yqwM0dZaXi5Kcx1kEJ+niBxuzkc9mFI+nHh+wLIlin/kaaUdLKu7mP3NXj2cmWxN
NoRaKc2gEvkPHhPSH4Z1DVXTHxvH2REFvt9APPUgfLfqcUVHV9b7V/wI/roiGWMn
53h6f38IdqoNQIpzMog/k/va67NLmEvUZOlpCYPyanPOjuxTrmi8iC2S6gLEOjtc
GGnQnc5skVL31seFR1NbOJiiN3hTLTncnoXza0cKtYxmo7a/FjXApw4jCu3Rkrav
UXpCI4O6+2AVVG+pEPbjQy3/GEImeoGvp+xr57jBSZBHoDZU9LDwag65qvZ1btIq
KOBx2gweQz0aB2heXmfee7qzxFdftHmtMWhIMnJASKNuAWGL23Scqem+d97i2T6H
7y9OJ3aOXiYxFMLYJCsLWjUJxYiaIANNBmHMjf27mZzcdDuxGFms277CMpNPr3SU
WZk6/oKw9jaRSzHzaKgVDXiULLXQE1/xZ/mvgR/zk1QAusyeXPvVnMdxoRdxFdXb
QGZHgUqvFvYi8Lufvs+ZLGS4sAp7oD/Q+lNPXn7cniSwfY4uJiw=
=b6+F
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Puyll NFS client updates from Anna Schumaker:
"Stable bugfixes:
- Fix memory leaks and corruption in readdir # v2.6.37+
- Directory page cache needs to be locked when read # v2.6.37+
New features:
- Convert NFS to use the new mount API
- Add "softreval" mount option to let clients use cache if server goes down
- Add a config option to compile without UDP support
- Limit the number of inactive delegations the client can cache at once
- Improved readdir concurrency using iterate_shared()
Other bugfixes and cleanups:
- More 64-bit time conversions
- Add additional diagnostic tracepoints
- Check for holes in swapfiles, and add dependency on CONFIG_SWAP
- Various xprtrdma cleanups to prepare for 5.7's changes
- Several fixes for NFS writeback and commit handling
- Fix acls over krb5i/krb5p mounts
- Recover from premature loss of openstateids
- Fix NFS v3 chacl and chmod bug
- Compare creds using cred_fscmp()
- Use kmemdup_nul() in more places
- Optimize readdir cache page invalidation
- Lease renewal and recovery fixes"
* tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
NFSv4: try lease recovery on NFS4ERR_EXPIRED
NFS: Fix memory leaks
nfs: optimise readdir cache page invalidation
NFS: Switch readdir to using iterate_shared()
NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
NFS: Directory page cache pages need to be locked when read
NFS: Fix memory leaks and corruption in readdir
SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
NFSv4: Limit the total number of cached delegations
NFSv4: Add accounting for the number of active delegations held
NFSv4: Try to return the delegation immediately when marked for return on close
NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
NFS: nfs_find_open_context() should use cred_fscmp()
NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
NFS: remove unused macros
nfs: Return EINVAL rather than ERANGE for mount parse errors
...