Pull parisc updates from Helge Deller:
- Optimize parisc page table locks by using the existing
page_table_lock
- Export argv0-preserve flag in binfmt_misc for usage in qemu-user
- Fix interrupt table (IVT) checksum so firmware will call crash
handler (HPMC)
- Increase IRQ stack to 64kb on 64-bit kernel
- Switch to common devmem_is_allowed() implementation
- Minor fix to get_whan()
* 'parisc-5.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
binfmt_misc: pass binfmt_misc flags to the interpreter
parisc: Optimize per-pagetable spinlocks
parisc: Replace test_ti_thread_flag() with test_tsk_thread_flag()
parisc: Bump 64-bit IRQ stack size to 64 KB
parisc: Fix IVT checksum calculation wrt HPMC
parisc: Use the generic devmem_is_allowed()
parisc: Drop out of get_whan() if task is running again
- vDSO build improvements including support for building with BSD.
- Cleanup to the AMU support code and initialisation rework to support
cpufreq drivers built as modules.
- Removal of synthetic frame record from exception stack when entering
the kernel from EL0.
- Add support for the TRNG firmware call introduced by Arm spec
DEN0098.
- Cleanup and refactoring across the board.
- Avoid calling arch_get_random_seed_long() from
add_interrupt_randomness()
- Perf and PMU updates including support for Cortex-A78 and the v8.3
SPE extensions.
- Significant steps along the road to leaving the MMU enabled during
kexec relocation.
- Faultaround changes to initialise prefaulted PTEs as 'old' when
hardware access-flag updates are supported, which drastically
improves vmscan performance.
- CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
(#1024718)
- Preparatory work for yielding the vector unit at a finer granularity
in the crypto code, which in turn will one day allow us to defer
softirq processing when it is in use.
- Support for overriding CPU ID register fields on the command-line.
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmAmwZcQHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNLA1B/0XMwWUhmJ4ZPK4sr28YWHNGLuCFHDgkMKU
dEmS806OF9d0J7fTczGsKdS4IKtXWko67Z0UGiPIStwfm0itSW2Zgbo9KZeDPqPI
fH0s23nQKxUMyNW7b9p4cTV3YuGVMZSBoMug2jU2DEDpSqeGBk09NPi6inERBCz/
qZxcqXTKxXbtOY56eJmq09UlFZiwfONubzuCrrUH7LU8ZBSInM/6Q4us/oVm4zYI
Pnv996mtL4UxRqq/KoU9+cQ1zsI01kt9/coHwfCYvSpZEVAnTWtfECsJ690tr3mF
TSKQLvOzxbDtU+HcbkNVKW0A38EIO1xXr8yXW9SJx6BJBkyb24xo
=IwMb
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
- vDSO build improvements including support for building with BSD.
- Cleanup to the AMU support code and initialisation rework to support
cpufreq drivers built as modules.
- Removal of synthetic frame record from exception stack when entering
the kernel from EL0.
- Add support for the TRNG firmware call introduced by Arm spec
DEN0098.
- Cleanup and refactoring across the board.
- Avoid calling arch_get_random_seed_long() from
add_interrupt_randomness()
- Perf and PMU updates including support for Cortex-A78 and the v8.3
SPE extensions.
- Significant steps along the road to leaving the MMU enabled during
kexec relocation.
- Faultaround changes to initialise prefaulted PTEs as 'old' when
hardware access-flag updates are supported, which drastically
improves vmscan performance.
- CPU errata updates for Cortex-A76 (#1463225) and Cortex-A55
(#1024718)
- Preparatory work for yielding the vector unit at a finer granularity
in the crypto code, which in turn will one day allow us to defer
softirq processing when it is in use.
- Support for overriding CPU ID register fields on the command-line.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (85 commits)
drivers/perf: Replace spin_lock_irqsave to spin_lock
mm: filemap: Fix microblaze build failure with 'mmu_defconfig'
arm64: Make CPU_BIG_ENDIAN depend on ld.bfd or ld.lld 13.0.0+
arm64: cpufeatures: Allow disabling of Pointer Auth from the command-line
arm64: Defer enabling pointer authentication on boot core
arm64: cpufeatures: Allow disabling of BTI from the command-line
arm64: Move "nokaslr" over to the early cpufeature infrastructure
KVM: arm64: Document HVC_VHE_RESTART stub hypercall
arm64: Make kvm-arm.mode={nvhe, protected} an alias of id_aa64mmfr1.vh=0
arm64: Add an aliasing facility for the idreg override
arm64: Honor VHE being disabled from the command-line
arm64: Allow ID_AA64MMFR1_EL1.VH to be overridden from the command line
arm64: cpufeature: Add an early command-line cpufeature override facility
arm64: Extract early FDT mapping from kaslr_early_init()
arm64: cpufeature: Use IDreg override in __read_sysreg_by_encoding()
arm64: cpufeature: Add global feature override facility
arm64: Move SCTLR_EL1 initialisation to EL-agnostic code
arm64: Simplify init_el2_state to be non-VHE only
arm64: Move VHE-specific SPE setup to mutate_to_vhe()
arm64: Drop early setting of MDSCR_EL2.TPMS
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtYbYQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgppeWD/4xKhzBCGZWOkdycaaPhsUTOjNNIPmCBhlz
QQj4KFSEuJNKACUg53Ak0oECJTaH5976kjKkKs7Z+hzmkEwboLBI4erkcT9MGC3M
mPx349qBq9X3sYaFrUJF3h0sjRr+wa60nWQ01oVH8HkfI4bCNCHoqo5jDvMPWsYT
ksFbUm8YWEZmi0K2yXFWXuJIN2bVBd72a8CrvtF3ksdEMYxbWWTOAcrhYJ4H5/U7
BQjWIxiIVsAoJohcXWq/Swh8cgvgb5uJVpNUU8VEFob/jI3Gc3YojIToISB6soUL
DNhDJLeyZjuXfE1Ej+ySas9bpdG4LgxzsDBl9lFl9EQkSo1c3h/lEx85aeixAZla
QfjTOVUabzdPzvZ9H1yDQISxjVLy2PotnhVMy/rSSrnDKlowtNB9iEzd6cpzFzxU
fxomz1d6+w8rZY9jaRIAcMNa6bEOuYmcP9V8rIzGeg3Mm3jqL7H/JgJu5s2YbjpN
InmTNu4cwLeTO65DzqVxF8UGbZ2tHbMm5pNeVBYxuY1adRgJFlIOP5kYlNlyiY+D
Bt41CRuK3hqpYfXh7nSK8U4BKEhMikTCS0W4aKL5EzLZ20rxjgTlaHZiOBqd9vep
1tqNjPIvL2jWfF+5shwAZbupj3WKbuVqi4S2jXljv+Wkmk4ZVLSX3fQZv2I7JTHM
I2qa59PB4A==
=8MX/
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Highlights from this cycles are things like request recycling and
task_work optimizations, which net us anywhere from 10-20% of speedups
on workloads that mostly are inline.
This work was originally done to put io_uring under memcg, which adds
considerable overhead. But it's a really nice win as well. Also worth
highlighting is the LOOKUP_CACHED work in the VFS, and using it in
io_uring. Greatly speeds up the fast path for file opens.
Summary:
- Put io_uring under memcg protection. We accounted just the rings
themselves under rlimit memlock before, now we account everything.
- Request cache recycling, persistent across invocations (Pavel, me)
- First part of a cleanup/improvement to buffer registration (Bijan)
- SQPOLL fixes (Hao)
- File registration NULL pointer fixup (Dan)
- LOOKUP_CACHED support for io_uring
- Disable /proc/thread-self/ for io_uring, like we do for /proc/self
- Add Pavel to the io_uring MAINTAINERS entry
- Tons of code cleanups and optimizations (Pavel)
- Support for skip entries in file registration (Noah)"
* tag 'for-5.12/io_uring-2021-02-17' of git://git.kernel.dk/linux-block: (103 commits)
io_uring: tctx->task_lock should be IRQ safe
proc: don't allow async path resolution of /proc/thread-self components
io_uring: kill cached requests from exiting task closing the ring
io_uring: add helper to free all request caches
io_uring: allow task match to be passed to io_req_cache_free()
io-wq: clear out worker ->fs and ->files
io_uring: optimise io_init_req() flags setting
io_uring: clean io_req_find_next() fast check
io_uring: don't check PF_EXITING from syscall
io_uring: don't split out consume out of SQE get
io_uring: save ctx put/get for task_work submit
io_uring: don't duplicate io_req_task_queue()
io_uring: optimise SQPOLL mm/files grabbing
io_uring: optimise out unlikely link queue
io_uring: take compl state from submit state
io_uring: inline io_complete_rw_common()
io_uring: move res check out of io_rw_reissue()
io_uring: simplify iopoll reissuing
io_uring: clean up io_req_free_batch_finish()
io_uring: move submit side state closer in the ring
...
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP
EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+
RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt
Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK
dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw
ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg
rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u
ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l
Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ
wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC
VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26
WC22RGC2MA==
=os1E
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Another nice round of removing more code than what is added, mostly
due to Christoph's relentless pursuit of tech debt removal/cleanups.
This pull request contains:
- Two series of BFQ improvements (Paolo, Jan, Jia)
- Block iov_iter improvements (Pavel)
- bsg error path fix (Pan)
- blk-mq scheduler improvements (Jan)
- -EBUSY discard fix (Jan)
- bvec allocation improvements (Ming, Christoph)
- bio allocation and init improvements (Christoph)
- Store bdev pointer in bio instead of gendisk + partno (Christoph)
- Block trace point cleanups (Christoph)
- hard read-only vs read-only split (Christoph)
- Block based swap cleanups (Christoph)
- Zoned write granularity support (Damien)
- Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)"
* tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits)
mm: simplify swapdev_block
sd_zbc: clear zone resources for non-zoned case
block: introduce blk_queue_clear_zone_settings()
zonefs: use zone write granularity as block size
block: introduce zone_write_granularity limit
block: use blk_queue_set_zoned in add_partition()
nullb: use blk_queue_set_zoned() to setup zoned devices
nvme: cleanup zone information initialization
block: document zone_append_max_bytes attribute
block: use bi_max_vecs to find the bvec pool
md/raid10: remove dead code in reshape_request
block: mark the bio as cloned in bio_iov_bvec_set
block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
block: remove a layer of indentation in bio_iov_iter_get_pages
block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
block: remove the 1 and 4 vec bvec_slabs entries
block: streamline bvec_alloc
block: factor out a bvec_alloc_gfp helper
block: move struct biovec_slab to bio.c
block: reuse BIO_INLINE_VECS for integrity bvecs
...
The "oprofile" user-space tools don't use the kernel OPROFILE support any more,
and haven't in a long time. User-space has been converted to the perf
interfaces.
The dcookies stuff is only used by the oprofile code. Now that oprofile's
support is getting removed from the kernel, there is no need for dcookies as
well.
Remove kernel's old oprofile and dcookies support.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJgJMEVAAoJENK5HDyugRIcL8YP/jkmXH5CZT80ntcqrJGWKcG7
lWbach7uNeQteht7B1ZPKvojxizTkmfrN2sClX0B2hbGkc5TiWUQ2ZSnvnfWDZ8+
z2qQcEB11G/ReL2vvRk1fJlWdAOyUfrPee/44AkemnLRv+Niw/8PqnGd87yDQGsK
qy5E1XXfbjUq6Y/uMiLOX3+21I6w6o2Q6I3NNXC93s0wS3awqnft8n0XBC7iAPBj
eowRJxpdRU2Vcuj8UOzzOI7gQlwdjwYImyLPbRy/V8NawC8a+FHrPrf5/GCYlVzl
7TGFBsDQSmzvrBChUfoGz1Rq/VZ1a357p5rhRqemfUrdkjW+vyzelnD8I1W/hb2o
SmBXoPoyl3+UkFHNyJI0mI7obaV+2PzyXMV0JIQUj+IiX/mfeFv0nF4XfZD2IkRt
6xhaYj775Zrx32iBdGZIvvLg5Gh9ZkZmR5vJ7Fi/EIZFe6Z+bZnPKUROnAgS/o0z
+UkSygOhgo/1XbqrzZVk1iweWeu+EUMbY4YQv2qVnFhpvsq4ieThcUGQpWcxGjjH
WP8O0n1yq1slsnpUtxhiTsm46ENajx9zZp6Iv6Ws+NM0RUqjND8BdF1co9WGD3LS
cnZMFBs4Bg/V1HICL/D4s6L7t1ofrEXIgJH1y3iF0HeECq03mU4CgA/qly9Aebqg
UxPF3oNlVOPlds9FzsU2
=I2Ac
-----END PGP SIGNATURE-----
Merge tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux
Pull oprofile and dcookies removal from Viresh Kumar:
"Remove oprofile and dcookies support
The 'oprofile' user-space tools don't use the kernel OPROFILE support
any more, and haven't in a long time. User-space has been converted to
the perf interfaces.
The dcookies stuff is only used by the oprofile code. Now that
oprofile's support is getting removed from the kernel, there is no
need for dcookies as well.
Remove kernel's old oprofile and dcookies support"
* tag 'oprofile-removal-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/linux:
fs: Remove dcookies support
drivers: Remove CONFIG_OPROFILE support
arch: xtensa: Remove CONFIG_OPROFILE support
arch: x86: Remove CONFIG_OPROFILE support
arch: sparc: Remove CONFIG_OPROFILE support
arch: sh: Remove CONFIG_OPROFILE support
arch: s390: Remove CONFIG_OPROFILE support
arch: powerpc: Remove oprofile
arch: powerpc: Stop building and using oprofile
arch: parisc: Remove CONFIG_OPROFILE support
arch: mips: Remove CONFIG_OPROFILE support
arch: microblaze: Remove CONFIG_OPROFILE support
arch: ia64: Remove rest of perfmon support
arch: ia64: Remove CONFIG_OPROFILE support
arch: hexagon: Don't select HAVE_OPROFILE
arch: arc: Remove CONFIG_OPROFILE support
arch: arm: Remove CONFIG_OPROFILE support
arch: alpha: Remove CONFIG_OPROFILE support
- Fix an ABBA deadlock when renaming files on overlayfs.
- Make sure that we can't overflow the inode extent counters when adding
to or removing extents from a file.
- Make directory sgid inheritance work the same way as all the other
filesystems.
- Don't drain the buffer cache on freeze and ro remount, which should
reduce the amount of time if read-only workloads are continuing
during the freeze.
- Fix a bug where symlink size isn't reported to the vfs in ecryptfs.
- Disentangle log cleaning from log covering. This refactoring sets us
up for future changes to the log, though for now it simply means that
we can use covering for freezes, and cleaning becomes something we
only do at unmount.
- Speed up file fsyncs by reducing iolock cycling.
- Fix delalloc blocks leaking when changing the project id fails because
of input validation errors in FSSETXATTR.
- Fix oversized quota reservation when converting unwritten extents
during a DAX write.
- Create a transaction allocation helper function to standardize the
idiom of allocating a transaction, reserving blocks, locking inodes,
and reserving quota. Replace all the open-coded logic for file
creation, file ownership changes, and file modifications to use them.
- Actually shut down the fs if the incore quota reservations get
corrupted.
- Fix background block garbage collection scans to not block and to
actually clean out CoW staging extents properly.
- Run block gc scans when we run low on project quota.
- Use the standardized transaction allocation helpers to make it so that
ENOSPC and EDQUOT errors during reservation will back out, invoke the
block gc scanner, and try again. This is preparation for introducing
background inode garbage collection in the next cycle.
- Combine speculative post-EOF block garbage collection with speculative
copy on write block garbage collection.
- Enable multithreaded quotacheck.
- Allow sysadmins to tweak the CPU affinities and maximum concurrency
levels of quotacheck and background blockgc worker pools.
- Expose the inode btree counter feature in the fs geometry ioctl.
- Cleanups of the growfs code in preparation for starting work on
filesystem shrinking.
- Fix all the bloody gcc warnings that the maintainer knows about. :P
- Fix a RST syntax error.
- Don't trigger bmbt corruption assertions after the fs shuts down.
- Restore behavior of forcing SIGBUS on a shut down filesystem when
someone triggers a mmap write fault (or really, any buffered write).
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAlX/UACgkQ+H93GTRK
tOta+RAAiGqLKxeY07HH7F98pRJ86j6lU0zmc5i5UCOGMvZd8hLKDdThzggsjqO6
rrUSc7Ppg7MQt1JdXLSdZw2N6Ksb9yy6chufj+j3Dq1JQfSL4YvBO/LlXmZmFE6d
80Qbqq6HFSRWb6JzCMr3knhC+FJovAGhFgZYZGBZ817A/FXacTg9/A5Ow8SX81WX
42s517QOmegAn7YhC3xcPZp5iavjbMd7Y9v7izpuo4FBB9AY7NYyb5wVhvffILfS
/SMLQPw3T/tccRJuVJ8TfLA9R+B9+LaGmQ5tn/AtdwN+Lv7ykinzGKYLagkdlTmE
onGkEIwrebEgq9phT47eX7ixiEt7oWQiQGZukXLVn7mL/0WPVI2pbYi/M1BNpi8i
UftOEVroav+m4h0DF3duOE7rLGuBIEdjPuuAs85QhZ6UTusBjwxp1gOJbjuN0Up9
9hBGTtYQIRhWxHkxWKAeuYzIbtMxC2S2XGxnW4cNOxbE7GxwfxBw0KP/38ZP4iYQ
LKt6JVX+iFDQ+lH8JA6DD7+j+m7W37Alu89OPmpW2nYpFyisFDY+1dEIFvPw9roZ
BtbKlZzS2O2zD67/tTVh+ZcPoEcPfp156GDCrgfgdIdiBvQtGbyOLB/WQC6wSU1L
2PLt1inFBx5wNrIEMFMHT1hsduRihNMM+eLn6LV5XIK2RmSCT+I=
=CaLz
-----END PGP SIGNATURE-----
Merge tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"There's a lot going on this time, which seems about right for this
drama-filled year.
Community developers added some code to speed up freezing when
read-only workloads are still running, refactored the logging code,
added checks to prevent file extent counter overflow, reduced iolock
cycling to speed up fsync and gc scans, and started the slow march
towards supporting filesystem shrinking.
There's a huge refactoring of the internal speculative preallocation
garbage collection code which fixes a bunch of bugs, makes the gc
scheduling per-AG and hence multithreaded, and standardizes the retry
logic when we try to reserve space or quota, can't, and want to
trigger a gc scan. We also enable multithreaded quotacheck to reduce
mount times further. This is also preparation for background file gc,
which may or may not land for 5.13.
We also fixed some deadlocks in the rename code, fixed a quota
accounting leak when FSSETXATTR fails, restored the behavior that
write faults to an mmap'd region actually cause a SIGBUS, fixed a bug
where sgid directory inheritance wasn't quite working properly, and
fixed a bug where symlinks weren't working properly in ecryptfs. We
also now advertise the inode btree counters feature that was
introduced two cycles ago.
Summary:
- Fix an ABBA deadlock when renaming files on overlayfs.
- Make sure that we can't overflow the inode extent counters when
adding to or removing extents from a file.
- Make directory sgid inheritance work the same way as all the other
filesystems.
- Don't drain the buffer cache on freeze and ro remount, which should
reduce the amount of time if read-only workloads are continuing
during the freeze.
- Fix a bug where symlink size isn't reported to the vfs in ecryptfs.
- Disentangle log cleaning from log covering. This refactoring sets
us up for future changes to the log, though for now it simply means
that we can use covering for freezes, and cleaning becomes
something we only do at unmount.
- Speed up file fsyncs by reducing iolock cycling.
- Fix delalloc blocks leaking when changing the project id fails
because of input validation errors in FSSETXATTR.
- Fix oversized quota reservation when converting unwritten extents
during a DAX write.
- Create a transaction allocation helper function to standardize the
idiom of allocating a transaction, reserving blocks, locking
inodes, and reserving quota. Replace all the open-coded logic for
file creation, file ownership changes, and file modifications to
use them.
- Actually shut down the fs if the incore quota reservations get
corrupted.
- Fix background block garbage collection scans to not block and to
actually clean out CoW staging extents properly.
- Run block gc scans when we run low on project quota.
- Use the standardized transaction allocation helpers to make it so
that ENOSPC and EDQUOT errors during reservation will back out,
invoke the block gc scanner, and try again. This is preparation for
introducing background inode garbage collection in the next cycle.
- Combine speculative post-EOF block garbage collection with
speculative copy on write block garbage collection.
- Enable multithreaded quotacheck.
- Allow sysadmins to tweak the CPU affinities and maximum concurrency
levels of quotacheck and background blockgc worker pools.
- Expose the inode btree counter feature in the fs geometry ioctl.
- Cleanups of the growfs code in preparation for starting work on
filesystem shrinking.
- Fix all the bloody gcc warnings that the maintainer knows about. :P
- Fix a RST syntax error.
- Don't trigger bmbt corruption assertions after the fs shuts down.
- Restore behavior of forcing SIGBUS on a shut down filesystem when
someone triggers a mmap write fault (or really, any buffered
write)"
* tag 'xfs-5.12-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (85 commits)
xfs: consider shutdown in bmapbt cursor delete assert
xfs: fix boolreturn.cocci warnings
xfs: restore shutdown check in mapped write fault path
xfs: fix rst syntax error in admin guide
xfs: fix incorrect root dquot corruption error when switching group/project quota types
xfs: get rid of xfs_growfs_{data,log}_t
xfs: rename `new' to `delta' in xfs_growfs_data_private()
libxfs: expose inobtcount in xfs geometry
xfs: don't bounce the iolock between free_{eof,cow}blocks
xfs: expose the blockgc workqueue knobs publicly
xfs: parallelize block preallocation garbage collection
xfs: rename block gc start and stop functions
xfs: only walk the incore inode tree once per blockgc scan
xfs: consolidate the eofblocks and cowblocks workers
xfs: consolidate incore inode radix tree posteof/cowblocks tags
xfs: remove trivial eof/cowblocks functions
xfs: hide xfs_icache_free_cowblocks
xfs: hide xfs_icache_free_eofblocks
xfs: relocate the eofb/cowb workqueue functions
xfs: set WQ_SYSFS on all workqueues in debug mode
...
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN if
the write is not a pure overwrite within EOF; this will be used to
reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for unaligned
direct writes if the circumstances permit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
=AR07
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The big change in this cycle is some new code to make it possible for
XFS to try unaligned directio overwrites without taking locks. If the
block is fully written and within EOF (i.e. doesn't require any
further fs intervention) then we can let the unlocked write proceed.
If not, we fall back to synchronizing direct writes.
Summary:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN
if the write is not a pure overwrite within EOF; this will be used
to reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for
unaligned direct writes if the circumstances permit"
* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reduce exclusive locking on unaligned dio
xfs: split the unaligned DIO write code out
xfs: improve the reflink_bounce_dio_write tracepoint
xfs: simplify the read/write tracepoints
xfs: remove the buffered I/O fallback assert
xfs: cleanup the read/write helper naming
xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
xfs: factor out a xfs_ilock_iocb helper
iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
iomap: pass a flags argument to iomap_dio_rw
iomap: rename the flags variable in __iomap_dio_rw
Add an ioctl which allows reading fs-verity metadata from a file.
This is useful when a file with fs-verity enabled needs to be served
somewhere, and the other end wants to do its own fs-verity compatible
verification of the file. See the commit messages for details.
This new ioctl has been tested using new xfstests I've written for it.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCYCv/2hQcZWJpZ2dlcnNA
Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK6/7AQDRmmnV+G34yGPCWfu8tyjdYvWPyak2
IA/I+eM6S/F+4QEAkbX6rOwYVhLHN9KSOYyNhJiBchm6xq83J+R8BYh/Kw0=
=FPNK
-----END PGP SIGNATURE-----
Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt
Pull fsverity updates from Eric Biggers:
"Add an ioctl which allows reading fs-verity metadata from a file.
This is useful when a file with fs-verity enabled needs to be served
somewhere, and the other end wants to do its own fs-verity compatible
verification of the file. See the commit messages for details.
This new ioctl has been tested using new xfstests I've written for it"
* tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt:
fs-verity: support reading signature with ioctl
fs-verity: support reading descriptor with ioctl
fs-verity: support reading Merkle tree with ioctl
fs-verity: add FS_IOC_READ_VERITY_METADATA ioctl
fs-verity: don't pass whole descriptor to fsverity_verify_signature()
fs-verity: factor out fsverity_get_descriptor()
- Update NFSv2 and NFSv3 XDR decoding functions
- Further improve support for re-exporting NFS mounts
- Convert NFSD stats to per-CPU counters
- Add batch Receive posting to the server's RPC/RDMA transport
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmAYVsAACgkQM2qzM29m
f5f1Lg/+IBC7Bhnnc8jNr4nv4IntCwwKdx2VzSzQszbN/kkhLZK89u36nZyqp0RB
Vg3olyS5DseEisMMx0rI0KkHBz7pz+kXVdOGvve8fHBZvewnJ/FpxNZPChG4aMDc
mfjHLvDHO0/GoUqSftrBrjSEJ2jHoNdDcmvzgdAlugTuLOjGX3HhmKa3ZYVTNgFn
kDmFMaEHjS3pb3LqNDHNIYYpNnvtIukxHUh9weDvr+AH8Rmt/WVfjDc26xBS0FQu
jDJUk9AP06VYgZx0dLKp4In8GJYwz9DNjNrWm91+RyJml9AWrFswdBHHcfi0W/Yy
GipkBZGYE6ZblyMlITZCB4etyHQsq7qLuqicTlcXjL/Fdkd7xlT8DwFlZ8LjpyCU
LeHTI2cGzRSJ/JjL2hvhPvT3gR5hln/qk17jSP7V4S6psZAqAEvw/Xa/+MDJhB/b
vnzltFPvEgZc59Q/SJLbaWZLHy1q0enbrOBLMZDmUlk911/tgAuflHJM60N8o732
vkfy05pvZlrV0cFY546pQd7zTKZcAOYPVHHoP25wPa2ibKBu6eQ6kZEi5zu+tVK3
CkvqIhePFspBMQ6GOPKixTiFV4KFoO1HBtk+JEeMkiHXHk1xATCWbg1m7wkaagsq
NNS/qFkLRnftGYpFViBaxTFBGxiBOSbsTIS/zfj5L7JOpW4FRD4=
=02xw
-----END PGP SIGNATURE-----
Merge tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
- Update NFSv2 and NFSv3 XDR decoding functions
- Further improve support for re-exporting NFS mounts
- Convert NFSD stats to per-CPU counters
- Add batch Receive posting to the server's RPC/RDMA transport
* tag 'nfsd-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (65 commits)
nfsd: skip some unnecessary stats in the v4 case
nfs: use change attribute for NFS re-exports
NFSv4_2: SSC helper should use its own config.
nfsd: cstate->session->se_client -> cstate->clp
nfsd: simplify nfsd4_check_open_reclaim
nfsd: remove unused set_client argument
nfsd: find_cpntf_state cleanup
nfsd: refactor set_client
nfsd: rename lookup_clientid->set_client
nfsd: simplify nfsd_renew
nfsd: simplify process_lock
nfsd4: simplify process_lookup1
SUNRPC: Correct a comment
svcrdma: DMA-sync the receive buffer in svc_rdma_recvfrom()
svcrdma: Reduce Receive doorbell rate
svcrdma: Deprecate stat variables that are no longer used
svcrdma: Restore read and write stats
svcrdma: Convert rdma_stat_sq_starve to a per-CPU counter
svcrdma: Convert rdma_stat_recv to a per-CPU counter
svcrdma: Refactor svc_rdma_init() and svc_rdma_clean_up()
...
- fix shift-out-of-bounds of crafted blkszbits generated by syzkaller;
- ensure initialized fields can only be observed after bit is set.
-----BEGIN PGP SIGNATURE-----
iIsEABYIADMWIQThPAmQN9sSA0DVxtI5NzHcH7XmBAUCYC5qFBUcaHNpYW5na2Fv
QHJlZGhhdC5jb20ACgkQOTcx3B+15gT4PwD/W8BGqC3/uBC6qGJuNkRteFmaIDvB
EplXizcZ+6ennkkBAIbbEsFx8K3TM/tg45YqV+ebjRbsH4NG1owVqb8ZAc0M
=Ni8F
-----END PGP SIGNATURE-----
Merge tag 'erofs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"This contains a somewhat important but rarely reproduced fix reported
month ago for platforms which have weak memory model (e.g. arm64).
The root cause is that test_bit/set_bit atomic operations are actually
implemented in relaxed forms, and uninitialized fields governed by an
atomic bit could be observed in advance due to memory reordering thus
memory barrier pairs should be used.
There is also a trivial fix of crafted blkszbits generated by
syzkaller.
Summary:
- fix shift-out-of-bounds of crafted blkszbits generated by syzkaller
- ensure initialized fields can only be observed after bit is set"
* tag 'erofs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: initialized fields can only be observed after bit is set
erofs: fix shift-out-of-bounds of blkszbits
We've added two major features: 1) compression level and 2) checkpoint_merge, in
this round. 1) compression level expands 'compress_algorithm' mount option to
accept parameter as format of <algorithm>:<level>, by this way, it gives a way
to allow user to do more specified config on lz4 and zstd compression level,
then f2fs compression can provide higher compress ratio. 2) checkpoint_merge
creates a kernel daemon and makes it to merge concurrent checkpoint requests as
much as possible to eliminate redundant checkpoint issues. Plus, we can
eliminate the sluggish issue caused by slow checkpoint operation when the
checkpoint is done in a process context in a cgroup having low i/o budget and
cpu shares.
Enhancement:
- add compress level for lz4 and zstd in mount option
- checkpoint_merge mount option
- deprecate f2fs_trace_io
Bug fix:
- flush data when enabling checkpoint back
- handle corner cases of mount options
- missing ACL update and lock for I_LINKABLE flag
- attach FIEMAP_EXTENT_MERGED in f2fs_fiemap
- fix potential deadlock in compression flow
- fix wrong submit_io condition
As usual, we've cleaned up many code flows and fixed minor bugs.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmAtdrAACgkQQBSofoJI
UNLvLg//XWERjTZ3tfHHLtNcIkNCd2WaKXwpanTXJsn0kVUc6H5m8lqkutn5Vh/z
ZAtQE89aqwbw/FPQQl6jEA/aHhXAnCBbXS0Rjx7QFwlqs+772H10VLvdNXewgvJB
r/u7CIlxbmu3p6ZLSG/a8uJe3CMimJe4lrswjnFlLYgKiho40tcQL8qfQEtkNQSF
+MV2npS7ka4x/PenFykVbTI0OcwOpblpgkpjgfl5A9bcOsGbli+1qzcasbcX9z9k
20TwZqk5q7rZHVDjvtYERSyS9mmn3fzEJStK4sdZ6uk+EKxyC+KNHrv9cKwemTCm
ZATR/YBJKeYhjYppyYLLTRp5eL08PBNgE15SmnkVRjMcAiFxM689WfShrIVhBaf1
dRr9DxAMLuFSiwFuLBLE/8yMwed38RH9e0RrfQRVjj8Zs2kHcUdwD1WqyDg7omS8
NuH776LhJSsSVgC8ZKTacQgX8l2NvsjAigeBj/6v4o0lzr1msn2ADpQ9Bww9Iqtt
lv/09350ww78UV+ipLlVSHw4rl8sebatMUSHtmF4SP7U7Jqv2MaGhNAteWlCklmV
0cTzjEueiuvmrmkiphTHtl1fHHDVCE0xtScpoylchPVd8bal0pVq4XbZLmGsQwDt
9V9qOebt2xLmx9EXDyqdRWRbDrtE0FG/AZiN8Q0VcJSzUI/ATx8=
=+/7T
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"We've added two major features: 1) compression level and 2)
checkpoint_merge, in this round.
Compression level expands 'compress_algorithm' mount option to accept
parameter as format of <algorithm>:<level>, by this way, it gives a
way to allow user to do more specified config on lz4 and zstd
compression level, then f2fs compression can provide higher compress
ratio.
checkpoint_merge creates a kernel daemon and makes it to merge
concurrent checkpoint requests as much as possible to eliminate
redundant checkpoint issues. Plus, we can eliminate the sluggish issue
caused by slow checkpoint operation when the checkpoint is done in a
process context in a cgroup having low i/o budget and cpu shares.
Enhancements:
- add compress level for lz4 and zstd in mount option
- checkpoint_merge mount option
- deprecate f2fs_trace_io
Bug fixes:
- flush data when enabling checkpoint back
- handle corner cases of mount options
- missing ACL update and lock for I_LINKABLE flag
- attach FIEMAP_EXTENT_MERGED in f2fs_fiemap
- fix potential deadlock in compression flow
- fix wrong submit_io condition
As usual, we've cleaned up many code flows and fixed minor bugs"
* tag 'f2fs-for-5.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
Documentation: f2fs: fix typo s/automaic/automatic
f2fs: give a warning only for readonly partition
f2fs: don't grab superblock freeze for flush/ckpt thread
f2fs: add ckpt_thread_ioprio sysfs node
f2fs: introduce checkpoint_merge mount option
f2fs: relocate inline conversion from mmap() to mkwrite()
f2fs: fix a wrong condition in __submit_bio
f2fs: remove unnecessary initialization in xattr.c
f2fs: fix to avoid inconsistent quota data
f2fs: flush data when enabling checkpoint back
f2fs: deprecate f2fs_trace_io
f2fs: Remove readahead collision detection
f2fs: remove unused stat_{inc, dec}_atomic_write
f2fs: introduce sb_status sysfs node
f2fs: fix to use per-inode maxbytes
f2fs: compress: fix potential deadlock
libfs: unexport generic_ci_d_compare() and generic_ci_d_hash()
f2fs: fix to set/clear I_LINKABLE under i_lock
f2fs: fix null page reference in redirty_blocks
f2fs: clean up post-read processing
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
=0O6r
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings updates of space handling, performance improvements or bug
fixes. The subpage block size and zoned mode features have reached
state where they're usable but with limitations.
Performance or related:
- do not block on deleted block group mutex in the cleaner, avoids
some long stalls
- improved flushing: make it work better with ticket space
reservations and avoid excessive transaction commits in some
scenarios, slightly improves throughput for random write load
- preemptive background flushing: separate the logic from ticket
reservations, improve the accounting and decisions when to flush in
low space conditions
- less lock contention related to running delayed refs, let just one
thread do the flushing when there are many inside transaction
commit
- dbench workload improvements: avoid unnecessary work when logging
inodes, fewer fallbacks to transaction commit and thus less waiting
for it (+7% throughput, -20% latency)
Core:
- subpage block size
- currently read-only support
- refactor and generalize code where sectorsize is assumed to be
page size, add the subpage handling everywhere
- the read-write support is on the way, page sizes are still
limited to 4K or 64K
- zoned mode, first working version but with limitations
- SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
location for structures" and chunked allocation
- superblock as the only fixed data structure needs special
handling, uses 2 consecutive zones as a ring buffer
- tree-log support with a dedicated block group to avoid unordered
writes
- emulated zones on non-zoned devices
- not yet working
- all non-single block group profiles, requires more zone write
pointer synchronization between the multiple block groups
- fitrim due to dependency on space cache, can be implemented
Fixes:
- ref-verify: proper tree owner and node level tracking
- fix pinned byte accounting, causing some early ENOSPC now more
likely due to other changes in delayed refs
Other:
- error handling fixes and improvements
- more error injection points
- more function documentation
- more and updated tracepoints
- subset of W=1 checked by default
- update comments to allow more automatic kdoc parameter checks"
* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
btrfs: zoned: enable to mount ZONED incompat flag
btrfs: zoned: deal with holes writing out tree-log pages
btrfs: zoned: reorder log node allocation on zoned filesystem
btrfs: zoned: serialize log transaction on zoned filesystems
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
btrfs: split alloc_log_tree()
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
btrfs: zoned: enable relocation on a zoned filesystem
btrfs: zoned: support dev-replace in zoned filesystems
btrfs: zoned: implement copying for zoned device-replace
btrfs: zoned: implement cloning for zoned device-replace
btrfs: zoned: mark block groups to copy for device-replace
btrfs: zoned: do not use async metadata checksum on zoned filesystems
btrfs: zoned: wait for existing extents before truncating
btrfs: zoned: serialize metadata IO
btrfs: zoned: introduce dedicated data write path for zoned filesystems
btrfs: zoned: enable zone append writing for direct IO
btrfs: zoned: use ZONE_APPEND write for zoned mode
btrfs: save irq flags when looking up an ordered extent
btrfs: zoned: cache if block group is on a sequential zone
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyNYACgkQxWXV+ddt
WDt9Ow/9Fsulw3gwsgTzuhlM08Ax7uJWhYSvq7hdg4kfrwCgsR7gI6BalIErstOz
R8pxiRwXLI6C3muQGUHVTYa7t9IkYqhYfE6hTNtFYlpomVwZPm0URkwAnbwkL+VK
rL94bimLtsbvkdMI17rHSvQ5wEEvEUGZBF2Jvy3s2sx3P1tt6nFHFf51alIKY+Lv
u4J3/8otevd+nGRKeMahUOV2v4ssTTcASGLPudvRAIj3g+nAjM/ODTeopN7SBvnd
b708r4e5HsPXCSW+aN2E2IwrwOiNrcezSgQsl6xtUobvBcTjeFoEGnbgNK8FTepr
GaE2sJnHhH2+ZhSph21iMONVFY34hJJwl26SrixjfhGh+88QsgHD91dypkPfPKMn
2TLiCpmPg95UCBmElSJubgqOAC2KT/rwTN4dob7G+mFwEKSza2Oqc4dBVrB5rWiW
bYyexkobZt83ybwgL1ySiyA3t9GZiuDpORylE1rXB28KfQbHDaCwOgtc4qV6TJbr
z4F9ya+Yoop3/1M1xbknuA9AtPykmnAjxK96NKEeAiWpzCrcnP0PFQ4Vh1tHRQoY
yhE3mEaAHgMbEa9N+9gO8RyJSzqlqvneA2kgoTQoFfcUWoGdgzk6d1dWJmvZuUT1
I3K+K48E+2Cwq0aewCPUv44z8N/NmgDK+vDRR/U3cXG6RlJUkJM=
=Eo74
-----END PGP SIGNATURE-----
Merge tag 'affs-for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull AFFS fix from David Sterba:
"One minor fix for error handling in rename exchange"
* tag 'affs-for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
fs/affs: release old buffer head on error path
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmAqmbYACgkQNqiEXrVA
jGSf9A/7BatjusxdjAQ3pF8xROzVl3+xZR/uG7NZFvhh7X8/Gm+/DIPH4MIpI99s
gaQLCOPaBz3s7ZAK+LYMyJ6/Ko0e1tgBWXCVNdm24bc5ETNbT68NNWiqsKw+HG/q
iVf/2n2PIgeDWjwkXfuOnYK6vpisl6l5gst8d2aIorPHk2oE9qTvylxmTBg114dP
6gJEyNnokrqo9oVPoEGwFsDIOigM0QSrreiBtzb5+8nWxd366VoOh8zznehPjGAs
C2MiKxQYOTub7AcyKdnuOwrWjmWiHHhkXq2w33QVZKSVU2m0Uoa7XkA75n+PE6GT
ypxUopxNZmQu3WB7BzkoZB6zsNHdyCbp9RdFtzLO2o1eKj8B2yvSTrp8TmSd8ReM
4Wi3CjVjVQcGyFgbng6071h5eRfXpxuFg4blGscFnttkHGaKNGtmklhie2qQAPiJ
ToV1bdam7CuvlMsOSX+DSFM7ZZbnLFlvcD8eDAztMKPWim1qgkMiY6tumSLPAGrj
9N02IIET9Iixx0BE9/HeauU3/0CTbgNwRBqBTqwBBYH9RTER4B3/+4ouWM7aLsNJ
ky/d4IB+QGXgVTbNj+FCo2dyCc3tLy/TZvY/uIq7QBNEqTuLmGwGl71BuZIvWYEV
hM4oHmV//ncgBFDM8a+cWp+saDkI2CRVJTAn/pd1vIPZRWgVwgM=
=ESGZ
-----END PGP SIGNATURE-----
Merge tag 'jfs-5.12' of git://github.com/kleikamp/linux-shaggy
Pull jfs updates from David Kleikamp:
"A few jfs fixes"
* tag 'jfs-5.12' of git://github.com/kleikamp/linux-shaggy:
fs/jfs: fix potential integer overflow on shift of a int
jfs: turn diLog(), dataLog() and txLog() into void functions
JFS: more checks for invalid superblock
-----BEGIN PGP SIGNATURE-----
iQJHBAABCAAxFiEES8DXskRxsqGE6vXTAA5oQRlWghUFAmAqXhITHGpsYXl0b25A
a2VybmVsLm9yZwAKCRAADmhBGVaCFc2CEAC2WgxNFYXUINTo8FzmgYquLrVfj04X
ecXUJwOJBUQjg+F46OENufh0uREI9DmwlW9RWQAwiVBecLK24vz0WBhKOi/88JhG
8S1I2YL3zIBbnOyBKwAiuK7y3uAQswvKRFRzaY7+aFxVvagDO2YC0l4QCdg3WDp/
n9es8OksUR04ztMYLn6qT1xHb1pWXUmHeYiGzmhgWBwyPygs5OxSP+y2qmDkj08l
o64f3GdUZivF6tT7m7rBDrx9pzUha8oqEw8+LDgiUEaq7ZeMVxHSuFVNHW7fCWVH
ICLfeZPUEZgdMD0w2v5+z/jpy8H4tm2bWNtOWxba1uQoUj5cHrPVuYXSSU1rt5SP
+yHCSyr4eEfR211d7j/U+v/O+WwJCFHRxzE9PdUpi6VlMnuTVkBhrbSGMtBiQRv7
UUwXN3JLRPO63d1D2rfpqxMspZpp5e70TpWKXYLQ69Fl1j0GcF1eLfnKsHPZld8C
Uqfa+CUwRDJKEpnprVn0BOHUlWoPHu4pUIz/gf52pN2v+mTAziZA7WHdxR30V8Pm
H1VAhRX+rPNXsjHzc9TuQK+IsaDenKHRyBOrteBS0TT1hBLF+pe0ocOVgMSP+H3w
p0BL3bVf6gToKRZMnT5+L5GA0Zp1PIQCODyjfSRxQGtNNumnGr/vmZsGka0j3gIW
JO6I+6fsEr0TEg==
=hsmB
-----END PGP SIGNATURE-----
Merge tag 'locks-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
Pull fcntl fix from Jeff Layton.
* tag 'locks-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
fcntl: make F_GETOWN(EX) return 0 on dead owner task
Pull namei updates from Al Viro:
"Most of that pile is LOOKUP_CACHED series; the rest is a couple of
misc cleanups in the general area...
There's a minor bisect hazard in the end of series, and normally I
would've just folded the fix into the previous commit, but this branch
is shared with Jens' tree, with stuff on top of it in there, so that
would've required rebases outside of vfs.git"
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy*
fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED
fs: add support for LOOKUP_CACHED
saner calling conventions for unlazy_child()
fs: make unlazy_walk() error handling consistent
fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast()
do_tmpfile(): don't mess with finish_open()
Pull ELF compat updates from Al Viro:
"Sanitizing ELF compat support, especially for triarch architectures:
- X32 handling cleaned up
- MIPS64 uses compat_binfmt_elf.c both for O32 and N32 now
- Kconfig side of things regularized
Eventually I hope to have compat_binfmt_elf.c killed, with both native
and compat built from fs/binfmt_elf.c, with -DELF_BITS={64,32} passed
by kbuild, but that's a separate story - not included here"
* 'work.elf-compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
get rid of COMPAT_ELF_EXEC_PAGESIZE
compat_binfmt_elf: don't bother with undef of ELF_ARCH
Kconfig: regularize selection of CONFIG_BINFMT_ELF
mips compat: switch to compat_binfmt_elf.c
mips: don't bother with ELF_CORE_EFLAGS
mips compat: don't bother with ELF_ET_DYN_BASE
mips: KVM_GUEST makes no sense for 64bit builds...
mips: kill unused definitions in binfmt_elf[on]32.c
mips binfmt_elf*32.c: use elfcore-compat.h
x32: make X32, !IA32_EMULATION setups able to execute x32 binaries
[amd64] clean PRSTATUS_SIZE/SET_PR_FPVALID up properly
elf_prstatus: collect the common part (everything before pr_reg) into a struct
binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID
Pull sendfile updates from Al Viro:
"Make sendfile() to pipe destination do the right thing, should make
'fs/pipe: allow sendfile() to pipe again' redundant"
* 'work.sendfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
teach sendfile(2) to handle send-to-pipe directly
take the guts of file-to-pipe splice into a helper function
do_splice_to(): move the logics for limiting the read length in
There are a lot of platforms that have not seen any interesting code
changes in the past five years or more.
I made a list and asked around which ones are no longer in use [1], and
received confirmation about six ARM platforms and the TI C6x architecture
that have all reached the end of their life upstream, with no known
users remaining:
- efm32 -- added in 2011, first Cortex-M, no notable changes after 2013
- picoxcell -- added in 2011, abandoned after 2012 acquisition
- prima2 -- added in 20111, no notable changes since 2015
- tango -- added in 2015, sporadic changes until 2017, but abandoned
- u300 -- added in 2009, no notable changes since 2013
- zx --added in 2015 for both 32, 2017 for 64 bit, no notable changes
- arch/c6x -- added in 2011, but work stalled soon after that
A number of other platforms on the original list turned out to still
have users. In some cases there are out-of-tree patches and users
that plan to contribute them in the future, in other cases the code
is complete and works reliably.
[1] https://lore.kernel.org/lkml/CAK8P3a2DZ8xQp7R=H=wewHnT2=a_=M53QsZOueMVEf7tOZLKNg@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmApiR8ACgkQYKtH/8kJ
Uifl7A//RZVyxUSlbD/StS6oEOmkZH8j0L7yeYOKkSHGZI+6Dqxo6rooKymbeflk
jJvDVQqLcrclT/7rWsKesdN8aW+ilfWrby5nDsWivsROrTw3DdvZgkjh7KYz7tA/
OxygKQu4W9I+ywJltR4ykTUxXohjU+duHPuZJawQk64xE3Q0MWxJlQQ2kHJYVJRu
/rWgNDQaI2d8HFhhEVsn4PC0RLWfUuBevKEuRYqZwM/oB/HuYjY+uTUGe2RhlgWb
sbcoD93JP2MghSypq33/UtEl4Uk7Wpdv2bshTTv8DL5ToltY7wD8qIIh+aSJk9hP
0FG3NTia7e9dqQQR2bskspGxP73iIuSN1exAbm/Ten5sysy6IsESmzqZRxXv+7Z1
q1Oyc4wYaotJPAxMOE00RMLiRa5domI8V6Y10I5uyOcmpRvwWK2WfCOE7D3WSQ5M
i1JiqLnC5JtJ0vyVBeRKo99zZImeXXrmS0n+fcARGtcKwAqKSvKxFcLTmkj3KqHv
L4Xgy5f83QrMZWmldX7IiwWjTar2geBM7pFgG/z3R6JqkaxWiDHxyok6j1WUCE7b
MViRZ8wT7JC5sIkHuwXZ4jvAXPqHq6J1rmJreU6N/jzmv/PTQoUnQ3C/MbDNhuv8
NDVSRgrPcd/T0BrBkzIWk3t+Oh6ikDgflWsWkqIRFG0vCNx+KdM=
=pf3b
-----END PGP SIGNATURE-----
Merge tag 'arm-platform-removal-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC platform removals from Arnd Bergmann:
"There are a lot of platforms that have not seen any interesting code
changes in the past five years or more.
I made a list and asked around which ones are no longer in use, and
received confirmation about six ARM platforms and the TI C6x
architecture that have all reached the end of their life upstream,
with no known users remaining:
- efm32 - added in 2011, first Cortex-M, no notable changes after 2013
- picoxcell - added in 2011, abandoned after 2012 acquisition
- prima2 - added in 20111, no notable changes since 2015
- tango - added in 2015, sporadic changes until 2017, but abandoned
- u300 - added in 2009, no notable changes since 2013
- zx - added in 2015 for both 32, 2017 for 64 bit, no notable changes
- arch/c6x - added in 2011, but work stalled soon after that
A number of other platforms on the original list turned out to still
have users. In some cases there are out-of-tree patches and users that
plan to contribute them in the future, in other cases the code is
complete and works reliably"
Link: https://lore.kernel.org/lkml/CAK8P3a2DZ8xQp7R=H=wewHnT2=a_=M53QsZOueMVEf7tOZLKNg@mail.gmail.com/
* tag 'arm-platform-removal-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
ARM: remove u300 platform
ARM: remove tango platform
ARM: remove zte zx platform
ARM: remove sirf prima2/atlas platforms
c6x: remove architecture
MAINTAINERS: Remove deleted platform efm32
ARM: drop efm32 platform
ARM: Remove PicoXcell platform support
ARM: dts: Remove PicoXcell platforms
sqe->flags are subset of req flags, so incorrectly copied may span into
in-kernel flags and wreck havoc, e.g. by setting REQ_F_INFLIGHT.
Fixes: 5be9ad1e42 ("io_uring: optimise io_init_req() flags setting")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is a short window where percpu_refs are already turned zero, but
we try to do resurrect(). Play nicer and wait for ->release() to happen
in this case and proceed as everything is ok. One downside for ctx refs
is that we can ignore signal_pending() on a rare occasion, but someone
else should check for it later if needed.
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_rsrc_ref_quiesce() is a generic resource function, though now it
was wired to allocate and initialise ref nodes with file-specific
callbacks/etc. Keep it sane by passing in as a parameters everything we
need for initialisations, otherwise it will hurt us badly one day.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After a rsrc/files reference node's refs are killed, it must never be
used. And that's how it works, it either assigns a new node or kills the
whole data table.
Let's explicitly NULL it, that shouldn't be necessary, but if something
would go wrong I'd rather catch a NULL dereference to using a dangling
pointer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the prep and prep async split, we now have potentially 3 helpers
that need to be defined for !CONFIG_NET. Add some helpers to do just
that.
Fixes the following compile error on !CONFIG_NET:
fs/io_uring.c:6171:10: error: implicit declaration of function
'io_sendmsg_prep_async'; did you mean 'io_req_prep_async'?
[-Werror=implicit-function-declaration]
return io_sendmsg_prep_async(req);
^~~~~~~~~~~~~~~~~~~~~
io_req_prep_async
Fixes: 93642ef884 ("io_uring: split sqe-prep and async setup")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In case of failure io_wq_submit_work() needs to post an CQE and so
potentially take uring_lock. The safest way to deal with it is to do
that from under task_work where we can safely take the lock.
Also, as io_iopoll_check() holds the lock tight and releases it
reluctantly, it will play nicer in the furuter with notifying an
iopolling task about new such pending failed requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
After switching to non-RCU mode, we want nd->depth to match the number
of entries in nd->stack[] that need eventual path_put().
legitimize_links() takes care of that on failures; unfortunately,
failure exits added for LOOKUP_CACHED do not.
We could add the logics for that into those failure exits, both in
try_to_unlazy() and in try_to_unlazy_next(), but since both checks
are immediately followed by legitimize_links() and there's no calls
of legitimize_links() other than those two... It's easier to
move the check (and required handling of nd->depth on failure) into
legitimize_links() itself.
[caught by Jens: ... and since we are zeroing ->depth here, we need
to do drop_links() first]
Fixes: 6c6ec2b0a3 "fs: add support for LOOKUP_CACHED"
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix inconsistent IS_ERR and PTR_ERR in cifs_find_swn_reg(). The proper
pointer to be passed as argument to PTR_ERR() is share_name.
This bug was detected with the help of Coccinelle.
Fixes: bf80e5d425 ("cifs: Send witness register and unregister commands to userspace daemon")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Samuel Cabrero <scabrero@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
Both pstore_compress() and decompress_record() use a mistyped config
option name ("PSTORE_COMPRESSION" instead of "PSTORE_COMPRESS"). As
a result compression and decompression of pstore records was always
disabled.
Use the correct config option name.
Signed-off-by: Jiri Bohac <jbohac@suse.cz>
Fixes: fd49e03280 ("pstore: Fix linking when crypto API disabled")
Acked-by: Matteo Croce <mcroce@microsoft.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210218111547.johvp5klpv3xrpnn@dwarf.suse.cz
Instead of marking a link with REQ_F_FAIL_LINK on an error and delaying
its failing to the caller, do it eagerly right when after getting an
error in io_submit_sqe(). This renders FAIL_LINK checks in
io_queue_link_head() useless and we can skip it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now, as we can do async setup without holding an SQE, we can skip doing
io_req_defer_prep() for link heads, it will be tried to be executed
inline and follows all the rules of the non-linked requests.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now as preparations are split from async setup, we can do the first one
pretty early not spilling it across multiple call sites. And after it's
done SQE is not needed anymore and we can save on passing it deeply into
the submission stack.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two kinds of opcode-specific preparations we do. The first is
just initialising req with what is always needed for an opcode and
reading all non-generic SQE fields. And the second is copying some of
the stuff like iovec preparing to punt a request to somewhere async,
e.g. to io-wq or for draining. For requests that have tried an inline
execution but still needing to be punted, the second prep type is done
by the opcode handler itself.
Currently, we don't explicitly split those preparation steps, but
combining both of them into io_*_prep(), altering the behaviour by
allocating ->async_data. That's pretty messy and hard to follow and also
gets in the way of some optimisations.
Split the steps, leave the first type as where it is now, and put the
second into a new io_req_prep_async() helper. It may make us to do opcode
switch twice, but it's worth it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If we get an error in io_init_req() for a request that would have been
linked, we break the submission but still issue a partially composed
link, that's nasty, fail it instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move struct io_submit_link into submit_state, which is a part of a
submission state and so belongs to it. It saves us from explicitly
passing it, and init/deinit is now nicely hidden in
io_submit_state_[start,end].
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Behaves identically, just move io_init_req() call into the beginning of
io_submit_sqes(). That looks better unloads io_submit_sqes().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A preparation patch, symbol to symbol move io_init_req() +
io_check_restriction() a bit up. The submission path is pretty settled
down, so don't worry about backports and move the functions instead of
relying on forward declarations in the future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
IORING_OP_SYNC_FILE_RANGE is marked as .needs_file, so the common path
will take care of assigning and validating req->file, no need to
duplicate it in io_sfr_prep().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Follow io_*_prep() naming pattern, there are only fsync and sfr that
don't do that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@i and @submitted are very much coupled together, and there is no need
to keep them both. Remove @i, it doesn't change generated binary but
helps to keep a single source of truth.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Some subsystems want to add debugfs files at early boot, way before
debugfs is initialized. This seems to work somehow as the vfs layer
will not allow it to happen, but let's be explicit and test to ensure we
are properly up and running before allowing files to be created.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable <stable@vger.kernel.org>
Reported-by: Michael Walle <michael@walle.cc>
Reported-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210218100818.3622317-2-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
debugfs_lookup() doesn't like it if it is passed an illegal name
pointer, or if the filesystem isn't even initialized yet. If either of
these happen, it will crash the system, so fix it up by properly testing
for valid input and that we are up and running before trying to find a
file in the filesystem.
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: stable <stable@vger.kernel.org>
Reported-by: Michael Walle <michael@walle.cc>
Tested-by: Michael Walle <michael@walle.cc>
Tested-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210218100818.3622317-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Per ZBC/ZAC/ZNS specifications, write pointers may not have valid values
when zones are in full condition. However, when zonefs mounts a zoned
block device, zonefs refers write pointers to set file size even when
the zones are in full condition. This results in wrong file size. To fix
this, refer maximum file size in place of write pointers for zones in
full condition.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: 8dcc1a9d90 ("fs: New zonefs file system")
Cc: <stable@vger.kernel.org> # 5.6+
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Don't forget to free iovec read inline completion and bunch of other
cases that do "goto done" before setting up an async context.
Fixes: 5ea5dd4584 ("io_uring: inline io_read()'s iovec freeing")
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This patch takes advantage of the new glock holder sharing feature for
resource groups. We have already introduced local resource group
locking in a previous patch, so competing accesses of local processes
are already under control.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Introduce a new LM_FLAG_NODE_SCOPE glock holder flag: when taking a
glock in LM_ST_EXCLUSIVE (EX) mode and with the LM_FLAG_NODE_SCOPE flag
set, the exclusive lock is shared among all local processes who are
holding the glock in EX mode and have the LM_FLAG_NODE_SCOPE flag set.
From the point of view of other nodes, the lock is still held
exclusively.
A future patch will start using this flag to improve performance with
rgrp sharing.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Prepare for treating resource group glocks as exclusive among nodes but
shared among all tasks running on a node: introduce another layer of
node-specific locking that the local tasks can use to coordinate their
accesses.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Add a rs_reserved field to struct gfs2_blkreserv to keep track of the number of
blocks reserved by this particular reservation, and a rd_reserved field to
struct gfs2_rgrpd to keep track of the total number of reserved blocks in the
resource group. Those blocks are exclusively reserved, as opposed to the
rs_requested / rd_requested blocks which are tracked in the reservation tree
(rd_rstree) and which can be stolen if necessary.
When making a reservation with gfs2_inplace_reserve, rs_reserved is set to
somewhere between ap->min_target and ap->target depending on the number of free
blocks in the resource group. When allocating blocks with gfs2_alloc_blocks,
rs_reserved is decremented accordingly. Eventually, any reserved but not
consumed blocks are returned to the resource group by gfs2_inplace_release.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We keep track of what we've so far been referring to as reservations in
rd_rstree: the nodes in that tree indicate where in a resource group we'd
like to allocate the next couple of blocks for a particular inode. Local
processes take those as hints, but they may still "steal" blocks from those
extents, so when actually allocating a block, we must double check in the
bitmap whether that block is actually still free. Likewise, other cluster
nodes may "steal" such blocks as well.
One of the following patches introduces resource group glock sharing, i.e.,
sharing of an exclusively locked resource group glock among local processes to
speed up allocations. To make that work, we'll need to keep track of how many
blocks we've actually reserved for each inode, so we end up with two different
kinds of reservations.
Distinguish these two kinds by referring to blocks which are reserved but may
still be "stolen" as "requested". This rename also makes it more obvious that
rs_requested and rd_requested are strongly related.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
In gfs2_release, check if the inode has an active reservation to avoid
unnecessary lock taking.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
If gfs2_inplace_reserve has chosen a resource group but it couldn't make a
reservation there, there are too many other reservations in that resource
group. In that case, don't even try to respect existing reservations in
gfs2_alloc_blocks.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Only pass the current reservation down to gfs2_rbm_find rather than the entire
inode; we don't need any of the other information.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Pass a non-NULL minext to gfs2_rbm_find even for single-block allocations. In
gfs2_rbm_find, also set rgd->rd_extfail_pt when a single-block allocation
fails in a resource group: there is no reason for treating that case
differently. In gfs2_reservation_check_and_update, only check how many free
blocks we have if more than one block is requested; we already know there's at
least one free block.
In addition, when allocating N blocks fails in gfs2_rbm_find, we need to set
rd_extfail_pt to N - 1 rather than N: rd_extfail_pt defines the biggest
allocation that might still succeed.
Finally, reset rd_extfail_pt when updating the resource group statistics in
update_rgrp_lvb, as we already do in gfs2_rgrp_bh_get.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Introduced a new field conn_id in TCP_Server_Info structure.
This is a non-persistent unique identifier maintained by the client
for a connection to a file server. For this, a global counter named
tcpSesNextId is maintained. On allocating a new TCP_Server_Info,
this counter is incremented and assigned.
Changed the dynamic tracepoints related to reconnects and
crediting to be more informative (with conn_id printed).
Debugging a crediting issue helped me understand the
important things to print here.
Always call dynamic tracepoints outside the scope of spinlocks.
To do this, copy out the credits and in_flight fields of the
server struct before dropping the lock.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
For failure by timeout waiting for credits, changed the error
returned to the app with EBUSY, instead of ENOTSUPP. This is done
because this situation is possible even in non-buggy cases. i.e.
overloaded server can return 0 credits until done with outstanding
requests. And this feels like a better error to return to the app.
For cases of zero credits found even when there are no requests
in flight, replaced ENOTSUPP with EDEADLK, since we're avoiding
deadlock here by returning error.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We used to share the CIFS_NEG_OP flag between negotiate and
session authentication. There was an assumption in the code that
CIFS_NEG_OP is used by negotiate only. So introcuded CIFS_SESS_OP
and used it for session setup optypes.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
We need to wait for outstanding writes on the page to complete before we
can update it.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Support eager writing to the server, meaning that we write the data to
cache on the server, and wait for that to complete. This ensures that we
see ENOSPC errors immediately.
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
We add task_work from any context, hence we need to ensure that we can
tolerate it being from IRQ context as well.
Fixes: 7cbf1722d5 ("io_uring: provide FIFO ordering for task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the Fb cap is used it means the current inode is flushing the
dirty data to OSD, just defer flushing the capsnap.
URL: https://tracker.ceph.com/issues/48640
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Testing with the fscache overhaul has triggered some lockdep warnings
about circular lock dependencies involving page_mkwrite and the
mmap_lock. It'd be better to do the "real work" without the mmap lock
being held.
Change the skip_checking_caps parameter in __ceph_put_cap_refs to an
enum, and use that to determine whether to queue check_caps, do it
synchronously or not at all. Change ceph_page_mkwrite to do a
ceph_put_cap_refs_async().
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add a generic function for taking an inode reference, setting the I_WORK
bit and queueing i_work. Turn the ceph_queue_* functions into static
inline wrappers that pass in the right bit.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
A primary reason for skipping ceph_check_caps after putting the
references was to avoid the locking in ceph_check_caps during a
reconnect. __ceph_put_cap_refs can still call ceph_flush_snaps in that
case though, and that takes many of the same inconvenient locks.
Fix the logic in __ceph_put_cap_refs to skip flushing snaps when the
skip_checking_caps flag is set.
Fixes: e64f44a884 ("ceph: skip checking caps when session reconnecting and releasing reqs")
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Userspace has discovered the functionality offered by SYS_kcmp and has
started to depend upon it. In particular, Mesa uses SYS_kcmp for
os_same_file_description() in order to identify when two fd (e.g. device
or dmabuf) point to the same struct file. Since they depend on it for
core functionality, lift SYS_kcmp out of the non-default
CONFIG_CHECKPOINT_RESTORE into the selectable syscall category.
Rasmus Villemoes also pointed out that systemd uses SYS_kcmp to
deduplicate the per-service file descriptor store.
Note that some distributions such as Ubuntu are already enabling
CHECKPOINT_RESTORE in their configs and so, by extension, SYS_kcmp.
References: https://gitlab.freedesktop.org/drm/intel/-/issues/3046
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Will Drewry <wad@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Lucas Stach <l.stach@pengutronix.de>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> # DRM depends on kcmp
Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> # systemd uses kcmp
Reviewed-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Thomas Zimmermann <tzimmermann@suse.de>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210205220012.1983-1-chris@chris-wilson.co.uk
Add tracepoints for file I/O operations to aid in debugging of I/O errors
with zonefs.
The added tracepoints are in:
- zonefs_zone_mgmt() for tracing zone management operations
- zonefs_iomap_begin() for tracing regular file I/O
- zonefs_file_dio_append() for tracing zone-append operations
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.
Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.
Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3 ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
It can be useful to the interpreter to know which flags are in use.
For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.
This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.
Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
Reviewed-by: YunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460
Signed-off-by: Helge Deller <deller@gmx.de>
SMB3.1.1 is the newest, and preferred dialect, and is included in
the requested dialect list by default (ie if no vers= is specified
on mount) but it should also be requested if SMB3 or later is requested
(vers=3 instead of a specific dialect: vers=2.1, vers=3.02 or vers=3.0).
Currently specifying "vers=3" only requests smb3.0 and smb3.02 but this
patch fixes it to also request smb3.1.1 dialect, as it is the newest
and most secure dialect and is a "version 3 or later" dialect (the intent
of "vers=3").
Signed-off-by: Steve French <stfrench@microsoft.com>
Suggested-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
These pernet operations may depend on stuff set up or torn down in the
module init/exit functions. And they may be called at any time in
between. So it makes more sense for them to be the last to be
registered in the init function, and the first to be unregistered in the
exit function.
In particular, without this, the drc slab is being destroyed before all
the per-net drcs are shut down, resulting in an "Objects remaining in
nfsd_drc on __kmem_cache_shutdown()" warning in exit_nfsd.
Reported-by: Zhi Li <yieli@redhat.com>
Fixes: 3ba75830ce "nfsd4: drc containerization"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fix to return PTR_ERR() error code from the error handling case instead
fo 0 in function alloc_wbufs(), as done elsewhere in this function.
Fixes: 6a98bc4614 ("ubifs: Add authentication nodes to journal")
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAmlkAACgkQxWXV+ddt
WDuwNxAAiBAhEwPllzyU86p4RMMip5pa24zu11HkTya65yGk6EFuj4zTlx/L5Fn6
JOjxwlPqaTItER1PYJ5HRdIy1Y2E4eWEiDLolvmvDCPZrfKRKhBU1MZbgXwDbp+Z
pwaJGIm5ZaXDGyuFge3bKA48BERfqxRBO3qIOZ0tzgsUFLlZ2d9EdDc99093/J6k
QzIijXQjFnvnB2MNawN1b/KQ63xqXLo2hemKcKIFCxJHm9eaet/qwGHl5iuR5ScY
bOGCWvLSkCXceartDur3msOZXur09YLyfeYmE9dj1FN3aNu97sW8VivWRrs3aglK
if51iYrrjKSnDr4SOK28S5UYdgeStb/qWWtosdcMsQVBo0t7iCnGT2psGaQCkdfG
FChqbs2uXlbJrojlelV6xbaU3S2D2MtSz5mF+I2G5MpQbj1jkhYE9ZTUQeibcd7o
l+edn/VJvVK4X0NAX8pIWJ4nFY1HqUTyfn28IQ7ymBhyyUloIoazvSkBuSWy6iy0
9aPpohOKjCw8Y3MbgcIfIEJhdK+aIKF8ZPh52+zcXQzf1OtSryVarLHsNXWm9vJ8
tHsRHCzrbLFdAXZccT6YlerzPs4+PVf44UknDbFCg7sLcG04NIGGrMXOtTHwgEZL
BEywTjAMlMDjrEXouxYAPNPnEg/NlvQGZYRvBnxrtZE4G2fxJ7o=
=7w6G
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"A regression fix caused by a refactoring in 5.11.
A corrupted superblock wouldn't be detected by checksum verification
due to wrongly placed initialization of the checksum length, thus
making memcmp always work"
* tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: initialize fs_info::csum_size earlier in open_ctree
s390 and alpha are the only 64 bit architectures with a 32-bit ino_t.
Since this is quite unusual this causes bugs from time to time.
See e.g. commit ebce3eb2f7 ("ceph: fix inode number handling on
arches with 32-bit ino_t") for an example.
This (obviously) also prevents s390 and alpha to use 64-bit ino_t for
tmpfs. See commit b85a7a8bb5 ("tmpfs: disallow CONFIG_TMPFS_INODE64
on s390").
Therefore switch both s390 and alpha to 64-bit ino_t. This should only
have an effect on the ustat system call. To prevent ABI breakage
define struct ustat compatible to the old layout and change
sys_ustat() accordingly.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Be nice and prune these upfront, in case the ring is being shared and
one of the tasks is going away. This is a bit more important now that
we account the allocations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have three different ones, put it in a helper for easy calling. This
is in preparation for doing it outside of ring freeing as well. With
that in mind, also ensure that we do the proper locking for safe calling
from a context where the ring it still live.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
No changes in this patch, just allows a caller to pass in a targeted
task that we must match for freeing requests in the cache.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmAm8mYACgkQiiy9cAdy
T1FDMgwArdtgQtRL9hSOINBl19/OM9GSLszUtI/EctfZGgnGoNysIq4/pIvv7uqE
egVohlVwJI4niGguU7AABj5vrthLsAbmzKi+e2N6kAcYtzpXeLvjkXVfyq/Bld66
oe7sYWMjH1TFEc64gejW7nYcxOsg93HQtAvNyyoAS3SlOOWsl2LI/AQiw3BXXVBo
Jb1AkfpdBGglUW7esYVZUyVwCI/ZzYVEA0YTCpaGX+EIfdCWXm3ArNPP7E9gHOLb
3HUbNP6W5QKwgYL1tPX3s7AFEtj0+PxuREgB6mSTFOkWRRfZJUTma1AEfa9MUWGA
KOnQKiIzIsmaOQGP/BumcrPr/7kgeqYEFZ2exNT8kVw6ETEHP1+A4j61KZI0mduz
rgnQx21gPzVcDo0tfO8SjGSt3vzuRA+vkyZO4eB/nmTqJ4YVqX+E6E4vophKOUk2
ELqk0fUlX3uspqocZCor5nrLA0EadNV6P/LfFiRyGUTt+tOcOeYmyAdK7dWY1JTf
wsd20mCY
=vJKS
-----END PGP SIGNATURE-----
Merge tag '5.11-rc7-smb3-github' of git://github.com/smfrench/smb3-kernel
Pull cifs fixes from Steve French:
"Four small smb3 fixes to the new mount API (including a particularly
important one for DFS links).
These were found in testing this week of additional DFS scenarios, and
a user testing of an apache container problem"
* tag '5.11-rc7-smb3-github' of git://github.com/smfrench/smb3-kernel:
cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
cifs: In the new mount api we get the full devname as source=
cifs: do not disable noperm if multiuser mount option is not provided
cifs: fix dfs-links
Let's allow mounting readonly partition. We're able to recovery later once we
have it as read-write back.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.
Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
An inode is allowed to have ubifs_xattr_max_cnt() xattrs, so we must
complain only when an inode has more xattrs, having exactly
ubifs_xattr_max_cnt() xattrs is fine.
With this the maximum number of xattrs can be created without hitting
the "has too many xattrs" warning when removing it.
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Signed-off-by: Richard Weinberger <richard@nod.at>
An earlier commit moved out some functions to not be inlined by gcc, but
after some other rework to remove one of those, clang started inlining
the other one and ran into the same problem as gcc did before:
fs/ubifs/replay.c:1174:5: error: stack frame size of 1152 bytes in function 'ubifs_replay_journal' [-Werror,-Wframe-larger-than=]
Mark the function as noinline_for_stack to ensure it doesn't happen
again.
Fixes: f80df38512 ("ubifs: use crypto_shash_tfm_digest()")
Fixes: eb66eff663 ("ubifs: replay: Fix high stack usage")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
When crypto_shash_digestsize() fails, c->hmac_tfm
has not been freed before returning, which leads
to memleak.
Fixes: 49525e5eec ("ubifs: Add helper functions for authentication support")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
clang static analysis reports this problem
fs/jffs2/summary.c:794:31: warning: Use of memory after it is freed
c->summary->sum_list_head = temp->u.next;
^~~~~~~~~~~~
In jffs2_sum_write_data(), in a loop summary data is handles a node at
a time. When it has written out the node it is removed the summary list,
and the node is deleted. In the corner case when a
JFFS2_FEATURE_RWCOMPAT_COPY is seen, a call is made to
jffs2_sum_disable_collecting(). jffs2_sum_disable_collecting() deletes
the whole list which conflicts with the loop's deleting the list by parts.
To preserve the old behavior of stopping the write midway, bail out of
the loop after disabling summary collection.
Fixes: 6171586a7a ("[JFFS2] Correct handling of JFFS2_FEATURE_RWCOMPAT_COPY nodes.")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
This collects all of them together and makes it possible to
e.g. exclude it from slub debugging.
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAmi4wQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrtD/9BTWebvU17q/9grBrZsIyfCB9UOoPzOSwd
4dY1trOnFk91vAVM2I8RLkUSYS6UHPSEE3Xa+jVkNOBWy0W/8RJtQpqQTVjoY8zn
jMrvPHTyW5tc/rniG5FNsCilyDEOBZY4pIpJqwXpj/Ez14D+mb+A3nTCTdwat4sU
zdSVLcGJU2O7RJx1yLiJLvqct1dbZI8axSd/gEOCVhxKgP6UVoYfjkjYm9hCU4et
y7OzRfvPTb0N03EVoA6XeVke7sDK9cJLySbcwiGczCmPkKEJmFOyJP2xWSlE5Z91
UMYeg4pOSg3tHYvPFuUShjzaYJTKAvzObHomyPjRCve+847AcqPpdoHaYQASUXvF
ORs8vXkgjyd9lyrBa+8oqWYvXYj/3M05qPO2LhfSyDbwzEzBAmaAyf0JIr9mvem+
7mgJ6R7uTCqPt0FXzxIfNnWSq/Rtiyuw+DP/y2sgYMDRjg70hyFhhud9K67hMplP
wc1UAp9vD3PalTQG3fHIJycWIXd6A/RxBM+KbXdIyi6aqd6iHgf2Plz5CI2Orz7W
sMPlG2IYfwwKDyNf9LE+sXmrDbfM3wdSQGr3BXmMXBRNWicxD6P4IM8FbsVemrt1
QZ77zt7xPtGY3CMabYPAycbVxdf52TofcTvp3Z5gEWUEaQw1j681wsBF/V3Btnk/
704EKKZraw==
=y7ZL
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.11-2021-02-12' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Revert of a patch from this release that caused a regression"
* tag 'io_uring-5.11-2021-02-12' of git://git.kernel.dk/linux-block:
Revert "io_uring: don't take fs for recvmsg/sendmsg"
Invalid req->flags are tolerated by free/put well, avoid this dancing
needlessly presetting it to zero, and then not even resetting but
modifying it, i.e. "|=".
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Indirectly io_req_find_next() is called for every request, optimise the
check by testing flags as it was long before -- __io_req_find_next()
tolerates false-positives well (i.e. link==NULL), and those should be
really rare.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when
it's called from task_work context. Don't check it in all other cases,
that are when we're in io_uring_enter().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
- Make the nVHE EL2 object relocatable, resulting in much more
maintainable code
- Handle concurrent translation faults hitting the same page
in a more elegant way
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Allow the disabling of symbol export from assembly code
- Simplification of the early init hypercall handling
-----BEGIN PGP SIGNATURE-----
iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmAmjqEPHG1hekBrZXJu
ZWwub3JnAAoJECPQ0LrRPXpDoUEQAIrJ7YF4v4gz06a0HG9+b6fbmykHyxlG7jfm
trvctfaiKzOybKoY5odPpNFzhbYOOdXXqYipyTHGwBYtGSy9G/9SjMKSUrfln2Ni
lr1wBqapr9TE+SVKoR8pWWuZxGGbHVa7brNuMbMsMi1wwAsM2/n70H9PXrdq3QiK
Ge1DWLso2oEfhtTwqNKa4dwB2MHjBhBFhhq+Nq5pslm6mmxJaYqz7pyBmw/C+2cc
oU/6kpAa1yPAauptWXtYXJYOMHihxgEa1IdK3Gl0hUyFyu96xVkwH/KFsj+bRs23
QGGCSdy4313hzaoGaSOTK22R98Aeg0wI9a6tcCBvVVjTAztnlu1FPtUZr8e/F7uc
+r8xVJUJFiywt3Zktf/D7YDK9LuMMqFnj0BkI4U9nIBY59XZRNhENsBCmjru5lnL
iXa5cuta03H4emfssIChLpgn0XHFas6t5dFXBPGbXyw0qsQchTw98iQX9LVxefUK
rOUGPIN4nE9ESRIZe0SPlAVeCtNP8cLH7+0YG9MJ1QeDVYaUsnvy9Ln/ox+514mR
5y2KJ6y7xnLB136SKCzPDDloYtz7BDiJq6a/RPiXKGheKoxy+N+BSe58yWCqFZYE
Fx/cGUr7oSg39U7gCboog6BDp5e2CXBfbRllg6P47bZFfdPNwzNEzHvk49VltMxx
Rl2W05bk
=6EwV
-----END PGP SIGNATURE-----
Merge tag 'kvmarm-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 updates for Linux 5.12
- Make the nVHE EL2 object relocatable, resulting in much more
maintainable code
- Handle concurrent translation faults hitting the same page
in a more elegant way
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Allow the disabling of symbol export from assembly code
- Simplification of the early init hypercall handling
User reported that btrfs-progs misc-tests/028-superblock-recover fails:
[TEST/misc] 028-superblock-recover
unexpected success: mounted fs with corrupted superblock
test failed for case 028-superblock-recover
The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.
Commit 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size
everywhere") replaces all calls to btrfs_super_csum_size by
fs_info::csum_size. The calls include the place where fs_info->csum_size
is not initialized. So btrfs_check_super_csum() passes because memcmp()
with len 0 always returns 0.
Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
csum type in superblock is valid in open_ctree().
Link: https://github.com/kdave/btrfs-progs/issues/250
Fixes: 55fc29bed8 ("btrfs: use cached value of fs_info::csum_size everywhere")
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().
The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.
Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As of [1], we no longer want EXT4_KUNIT_TESTS and others to `select`
their deps. This means it can get harder to get all the right things
selected as we gain more tests w/ more deps over time.
This patch (and [2]) proposes we store kunitconfig fragments in-tree to
represent sets of tests. (N.B. right now we only have one ext4 test).
There's still a discussion to be had about how to have a hierarchy of
these files (e.g. if one wanted to test all of fs/, not just fs/ext4).
But this fragment would likely be a leaf node and isn't blocked on
deciding if we want `import` statements and the like.
Usage
=====
Before [2] (on its way to being merged):
$ cp fs/ext4/.kunitconfig .kunit/
$ ./tools/testing/kunit/kunit.py run
After [2]:
$ ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig
".kunitconfig" vs "kunitconfig"
===============================
See also: commit 14ee5cfd45 ("kunit: Rename 'kunitconfig' to '.kunitconfig'").
* The bit about .gitignore exluding it by default is now a con, however.
* But there are a lot of directories with files that begin with "k" and
so this could cause some annoyance w/ tab completion*
* This is the name kunit.py expects right now, so some people are used
to .kunitconfig over "kunitconfig"
[1] https://lore.kernel.org/linux-ext4/20210122110234.2825685-1-geert@linux-m68k.org/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git/commit/?h=kunit&id=243180f5924ed27ea417db39feb7f9691777688e
* 372/5556 directories isn't too much, but still not a small number:
$ find -type f -name 'k*' | xargs dirname | sort -u | wc -l
372
Signed-off-by: Daniel Latypov <dlatypov@google.com>
Link: https://lore.kernel.org/r/20210210013206.136227-1-dlatypov@google.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
user may not want to enable. Fix this by making the test depend on
EXT4_FS instead.
Fixes: 1cbeab1b24 ("ext4: add kunit test for decoding extended timestamps")
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://lore.kernel.org/r/20210122110234.2825685-1-geert@linux-m68k.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The left shift of int 32 bit integer constant 1 is evaluated using 32 bit
arithmetic and then assigned to a signed 64 bit integer. In the case where
l2nb is 32 or more this can lead to an overflow. Avoid this by shifting
the value 1LL instead.
Addresses-Coverity: ("Uninitentional integer overflow")
Fixes: b40c2e665c ("fs/jfs: TRIM support for JFS Filesystem")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
While debugging another issue today, Steve and I noticed that if a
subdir for a file share is already mounted on the client, any new
mount of any other subdir (or the file share root) of the same share
results in sharing the cifs superblock, which e.g. can result in
incorrect device name.
While setting prefix path for the root of a cifs_sb,
CIFS_MOUNT_USE_PREFIX_PATH flag should also be set.
Without it, prepath is not even considered in some places,
and output of "mount" and various /proc/<>/*mount* related
options can be missing part of the device name.
Signed-off-by: Shyam Prasad N <sprasad@microsoft.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
so we no longer need to handle or parse the UNC= and prefixpath=
options that mount.cifs are generating.
This also fixes a bug in the mount command option where the devname
would be truncated into just //server/share because we were looking
at the truncated UNC value and not the full path.
I.e. in the mount command output the devive //server/share/path
would show up as just //server/share
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Shyam Prasad N <nspmangalore@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
The assert in xfs_btree_del_cursor() checks that the bmapbt block
allocation field has been handled correctly before the cursor is
freed. This field is used for accurate calculation of indirect block
reservation requirements (for delayed allocations), for example.
generic/019 reproduces a scenario where this assert fails because
the filesystem has shutdown while in the middle of a bmbt record
insertion. This occurs after a bmbt block has been allocated via the
cursor but before the higher level bmap function (i.e.
xfs_bmap_add_extent_hole_real()) completes and resets the field.
Update the assert to accommodate the transient state if the
filesystem has shutdown. While here, clean up the indentation and
comments in the function.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.
Fixes: ea64ec02b3 ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently, although set_bit() & test_bit() pairs are used as a fast-
path for initialized configurations. However, these atomic ops are
actually relaxed forms. Instead, load-acquire & store-release form is
needed to make sure uninitialized fields won't be observed in advance
here (yet no such corresponding bitops so use full barriers instead.)
Link: https://lore.kernel.org/r/20210209130618.15838-1-hsiangkao@aol.com
Fixes: 62dc45979f ("staging: erofs: fix race of initializing xattrs of a inode at the same time")
Fixes: 152a333a58 ("staging: erofs: add compacted compression indexes support")
Cc: <stable@vger.kernel.org> # 5.3+
Reported-by: Huang Jianan <huangjianan@oppo.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 'xfs_log_need_covered' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes: 37444fc4cc ("xfs: lift writable fs check up into log worker task")
CC: Brian Foster <bfoster@redhat.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
XFS triggers an iomap warning in the write fault path due to a
!PageUptodate() page if a write fault happens to occur on a page
that recently failed writeback. The iomap writeback error handling
code can clear the Uptodate flag if no portion of the page is
submitted for I/O. This is reproduced by fstest generic/019, which
combines various forms of I/O with simulated disk failures that
inevitably lead to filesystem shutdown (which then unconditionally
fails page writeback).
This is a regression introduced by commit f150b42343 ("xfs: split
the iomap ops for buffered vs direct writes") due to the removal of
a shutdown check and explicit error return in the ->iomap_begin()
path used by the write fault path. The explicit error return
historically translated to a SIGBUS, but now carries on with iomap
processing where it complains about the unexpected state. Restore
the shutdown check to xfs_buffered_write_iomap_begin() to restore
historical behavior.
Fixes: f150b42343 ("xfs: split the iomap ops for buffered vs direct writes")
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.
Addresses-Coverity: ("Unused value")
Fixes: b63534c41e ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().
Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.
INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
context_switch kernel/sched/core.c:4327 [inline]
__schedule+0x90c/0x21a0 kernel/sched/core.c:5078
schedule+0xcf/0x270 kernel/sched/core.c:5157
io_uring_cancel_files fs/io_uring.c:8912 [inline]
io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
__io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
io_uring_files_cancel include/linux/io_uring.h:51 [inline]
do_exit+0x2fe/0x2ae0 kernel/exit.c:780
do_group_exit+0x125/0x310 kernel/exit.c:922
__do_sys_exit_group kernel/exit.c:933 [inline]
__se_sys_exit_group kernel/exit.c:931 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This reverts commit 10cad2c40d.
Petr reports that with this commit in place, io_uring fails the chroot
test (CVE-202-29373). We do need to retain ->fs for send/recvmsg, so
revert this commit.
Reported-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 5.10, splice() or sendfile() to NILFS2 return EINVAL. This was
caused by commit 36e2c7421f ("fs: don't allow splice read/write
without explicit ops").
This patch initializes the splice_write field in file_operations, like
most file systems do, to restore the functionality.
Link: https://lkml.kernel.org/r/1612784101-14353-1-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Joachim Henke <joachim.henke@t-systems.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> [5.10+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zoned block devices have different granularity constraints for write
operations into sequential zones. E.g. ZBC and ZAC devices require that
writes be aligned to the device physical block size while NVMe ZNS
devices allow logical block size aligned write operations. To correctly
handle such difference, use the device zone write granularity limit to
set the block size of a zonefs volume, thus allowing the smallest
possible write unit for all zoned device types.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.
With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.
Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.
[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove fallback_req for now, it gets in the way of other changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_submit_flush_completions() does completion batching, but may also use
free batching as iopoll does. The main beneficiaries should be buffered
reads/writes and send/recv.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Reincarnation of an old patch that replaces a list in struct
io_compl_batch with an array. It's needed to avoid hooking requests via
their compl.list, because it won't be always available in the future.
It's also nice to split io_submit_flush_completions() to avoid free
under locks and remove unlock/lock with a long comment describing when
it can be done.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As now submit_state is retained across syscalls, we can save ourself
from initialising it from ground up for each io_submit_sqes(). Set some
fields during ctx allocation, and just keep them always consistent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove unnecessary zeroing of ctx members]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
completion state is closely bound to ctx, we don't need to store ctx
inside as we always have it around to pass to flush.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
struct io_submit_state is quite big (168 bytes) and going to grow. It's
better to not keep it on stack as it is now. Move it to context, it's
always protected by uring_lock, so it's fine to have only one instance
of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There is no reason to drag io_comp_state into opcode handlers, we just
need a flag and the actual work will be done in __io_queue_sqe().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
When starting an iomap write, gfs2_quota_lock_check -> gfs2_quota_lock
-> gfs2_quota_hold is called from gfs2_iomap_begin. At the end of the
write, before unlocking the quotas, punch_hole -> gfs2_quota_hold can be
called again in gfs2_iomap_end, which is incorrect and leads to a failed
assertion. Instead, move the call to gfs2_quota_unlock before the call
to punch_hole to fix that.
Fixes: 64bc06bb32 ("gfs2: iomap buffered write support")
Cc: stable@vger.kernel.org # v4.19+
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Fixes small regression in implementation of new mount API.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reported-by: Hyunchul Lee <hyc.lee@gmail.com>
Tested-by: Hyunchul Lee <hyc.lee@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Make opcode handler interfaces a bit more consistent by always passing
in issue flags. Bulky but pretty easy and mechanical change.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace bool force_nonblock with flags. It has a long standing goal of
differentiating context from which we execute. Currently we have some
subtle places where some invariants, like holding of uring_lock, are
subtly inferred.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As with s390, alpha is a 64-bit architecture with a 32-bit ino_t. With
CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers and
display "inode64" in the mount options, whereas passing "inode64" in the
mount options will fail. This leads to erroneous behaviours such as
this:
# mkdir mnt
# mount -t tmpfs nodev mnt
# mount -o remount,rw mnt
mount: /home/ubuntu/mnt: mount point not mounted or bad option.
Prevent CONFIG_TMPFS_INODE64 from being selected on alpha.
Link: https://lkml.kernel.org/r/20210208215726.608197-1-seth.forshee@canonical.com
Fixes: ea3271f719 ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: <stable@vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently there is an assumption in tmpfs that 64-bit architectures also
have a 64-bit ino_t. This is not true on s390 which has a 32-bit ino_t.
With CONFIG_TMPFS_INODE64=y tmpfs mounts will get 64-bit inode numbers
and display "inode64" in the mount options, but passing the "inode64"
mount option will fail. This leads to the following behavior:
# mkdir mnt
# mount -t tmpfs nodev mnt
# mount -o remount,rw mnt
mount: /home/ubuntu/mnt: mount point not mounted or bad option.
As mount sees "inode64" in the mount options and thus passes it in the
options for the remount.
So prevent CONFIG_TMPFS_INODE64 from being selected on s390.
Link: https://lkml.kernel.org/r/20210205230620.518245-1-seth.forshee@canonical.com
Fixes: ea3271f719 ("tmpfs: support 64-bit inums per-sb")
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Hugh Dickins <hughd@google.com>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: <stable@vger.kernel.org> [5.9+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysbot has reported a warning where a kmalloc() attempt exceeds the
maximum limit. This has been identified as corruption of the xattr_ids
count when reading the xattr id lookup table.
This patch adds a number of additional sanity checks to detect this
corruption and others.
1. It checks for a corrupted xattr index read from the inode. This could
be because the metadata block is uncompressed, or because the
"compression" bit has been corrupted (turning a compressed block
into an uncompressed block). This would cause an out of bounds read.
2. It checks against corruption of the xattr_ids count. This can either
lead to the above kmalloc failure, or a smaller than expected
table to be read.
3. It checks the contents of the index table for corruption.
[phillip@squashfs.org.uk: fix checkpatch issue]
Link: https://lkml.kernel.org/r/270245655.754655.1612770082682@webmail.123-reg.co.uk
Link: https://lkml.kernel.org/r/20210204130249.4495-5-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+2ccea6339d368360800d@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysbot has reported an "slab-out-of-bounds read" error which has been
identified as being caused by a corrupted "ino_num" value read from the
inode. This could be because the metadata block is uncompressed, or
because the "compression" bit has been corrupted (turning a compressed
block into an uncompressed block).
This patch adds additional sanity checks to detect this, and the
following corruption.
1. It checks against corruption of the inodes count. This can either
lead to a larger table to be read, or a smaller than expected
table to be read.
In the case of a too large inodes count, this would often have been
trapped by the existing sanity checks, but this patch introduces
a more exact check, which can identify too small values.
2. It checks the contents of the index table for corruption.
[phillip@squashfs.org.uk: fix checkpatch issue]
Link: https://lkml.kernel.org/r/527909353.754618.1612769948607@webmail.123-reg.co.uk
Link: https://lkml.kernel.org/r/20210204130249.4495-4-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+04419e3ff19d2970ea28@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sysbot has reported a number of "slab-out-of-bounds reads" and
"use-after-free read" errors which has been identified as being caused
by a corrupted index value read from the inode. This could be because
the metadata block is uncompressed, or because the "compression" bit has
been corrupted (turning a compressed block into an uncompressed block).
This patch adds additional sanity checks to detect this, and the
following corruption.
1. It checks against corruption of the ids count. This can either
lead to a larger table to be read, or a smaller than expected
table to be read.
In the case of a too large ids count, this would often have been
trapped by the existing sanity checks, but this patch introduces
a more exact check, which can identify too small values.
2. It checks the contents of the index table for corruption.
Link: https://lkml.kernel.org/r/20210204130249.4495-3-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: syzbot+b06d57ba83f604522af2@syzkaller.appspotmail.com
Reported-by: syzbot+c021ba012da41ee9807c@syzkaller.appspotmail.com
Reported-by: syzbot+5024636e8b5fd19f0f19@syzkaller.appspotmail.com
Reported-by: syzbot+bcbc661df46657d0fa4f@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Squashfs: fix BIO migration regression and add sanity checks".
Patch [1/4] fixes a regression introduced by the "migrate from
ll_rw_block usage to BIO" patch, which has produced a number of
Sysbot/Syzkaller reports.
Patches [2/4], [3/4], and [4/4] fix a number of filesystem corruption
issues which have produced Sysbot reports in the id, inode and xattr
lookup code.
Each patch has been tested against the Sysbot reproducers using the
given kernel configuration. They have the appropriate "Reported-by:"
lines added.
Additionally, all of the reproducer filesystems are indirectly fixed by
patch [4/4] due to the fact they all have xattr corruption which is now
detected there.
Additional testing with other configurations and architectures (32bit,
big endian), and normal filesystems has also been done to trap any
inadvertent regressions caused by the additional sanity checks.
This patch (of 4):
This is a regression introduced by the patch "migrate from ll_rw_block
usage to BIO".
Sysbot/Syskaller has reported a number of "out of bounds writes" and
"unable to handle kernel paging request in squashfs_decompress" errors
which have been identified as a regression introduced by the above
patch.
Specifically, the patch removed the following sanity check
if (length < 0 || length > output->length ||
(index + length) > msblk->bytes_used)
This check did two things:
1. It ensured any reads were not beyond the end of the filesystem
2. It ensured that the "length" field read from the filesystem
was within the expected maximum length. Without this any
corrupted values can over-run allocated buffers.
Link: https://lkml.kernel.org/r/20210204130249.4495-1-phillip@squashfs.org.uk
Link: https://lkml.kernel.org/r/20210204130249.4495-2-phillip@squashfs.org.uk
Fixes: 93e72b3c61 ("squashfs: migrate from ll_rw_block usage to BIO")
Reported-by: syzbot+6fba78f99b9afd4b5634@syzkaller.appspotmail.com
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Cc: Philippe Liard <pliard@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a regression following dfs links that was introduced in the
patch series for the new mount api.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
Signed-off-by: Steve French <stfrench@microsoft.com>
The reference count of the old buffer head should be decremented on path
that fails to get the new buffer head.
Fixes: 6b4657667b ("fs/affs: add rename exchange")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Pan Bian <bianpan2016@163.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, the follow_pfn function is exported for modules but
follow_pte is not. However, follow_pfn is very easy to misuse,
because it does not provide protections (so most of its callers
assume the page is writable!) and because it returns after having
already unlocked the page table lock.
Provide instead a simplified version of follow_pte that does
not have the pmdpp and range arguments. The older version
survives as follow_invalidate_pte() for use by fs/dax.c.
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This final patch adds the ZONED incompat flag to the supported flags
and enables to mount ZONED flagged file system.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the zoned filesystem requires sequential write out of metadata, we
cannot proceed with a hole in tree-log pages. When such a hole exists,
btree_write_cache_pages() will return -EAGAIN. This happens when someone,
e.g., a concurrent transaction commit, writes a dirty extent in this
tree-log commit.
If we are not going to wait for the extents, we can hope the concurrent
writing fills the hole for us. So, we can ignore the error in this case and
hope the next write will succeed.
If we want to wait for them and got the error, we cannot wait for them
because it will cause a deadlock. So, let's bail out to a full commit in
this case.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the 3/3 patch to enable tree-log on zoned filesystems.
The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.
Reorder the allocation of them by delaying allocation of the root node of
"fs_info->log_root_tree," so that the node buffers can go out sequentially
to devices.
Cc: Filipe Manana <fdmanana@gmail.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the 2/3 patch to enable tree-log on zoned filesystems.
Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at
the time of a log transaction commit. The nodes of the global log root
tree (fs_info->log_root_tree), also have the same problem with mixed
allocation.
Serializes log transactions by waiting for a committing transaction when
someone tries to start a new transaction, to avoid the mixed allocation
problem. We must also wait for running log transactions from another
subvolume, but there is no easy way to detect which subvolume root is
running a log transaction. So, this patch forbids starting a new log
transaction when other subvolumes already allocated the global log root
tree.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the 1/3 patch to enable tree log on zoned filesystems.
The tree-log feature does not work on a zoned filesystem as is. Blocks for
a tree-log tree are allocated mixed with other metadata blocks and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which has a different timing than a global transaction commit. As a
result, both writing tree-log blocks and writing other metadata blocks
become non-sequential writes that zoned filesystems must avoid.
Introduce a dedicated block group for tree-log blocks, so that tree-log
blocks and other metadata blocks can be separate write streams. As a
result, each write stream can now be written to devices separately.
"fs_info->treelog_bg" tracks the dedicated block group and assigns
"treelog_bg" on-demand on tree-log block allocation time.
This commit extends the zoned block allocator to use the block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparation patch for the next patch. Split alloc_log_tree()
into two parts. The first one allocating the tree structure, remains in
alloc_log_tree() and the second part allocating the tree node, which is
moved into btrfs_alloc_log_tree_node().
Also export the latter part is to be used in the next patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently fallocate() is disabled on a zoned filesystem. Since current
relocation process relies on preallocation to move file data extents, it
must be handled differently.
On a zoned filesystem, we just truncate the inode to the size that we
wanted to pre-allocate. Then, we flush dirty pages on the file before
finishing the relocation process. run_delalloc_zoned() will handle all
the allocations and submit IOs to the underlying layers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is 4/4 patch to implement device-replace on zoned filesystems.
Even after the copying is done, the write pointers of the source device
and the destination device may not be synchronized. For example, when
the last allocated extent is freed before device-replace process, the
extent is not copied, leaving a hole there.
Synchronize the write pointers by writing zeroes to the destination
device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is 3/4 patch to implement device-replace on zoned filesystems.
This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.
The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit 042528f8d8 ("Btrfs: fix
block group remaining RO forever after error during device replace").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is 2/4 patch to implement device replace for zoned filesystems.
In zoned mode, a block group must be either copied (from the source
device to the target device) or cloned (to both devices).
Implement the cloning part. If a block group targeted by an IO is marked
to copy, we should not clone the IO to the destination device, because
the block group is eventually copied by the replace process.
This commit also handles cloning of device reset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is the 1/4 patch to support device-replace on zoned filesystems.
We have two types of IOs during the device replace process. One is an IO
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an IO to "clone" (by
handle_ops_on_dev_replace()) new incoming write IOs from users to the
source device into the target device.
Cloning incoming IOs can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the IO is
directed to the middle of a target device zone, which breaks the
sequential write requirement.
However, the cloning function cannot be disabled since incoming IOs
targeting already copied device extents must be cloned so that the IO is
executed on the target device.
We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.
So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.
Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
the metadata write IOs.
Even with this serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU
and not per zone.
To preserve write bio ordering, we disable async metadata checksum on a
zoned filesystem. This does not result in lower performance with HDDs as
a single CPU core is fast enough to do checksum for a single zone write
stream with the maximum possible bandwidth of the device. If multiple
zones are being written simultaneously, HDD seek overhead lowers the
achievable maximum bandwidth, resulting again in a per zone checksum
serialization not affecting the performance.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When truncating a file, file buffers which have already been allocated
but not yet written may be truncated. Truncating these buffers could
cause breakage of a sequential write pattern in a block group if the
truncated blocks are for example followed by blocks allocated to another
file. To avoid this problem, always wait for write out of all unwritten
buffers before proceeding with the truncate execution.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.
We cannot add a mutex around allocation and submission because metadata
blocks are allocated in an earlier stage to build up B-trees.
Add a zoned_meta_io_lock and hold it during metadata IO submission in
btree_write_cache_pages() to serialize IOs.
Furthermore, this adds a per-block group metadata IO submission pointer
"meta_write_pointer" to ensure sequential writing, which can break when
attempting to write back blocks in an unfinished transaction. If the
writing out failed because of a hole and the write out is for data
integrity (WB_SYNC_ALL), it returns EAGAIN.
A caller like fsync() code should handle this properly e.g. by falling
back to a full transaction commit.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If more than one IO is issued for one file extent, these IO can be
written to separate regions on a device. Since we cannot map one file
extent to such a separate area on a zoned filesystem, we need to follow
the "one IO == one ordered extent" rule.
The normal buffered, uncompressed and not pre-allocated write path (used
by cow_file_range()) sometimes does not follow this rule. It can write a
part of an ordered extent when specified a region to write e.g., when
its called from fdatasync().
Introduce a dedicated (uncompressed buffered) data write path for zoned
filesystems, that will COW the region and write it at once.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Likewise to buffered IO, enable zone append writing for direct IO when
its used on a zoned block device.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enable zone append writing for zoned mode. When using zone append, a
bio is issued to the start of a target zone and the device decides to
place it inside the zone. Upon completion the device reports the actual
written position back to the host.
Three parts are necessary to enable zone append mode. First, modify the
bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the
bi_sector to point the beginning of the zone.
Second, record the returned physical address (and disk/partno) to the
ordered extent in end_bio_extent_writepage() after the bio has been
completed. We cannot resolve the physical address to the logical address
because we can neither take locks nor allocate a buffer in this end_bio
context. So, we need to record the physical address to resolve it later
in btrfs_finish_ordered_io().
And finally, rewrite the logical addresses of the extent mapping and
checksum data according to the physical address using btrfs_rmap_block.
If the returned address matches the originally allocated address, we can
skip this rewriting process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A following patch will add another caller of
btrfs_lookup_ordered_extent(), but from a bio's endio context.
btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enabled unconditionally.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
On a zoned filesystem, cache if a block group is on a sequential write
only zone.
On sequential write only zones, we can use REQ_OP_ZONE_APPEND for
writing data, therefore provide btrfs_use_zone_append() to figure out if
IO is targeting a sequential write only zone and we can use
REQ_OP_ZONE_APPEND for data writing.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.
Extend the function to match to a specified device. The old functionality
of querying all devices is left intact by specifying NULL as target
device.
A block_device instead of a btrfs_device is passed into btrfs_rmap_block,
as this function is intended to reverse-map the result of a bio, which
only has a block_device.
Also export the function for later use.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To ensure that an ordered extent maps to a contiguous region on disk, we
need to maintain a "one bio == one ordered extent" rule.
Ensure that constructing bio does not span more than an ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For a zone append write, the device decides the location the data is being
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that an ordered extent
maps to a contiguous region on disk, we need to maintain a "one bio ==
one ordered extent" rule.
Implement splitting of an ordered extent and extent map on bio submission
to adhere to the rule.
extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the
corresponding ordered extent so that the ordered extent's region fits into
one bio and the corresponding device limits.
Several sanity checks need to be done in extract_ordered_extent() e.g.
- We cannot split once end_bio'd ordered extent because we cannot divide
ordered->bytes_left for the split ones
- We do not expect a compressed ordered extent
- We should not have checksum list because we omit the list splitting.
Since the function is called before btrfs_wq_submit_bio() or
btrfs_csum_one_bio(), this should be always ensured.
We also need to split an extent map by creating a new one. If not,
unpin_extent_cache() complains about the difference between the start of
the extent map and the file's logical offset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
devices.
Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
bio_op().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow these restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to cache the target device to btrfs_io_bio(bio)->device.
Caching only the target device is sufficient here as zoned filesystems
only supports the single profile at the moment. Once more profiles will be
supported btrfs_io_bio can hold an extent_map to be able to check for the
restrictions of all devices the btrfs_bio will be mapped to.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out adding a page to a bio from submit_extent_page(). The page
is added only when bio_flags are the same, contiguous and the added page
fits in the same stripe as pages in the bio.
Condition checks are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_stripe() calling.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We must reset the zones of a deleted unused block group to rewind the
zones' write pointers to the zones' start.
To do this, we can use the DISCARD_SYNC code to do the reset when the
filesystem is running on zoned devices.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the allocation info of a tree log node is not recorded in the extent
tree, calculate_alloc_pointer() cannot detect this node, so the pointer
can be over a tree node.
Replaying the log calls btrfs_remove_free_space() for each node in the
log tree.
So, advance the pointer after the node to not allocate over it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Such nodes are cleaned so that pages in the
node are not uselessly written out. On zoned volumes, however, such
optimization blocks the following IOs as the cancellation of the write
out of the freed blocks breaks the sequential write sequence expected by
the device.
Introduce a list of clean and unwritten extent buffers that have been
released in a transaction. Redirty the buffers so that
btree_write_cache_pages() can send proper bios to the devices.
Besides it clears the entire content of the extent buffer not to confuse
raw block scanners e.g. 'btrfs check'. By clearing the content,
csum_dirty_buffer() complains about bytenr mismatch, so avoid the
checking and checksum using newly introduced buffer flag
EXTENT_BUFFER_NO_CHECK.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a sequential extent allocator for zoned filesystems. This
allocator only needs to check if there is enough space in the block group
after the allocation pointer to satisfy the extent allocation request.
Therefore the allocator never manages bitmaps or clusters. Also, add
assertions to the corresponding functions.
As zone append writing is used, it would be unnecessary to track the
allocation offset, as the allocator only needs to check available space.
But by tracking and returning the offset as an allocated region, we can
skip modification of ordered extents and checksum information when there
is no IO reordering.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In a zoned filesystem a once written then freed region is not usable
until the underlying zone has been reset. So we need to distinguish such
unusable space from usable free space.
Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the space_info structure
to track the unusable space.
Pinned bytes are always reclaimed to the unusable space. But, when an
allocated region is returned before using e.g., the block group becomes
read-only between allocation time and reservation time, we can safely
return the region to the block group. For the situation, this commit
introduces "btrfs_add_free_space_unused". This behaves the same as
btrfs_add_free_space() on regular filesystem. On zoned filesystems, it
rewinds the allocation offset.
Because the read-only bytes tracks free but unusable bytes when the block
group is read-only, we need to migrate the zone_unusable bytes to
read-only bytes when a block group is marked read-only.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset for sequential allocation if a block
group contains a conventional zone.
But instead, we can consider the end of the highest addressed extent in
the block group for the allocation offset.
For new block group, we cannot calculate the allocation offset by
consulting the extent tree, because it can cause deadlock by taking
extent buffer lock after chunk mutex, which is already taken in
btrfs_make_block_group(). Since it is a new block group anyways, we can
simply set the allocation offset to 0.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem must allocate blocks at the zones' write pointer. The
device's write pointer position can be mapped to a logical address within
a block group. To facilitate this, add an "alloc_offset" to the
block-group to track the logical addresses of the write pointer.
This logical address is populated in btrfs_load_block_group_zone_info()
from the write pointers of corresponding zones.
For now, zoned filesystems the single profile. Supporting non-single
profile with zone append writing is not trivial. For example, in the DUP
profile, we send a zone append writing IO to two zones on a device. The
device reply with written LBAs for the IOs. If the offsets of the
returned addresses from the beginning of the zone are different, then it
results in different logical addresses.
We need fine-grained logical to physical mapping to support such separated
physical address issue. Since it should require additional metadata type,
disable non-single profiles for now.
This commit supports the case all the zones in a block group are
sequential. The next patch will handle the case having a conventional
zone.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a check in verify_one_dev_extent() to ensure that a device extent on
a zoned block device is aligned to the respective zone boundary.
If it isn't, mark the filesystem as unclean.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement a zoned chunk and device extent allocator. One device zone
becomes a device extent so that a zone reset affects only this device
extent and does not change the state of blocks in the neighbor device
extents.
To implement the allocator, we need to extend the following functions for
a zoned filesystem.
- init_alloc_chunk_ctl
- dev_extent_search_start
- dev_extent_hole_check
- decide_stripe_size
init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always
set the stripe_size to the zone size and aligns the parameters to the zone
size.
dev_extent_search_start() only aligns the start offset to zone boundaries.
We don't care about the first 1MB like in regular filesystem because we
anyway reserve the first two zones for superblock logging.
dev_extent_hole_check_zoned() checks if zones in given hole are either
conventional or empty sequential zones. Also, it skips zones reserved for
superblock logging.
With the change to the hole, the new hole may now contain pending extents.
So, in this case, loop again to check that.
Finally, decide_stripe_size_zoned() should shrink the number of devices
instead of stripe size because we need to honor stripe_size == zone_size.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Run a zoned filesystem on non-zoned devices. This is done by "slicing up"
the block device into static sized chunks and fake a conventional zone on
each of them. The emulated zone size is determined from the size of device
extent.
This is mainly aimed at testing of zoned filesystems, i.e. the zoned
chunk allocator, on regular block devices.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The implementation of fitrim depends on space cache, which is not used
and disabled for zoned extent allocator. So the current code does not
work with zoned filesystem.
In the future, we can implement fitrim for zoned filesystems by enabling
space cache (but, only for fitrim) or scanning the extent tree at fitrim
time. For now, disallow fitrim on zoned filesystems.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Don't set the zoned flag in fs_info as soon as we're encountering the
incompat filesystem flag for a zoned filesystem on mount. The zoned flag
in fs_info is in a union together with the zone_size, so setting it too
early will result in setting an incorrect zone_size as well.
Once the correct zone_size is read from the device, we can rely on the
zoned flag in fs_info as well to determine if the filesystem is zoned.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have no write pointer in conventional zones, we cannot
determine the allocation offset from it. Instead, we set the allocation
offset after the highest addressed extent. This is done by reading the
extent tree in btrfs_load_block_group_zone_info().
However, this function is called from btrfs_read_block_groups(), so the
read lock for the tree node could be recursively taken.
To avoid this unsafe locking scenario, release the path before reading
the extent tree to get the allocation offset.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A zoned filesystem currently has a superblock at the beginning of the
superblock logging zones if the zones are conventional. This difference
in superblock position causes a chicken-and-egg problem for filesystems
with emulated zones. Since the device is a regular (non-zoned) device,
we cannot know if the filesystem is regular or zoned while reading the
superblock. But, to load the superblock, we need to see if it is
emulated zoned or not.
Place the superblocks at the same location as they are on regular
filesystem on regular devices to solve the problem. It is possible
because it's ensured that all the superblock locations are at an
(emulated) conventional zone on regular devices.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is a preparation patch to implement zone emulation on a regular
device.
To emulate a zoned filesystem on a regular (non-zoned) device, we need to
decide an emulated zone size. Instead of making it a compile-time static
value, we'll make it configurable at mkfs time. Since we have one zone ==
one device extent restriction, we can determine the emulated zone size
from the size of a device extent. We can extend btrfs_get_dev_zone_info()
to show a regular device filled with conventional zones once the zone size
is decided.
The current call site of btrfs_get_dev_zone_info() during the mount process
is earlier than loading the file system trees so that we don't know the
size of a device extent at this point. Thus we can't slice a regular device
to conventional zones.
This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
info for all the devices. And, it places this function in open_ctree()
after loading the trees.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov_iter_get_pages(). This commit introduces IOMAP_F_ZONE_APPEND, so
that iomap user can set the flag to indicate they want REQ_OP_ZONE_APPEND
and restricted bio.
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the retry policy in ext4_alloc_file_blocks() to allow for a full
retry cycle whenever a portion of an allocation request has been
fulfilled. A large allocation request often results in multiple calls
to ext4_map_blocks(), each of which is potentially subject to a
temporary ENOSPC condition and retry cycle. The current code only
allows for a single retry cycle.
This patch does not address a known bug or reported complaint.
However, it should make block allocation for fallocate and zero range
more robust.
In addition, simplify the conditional controlling the allocation while
loop, where testing len alone is sufficient. Remove the assignment to
ret2 in the error path after the call to ext4_map_blocks() since its
value isn't subsequently used.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Link: https://lore.kernel.org/r/20210113221403.18258-1-enwlinux@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes: be20aa9dba ("Btrfs: Add mount option to turn off data cow")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In read_extent_buffer_pages(), if we failed to lock the page atomically,
we just exit with return value 0.
This is counter-intuitive, as normally if we can't lock what we need, we
would return something like EAGAIN.
But that return hides under (wait == WAIT_NONE) branch, which only gets
triggered for readahead.
And for readahead, if we failed to lock the page, it means the extent
buffer is either being read by other thread, or has been read and is
under modification. Either way the eb will or has been cached, thus
readahead has no need to wait for it.
Add comment on this counter-intuitive behavior.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This adds the basic RO mount ability for 4K sector size on 64K page
system.
Currently we only plan to support 4K and 64K page system.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support subpage sector size, data also need extra info to make sure
which sectors in a page are uptodate/dirty/...
This patch will make pages for data inodes get btrfs_subpage structure
attached, and detached when the page is freed.
This patch also slightly changes the timing when
set_page_extent_mapped() is called to make sure:
- We have page->mapping set
page->mapping->host is used to grab btrfs_fs_info, thus we can only
call this function after page is mapped to an inode.
One call site attaches pages to inode manually, thus we have to modify
the timing of set_page_extent_mapped() a bit.
- As soon as possible, before other operations
Since memory allocation can fail, we have to do extra error handling.
Calling set_page_extent_mapped() as soon as possible can simply the
error handling for several call sites.
The idea is pretty much the same as iomap_page, but with more bitmaps
for btrfs specific cases.
Currently the plan is to switch iomap if iomap can provide sector
aligned write back (only write back dirty sectors, but not the full
page, data balance require this feature).
So we will stick to btrfs specific bitmap for now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage metadata validation check, there are some differences:
- Read must finish in one bvec
Since we're just reading one subpage range in one page, it should
never be split into two bios nor two bvecs.
- How to grab the existing eb
Instead of grabbing eb using page->private, we have to go search radix
tree as we don't have any direct pointer at hand.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To handle subpage status update, add the following:
- Use btrfs_page_*() subpage-aware helpers to update page status
Now we can handle both cases well.
- No page unlock for subpage metadata
Since subpage metadata doesn't utilize page locking at all, skip it.
For subpage data locking, it's handled in later commits.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a helper, read_extent_buffer_subpage(), to do the subpage
extent buffer read.
The difference between regular and subpage routines are:
- No page locking
Here we completely rely on extent locking.
Page locking can reduce the concurrency greatly, as if we lock one
page to read one extent buffer, all the other extent buffers in the
same page will have to wait.
- Extent uptodate condition
Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check,
We also need to check btrfs_subpage::uptodate_bitmap.
- No page iteration
Just one page, no need to loop, this greatly simplified the subpage
routine.
This patch only implements the bio submit part, no endio support yet.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Unlike the original try_release_extent_buffer(),
try_release_subpage_extent_buffer() will iterate through all the ebs in
the page, and try to release each.
We can release the full page only after there's no private attached,
which means all ebs of that page have been released as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs_clone_extent_buffer(), it's mostly the same code of
__alloc_dummy_extent_buffer(), except it has extra page copy.
So to make it subpage compatible, we only need to:
- Call set_extent_buffer_uptodate() instead of SetPageUptodate()
This will set correct uptodate bit for subpage and regular sector size
cases.
Since we're calling set_extent_buffer_uptodate() which will also set
EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit
either.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support subpage in set_extent_buffer_uptodate and
clear_extent_buffer_uptodate we only need to use the subpage-aware
helpers to update the page bits.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the following functions to handle subpage error status:
- btrfs_subpage_set_error()
- btrfs_subpage_clear_error()
- btrfs_subpage_test_error()
These helpers can only be called when the page has subpage attached
and the range is ensured to be inside the page.
- btrfs_page_set_error()
- btrfs_page_clear_error()
- btrfs_page_test_error()
These helpers can handle both regular sector size and subpage without
problem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the following functions to handle subpage uptodate status:
- btrfs_subpage_set_uptodate()
- btrfs_subpage_clear_uptodate()
- btrfs_subpage_test_uptodate()
These helpers can only be called when the page has subpage attached
and the range is ensured to be inside the page.
- btrfs_page_set_uptodate()
- btrfs_page_clear_uptodate()
- btrfs_page_test_uptodate()
These helpers can handle both regular sector size and subpage.
Although caller should still ensure that the range is inside the page.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are locations where we allocate dummy extent buffers for temporary
usage, like in tree_mod_log_rewind() or get_old_root().
These dummy extent buffers will be handled by the same eb accessors, and
if they don't have page::private subpage eb accessors could fail.
To address such problems, make __alloc_dummy_extent_buffer() attach
page private for dummy extent buffers too.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_release_extent_buffer_pages(), we need to add extra handling
for subpage.
Introduce a helper, detach_extent_buffer_page(), to do different
handling for regular and subpage cases.
For subpage case, handle detaching page private.
For unmapped (dummy or cloned) ebs, we can detach the page private
immediately as the page can only be attached to one unmapped eb.
For mapped ebs, we have to ensure there are no eb in the page range
before we delete it, as page->private is shared between all ebs in the
same page.
But there is a subpage specific race, where we can race with extent
buffer allocation, and clear the page private while new eb is still
being utilized, like this:
Extent buffer A is the new extent buffer which will be allocated,
while extent buffer B is the last existing extent buffer of the page.
T1 (eb A) | T2 (eb B)
-------------------------------+------------------------------
alloc_extent_buffer() | btrfs_release_extent_buffer_pages()
|- p = find_or_create_page() | |
|- attach_extent_buffer_page() | |
| | |- detach_extent_buffer_page()
| | |- if (!page_range_has_eb())
| | | No new eb in the page range yet
| | | As new eb A hasn't yet been
| | | inserted into radix tree.
| | |- btrfs_detach_subpage()
| | |- detach_page_private();
|- radix_tree_insert() |
Then we have a metadata eb whose page has no private bit.
To avoid such race, we introduce a subpage metadata-specific member,
btrfs_subpage::eb_refs.
In alloc_extent_buffer() we increase eb_refs in the critical section of
private_lock. Then page_range_has_eb() will return true for
detach_extent_buffer_page(), and will not detach page private.
The section is marked by:
- btrfs_page_inc_eb_refs()
- btrfs_page_dec_eb_refs()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage case, grab_extent_buffer() can't really get an extent buffer
just from btrfs_subpage.
We have radix tree lock protecting us from inserting the same eb into
the tree. Thus we don't really need to do the extra hassle, just let
alloc_extent_buffer() handle the existing eb in radix tree.
Now if two ebs are being allocated as the same time, one will fail with
-EEIXST when inserting into the radix tree.
So for grab_extent_buffer(), just always return NULL for subpage case.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage case, we need to allocate additional memory for each
metadata page.
So we need to:
- Allow attach_extent_buffer_page() to return int to indicate allocation
failure
- Allow manually pre-allocate subpage memory for alloc_extent_buffer()
As we don't want to use GFP_ATOMIC under spinlock, we introduce
btrfs_alloc_subpage() and btrfs_free_subpage() functions for this
purpose.
(The simple wrap for btrfs_free_subpage() is for later convert to
kmem_cache. Already internally tested without problem)
- Preallocate btrfs_subpage structure for alloc_extent_buffer()
We don't want to call memory allocation with spinlock held, so
do preallocation before we acquire mapping->private_lock.
- Handle subpage and regular case differently in
attach_extent_buffer_page()
For regular case, no change, just do the usual thing.
For subpage case, allocate new memory or use the preallocated memory.
For future subpage metadata, we will make use of radix tree to grab
extent buffer.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For sectorsize < page size support, we need a structure to record extra
status info for each sector of a page.
Introduce the skeleton structure, all subpage related code would go to
subpage.[ch].
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the incoming subpage support, UNMAPPED extent buffer will have
different behavior in btrfs_release_extent_buffer().
This means we need to set UNMAPPED bit early before calling
btrfs_release_extent_buffer().
Currently there is only one caller which relies on
btrfs_release_extent_buffer() in its error path while set UNMAPPED bit
late:
- btrfs_clone_extent_buffer()
Make it subpage compatible by setting the UNMAPPED bit early, since
we're here, also move the UPTODATE bit early.
There is another caller, __alloc_dummy_extent_buffer(), setting
UNMAPPED bit late, but that function clean up the allocated page
manually, thus no need for any modification.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in
__process_pages_contig(), to let the function know to clear page dirty
bit and then set page writeback.
However page writeback and dirty bits are conflicting (at least for
sector size == PAGE_SIZE case), this means these two have to be always
updated together.
This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to
PAGE_START_WRITEBACK.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Often an fsync needs to fallback to a transaction commit for several
reasons (to ensure consistency after a power failure, a new block group
was allocated or a temporary error such as ENOMEM or ENOSPC happened).
In that case the log is marked as needing a full commit and any concurrent
tasks attempting to log inodes or commit the log will also fallback to the
transaction commit. When this happens they all wait for the task that first
started the transaction commit to finish the transaction commit - however
they wait until the full transaction commit happens, which is not needed,
as they only need to wait for the superblocks to be persisted and not for
unpinning all the extents pinned during the transaction's lifetime, which
even for short lived transactions can be a few thousand and take some
significant amount of time to complete - for dbench workloads I have
observed up to 4~5 milliseconds of time spent unpinning extents in the
worst cases, and the number of pinned extents was between 2 to 3 thousand.
So allow fsync tasks to skip waiting for the unpinning of extents when
they call btrfs_commit_transaction() and they were not the task that
started the transaction commit (that one has to do it, the alternative
would be to offload the transaction commit to another task so that it
could avoid waiting for the extent unpinning or offload the extent
unpinning to another task).
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
After applying the entire patchset, dbench shows improvements in respect
to throughput and latency. The script used to measure it is the following:
$ cat dbench-test.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-m single -d single"
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
umount $DEV &> /dev/null
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a physical machine with 12 cores (Intel corei7), 64G
of ram, using a NVMe device and a non-debug kernel configuration (Debian's
default configuration).
Before applying patchset, 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9627107 0.153 61.938
Close 7072076 0.001 3.175
Rename 407633 1.222 44.439
Unlink 1943895 0.658 44.440
Deltree 256 17.339 110.891
Mkdir 128 0.003 0.009
Qpathinfo 8725406 0.064 17.850
Qfileinfo 1529516 0.001 2.188
Qfsinfo 1599884 0.002 1.457
Sfileinfo 784200 0.005 3.562
Find 3373513 0.411 30.312
WriteX 4802132 0.053 29.054
ReadX 15089959 0.002 5.801
LockX 31344 0.002 0.425
UnlockX 31344 0.001 0.173
Flush 674724 5.952 341.830
Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms
After applying patchset, 32 clients:
After patchset, with 32 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9931568 0.111 25.597
Close 7295730 0.001 2.171
Rename 420549 0.982 49.714
Unlink 2005366 0.497 39.015
Deltree 256 11.149 89.242
Mkdir 128 0.002 0.014
Qpathinfo 9001863 0.049 20.761
Qfileinfo 1577730 0.001 2.546
Qfsinfo 1650508 0.002 3.531
Sfileinfo 809031 0.005 5.846
Find 3480259 0.309 23.977
WriteX 4952505 0.043 41.283
ReadX 15568127 0.002 5.476
LockX 32338 0.002 0.978
UnlockX 32338 0.001 2.032
Flush 696017 7.485 228.835
Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms
--> +4.1% throughput, -39.6% max latency
Before applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 8956748 0.342 108.312
Close 6579660 0.001 3.823
Rename 379209 2.396 81.897
Unlink 1808625 1.108 131.148
Deltree 256 25.632 172.176
Mkdir 128 0.003 0.018
Qpathinfo 8117615 0.131 55.916
Qfileinfo 1423495 0.001 2.635
Qfsinfo 1488496 0.002 5.412
Sfileinfo 729472 0.007 8.643
Find 3138598 0.855 78.321
WriteX 4470783 0.102 79.442
ReadX 14038139 0.002 7.578
LockX 29158 0.002 0.844
UnlockX 29158 0.001 0.567
Flush 627746 14.168 506.151
Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms
After applying patchset, 64 clients:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 9069003 0.303 43.193
Close 6662328 0.001 3.888
Rename 383976 2.194 46.418
Unlink 1831080 1.022 43.873
Deltree 256 24.037 155.763
Mkdir 128 0.002 0.005
Qpathinfo 8219173 0.137 30.233
Qfileinfo 1441203 0.001 3.204
Qfsinfo 1507092 0.002 4.055
Sfileinfo 738775 0.006 5.431
Find 3177874 0.936 38.170
WriteX 4526152 0.084 39.518
ReadX 14213562 0.002 24.760
LockX 29522 0.002 1.221
UnlockX 29522 0.001 0.694
Flush 635652 14.358 422.039
Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms
--> +6.8% throughput, -18.1% max latency
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Whenever we fsync an inode, if it is a directory, a regular file that was
created in the current transaction or has last_unlink_trans set to the
generation of the current transaction, we check if any of its ancestor
inodes (and the inode itself if it is a directory) can not be logged and
need a fallback to a full transaction commit - if so, we return with a
value of 1 in order to fallback to a transaction commit.
However we often do not need to fallback to a transaction commit because:
1) The ancestor inode is not an immediate parent, and therefore there is
not an explicit request to log it and it is not needed neither to
guarantee the consistency of the inode originally asked to be logged
(fsynced) nor its immediate parent;
2) The ancestor inode was already logged before, in which case any link,
unlink or rename operation updates the log as needed.
So for these two cases we can avoid an unnecessary transaction commit.
Therefore remove check_parent_dirs_for_sync() and add a check at the top
of btrfs_log_inode() to make us fallback immediately to a transaction
commit when we are logging a directory inode that can not be logged and
needs a full transaction commit. All we need to protect is the case where
after renaming a file someone fsyncs only the old directory, which would
result is losing the renamed file after a log replay.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging new directory entries of a directory, we log the inodes of
new dentries and the inodes of dentries pointing to directories that
may have been created in past transactions. For the case of directories
we log in full mode, which can be particularly expensive for large
directories.
We do use btrfs_inode_in_log() to skip already logged inodes, however for
that helper to return true, it requires that the log transaction used to
log the inode to be already committed. This means that when we have more
than one task using the same log transaction we can end up logging an
inode multiple times, which is a waste of time and not necessary since
the log will be committed by one of the tasks and the others will wait for
the log transaction to be committed before returning to user space.
So simply replace the use of btrfs_inode_in_log() with the new helper
function need_log_inode(), introduced in a previous commit.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some times when we fsync an inode we need to do a full log of all its
ancestors (due to unlink, link or rename operations), which can be an
expensive operation, specially if the directories are large.
However if we find an ancestor directory inode that is already logged in
the current transaction, and has no inserted/updated/deleted xattrs since
it was last logged, we can skip logging the directory again. We are safe
to skip that since we know that for logged directories, any link, unlink
or rename operations that implicate the directory will update the log as
necessary.
So use the helper need_log_dir(), introduced in a previous commit, to
detect already logged directories that can be skipped.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we fsync a new file, created in the current transaction, we check
all its ancestor inodes and always log them if they were created in the
current transaction - even if we have already logged them before, which
is a waste of time.
So avoid logging new ancestor inodes if they were already logged before
and have no xattrs added/updated/removed since they were last logged.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we fill an inode item for logging we are setting its nbytes field
with the value returned by inode_get_bytes() (a VFS API), however we do
not need it because it is not used during log replay. In fact, for fast
fsyncs, when we call inode_get_bytes() we may even get an outdated value
for nbytes because the nbytes field of the inode is only updated when
ordered extents complete, and a fast fsync only waits for writeback to
complete, it does not wait for ordered extent completion.
So just remove the setup of nbytes and add an explicit comment mentioning
why we do not set it. This also avoids adding contention on the inode's
i_lock (VFS) with concurrent stat() calls, since that spinlock is used by
inode_get_bytes() which is also called by our stat callback
(btrfs_getattr()).
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.
This is not needed since commit d555438b6e ("Btrfs: drop dir i_size
when adding new names on replay"), where we explicitly ignore the i_size
of directory inode items on log replay. Before that we used it but it
was buggy as mentioned in that commit's change log (i_size got a larger
value then it should have).
So stop updating the i_size of the directory inode item in the log, as
that is a waste of time, adds more log contention to the log tree and
often results in COWing more extent buffers for the log tree.
This code path is triggered often during dbench workloads for example.
This patch is part of a patchset comprised of the following patches:
btrfs: remove unnecessary directory inode item update when deleting dir entry
btrfs: stop setting nbytes when filling inode item for logging
btrfs: avoid logging new ancestor inodes when logging new inode
btrfs: skip logging directories already logged when logging all parents
btrfs: skip logging inodes already logged when logging new entries
btrfs: remove unnecessary check_parent_dirs_for_sync()
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Performance results, after applying all patches, are mentioned in the
change log of the last patch.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Before this change, the btrfs_get_io_geometry() function was calling
btrfs_get_chunk_map() to get the extent mapping, necessary for
calculating the I/O geometry. It was using that extent mapping only
internally and freeing the pointer after its execution.
That resulted in calling btrfs_get_chunk_map() de facto twice by the
__btrfs_map_block() function. It was calling btrfs_get_io_geometry()
first and then calling btrfs_get_chunk_map() directly to get the extent
mapping, used by the rest of the function.
Change that to passing the extent mapping to the btrfs_get_io_geometry()
function as an argument.
This could improve performance in some cases. For very large
filesystems, i.e. several thousands of allocated chunks, not only this
avoids searching two times the rbtree, saving time, it may also help
reducing contention on the lock that protects the tree - thinking of
writeback starting for multiple inodes, other tasks allocating or
removing chunks, and anything else that requires access to the rbtree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Michal Rostecki <mrostecki@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add Filipe's analysis ]
Signed-off-by: David Sterba <dsterba@suse.com>
Commit dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could
span across a page") make btrfs_invalidapage() to search all ordered
extents.
The offending code looks like this:
again:
start = page_start;
ordered = btrfs_lookup_ordered_range(inode, start, page_end - start + 1);
if (ordred) {
end = min(page_end,
ordered->file_offset + ordered->num_bytes - 1);
/* Do the cleanup */
start = end + 1;
if (start < page_end)
goto again;
}
The behavior is indeed necessary for the incoming subpage support, but
when it iterates through all the ordered extents, it also resets the
search range @start.
This means, for the following cases, we can double account the ordered
extents, causing its bytes_left underflow:
Page offset
0 16K 32K
|<--- OE 1 --->|<--- OE 2 ---->|
As the first iteration will find ordered extent (OE) 1, which doesn't
cover the full page, thus after cleanup code, we need to retry again.
But again label will reset start to page_start, and we got OE 1 again,
which causes double accounting on OE 1, and cause OE 1's byte_left to
underflow.
This problem can only happen for subpage case, as for regular sectorsize
== PAGE_SIZE case, we will always find a OE ends at or after page end,
thus no way to trigger the problem.
Move the again label after start = page_start. There will be more
comprehensive rework to convert the open coded loop to a proper while
loop for subpage support.
Fixes: dbfdb6d1b3 ("Btrfs: Search for all ordered extents that could span across a page")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the following coccicheck warnings:
./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is
equivalent to !A || B.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit e4ecaf90bc ("btrfs: add comments for
btrfs_check_can_nocow() and can_nocow_extent()"). This is true only for
the function btrfs_check_can_nocow(), which after that commit was renamed
to check_can_nocow(). So just remove that part of the comment.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Often when I'm debugging ENOSPC related issues I have to resort to
printing the entire ENOSPC state with trace_printk() in different spots.
This gets pretty annoying, so add a trace state that does this for us.
Then add a trace point at the end of preemptive flushing so you can see
the state of the space_info when we decide to exit preemptive flushing.
This helped me figure out we weren't kicking in the preemptive flushing
soon enough.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we have normal ticketed flushing and preemptive flushing, adjust
the tracepoint so that we know the source of the flushing action to make
it easier to debug problems.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Starting preemptive flushing at 50% of available free space is a good
start, but some workloads are particularly abusive and can quickly
overwhelm the preemptive flushing code and drive us into using tickets.
Handle this by clamping down on our threshold for starting and
continuing to run preemptive flushing. This is particularly important
for our overcommit case, as we can really drive the file system into
overages and then it's more difficult to pull it back as we start to
actually fill up the file system.
The clamping is essentially 2^CLAMP, but we start at 1 so whatever we
calculate for overcommit is the baseline.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A lot of this was added all in one go with no explanation, and is a bit
unwieldy and confusing. Simplify the logic to start preemptive flushing
if we've reserved more than half of our available free space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_calc_reclaim_metadata_size does two things, it returns
the space currently required for flushing by the tickets, and if there
are no tickets it calculates a value for the preemptive flushing.
However for the normal ticketed flushing we really only care about the
space required for tickets. We will accidentally come in and flush one
time, but as soon as we see there are no tickets we bail out of our
flushing.
Fix this by making btrfs_calc_reclaim_metadata_size really only tell us
what is required for flushing if we have people waiting on space. Then
move the preemptive flushing logic into need_preemptive_reclaim(). We
ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim()
because if we are in this path then we made our reservation and there
are not pending tickets currently, so we do not need to check it, simply
do the fuzzy logic to check if we're getting low on space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're flushing space for tickets then we have
space_info->reclaim_size set and we do not need to do background
reclaim.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All of our normal flushing is asynchronous reclaim, so this helper is
poorly named. This is more checking if we need to preemptively flush
space, so rename it to need_preemptive_reclaim.
Also switch it to bool and make it plain static as followup patches will
move more code here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently if we ever have to flush space because we do not have enough
we allocate a ticket and attach it to the space_info, and then
systematically flush things in the filesystem that hold space
reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the
flushing to make progress before we are woken up and allowed to continue
doing our work.
In order to address that we used to kick off the async worker to flush
space preemptively, so that we could be reclaiming space hopefully
before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the
fact that we were using tickets to indicate if we were done flushing.
No tickets, no more flushing. However this meant that we essentially
never preemptively flushed. This caused a write performance regression
that Nikolay noticed in an unrelated patch that removed the committing
of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction()
would see that we were low on space, and it would commit the
transaction. This was bad because in this particular case you could end
up with thousands and thousands of transactions being committed during
the 5 minute reproducer. With the patch to remove this behavior we got
much more sane transaction commits, but we ended up slower because we
would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism.
However it is distinctly different from our ticketing flushing in that
it doesn't have tickets to base it's decisions on. Instead of bolting
this logic into our existing flushing work, add another worker to handle
this preemptive flushing. Here we will attempt to be slightly
intelligent about the things that we flushing, attempting to balance
between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Solely for preemptive flushing, we want to be able to force the
transaction commit without any of the ambiguity of
may_commit_transaction(). This is because may_commit_transaction()
checks tickets and such, and in preemptive flushing we already know
it'll be helpful, so use this to keep the code nice and clean and
straightforward.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We track dio_bytes because the shrink delalloc code needs to know if we
have more DIO in flight than we have normal buffered IO. The reason for
this is because we can't "flush" DIO, we have to just wait on the
ordered extents to finish.
However this is true of all ordered extents. If we have more ordered
space outstanding than dirty pages we should be waiting on ordered
extents. We already are ok on this front technically, because we always
do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
the preemptive flushing code as well, so change this to count all
ordered bytes instead of just DIO ordered bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While debugging a ENOSPC related performance problem I needed to see the
time difference between start and end of a reserve ticket, so add a
trace point to report when we handle a reserve ticket.
I opted to spit out start_ns itself without calculating the difference
because there could be a gap between enabling the tracepoint and setting
start_ns. Doing it this way allows us to filter on 0 start_ns so we
don't get bogus entries, and we can easily calculate the time difference
with bpftrace or something else.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I got a automated message from somebody who runs clang against our
kernels and it's because I used the wrong enum type for what I passed
into flush_space, caught by -Wenum-conversion. Change the argument to
be explicitly the enum we're expecting to make everything consistent.
Maybe eventually gcc will catch errors like this.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
struct send_ctx *sctx but when used in changed_cb it is immediately
cast to `struct send_ctx *sctx = ctx;`.
changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
- full_send_tree already passes a struct send_ctx *sctx
- btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
- void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
So casting to/from void *ctx seems unnecessary and directly using
struct send_ctx *sctx instead provides better type-safety.
The original reason for using void *ctx in the first place seems to have
been dropped with 1b51d6fce4 ("btrfs: send: remove indirect callback
parameter for changed_cb").
Signed-off-by: Roman Anasal <roman.anasal@bdsu.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We love running delayed refs in commit_cowonly_roots, but it is a bit
excessive. I was seeing cases of running 3 or 4 refs a few times in a
row during this time. Instead simply:
- update all of the roots first
- then run delayed refs
- then handle the empty block groups case
- and then if we have any more dirty roots do the whole thing again
This allows us to be much more efficient with our delayed ref running,
as we can batch a few more operations at once.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This was added in commit 361048f586 ("Btrfs: fix full backref problem
when inserting shared block reference") to address a problem where we
hit the following BUG_ON() in alloc_reserved_tree_block
if (node->type == BTRFS_SHARED_BLOCK_REF_KEY) {
BUG_ON(!(flags & BTRFS_BLOCK_FLAG_FULL_BACKREF));
However this BUG_ON() is bogus, and was removed by previous commit:
btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
We no longer need to run delayed refs because of this, and can remove
this flushing here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fix 361048f586 ("Btrfs: fix full backref problem when inserting
shared block reference") added a delayed ref flushing at subvolume
creation time in order to avoid hitting this particular BUG_ON().
Before this fix, we were tripping the BUG_ON() by
1. Modify snapshot A, which creates blocks with a normal reference for
snapshot A, as A is the owner of these blocks. We now have delayed
refs for these blocks.
2. Create a snapshot of A named B, which pushes references for the
children blocks of the root node for the new root B, thus creating
more delayed refs for newly allocated blocks.
3. A is modified, and because the metadata blocks can now be shared, it
must push FULL_BACKREF references to the children of any block that A
COWs down it's path to its target key.
4. Delayed refs are run. Because these are newly allocated blocks, we
have ->must_insert_reserved reserved set on the delayed ref head, we
call into alloc_reserved_tree_block() to add the extent item, and
then add our ref. At the time of this fix, we were ordering
FULL_BACKREF delayed ref operations first, so we'd go to add this
reference and then BUG_ON() because we didn't have the FULL_BACKREF
flag set.
The patch fixed this problem by making sure we ran the delayed refs
before we had the chance to modify A. This meant that any *new* blocks
would have had their extent items created _before_ we would ever
actually COW down and generate FULL_BACKREF entries. Thus the problem
went away.
However this BUG_ON() is actually completely bogus. The existence of a
full backref doesn't necessarily mean that FULL_BACKREF must be set on
that block, it must only be set on the actual parent itself. Consider
the example provided above. If we COW down one path from A, any nodes
are going to have a FULL_BACKREF ref pushed down to _all_ of their
children, but not all of the children are going to have FULL_BACKREF
set. It is completely valid to have an extent item with normal and full
backrefs without FULL_BACKREF actually set on the block itself.
As a final note, I have been testing with the patch (applied after this
one)
btrfs: stop running all delayed refs during snapshot
which removed this flushing. My test was a torture test which did a lot
of operations while snapshotting and deleting snapshots as well as
relocation, and I never tripped this BUG_ON(). This is actually because
at the time of 361048f586, we ordered SHARED keys _before_ normal
references, and thus they would get run first. However currently they
are ordered _after_ normal references, so we'd do the initial creation
without having a shared reference, and thus not hit this BUG_ON(), which
explains why I didn't start hitting this problem during my testing with
my other patch applied.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The commit d672633545 ("btrfs: qgroup: Make snapshot accounting work
with new extent-oriented qgroup.") added a flush of the delayed refs
during snapshot creation in order to get the qgroup accounting properly.
However this code has changed and been moved to it's own helper that is
skipped if qgroups are turned off. Move the flushing to the helper, as
we do not need it when qgroups are turned off.
Also add a comment explaining why it exists, and why it doesn't actually
save us. This will be helpful later when we try to fix qgroup
accounting properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We try to pre-flush the delayed refs when committing, because we want to
do as little work as possible in the critical section of the transaction
commit.
However doing this twice can lead to very long transaction commit delays
as other threads are allowed to continue to generate more delayed refs,
which potentially delays the commit by multiple minutes in very extreme
cases.
So simply stick to one pre-flush, and then continue the rest of the
transaction commit.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously our delayed ref running used the total number of items as the
items to run. However we changed that to number of heads to run with
the delayed_refs_rsv, as generally we want to run all of the operations
for one bytenr.
But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the
number of items that we have. This is generally fine, but if we have
some operation generation loads of delayed refs while we're doing this
pre-flushing in the transaction commit, we'll just spin forever doing
delayed refs.
Fix this to simply pick the number of delayed refs we currently have,
that way we do not end up doing a lot of extra work that's being
generated in other threads.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>