In this round, we've mainly focused on performance tuning and critical bug fixes
occurred in low-end devices. Sheng Yong introduced lost_found feature to keep
missing files during recovery instead of thrashing them. We're preparing coming
fsverity implementation. And, we've got more features to communicate with users
for better performance. In low-end devices, some memory-related issues were
fixed, and subtle race condtions and corner cases were addressed as well.
Enhancement:
- large nat bitmaps for more free node ids
- add three block allocation policies to pass down write hints given by user
- expose extension list to user and introduce hot file extension
- tune small devices seamlessly for low-end devices
- set readdir_ra by default
- give more resources under gc_urgent mode regarding to discard and cleaning
- introduce fsync_mode to enforce posix or not
- nowait aio support
- add lost_found feature to keep dangling inodes
- reserve bits for future fsverity feature
- add test_dummy_encryption for FBE
Bug fix:
- don't use highmem for dentry pages
- align memory boundary for bitops
- truncate preallocated blocks in write errors
- guarantee i_times on fsync call
- clear CP_TRIMMED_FLAG correctly
- prevent node chain loop during recovery
- avoid data race between atomic write and background cleaning
- avoid unnecessary selinux violation warnings on resgid option
- GFP_NOFS to avoid deadlock in quota and read paths
- fix f2fs_skip_inode_update to allow i_size recovery
In addition to them, there are several minor bug fixes and clean-ups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAlrFmaYACgkQQBSofoJI
UNKaSw//RZraDQmODfDieyjGAd8y1NgGDju1jgk62QJUqe/G3jIKN/A39HU3I2Ak
jLIotwZ8j1yNBUGbXMXZTY+6sRkQGJrcgawgvZ0gwWSeqTAiCi+AbqnNysTv8pDp
Rh1iMCGvNKv1/vbFizJumcK2haH/KzX9Z7uDBjji4N7tTT1nUGVz+vmUO/qS6Xi0
44BrEN1rhaF71JNaiQE0WuEc0GY2IU0W7l4IWirJeNqsK0BTVPnq1vxc8SsQQRmn
Wdr4rUw2SRAynGMCQZlMW/sG0sxV0WMAE0xVE8WCFMXzHx2GNVakUgXMTZ2DgUKT
JlgdsB5NQYrZJqSEAdNq3WclJpznlhOLHp8hjITxSRxhd2j7GH7xujUoGSsNiXDr
n2s3Xg6wWMx+D4vsfXYJuZQKYqZUcBlPRgXa74/aXnsxRTLaormJtFNWkLqELIzR
Qaw7XIhZoOyKkH7CPSWk0eYyQVi13l9QV70iyBfY4bcN2aLZLxTX2Axwws3ESOCt
Nf0jIe3L4k7MSdoMR0K51rhCMlPWDuddnQSs9Vsw3w8U5pVph+BKG5RBUI9N4hPv
U0gKJfj8LIaxhHTfborEHf+hjq/0sN0TBPFBsAtqvv1LBXkMeWEeo0wXtxSJC89b
3xBRQ0p4Nox4fEHs36AZID2OMOV8iamiEqzP/o9ja0sMzEjnYmo=
=Ligi
-----END PGP SIGNATURE-----
Merge tag 'f2fs-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs update from Jaegeuk Kim:
"In this round, we've mainly focused on performance tuning and critical
bug fixes occurred in low-end devices. Sheng Yong introduced
lost_found feature to keep missing files during recovery instead of
thrashing them. We're preparing coming fsverity implementation. And,
we've got more features to communicate with users for better
performance. In low-end devices, some memory-related issues were
fixed, and subtle race condtions and corner cases were addressed as
well.
Enhancements:
- large nat bitmaps for more free node ids
- add three block allocation policies to pass down write hints given by user
- expose extension list to user and introduce hot file extension
- tune small devices seamlessly for low-end devices
- set readdir_ra by default
- give more resources under gc_urgent mode regarding to discard and cleaning
- introduce fsync_mode to enforce posix or not
- nowait aio support
- add lost_found feature to keep dangling inodes
- reserve bits for future fsverity feature
- add test_dummy_encryption for FBE
Bug fixes:
- don't use highmem for dentry pages
- align memory boundary for bitops
- truncate preallocated blocks in write errors
- guarantee i_times on fsync call
- clear CP_TRIMMED_FLAG correctly
- prevent node chain loop during recovery
- avoid data race between atomic write and background cleaning
- avoid unnecessary selinux violation warnings on resgid option
- GFP_NOFS to avoid deadlock in quota and read paths
- fix f2fs_skip_inode_update to allow i_size recovery
In addition to the above, there are several minor bug fixes and clean-ups"
* tag 'f2fs-for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (50 commits)
f2fs: remain written times to update inode during fsync
f2fs: make assignment of t->dentry_bitmap more readable
f2fs: truncate preallocated blocks in error case
f2fs: fix a wrong condition in f2fs_skip_inode_update
f2fs: reserve bits for fs-verity
f2fs: Add a segment type check in inplace write
f2fs: no need to initialize zero value for GFP_F2FS_ZERO
f2fs: don't track new nat entry in nat set
f2fs: clean up with F2FS_BLK_ALIGN
f2fs: check blkaddr more accuratly before issue a bio
f2fs: Set GF_NOFS in read_cache_page_gfp while doing f2fs_quota_read
f2fs: introduce a new mount option test_dummy_encryption
f2fs: introduce F2FS_FEATURE_LOST_FOUND feature
f2fs: release locks before return in f2fs_ioc_gc_range()
f2fs: align memory boundary for bitops
f2fs: remove unneeded set_cold_node()
f2fs: add nowait aio support
f2fs: wrap all options with f2fs_sb_info.mount_opt
f2fs: Don't overwrite all types of node to keep node chain
f2fs: introduce mount option for fsync mode
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABCAAGBQJawr05AAoJEPfTWPspceCmT2UP/1uuaqwzyl4VjFNb/k7KS7UM
+Cs/1HBlGomgMA8orDTGqtWqLRdR3z4RSh0+MvXTzQ78HpFVYz7CbDc9itHm+G9M
X0ypD4kF/JGCFb5cxk+x6qv28uO2nv4DP3+0hHqJWLH4UVJBWDY6bs4BPShsf9QB
I6XjioNMhoqylXgdOITLODJZz+TcChlJMDAqwhpJwh9TH1wjobleAZ6AdmCPfgi5
h0UCKMUKzcVJlNZwQUrzrs2cxcx9Uhunnbz7HK0ZV4n/FKFtDpGynFpQQ71pZxKe
Be0ZOBPCQvC3ykOM/egCIvC/e5y7FgrjORD6jxyu1PTwAugI5E1VYSMxHkXvgPAx
zOo9A7RT4GPO2tDQv+DbzNFpqeSAclTgSmr+/y1wmheBs8DiSt7MPVBiNM4zdCNv
NLk9z7IEjFhdmluSB/LbTb1aokypMb/q7QTLouPHdwGn80k7yrhFyLHgdjpNTQ2K
UHfHZvGxkOX6SmFhBNOtIFUkuSceenh64a0RkRle7filx+ImpbCVm2/GYi9zZNCu
EtctgzLbLmz40zMiyDaZS2bxBgGzfn6yf4xd9LsaAJPMhvZnmXogT0D9ctWXB0WU
mMaS7sOkLnNjnGkzF1fHkeiZ/oigrstJbe+CA7BtOdwxpWn6MZBgKEoFQ6iA2b3X
5J1axMgVH5LAsIEcEQVq
=RVhK
-----END PGP SIGNATURE-----
Merge tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:
"It's a pretty quiet round this time, which is nice. This contains:
- series from Bart, cleaning up the way we set/test/clear atomic
queue flags.
- series from Bart, fixing races between gendisk and queue
registration and removal.
- set of bcache fixes and improvements from various folks, by way of
Michael Lyle.
- set of lightnvm updates from Matias, most of it being the 1.2 to
2.0 transition.
- removal of unused DIO flags from Nikolay.
- blk-mq/sbitmap memory ordering fixes from Omar.
- divide-by-zero fix for BFQ from Paolo.
- minor documentation patches from Randy.
- timeout fix from Tejun.
- Alpha "can't write a char atomically" fix from Mikulas.
- set of NVMe fixes by way of Keith.
- bsg and bsg-lib improvements from Christoph.
- a few sed-opal fixes from Jonas.
- cdrom check-disk-change deadlock fix from Maurizio.
- various little fixes, comment fixes, etc from various folks"
* tag 'for-4.17/block-20180402' of git://git.kernel.dk/linux-block: (139 commits)
blk-mq: Directly schedule q->timeout_work when aborting a request
blktrace: fix comment in blktrace_api.h
lightnvm: remove function name in strings
lightnvm: pblk: remove some unnecessary NULL checks
lightnvm: pblk: don't recover unwritten lines
lightnvm: pblk: implement 2.0 support
lightnvm: pblk: implement get log report chunk
lightnvm: pblk: rename ppaf* to addrf*
lightnvm: pblk: check for supported version
lightnvm: implement get log report chunk helpers
lightnvm: make address conversions depend on generic device
lightnvm: add support for 2.0 address format
lightnvm: normalize geometry nomenclature
lightnvm: complete geo structure with maxoc*
lightnvm: add shorten OCSSD version in geo
lightnvm: add minor version to generic geometry
lightnvm: simplify geometry structure
lightnvm: pblk: refactor init/exit sequences
lightnvm: Avoid validation of default op value
lightnvm: centralize permission check for lightnvm ioctl
...
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
kfifo: fix inaccurate comment
tools/thermal: tmon: fix for segfault
net: Spelling s/stucture/structure/
edd: don't spam log if no EDD information is present
Documentation: Fix early-microcode.txt references after file rename
tracing: Block comments should align the * on each line
treewide: Fix typos in printk
GenWQE: Fix a typo in two comments
treewide: Align function definition open/close braces
do_chunk_alloc implements a loop checking whether there is a pending
chunk allocation and if so causes the caller do loop. Generally this
loop is executed only once, however testing with btrfs/072 on a single
core vm machines uncovered an extreme case where the system could loop
indefinitely. This is due to a missing cond_resched when loop which
doesn't give a chance to the previous chunk allocator finish its job.
The fix is to simply add the missing cond_resched.
Fixes: 6d74119f1a ("Btrfs: avoid taking the chunk_mutex in do_chunk_alloc")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If errors were returned by btrfs_next_leaf(), replay_dir_deletes needs
to bail out, otherwise @ret would be forced to be 0 after 'break;' and
the caller won't be aware of it.
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
0, 1 and <0 can be returned by btrfs_next_leaf(), and when <0 is
returned, path->nodes[0] could be NULL, log_dir_items lacks such a
check for <0 and we may run into a null pointer dereference panic.
Fixes: e02119d5a7 ("Btrfs: Add a write ahead tree log to optimize synchronous operations")
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Here is the big set of char/misc driver patches for 4.17-rc1.
There are a lot of little things in here, nothing huge, but all
important to the different hardware types involved:
- thunderbolt driver updates
- parport updates (people still care...)
- nvmem driver updates
- mei updates (as always)
- hwtracing driver updates
- hyperv driver updates
- extcon driver updates
- and a handfull of even smaller driver subsystem and individual
driver updates
All of these have been in linux-next with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWsShSQ8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykNqwCfUbfvopswb1PesHCLABDBsFQChgoAniDa6pS9
kI8TN5MdLN85UU27Mkb6
=BzFR
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH:
"Here is the big set of char/misc driver patches for 4.17-rc1.
There are a lot of little things in here, nothing huge, but all
important to the different hardware types involved:
- thunderbolt driver updates
- parport updates (people still care...)
- nvmem driver updates
- mei updates (as always)
- hwtracing driver updates
- hyperv driver updates
- extcon driver updates
- ... and a handful of even smaller driver subsystem and individual
driver updates
All of these have been in linux-next with no reported issues"
* tag 'char-misc-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (149 commits)
hwtracing: Add HW tracing support menu
intel_th: Add ACPI glue layer
intel_th: Allow forcing host mode through drvdata
intel_th: Pick up irq number from resources
intel_th: Don't touch switch routing in host mode
intel_th: Use correct method of finding hub
intel_th: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
stm class: Make dummy's master/channel ranges configurable
stm class: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
MAINTAINERS: Bestow upon myself the care for drivers/hwtracing
hv: add SPDX license id to Kconfig
hv: add SPDX license to trace
Drivers: hv: vmbus: do not mark HV_PCIE as perf_device
Drivers: hv: vmbus: respect what we get from hv_get_synint_state()
/dev/mem: Avoid overwriting "err" in read_mem()
eeprom: at24: use SPDX identifier instead of GPL boiler-plate
eeprom: at24: simplify the i2c functionality checking
eeprom: at24: fix a line break
eeprom: at24: tweak newlines
eeprom: at24: refactor at24_probe()
...
Here is the big set of tty and serial driver patches for 4.17-rc1
Not all that big really, most are just small fixes and additions to
existing drivers. There's a bunch of work on the imx serial driver
recently for some reason, and a new embedded serial driver added as
well.
Full details are in the shortlog.
All of these have been in the linux-next tree for a while with no
reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCWsSn+w8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ynb6wCdEf5dAUrSB37ptZY78n4kc6nI6NAAniDO+rjL
ppZZp7QTIB/bnPfW8cOH
=+tfY
-----END PGP SIGNATURE-----
Merge tag 'tty-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big set of tty and serial driver patches for 4.17-rc1
Not all that big really, most are just small fixes and additions to
existing drivers. There's a bunch of work on the imx serial driver
recently for some reason, and a new embedded serial driver added as
well.
Full details are in the shortlog.
All of these have been in the linux-next tree for a while with no
reported issues"
* tag 'tty-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (66 commits)
serial: expose buf_overrun count through proc interface
serial: mvebu-uart: fix tx lost characters
tty: serial: msm_geni_serial: Fix return value check in qcom_geni_serial_probe()
tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP
8250-men-mcb: add support for 16z025 and 16z057
powerpc: Mark the variable earlycon_acpi_spcr_enable maybe_unused
serial: stm32: fix initialization of RS485 mode
ARM: dts: STi: Remove "console=ttyASN" from bootargs for STi boards
vt: change SGR 21 to follow the standards
serdev: Fix typo in serdev_device_alloc
ARM: dts: STi: Fix aliases property name for STi boards
tty: st-asc: Update tty alias
serial: stm32: add support for RS485 hardware control mode
dt-bindings: serial: stm32: add RS485 optional properties
selftests: add devpts selftests
devpts: comment devpts_mntget()
devpts: resolve devpts bind-mounts
devpts: hoist out check for DEVPTS_SUPER_MAGIC
serial: 8250: Add Nuvoton NPCM UART
serial: mxs-auart: disable clks of Alphascale ASM9260
...
The parameter *old_lprops* is never used in lpt_heap_replace(),
remove it to avoid compile warning.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
Constify struct ubifs_lprops in scan_for_leb_for_idx to be
consistent with other references.
Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Richard Weinberger <richard@nod.at>
Assigning a value of a variable to itself is not useful. This
fixes a warning shown when using clang:
warning: explicitly assigning value of variable of type 'int' to itself [-Wself-assign]
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Richard Weinberger <richard@nod.at>
If ubifs_wbuf_sync() fails we must not write a master node with the
dirty marker cleared.
Otherwise it is possible that in case of an IO error while syncing we
mark the filesystem as clean and UBIFS refuses to recover upon next
mount.
Cc: <stable@vger.kernel.org>
Fixes: 1e51764a3c ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger <richard@nod.at>
robust against maliciously crafted file system images. (I still don't
recommend that container folks hold any delusions that mounting
arbitary images that can be crafted by malicious attackers should be
considered sane thing to do, though!)
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlrE13oACgkQ8vlZVpUN
gaM/fQf/bw5IDWNZ9vAYNEBDsu4Je14XVTZVgpDFESk5BGjdVusAv562QCQjC3Pf
O3uqRSh7DbulICzYGy0BmjM7Gw0Kj5WhfDjPbVXhnGoNpF7NUEhNos8xWLRWukky
P8mPxv8cj2oprtfM1Cy8mtl0KT8jywkOahq6OmRo6+F1V32IzQAfpPpItIw9lCv8
JcFgc/2K9zjAG89TR/vWChi8AGhNUXsYEYt+l3tMu+CRC+FWaJ7aEGPn9pmOlqOV
svHG9skImvINOO4FPydOu+yw9spqFKBX9NO0J9MsCAHmmmKW2GW3RANGIhY9kgBb
a/T+/cQZTTc+Xu6VzRf4e/SGUa5mgA==
=HndN
-----END PGP SIGNATURE-----
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"Cleanups and bugfixes for ext4, including some fixes to make ext4 more
robust against maliciously crafted file system images.
(I still don't recommend that container folks hold any delusions that
mounting arbitary images that can be crafted by malicious attackers
should be considered sane thing to do, though!)"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (29 commits)
ext4: force revalidation of directory pointer after seekdir(2)
ext4: add extra checks to ext4_xattr_block_get()
ext4: add bounds checking to ext4_xattr_find_entry()
ext4: move call to ext4_error() into ext4_xattr_check_block()
ext4: don't show data=<mode> option if defaulted
ext4: omit init_itable=n in procfs when disabled
ext4: show more binary mount options in procfs
ext4: simplify kobject usage
ext4: remove unused parameters in sysfs code
ext4: null out kobject* during sysfs cleanup
ext4: don't allow r/w mounts if metadata blocks overlap the superblock
ext4: always initialize the crc32c checksum driver
ext4: fail ext4_iget for root directory if unallocated
ext4: limit xattr size to INT_MAX
ext4: add validity checks for bitmap block numbers
ext4: fix comments in ext4_swap_extents()
ext4: use generic_writepages instead of __writepage/write_cache_pages
ext4: don't complain about incorrect features when probing
ext4: remove EXT4_STATE_DIOREAD_LOCK flag
ext4: fix offset overflow on 32-bit archs in ext4_iomap_begin()
...
-----BEGIN PGP SIGNATURE-----
iQGwBAABCAAaBQJaxRzZExxzbWZyZW5jaEBnbWFpbC5jb20ACgkQiiy9cAdyT1Fb
FQv/Rd/5CrYhZumBrPvFW2jcbRQ2ANTnSRTA3rpd/jJM52DZ7nvcePr/qm9wRLrT
puMfd8e0a4Df5Mo/ns806iphRtYctKpMKLnkBqPL0WqrXLYSi/Nz3wy/DFyuh3C7
U22gDYAjQ4dy6Am0CG/y4i1h8D0hRkmMS6PQECpjmNwqjtmfZn5kWJRv+W5UNNj9
QPldz5PdyNpPw7DxDRetl5uGqUKqsvUATo109hL7ks97qgHUzMHeXWmQpOSS+exh
P7tPNphIPJYM2VG+uDvIg15l00lgQxzzN0uOs+x7ZDnZ1Bil/a3So823SfJyXNU2
utJNWSuN/OSHUCmmd7yn+rLd2oJa55+U+Bb3gWaZ8beP639d8P1kEF/isJzu6ede
gh92lyU2ecfyHNbjKzAbQwwQxnkmMC5XGhP/2+eawsCyo+vk/NQR4NIehqawj/OB
eQSRnT0vNgi4p4OXgJ3FpBBKNBFtTmWrxqKyl0U9C5nw+YdxBdkr16qlFi2b9DjC
bW0/
=VW5l
-----END PGP SIGNATURE-----
Merge tag '4.17-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"Includes SMB3.11 security improvements, as well as various fixes for
stable and some debugging improvements"
* tag '4.17-SMB3-Fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Add minor debug message during negprot
smb3: Fix root directory when server returns inode number of zero
cifs: fix sparse warning on previous patch in a few printks
cifs: add server->vals->header_preamble_size
cifs: smbd: disconnect transport on RDMA errors
cifs: smbd: avoid reconnect lockup
Don't log confusing message on reconnect by default
Don't log expected error on DFS referral request
fs: cifs: Replace _free_xid call in cifs_root_iget function
SMB3.1.1 dialect is no longer experimental
Tree connect for SMB3.1.1 must be signed for non-encrypted shares
fix smb3-encryption breakage when CONFIG_DEBUG_SG=y
CIFS: fix sha512 check in cifs_crypto_secmech_release
CIFS: implement v3.11 preauth integrity
CIFS: add sha512 secmech
CIFS: refactor crypto shash/sdesc allocation&free
Update README file for cifs.ko
Update TODO list for cifs.ko
cifs: fix memory leak in SMB2_open()
CIFS: SMBD: fix spelling mistake: "faield" and "legnth"
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAlqKKI0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGRNAH/0v3+nuJ0oiHE1Cl
fH89F9Ma17j8oTo28byRPi7X5XJfJAqANhHa209rguvnC27y3ew/l9k93HoxG12i
ttvyKFDQulQbytfJZXw8lhUyYGXVsTpyNaihPe/NtqPdIxNgfrXsUN9EIEtcnuS2
SiAj51jUySDRNR4ST6TOx4ulDm1zLrmA28WHOBNOTvDi4jTQMt1TsngHfF5AySBB
lD4RTRDDwWDWtdMI7euYSq019TiDXCxmwQ94vZjrqmjmSQcl/yCK/JzEV33SZslg
4WqGIllxONvP/UlwxZLaJ+RrslqxNgDVqQKwJdfYhGaWvpgPFtS1s86zW6IgyXny
02jJfD0=
=DLWn
-----END PGP SIGNATURE-----
Merge tag 'v4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into mtd/next
Backmerge v4.16-rc2 into mtd/next to resolve a conflict between Linus'
master branch and nand/for-4.17.
1. Abhi Das contributed a patch to report journal recovery times more
accurately during journal replay.
2. Andreas Gruenbacher contributed a patch to fix fallocate chunk size.
3. Andreas added a patch to correctly dirty inodes during rename.
4. Andreas added a patch to improve the comment for function gfs2_block_map.
5. Andreas added a patch to improve kernel trace point iomap end:
The physical block address was added.
6. Andreas added a patch to fix a nasty file system corruption bug that
surfaced in xfstests 476 in punch-hole/truncate.
7. Andreas fixed a problem Christoph Helwig pointed out, namely, that GFS2
was misusing the IOMAP_ZERO flag. The zeroing of new blocks was moved
to the proper fallocate code.
8. I contributed a patch to declare function gfs2_remove_from_ail as static.
9. I added a patch to only set PageChecked for jdata page writes.
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJaw7mxAAoJENeLYdPf93o7kbUH/3MVpJKJe+jdtb5zMGE+byPZ
T4x+kjZcGwIjWtqIMnga2X1mMIpJ/MCQzZriT+zno2BRRxPS8MQsKp9rcmEGiwoP
mKZnVGm85xvBDg/JaAnCOeH2S90BlXX3roaB4U4031tAmcZOz/tVbBiTCBXbZ7FP
F9tnKt/CkRwBBE8CBvstpgYYvbM+kkfiQ8x38FJFx79lOBytIRhPCHpbR3mgko+p
HrJNTadGJie65nsplKZcQpDQTPrslwP/ynyGh313ReztThHfCB6teL6sNOgJ5u+T
ThCA95QesH3P45Dd+xl/SuCDd0pZLLslNgOl/q7qiFWaoxFFBKldN3Bnc7ItnDw=
=iw4e
-----END PGP SIGNATURE-----
Merge tag 'gfs2-4.17.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Bob Peterson:
"We've only got nine GFS2 patches for this merge window:
- report journal recovery times more accurately during journal replay
(Abhi Das)
- fix fallocate chunk size (Andreas Gruenbacher)
- correctly dirty inodes during rename (Andreas Gruenbacher)
- improve the comment for function gfs2_block_map (Andreas
Gruenbacher)
- improve kernel trace point iomap end: The physical block address
was added (Andreas Gruenbacher)
- fix a nasty file system corruption bug that surfaced in xfstests
476 in punch-hole/truncate (Andreas Gruenbacher)
- fix a problem Christoph Helwig pointed out, namely, that GFS2 was
misusing the IOMAP_ZERO flag. The zeroing of new blocks was moved
to the proper fallocate code (Andreas Gruenbacher)
- declare function gfs2_remove_from_ail as static (Bob Peterson)
- only set PageChecked for jdata page writes (Bob Peterson)"
* tag 'gfs2-4.17.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: time journal recovery steps accurately
gfs2: Zero out fallocated blocks in fallocate_chunk
gfs2: Check for the end of metadata in punch_hole
gfs2: gfs2_iomap_end tracepoint: log block address
gfs2: Improve gfs2_block_map comment
GFS2: Only set PageChecked for jdata pages
GFS2: Make function gfs2_remove_from_ail static
gfs2: Dirty source inode during rename
gfs2: Fix fallocate chunk size
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAlrDg0UACgkQxWXV+ddt
WDvtOQ//QCk0zH2EcPQnqOW6HqAfkkDc7D9P51sK1izNM3vBEYtbuPlY6wp3xnJr
0hbPjGNU7vMC/4SIwSgEdXyAvlpOr1gm9n+w1GcxpkjLa8l4P+2wt9OX0BSzRUMu
X7LQxqg2zmQibFy4b1MSmDGsO2dxB2eqvVUT/Ir4b56uqkdtValYRWY75APJIZ5l
6w0Ja3HVvgOX3pVwSmadCpfMEonN4JE+mfHaP8RajAlTGQcUPq8If9w4BtEoWQRl
QC7kUCCTmp+isnzH7u4EqEQC6XUqEqeuQH+Bli1pNYTvipHY+9EO1dSxHZoCelgk
M9PpQz8x+N6ZMcMNtJQVifkfN6tAp/acdWBTZtlpqB8nZR4v5bBndS/5TNBMu3/v
JfhMEsIUz5o2mWz9qUneyK80RoTRkL5SfdgOclx2Yd7K1fKuJsUjmdXaT08BzotS
5cxTFZu7EcFJyg4eHemjdyRYr3cUkS19P2uIJle9nj3DAMpCvIyL1c1vn5eB/7MN
3JeRME6AOQcD7sFgVNuYhGVdIBuwHU6kj4mf1WN27YKGdbsaMZsFz7/HH3SsBLOF
E0p7Q25HcHSrAimcmLTifN+gol8Y5m6dT0Pjuf9y9QbWvHwh7oRq7DVQ3L8hJkGz
j64b0jlv43P0QorWvsX6/VJBevWedhe1hZtrBhPFSE0CtJ3Ml4Q=
=wkSk
-----END PGP SIGNATURE-----
Merge tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are a several user visible changes, the rest is mostly invisible
and continues to clean up the whole code base.
User visible changes:
- new mount option nossd_spread (pair for ssd_spread)
- mount option subvolid will detect junk after the number and fail
the mount
- add message after cancelled device replace
- direct module dependency on libcrc32, removed own crc wrappers
- removed user space transaction ioctls
- use lighter locking when reading /proc/self/mounts, RCU instead of
mutex to avoid unnecessary contention
Enhancements:
- skip writeback of last page when truncating file to same size
- send: do not issue unnecessary truncate operations
- mount option token specifiers: use %u for unsigned values, more
validation
- selftests: more tree block validations
qgroups:
- preparatory work for splitting reservation types for data and
metadata, this should allow for more accurate tracking and fix some
issues with underflows or do further enhancements
- split metadata reservations for started and joined transaction so
they do not get mixed up and are accounted correctly at commit time
- with the above, it's possible to revert patch that potentially
deadlocks when trying to make more space by explicitly committing
when the quota limit is hit
- fix root item corruption when multiple same source snapshots are
created with quota enabled
RAID56:
- make sure target is identical to source when raid56 rebuild fails
after dev-replace
- faster rebuild during scrub, batch by stripes and not
block-by-block
- make more use of cached data when rebuilding from a missing device
Fixes:
- null pointer deref when device replace target is missing
- fix fsync after hole punching when using no-holes feature
- fix lockdep splat when allocating percpu data with wrong GFP flags
Cleanups, refactoring, core changes:
- drop redunant parameters from various functions
- kill and opencode trivial helpers
- __cold/__exit function annotations
- dead code removal
- continued audit and documentation of memory barriers
- error handling: handle removal from uuid tree
- error handling: remove handling of impossible condtitons
- more debugging or error messages
- updated tracepoints
- one VLA use removal (and one still left)"
* tag 'for-4.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (164 commits)
btrfs: lift errors from add_extent_changeset to the callers
Btrfs: print error messages when failing to read trees
btrfs: user proper type for btrfs_mask_flags flags
btrfs: split dev-replace locking helpers for read and write
btrfs: remove stale comments about fs_mutex
btrfs: use RCU in btrfs_show_devname for device list traversal
btrfs: update barrier in should_cow_block
btrfs: use lockdep_assert_held for mutexes
btrfs: use lockdep_assert_held for spinlocks
btrfs: Validate child tree block's level and first key
btrfs: tests/qgroup: Fix wrong tree backref level
Btrfs: fix copy_items() return value when logging an inode
Btrfs: fix fsync after hole punching when using no-holes feature
btrfs: use helper to set ulist aux from a qgroup
Revert "btrfs: qgroups: Retry after commit on getting EDQUOT"
btrfs: qgroup: Update trace events for metadata reservation
btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space
btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item
btrfs: qgroup: Use separate meta reservation type for delalloc
btrfs: qgroup: Introduce function to convert META_PREALLOC into META_PERTRANS
...
- Various cleanups and code fixes
- Implement lazytime as a mount option
- Convert various on-disk metadata checks from asserts to -EFSCORRUPTED
- Fix accounting problems with the rmap per-ag reservations
- Refactorings and cleanups for xfs_log_force
- Various bugfixes for the reflink code
- Work around v5 AGFL padding problems to prevent fs shutdowns
- Establish inode fork verifiers to inspect on-disk metadata correctness
- Various online scrub fixes
- Fix v5 swapext blowing up on deleted inodes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCgAGBQJawZs5AAoJEPh/dxk0SrTrUbAQAKCT0zaYDHViC6p0yxVMTa1z
7fivnwtKNYc2LiihV6wPp+Hj5YtTGExYncJOLuTsAIuBZ6px+jlV9bpA8X9mWgbN
e5XXyqz1O8nn/5iBwKRQm2yFdSnsQQfWXNm0XPNTuPGxuzlzxF/rpFN4UlWGdZul
tigHom5gZD//GYfYHrsOb/7CIRGw90ebpqM3Nt4eAi5o0H5eK46sHKUYtAngSfPm
FdPHJwmw5Kx+yZW5EdR+ELbLqGsBKsOfsp9SG+un0R+kvj/CKC2ovgwS6tuU+gsi
MRD8C0zHlz4ikQrmJ0bV+no7T+9bC8fQDIZu0h7dQ1acWb2F1Epr1LRIxNH/1bLi
qbtchVZkCNXiV0GMQ2iNo1cDJO3AICsQwTuktpoUMU1QOWgQenvzdZCUOQAUqne6
xwnrCq19UbmNlCdkRWChrVn9Gb7FNYVhe15W/y0qZhzJxWam6yIzKBm91Zc/XLp8
L5VUc+FVmtSiHXpEVttSwVeMSzhDfG6qOL42dFmw7xwh7JO/vXi0MlxjGe215ApS
lhBWjEOGB9kbUxMjhqS5KsFn8E1DhL0AMD7N53z7eBTh5Eani81ytf1PzXWhvLbI
1auY0+7cVggXFltcW6rfAJFC0EEuw6wsx86rl3G+dQ9vmlhy4zaWlt0EJEGmNC90
Kw4GpFLDmtV93K++lD1C
=fdIf
-----END PGP SIGNATURE-----
Merge tag 'xfs-4.17-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs updates from Darrick Wong:
"Here's the first round of fixes for XFS for 4.17.
The biggest new features this time around are the addition of lazytime
support, further enhancement of the on-disk inode metadata verifiers,
and a patch to smooth over some of the AGFL padding problems that have
intermittently plagued users since 4.5. I forsee sending a second pull
request next week with further bug fixes and speedups in the online
scrub code and elsewhere.
This series has been run through a full xfstests run over the weekend
and through a quick xfstests run against this morning's master, with
no major failures reported.
Summary of changes for this release:
- Various cleanups and code fixes
- Implement lazytime as a mount option
- Convert various on-disk metadata checks from asserts to -EFSCORRUPTED
- Fix accounting problems with the rmap per-ag reservations
- Refactorings and cleanups for xfs_log_force
- Various bugfixes for the reflink code
- Work around v5 AGFL padding problems to prevent fs shutdowns
- Establish inode fork verifiers to inspect on-disk metadata
correctness
- Various online scrub fixes
- Fix v5 swapext blowing up on deleted inodes"
* tag 'xfs-4.17-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (49 commits)
xfs: do not log/recover swapext extent owner changes for deleted inodes
xfs: clean up xfs_mount allocation and dynamic initializers
xfs: remove dead inode version setting code
xfs: catch inode allocation state mismatch corruption
xfs: xfs_scrub_iallocbt_xref_rmap_inodes should use xref_set_corrupt
xfs: flag inode corruption if parent ptr doesn't get us a real inode
xfs: don't accept inode buffers with suspicious unlinked chains
xfs: move inode extent size hint validation to libxfs
xfs: record inode buf errors as a xref error in inobt scrubber
xfs: remove xfs_buf parameter from inode scrub methods
xfs: inode scrubber shouldn't bother with raw checks
xfs: bmap scrubber should do rmap xref with bmap for sparse files
xfs: refactor inode buffer verifier error logging
xfs: refactor inode verifier error logging
xfs: refactor bmap record validation
xfs: sanity-check the unused space before trying to use it
xfs: detect agfl count corruption and reset agfl
xfs: unwind the try_again loop in xfs_log_force
xfs: refactor xfs_log_force_lsn
xfs: minor cleanup for xfs_reflink_end_cow
...
Pull vfs dcache updates from Al Viro:
"Part of this is what the trylock loop elimination series has turned
into, part making d_move() preserve the parent (and thus the path) of
victim, plus some general cleanups"
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
d_genocide: move export to definition
fold dentry_lock_for_move() into its sole caller and clean it up
make non-exchanging __d_move() copy ->d_parent rather than swap them
oprofilefs: don't oops on allocation failure
lustre: get rid of pointless casts to struct dentry *
debugfs_lookup(): switch to lookup_one_len_unlocked()
fold lookup_real() into __lookup_hash()
take out orphan externs (empty_string/slash_string)
split d_path() and friends into a separate file
dcache.c: trim includes
fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
get rid of trylock loop around dentry_kill()
handle move to LRU in retain_dentry()
dput(): consolidate the "do we need to retain it?" into an inlined helper
split the slow part of lock_parent() off
now lock_parent() can't run into killed dentry
get rid of trylock loop in locking dentries on shrink list
d_delete(): get rid of trylock loop
fs/dcache: Move dentry_kill() below lock_parent()
fs/dcache: Remove stale comment from dentry_kill()
...
Attach copies of the index key and auxiliary data to the fscache cookie so
that:
(1) The callbacks to the netfs for this stuff can be eliminated. This
can simplify things in the cache as the information is still
available, even after the cache has relinquished the cookie.
(2) Simplifies the locking requirements of accessing the information as we
don't have to worry about the netfs object going away on us.
(3) The cache can do lazy updating of the coherency information on disk.
As long as the cache is flushed before reboot/poweroff, there's no
need to update the coherency info on disk every time it changes.
(4) Cookies can be hashed or put in a tree as the index key is easily
available. This allows:
(a) Checks for duplicate cookies can be made at the top fscache layer
rather than down in the bowels of the cache backend.
(b) Caching can be added to a netfs object that has a cookie if the
cache is brought online after the netfs object is allocated.
A certain amount of space is made in the cookie for inline copies of the
data, but if it won't fit there, extra memory will be allocated for it.
The downside of this is that live cache operation requires more memory.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Anna Schumaker <anna.schumaker@netapp.com>
Tested-by: Steve Dickson <steved@redhat.com>
Add more tracepoints to fscache, including:
(*) fscache_page - Tracks netfs pages known to fscache.
(*) fscache_check_page - Tracks the netfs querying whether a page is
pending storage.
(*) fscache_wake_cookie - Tracks cookies being woken up after a page
completes/aborts storage in the cache.
(*) fscache_op - Tracks operations being initialised.
(*) fscache_wrote_page - Tracks return of the backend write_page op.
(*) fscache_gang_lookup - Tracks lookup of pages to be stored in the write
operation.
Signed-off-by: David Howells <dhowells@redhat.com>
Add some tracepoints to fscache:
(*) fscache_cookie - Tracks a cookie's usage count.
(*) fscache_netfs - Logs registration of a network filesystem, including
the pointer to the cookie allocated.
(*) fscache_acquire - Logs cookie acquisition.
(*) fscache_relinquish - Logs cookie relinquishment.
(*) fscache_enable - Logs enablement of a cookie.
(*) fscache_disable - Logs disablement of a cookie.
(*) fscache_osm - Tracks execution of states in the object state machine.
and cachefiles:
(*) cachefiles_ref - Tracks a cachefiles object's usage count.
(*) cachefiles_lookup - Logs result of lookup_one_len().
(*) cachefiles_mkdir - Logs result of vfs_mkdir().
(*) cachefiles_create - Logs result of vfs_create().
(*) cachefiles_unlink - Logs calls to vfs_unlink().
(*) cachefiles_rename - Logs calls to vfs_rename().
(*) cachefiles_mark_active - Logs an object becoming active.
(*) cachefiles_wait_active - Logs a wait for an old object to be
destroyed.
(*) cachefiles_mark_inactive - Logs an object becoming inactive.
(*) cachefiles_mark_buried - Logs the burial of an object.
Signed-off-by: David Howells <dhowells@redhat.com>
If the fscache asynchronous write operation elects to discard a page that's
pending storage to the cache because the page would be over the store limit
then it needs to wake the page as someone may be waiting on completion of
the write.
The problem is that the store limit may be updated by a different
asynchronous operation - and so may miss the write - and that the store
limit may not even get updated until later by the netfs.
Fix the kernel hang by making fscache_write_op() mark as written any pages
that are over the limit.
Signed-off-by: David Howells <dhowells@redhat.com>
The last parameter to fscache_op_complete() is a bool indicating whether or
not the operation was cancelled. A lot of the time the inverse value is
given or no differentiation is made. Fix this.
Signed-off-by: David Howells <dhowells@redhat.com>
Fix a couple of checker warnings in fscache and cachefiles:
(1) fscache_n_op_requeue is never used, so get rid of it.
(2) cachefiles_uncache_page() is passed in a lock that it releases, so
this needs annotating.
Signed-off-by: David Howells <dhowells@redhat.com>
When relinquishing cookies, either due to iget failure or to inode
eviction, retire a cookie if we think the corresponding vnode got deleted
on the server rather than just letting it lie in the cache.
Signed-off-by: David Howells <dhowells@redhat.com>
AFS vnodes (files) are referenced by a triplet of { volume ID, vnode ID,
uniquifier }. Currently, kafs is only using the vnode ID as the file key
in the volume fscache index and checking the uniquifier on cookie
acquisition against the contents of the auxiliary data stored in the cache.
Unfortunately, this is subject to a race in which an FS.RemoveFile or
FS.RemoveDir op is issued against the server but the local afs inode isn't
torn down and disposed off before another thread issues something like
FS.CreateFile. The latter then gets given the vnode ID that just got
removed, but with a new uniquifier and a cookie collision occurs in the
cache because the cookie is only keyed on the vnode ID whereas the inode is
keyed on the vnode ID plus the uniquifier.
Fix this by keying the cookie on the uniquifier in addition to the vnode ID
and dropping the uniquifier from the auxiliary data supplied.
Signed-off-by: David Howells <dhowells@redhat.com>
Invalidate any data stored in fscache for a vnode that changes on the
server so that we don't end up with the cache in a bad state locally.
Signed-off-by: David Howells <dhowells@redhat.com>
Must retrieve size before running filemap_fault so the kernel has
an up-to-date size.
This should have been caught by xfstests generic/246, but it was masked
by orangefs_new_inode, which set i_size to PAGE_SIZE. When nothing
caused a getattr prior to a pagefault, i_size was still PAGE_SIZE.
Since xfstests only read 10 bytes, it did not catch this bug.
When orangefs_new_inode was modified to perform a getattr instead,
i_size was set to zero, as it was a newly created file. Then
orangefs_file_write_iter did NOT set i_size. Instead it invalidated the
attribute cache, which should have caused the next caller to retrieve
i_size. But the fault handler did not know it was supposed to retrieve
i_size. So during xfstests, i_size was still zero, and filemap_fault
returned VM_FAULT_SIGBUS.
Fixes xfstests generic/452.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This fixes xfstests/generic/392.
The failure was caused by different times between 1) one marked in the last
fsync(2) call and 2) the other given by roll-forward recovery after power-cut.
The reason was that we skipped updating inode block at 1), since its i_size
was recoverable along with 4KB-aligned data writes, which was fixed by:
"f2fs: fix a wrong condition in f2fs_skip_inode_update"
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Pull workqueue updates from Tejun Heo:
"rcu_work addition and a couple trivial changes"
* 'for-4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: remove the comment about the old manager_arb mutex
workqueue: fix the comments of nr_idle
fs/aio: Use rcu_work instead of explicit rcu and work item
cgroup: Use rcu_work instead of explicit rcu and work item
RCU, workqueue: Implement rcu_work
Pull networking updates from David Miller:
1) Support offloading wireless authentication to userspace via
NL80211_CMD_EXTERNAL_AUTH, from Srinivas Dasari.
2) A lot of work on network namespace setup/teardown from Kirill Tkhai.
Setup and cleanup of namespaces now all run asynchronously and thus
performance is significantly increased.
3) Add rx/tx timestamping support to mv88e6xxx driver, from Brandon
Streiff.
4) Support zerocopy on RDS sockets, from Sowmini Varadhan.
5) Use denser instruction encoding in x86 eBPF JIT, from Daniel
Borkmann.
6) Support hw offload of vlan filtering in mvpp2 dreiver, from Maxime
Chevallier.
7) Support grafting of child qdiscs in mlxsw driver, from Nogah
Frankel.
8) Add packet forwarding tests to selftests, from Ido Schimmel.
9) Deal with sub-optimal GSO packets better in BBR congestion control,
from Eric Dumazet.
10) Support 5-tuple hashing in ipv6 multipath routing, from David Ahern.
11) Add path MTU tests to selftests, from Stefano Brivio.
12) Various bits of IPSEC offloading support for mlx5, from Aviad
Yehezkel, Yossi Kuperman, and Saeed Mahameed.
13) Support RSS spreading on ntuple filters in SFC driver, from Edward
Cree.
14) Lots of sockmap work from John Fastabend. Applications can use eBPF
to filter sendmsg and sendpage operations.
15) In-kernel receive TLS support, from Dave Watson.
16) Add XDP support to ixgbevf, this is significant because it should
allow optimized XDP usage in various cloud environments. From Tony
Nguyen.
17) Add new Intel E800 series "ice" ethernet driver, from Anirudh
Venkataramanan et al.
18) IP fragmentation match offload support in nfp driver, from Pieter
Jansen van Vuuren.
19) Support XDP redirect in i40e driver, from Björn Töpel.
20) Add BPF_RAW_TRACEPOINT program type for accessing the arguments of
tracepoints in their raw form, from Alexei Starovoitov.
21) Lots of striding RQ improvements to mlx5 driver with many
performance improvements, from Tariq Toukan.
22) Use rhashtable for inet frag reassembly, from Eric Dumazet.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1678 commits)
net: mvneta: improve suspend/resume
net: mvneta: split rxq/txq init and txq deinit into SW and HW parts
ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh
net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
net: bgmac: Correctly annotate register space
route: check sysctl_fib_multipath_use_neigh earlier than hash
fix typo in command value in drivers/net/phy/mdio-bitbang.
sky2: Increase D3 delay to sky2 stops working after suspend
net/mlx5e: Set EQE based as default TX interrupt moderation mode
ibmvnic: Disable irqs before exiting reset from closed state
net: sched: do not emit messages while holding spinlock
vlan: also check phy_driver ts_info for vlan's real device
Bluetooth: Mark expected switch fall-throughs
Bluetooth: Set HCI_QUIRK_SIMULTANEOUS_DISCOVERY for BTUSB_QCA_ROME
Bluetooth: btrsi: remove unused including <linux/version.h>
Bluetooth: hci_bcm: Remove DMI quirk for the MINIX Z83-4
sh_eth: kill useless check in __sh_eth_get_regs()
sh_eth: add sh_eth_cpu_data::no_xdfar flag
ipv6: factorize sk_wmem_alloc updates done by __ip6_append_data()
ipv4: factorize sk_wmem_alloc updates done by __ip_append_data()
...
We're neglecting to clear the umask after it's set, which can cause a
later unrelated rpc to (incorrectly) use the same umask if it happens to
be processed by the same thread.
There's a more subtle problem here too:
An NFSv4 compound request is decoded all in one pass before any
operations are executed.
Currently we're setting current->fs->umask at the time we decode the
compound. In theory a single compound could contain multiple creates
each setting a umask. In that case we'd end up using whichever umask
was passed in the *last* operation as the umask for all the creates,
whether that was correct or not.
So, we should just be saving the umask at decode time and waiting to set
it until we actually process the corresponding operation.
In practice it's unlikely any client would do multiple creates in a
single compound. And even if it did they'd likely be from the same
process (hence carry the same umask). So this is a little academic, but
we should get it right anyway.
Fixes: 47057abde5 (nfsd: add support for the umask attribute)
Cc: stable@vger.kernel.org
Reported-by: Lucash Stach <l.stach@pengutronix.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move common code in NFSD's legacy SYMLINK decoders into a helper.
The immediate benefits include:
- one fewer data copies on transports that support DDP
- consistent error checking across all versions
- reduction of code duplication
- support for both legal forms of SYMLINK requests on RDMA
transports for all versions of NFS (in particular, NFSv2, for
completeness)
In the long term, this helper is an appropriate spot to perform a
per-transport call-out to fill the pathname argument using, say,
RDMA Reads.
Filling the pathname in the proc function also means that eventually
the incoming filehandle can be interpreted so that filesystem-
specific memory can be allocated as a sink for the pathname
argument, rather than using anonymous pages.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Move common code in NFSD's legacy NFS WRITE decoders into a helper.
The immediate benefit is reduction of code duplication and some nice
micro-optimizations (see below).
In the long term, this helper can perform a per-transport call-out
to fill the rq_vec (say, using RDMA Reads).
The legacy WRITE decoders and procs are changed to work like NFSv4,
which constructs the rq_vec just before it is about to call
vfs_writev.
Why? Calling a transport call-out from the proc instead of the XDR
decoder means that the incoming FH can be resolved to a particular
filesystem and file. This would allow pages from the backing file to
be presented to the transport to be filled, rather than presenting
anonymous pages and copying or flipping them into the file's page
cache later.
I also prefer using the pages in rq_arg.pages, instead of pulling
the data pages directly out of the rqstp::rq_pages array. This is
currently the way the NFSv3 write decoder works, but the other two
do not seem to take this approach. Fixing this removes the only
reference to rq_pages found in NFSD, eliminating an NFSD assumption
about how transports use the pages in rq_pages.
Lastly, avoid setting up the first element of rq_vec as a zero-
length buffer. This happens with an RDMA transport when a normal
Read chunk is present because the data payload is in rq_arg's
page list (none of it is in the head buffer).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
This helps record the identity and timing of the ops in each NFSv4
COMPOUND, replacing dprintk calls that did much the same thing.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 read compound processing invokes nfsd_splice_read and
nfs_readv directly, so the trace points currently in nfsd_read are
not invoked for NFSv4 reads.
Move the NFSD READ trace points to common helpers so that NFSv4
reads are captured.
Also, record any local I/O error that occurs, the total count of
bytes that were actually returned, and whether splice or vectored
read was used.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
NFSv4 write compound processing invokes nfsd_vfs_write directly. The
trace points currently in nfsd_write are not effective for NFSv4
writes.
Move the trace points into the shared nfsd_vfs_write() helper.
After the I/O, we also want to record any local I/O error that
might have occurred, and the total count of bytes that were actually
moved (rather than the requested number).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Follow naming convention used in client and in sunrpc layers.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Byte count is more helpful to know than vector count.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
nfsd-1915 [003] 77915.780959: write_opened:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1
nfsd-1915 [003] 77915.780960: write_io_done:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1
nfsd-1915 [003] 77915.780964: write_done:
[FAILED TO PARSE] xid=3286130958 fh=0 offset=154624 len=1
Byte swapping and knfsd_fh_hash() are not available in "trace-cmd
report", where the print format string is actually used. These
data transformations have to be done during the TP_fast_assign step.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Use enum nfs_cb_opnum4 in decode_cb_op_status. This fixes warnings
seen with clang:
fs/nfsd/nfs4callback.c:451:36: warning: implicit conversion from
enumeration type 'enum nfs_cb_opnum4' to different enumeration
type 'enum nfs_opnum4' [-Wenum-conversion]
status = decode_cb_op_status(xdr, OP_CB_SEQUENCE, &cb->cb_seq_status);
~~~~~~~~~~~~~~~~~~~ ^~~~~~~~~~~~~~
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
fs/nfsd/nfs4state.c:926:8-9: WARNING: return of 0/1 in function 'nfs4_delegation_exists' with return type bool
fs/nfsd/nfs4state.c:2955:9-10: WARNING: return of 0/1 in function 'nfsd4_compound_in_session' with return type bool
Return statements in functions returning bool should use
true/false instead of 1/0.
Generated by: scripts/coccinelle/misc/boolreturn.cocci
Fixes: 68b18f5294 ("nfsd: make nfs4_get_existing_delegation less confusing")
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
[bfields: also fix -EAGAIN]
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.
Here is an example of a collision that this implementation catches:
WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
[..]
Call Trace:
__dax_invalidate_mapping_entry+0x6c/0xf0
dax_delete_mapping_entry+0xf/0x20
truncate_exceptional_pvec_entries.part.12+0x1af/0x200
truncate_inode_pages_range+0x268/0x970
? tlb_gather_mmu+0x10/0x20
? up_write+0x1c/0x40
? unmap_mapping_range+0x73/0x140
xfs_free_file_space+0x1b6/0x5b0 [xfs]
? xfs_file_fallocate+0x7f/0x320 [xfs]
? down_write_nested+0x40/0x70
? xfs_ilock+0x21d/0x2f0 [xfs]
xfs_file_fallocate+0x162/0x320 [xfs]
? rcu_read_lock_sched_held+0x3f/0x70
? rcu_sync_lockdep_assert+0x2a/0x50
? __sb_start_write+0xd0/0x1b0
? vfs_fallocate+0x20c/0x270
vfs_fallocate+0x154/0x270
SyS_fallocate+0x43/0x80
entry_SYSCALL_64_fastpath+0x1f/0x96
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.
Reviewed-by: Jan Kara <jack@suse.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
This removes the entire architecture code for blackfin, cris, frv, m32r,
metag, mn10300, score, and tile, including the associated device drivers.
I have been working with the (former) maintainers for each one to ensure
that my interpretation was right and the code is definitely unused in
mainline kernels. Many had fond memories of working on the respective
ports to start with and getting them included in upstream, but also saw
no point in keeping the port alive without any users.
In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company
in charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It seems
that all the SoC product lines are still around, but have not used the
custom CPU architectures for several years at this point. In contrast,
CPU instruction sets that remain popular and have actively maintained
kernel ports tend to all be used across multiple licensees.
The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I made
sure that they are all unused. Some architectures (notably tile, mn10300,
and blackfin) are still being shipped in products with old kernels,
but those products will never be updated to newer kernel releases.
After this series, we still have a few architectures without mainline
gcc support:
- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.
- openrisc, risc-v and nds32 are still in the process of finishing their
support or getting it added to mainline gcc in the first place.
They all have patched gcc-7.3 ports that work to some degree, but
complete upstream support won't happen before gcc-8.1. Csky posted
their first kernel patch set last week, their situation will be similar.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJawdL2AAoJEGCrR//JCVInuH0P/RJAZh1nTD+TR34ZhJq2TBoo
PgygwDU7Z2+tQVU+EZ453Gywz9/NMRFk1RWAZqrLix4ZtyIMvC6A1qfT2yH1Y7Fb
Qh6tccQeLe4ezq5u4S/46R/fQXu3Txr92yVwzJJUuPyU0arF9rv5MmI8e6p7L1en
yb74kSEaCe+/eMlsEj1Cc1dgthDNXGKIURHkRsILoweysCpesjiTg4qDcL+yTibV
FP2wjVbniKESMKS6qL71tiT5sexvLsLwMNcGiHPj94qCIQuI7DLhLdBVsL5Su6gI
sbtgv0dsq4auRYAbQdMaH1hFvu6WptsuttIbOMnz2Yegi2z28H8uVXkbk2WVLbqG
ZESUwutGh8MzOL2RJ4jyyQq5sfo++CRGlfKjr6ImZRv03dv0pe/W85062cK5cKNs
cgDDJjGRorOXW7dyU6jG2gRqODOQBObIv3w5efdq5OgzOWlbI4EC+Y5u1Z0JF/76
pSwtGXA6YhwC+9LLAlnVTHG+yOwuLmAICgoKcTbzTVDKA2YQZG/cYuQfI5S1wD8e
X6urPx3Md2GCwLXQ9mzKBzKZUpu/Tuhx0NvwF4qVxy6x1PELjn68zuP7abDHr46r
57/09ooVN+iXXnEGMtQVS/OPvYHSa2NgTSZz6Y86lCRbZmUOOlK31RDNlMvYNA+s
3iIVHovno/JuJnTOE8LY
=fQ8z
-----END PGP SIGNATURE-----
Merge tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pul removal of obsolete architecture ports from Arnd Bergmann:
"This removes the entire architecture code for blackfin, cris, frv,
m32r, metag, mn10300, score, and tile, including the associated device
drivers.
I have been working with the (former) maintainers for each one to
ensure that my interpretation was right and the code is definitely
unused in mainline kernels. Many had fond memories of working on the
respective ports to start with and getting them included in upstream,
but also saw no point in keeping the port alive without any users.
In the end, it seems that while the eight architectures are extremely
different, they all suffered the same fate: There was one company in
charge of an SoC line, a CPU microarchitecture and a software
ecosystem, which was more costly than licensing newer off-the-shelf
CPU cores from a third party (typically ARM, MIPS, or RISC-V). It
seems that all the SoC product lines are still around, but have not
used the custom CPU architectures for several years at this point. In
contrast, CPU instruction sets that remain popular and have actively
maintained kernel ports tend to all be used across multiple licensees.
[ See the new nds32 port merged in the previous commit for the next
generation of "one company in charge of an SoC line, a CPU
microarchitecture and a software ecosystem" - Linus ]
The removal came out of a discussion that is now documented at
https://lwn.net/Articles/748074/. Unlike the original plans, I'm not
marking any ports as deprecated but remove them all at once after I
made sure that they are all unused. Some architectures (notably tile,
mn10300, and blackfin) are still being shipped in products with old
kernels, but those products will never be updated to newer kernel
releases.
After this series, we still have a few architectures without mainline
gcc support:
- unicore32 and hexagon both have very outdated gcc releases, but the
maintainers promised to work on providing something newer. At least
in case of hexagon, this will only be llvm, not gcc.
- openrisc, risc-v and nds32 are still in the process of finishing
their support or getting it added to mainline gcc in the first
place. They all have patched gcc-7.3 ports that work to some
degree, but complete upstream support won't happen before gcc-8.1.
Csky posted their first kernel patch set last week, their situation
will be similar
[ Palmer Dabbelt points out that RISC-V support is in mainline gcc
since gcc-7, although gcc-7.3.0 is the recommended minimum - Linus ]"
This really says it all:
2498 files changed, 95 insertions(+), 467668 deletions(-)
* tag 'arch-removal' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (74 commits)
MAINTAINERS: UNICORE32: Change email account
staging: iio: remove iio-trig-bfin-timer driver
tty: hvc: remove tile driver
tty: remove bfin_jtag_comm and hvc_bfin_jtag drivers
serial: remove tile uart driver
serial: remove m32r_sio driver
serial: remove blackfin drivers
serial: remove cris/etrax uart drivers
usb: Remove Blackfin references in USB support
usb: isp1362: remove blackfin arch glue
usb: musb: remove blackfin port
usb: host: remove tilegx platform glue
pwm: remove pwm-bfin driver
i2c: remove bfin-twi driver
spi: remove blackfin related host drivers
watchdog: remove bfin_wdt driver
can: remove bfin_can driver
mmc: remove bfin_sdh driver
input: misc: remove blackfin rotary driver
input: keyboard: remove bf54x driver
...
In make_dentry_ptr_block, it is confused with "&" for t->dentry_bitmap
but without "&" for t->dentry, so delete "&" to make code more readable.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
If write is failed, we must deallocate the blocks that we couldn't write.
Cc: stable@vger.kernel.org
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
When an intent is aborted during it's initial commit through
xfs_defer_trans_abort(), there is a use after free. The current
report is for a RUI through this path in generic/388:
Freed by task 6274:
__kasan_slab_free+0x136/0x180
kmem_cache_free+0xe7/0x4b0
xfs_trans_free_items+0x198/0x2e0
__xfs_trans_commit+0x27f/0xcc0
xfs_trans_roll+0x17b/0x2a0
xfs_defer_trans_roll+0x6ad/0xe60
xfs_defer_finish+0x2a6/0x2140
xfs_alloc_file_space+0x53a/0xf90
xfs_file_fallocate+0x5c6/0xac0
vfs_fallocate+0x2f5/0x930
ioctl_preallocate+0x1dc/0x320
do_vfs_ioctl+0xfe4/0x1690
The problem is that the RUI has two active references - one in the
current transaction, and another held by the defer_ops structure
that is passed to the RUD (intent done) so that both the intent and
the intent done structures are freed on commit of the intent done.
Hence during abort, we need to release the intent item, because the
defer_ops reference is released separately via ->abort_intent
callback. Fix all the intent code to do this correctly.
Signed-Off-By: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Pull wait_var_event updates from Ingo Molnar:
"This introduces the new wait_var_event() API, which is a more flexible
waiting primitive than wait_on_atomic_t().
All wait_on_atomic_t() users are migrated over to the new API and
wait_on_atomic_t() is removed. The migration fixes one bug and should
result in no functional changes for the other usecases"
* 'sched-wait-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/wait: Improve __var_waitqueue() code generation
sched/wait: Remove the wait_on_atomic_t() API
sched/wait, arch/mips: Fix and convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/ocfs2: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/nfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/fscache: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/btrfs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, fs/afs: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/media: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait, drivers/drm: Convert wait_on_atomic_t() usage to the new wait_var_event() API
sched/wait: Introduce wait_var_event()
xfs_dir_ialloc() rolls the current transaction when allocation of a new
inode required the space manager to perform an allocation and replinish
the Inode btree.
None of the callers of xfs_dir_ialloc() need to know if the
transaction was committed. Hence this commit removes the "committed"
argument of xfs_dir_ialloc.
Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Fix commit 97dd26ad83 (f2fs: fix wrong AUTO_RECOVER condition).
We should use ~PAGE_MASK to determine whether i_size is aligned to
the f2fs's block size or not.
Signed-off-by: Junling Zheng <zhengjunling@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_fallocate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_p{read,write}64() wrappers allows us to get rid of
in-kernel calls to the sys_pread64() and sys_pwrite64() syscalls.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_p{read,write}64().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_truncate() wrapper allows us to get rid of in-kernel
calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_truncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_sync_file_range().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_sync() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_sync().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_read() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_read().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_lseek() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_lseek().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_ioctl() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_ioctl().
After careful review, at least some of these calls could be converted
to do_vfs_ioctl() in future.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_getdents64() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_getdents64().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this wrapper allows us to avoid the in-kernel calls to the
sys_open() syscall. The ksys_ prefix denotes that this function is meant
as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_open().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_close() wrapper allows us to get rid of in-kernel calls
to the sys_close() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it
uses the same calling convention as sys_close(), with one subtle
difference:
The few places which checked the return value did not care about the return
value re-writing in sys_close(), so simply use a wrapper around
__close_fd().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_ftruncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-interal do_fchownat() wrapper allows us to get rid of
fs-internal calls to the sys_fchownat() syscall.
Introducing the ksys_fchown() helper and the ksys_{,}chown() wrappers
allows us to avoid the in-kernel calls to the sys_{,l,f}chown() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same calling
convention as sys_{,l,f}chown().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_faccessat() helper allows us to get rid of
fs-internal calls to the sys_faccessat() syscall.
Introducing the ksys_access() wrapper allows us to avoid the in-kernel
calls to the sys_access() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_access().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_fchmodat() helper allows us to get rid of
fs-internal calls to the sys_fchmodat() syscall.
Introducing the ksys_fchmod() helper and the ksys_chmod() wrapper allows
us to avoid the in-kernel calls to the sys_fchmod() and sys_chmod()
syscalls. The ksys_ prefix denotes that these functions are meant as a
drop-in replacement for the syscalls. In particular, they use the same
calling convention as sys_fchmod() and sys_chmod().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_linkat() helper allows us to get rid of
fs-internal calls to the sys_linkat() syscall.
Introducing the ksys_link() wrapper allows us to avoid the in-kernel
calls to sys_link() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_link().
In the near future, the only fs-external user of ksys_link() should be
converted to use vfs_link() instead.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_mknodat() helper allows us to get rid of
fs-internal calls to the sys_mknodat() syscall.
Introducing the ksys_mknod() wrapper allows us to avoid the in-kernel
calls to sys_mknod() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_mknod().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_symlinkat() helper allows us to get rid of
fs-internal calls to the sys_symlinkat() syscall.
Introducing the ksys_symlink() wrapper allows us to avoid the in-kernel
calls to the sys_symlink() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In particular,
it uses the same calling convention as sys_symlink().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_mkdirat() helper allows us to get rid of
fs-internal calls to the sys_mkdirat() syscall.
Introducing the ksys_mkdir() wrapper allows us to avoid the in-kernel calls
to the sys_mkdir() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mkdir().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this wrapper allows us to avoid the in-kernel calls to the
sys_rmdir() syscall. The ksys_ prefix denotes that this function is meant
as a drop-in replacement for the syscall. In particular, it uses the same
calling convention as sys_rmdir().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
do_rmdir() is used in the VFS layer at fs/namei.c, so use a different
name in hostfs.
Cc: Jeff Dike <jdike@addtoit.com>
Cc: user-mode-linux-devel@lists.sourceforge.net
Acked-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the sys_chdir()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_chdir().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the sys_write()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_write().
In the near future, the do_mounts / initramfs callers of ksys_write()
should be converted to use filp_open() and vfs_write() instead.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-s390@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_chroot() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_chroot().
In the near future, the fs-external callers of ksys_chroot() should be
converted to use kern_path()/set_fs_root() directly. Then ksys_chroot()
can be moved within sys_chroot() again.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using ksys_dup() and ksys_dup3() as helper functions allows us to
avoid the in-kernel calls to the sys_dup() and sys_dup3() syscalls.
The ksys_ prefix denotes that these functions are meant as a drop-in
replacement for the syscalls. In particular, they use the same
calling convention as sys_dup{,3}().
In the near future, the fs-external callers of ksys_dup{,3}() should be
converted to call do_dup2() directly. Then, ksys_dup{,3}() can be moved
within sys_dup{,3}() again.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel call to the sys_umount()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as ksys_umount().
In the near future, the only fs-external caller of ksys_umount() should be
converted to call do_umount() directly. Then, ksys_umount() can be moved
within sys_umount() again.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the sys_mount()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_mount().
In the near future, all callers of ksys_mount() should be converted to call
do_mount() directly.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
While sys32_quotactl() is only needed on x86, it can use the recommended
COMPAT_SYSCALL_DEFINEx() machinery for its setup.
Acked-by: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal kernel_quotactl() helper allows us to get rid of
the fs-internal call to the sys_quotactl() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_fanotify_mark() helper allows us to get rid of
the fs-internal call to the sys_fanotify_mark() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Acked-by: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the inotify-internal do_inotify_init() helper allows us to get rid
of the in-kernel call to sys_inotify_init1() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Acked-by: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_compat_futimesat() helper allows us to get rid of
the fs-internal call to the compat_sys_futimesat() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_compat_signalfd4() helper allows us to get rid of
the fs-internal call to the compat_sys_signalfd4() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_compat_select() helper allows us to get rid of
the fs-internal call to the compat_sys_select() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_compat_fcntl64() helper allows us to get rid of
the fs-internal call to the compat_sys_fcntl64() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel call to the sys_umount()
syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_vmsplice() helper allows us to get rid of the
fs-internal call to the sys_vmsplice() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_lookup_dcookie() helper allows us to get rid of
fs-internal calls to the sys_lookup_dcookie() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes an in-kernel call to the sys_eventfd() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes in-kernel calls to the sys_signalfd4() syscall
function.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the helper functions do_epoll_create() and do_epoll_wait() allows us
to remove in-kernel calls to the related syscall functions.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes the in-kernel call to the sys_futimesat()
syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes in-kernel calls to the sys_renameat2() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes an in-kernel call to the sys_pipe2() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the do_readlinkat() helper removes an in-kernel call to the
sys_readlinkat() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Check for unknown security mode flags during negotiate protocol
if debugging enabled.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Some servers return inode number zero for the root directory, which
causes ls to display incorrect data (missing "." and "..").
If the server returns zero for the inode number of the root directory,
fake an inode number for it.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
This variable is set to 4 for all protocol versions and replaces
the hardcoded constant 4 throughought the code.
This will later be updated to reflect whether a response packet
has a 4 byte length preamble or not once we start removing this
field from the SMB2+ dialects.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Use vzalloc instead of the vmalloc, memset combo
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When a slot becomes free, call wake_up_locked regardless of the number
of slots available.
Without this patch, wake_up_locked is only called when going from no
free slots to one. This means that there is a chance a waiting task
will not be woken up. In many cases, the system will bounce between 0
and 1 free slots, and the waiting tasks will be woken up. But if there
is still a waiting task and another slot becomes available before the
number of free slots reaches zero, that waiting task may never be woken
up since the number of free slots may never reach zero again.
The bug behavior is easy to reproduce with the following script,
where /mnt/orangefs is an OrangeFS file system.
for i in {1..100}; do
for j in {1..20}; do
dd if=/dev/zero of=/mnt/orangefs/tmp$j bs=32768 count=32 &
done
wait
done
Signed-off-by: David Reynolds <david@omnibond.com>
Reviewed-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This commit changes statfs default behaviour when reporting usage
statistics. Instead of using the overall filesystem usage, statfs now
reports the quota for the filesystem root, if ceph.quota.max_bytes has
been set for this inode. If quota hasn't been set, it falls back to the
old statfs behaviour.
A new mount option is also added ('noquotadf') to disable this behaviour.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
By keeping a counter with the number of snaprealms that have quota set
allows to optimize the functions that need to walk throught the realms
hierarchy looking for quotas. Thus, if this counter is zero it's safe to
assume that there are no realms with quota.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Keep a pointer to the inode in struct ceph_snap_realm. This allows to
optimize functions that walk the realms hierarchy (e.g. in quotas).
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When we're reaching the ceph.quota.max_bytes limit, i.e., when writing
more than 1/16th of the space left in a quota realm, update the MDS with
the new file size.
This mirrors the fuse-client approach with commit 122c50315ed1 ("client:
Inform mds file size when approaching quota limit"), in the ceph git tree.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch changes ceph_rename so that -EXDEV is returned if an attempt is
made to mv a file between two different dir trees with different quotas
setup.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds support for the max_files quota. It hooks into all the
ceph functions that add new filesystem objects that need to be checked
against the quota limits. When these limits are hit, -EDQUOT is returned.
Note that we're not checking quotas on ceph_link(). ceph_link doesn't
really create a new inode, and since the MDS doesn't update the directory
statistics when a new (hard) link is created (only with symlinks), they
are not accounted as a new file.
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This patch adds the infrastructure required to support cephfs quotas as it
is currently implemented in the ceph fuse client. Cephfs quotas can be
set on any directory, and can restrict the number of bytes or the number
of files stored beneath that point in the directory hierarchy.
Quotas are set using the extended attributes 'ceph.quota.max_files' and
'ceph.quota.max_bytes', and can be removed by setting these attributes to
'0'.
Link: http://tracker.ceph.com/issues/22372
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
1. set fsc->mdsc after successfully allocate all necessary memory
in mdsc init.
2. if fsc->mdsc is NULL, just skip destroy operation in mdsc destroy.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In current code, regular file and directory use same struct
ceph_file_info to store fs specific data so the struct has to
include some fields which are only used for directory
(e.g., readdir related info), when having plenty of regular files,
it will lead to memory waste.
This patch introduces dedicated ceph_dir_file_info cache for
readdir related thins. So that regular file does not include those
unused fields anymore.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add __init attribution to the functions which are called only once
during initiating/registering operations and deleting unnecessary
symbol exports.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
In sync mode, writepages() needs to write all dirty pages. But
it can only write dirty pages associated with the oldest snapc.
To write dirty pages associated with next snapc, it needs to wait
until current writes complete.
If there is no more dirty pages, writepages() should not wait on
writeback. Otherwise, dirty page writeback becomes very slow.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Dirty pages can be associated with different capsnap. Different capsnap
may have different EOF value. So invalidating dirty pages according to
the largest EOF value is wrong. Dirty pages beyond EOF, but associated
with other capsnap, do not get invalidated.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Releasing cap is affected by many factors (e.g., avail_count/reserve_count/min_count)
and min_count could be specified high volume in client mount option. Hence it's better
to mark cap cache as unreclaimable in case of non-trivial discrepancies between memory
shown as reclaimable and what is actually reclaimed.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Variable name ci is mostly used for ceph_inode_info.
Variable name fi is mostly used for ceph_file_info.
Variable name cf is mostly used for ceph_cap_flush.
Change variable name to follow above common rules
in case of confusing.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When caps_avail_count is in a low level, most newly
trimmed caps will probably go into ->caps_list and
caps_avail_count will be increased. Hence after trimming,
should recheck caps_avail_count to effectly reuse
newly trimmed caps. Also, when releasing unnecessary
caps follow the same rule of ceph_put_cap.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When unreserving caps check if there is too mamy available caps
in the ->caps_list, if so release unreserved caps.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When setting high volume of caps_min_count or having many
unreserved caps, unused caps may always keep in the ->caps_list
even can't get new cap from kmem_cache_alloc because lack of
maximum limitation of caps_avail_count. Hence reuse caps in
->caps_list if available, it's maybe better than setting max
limitation of caps_avail_count and releasing unused caps when
reaching the limit.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Adding spinlock protection during getting cap reservation
ralated fields so that the numbers match below BUG_ON condition
in the code.
BUG_ON(mdsc->caps_total_count != mdsc->caps_use_count +
mdsc->caps_reserve_count +
mdsc->caps_avail_count);
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When specifying multiple fscache related options, the result isn't always
the same as option order, this fix will keep strict consistent meaning
by order.
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Some of dout format do not include newline in the end,
fix for the files which are in fs/ceph and net/ceph directories,
and changing printk to dout for printing debug info in super.c
Signed-off-by: Chengguang Xu <cgxu519@icloud.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
- make it void
- xlen (object extent length) out parameter should be u32 because only
a single stripe unit is mapped at a time
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Alex Elder <elder@linaro.org>
A malicious user could force the directory pointer to be in an invalid
spot by using seekdir(2). Use the mechanism we already have to notice
if the directory has changed since the last time we called
ext4_readdir() to force a revalidation of the pointer.
Reported-by: syzbot+1236ce66f79263e8a862@syzkaller.appspotmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
On RDMA errors, transport should disconnect the RDMA CM connection. This
will notify the upper layer, and it will attempt transport reconnect.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
During transport reconnect, other processes may have registered memory
and blocked on transport. This creates a deadlock situation because the
transport resources can't be freed, and reconnect is blocked.
Fix this by returning to upper layer on timeout. Before returning,
transport status is set to reconnecting so other processes will release
memory registration resources.
Upper layer will retry the reconnect. This is not in fast I/O path so
setting the timeout to 5 seconds.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Change the following message (which can occur on reconnect) from
a warning to an FYI message. It is confusing to users.
[58360.523634] CIFS VFS: Free previous auth_key.response = 00000000a91cdc84
By default this message won't show up on reconnect unless the user bumps
up the log level to include FYI messages.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
STATUS_FS_DRIVER_REQUIRED is expected when DFS is not turned
on on the server. Do not log it on DFS referral response.
It clutters the dmesg log unnecessarily at mount time.
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com
Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>
Modify end of cifs_root_iget function in fs/cifs/inode.c to call
free_xid(xid) instead of _free_xid(xid), thereby allowing debug
notification of this action when enabled.
Signed-off-by: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB3.1.1 is a very important dialect, with much improved security.
We can remove the ExPERIMENTAL comments about it. It is widely
supported by servers.
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
SMB3.1.1 tree connect was only being signed when signing was mandatory
but needs to always be signed (for non-guest users).
See MS-SMB2 section 3.2.4.1.1
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
CC: Stable <stable@vger.kernel.org>
We can not use the standard sg_set_buf() fucntion since when
CONFIG_DEBUG_SG=y this adds a check that will BUG_ON for cifs.ko
when we pass it an object from the stack.
Create a new wrapper smb2_sg_set_buf() which avoids doing that particular check
and use it for smb3 encryption instead.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
It seems this is a copy-paste error and that the proper variable to use
in this particular case is _sha512_ instead of _md5_.
Addresses-Coverity-ID: 1465358 ("Copy-paste error")
Fixes: 1c6614d229e7 ("CIFS: add sha512 secmech")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
SMB3.11 clients must implement pre-authentification integrity.
* new mechanism to certify requests/responses happening before Tree
Connect.
* supersedes VALIDATE_NEGOTIATE
* fixes signing for SMB3.11
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
* prepare for SMB3.11 pre-auth integrity
* enable sha512 when SMB311 is enabled in Kconfig
* add sha512 as a soft dependency
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
shash and sdesc and always allocated and freed together.
* abstract this in new functions cifs_alloc_hash() and cifs_free_hash().
* make smb2/3 crypto allocation independent from each other.
Signed-off-by: Aurelien Aptel <aaptel@suse.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com>
CC: Stable <stable@vger.kernel.org>
Trivial fix to spelling mistake in log_rdma_send and log_rdma_mr
message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steve French <smfrench@gmail.com>
Reviewed-by: Aurelien Aptel <aaptel@suse.com>
Minor conflicts in drivers/net/ethernet/mellanox/mlx5/core/en_rep.c,
we had some overlapping changes:
1) In 'net' MLX5E_PARAMS_LOG_{SQ,RQ}_SIZE -->
MLX5E_REP_PARAMS_LOG_{SQ,RQ}_SIZE
2) In 'net-next' params->log_rq_size is renamed to be
params->log_rq_mtu_frames.
3) In 'net-next' params->hard_mtu is added.
Signed-off-by: David S. Miller <davem@davemloft.net>
Add explicit checks in ext4_xattr_block_get() just in case the
e_value_offs and e_value_size fields in the the xattr block are
corrupted in memory after the buffer_verified bit is set on the xattr
block.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
The missing error handling in add_extent_changeset was hidden, so make
it at least visible in the callers.
Signed-off-by: David Sterba <dsterba@suse.com>
When mount fails to read trees like fs tree, checksum tree, extent
tree, etc, there is not enough information about where went wrong.
With this, messages like
"BTRFS warning (device sdf): failed to read root (objectid=7): -5"
would help us a bit.
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
All users pass a local unsigned int and not the __uXX types that are
supposed to be used for userspace interfaces.
Signed-off-by: David Sterba <dsterba@suse.com>
The current calls are unclear in what way btrfs_dev_replace_lock takes
the locks, so drop the argument, split the helpers and use similar
naming as for read and write locks.
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_mutex has been killed in 2008, a213501153 ("Btrfs: Replace
the big fs_mutex with a collection of other locks"), still remembered in
some comments.
We don't have any extra needs for locking in the ACL handlers.
Signed-off-by: David Sterba <dsterba@suse.com>
The show_devname callback is used to print device name in
/proc/self/mounts, we need to traverse the device list consistently and
read the name that's copied to a seq buffer so we don't need further
locking.
If the first device is being deleted at the same time, the RCU will
allow us to read the device name, though it will become stale right
after the RCU protection ends. This is unavoidable and the user can
expect that the device will disappear from the filesystem's list at some
point.
The device_list_mutex was pretty heavy as it is used eg. for writing
superblock and a few other IO related contexts. This can stall any
application that reads the proc file for no reason.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Once there was a simple int force_cow that was used with the plain
barriers, and then converted to a bit, so we should use the appropriate
barrier helper.
Other variables in the complex if condition do not depend on a barrier,
so we should be fine in case the atomic barrier becomes a no-op.
Signed-off-by: David Sterba <dsterba@suse.com>
We have several reports about node pointer points to incorrect child
tree blocks, which could have even wrong owner and level but still with
valid generation and checksum.
Although btrfs check could handle it and print error message like:
leaf parent key incorrect 60670574592
Kernel doesn't have enough check on this type of corruption correctly.
At least add such check to read_tree_block() and btrfs_read_buffer(),
where we need two new parameters @level and @first_key to verify the
child tree block.
The new @level check is mandatory and all call sites are already
modified to extract expected level from its call chain.
While @first_key is optional, the following call sites are skipping such
check:
1) Root node/leaf
As ROOT_ITEM doesn't contain the first key, skip @first_key check.
2) Direct backref
Only parent bytenr and level is known and we need to resolve the key
all by ourselves, skip @first_key check.
Another note of this verification is, it needs extra info from nodeptr
or ROOT_ITEM, so it can't fit into current tree-checker framework, which
is limited to node/leaf boundary.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent tree of the test fs is like the following:
BTRFS info (device (null)): leaf 16327509003777336587 total ptrs 1 free space 3919
item 0 key (4096 168 4096) itemoff 3944 itemsize 51
extent refs 1 gen 1 flags 2
tree block key (68719476736 0 0) level 1
^^^^^^^
ref#0: tree block backref root 5
And it's using an empty tree for fs tree, so there is no way that its
level can be 1.
For REAL (created by mkfs) fs tree backref with no skinny metadata, the
result should look like:
item 3 key (30408704 EXTENT_ITEM 4096) itemoff 3845 itemsize 51
refs 1 gen 4 flags TREE_BLOCK
tree block key (256 INODE_ITEM 0) level 0
^^^^^^^
tree block backref root 5
Fix the level to 0, so it won't break later tree level checker.
Fixes: faa2dbf004 ("Btrfs: add sanity tests for new qgroup accounting code")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode, at tree-log.c:copy_items(), if we call
btrfs_next_leaf() at the loop which checks for the need to log holes, we
need to make sure copy_items() returns the value 1 to its caller and
not 0 (on success). This is because the path the caller passed was
released and is now different from what is was before, and the caller
expects a return value of 0 to mean both success and that the path
has not changed, while a return value of 1 means both success and
signals the caller that it can not reuse the path, it has to perform
another tree search.
Even though this is a case that should not be triggered on normal
circumstances or very rare at least, its consequences can be very
unpredictable (especially when replaying a log tree).
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we have the no-holes mode enabled and fsync a file after punching a
hole in it, we can end up not logging the whole hole range in the log tree.
This happens if the file has extent items that span more than one leaf and
we punch a hole that covers a range that starts in a leaf but does not go
beyond the offset of the first extent in the next leaf.
Example:
$ mkfs.btrfs -f -O no-holes -n 65536 /dev/sdb
$ mount /dev/sdb /mnt
$ for ((i = 0; i <= 831; i++)); do
offset=$((i * 2 * 256 * 1024))
xfs_io -f -c "pwrite -S 0xab -b 256K $offset 256K" \
/mnt/foobar >/dev/null
done
$ sync
# We now have 2 leafs in our filesystem fs tree, the first leaf has an
# item corresponding the extent at file offset 216530944 and the second
# leaf has a first item corresponding to the extent at offset 217055232.
# Now we punch a hole that partially covers the range of the extent at
# offset 216530944 but does go beyond the offset 217055232.
$ xfs_io -c "fpunch $((216530944 + 128 * 1024 - 4000)) 256K" /mnt/foobar
$ xfs_io -c "fsync" /mnt/foobar
<power fail>
# mount to replay the log
$ mount /dev/sdb /mnt
# Before this patch, only the subrange [216658016, 216662016[ (length of
# 4000 bytes) was logged, leaving an incorrect file layout after log
# replay.
Fix this by checking if there is a hole between the last extent item that
we processed and the first extent item in the next leaf, and if there is
one, log an explicit hole extent item.
Fixes: 16e7549f04 ("Btrfs: incompatible format change to remove hole extents")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a nice helper to do proper casting of a qgroup to a ulist aux
value. And several places that could make use of it.
Signed-off-by: David Sterba <dsterba@suse.com>
This reverts commit 48a89bc4f2.
The idea to commit transaction and free some space after hitting qgroup
limit is good, although the problem is it can easily cause deadlocks.
One deadlock example is caused by trying to flush data while still
holding it:
Call Trace:
__schedule+0x49d/0x10f0
schedule+0xc6/0x290
schedule_timeout+0x187/0x1c0
wait_for_completion+0x204/0x3a0
btrfs_wait_ordered_extents+0xa40/0xaf0 [btrfs]
qgroup_reserve+0x913/0xa10 [btrfs]
btrfs_qgroup_reserve_data+0x3ef/0x580 [btrfs]
btrfs_check_data_free_space+0x96/0xd0 [btrfs]
__btrfs_buffered_write+0x3ac/0xd40 [btrfs]
btrfs_file_write_iter+0x62a/0xba0 [btrfs]
__vfs_write+0x320/0x430
vfs_write+0x107/0x270
SyS_write+0xbf/0x150
do_syscall_64+0x1b0/0x3d0
entry_SYSCALL64_slow_path+0x25/0x25
Another can be caused by trying to commit one transaction while nesting
with trans handle held by ourselves:
btrfs_start_transaction()
|- btrfs_qgroup_reserve_meta_pertrans()
|- qgroup_reserve()
|- btrfs_join_transaction()
|- btrfs_commit_transaction()
The retry is causing more problems than exppected when limit is enabled.
At least a graceful EDQUOT is way better than deadlock.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now trace_qgroup_meta_reserve() will have extra type parameter.
And introduce two new trace events:
1) trace_qgroup_meta_free_all_pertrans()
For btrfs_qgroup_free_meta_all_pertrans()
2) trace_qgroup_meta_convert()
For btrfs_qgroup_convert_reserved_meta()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For quota disabled->enable case, it's possible that at reservation time
quota was not enabled so no bytes were really reserved, while at release
time, quota was enabled so we will try to release some bytes we didn't
really own.
Such situation can cause metadata reserveation underflow, for both types,
also less possible for per-trans type since quota enable will commit
transaction.
To address this, record qgroup meta reserved bytes into
root::qgroup_meta_rsv_pertrans and ::prealloc.
So at releasing time we won't free any bytes we didn't reserve.
For DATA, it's already handled by io_tree, so nothing needs to be done
there.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Quite similar for delalloc, some modification to delayed-inode and
delayed-item reservation. Also needs extra parameter for release case
to distinguish normal release and error release.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add some paranoia checks to make sure we don't stray beyond the end of
the valid memory region containing ext4 xattr entries while we are
scanning for a match.
Also rename the function to xattr_find_entry() since it is static and
thus only used in fs/ext4/xattr.c
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
preallocated meta rsv, making it quite easy to underflow qgroup meta
reservation.
Since we have the new qgroup meta rsv types, apply it to delalloc
reservation.
Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
rsv type.
And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
they will convert corresponding META_PREALLOC reservation to
META_PERTRANS.
This is mainly due to the fact that current qgroup numbers will only be
updated in btrfs_commit_transaction(), that's to say if we don't keep
such placeholder reservation, we can exceed qgroup limitation.
And for callers freeing outstanding extent in error handler, we will
just free META_PREALLOC bytes.
This behavior makes callers of btrfs_qgroup_release_meta() or
btrfs_qgroup_convert_meta() to be aware of which type they are.
So in this patch, btrfs_delalloc_release_metadata() and its callers get
an extra parameter to info qgroup to do correct meta convert/release.
The good news is, even we use the wrong type (convert or free), it won't
cause obvious bug, as prealloc type is always in good shape, and the
type only affects how per-trans meta is increased or not.
So the worst case will be at most metadata limitation can be sometimes
exceeded (no convert at all) or metadata limitation is reached too soon
(no free at all).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For meta_prealloc reservation users, after btrfs_join_transaction()
caller will modify tree so part (or even all) meta_prealloc reservation
should be converted to meta_pertrans until transaction commit time.
This patch introduces a new function,
btrfs_qgroup_convert_reserved_meta() to do this for META_PREALLOC
reservation user.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since qgroup has seperate metadata reservation types now, we can
completely get rid of the old root->qgroup_meta_rsv, which mostly acts
as current META_PERTRANS reservation type.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs uses 2 different methods to reseve metadata qgroup space.
1) Reserve at btrfs_start_transaction() time
This is quite straightforward, caller will use the trans handler
allocated to modify b-trees.
In this case, reserved metadata should be kept until qgroup numbers
are updated.
2) Reserve by using block_rsv first, and later btrfs_join_transaction()
This is more complicated, caller will reserve space using block_rsv
first, and then later call btrfs_join_transaction() to get a trans
handle.
In this case, before we modify trees, the reserved space can be
modified on demand, and after btrfs_join_transaction(), such reserved
space should also be kept until qgroup numbers are updated.
Since these two types behave differently, split the original "META"
reservation type into 2 sub-types:
META_PERTRANS:
For above case 1)
META_PREALLOC:
For reservations that happened before btrfs_join_transaction() of
case 2)
NOTE: This patch will only convert existing qgroup meta reservation
callers according to its situation, not ensuring all callers are at
correct timing.
Such fix will be added in later patches.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
When modifying qgroup relationship, for qgroup which only owns exclusive
extents, we will go through quick update path.
In this path, we will add/subtract exclusive and reference number for
parent qgroup, since the source (child) qgroup only has exclusive
extents, destination (parent) qgroup will also own or lose those extents
exclusively.
The same should be the same for reservation, since later reservation
adding/releasing will also affect parent qgroup, without the reservation
carried from child, parent will underflow reservation or have dead
reservation which will never be freed.
However original code doesn't do the same thing for reservation.
It handles qgroup reservation quite differently:
It removes qgroup reservation, as it's allocating space from the
reserved qgroup for relationship adding.
But does nothing for qgroup reservation if we're removing a qgroup
relationship.
According to the original code, it looks just like because we're adding
qgroup->rfer, the code assumes we're writing new data, so it's follows
the normal write routine, by reducing qgroup->reserved and adding
qgroup->rfer/excl.
This old behavior is wrong, and should be fixed to follow the same
excl/rfer behavior.
Just fix it by using the correct behavior described above.
Fixes: 31193213f1 ("Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use.")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since most callers of qgroup_reserve() are already defined by type,
converting qgroup_reserve() is quite an easy work.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce helpers to:
1) Get total reserved space
For limit calculation
2) Add/release reserved space for given type
With underflow detection and warning
3) Add/release reserved space according to child qgroup
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of single qgroup->reserved, use a new structure btrfs_qgroup_rsv
to store different types of reservation.
This patch only updates the header needed to compile.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_orphan_add() has had this case commented out since it was first
introduced in commit d68fc57b7e ("Btrfs: Metadata reservation for
orphan inodes"). Most of the orphan cleanup code has been rewritten
since then, so it's safe to say that this code isn't needed.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ switch to bool ]
Signed-off-by: David Sterba <dsterba@suse.com>
Any time the first block group of a new type is created, we add a new
kobject to sysfs to hold the attributes for that type. Kobject-internal
allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
While it appears as if this can occur any time a block group is created,
the only times the first block group of a new type can be created in
memory is at mount and when we create the first new block group during
raid conversion.
This patch adds a new list to track pending kobject additions and then
handles them after we do chunk relocation. Between relocating the
target chunk (or forcing allocation of a new chunk in the case of data)
and removing the old chunk, we're in a safe place for fs-reclaim to
occur. We're holding the volume mutex, which is already held across
page faults, and the delete_unused_bgs_mutex, which will only stall
the cleaner thread.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit 2be12ef79 (btrfs: Separate space_info create/update), we've
separated out the creation and updating of the space info structures.
That commit was a straightforward refactoring of the two parts of
update_space_info, but we can go a step further. Since commits
c59021f84 (Btrfs: fix OOPS of empty filesystem after balance) and
b742bb82f (Btrfs: Link block groups of different raid types), we know
that the space_info structures will be created at mount and there will
only ever be, at most, three of them.
This patch cleans out the create_space_info calls after __find_space_info
returns NULL since __find_space_info *can't* return NULL.
The initial cause for reviewing this was the kobject_add calls from
create_space_info occuring in sites where fs-reclaim wasn't allowed. Now
we are certain they occur only early in the mount process and are safe.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rebuild on missing device is as same as recover, after it's done, rbio
has data which is consistent with on-disk data, so it can be cached to
avoid further reads.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop optimal argument from the function find_live_mirror() as we can
deduce it in the function itself. Also rename optimal to
preferred_mirror.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Obtain the stripes info from the map directly and so no need
to pass it as an argument.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added by 08e007d2e5 ("Btrfs: improve the noflush reservation") and
made redundant by 17024ad0a0 ("Btrfs: fix early ENOSPC due to
delalloc").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added by b4570aa994 ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG
enabled.") and obsoleted by 2ff7e61e0d ("btrfs: take an fs_info
directly when the root is not used otherwise").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduced by 5cdc7ad337 ("btrfs: Replace fs_info->workers with
btrfs_workqueue.") but obsoleted by 2a4581983f ("btrfs: factor
btrfs_init_workqueues() out of open_ctree()").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added in 38c227d87c ("Btrfs: snapshot-aware defrag") but subsequently
made redundant by 0b246afa62 ("btrfs: root->fs_info cleanup, add
fs_info convenience variables").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added in b5d67f64f9 ("Btrfs: change scrub to support big blocks") but
rendered redundant by be50a8ddaa ("Btrfs: Simplify
scrub_setup_recheck_block()'s argument").
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Added as part of 86d5f99442 ("btrfs: convert prelimary reference
tracking to use rbtrees") but never used. tmp_op_key essentially
subsumed that variable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When checking the minimal nr_devs, there is one dead and meaningless
condition:
if (ndevs < devs_increment * sub_stripes || ndevs < devs_min) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This condition is meaningless, @devs_increment has nothing to do with
@sub_stripes.
In fact, in btrfs_raid_array[], profile with sub_stripes larger than 1
(RAID10) already has the @devs_increment set to 2.
So no need to multiple it by @sub_stripes.
And above condition is also dead.
For RAID10, @devs_increment * @sub_stripes equals 4, which is also the
@devs_min of RAID10.
For other profiles, @sub_stripes is always 1, and since @ndevs is
rounded down to @devs_increment, the condition will always be true.
Remove the meaningless condition to make later reader wander less.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is the entry to the extent allocator and as such has
quite a number of parameters. Some of those have subtle effects on the
allocation algorithm. Document the parameters.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As with every function which deals with modifying the btree
btrfs_uuid_tree_rem can fail for any number of reasons (ie. EIO/ENOMEM).
Handle return error value from this function gracefully by aborting the
transaction.
Fixes: dd5f9615fc ("Btrfs: maintain subvolume items in the UUID tree")
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The kernel would like to have all stack VLA usage removed[1].
Unfortunately using an integer constant variable as the size of an
array is still considered a VLA. Instead let's use directly sizeof(var)
which removes the VLA usage. Use the occasion to remove csum_size
altogether and use sizeof() also for the size passed to memcmp
[1]: https://lkml.org/lkml/2018/3/7/621
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The callbacks make use of different parameters that are passed to the
other type unnecessarily. This patch adds separate types for each and
the unused parameters will be removed.
The type extent_submit_bio_hook_t keeps all parameters and can be used
where the start/done types are not appropriate.
Signed-off-by: David Sterba <dsterba@suse.com>
A useless wrapper around tree_mod_log_insert_root that hides missing
error handling. Move it to the callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A trivial wrapper that can be simply opencoded and makes the GFP
allocation request more visible. The error handling is now moved to the
callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrapper is effectively an alias for tree_mod_log_insert_move but
also hides the missing error handling. To make that more visible, lift
the BUG_ON to the callers.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The wrappers are trivial and do not bring any extra value on top of the
plain locking primitives.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The merge call was factored out to a separate helper but it's a trivial
one and arguably we can opencode it and cache the value.
Signed-off-by: David Sterba <dsterba@suse.com>
All callers pass a valid pointer so we can drop the redundant checks.
The call to submit_one_bio never happend and can be removed.
Signed-off-by: David Sterba <dsterba@suse.com>
In case of raid56, writes and rebuilds always take BTRFS_STRIPE_LEN(64K)
as unit, however, scrub_extent() sets blocksize as unit, so rebuild
process may be triggered on every block on a same stripe.
A typical example would be that when we're replacing a disappeared disk,
all reads on the disks get -EIO, every block (size is 4K if blocksize is
4K) would go thru these,
scrub_handle_errored_block
scrub_recheck_block # re-read pages one by one
scrub_recheck_block # rebuild by calling raid56_parity_recover()
page by page
Although with raid56 stripe cache most of reads during rebuild can be
avoided, the parity recover calculation(xor or raid6 algorithms) needs to
be done $(BTRFS_STRIPE_LEN / blocksize) times.
This makes it smarter by doing raid56 scrub/replace on stripe length.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sort mount options by the primary name, followed by the 'no-'
counterpart if it exists. Group the deprecated and debugging options.
Enum and token defintions are synced.
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has two mount options for SSD optimizations: ssd and ssd_spread.
Presently there is an option to disable all SSD optimizations, but there
isn't an option to disable just ssd_spread.
This patch adds a mount option nossd_spread that disables ssd_spread
only.
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since userspace transaction have been removed we no longer have use
for this field so delete it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the userspace transaction ioctls have been removed,
TRANS_USERSPACE is no longer used hence we can remove it.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the userspace transaction IOCTL have been removed, this member
is no longer used so just remove it
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 3558d4f88e ("btrfs: Deprecate userspace transaction ioctls")
marked the beginning of the end of userspace transaction. This commit
finishes the job! There are no known users and ceph does not use the
ioctl anymore.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Sage Weil <sage@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When multiple pending snapshots referring to the same source subvolume
are executed, enabled quota will cause root item corruption, where root
items are using old bytenr (no backref in extent tree).
This can be triggered by fstests btrfs/152.
The cause is when source subvolume is still dirty, extra commit
(simplied transaction commit) of qgroup_account_snapshot() can skip
dirty roots not recorded in current transaction, making root item of
source subvolume not updated.
Fix it by forcing recording source subvolume in current transaction
before qgroup sub-transaction commit.
Reported-by: Justin Maggard <jmaggard@netgear.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When performing an unlock on an extent buffer we'd like to order the
decrement of extent_buffer::blocking_writers with waking up any
waiters. In such situations it's sufficient to use smp_mb__after_atomic
rather than the heavy smp_mb. On architectures where atomic operations
are fully ordered (such as x86 or s390) unconditionally executing
a heavyweight smp_mb instruction causes a severe hit to performance
while bringin no improvements in terms of correctness.
The better thing is to use the appropriate smp_mb__after_atomic routine
which will do the correct thing (invoke a full smp_mb or in the case
of ordered atomics insert a compiler barrier). Put another way,
an RMW atomic op + smp_load__after_atomic equals, in terms of
semantics, to a full smp_mb. This ensures that none of the problems
described in the accompanying comment of waitqueue_active occur.
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some functions can filter metadata by the generation. Add a define that
will annotate such arguments.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The current implementation of btrfs_page_exists_in_range() gives the
wrong answer if the workingset code has stored a shadow entry in the
page cache. The filemap_range_has_page() function does not have this
problem, and it's shared code, so use it instead.
eigned-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the call to EXT4_ERROR_INODE() into ext4_xattr_check_block().
This simplifies the code, and fixes a problem where not all callers of
ext4_xattr_check_block() were not resulting in ext4_error() getting
called when the xattr block is corrupted.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:
WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
[..]
CPU: 27 PID: 1783 Comm: dma-collision Tainted: G O 4.15.0-rc2+ #984
[..]
Call Trace:
set_page_dirty_lock+0x40/0x60
bio_set_pages_dirty+0x37/0x50
iomap_dio_actor+0x2b7/0x3b0
? iomap_dio_zero+0x110/0x110
iomap_apply+0xa4/0x110
iomap_dio_rw+0x29e/0x3b0
? iomap_dio_zero+0x110/0x110
? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
xfs_file_read_iter+0xa0/0xc0 [xfs]
__vfs_read+0xf9/0x170
vfs_read+0xa6/0x150
SyS_pread64+0x93/0xb0
entry_SYSCALL_64_fastpath+0x1f/0x96
...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Block device inodes never have S_DAX set, so kill the check for DAX and
diversion to dax_writeback_mapping_range().
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Define some generic VFS aops
helpers for dax. These noop implementations are there in the dax case to
prevent the VFS from falling back to operations with page-cache
assumptions, dax_writeback_mapping_range() may not be referenced in the
FS_DAX=n case.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Matthew Wilcox <mawilcox@microsoft.com>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
If a page is already locked, attempting to dirty it leads to a deadlock
in lock_page(). This is what currently happens to ITER_BVEC pages when
a dio-enabled loop device is backed by ceph:
$ losetup --direct-io /dev/loop0 /mnt/cephfs/img
$ xfs_io -c 'pread 0 4k' /dev/loop0
Follow other file systems and only dirty ITER_IOVEC pages.
Cc: stable@kernel.org
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Previously, mount -l would show data=<mode> even if the ext4 default
journaling mode was being used. Change this to be consistent with the
rest of the options.
Ext4 already did the right thing when the journaling mode being used
matched the one specified in the superblock's default mount options. The
reason it failed to do the right thing for the ext4 defaults is that,
when set, they were never included in sbi->s_def_mount_opt (unlike the
superblock's defaults, which were).
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Don't show init_itable=n in /proc/fs/ext4/<dev>/options when filesystem
is mounted with noinit_itable.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Previously, /proc/fs/ext4/<dev>/options would only show binary options
if they were set (1 in the options bit mask). E.g. it would show "grpid"
if it was set, but it would not show "nogrpid" if grpid was not set.
This seems sensible, but when an option is absent from the file, it can
be hard for the unfamiliar to know what is being used. E.g. if there
isn't a (no)grpid entry, nogrpid is in effect. But if there isn't a
(no)auto_da_alloc entry, auto_da_alloc is in effect. If there isn't a
(minixdf|bsddf) entry, it turns out bsddf is in effect. It all depends
on how the option is implemented.
It's clearer to be explicit, so print the corresponding option
regardless of whether it means a 1 or a 0 in the bit mask.
Note that options which do not have an explicit disable option aren't
indicated as being disabled even with this change (e.g. dax).
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Replace kset with generic kobject provided by kobject_create_and_add(),
since the latter is sufficient.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Make cleanup of ext4_feat kobject consistent with similar objects.
Signed-off-by: Tyson Nottingham <tgnottingham@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
If some metadata block, such as an allocation bitmap, overlaps the
superblock, it's very likely that if the file system is mounted
read/write, the results will not be pretty. So disallow r/w mounts
for file systems corrupted in this particular way.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
If the root directory has an i_links_count of zero, then when the file
system is mounted, then when ext4_fill_super() notices the problem and
tries to call iput() the root directory in the error return path,
ext4_evict_inode() will try to free the inode on disk, before all of
the file system structures are set up, and this will result in an OOPS
caused by a NULL pointer dereference.
This issue has been assigned CVE-2018-1092.
https://bugzilla.kernel.org/show_bug.cgi?id=199179https://bugzilla.redhat.com/show_bug.cgi?id=1560777
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Currently d_move(from, to) does the following:
* name/parent of from <- old name/parent of to, from hashed there
* to is unhashed
* name of to is preserved
* if from used to be detached, to gets detached
* if from used to be attached, parent of to <- old parent of from.
That's both user-visibly bogus and complicates reasoning a lot.
Much saner semantics would be
* name/parent of from <- name/parent of to, from hashed there.
* to is unhashed
* name/parent of to is unchanged.
The price, of course, is that old parent of from might lose a reference.
However,
* all potentially cross-directory callers of d_move() have both
parents pinned directly; typically, dentries themselves are grabbed
only after we have grabbed and locked both parents. IOW, the decrement
of old parent's refcount in case of d_move() won't reach zero.
* __d_move() from d_splice_alias() is done to detached alias.
No refcount decrements in that case
* __d_move() from __d_unalias() *can* get the refcount to zero.
So let's grab a reference to alias' old parent before calling __d_unalias()
and dput() it after we'd dropped rename_lock.
That does make d_splice_alias() potentially blocking. However, it has
no callers in non-sleepable contexts (and the case where we'd grown
that dget/dput pair is _very_ rare, so performance is not an issue).
Another thing that needs adjustment is unlocking in the end of __d_move();
folded it in. And cleaned the remnants of bogus ordering from the
"lock them in the beginning" counterpart - it's never been right and
now (well, for 7 years now) we have that thing always serialized on
rename_lock anyway.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
shrink_dentry_list() holds dentry->d_lock and needs to acquire
dentry->d_inode->i_lock. This cannot be done with a spin_lock()
operation because it's the reverse of the regular lock order.
To avoid ABBA deadlocks it is done with a trylock loop.
Trylock loops are problematic in two scenarios:
1) PREEMPT_RT converts spinlocks to 'sleeping' spinlocks, which are
preemptible. As a consequence the i_lock holder can be preempted
by a higher priority task. If that task executes the trylock loop
it will do so forever and live lock.
2) In virtual machines trylock loops are problematic as well. The
VCPU on which the i_lock holder runs can be scheduled out and a
task on a different VCPU can loop for a whole time slice. In the
worst case this can lead to starvation. Commits 47be61845c
("fs/dcache.c: avoid soft-lockup in dput()") and 046b961b45
("shrink_dentry_list(): take parent's d_lock earlier") are
addressing exactly those symptoms.
Avoid the trylock loop by using dentry_kill(). When pruning ancestors,
the same code applies that is used to kill a dentry in dput(). This
also has the benefit that the locking order is now the same. First
the inode is locked, then the parent.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case when trylock in there fails, deal with it directly in
dentry_kill(). Note that in cases when we drop and retake
->d_lock, we need to recheck whether to retain the dentry.
Another thing is that dropping/retaking ->d_lock might have
ended up with negative dentry turning into positive; that,
of course, can happen only once...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case of trylock failure don't re-add to the list - drop the locks
and carefully get them in the right order. For shrink_dentry_list(),
somebody having grabbed a reference to dentry means that we can
kick it off-list, so if we find dentry being modified under us we
don't need to play silly buggers with retries anyway - off the list
it is.
The locking logics taken out into a helper of its own; lock_parent()
is no longer used for dentries that can be killed under us.
[fix from Eric Biggers folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
ext4 isn't validating the sizes of xattrs where the value of the xattr
is stored in an external inode. This is problematic because
->e_value_size is a u32, but ext4_xattr_get() returns an int. A very
large size is misinterpreted as an error code, which ext4_get_acl()
translates into a bogus ERR_PTR() for which IS_ERR() returns false,
causing a crash.
Fix this by validating that all xattrs are <= INT_MAX bytes.
This issue has been assigned CVE-2018-1095.
https://bugzilla.kernel.org/show_bug.cgi?id=199185https://bugzilla.redhat.com/show_bug.cgi?id=1560793
Reported-by: Wen Xu <wen.xu@gatech.edu>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
This patch spits out the time taken by the various steps in the
journal recover process. Previously, the journal recovery time
didn't account for finding the journal head in the log which takes
up a significant portion of time.
Signed-off-by: Abhi Das <adas@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Today if we run xfs_fsr and crash[1], log replay can fail because
the recovery code tries to instantiate the donor inode from
disk to replay the swapext, but it's been deleted and we get
verifier failures when we try to read the inode off disk with
i_mode == 0.
This fixes both sides: We don't log the swapext change if the
inode has been deleted, and we don't try to recover it either.
[1] or if systemd doesn't cleanly unmount root, as it is wont
to do ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Instead of zeroing out fallocated blocks in gfs2_iomap_alloc, zero them
out in fallocate_chunk, much higher up the call stack. This gets rid of
gfs2's abuse of the IOMAP_ZERO flag as well as the gfs2 specific zeronew
buffer flag. I can't think of a reason why zeroing out the blocks in
gfs2_iomap_alloc would have any benefits: there is no additional locking
at that level that would add protection to the newly allocated blocks.
While at it, change fallocate over from gs2_block_map to gfs2_iomap_begin.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Trivial fix to spelling mistake in debug message text.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
mount_crypt_stat is assigned to
&ecryptfs_superblock_to_private(ecryptfs_dentry->d_sb)->mount_crypt_stat,
and mount_crypt_stat is not the first object in struct ecryptfs_sb_info.
mount_crypt_stat is therefore never NULL. At the same time, no crash
in ecryptfs_lookup() has been reported, and the lookup functions in
other file systems don't check if d_sb is NULL either.
Given that, remove the NULL check.
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Tyler Hicks <tyhicks@canonical.com>
Reserve an F2FS feature flag and inode flag for fs-verity. This is an
in-development feature that is planned be discussed at LSF/MM 2018 [1].
It will provide file-based integrity and authenticity for read-only
files. Most code will be in a filesystem-independent module, with
smaller changes needed to individual filesystems that opt-in to
supporting the feature. An early prototype supporting F2FS is available
[2]. Reserving the F2FS on-disk bits for fs-verity will prevent users
of the prototype from conflicting with other new F2FS features.
Note that we're reserving the inode flag in f2fs_inode.i_advise, which
isn't really appropriate since it's not a hint or advice. But
->i_advise is already being used to hold the 'encrypt' flag; and F2FS's
->i_flags uses the generic FS_* values, so it seems ->i_flags can't be
used for an F2FS-specific flag without additional work to remove the
assumption that ->i_flags uses the generic flags namespace.
[1] https://marc.info/?l=linux-fsdevel&m=151690752225644
[2] https://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git/log/?h=fs-verity-dev
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
And use it in a few more places rather than opencoding the values.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I_DIRTY_DATASYNC is a strict superset of I_DIRTY_SYNC semantics, as
in mark dirty to be written out by fdatasync as well. So dirtying
for both flags makes no sense.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_dentry_open is where we do the actual open of the file, so this is
where we should do our O_DIRECT sanity check to cover all potential
callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This patch add a segment type check in IPU, in
case of something wrong with blkadd in dnode.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Since f2fs_inode_info is allocated with flag GFP_F2FS_ZERO, so we do not
need to initialize zero value for its member any more.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Nat entry set is used only in checkpoint(), and during checkpoint() we
won't flush new nat entry with unallocated address, so we don't need to
add new nat entry into nat set, then nat_entry_set::entry_cnt can
indicate actual entry count we need to flush in checkpoint().
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In rxrpc and afs, use the debug_ids that are monotonically allocated to
various objects as they're allocated rather than pointers as kernel
pointers are now hashed making them less useful. Further, the debug ids
aren't reused anywhere nearly as quickly.
In addition, allow kernel services that use rxrpc, such as afs, to take
numbers from the rxrpc counter, assign them to their own call struct and
pass them in to rxrpc for both client and service calls so that the trace
lines for each will have the same ID tag.
Signed-off-by: David Howells <dhowells@redhat.com>
Synchronous pernet_operations are not allowed anymore.
All are asynchronous. So, drop the structure member.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another caches. So, they also
can be async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes spelling typos found in printk.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
These pernet_operations create and destroy per-net pipe
and dentry, and they seem safe to be marked as async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
These pernet_operations look similar to rpcsec_gss_net_ops,
they just create and destroy another cache. Also they create
and destroy directory. So, they also look safe to be async.
Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Acked-by: Anna Schumaker <Anna.Schumaker@netapp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Most of the generic data structures embedded in xfs_mount are
dynamically initialized immediately after mp is allocated. A few
fields are left out and initialized during the xfs_mountfs()
sequence, after mp has been attached to the superblock.
To clean this up and help prevent premature access of associated
fields, refactor xfs_mount allocation and all dependent init calls
into a new helper. This self-documents that all low level data
structures (i.e., locks, trees, etc.) should be initialized before
xfs_mount is attached to the superblock.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
A lot of Kconfig symbols have architecture specific dependencies.
In those cases that depend on architectures we have already removed,
they can be omitted.
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
In the last step of scrub_handle_error_block, we try to combine good
copies on all possible mirrors, this works fine for raid1 and raid10,
but not for raid56 as it's doing parity rebuild.
If parity rebuild doesn't get back with correct data which matches its
checksum, in case of replace we'd rather write what is stored in the
source device than the data calculuated from parity.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
async_missing_raid56() is identical to async_read_rebuild().
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, btrfs_inode_by_name() returned 0 which left caller to check
objectid of location even location if the type was invalid.
Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
dir entry is found. Removal of label out_err also simplifies the
function.
Signed-off-by: Su Yue <suy.fnst@cn.fujitsu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ drop unlikely ]
Signed-off-by: David Sterba <dsterba@suse.com>
This function btrfs_close_extra_devices() is about freeing
extra devids which once it may have belonged to this filesystem.
So rename it and add the comment. The _devid suffix is
appropriate as this function won't handle devices which are
outside of the filesytem being mounted.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This argument is always set to the root of the inode, which is also
passed. So let's get a reference inside the function and simplify
the arg list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to tlv_put()'s prototype, data and attrlen needs to be
exchanged in the macro, but seems all callers are already aware of
this misorder and are therefore not affected.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that nothing uses the root arg of btrfs_log_dentry_safe it can be
safely removed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_log_inode_parent is called from 2 places (btrfs_log_dentry_safe
and btrfs_log_new_name) both of which pass inode->root as the root
argument and the inode itself. Remove the redundant root argument and
get a reference to the root directly from the inode, also remove
redundant root != inode->root check from the same function. No
functional change.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function always sets keep_locks to 1 and saves the old value of
keep_locks which is restored at the end. So there is no way it can be
called without keep_locks being set. Remove comment imposing redundant
requirement on callers.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The xattr_handler::get prototype returns int, use it. The only ssize_t
exception is the per-inode listxattr handler.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Extern for functions does not make any difference, there are only a few
so let's remove them before it's too late.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When send finishes processing an inode representing a regular file, it
always issues a truncate operation for that file, even if its size did
not change or the last write sets the file size correctly. In the most
common cases, the issued write operations set the file to correct size
(either full or incremental sends) or the file size did not change (for
incremental sends), so the only case where a truncate operation is needed
is when a file size becomes smaller in the send snapshot when compared
to the parent snapshot.
By not issuing unnecessary truncate operations we reduce the stream size
and save time in the receiver. Currently truncating a file to the same
size triggers writeback of its last page (if it's dirty) and waits for it
to complete (only if the file size is not aligned with the filesystem's
sector size). This is being fixed by another patch and is independent of
this change (that patch's title is "Btrfs: skip writeback of last page
when truncating file to same size").
The following script was used to measure time spent by a receiver without
this change applied, with this change applied, and without this change and
with the truncate fix applied (the fix to not make it start and wait for
writeback to complete).
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results, which are averages for 5 runs for each case, were the
following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* Without this change and with the truncate fix
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
* With this change and without the truncate fix
average receive time was 10.02 seconds
standard deviation of 1.11 seconds
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we truncate a file to the same size and that size is not aligned
with the sector size, we end up triggering writeback (and wait for it to
complete) of the last page. This is unncessary as we can not have delayed
allocation beyond the inode's i_size and the goal of truncating a file
to its own size is to discard prealloc extents (allocated via the
fallocate(2) system call). Besides the unnecessary IO start and wait, it
also breaks the oppurtunity for larger contiguous extents on disk, as
before the last dirty page there might be other dirty pages.
This scenario is probably not very common in general, however it is
common for btrfs receive implementations because currently the send
stream always issues a truncate operation for each processed inode as
the last operation for that inode (this truncate operation is not
always needed and the send implementation will be addressed to avoid
them).
So improve this by not starting and waiting for writeback of the inode's
last page when we are truncating to exactly the same size.
The following script was used to quickly measure the time a receive
operation takes:
$ cat test_send.sh
#!/bin/bash
SRC_DEV=/dev/sdc
DST_DEV=/dev/sdd
SRC_MNT=/mnt/sdc
DST_MNT=/mnt/sdd
mkfs.btrfs -f $SRC_DEV >/dev/null
mkfs.btrfs -f $DST_DEV >/dev/null
mount $SRC_DEV $SRC_MNT
mount $DST_DEV $DST_MNT
echo "Creating source filesystem"
for ((t = 0; t < 10; t++)); do
(
for ((i = 1; i <= 20000; i++)); do
xfs_io -f -c "pwrite -S 0xab 0 5000" \
$SRC_MNT/file_$i > /dev/null
done
) &
worker_pids[$t]=$!
done
wait ${worker_pids[@]}
echo "Creating and sending snapshot"
btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
/usr/bin/time -f "send took %e seconds" \
btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
/usr/bin/time -f "receive took %e seconds" \
btrfs receive -f $SRC_MNT/send_file $DST_MNT
umount $SRC_MNT
umount $DST_MNT
The results for 5 runs were the following:
* Without this change
average receive time was 26.49 seconds
standard deviation of 2.53 seconds
* With this change
average receive time was 12.51 seconds
standard deviation of 0.32 seconds
Reported-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It doens't make sense to process prealloc extents as pages will be
filled with zero when reading prealloc extents.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have btrfs_fs_info::data_chunk_allocations and
btrfs_fs_info::metadata_ratio declared as unsigned which would be
unsinged int and kernel style prefers unsigned int over bare unsigned.
So this patch changes them to u32.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using any kind of memory barriers around atomic operations which have
a return value is redundant, since those operations themselves are
fully ordered. atomic_t.txt states:
- RMW operations that have a return value are fully ordered;
Fully ordered primitives are ordered against everything prior and
everything subsequent. Therefore a fully ordered primitive is like
having an smp_mb() before and an smp_mb() after the primitive.
Given this let's replace the extra memory barriers with comments.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.
Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:
- printf wrappers, error messages
- exit helpers
Signed-off-by: David Sterba <dsterba@suse.com>
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.
As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.
Signed-off-by: David Sterba <dsterba@suse.com>
We aren't verifying the parameter passed to the subvolid mount option,
so we won't report and fail the mount if a junk value is specified for
example, -o subvolid=abc.
This patch verifies the subvolid option with match_u64.
Up to now the memparse function accepts the K/M/G/ suffixes, that are
usually meant for size values and do not make sense for a subvolume it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Fstests generic/475 provides a way to fail metadata reads while
checking if checksum exists for the inode inside run_delalloc_nocow(),
and csum_exist_in_range() interprets error (-EIO) as inode having
checksum and makes its caller enter the cow path.
In case of free space inode, this ends up with a warning in
cow_file_range().
The same problem applies to btrfs_cross_ref_exist() since it may also
read metadata in between.
With this, run_delalloc_nocow() bails out when errors occur at the two
places.
cc: <stable@vger.kernel.org> v2.6.28+
Fixes: 17d217fe97 ("Btrfs: fix nodatasum handling in balancing code")
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The custom crc32 init code was introduced in
14a958e678 ("Btrfs: fix btrfs boot when compiled as built-in") to
enable using btrfs as a built-in. However, later as pointed out by
60efa5eb2e ("Btrfs: use late_initcall instead of module_init") this
wasn't enough and finally btrfs was switched to late_initcall which
comes after the generic crc32c implementation is initiliased. The
latter commit superseeded the former. Now that we don't have to
maintain our own code let's just remove it and switch to using the
generic implementation.
Despite touching a lot of files the patch is really simple. Here is the gist of
the changes:
1. Select LIBCRC32C rather than the low-level modules.
2. s/btrfs_crc32c/crc32c/g
3. replace hash.h with linux/crc32c.h
4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
I've tested this with btrfs being both a module and a built-in and xfstest
doesn't complain.
Does seem to fix the longstanding problem of not automatically selectiong
the crc32c module when btrfs is used. Possibly there is a workaround in
dracut.
The modinfo confirms that now all the module dependencies are there:
before:
depends: zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
after:
depends: libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add more info to changelog from mails ]
Signed-off-by: David Sterba <dsterba@suse.com>
Function __get_raid_index() is used to convert block group flags into
raid index, which can be used to get various info directly from
btrfs_raid_array[].
Refactor this function a little:
1) Rename to btrfs_bg_flags_to_raid_index()
Double underline prefix is normally for internal functions, while the
function is used by both extent-tree and volumes.
Although the name is a little longer, but it should explain its usage
quite well.
2) Move it to volumes.h and make it static inline
Just several if-else branches, really no need to define it as a normal
function.
This also makes later code re-use between kernel and btrfs-progs
easier.
3) Remove function get_block_group_index()
Really no need to do such a simple thing as an exported function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inspecting the error message with real corruption, the "root=%llu"
always shows "1" (root tree), instead of the correct owner.
The problem is that we are getting @root from page->mapping->host, which
points the same btree inode, so we will always get the same root.
This makes the root owner output meaningless, and harder to port
tree-checker to btrfs-progs.
So get rid of the false and meaningless @root parameter and replace it
with @fs_info.
To get the owner, we can only rely on btrfs_header_owner() now.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the
subtle bugs around merge_extent_mapping.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_run_qgroups is doing a bit too much. Not only is it
responsible for synchronizing in-memory state of qgroups to disk but
it also contains code to trigger the initial qgroup rescan when
quota is enabled initially. This condition is detected by checking that
BTRFS_FS_QUOTA_ENABLED is not set and BTRFS_FS_QUOTA_ENABLING is set.
Nothing really requires from the code to be structured (and scattered)
the way it is so let's streamline things. First move the quota rescan
code into btrfs_quota_enable, where its invocation is closer to the
use. This also makes the FS_QUOTA_ENABLING flag redundant so let's
remove it as well.
This has been tested with a full xfstest run with qgroups enabled on
the scratch device of every xfstest and no regressions were observed.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
load_free_space_tree calls either function load_free_space_bitmaps or
load_free_space_extents. And either of those two will lead to call
btrfs_next_item. So in function load_free_space_tree, use READA_FORWARD
to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
populate_free_space_tree calls function btrfs_search_slot_for_read with
parameter int find_higher = 1, it means that, if no exact match is
found, then use the next higher item. So in function
populate_free_space_tree, use READA_FORWARD to read forward ahead.
This also changes the value from READA_BACK to READA_FORWARD, since
according to the logic, it should reada_for_search forward, not
backward.
Signed-off-by: Gu JinXiang <gujx@cn.fujitsu.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
delayed_iput_count wa supposed to be used to implement, well, delayed
iput. The idea is that we keep accumulating the number of iputs we do
until eventually the inode is deleted. Turns out we never really
switched the delayed_iput_count from 0 to 1, hence all conditional
code relying on the value of that member being different than 0 was
never executed. This, as it turns out, didn't cause any problem due
to the simple fact that the generic inode's i_count member was always
used to count the number of iputs. So let's just remove the unused
member and all unused code. This patch essentially provides no
functional changes. While at it, also add proper documentation for
btrfs_add_delayed_iput
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
The behavior of btrfs_delalloc_reserve_metadata depends on whether
the inode we are allocating for is the freespace inode or not. As it
stands if we are the free node we set 'flush' and 'delalloc_lock'
variable to certain values. Subsequently we check the values of those
vars and act accordingly. Instead, simplify things by having 1 if
which checks whether we are the freespace inode or not and do any
specific operation in either branches of that if. This makes the code
a bit easier to understand, as an added bonus it also shrinks the
compiled size:
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-17 (-17)
Function old new delta
btrfs_delalloc_reserve_metadata 1876 1859 -17
Total: Before=85966, After=85949, chg -0.02%
No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Edmund Nadolski <enadolski@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add opened device to the tail of dev_alloc_list instead of head, so that
it maintains the same order as dev_list.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
By maintaining the device list sorted lets us reproduce the problems
related to missing chunk in the degraded mode much more consistent. So
fix this by sorting the devices by devid within the kernel. So that we
know which device is assigned to the struct fs_info::latest_bdev when
all the devices are having and same SB generation.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's not necessary to hold ->orphan_lock when checking inode's runtime
flags.
Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of manually fiddling with the state of the task
(RUNNING->INTERRUPTIBLE->RUNNING) again just use schedule_timeout_interruptible
which adjusts the task state as needed. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Even though btrfs_start_dirty_block_groups is fairly in the beginning of
btrfs_commit_transaction outside of the critical section defined by the
transaction states it can only be run by a single comitter. In other
words it defines its own critical section thanks to the
BTRFS_TRANS_DIRTY_BG run flag and ro_block_group_mutex. However, its
error handling is outside of this critical section which is a bit
counter-intuitive. So move the error handling righ after the function is
executed and let the sole runner of dirty block groups handle the return
value. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page() can fail for logical reasons as from the bio_add_page()
comments:
/*
* This will only fail if either bio->bi_vcnt == bio->bi_max_vecs or
* it's a cloned bio.
*/
Here we have just allocated the bio, so both of those failures can't
occur. So drop the check. We can also drop the error stats for write
error.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Enospc_debug makes extent allocator print more debug messages,
however for chunk allocation, there is no debug message for enospc_debug
at all.
This patch will add message for the following parts of chunk allocator:
1) No rw device at all
Quite rare, but at least output one message for this case.
2) Not enough space for some device
This debug message is quite handy for unbalanced disks with stripe
based profiles (RAID0/10/5/6).
3) Not enough free devices
This debug message should tell us if current chunk allocator is
working correctly under minimal device requirements.
Although in most cases, we will hit other ENOSPC before we even hit a
chunk allocator ENOSPC, but such debug info won't help.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use ASSERT to report logical error in cow_file_range(), also move it a
bit closer to when the num_bytes is derived.
The extent start could be (u64)-1 in some cases, the assert should catch
that we do not accidentally pass it to cow_file_range.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch deletes local variable disk_num_bytes as its value
is same as num_bytes in the function cow_file_range().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit [1] removed the need to use btrfs_async_submit_limit(), so
delete it.
[1]
commit 736cd52e0c
Btrfs: remove nr_async_submits and async_submit_draining
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Preempt counter APIs have been split out, currently, hardirq.h just
includes irq_enter/exit APIs which are not used by btrfs at all.
So, remove the unused hardirq.h.
Signed-off-by: Yang Shi <yang.s@alibaba-inc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently when enospc_debug mount option is turned on we do not print
any debug info in case metadata reservation failures happen. Fix this
by adding the necessary hook in reserve_metadata_bytes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why io_bgs can be modified without holding any lock is
non-obvious. Document it and reference that documentation from the
respective call sites.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
list_first_entry is essentially a wrapper over cotnainer_of. The latter
can never return null even if it's working on inconsistent list since it
will either crash or return some offset in the wrong struct.
Additionally, for the dirty_bgs list the iteration is done under
dirty_bgs_lock which ensures consistency of the list.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For debugging or administration purposes, we would want to know if and
when the user cancels the replace, to complement the existing messages
when dev-replace starts or finishes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog, fold fix for RCU warning from Nikolay ]
Signed-off-by: David Sterba <dsterba@suse.com>
The replace target device can be missing when mounted with -o degraded,
but we wont allocate a missing btrfs_device to it. So check the device
before accessing.
BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25
This patch has been moved in front of patch "btrfs: log, when replace,
is canceled by the user" that could reproduce the crash if the system
reboots inside btrfs_dev_replace_start before the
btrfs_dev_replace_finishing call.
$ mkfs /dev/sda
$ mount /dev/sda mnt
$ btrfs replace start /dev/sda /dev/sdb
<insert reboot>
$ mount po degraded /dev/sdb mnt
<crash>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[ added reproducer description from mail ]
Signed-off-by: David Sterba <dsterba@suse.com>
The options alloc_start and subvolrootid are deprecated, comment them in
the tokens list. And leave them as it is. No functional changes.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As the commit mount option is unsigned so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As check_int_print_mask mount option is unsigned so manage it as %u for
token verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As metadata_ratio mount option is unsinged so manage it as %u for token
verifications, instead of %d.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mount option thread_pool is always unsigned. Manage it that way all
around.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>