linux

Commit Graph

Author	SHA1	Message	Date
Jeff Layton	7fe0cdeb0f	ceph: take a cred reference instead of tracking individual uid/gid Replace req->r_uid/r_gid with an r_cred pointer and take a reference to that at the point where we previously would sample the two. Use that to populate the uid and gid in the header and release the reference when the request is freed. This should enable us to later add support for sending supplementary group lists in MDS requests. [ idryomov: break unnecessarily long lines ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Jeff Layton	0f51a98361	ceph: don't reach into request header for readdir info We already have a pointer to the argument struct in req->r_args. Use that instead of groveling around in the ceph_mds_request_head. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Xiubo Li	968cd14edc	ceph: set osdmap epoch for setxattr When setting the file/dir layout, it may need data pool info. So in mds server, it needs to check the osdmap. At present, if mds doesn't find the data pool specified, it will try to get the latest osdmap. Now if pass the osd epoch for setxattr, the mds server can only check this epoch of osdmap. URL: https://tracker.ceph.com/issues/48504 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Colin Ian King	4a756db2a1	ceph: remove redundant assignment to variable i The variable i is being initialized with a value that is never read and it is being updated later with a new value in a for-loop. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Luis Henriques	dd980fc0d5	ceph: add ceph.caps vxattr Add a new vxattr that allows userspace to list the caps for a specific directory or file. [ jlayton: change format delimiter to '/' ] Signed-off-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Jeff Layton	bca9fc14c7	ceph: when filling trace, call ceph_get_inode outside of mutexes Geng Jichao reported a rather complex deadlock involving several moving parts: 1) readahead is issued against an inode and some of its pages are locked while the read is in flight 2) the same inode is evicted from the cache, and this task gets stuck waiting for the page lock because of the above readahead 3) another task is processing a reply trace, and looks up the inode being evicted while holding the s_mutex. That ends up waiting for the eviction to complete 4) a write reply for an unrelated inode is then processed in the ceph_con_workfn job. It calls ceph_check_caps after putting wrbuffer caps, and that gets stuck waiting on the s_mutex held by 3. The reply to "1" is stuck behind the write reply in "4", so we deadlock at that point. This patch changes the trace processing to call ceph_get_inode outside of the s_mutex and snap_rwsem, which should break the cycle above. [ idryomov: break unnecessarily long lines ] URL: https://tracker.ceph.com/issues/47998 Reported-by: Geng Jichao <gengjichao@jd.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:48 +01:00
Luis Henriques	6646ea1c8e	Revert "ceph: allow rename operation under different quota realms" This reverts commit `dffdcd7145`. When doing a rename across quota realms, there's a corner case that isn't handled correctly. Here's a testcase: mkdir files limit truncate files/file -s 10G setfattr limit -n ceph.quota.max_bytes -v 1000000 mv files limit/ The above will succeed because ftruncate(2) won't immediately notify the MDSs with the new file size, and thus the quota realms stats won't be updated. Since the possible fixes for this issue would have a huge performance impact, the solution for now is to simply revert to returning -EXDEV when doing a cross quota realms rename. URL: https://tracker.ceph.com/issues/48203 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	68cbb8056a	ceph: fix inode refcount leak when ceph_fill_inode on non-I_NEW inode fails Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Luis Henriques	ccd1acdf1c	ceph: downgrade warning from mdsmap decode to debug While the MDS cluster is unstable and changing state the client may get mdsmap updates that will trigger warnings: [144692.478400] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489552] ceph: mdsmap_decode got incorrect state(up:standby-replay) [144697.489580] ceph: mdsmap_decode got incorrect state(up:standby-replay) This patch downgrades these warnings to debug, as they may flood the logs if the cluster is unstable for a while. Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Luis Henriques	e5cafce3ad	ceph: fix race in concurrent __ceph_remove_cap invocations A NULL pointer dereference may occur in __ceph_remove_cap with some of the callbacks used in ceph_iterate_session_caps, namely trim_caps_cb and remove_session_caps_cb. Those callers hold the session->s_mutex, so they are prevented from concurrent execution, but ceph_evict_inode does not. Since the callers of this function hold the i_ceph_lock, the fix is simply a matter of returning immediately if caps->ci is NULL. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/43272 Suggested-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	4a357f5069	ceph: pass down the flags to grab_cache_page_write_begin write_begin operations are passed a flags parameter that we need to mirror here, so that we don't (e.g.) recurse back into filesystem code inappropriately. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Xiubo Li	5a9e2f5d55	ceph: add ceph.{cluster_fsid/client_id} vxattrs These two vxattrs will only exist in local client side, with which we can easily know which mountpoint the file belongs to and also they can help locate the debugfs path quickly. URL: https://tracker.ceph.com/issues/48057 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Xiubo Li	247b1f19db	ceph: add status debugfs file This will help list some useful client side info, like the client entity address/name and blocklisted status, etc. URL: https://tracker.ceph.com/issues/48057 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	04fabb1199	ceph: ensure we have Fs caps when fetching dir link count The link count for a directory is defined as inode->i_subdirs + 2, (for "." and ".."). i_subdirs is only populated when Fs caps are held. Ensure we grab Fs caps when fetching the link count for a directory. [ idryomov: break unnecessarily long line ] URL: https://tracker.ceph.com/issues/48125 Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Xiubo Li	8ba3b8c7fb	ceph: send dentry lease metrics to MDS daemon For the old ceph version, if it received this one metric message containing the dentry lease metric info, it will just ignore it. URL: https://tracker.ceph.com/issues/43423 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	81048c00d1	ceph: acquire Fs caps when getting dir stats We only update the inode's dirstats when we have Fs caps from the MDS. Declare a new VXATTR_FLAG_DIRSTAT that we set on all dirstats, and have the vxattr handling code acquire those caps when it's set. URL: https://tracker.ceph.com/issues/48104 Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Xiubo Li <xiubli@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	06a1ad438b	ceph: fix up some warnings on W=1 builds Convert some decodes into unused variables into skips, and fix up some non-kerneldoc comment headers to not start with "/**". Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	4ae3713fe4	ceph: queue MDS requests to REJECTED sessions when CLEANRECOVER is set Ilya noticed that the first access to a blacklisted mount would often get back -EACCES, but then subsequent calls would be OK. The problem is in __do_request. If the session is marked as REJECTED, a hard error is returned instead of waiting for a new session to come into being. When the session is REJECTED and the mount was done with recover_session=clean, queue the request to the waiting_for_map queue, which will be awoken after tearing down the old session. We can only do this for sync requests though, so check for async ones first and just let the callers redrive a sync request. URL: https://tracker.ceph.com/issues/47385 Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:47 +01:00
Jeff Layton	dbeec07bc8	ceph: remove timeout on allowing reconnect after blocklisting 30 minutes is a long time to wait, and this makes it difficult to test the feature by manually blocklisting clients. Remove the timeout infrastructure and just allow the client to reconnect at will. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:46 +01:00
Jeff Layton	50c9132ddf	ceph: add new RECOVER mount_state when recovering session When recovering a session (a'la recover_session=clean), we want to do all of the operations that we do on a forced umount, but changing the mount state to SHUTDOWN is can cause queued MDS requests to fail when the session comes back. Most of those can idle until the session is recovered in this situation. Reserve SHUTDOWN state for forced umount, and make a new RECOVER state for the forced reconnect situation. Change several tests for equality with SHUTDOWN to test for that or RECOVER. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:46 +01:00
Jeff Layton	aa5c791053	ceph: make fsc->mount_state an int This field is an unsigned long currently, which is a bit of a waste on most arches since this just holds an enum. Make it (signed) int instead. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:46 +01:00
Jeff Layton	dc167e38a0	ceph: don't WARN when removing caps due to blocklisting We expect to remove dirty caps when the client is blocklisted. Don't throw a warning in that case. [ idryomov: break unnecessarily long line ] Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: "Yan, Zheng" <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>	2020-12-14 23:21:46 +01:00
Linus Torvalds	405f868f13	- Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end (Gabriel Krisman Bertazi) - All kinds of minor cleanups all over the tree. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl/XgtoACgkQEsHwGGHe VUqGuA/9GqN2zNQdhgRvAQ+FLZiOYK9MfXcoayfMq8T61VRPDBWaQRfVYKmfmEjS 0l5OnYgZQ9n6vzqFy6pmgc/ix8Jr553dZp5NCamcOqjCTcuO/LwRRh+ZBeFSBTPi r2qFYKKRYvM7nbyUMm4WqvAakxJ18xsjNbIslr9Aqe8WtHBKKX3MOu8SOpFtGyXz aEc4rhsS45iZa5gTXhvOn73tr3yHGWU1rzyyAAAmDGTgAxRwsTna8v16C4+v+Bua Zg18Wiutj8ZjtFpzKJtGWGZoSBap3Jw2Ys64g42MBQUE56KY/99tQVo/SvbYvvlf PHWLH0f3rPNJ6J2qeKwhtNzPlEAH/6e416A1/6TVwsK+8pdfGmkfaQh2iDHLhJ5i CSwF61H44ZaE3pc1tHHbC5ALvydPlup7D4MKgztfq0mZ3OoV2Vg7dtyyr+Ybz72b G+Kl/tmyacQTXo0FiYbZKETo3/VfTdBXGyVax1rHkx3pt8zvhFg3kxb1TT/l/CoM eSTx53PtTdVtbGOq1CjnUm0FKlbh4+kLoNuo9DYKeXUQBs8PWOCZmL3wXmm4cqlZ mDZVWvll7CjToY8izzcE/AG279cWkgcL5Tcg7W7CR66+egfDdpuqOZ4tv4TyzoWq 0J7WeNj+TAo98b7RA0Ux8LOlszRxS2ykuI6uB2MgwCaRMbbaQao= =lLiH -----END PGP SIGNATURE----- Merge tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Borislav Petkov: "Another branch with a nicely negative diffstat, just the way I like 'em: - Remove all uses of TIF_IA32 and TIF_X32 and reclaim the two bits in the end (Gabriel Krisman Bertazi) - All kinds of minor cleanups all over the tree" * tag 'x86_cleanups_for_v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86/ia32_signal: Propagate __user annotation properly x86/alternative: Update text_poke_bp() kernel-doc comment x86/PCI: Make a kernel-doc comment a normal one x86/asm: Drop unused RDPID macro x86/boot/compressed/64: Use TEST %reg,%reg instead of CMP $0,%reg x86/head64: Remove duplicate include x86/mm: Declare 'start' variable where it is used x86/head/64: Remove unused GET_CR2_INTO() macro x86/boot: Remove unused finalize_identity_maps() x86/uaccess: Document copy_from_user_nmi() x86/dumpstack: Make show_trace_log_lvl() static x86/mtrr: Fix a kernel-doc markup x86/setup: Remove unused MCA variables x86, libnvdimm/test: Remove COPY_MC_TEST x86: Reclaim TIF_IA32 and TIF_X32 x86/mm: Convert mmu context ia32_compat into a proper flags field x86/elf: Use e_machine to check for x32/ia32 in setup_additional_pages() elf: Expose ELF header on arch_setup_additional_pages() x86/elf: Use e_machine to select start_thread for x32 elf: Expose ELF header in compat_start_thread() ...	2020-12-14 13:45:26 -08:00
Linus Torvalds	9e4b0d55d8	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto updates from Herbert Xu: "API: - Add speed testing on 1420-byte blocks for networking Algorithms: - Improve performance of chacha on ARM for network packets - Improve performance of aegis128 on ARM for network packets Drivers: - Add support for Keem Bay OCS AES/SM4 - Add support for QAT 4xxx devices - Enable crypto-engine retry mechanism in caam - Enable support for crypto engine on sdm845 in qce - Add HiSilicon PRNG driver support" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (161 commits) crypto: qat - add capability detection logic in qat_4xxx crypto: qat - add AES-XTS support for QAT GEN4 devices crypto: qat - add AES-CTR support for QAT GEN4 devices crypto: atmel-i2c - select CONFIG_BITREVERSE crypto: hisilicon/trng - replace atomic_add_return() crypto: keembay - Add support for Keem Bay OCS AES/SM4 dt-bindings: Add Keem Bay OCS AES bindings crypto: aegis128 - avoid spurious references crypto_aegis128_update_simd crypto: seed - remove trailing semicolon in macro definition crypto: x86/poly1305 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: x86/sha512 - Use TEST %reg,%reg instead of CMP $0,%reg crypto: aesni - Use TEST %reg,%reg instead of CMP $0,%reg crypto: cpt - Fix sparse warnings in cptpf hwrng: ks-sa - Add dependency on IOMEM and OF crypto: lib/blake2s - Move selftest prototype into header file crypto: arm/aes-ce - work around Cortex-A57/A72 silion errata crypto: ecdh - avoid unaligned accesses in ecdh_set_secret() crypto: ccree - rework cache parameters handling crypto: cavium - Use dma_set_mask_and_coherent to simplify code crypto: marvell/octeontx - Use dma_set_mask_and_coherent to simplify code ...	2020-12-14 12:18:19 -08:00
Linus Torvalds	51895d58c7	fsverity updates for 5.11 Some cleanups for fs-verity: - Rename some names that have been causing confusion. - Move structs needed for file signing to the UAPI header. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX9bedBQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK7s6AQD4xMyyPTAQbznYlxPzn+2d8H3y+qik xbX7JwATnyOM4AEApDMGPvokvmEJO4z/ionUunR20i//JIyOklxy8w4whQ0= =XO4L -----END PGP SIGNATURE----- Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fsverity updates from Eric Biggers: "Some cleanups for fs-verity: - Rename some names that have been causing confusion - Move structs needed for file signing to the UAPI header" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fs-verity: move structs needed for file signing to UAPI header fs-verity: rename "file measurement" to "file digest" fs-verity: rename fsverity_signed_digest to fsverity_formatted_digest fs-verity: remove filenames from file comments	2020-12-14 12:10:32 -08:00
Linus Torvalds	7c7fdaf6ad	fscrypt updates for 5.11 This release there are some fixes for longstanding problems, as well as some cleanups: - Fix a race condition where a duplicate filename could be created in an encrypted directory if a syscall that creates a new filename raced with the directory's encryption key being added. - Allow deleting files that use an unsupported encryption policy. - Simplify the locking for 'struct fscrypt_master_key'. - Remove kernel-internal constants from the UAPI header. As usual, all these patches have been in linux-next with no reported issues, and I've tested them with xfstests. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCX9bcDxQcZWJpZ2dlcnNA Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK/HRAP95FGQqS47rIEh4LrvS7cohMJxb5NiX KokAyB88GgmzLQD/c4Xh+iYOxxhFX5NRgruuoec876DrzsuNbEt7WNJ6CQc= =CBoc -----END PGP SIGNATURE----- Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fscrypt updates from Eric Biggers: "This release there are some fixes for longstanding problems, as well as some cleanups: - Fix a race condition where a duplicate filename could be created in an encrypted directory if a syscall that creates a new filename raced with the directory's encryption key being added. - Allow deleting files that use an unsupported encryption policy. - Simplify the locking for 'struct fscrypt_master_key'. - Remove kernel-internal constants from the UAPI header. As usual, all these patches have been in linux-next with no reported issues, and I've tested them with xfstests" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: fscrypt: allow deleting files with unsupported encryption policy fscrypt: unexport fscrypt_get_encryption_info() fscrypt: move fscrypt_require_key() to fscrypt_private.h fscrypt: move body of fscrypt_prepare_setattr() out-of-line fscrypt: introduce fscrypt_prepare_readdir() ext4: don't call fscrypt_get_encryption_info() from dx_show_leaf() ubifs: remove ubifs_dir_open() f2fs: remove f2fs_dir_open() ext4: remove ext4_dir_open() fscrypt: simplify master key locking fscrypt: remove unnecessary calls to fscrypt_require_key() ubifs: prevent creating duplicate encrypted filenames f2fs: prevent creating duplicate encrypted filenames ext4: prevent creating duplicate encrypted filenames fscrypt: add fscrypt_is_nokey_name() fscrypt: remove kernel-internal constants from UAPI header	2020-12-14 12:06:54 -08:00
Ronnie Sahlberg	5c4b642141	cifs: fix uninitialized variable in smb3_fs_context_parse_param Addresses an issue noted by the kernel test robot Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>	2020-12-14 13:40:43 -06:00
Ronnie Sahlberg	1cb6c3d62c	cifs: update mnt_cifs_flags during reconfigure Many mount flags (e.g. for noperm, noxattr, nobrl, cifsacl, mfsymlinks and more) can be updated now. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 13:37:43 -06:00
Ronnie Sahlberg	2d39f50c2b	cifs: move update of flags into a separate function This function will set/clear flags that can be changed during mount or remount Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:28:25 -06:00
Ronnie Sahlberg	51acd208bd	cifs: remove ctx argument from cifs_setup_cifs_sb Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	531f03bc6d	cifs: do not allow changing posix_paths during remount Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	7c7ee628f8	cifs: uncomplicate printing the iocharset parameter There is no need to load the default nls to check if the iocharset argument was specified or not since we have it in cifs_sb->ctx Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	6fd4ea88b5	cifs: don't create a temp nls in cifs_setup_ipc just use the one that is already available in ctx Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	387ec58f33	cifs: simplify handling of cifs_sb/ctx->local_nls Only load/unload local_nls from cifs_sb and just make the ctx contain a pointer to cifs_sb->ctx. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	9ccecae8d1	cifs: we do not allow changing username/password/unc/... during remount Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	d6a7878340	cifs: add initial reconfigure support Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	522aa3b575	cifs: move [brw]size from cifs_sb to cifs_sb->ctx Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Ronnie Sahlberg	c741cba2cd	cifs: move cifs_cleanup_volume_info[_content] to fs_context.c and rename it to smb3_cleanup_fs_context[_content] Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:26:30 -06:00
Dmitry Osipenko	427c4f004e	cifs: Add missing sentinel to smb3_fs_parameters Add missing sentinel to smb3_fs_parameters. This fixes ARM32 kernel crashing once CIFS is registered. Unable to handle kernel paging request at virtual address 33626d73 ... (strcmp) from (fs_validate_description) (fs_validate_description) from (register_filesystem) (register_filesystem) from (init_cifs [cifs]) (init_cifs [cifs]) from (do_one_initcall) (do_one_initcall) from (do_init_module) (do_init_module) from (load_module) (load_module) from (sys_finit_module) (sys_finit_module) from (ret_fast_syscal) Fixes: e07724d1cf38 ("cifs: switch to new mount api") Signed-off-by: Dmitry Osipenko <digetx@gmail.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:21:33 -06:00
Samuel Cabrero	121d947d4f	cifs: Handle witness client move notification This message is sent to tell a client to close its current connection and connect to the specified address. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:18:55 -06:00
Ronnie Sahlberg	af1e40d9ac	cifs: remove actimeo from cifs_sb Can now be accessed via the ctx Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Ronnie Sahlberg	8401e93678	cifs: remove [gu]id/backup[gu]id/file_mode/dir_mode from cifs_sb We can already access these from cifs_sb->ctx so we no longer need a local copy in cifs_sb. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Steve French	ee0dce4926	cifs: remove some minor warnings pointed out by kernel test robot Correct some trivial warnings caused when new file unc.c was created. For example: In file included from fs/cifs/unc.c:11: >> fs/cifs/cifsproto.h:44:28: warning: 'struct TCP_Server_Info' declared inside parameter list will not be visible outside of this definition or declaration 44 \| extern int smb_send(struct TCP_Server_Info , struct smb_hdr , Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Steve French	607dfc79c3	cifs: remove various function description warnings When compiling with W=1 I noticed various functions that did not follow proper style in describing (in the comments) the parameters passed in to the function. For example: fs/cifs/inode.c:2236: warning: Function parameter or member 'mode' not described in 'cifs_wait_bit_killable' I did not address the style warnings in two of the six files (connect.c and misc.c) in order to reduce risk of merge conflict with pending patches. We can update those later. Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Samuel Cabrero	7d6535b720	cifs: Simplify reconnect code when dfs upcall is enabled Some witness notifications, like client move, tell the client to reconnect to a specific IP address. In this situation the DFS failover code path has to be skipped so clean up as much as possible the cifs_reconnect() code. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Samuel Cabrero	21077c62e1	cifs: Send witness register messages to userspace daemon in echo task If the daemon starts after mounting a share, or if it crashes, this provides a mechanism to register again. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:23 -06:00
Samuel Cabrero	20fab0da2f	cifs: Add witness information to debug data dump + Indicate if witness feature is supported + Indicate if witness is used when dumping tcons + Dumps witness registrations. Example: Witness registrations: Id: 1 Refs: 1 Network name: 'fs.fover.ad'(y) Share name: 'share1'(y) \ Ip address: 192.168.103.200(n) Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	fed979a7e0	cifs: Set witness notification handler for messages from userspace daemon + Set a handler for the witness notification messages received from the userspace daemon. + Handle the resource state change notification. When the resource becomes unavailable or available set the tcp status to CifsNeedReconnect for all channels. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	bf80e5d425	cifs: Send witness register and unregister commands to userspace daemon + Define the generic netlink family commands and message attributes to communicate with the userspace daemon + The register and unregister commands are sent when connecting or disconnecting a tree. The witness registration keeps a pointer to the tcon and has the same lifetime. + Each registration has an id allocated by an IDR. This id is sent to the userspace daemon in the register command, and will be included in the notification messages from the userspace daemon to retrieve from the IDR the matching registration. + The authentication information is bundled in the register message. If kerberos is used the message just carries a flag. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Steve French	e68f4a7bf0	cifs: minor updates to Kconfig Correct references to fs/cifs/README which has been replaced by Documentation/filesystems/admin-guide/cifs/usage.rst, and also correct a typo. Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	0ac4e2919a	cifs: add witness mount option and data structs Add 'witness' mount option to register for witness notifications. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	06f08dab3c	cifs: Register generic netlink family Register a new generic netlink family to talk to the witness service userspace daemon. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Steve French	047092ffe2	cifs: cleanup misc.c misc.c was getting a little large, move two of the UNC parsing relating functions to a new C file unc.c which makes the coding of the upcoming witness protocol patch series a little cleaner as well. Suggested-by: Rafal Szczesniak <rafal@elbingbrewery.org> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Steve French	bc04499477	cifs: minor kernel style fixes for comments Trivial fix for a few comments which didn't follow kernel style Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	e73a42e07a	cifs: Make extract_sharename function public Move the function to misc.c Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Samuel Cabrero	a87e67254b	cifs: Make extract_hostname function public Move the function to misc.c and give it a public header. Signed-off-by: Samuel Cabrero <scabrero@suse.de> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-14 09:16:22 -06:00
Miklos Szeredi	459c7c565a	ovl: unprivieged mounts Enable unprivileged user namespace mounts of overlayfs. Overlayfs's permission model () ensures that the mounter itself cannot gain additional privileges by the act of creating an overlayfs mount. This feature request is coming from the "rootless" container crowd. () Documentation/filesystems/overlayfs.txt#Permission model Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	87b2c60c61	ovl: do not get metacopy for userxattr When looking up an inode on the lower layer for which the mounter lacks read permisison the metacopy check will fail. This causes the lookup to fail as well, even though the directory is readable. So ignore EACCES for the "userxattr" case and assume no metacopy for the unreadable file. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	b6650dab40	ovl: do not fail because of O_NOATIME In case the file cannot be opened with O_NOATIME because of lack of capabilities, then clear O_NOATIME instead of failing. Remove WARN_ON(), since it would now trigger if O_NOATIME was cleared. Noticed by Amir Goldstein. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	6939f977c5	ovl: do not fail when setting origin xattr Comment above call already says this, but only EOPNOTSUPP is ignored, other failures are not. For example setting "user.*" will fail with EPERM on symlink/special. Ignore this error as well. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	2d2f2d7322	ovl: user xattr Optionally allow using "user.overlay." namespace instead of "trusted.overlay." This is necessary for overlayfs to be able to be mounted in an unprivileged namepsace. Make the option explicit, since it makes the filesystem format be incompatible. Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the "user.overlay.redirect" or "user.overlay.metacopy" xattrs. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	82a763e61e	ovl: simplify file splice generic_file_splice_read() and iter_file_splice_write() will call back into f_op->iter_read() and f_op->iter_write() respectively. These already do the real file lookup and cred override. So the code in ovl_splice_read() and ovl_splice_write() is redundant. In addition the ovl_file_accessed() call in ovl_splice_write() is incorrect, though probably harmless. Fix by calling generic_file_splice_read() and iter_file_splice_write() directly. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	89bdfaf93d	ovl: make ioctl() safe ovl_ioctl_set_flags() does a capability check using flags, but then the real ioctl double-fetches flags and uses potentially different value. The "Check the capability before cred override" comment misleading: user can skip this check by presenting benign flags first and then overwriting them to non-benign flags. Just remove the cred override for now, hoping this doesn't cause a regression. The proper solution is to create a new setxflags i_op (patches are in the works). Xfstests don't show a regression. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Fixes: `dab5ca8fd9` ("ovl: add lsattr/chattr support") Cc: <stable@vger.kernel.org> # v4.19	2020-12-14 15:26:14 +01:00
Miklos Szeredi	c846af050f	ovl: check privs before decoding file handle CAP_DAC_READ_SEARCH is required by open_by_handle_at(2) so check it in ovl_decode_real_fh() as well to prevent privilege escalation for unprivileged overlay mounts. [Amir] If the mounter is not capable in init ns, ovl_check_origin() and ovl_verify_index() will not function as expected and this will break index and nfs export features. So check capability in ovl_can_decode_fh(), to auto disable those features. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:14 +01:00
Miklos Szeredi	3078d85c9a	vfs: verify source area in vfs_dedupe_file_range_one() Call remap_verify_area() on the source file as well as the destination. When called from vfs_dedupe_file_range() the check as already been performed, but not so if called from layered fs (overlayfs, etc...) Could ommit the redundant check in vfs_dedupe_file_range(), but leave for now to get error early (for fear of breaking backward compatibility). This call shouldn't be performance sensitive. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>	2020-12-14 15:26:13 +01:00
Miklos Szeredi	7c03e2cda4	vfs: move cap_convert_nscap() call into vfs_setxattr() cap_convert_nscap() does permission checking as well as conversion of the xattr value conditionally based on fs's user-ns. This is needed by overlayfs and probably other layered fs (ecryptfs) and is what vfs_foo() is supposed to do anyway. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Acked-by: James Morris <jamorris@linux.microsoft.com>	2020-12-14 15:26:13 +01:00
Trond Myklebust	5c3485bb12	NFSv4.2/pnfs: Don't use READ_PLUS with pNFS yet We have no way of tracking server READ_PLUS support in pNFS for now, so just disable it. Reported-by: "Mkrtchyan, Tigran" <tigran.mkrtchyan@desy.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:08 -05:00
Trond Myklebust	7aedc687c9	NFSv4.2: Deal with potential READ_PLUS data extent buffer overflow If the server returns more data than we have buffer space for, then we need to truncate and exit early. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:08 -05:00
Trond Myklebust	503b934a75	NFSv4.2: Don't error when exiting early on a READ_PLUS buffer overflow Expanding the READ_PLUS extents can cause the read buffer to overflow. If it does, then don't error, but just exit early. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:08 -05:00
Trond Myklebust	dac3b1059b	NFSv4.2: Handle hole lengths that exceed the READ_PLUS read buffer If a hole extends beyond the READ_PLUS read buffer, then we want to fill just the remaining buffer with zeros. Also ignore eof... Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:08 -05:00
Trond Myklebust	82f98c8b11	NFSv4.2: decode_read_plus_hole() needs to check the extent offset The server is allowed to return a hole extent with an offset that starts before the offset supplied in the READ_PLUS argument. Ensure that we support that case too. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	5c4afe2ab6	NFSv4.2: decode_read_plus_data() must skip padding after data segment All XDR opaque object sizes are 32-bit aligned, and a data segment is no exception. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Trond Myklebust	1ee6310119	NFSv4.2: Ensure we always reset the result->count in decode_read_plus() Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Geliang Tang	1f70ea7009	NFSv4.1: use BITS_PER_LONG macro in nfs4session.h Use the existing BITS_PER_LONG macro instead of calculating the value. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Frank van der Linden	a1f26739cc	NFSv4.2: improve page handling for GETXATTR XDRBUF_SPARSE_PAGES can cause problems for the RDMA transport, and it's easy enough to allocate enough pages for the request up front, so do that. Also, since we've allocated the pages anyway, use the full page aligned length for the receive buffer. This will allow caching of valid replies that are too large for the caller, but that still fit in the allocated pages. Signed-off-by: Frank van der Linden <fllinden@amazon.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-14 06:51:07 -05:00
Ronnie Sahlberg	a2a52a8a36	cifs: get rid of cifs_sb->mountdata as we now have a full smb3_fs_context as part of the cifs superblock we no longer need a local copy of the mount options and can just reference the copy in the smb3_fs_context. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	d17abdf756	cifs: add an smb3_fs_context to cifs_sb and populate it during mount in cifs_smb3_do_mount() Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	4deb075985	cifs: remove the devname argument to cifs_compose_mount_options none of the callers use this argument any more. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	24e0a1eff9	cifs: switch to new mount api See Documentation/filesystems/mount_api.rst for details on new mount API Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	66e7b09c73	cifs: move cifs_parse_devname to fs_context.c Also rename the function from cifs_ to smb3_ Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	15c7d09af2	cifs: move the enum for cifs parameters into fs_context.h No change to logic, just moving the enum of cifs mount parms into a header Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	837e3a1bbf	cifs: rename dup_vol to smb3_fs_context_dup and move it into fs_context.c Continue restructuring needed for support of new mount API Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Ronnie Sahlberg	3fa1c6d1b8	cifs: rename smb_vol as smb3_fs_context and move it to fs_context.h Harmonize and change all such variables to 'ctx', where possible. No changes to actual logic. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Steve French	7955f105af	SMB3.1.1: do not log warning message if server doesn't populate salt In the negotiate protocol preauth context, the server is not required to populate the salt (although it is done by most servers) so do not warn on mount. We retain the checks (warn) that the preauth context is the minimum size and that the salt does not exceed DataLength of the SMB response. Although we use the defaults in the case that the preauth context response is invalid, these checks may be useful in the future as servers add support for additional mechanisms. CC: Stable <stable@vger.kernel.org> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Steve French	145024e3e4	SMB3.1.1: update comments clarifying SPNEGO info in negprot response Trivial changes to clarify confusing comment about SPNEGO blog (and also one length comparisons in negotiate context parsing). Suggested-by: Tom Talpey <tom@talpey.com> Suggested-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Shyam Prasad N	f2156d35c9	cifs: Enable sticky bit with cifsacl mount option. For the cifsacl mount option, we did not support sticky bits. With this patch, we do support it, by setting the DELETE_CHILD perm on the directory only for the owner user. When sticky bit is not enabled, allow DELETE_CHILD perm for everyone. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Shyam Prasad N	0f22053e81	cifs: Fix unix perm bits to cifsacl conversion for "other" bits. With the "cifsacl" mount option, the mode bits set on the file/dir is converted to corresponding ACEs in DACL. However, only the ALLOWED ACEs were being set for "owner" and "group" SIDs. Since owner is a subset of group, and group is a subset of everyone/world SID, in order to properly emulate unix perm groups, we need to add DENIED ACEs. If we don't do that, "owner" and "group" SIDs could get more access rights than they should. Which is what was happening. This fixes it. We try to keep the "preferred" order of ACEs, i.e. DENYs followed by ALLOWs. However, for a small subset of cases we cannot maintain the preferred order. In that case, we'll end up with the DENY ACE for group after the ALLOW for the owner. If owner SID == group SID, use the more restrictive among the two perm bits and convert them to ACEs. Also, for reverse mapping, i.e. to convert ACL to unix perm bits, for the "others" bits, we needed to add the masked bits of the owner and group masks to others mask. Updated version of patch fixes a problem noted by the kernel test robot. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Steve French	bc7c4129d4	SMB3.1.1: remove confusing mount warning when no SPNEGO info on negprot rsp Azure does not send an SPNEGO blob in the negotiate protocol response, so we shouldn't assume that it is there when validating the location of the first negotiate context. This avoids the potential confusing mount warning: CIFS: Invalid negotiate context offset CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Steve French	ebcd6de987	SMB3: avoid confusing warning message on mount to Azure Mounts to Azure cause an unneeded warning message in dmesg "CIFS: VFS: parse_server_interfaces: incomplete interface info" Azure rounds up the size (by 8 additional bytes, to a 16 byte boundary) of the structure returned on the query of the server interfaces at mount time. This is permissible even though different than other servers so do not log a warning if query network interfaces response is only rounded up by 8 bytes or fewer. CC: Stable <stable@vger.kernel.org> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Gustavo A. R. Silva	21ac58f495	cifs: Fix fall-through warnings for Clang In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple break/goto statements instead of just letting the code fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>	2020-12-13 19:12:07 -06:00
Zhihao Cheng	b80a974b8c	ubifs: ubifs_dump_node: Dump all branches of the index node An index node can have up to c->fanout branches, all branches should be displayed while dumping index node. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:12:42 +01:00
Zhihao Cheng	bf6dab7a6c	ubifs: ubifs_dump_sleb: Remove unused function Function ubifs_dump_sleb() is defined but unused, it can be removed. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:12:38 +01:00
Zhihao Cheng	a33e30a0e0	ubifs: Pass node length in all node dumping callers Function ubifs_dump_node() has been modified to avoid memory oob accessing while dumping node, node length (corresponding to the size of allocated memory for node) should be passed into all node dumping callers. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:12:32 +01:00
Zhihao Cheng	c8be097530	Revert "ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len" This reverts commit `acc5af3efa` ("ubifs: Fix out-of-bounds memory access caused by abnormal value of node_len") No need to avoid memory oob in dumping for data node alone. Later, node length will be passed into function 'ubifs_dump_node()' which replaces all node dumping places. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:11:51 +01:00
Zhihao Cheng	c4c0d19d39	ubifs: Limit dumping length by size of memory which is allocated for the node To prevent memory out-of-bounds accessing in ubifs_dump_node(), actual dumping length should be restricted by another condition(size of memory which is allocated for the node). This patch handles following situations (These situations may be caused by bit flipping due to hardware error, writing bypass ubifs, unknown bugs in ubifs, etc.): 1. bad node_len: Dumping data according to 'ch->len' which may exceed the size of memory allocated for node. 2. bad node content: Some kinds of node can record additional data, eg. index node and orphan node, make sure the size of additional data not beyond the node length. 3. node_type changes: Read data according to type A, but expected type B, before that, node is allocated according to type B's size. Length of type A node is greater than type B node. Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:11:36 +01:00
Chengsong Ke	32f6ccc743	ubifs: Remove the redundant return in dbg_check_nondata_nodes_order There is a redundant return in dbg_check_nondata_nodes_order, which will be never reached. In addition, error code should be returned instead of zero in this branch. Signed-off-by: Chengsong Ke <kechengsong@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 22:11:23 +01:00
Jamie Iles	a61df3c413	jffs2: Fix NULL pointer dereference in rp_size fs option parsing syzkaller found the following JFFS2 splat: Unable to handle kernel paging request at virtual address dfffa00000000001 Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [dfffa00000000001] address between user and kernel address ranges Internal error: Oops: 96000004 [#1] SMP Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 12745 Comm: syz-executor.5 Tainted: G S 5.9.0-rc8+ #98 Hardware name: linux,dummy-virt (DT) pstate: 20400005 (nzCv daif +PAN -UAO BTYPE=--) pc : jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206 lr : jffs2_parse_param+0x108/0x308 fs/jffs2/super.c:205 sp : ffff000022a57910 x29: ffff000022a57910 x28: 0000000000000000 x27: ffff000057634008 x26: 000000000000d800 x25: 000000000000d800 x24: ffff0000271a9000 x23: ffffa0001adb5dc0 x22: ffff000023fdcf00 x21: 1fffe0000454af2c x20: ffff000024cc9400 x19: 0000000000000000 x18: 0000000000000000 x17: 0000000000000000 x16: ffffa000102dbdd0 x15: 0000000000000000 x14: ffffa000109e44bc x13: ffffa00010a3a26c x12: ffff80000476e0b3 x11: 1fffe0000476e0b2 x10: ffff80000476e0b2 x9 : ffffa00010a3ad60 x8 : ffff000023b70593 x7 : 0000000000000003 x6 : 00000000f1f1f1f1 x5 : ffff000023fdcf00 x4 : 0000000000000002 x3 : ffffa00010000000 x2 : 0000000000000001 x1 : dfffa00000000000 x0 : 0000000000000008 Call trace: jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206 vfs_parse_fs_param+0x234/0x4e8 fs/fs_context.c:117 vfs_parse_fs_string+0xe8/0x148 fs/fs_context.c:161 generic_parse_monolithic+0x17c/0x208 fs/fs_context.c:201 parse_monolithic_mount_data+0x7c/0xa8 fs/fs_context.c:649 do_new_mount fs/namespace.c:2871 [inline] path_mount+0x548/0x1da8 fs/namespace.c:3192 do_mount+0x124/0x138 fs/namespace.c:3205 __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount fs/namespace.c:3390 [inline] __arm64_sys_mount+0x164/0x238 fs/namespace.c:3390 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline] invoke_syscall arch/arm64/kernel/syscall.c:48 [inline] el0_svc_common.constprop.0+0x15c/0x598 arch/arm64/kernel/syscall.c:149 do_el0_svc+0x60/0x150 arch/arm64/kernel/syscall.c:195 el0_svc+0x34/0xb0 arch/arm64/kernel/entry-common.c:226 el0_sync_handler+0xc8/0x5b4 arch/arm64/kernel/entry-common.c:236 el0_sync+0x15c/0x180 arch/arm64/kernel/entry.S:663 Code: d2d40001 f2fbffe1 91002260 d343fc02 (38e16841) ---[ end trace 4edf690313deda44 ]--- This is because since `ec10a24f10`, the option parsing happens before fill_super and so the MTD device isn't associated with the filesystem. Defer the size check until there is a valid association. Fixes: `ec10a24f10` ("vfs: Convert jffs2 to use the new mount API") Cc: <stable@vger.kernel.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Jamie Iles <jamie@nuviainc.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:21 +01:00
Fangping Liang	89f40d0a96	ubifs: Fixed print foramt mismatch in ubifs fs/ubifs/super.c: function mount_ubifs: the format specifier "lld" need arg type "long long", but the according arg "old_idx_sz" has type "unsigned long long" Signed-off-by: Fangping Liang <liangfangping@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:21 +01:00
Tom Rix	22bdb8b6fd	jffs2: remove trailing semicolon in macro definition The macro use will already have a semicolon. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:20 +01:00
Wang ShaoBo	3cded66330	ubifs: Fix error return code in ubifs_init_authentication() Fix to return PTR_ERR() error code from the error handling case where ubifs_hash_get_desc() failed instead of 0 in ubifs_init_authentication(), as done elsewhere in this function. Fixes: `49525e5eec` ("ubifs: Add helper functions for authentication support") Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com> Reviewed-by: Sascha Hauer <s.hauer@pengutronix.de> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:20 +01:00
Richard Weinberger	20f1431160	ubifs: wbuf: Don't leak kernel memory to flash Write buffers use a kmalloc()'ed buffer, they can leak up to seven bytes of kernel memory to flash if writes are not aligned. So use ubifs_pad() to fill these gaps with padding bytes. This was never a problem while scanning because the scanner logic manually aligns node lengths and skips over these gaps. Cc: <stable@vger.kernel.org> Fixes: `1e51764a3c` ("UBIFS: add new flash file system") Signed-off-by: Richard Weinberger <richard@nod.at> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:20 +01:00
Chengsong Ke	f212400783	ubifs: Fix the printing type of c->big_lpt Ubifs uses %d to print c->big_lpt, but c->big_lpt is a variable of type unsigned int and should be printed with %u. Signed-off-by: Chengsong Ke <kechengsong@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:57:10 +01:00
lizhe	cd3ed3c73a	jffs2: Allow setting rp_size to zero during remounting Set rp_size to zero will be ignore during remounting. The method to identify whether we input a remounting option of rp_size is to check if the rp_size input is zero. It can not work well if we pass "rp_size=0". This patch add a bool variable "set_rp_size" to fix this problem. Reported-by: Jubin Zhong <zhongjubin@huawei.com> Signed-off-by: lizhe <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:56:24 +01:00
lizhe	08cd274f9b	jffs2: Fix ignoring mounting options problem during remounting The jffs2 mount options will be ignored when remounting jffs2. It can be easily reproduced with the steps listed below. 1. mount -t jffs2 -o compr=none /dev/mtdblockx /mnt 2. mount -o remount compr=zlib /mnt Since `ec10a24f10`, the option parsing happens before fill_super and then pass fc, which contains the options parsing results, to function jffs2_reconfigure during remounting. But function jffs2_reconfigure do not update c->mount_opts. This patch add a function jffs2_update_mount_opts to fix this problem. By the way, I notice that tmpfs use the same way to update remounting options. If it is necessary to unify them? Cc: <stable@vger.kernel.org> Fixes: `ec10a24f10` ("vfs: Convert jffs2 to use the new mount API") Signed-off-by: lizhe <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:55:59 +01:00
Zhe Li	9afc9a8a49	jffs2: Fix GC exit abnormally The log of this problem is: jffs2: Error garbage collecting node at 0x**! jffs2: No space for garbage collection. Aborting GC thread This is because GC believe that it do nothing, so it abort. After going over the image of jffs2, I find a scene that can trigger this problem stably. The scene is: there is a normal dirent node at summary-area, but abnormal at corresponding not-summary-area with error name_crc. The reason that GC exit abnormally is because it find that abnormal dirent node to GC, but when it goes to function jffs2_add_fd_to_list, it cannot meet the condition listed below: if ((prev)->nhash == new->nhash && !strcmp((*prev)->name, new->name)) So no node is marked obsolete, statistical information of erase_block do not change, which cause GC exit abnormally. The root cause of this problem is: we do not check the name_crc of the abnormal dirent node with summary is enabled. Noticed that in function jffs2_scan_dirent_node, we use function jffs2_scan_dirty_space to deal with the dirent node with error name_crc. So this patch add a checking code in function read_direntry to ensure the correctness of dirent node. If checked failed, the dirent node will be marked obsolete so GC will pass this node and this problem will be fixed. Cc: <stable@vger.kernel.org> Signed-off-by: Zhe Li <lizhe67@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:55:39 +01:00
Chengguang Xu	2976c19c95	ubifs: Code cleanup by removing ifdef macro surrounding Define ubifs_listxattr and ubifs_xattr_handlers to NULL when CONFIG_UBIFS_FS_XATTR is not enabled, then we can remove many ugly ifdef macros in the code. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:51:59 +01:00
Randy Dunlap	8fdaaf4cf3	jffs2: Fix if/else empty body warnings When debug (print) macros are not enabled, change them to use the no_printk() macro instead of <nothing>. This fixes gcc warnings when -Wextra is used: ../fs/jffs2/nodelist.c:255:37: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/nodelist.c:278:38: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/nodelist.c:558:52: warning: suggest braces around empty body in an ‘else’ statement [-Wempty-body] ../fs/jffs2/xattr.c:1247:58: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] ../fs/jffs2/xattr.c:1281:65: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body] Builds without warnings on all 3 levels of CONFIG_JFFS2_FS_DEBUG. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Richard Weinberger <richard@nod.at> Cc: linux-mtd@lists.infradead.org Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:51:54 +01:00
Randy Dunlap	b8f1da98a2	ubifs: Delete duplicated words + other fixes Delete repeated words in fs/ubifs/. {negative, is, of, and, one, it} where "it it" was changed to "if it". Signed-off-by: Randy Dunlap <rdunlap@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: Richard Weinberger <richard@nod.at> Cc: linux-mtd@lists.infradead.org Signed-off-by: Richard Weinberger <richard@nod.at>	2020-12-13 21:51:46 +01:00
Zheng Yongjun	1189686e54	fs/xfs: convert comma to semicolon Replace a comma between expression statements by a semicolon. Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:49:47 -08:00
Christoph Hellwig	5d24ec4c7d	xfs: open code updating i_mode in xfs_set_acl Rather than going through the big and hairy xfs_setattr_nonsize function, just open code a transactional i_mode and i_ctime update. This allows to mark xfs_setattr_nonsize and remove the flags argument to it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:49:38 -08:00
Christoph Hellwig	26f88363ec	xfs: remove xfs_vn_setattr_nonsize Merge xfs_vn_setattr_nonsize into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:25 -08:00
Gao Xiang	3937493c50	xfs: kill ialloced in xfs_dialloc() It's enough to just use return code, and get rid of an argument. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:25 -08:00
Dave Chinner	8d822dc38a	xfs: spilt xfs_dialloc() into 2 functions This patch explicitly separates free inode chunk allocation and inode allocation into two individual high level operations. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:25 -08:00
Dave Chinner	f3bf6e0f11	xfs: move xfs_dialloc_roll() into xfs_dialloc() Get rid of the confusing ialloc_context and failure handling around xfs_dialloc() by moving xfs_dialloc_roll() into xfs_dialloc(). Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:24 -08:00
Dave Chinner	1abcf26101	xfs: move on-disk inode allocation out of xfs_ialloc() So xfs_ialloc() will only address in-core inode allocation then, Also, rename xfs_ialloc() to xfs_dir_ialloc_init() in order to keep everything in xfs_inode.c under the same namespace. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:24 -08:00
Dave Chinner	aececc9f8d	xfs: introduce xfs_dialloc_roll() Introduce a helper to make the on-disk inode allocation rolling logic clearer in preparation of the following cleanup. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:24 -08:00
Gao Xiang	15574ebbff	xfs: convert noroom, okalloc in xfs_dialloc() to bool Boolean is preferred for such use. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>	2020-12-12 10:48:24 -08:00
Linus Torvalds	31d00f6eb1	io_uring-5.10-2020-12-11 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/UMOwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpmI3EAC22QohUftfWAJeks142J9wFKQgYfNZpZHs iJR07NTYUSghp9rpfGVNGY/ZonUoKds0VSbDZpRU8TTjglGf9GguzYE4iOMedtG3 MA9diNOnAAK0gwZrBx1uB882X4pf1qYGChMe9fXXJ4mQsHpG84q00GnJI0cPjkKn PaYRVugr8AxCzG4dHHL0ckQtsEHWaPwe7i5u4UkoOC3IsrVjRri6wYAa98xDuznv fqssoMp2cSOcVHsfypQAVZbg0heR3Zo2OuXE4uw6yjB7w5j8tuLIV9XKtqwiRn2n 57rdnGfY6vSfJb9EPm6fwBRnve2cI2Xbr30x/XWSIx+QcUOFOsBfdblaJikX8cSJ FG5c0AgBeRo1heWC4FRNIFStAubm0RPJAPqNm9p6T0DzO4hC3xTTdkl04B5bA4Ms 4tNWhk2fP8YRKhC0gRcw7zRHAl/qjuMjrUor5L68jxGlmFR7h+uMky0AsS4KZKNZ o1+HFmc/m+3ZBNzL7NoXkJzWFaKzZVf0iuEqJP3Xqy91hCA85pvh7DGAlQpZBppM C0eNeJ59rWvfSuUS9bLlLfoMO3Ab5BsE4IEji7K2lBNZsCXQpXHVv4xJxgKJRja9 t+SuBgYpNOfHrIPSY0LaJh93MNzoP22TP8qF8U5gWQ0Bus/mVKE9S1EXSUAvfHPm IyOfTB9M6g== =wjOp -----END PGP SIGNATURE----- Merge tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block Pull io_uring fixes from Jens Axboe: "Two fixes in here, fixing issues introduced in this merge window" * tag 'io_uring-5.10-2020-12-11' of git://git.kernel.dk/linux-block: io_uring: fix file leak on error path of io ctx creation io_uring: fix mis-seting personality's creds	2020-12-12 09:45:01 -08:00
Jens Axboe	355fb9e2b7	io_uring: remove 'twa_signal_ok' deadlock work-around The TIF_NOTIFY_SIGNAL based implementation of TWA_SIGNAL is always safe to use, regardless of context, as we won't be recursing into the signal lock. So now that all archs are using that, we can drop this deadlock work-around as it's always safe to use TWA_SIGNAL. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-12 09:17:38 -07:00
Jens Axboe	792ee0f6db	io_uring: JOBCTL_TASK_WORK is no longer used by task_work Remove the dead code, TWA_SIGNAL will never set JOBCTL_TASK_WORK at this point. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-12 09:17:38 -07:00
Jakub Kicinski	46d5e62dd3	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net xdp_return_frame_bulk() needs to pass a xdp_buff to __xdp_return(). strlcpy got converted to strscpy but here it makes no functional difference, so just keep the right code. Conflicts: net/netfilter/nf_tables_api.c Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2020-12-11 22:29:38 -08:00
Linus Torvalds	782598ecea	zonefs fixes for 5.10-rc7 A single patch in this pull request to fix a BIO and page reference leak when writing sequential zone files. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSRPv8tYSvhwAzJdzjdoc3SxdoYdgUCX9NY0gAKCRDdoc3SxdoY dga9AQCq+hAOyA1FeCkWwBG0gywIIVhPscNphe1bH576JoaIZAEA86DsGB9BED7H yaezfh5W1lr63ctxtSVdyo+x5172fgY= =Te7s -----END PGP SIGNATURE----- Merge tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fix from Damien Le Moal: "A single patch in this pull request to fix a BIO and page reference leak when writing sequential zone files" * tag 'zonefs-5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: fix page reference and BIO leak	2020-12-11 14:22:42 -08:00
Miles Chen	40d6366e9d	proc: use untagged_addr() for pagemap_read addresses When we try to visit the pagemap of a tagged userspace pointer, we find that the start_vaddr is not correct because of the tag. To fix it, we should untag the userspace pointers in pagemap_read(). I tested with 5.10-rc4 and the issue remains. Explanation from Catalin in [1]: "Arguably, that's a user-space bug since tagged file offsets were never supported. In this case it's not even a tag at bit 56 as per the arm64 tagged address ABI but rather down to bit 47. You could say that the problem is caused by the C library (malloc()) or whoever created the tagged vaddr and passed it to this function. It's not a kernel regression as we've never supported it. Now, pagemap is a special case where the offset is usually not generated as a classic file offset but rather derived by shifting a user virtual address. I guess we can make a concession for pagemap (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)" My test code is based on [2]: A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8 userspace program: uint64 OsLayer::VirtualToPhysical(void vaddr) { uint64 frame, paddr, pfnmask, pagemask; int pagesize = sysconf(_SC_PAGESIZE); off64_t off = ((uintptr_t)vaddr) / pagesize 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0 int fd = open(kPagemapPath, O_RDONLY); ... if (lseek64(fd, off, SEEK_SET) != off \|\| read(fd, &frame, 8) != 8) { int err = errno; string errtxt = ErrorString(err); if (fd >= 0) close(fd); return 0; } ... } kernel fs/proc/task_mmu.c: static ssize_t pagemap_read(struct file file, char __user buf, size_t count, loff_t ppos) { ... src = ppos; svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54 start_vaddr = svpfn << PAGE_SHIFT; // start_vaddr == 0xb400007662f54000 end_vaddr = mm->task_size; /* watch out for wraparound */ // svpfn == 0xb400007662f54 // (mm->task_size >> PAGE) == 0x8000000 if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4 start_vaddr = end_vaddr; ret = 0; while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr int len; unsigned long end; ... } ... } [1] https://lore.kernel.org/patchwork/patch/1343258/ [2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158 Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com Signed-off-by: Miles Chen <miles.chen@mediatek.com> Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Song Bao Hua (Barry Song) <song.bao.hua@hisilicon.com> Cc: <stable@vger.kernel.org> [5.4-] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-12-11 14:02:14 -08:00
Amir Goldstein	fecc455978	fsnotify: fix events reported to watching parent and child fsnotify_parent() used to send two separate events to backends when a parent inode is watching children and the child inode is also watching. In an attempt to avoid duplicate events in fanotify, we unified the two backend callbacks to a single callback and handled the reporting of the two separate events for the relevant backends (inotify and dnotify). However the handling is buggy and can result in inotify and dnotify listeners receiving events of the type they never asked for or spurious events. The problem is the unified event callback with two inode marks (parent and child) is called when any of the parent and child inodes are watched and interested in the event, but the parent inode's mark that is interested in the event on the child is not necessarily the one we are currently reporting to (it could belong to a different group). So before reporting the parent or child event flavor to backend we need to check that the mark is really interested in that event flavor. The semantics of INODE and CHILD marks were hard to follow and made the logic more complicated than it should have been. Replace it with INODE and PARENT marks semantics to hopefully make the logic more clear. Thanks to Hugh Dickins for spotting a bug in the earlier version of this patch. Fixes: `497b0c5a7c` ("fsnotify: send event to parent and child with single callback") CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201202120713.702387-4-amir73il@gmail.com Reported-by: Hugh Dickins <hughd@google.com> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz>	2020-12-11 11:40:43 +01:00
Linus Torvalds	6840a3dcc2	More NFS Client Bugfixes for Linux 5.10 - Bugfixes: - Fix array overflow when flexfiles mirroring is enabled - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS - Fix 5 second delay when doing inter-server copy - Disable READ_PLUS by default -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl/Sl70ACgkQ18tUv7Cl QOuGdw/9GMmrLF26brO3q3lSTD9xdljJGQi0q1DEF2amNUL9cPgxB0xZB9lyUPqs AbpLxsCgfJ+rN3sbOnDdz8Ae/wiWS1PkeJWw4cu0/kIIjyPD2Uu4jWuIwxpFvP1v lFfBHeEdsNgqS5t9t9cF9u30PjPKt9SHG/64/8GY92zlrKDRUaJUyvW1zUKo4ne4 m3JxZ1rCxNkBYV+odrQIYE9gqFI6OBWpVsUeIXjBWWDZ/+zK9rGiO5fxZ8oe+yT2 cmrR/vOznoPgUGaH380HrCbXcIKwim2mn76t1MH9fScZQc1ocSWckrZRRdOVWlJ+ 228ImDPACLFaTqpgi8ZhFmuD3Mwbz4AtPLce52lvZLuMAPVMoFhR8xrm/VpJFSim pNjwsl/j9qlGSAe5XdGW+X/DzmHPVd6uhfzOFAd7ZErt7rU0gjyJTWzrmayoB3N1 GF5g4LaK1dYYj7SG8d8DrEV2CGyYle2UP4aDm8qJ/ZLQhd10RHeVt4BHfSKwY05l 03Gim7wlH/ercYfgU6wrgmEG2SWjuFJr+j1v//aqcmJVhA7FvTKygGpGyNK16pNL 9dZ81wUHKHzVt5C/ruG+3tARvnobvmB4ngQAQmtu0aXe9c+Kr2XlaXUAusHPK+Me pVqtmaKBgddUtJwFms8JF6dePiSBRQgpwyvluqNi1D7ntRhS+Ik= =EXrS -----END PGP SIGNATURE----- Merge tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs Pull NFS client fixes from Anna Schumaker: "Here are a handful more bugfixes for 5.10. Unfortunately, we found some problems with the new READ_PLUS operation that aren't easy to fix. We've decided to disable this codepath through a Kconfig option for now, but a series of patches going into 5.11 will clean up the code and fix the issues at the same time. This seemed like the best way to go about it. Summary: - Fix array overflow when flexfiles mirroring is enabled - Fix rpcrdma_inline_fixup() crash with new LISTXATTRS - Fix 5 second delay when doing inter-server copy - Disable READ_PLUS by default" * tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs: NFS: Disable READ_PLUS by default NFSv4.2: Fix 5 seconds delay when doing inter server copy NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled	2020-12-10 15:36:09 -08:00
Al Viro	1a97d899ec	Make sure that make_create_in_sticky() never sees uninitialized value of dir_mode make sure nd->dir_mode is always initialized after success exit from link_path_walk(); in case of empty path it did not happen. Reported-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Tested-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-12-10 17:33:17 -05:00
Hao Li	77573fa310	fs: Kill DCACHE_DONTCACHE dentry even if DCACHE_REFERENCED is set If DCACHE_REFERENCED is set, fast_dput() will return true, and then retain_dentry() have no chance to check DCACHE_DONTCACHE. As a result, the dentry won't be killed and the corresponding inode can't be evicted. In the following example, the DAX policy can't take effects unless we do a drop_caches manually. # DCACHE_LRU_LIST will be set echo abcdefg > test.txt # DCACHE_REFERENCED will be set and DCACHE_DONTCACHE can't do anything xfs_io -c 'chattr +x' test.txt # Drop caches to make DAX changing take effects echo 2 > /proc/sys/vm/drop_caches What this patch does is preventing fast_dput() from returning true if DCACHE_DONTCACHE is set. Then retain_dentry() will detect the DCACHE_DONTCACHE and will return false. As a result, the dentry will be killed and the inode will be evicted. In this way, if we change per-file DAX policy, it will take effects automatically after this file is closed by all processes. I also add some comments to make the code more clear. Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-12-10 17:33:17 -05:00
Hao Li	88149082bb	fs: Handle I_DONTCACHE in iput_final() instead of generic_drop_inode() If generic_drop_inode() returns true, it means iput_final() can evict this inode regardless of whether it is dirty or not. If we check I_DONTCACHE in generic_drop_inode(), any inode with this bit set will be evicted unconditionally. This is not the desired behavior because I_DONTCACHE only means the inode shouldn't be cached on the LRU list. As for whether we need to evict this inode, this is what generic_drop_inode() should do. This patch corrects the usage of I_DONTCACHE. This patch was proposed in [1]. [1]: https://lore.kernel.org/linux-fsdevel/20200831003407.GE12096@dread.disaster.area/ Fixes: `dae2f8ed79` ("fs: Lift XFS_IDONTCACHE to the VFS layer") Signed-off-by: Hao Li <lihao2018.fnst@cn.fujitsu.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-12-10 17:33:17 -05:00
Eric Biggers	edf7ddbf1c	fs/namespace.c: WARN if mnt_count has become negative Missing calls to mntget() (or equivalently, too many calls to mntput()) are hard to detect because mntput() delays freeing mounts using task_work_add(), then again using call_rcu(). As a result, mnt_count can often be decremented to -1 without getting a KASAN use-after-free report. Such cases are still bugs though, and they point to real use-after-frees being possible. For an example of this, see the bug fixed by commit `1b0b9cc8d3` ("vfs: fsmount: add missing mntget()"), discussed at https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u. This bug should have been trivial to find. But actually, it wasn't found until syzkaller happened to use fchdir() to manipulate the reference count just right for the bug to be noticeable. Address this by making mntput_no_expire() issue a WARN if mnt_count has become negative. Suggested-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-12-10 17:33:17 -05:00
Anna Schumaker	21e31401fc	NFS: Disable READ_PLUS by default We've been seeing failures with xfstests generic/091 and generic/263 when using READ_PLUS. I've made some progress on these issues, and the tests fail later on but still don't pass. Let's disable READ_PLUS by default until we can work out what is going on. Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-12-10 16:48:03 -05:00
Dai Ngo	fe8eb820e3	NFSv4.2: Fix 5 seconds delay when doing inter server copy Since commit `b4868b44c5` ("NFSv4: Wait for stateid updates after CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5 seconds delay regardless of the size of the copy. The delay is from nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential fails because the seqid in both nfs4_state and nfs4_stateid are 0. Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state, until after the call to update_open_stateid, to indicate this is the 1st open. This fix is part of a 2 patches, the other patch is the fix in the source server to return the stateid for COPY_NOTIFY request with seqid 1 instead of 0. Fixes: `ce0887ac96` ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-12-10 16:48:03 -05:00
Chuck Lever	1c87b85162	NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation By switching to an XFS-backed export, I am able to reproduce the ibcomp worker crash on my client with xfstests generic/013. For the failing LISTXATTRS operation, xdr_inline_pages() is called with page_len=12 and buflen=128. - When ->send_request() is called, rpcrdma_marshal_req() does not set up a Reply chunk because buflen is smaller than the inline threshold. Thus rpcrdma_convert_iovs() does not get invoked at all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked on the receive buffer. - During reply processing, rpcrdma_inline_fixup() tries to copy received data into rq_rcv_buf->pages because page_len is positive. But there are no receive pages because rpcrdma_marshal_req() never allocated them. The result is that the ibcomp worker faults and dies. Sometimes that causes a visible crash, and sometimes it results in a transport hang without other symptoms. RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and should eventually be fixed or replaced. However, my preference is that upper-layer operations should explicitly allocate their receive buffers (using GFP_KERNEL) when possible, rather than relying on XDRBUF_SPARSE_PAGES. Reported-by: Olga kornievskaia <kolga@netapp.com> Suggested-by: Olga kornievskaia <kolga@netapp.com> Fixes: `c10a75145f` ("NFSv4.2: add the extended attribute proc functions.") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga kornievskaia <kolga@netapp.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Tested-by: Olga kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>	2020-12-10 16:48:03 -05:00
Eric W. Biederman	f7cfd871ae	exec: Transform exec_update_mutex into a rw_semaphore Recently syzbot reported[0] that there is a deadlock amongst the users of exec_update_mutex. The problematic lock ordering found by lockdep was: perf_event_open (exec_update_mutex -> ovl_i_mutex) chown (ovl_i_mutex -> sb_writes) sendfile (sb_writes -> p->lock) by reading from a proc file and writing to overlayfs proc_pid_syscall (p->lock -> exec_update_mutex) While looking at possible solutions it occured to me that all of the users and possible users involved only wanted to state of the given process to remain the same. They are all readers. The only writer is exec. There is no reason for readers to block on each other. So fix this deadlock by transforming exec_update_mutex into a rw_semaphore named exec_update_lock that only exec takes for writing. Cc: Jann Horn <jannh@google.com> Cc: Vasiliy Kulikov <segoon@openwall.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Christopher Yeoh <cyeoh@au1.ibm.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Fixes: `eea9673250` ("exec: Add exec_update_mutex to replace cred_guard_mutex") [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 13:13:32 -06:00
Eric W. Biederman	9ee1206dcf	exec: Move io_uring_task_cancel after the point of no return Now that unshare_files happens in begin_new_exec after the point of no return, io_uring_task_cancel can also happen later. Effectively this means io_uring activities for a task are only canceled when exec succeeds. Link: https://lkml.kernel.org/r/878saih2op.fsf@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:57:13 -06:00
Eric W. Biederman	c39ab6de22	coredump: Document coredump code exclusively used by cell spufs Oleg Nesterov recently asked[1] why is there an unshare_files in do_coredump. After digging through all of the callers of lookup_fd it turns out that it is arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context that needs the unshare_files in do_coredump. Looking at the history[2] this code was also the only piece of coredump code that required the unshare_files when the unshare_files was added. Looking at that code it turns out that cell is also the only architecture that implements elf_coredump_extra_notes_size and elf_coredump_extra_notes_write. I looked at the gdb repo[3] support for cell has been removed[4] in binutils 2.34. Geoff Levand reports he is still getting questions on how to run modern kernels on the PS3, from people using 3rd party firmware so this code is not dead. According to Wikipedia the last PS3 shipped in Japan sometime in 2017. So it will probably be a little while before everyone's hardware dies. Add some comments briefly documenting the coredump code that exists only to support cell spufs to make it easier to understand the coredump code. Eventually the hardware will be dead, or their won't be userspace tools, or the coredump code will be refactored and it will be too difficult to update a dead architecture and these comments make it easy to tell where to pull to remove cell spufs support. [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com [2] `179e037fc1` ("do_coredump(): make sure that descriptor table isn't shared") [3] git://sourceware.org/git/binutils-gdb.git [4] abf516c6931a ("Remove Cell Broadband Engine debugging support"). Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:57:08 -06:00
Eric W. Biederman	fa67bf885e	file: Remove get_files_struct When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Now that get_files_struct has no more users and can not cause the problems for posix file locking and fget_light remove get_files_struct so that it does not gain any new users. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-13-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-24-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	9fe83c43e7	file: Rename __close_fd_get_file close_fd_get_file The function close_fd_get_file is explicitly a variant of __close_fd[1]. Now that __close_fd has been renamed close_fd, rename close_fd_get_file to be consistent with close_fd. When __alloc_fd, __close_fd and __fd_install were introduced the double underscore indicated that the function took a struct files_struct parameter. The function __close_fd_get_file never has so the naming has always been inconsistent. This just cleans things up so there are not any lingering mentions or references __close_fd left in the code. [1] `80cd795630` ("binder: fix use-after-free due to ksys_close() during fdget()") Link: https://lkml.kernel.org/r/20201120231441.29911-23-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	1572bfdf21	file: Replace ksys_close with close_fd Now that ksys_close is exactly identical to close_fd replace the one caller of ksys_close with close_fd. [1] https://lkml.kernel.org/r/20200818112020.GA17080@infradead.org Suggested-by: Christoph Hellwig <hch@infradead.org> Link: https://lkml.kernel.org/r/20201120231441.29911-22-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	8760c909f5	file: Rename __close_fd to close_fd and remove the files parameter The function __close_fd was added to support binder[1]. Now that binder has been fixed to no longer need __close_fd[2] all calls to __close_fd pass current->files. Therefore transform the files parameter into a local variable initialized to current->files, and rename __close_fd to close_fd to reflect this change, and keep it in sync with the similar changes to __alloc_fd, and __fd_install. This removes the need for callers to care about the extra care that needs to be take if anything except current->files is passed, by limiting the callers to only operation on current->files. [1] `483ce1d4b8` ("take descriptor-related part of close() to file.c") [2] `44d8047f1d` ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-17-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-21-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	aa384d10f3	file: Merge __alloc_fd into alloc_fd The function __alloc_fd was added to support binder[1]. With binder fixed[2] there are no more users. As alloc_fd just calls __alloc_fd with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] `dcfadfa4ec` ("new helper: __alloc_fd()") [2] `44d8047f1d` ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-16-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-20-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	e06b53c22f	file: In f_dupfd read RLIMIT_NOFILE once. Simplify the code, and remove the chance of races by reading RLIMIT_NOFILE only once in f_dupfd. Pass the read value of RLIMIT_NOFILE into alloc_fd which is the other location the rlimit was read in f_dupfd. As f_dupfd is the only caller of alloc_fd this changing alloc_fd is trivially safe. Further this causes alloc_fd to take all of the same arguments as __alloc_fd except for the files_struct argument. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-15-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-19-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:59 -06:00
Eric W. Biederman	d74ba04d91	file: Merge __fd_install into fd_install The function __fd_install was added to support binder[1]. With binder fixed[2] there are no more users. As fd_install just calls __fd_install with "files=current->files", merge them together by transforming the files parameter into a local variable initialized to current->files. [1] `f869e8a7f7` ("expose a low-level variant of fd_install() for binder") [2] `44d8047f1d` ("binder: use standard functions to allocate fds") Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1:https://lkml.kernel.org/r/20200817220425.9389-14-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-18-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:58 -06:00
Eric W. Biederman	775e0656b2	proc/fd: In fdinfo seq_show don't use get_files_struct When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead hold task_lock for the duration that task->files needs to be stable in seq_show. The task_lock was already taken in get_files_struct, and so skipping get_files_struct performs less work overall, and avoids the problems with the files_struct reference count. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-12-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-17-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:58 -06:00
Eric W. Biederman	5b17b61870	proc/fd: In proc_readfd_common use task_lookup_next_fd_rcu When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Using task_lookup_next_fd_rcu simplifies proc_readfd_common, by moving the checking for the maximum file descritor into the generic code, and by remvoing the need for capturing and releasing a reference on files_struct. As task_lookup_fd_rcu may update the fd ctx->pos has been changed to be the fd +2 after task_lookup_fd_rcu returns. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Tested-by: Andy Lavr <andy.lavr@gmail.com> v1: https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-15-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:58 -06:00
Eric W. Biederman	e9a53aeb5e	file: Implement task_lookup_next_fd_rcu As a companion to fget_task and task_lookup_fd_rcu implement task_lookup_next_fd_rcu that will return the struct file for the first file descriptor number that is equal or greater than the fd argument value, or NULL if there is no such struct file. This allows file descriptors of foreign processes to be iterated through safely, without needed to increment the count on files_struct. Some concern[1] has been expressed that this function takes the task_lock for each iteration and thus for each file descriptor. This place where this function will be called in a commonly used code path is for listing /proc/<pid>/fd. I did some small benchmarks and did not see any measurable performance differences. For ordinary users ls is likely to stat each of the directory entries and tid_fd_mode called from tid_fd_revalidae has always taken the task lock for each file descriptor. So this does not look like it will be a big change in practice. At some point is will probably be worth changing put_files_struct to free files_struct after an rcu grace period so that task_lock won't be needed at all. [1] https://lkml.kernel.org/r/20200817220425.9389-10-ebiederm@xmission.com v1: https://lkml.kernel.org/r/20200817220425.9389-9-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-14-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:42:58 -06:00
Eric W. Biederman	64eb661fda	proc/fd: In tid_fd_mode use task_lookup_fd_rcu When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Instead of manually coding finding the files struct for a task and then calling files_lookup_fd_rcu, use the helper task_lookup_fd_rcu that combines those to steps. Making the code simpler and removing the need to get a reference on a files_struct. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-7-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-12-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:40:14 -06:00
Eric W. Biederman	3a879fb380	file: Implement task_lookup_fd_rcu As a companion to lookup_fd_rcu implement task_lookup_fd_rcu for querying an arbitrary process about a specific file. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200818103713.aw46m7vprsy4vlve@wittgenstein Link: https://lkml.kernel.org/r/20201120231441.29911-11-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:40:10 -06:00
Eric W. Biederman	460b4f812a	file: Rename fcheck lookup_fd_rcu Also remove the confusing comment about checking if a fd exists. I could not find one instance in the entire kernel that still matches the description or the reason for the name fcheck. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-10-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:40:07 -06:00
Eric W. Biederman	f36c294327	file: Replace fcheck_files with files_lookup_fd_rcu This change renames fcheck_files to files_lookup_fd_rcu. All of the remaining callers take the rcu_read_lock before calling this function so the _rcu suffix is appropriate. This change also tightens up the debug check to verify that all callers hold the rcu_read_lock. All callers that used to call files_check with the files->file_lock held have now been changed to call files_lookup_fd_locked. This change of name has helped remind me of which locks and which guarantees are in place helping me to catch bugs later in the patchset. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-9-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:40:03 -06:00
Eric W. Biederman	120ce2b0cd	file: Factor files_lookup_fd_locked out of fcheck_files To make it easy to tell where files->file_lock protection is being used when looking up a file create files_lookup_fd_locked. Only allow this function to be called with the file_lock held. Update the callers of fcheck and fcheck_files that are called with the files->file_lock held to call files_lookup_fd_locked instead. Hopefully this makes it easier to quickly understand what is going on. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-8-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:59 -06:00
Eric W. Biederman	bebf684bf3	file: Rename __fcheck_files to files_lookup_fd_raw The function fcheck despite it's comment is poorly named as it has no callers that only check it's return value. All of fcheck's callers use the returned file descriptor. The same is true for fcheck_files and __fcheck_files. A new less confusing name is needed. In addition the names of these functions are confusing as they do not report the kind of locks that are needed to be held when these functions are called making error prone to use them. To remedy this I am making the base functio name lookup_fd and will and prefixes and sufficies to indicate the rest of the context. Name the function (previously called __fcheck_files) that proceeds from a struct files_struct, looks up the struct file of a file descriptor, and requires it's callers to verify all of the appropriate locks are held files_lookup_fd_raw. The need for better names became apparent in the last round of discussion of this set of changes[1]. [1] https://lkml.kernel.org/r/CAHk-=wj8BQbgJFLa+J0e=iT-1qpmCRTbPAJ8gd6MJQ=kbRPqyQ@mail.gmail.com Link: https://lkml.kernel.org/r/20201120231441.29911-7-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:54 -06:00
Eric W. Biederman	439be32656	proc/fd: In proc_fd_link use fget_task When discussing[1] exec and posix file locks it was realized that none of the callers of get_files_struct fundamentally needed to call get_files_struct, and that by switching them to helper functions instead it will both simplify their code and remove unnecessary increments of files_struct.count. Those unnecessary increments can result in exec unnecessarily unsharing files_struct which breaking posix locks, and it can result in fget_light having to fallback to fget reducing system performance. Simplifying proc_fd_link is a little bit tricky. It is necessary to know that there is a reference to fd_f ile while path_get is running. This reference can either be guaranteed to exist either by locking the fdtable as the code currently does or by taking a reference on the file in question. Use fget_task to remove the need for get_files_struct and to take a reference to file in question. [1] https://lkml.kernel.org/r/20180915160423.GA31461@redhat.com Suggested-by: Oleg Nesterov <oleg@redhat.com> v1: https://lkml.kernel.org/r/20200817220425.9389-8-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-6-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:48 -06:00
Eric W. Biederman	950db38ff2	exec: Remove reset_files_struct Now that exec no longer needs to restore the previous value of current->files on error there are no more callers of reset_files_struct so remove it. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-3-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-3-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:36 -06:00
Eric W. Biederman	1f702603e7	exec: Simplify unshare_files Now that exec no longer needs to return the unshared files to their previous value there is no reason to return displaced. Instead when unshare_fd creates a copy of the file table, call put_files_struct before returning from unshare_files. Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-2-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-2-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:32 -06:00
Eric W. Biederman	b604350128	exec: Move unshare_files to fix posix file locking during exec Many moons ago the binfmts were doing some very questionable things with file descriptors and an unsharing of the file descriptor table was added to make things better[1][2]. The helper steal_lockss was added to avoid breaking the userspace programs[3][4][6]. Unfortunately it turned out that steal_locks did not work for network file systems[5], so it was removed to see if anyone would complain[7][8]. It was thought at the time that NPTL would not be affected as the unshare_files happened after the other threads were killed[8]. Unfortunately because there was an unshare_files in binfmt_elf.c before the threads were killed this analysis was incorrect. This unshare_files in binfmt_elf.c resulted in the unshares_files happening whenever threads were present. Which led to unshare_files being moved to the start of do_execve[9]. Later the problems were rediscovered and the suggested approach was to readd steal_locks under a different name[10]. I happened to be reviewing patches and I noticed that this approach was a step backwards[11]. I proposed simply moving unshare_files[12] and it was pointed out that moving unshare_files without auditing the code was also unsafe[13]. There were then several attempts to solve this[14][15][16] and I even posted this set of changes[17]. Unfortunately because auditing all of execve is time consuming this change did not make it in at the time. Well now that I am cleaning up exec I have made the time to read through all of the binfmts and the only playing with file descriptors is either the security modules closing them in security_bprm_committing_creds or is in the generic code in fs/exec.c. None of it happens before begin_new_exec is called. So move unshare_files into begin_new_exec, after the point of no return. If memory is very very very low and the application calling exec is sharing file descriptor tables between processes we might fail past the point of no return. Which is unfortunate but no different than any of the other places where we allocate memory after the point of no return. This movement allows another process that shares the file table, or another thread of the same process and that closes files or changes their close on exec behavior and races with execve to cause some unexpected things to happen. There is only one time of check to time of use race and it is just there so that execve fails instead of an interpreter failing when it tries to open the file it is supposed to be interpreting. Failing later if userspace is being silly is not a problem. With this change it the following discription from the removal of steal_locks[8] finally becomes true. Apps using NPTL are not affected, since all other threads are killed before execve. Apps using LinuxThreads are only affected if they - have multiple threads during exec (LinuxThreads doesn't kill other threads, the app may do it with pthread_kill_other_threads_np()) - rely on POSIX locks being inherited across exec Both conditions are documented, but not their interaction. Apps using clone() natively are affected if they - use clone(CLONE_FILES) - rely on POSIX locks being inherited across exec I have investigated some paths to make it possible to solve this without moving unshare_files but they all look more complicated[18]. Reported-by: Daniel P. Berrangé <berrange@redhat.com> Reported-by: Jeff Layton <jlayton@redhat.com> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git [1] 02cda956de0b ("[PATCH] unshare_files" [2] 04e9bcb4d106 ("[PATCH] use new unshare_files helper") [3] 088f5d7244de ("[PATCH] add steal_locks helper") [4] 02c541ec8ffa ("[PATCH] use new steal_locks helper") [5] https://lkml.kernel.org/r/E1FLIlF-0007zR-00@dorka.pomaz.szeredi.hu [6] https://lkml.kernel.org/r/0060321191605.GB15997@sorel.sous-sol.org [7] https://lkml.kernel.org/r/E1FLwjC-0000kJ-00@dorka.pomaz.szeredi.hu [8] `c89681ed7d` ("[PATCH] remove steal_locks()") [9] `fd8328be87` ("[PATCH] sanitize handling of shared descriptor tables in failing execve()") [10] https://lkml.kernel.org/r/20180317142520.30520-1-jlayton@kernel.org [11] https://lkml.kernel.org/r/87r2nwqk73.fsf@xmission.com [12] https://lkml.kernel.org/r/87bmfgvg8w.fsf@xmission.com [13] https://lkml.kernel.org/r/20180322111424.GE30522@ZenIV.linux.org.uk [14] https://lkml.kernel.org/r/20180827174722.3723-1-jlayton@kernel.org [15] https://lkml.kernel.org/r/20180830172423.21964-1-jlayton@kernel.org [16] https://lkml.kernel.org/r/20180914105310.6454-1-jlayton@kernel.org [17] https://lkml.kernel.org/r/87a7ohs5ow.fsf@xmission.com [18] https://lkml.kernel.org/r/87pn8c1uj6.fsf_-_@x220.int.ebiederm.org Acked-by: Christian Brauner <christian.brauner@ubuntu.com> v1: https://lkml.kernel.org/r/20200817220425.9389-1-ebiederm@xmission.com Link: https://lkml.kernel.org/r/20201120231441.29911-1-ebiederm@xmission.com Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:24 -06:00
Eric W. Biederman	878f12dbb8	exec: Don't open code get_close_on_exec Al Viro pointed out that using the phrase "close_on_exec(fd, rcu_dereference_raw(current->files->fdt))" instead of wrapping it in rcu_read_lock(), rcu_read_unlock() is a very questionable optimization[1]. Once wrapped with rcu_read_lock()/rcu_read_unlock() that phrase becomes equivalent the helper function get_close_on_exec so simplify the code and make it more robust by simply using get_close_on_exec. [1] https://lkml.kernel.org/r/20201207222214.GA4115853@ZenIV.linux.org.uk Suggested-by: Al Viro <viro@ftp.linux.org.uk> Link: https://lkml.kernel.org/r/87k0tqr6zi.fsf_-_@x220.int.ebiederm.org Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>	2020-12-10 12:39:00 -06:00
Chao Yu	75e91c8889	f2fs: compress: fix compression chksum This patch addresses minor issues in compression chksum. Fixes: `b28f047b28` ("f2fs: compress: support chksum") Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-12-10 09:13:53 -08:00
Chao Yu	e584bbe821	f2fs: fix shift-out-of-bounds in sanity_check_raw_super() syzbot reported a bug which could cause shift-out-of-bounds issue, fix it. Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 sanity_check_raw_super fs/f2fs/super.c:2812 [inline] read_raw_super_block fs/f2fs/super.c:3267 [inline] f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519 mount_bdev+0x34d/0x410 fs/super.c:1366 legacy_get_tree+0x105/0x220 fs/fs_context.c:592 vfs_get_tree+0x89/0x2f0 fs/super.c:1496 do_new_mount fs/namespace.c:2896 [inline] path_mount+0x12ae/0x1e70 fs/namespace.c:3227 do_mount fs/namespace.c:3240 [inline] __do_sys_mount fs/namespace.c:3448 [inline] __se_sys_mount fs/namespace.c:3425 [inline] __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+ca9a785f8ac472085994@syzkaller.appspotmail.com Signed-off-by: Anant Thazhemadam <anant.thazhemadam@gmail.com> Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>	2020-12-10 09:13:19 -08:00
Miklos Szeredi	5d069dbe8a	fuse: fix bad inode Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org>	2020-12-10 15:33:14 +01:00
Trond Myklebust	fa94a951bf	NFSv4.2: Fix up the get/listxattr calls to rpc_prepare_reply_pages() Ensure that both getxattr and listxattr page array are correctly aligned, and that getxattr correctly accounts for the page padding word. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>	2020-12-10 09:01:53 -05:00
Damien Le Moal	6bea0225a4	zonefs: fix page reference and BIO leak In zonefs_file_dio_append(), the pages obtained using bio_iov_iter_get_pages() are not released on completion of the REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails. Furthermore, a call to bio_put() is missing when bio_iov_iter_get_pages() fails. Fix these resource leaks by adding BIO resource release code (bio_put()i and bio_release_pages()) at the end of the function after the BIO execution and add a jump to this resource cleanup code in case of bio_iov_iter_get_pages() failure. While at it, also fix the call to task_io_account_write() to be passed the correct BIO size instead of bio_iov_iter_get_pages() return value. Reported-by: Christoph Hellwig <hch@lst.de> Fixes: `02ef12a663` ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2020-12-10 15:14:19 +09:00
Huang Jianan	d8b3df8b10	erofs: avoid using generic_block_bmap Surprisingly, `block' in sector_t indicates the number of i_blkbits-sized blocks rather than sectors for bmap. In addition, considering buffer_head limits mapped size to 32-bits, should avoid using generic_block_bmap. Link: https://lore.kernel.org/r/20201209115740.18802-1-huangjianan@oppo.com Fixes: `9da681e017` ("staging: erofs: support bmap") Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Huang Jianan <huangjianan@oppo.com> Signed-off-by: Guo Weichao <guoweichao@oppo.com> [ Gao Xiang: slightly update the commit message description. ] Signed-off-by: Gao Xiang <hsiangkao@redhat.com>	2020-12-10 11:07:40 +08:00
Chunguang Xu	41fca96e63	ext4: delete nonsensical (commented-out) code inside ext4_xattr_block_set() Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-7-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-12-09 14:22:56 -05:00
Chunguang Xu	8041ac642a	ext4: update ext4_data_block_valid related comments Since ext4_data_block_valid() has been renamed to ext4_inode_block_valid(), the related comments need to be updated. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-5-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>	2020-12-09 14:11:27 -05:00
Pavel Begunkov	59850d226e	io_uring: fix io_cqring_events()'s noflush Checking !list_empty(&ctx->cq_overflow_list) around noflush in io_cqring_events() is racy, because if it fails but a request overflowed just after that, io_cqring_overflow_flush() still will be called. Remove the second check, it shouldn't be a problem for performance, because there is cq_check_overflow bit check just above. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:02 -07:00
Pavel Begunkov	634578f800	io_uring: fix racy IOPOLL flush overflow It's not safe to call io_cqring_overflow_flush() for IOPOLL mode without hodling uring_lock, because it does synchronisation differently. Make sure we have it. As for io_ring_exit_work(), we don't even need it there because io_ring_ctx_wait_and_kill() already set force flag making all overflowed requests to be dropped. Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	31bff9a51b	io_uring: fix racy IOPOLL completions IOPOLL allows buffer remove/provide requests, but they doesn't synchronise by rules of IOPOLL, namely it have to hold uring_lock. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	dad1b1242f	io_uring: always let io_iopoll_complete() complete polled io Abaci Fuzz reported a double-free or invalid-free BUG in io_commit_cqring(): [ 95.504842] BUG: KASAN: double-free or invalid-free in io_commit_cqring+0x3ec/0x8e0 [ 95.505921] [ 95.506225] CPU: 0 PID: 4037 Comm: io_wqe_worker-0 Tainted: G B W 5.10.0-rc5+ #1 [ 95.507434] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 95.508248] Call Trace: [ 95.508683] dump_stack+0x107/0x163 [ 95.509323] ? io_commit_cqring+0x3ec/0x8e0 [ 95.509982] print_address_description.constprop.0+0x3e/0x60 [ 95.510814] ? vprintk_func+0x98/0x140 [ 95.511399] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512036] ? io_commit_cqring+0x3ec/0x8e0 [ 95.512733] kasan_report_invalid_free+0x51/0x80 [ 95.513431] ? io_commit_cqring+0x3ec/0x8e0 [ 95.514047] __kasan_slab_free+0x141/0x160 [ 95.514699] kfree+0xd1/0x390 [ 95.515182] io_commit_cqring+0x3ec/0x8e0 [ 95.515799] __io_req_complete.part.0+0x64/0x90 [ 95.516483] io_wq_submit_work+0x1fa/0x260 [ 95.517117] io_worker_handle_work+0xeac/0x1c00 [ 95.517828] io_wqe_worker+0xc94/0x11a0 [ 95.518438] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.519151] ? __kthread_parkme+0x11d/0x1d0 [ 95.519806] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.520512] ? io_worker_handle_work+0x1c00/0x1c00 [ 95.521211] kthread+0x396/0x470 [ 95.521727] ? _raw_spin_unlock_irq+0x24/0x30 [ 95.522380] ? kthread_mod_delayed_work+0x180/0x180 [ 95.523108] ret_from_fork+0x22/0x30 [ 95.523684] [ 95.523985] Allocated by task 4035: [ 95.524543] kasan_save_stack+0x1b/0x40 [ 95.525136] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 95.525882] kmem_cache_alloc_trace+0x17b/0x310 [ 95.533930] io_queue_sqe+0x225/0xcb0 [ 95.534505] io_submit_sqes+0x1768/0x25f0 [ 95.535164] __x64_sys_io_uring_enter+0x89e/0xd10 [ 95.535900] do_syscall_64+0x33/0x40 [ 95.536465] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 95.537199] [ 95.537505] Freed by task 4035: [ 95.538003] kasan_save_stack+0x1b/0x40 [ 95.538599] kasan_set_track+0x1c/0x30 [ 95.539177] kasan_set_free_info+0x1b/0x30 [ 95.539798] __kasan_slab_free+0x112/0x160 [ 95.540427] kfree+0xd1/0x390 [ 95.540910] io_commit_cqring+0x3ec/0x8e0 [ 95.541516] io_iopoll_complete+0x914/0x1390 [ 95.542150] io_do_iopoll+0x580/0x700 [ 95.542724] io_iopoll_try_reap_events.part.0+0x108/0x200 [ 95.543512] io_ring_ctx_wait_and_kill+0x118/0x340 [ 95.544206] io_uring_release+0x43/0x50 [ 95.544791] __fput+0x28d/0x940 [ 95.545291] task_work_run+0xea/0x1b0 [ 95.545873] do_exit+0xb6a/0x2c60 [ 95.546400] do_group_exit+0x12a/0x320 [ 95.546967] __x64_sys_exit_group+0x3f/0x50 [ 95.547605] do_syscall_64+0x33/0x40 [ 95.548155] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason is that once we got a non EAGAIN error in io_wq_submit_work(), we'll complete req by calling io_req_complete(), which will hold completion_lock to call io_commit_cqring(), but for polled io, io_iopoll_complete() won't hold completion_lock to call io_commit_cqring(), then there maybe concurrent access to ctx->defer_list, double free may happen. To fix this bug, we always let io_iopoll_complete() complete polled io. Cc: <stable@vger.kernel.org> # 5.5+ Reported-by: Abaci Fuzz <abaci@linux.alibaba.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	9c8e11b36c	io_uring: add timeout update Support timeout updates through IORING_OP_TIMEOUT_REMOVE with passed in IORING_TIMEOUT_UPDATE. Updates doesn't support offset timeout mode. Oirignal timeout.off will be ignored as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: remove now unused 'ret' variable] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	fbd15848f3	io_uring: restructure io_timeout_cancel() Add io_timeout_extract() helper, which searches and disarms timeouts, but doesn't complete them. No functional changes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	bee749b187	io_uring: fix files cancellation io_uring_cancel_files()'s task check condition mistakenly got flipped. 1. There can't be a request in the inflight list without IO_WQ_WORK_FILES, kill this check to keep the whole condition simpler. 2. Also, don't call the function for files==NULL to not do such a check, all that staff is already handled well by its counter part, __io_uring_cancel_task_requests(). With that just flip the task check. Also, it iowq-cancels all request of current task there, don't forget to set right ->files into struct io_task_cancel. Fixes: c1973b38bf639 ("io_uring: cancel only requests of current task") Reported-by: syzbot+c0d52d0b3c0c3ffb9525@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Jens Axboe	ac0648a56c	io_uring: use bottom half safe lock for fixed file data io_file_data_ref_zero() can be invoked from soft-irq from the RCU core, hence we need to ensure that the file_data lock is bottom half safe. Use the _bh() variants when grabbing this lock. Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	bd5bbda72f	io_uring: fix miscounting ios_left io_req_init() doesn't decrement state->ios_left if a request doesn't need ->file, it just returns before that on if(!needs_file). That's not really a problem but may cause overhead for an additional fput(). Also inline and kill io_req_set_file() as it's of no use anymore. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	6e1271e60c	io_uring: change submit file state invariant Keep submit state invariant of whether there are file refs left based on state->nr_refs instead of (state->file==NULL), and always check against the first one. It's easier to track and allows to remove 1 if. It also automatically leaves struct submit_state in a consistent state after io_submit_state_end(), that's not used yet but nice. btw rename has_refs to file_refs for more clarity. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	65b2b21348	io_uring: check kthread stopped flag when sq thread is unparked syzbot reports following issue: INFO: task syz-executor.2:12399 can't die for more than 143 seconds. task:syz-executor.2 state:D stack:28744 pid:12399 ppid: 8504 flags:0x00004004 Call Trace: context_switch kernel/sched/core.c:3773 [inline] __schedule+0x893/0x2170 kernel/sched/core.c:4522 schedule+0xcf/0x270 kernel/sched/core.c:4600 schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847 do_wait_for_common kernel/sched/completion.c:85 [inline] __wait_for_common kernel/sched/completion.c:106 [inline] wait_for_common kernel/sched/completion.c:117 [inline] wait_for_completion+0x163/0x260 kernel/sched/completion.c:138 kthread_stop+0x17a/0x720 kernel/kthread.c:596 io_put_sq_data fs/io_uring.c:7193 [inline] io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290 io_finish_async fs/io_uring.c:7297 [inline] io_sq_offload_create fs/io_uring.c:8015 [inline] io_uring_create fs/io_uring.c:9433 [inline] io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x45deb9 Code: Unable to access opcode bytes at RIP 0x45de8f. RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9 RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9 RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5 RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c INFO: task syz-executor.2:12399 blocked for more than 143 seconds. Not tainted 5.10.0-rc3-next-20201110-syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Currently we don't have a reproducer yet, but seems that there is a race in current codes: => io_put_sq_data ctx_list is empty now. \| ==> kthread_park(sqd->thread); \| \| T1: sq thread is parked now. ==> kthread_stop(sqd->thread); \| KTHREAD_SHOULD_STOP is set now.\| ===> kthread_unpark(k); \| \| T2: sq thread is now unparkd, run again. \| \| T3: sq thread is now preempted out. \| ===> wake_up_process(k); \| \| \| T4: Since sqd ctx_list is empty, needs_sched will be true, \| then sq thread sets task state to TASK_INTERRUPTIBLE, \| and schedule, now sq thread will never be waken up. ===> wait_for_completion \| I have artificially used mdelay() to simulate above race, will get same stack like this syzbot report, but to be honest, I'm not sure this code race triggers syzbot report. To fix this possible code race, when sq thread is unparked, need to check whether sq thread has been stopped. Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	36f72fe279	io_uring: share fixed_file_refs b/w multiple rsrcs Double fixed files for splice/tee are done in a nasty way, it takes 2 ref_node refs, and during the second time it blindly overrides req->fixed_file_refs hoping that it haven't changed. That works because all that is done under iouring_lock in a single go but is error-prone. Bind everything explicitly to a single ref_node and take only one ref, with current ref_node ordering it's guaranteed to keep all files valid awhile the request is inflight. That's mainly a cleanup + preparation for generic resource handling, but also saves pcpu_ref get/put for splice/tee with 2 fixed files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	c98de08c99	io_uring: replace inflight_wait with tctx->wait As tasks now cancel only theirs requests, and inflight_wait is awaited only in io_uring_cancel_files(), which should be called with ->in_idle set, instead of keeping a separate inflight_wait use tctx->wait. That will add some spurious wakeups but actually is safer from point of not hanging the task. e.g. task1 \| IRQ \| start io_complete_rw_common(link) \| link: req1 -> req2 -> req3(with files) *cancel_files() \| io_wq_cancel(), etc. \| \| put_req(link), adds to io-wq req2 schedule() \| So, task1 will never try to cancel req2 or req3. If req2 is long-standing (e.g. read(empty_pipe)), this may hang. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Pavel Begunkov	10cad2c40d	io_uring: don't take fs for recvmsg/sendmsg We don't even allow not plain data msg_control, which is disallowed in __sys_{send,revb}msg_sock(). So no need in fs for IORING_OP_SENDMSG and IORING_OP_RECVMSG. fs->lock is less contanged not as much as before, but there are cases that can be, e.g. IOSQE_ASYNC. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:01 -07:00
Xiaoguang Wang	2e9dbe902d	io_uring: only wake up sq thread while current task is in io worker context If IORING_SETUP_SQPOLL is enabled, sqes are either handled in sq thread task context or in io worker task context. If current task context is sq thread, we don't need to check whether should wake up sq thread. io_iopoll_req_issued() calls wq_has_sleeper(), which has smp_mb() memory barrier, before this patch, perf shows obvious overhead: Samples: 481K of event 'cycles', Event count (approx.): 299807382878 Overhead Comma Shared Object Symbol 3.69% :9630 [kernel.vmlinux] [k] io_issue_sqe With this patch, perf shows: Samples: 482K of event 'cycles', Event count (approx.): 299929547283 Overhead Comma Shared Object Symbol 0.70% :4015 [kernel.vmlinux] [k] io_issue_sqe It shows some obvious improvements. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	906a3c6f9c	io_uring: don't acquire uring_lock twice Both IOPOLL and sqes handling need to acquire uring_lock, combine them together, then we just need to acquire uring_lock once. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	a0d9205f7d	io_uring: initialize 'timeout' properly in io_sq_thread() Some static checker reports below warning: fs/io_uring.c:6939 io_sq_thread() error: uninitialized symbol 'timeout'. This is a false positive, but let's just initialize 'timeout' to make sure we don't trip over this. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Xiaoguang Wang	0836924634	io_uring: refactor io_sq_thread() handling There are some issues about current io_sq_thread() implementation: 1. The prepare_to_wait() usage in __io_sq_thread() is weird. If multiple ctxs share one same poll thread, one ctx will put poll thread in TASK_INTERRUPTIBLE, but if other ctxs have work to do, we don't need to change task's stat at all. I think only if all ctxs don't have work to do, we can do it. 2. We use round-robin strategy to make multiple ctxs share one same poll thread, but there are various condition in __io_sq_thread(), which seems complicated and may affect round-robin strategy. To improve above issues, I take below actions: 1. If multiple ctxs share one same poll thread, only if all all ctxs don't have work to do, we can call prepare_to_wait() and schedule() to make poll thread enter sleep state. 2. To make round-robin strategy more straight, I simplify __io_sq_thread() a bit, it just does io poll and sqes submit work once, does not check various condition. 3. For multiple ctxs share one same poll thread, we choose the biggest sq_thread_idle among these ctxs as timeout condition, and will update it when ctx is in or out. 4. Not need to check EBUSY especially, if io_submit_sqes() returns EBUSY, IORING_SQ_CQ_OVERFLOW should be set, helper in liburing should be aware of cq overflow and enters kernel to flush work. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	f6edbabb83	io_uring: always batch cancel in *cancel_files() Instead of iterating over each request and cancelling it individually in io_uring_cancel_files(), try to cancel all matching requests and use ->inflight_list only to check if there anything left. In many cases it should be faster, and we can reuse a lot of code from task cancellation. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	6b81928d4c	io_uring: pass files into kill timeouts/poll Make io_poll_remove_all() and io_kill_timeouts() to match against files as well. A preparation patch, effectively not used by now. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	b52fda00dd	io_uring: don't iterate io_uring_cancel_files() io_uring_cancel_files() guarantees to cancel all matching requests, that's not necessary to do that in a loop. Move it up in the callchain into io_uring_cancel_task_requests(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	df9923f967	io_uring: cancel only requests of current task io_uring_cancel_files() cancels all request that match files regardless of task. There is no real need in that, cancel only requests of the specified task. That also handles SQPOLL case as it already changes task to it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	08d2363464	io_uring: add a {task,files} pair matching helper Add io_match_task() that matches both task and files. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	06de5f5973	io_uring: simplify io_task_match() If IORING_SETUP_SQPOLL is set all requests belong to the corresponding SQPOLL task, so skip task checking in that case and always match. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	2846c481c9	io_uring: inline io_import_iovec() Inline io_import_iovec() and leave only its former __io_import_iovec() renamed to the original name. That makes it more obious what is reused in io_read/write(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	632546c4b5	io_uring: remove duplicated io_size from rw io_size and iov_count in io_read() and io_write() hold the same value, kill the last one. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
David Laight	10fc72e433	fs/io_uring Don't use the return value from import_iovec(). This is the only code that relies on import_iovec() returning iter.count on success. This allows a better interface to import_iovec(). Signed-off-by: David Laight <david.laight@aculab.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:04:00 -07:00
Pavel Begunkov	1a38ffc9cb	io_uring: NULL files dereference by SQPOLL SQPOLL task may find sqo_task->files == NULL and __io_sq_thread_acquire_files() would leave it unset, so following fget_many() and others try to dereference NULL and fault. Propagate an error files are missing. [ 118.962785] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 118.963812] #PF: supervisor read access in kernel mode [ 118.964534] #PF: error_code(0x0000) - not-present page [ 118.969029] RIP: 0010:__fget_files+0xb/0x80 [ 119.005409] Call Trace: [ 119.005651] fget_many+0x2b/0x30 [ 119.005964] io_file_get+0xcf/0x180 [ 119.006315] io_submit_sqes+0x3a4/0x950 [ 119.007481] io_sq_thread+0x1de/0x6a0 [ 119.007828] kthread+0x114/0x150 [ 119.008963] ret_from_fork+0x22/0x30 Reported-by: Josef Grieb <josef.grieb@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Hao Xu	c73ebb685f	io_uring: add timeout support for io_uring_enter() Now users who want to get woken when waiting for events should submit a timeout command first. It is not safe for applications that split SQ and CQ handling between two threads, such as mysql. Users should synchronize the two threads explicitly to protect SQ and that will impact the performance. This patch adds support for timeout to existing io_uring_enter(). To avoid overloading arguments, it introduces a new parameter structure which contains sigmask and timeout. I have tested the workloads with one thread submiting nop requests while the other reaping the cqe with timeout. It shows 1.8~2x faster when the iodepth is 16. Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com> Signed-off-by: Hao Xu <haoxu@linux.alibaba.com> [axboe: various cleanups/fixes, and name change to SIG_IS_DATA] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	27926b683d	io_uring: only plug when appropriate We unconditionally call blk_start_plug() when starting the IO submission, but we only really should do that if we have more than 1 request to submit AND we're potentially dealing with block based storage underneath. For any other type of request, it's just a waste of time to do so. Add a ->plug bit to io_op_def and set it for read/write requests. We could make this more precise and check the file itself as well, but it doesn't matter that much and would quickly become more expensive. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	0415767e7f	io_uring: rearrange io_kiocb fields for better caching We've got extra 8 bytes in the 2nd cacheline, put ->fixed_file_refs there, so inline execution path mostly doesn't touch the 3rd cacheline for fixed_file requests as well. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	f2f87370bb	io_uring: link requests with singly linked list Singly linked list for keeping linked requests is enough, because we almost always operate on the head and traverse forward with the exception of linked timeouts going 1 hop backwards. Replace ->link_list with a handmade singly linked list. Also kill REQ_F_LINK_HEAD in favour of checking a newly added ->list for NULL directly. That saves 8B in io_kiocb, is not as heavy as list fixup, makes better use of cache by not touching a previous request (i.e. last request of the link) each time on list modification and optimises cache use further in the following patch, and actually makes travesal easier removing in the end some lines. Also, keeping invariant in ->list instead of having REQ_F_LINK_HEAD is less error-prone. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	90cd7e4249	io_uring: track link timeout's master explicitly In preparation for converting singly linked lists for chaining requests, make linked timeouts save requests that they're responsible for and not count on doubly linked list for back referencing. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	863e05604a	io_uring: track link's head and tail during submit Explicitly save not only a link's head in io_submit_sqe[s]() but the tail as well. That's in preparation for keeping linked requests in a singly linked list. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Pavel Begunkov	018043be1f	io_uring: split poll and poll_remove structs Don't use a single struct for polls and poll remove requests, they have totally different layouts. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00
Jens Axboe	14a1143b68	io_uring: add support for IORING_OP_UNLINKAT IORING_OP_UNLINKAT behaves like unlinkat(2) and takes the same flags and arguments. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2020-12-09 12:03:59 -07:00

... 2 3 4 5 6 ...

68374 Commits